Contenido del Curso
Neural Networks with PyTorch
Neural Networks with PyTorch
Working with Datasets
To simplify data preparation for machine learning models and enable efficient batch processing, shuffling, and data handling, PyTorch provides the TensorDataset
and DataLoader
utilities.
Loading and Inspecting the Dataset
We'll use a dataset (wine.csv
) containing data about different kinds of wine, including their features and corresponding class labels.
First, let's load the dataset and inspect its structure to understand the features and target variable:
import pandas as pd wine_df = pd.read_csv('wine.csv') print(wine_df.head())
Creating a TensorDataset
The next step is to separate the features and target, convert them into PyTorch tensors, and use these tensors directly to create a TensorDataset
. We'll ensure that the features are of type float32
(for handling floating-point numbers) and the target is of type long
(a 64-bit integer type suitable for labels).
import pandas as pd import torch from torch.utils.data import TensorDataset wine_df = pd.read_csv('wine.csv') # Separate features and target features = wine_df.drop(columns='class').values target = wine_df['class'].values # Create TensorDataset wine_dataset = TensorDataset( torch.tensor(features, dtype=torch.float32), # Features tensor torch.tensor(target, dtype=torch.long) # Target tensor )
Using DataLoader for Batch Processing
To facilitate batch processing, shuffling, and efficient data loading during training, we wrap the TensorDataset
in a DataLoader
. This step is crucial for managing the flow of data to the model during training, especially when working with larger datasets. The DataLoader
allows us to:
- Batch process: split the data into smaller, manageable chunks (batches) for training, which optimizes memory usage and allows gradient updates after each batch;
- Shuffle: randomize the order of the dataset, which helps break any inherent ordering in the data and prevents the model from learning spurious patterns;
- Efficient loading: automatically handle data fetching and preprocessing for each batch during training, reducing overhead.
import pandas as pd import torch from torch.utils.data import TensorDataset, DataLoader wine_df = pd.read_csv('wine.csv') # Separate features and target features = wine_df.drop(columns='class').values target = wine_df['class'].values # Create TensorDataset wine_dataset = TensorDataset( torch.tensor(features, dtype=torch.float32), # Features tensor torch.tensor(target, dtype=torch.long) # Target tensor ) # Wrap the dataset in a DataLoader wine_loader = DataLoader( wine_dataset, # TensorDataset batch_size=32, # Number of samples per batch shuffle=True # Randomize the order of the data )
With this setup, the DataLoader
ensures that the model receives batches of data efficiently and in random order. This is especially important for training neural networks, as it helps the model generalize better to unseen data.
Iterating Over the DataLoader
We can now iterate over the DataLoader
to access batches of data. Each batch contains a tuple (batch_features, batch_targets)
:
import pandas as pd import torch from torch.utils.data import TensorDataset, DataLoader wine_df = pd.read_csv('wine.csv') # Separate features and target features = wine_df.drop(columns='class').values target = wine_df['class'].values # Create TensorDataset wine_dataset = TensorDataset( torch.tensor(features, dtype=torch.float32), # Features tensor torch.tensor(target, dtype=torch.long) # Target tensor ) # Wrap the dataset in a DataLoader wine_loader = DataLoader( wine_dataset, # TensorDataset batch_size=32, # Number of samples per batch shuffle=True # Randomize the order of the data ) # Iterate through batches for batch_idx, (batch_features, batch_targets) in enumerate(dataloader): print(f"Batch {batch_idx+1}") print(f"Features: {batch_features}") print(f"Targets: {batch_targets}") print("-" * 30)
¡Gracias por tus comentarios!