Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprende Working with Datasets | Preparing for Neural Networks
Neural Networks with PyTorch
course content

Contenido del Curso

Neural Networks with PyTorch

Neural Networks with PyTorch

1. PyTorch Basics
2. Preparing for Neural Networks
3. Neural Networks

book
Working with Datasets

To simplify data preparation for machine learning models and enable efficient batch processing, shuffling, and data handling, PyTorch provides the TensorDataset and DataLoader utilities.

Loading and Inspecting the Dataset

We'll use a dataset (wine.csv) containing data about different kinds of wine, including their features and corresponding class labels.

First, let's load the dataset and inspect its structure to understand the features and target variable:

123
import pandas as pd wine_df = pd.read_csv('wine.csv') print(wine_df.head())
copy

Creating a TensorDataset

The next step is to separate the features and target, convert them into PyTorch tensors, and use these tensors directly to create a TensorDataset. We'll ensure that the features are of type float32 (for handling floating-point numbers) and the target is of type long (a 64-bit integer type suitable for labels).

123456789101112
import pandas as pd import torch from torch.utils.data import TensorDataset wine_df = pd.read_csv('wine.csv') # Separate features and target features = wine_df.drop(columns='class').values target = wine_df['class'].values # Create TensorDataset wine_dataset = TensorDataset( torch.tensor(features, dtype=torch.float32), # Features tensor torch.tensor(target, dtype=torch.long) # Target tensor )
copy

Using DataLoader for Batch Processing

To facilitate batch processing, shuffling, and efficient data loading during training, we wrap the TensorDataset in a DataLoader. This step is crucial for managing the flow of data to the model during training, especially when working with larger datasets. The DataLoader allows us to:

  1. Batch process: split the data into smaller, manageable chunks (batches) for training, which optimizes memory usage and allows gradient updates after each batch;
  2. Shuffle: randomize the order of the dataset, which helps break any inherent ordering in the data and prevents the model from learning spurious patterns;
  3. Efficient loading: automatically handle data fetching and preprocessing for each batch during training, reducing overhead.
123456789101112131415161718
import pandas as pd import torch from torch.utils.data import TensorDataset, DataLoader wine_df = pd.read_csv('wine.csv') # Separate features and target features = wine_df.drop(columns='class').values target = wine_df['class'].values # Create TensorDataset wine_dataset = TensorDataset( torch.tensor(features, dtype=torch.float32), # Features tensor torch.tensor(target, dtype=torch.long) # Target tensor ) # Wrap the dataset in a DataLoader wine_loader = DataLoader( wine_dataset, # TensorDataset batch_size=32, # Number of samples per batch shuffle=True # Randomize the order of the data )
copy

With this setup, the DataLoader ensures that the model receives batches of data efficiently and in random order. This is especially important for training neural networks, as it helps the model generalize better to unseen data.

Iterating Over the DataLoader

We can now iterate over the DataLoader to access batches of data. Each batch contains a tuple (batch_features, batch_targets):

123456789101112131415161718192021222324
import pandas as pd import torch from torch.utils.data import TensorDataset, DataLoader wine_df = pd.read_csv('wine.csv') # Separate features and target features = wine_df.drop(columns='class').values target = wine_df['class'].values # Create TensorDataset wine_dataset = TensorDataset( torch.tensor(features, dtype=torch.float32), # Features tensor torch.tensor(target, dtype=torch.long) # Target tensor ) # Wrap the dataset in a DataLoader wine_loader = DataLoader( wine_dataset, # TensorDataset batch_size=32, # Number of samples per batch shuffle=True # Randomize the order of the data ) # Iterate through batches for batch_idx, (batch_features, batch_targets) in enumerate(dataloader): print(f"Batch {batch_idx+1}") print(f"Features: {batch_features}") print(f"Targets: {batch_targets}") print("-" * 30)
copy

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 2. Capítulo 5
We're sorry to hear that something went wrong. What happened?
some-alt