Lære Creating Dataloaders for Transformers | Preparing Data and Tokenization

Stryg for at vise menuen

When training Transformer models on large text datasets, you need to process multiple samples at once to take advantage of modern hardware. This is called batching, and it is essential for efficient training. However, text samples usually have different lengths, which can make batching tricky. To address this, you use padding: shorter sequences are padded with a special token so that all sequences in a batch have the same length. The dataloader is responsible for organizing your dataset into batches, applying padding, and shuffling the data to ensure robust learning. Without proper batching and padding, your training would be slow and resource-inefficient.

Note

Use dynamic padding to minimize computation waste. Instead of padding all sequences to a fixed maximum length, pad each batch to the length of its longest sequence. This reduces the number of unnecessary computations for padded tokens.


              123456789101112131415161718192021222324252627282930313233343536373839404142
            
import torch
from torch.utils.data import DataLoader, Dataset

# Sample tokenized data (pretend these are token ids)
tokenized_samples = [
    [101, 2009, 2001, 2307, 102],         # "It was great"
    [101, 2027, 2293, 2009, 1012, 102],   # "They love it."
    [101, 2057, 2342, 102],               # "We tried"
]

labels = [1, 1, 0]  # Sentiment labels

class IMDBDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        return {
            'input_ids': torch.tensor(self.encodings[idx], dtype=torch.long),
            'labels': torch.tensor(self.labels[idx], dtype=torch.long)
        }

    def __len__(self):
        return len(self.labels)

def collate_fn(batch):
    input_ids = [item['input_ids'] for item in batch]
    labels = torch.stack([item['labels'] for item in batch])
    # Dynamic padding
    input_ids_padded = torch.nn.utils.rnn.pad_sequence(
        input_ids, batch_first=True, padding_value=0
    )
    return {'input_ids': input_ids_padded, 'labels': labels}

dataset = IMDBDataset(tokenized_samples, labels)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True, collate_fn=collate_fn)

for batch in dataloader:
    print("Batch input_ids:", batch['input_ids'])
    print("Batch labels:", batch['labels'])

Note

Shuffle your data at the start of each training epoch. This prevents your model from learning patterns based on the order of the data, which could reduce its ability to generalize.

Var alt klart?

Tak for dine kommentarer!

Sektion 2. Kapitel 4

Spørg AI

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

Sektion 2. Kapitel 4