Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Creating Dataloaders for Transformers | Preparing Data and Tokenization
Fine-Tuning Transformers

bookCreating Dataloaders for Transformers

Stryg for at vise menuen

When training Transformer models on large text datasets, you need to process multiple samples at once to take advantage of modern hardware. This is called batching, and it is essential for efficient training. However, text samples usually have different lengths, which can make batching tricky. To address this, you use padding: shorter sequences are padded with a special token so that all sequences in a batch have the same length. The dataloader is responsible for organizing your dataset into batches, applying padding, and shuffling the data to ensure robust learning. Without proper batching and padding, your training would be slow and resource-inefficient.

Note
Note

Use dynamic padding to minimize computation waste. Instead of padding all sequences to a fixed maximum length, pad each batch to the length of its longest sequence. This reduces the number of unnecessary computations for padded tokens.

123456789101112131415161718192021222324252627282930313233343536373839404142
import torch from torch.utils.data import DataLoader, Dataset # Sample tokenized data (pretend these are token ids) tokenized_samples = [ [101, 2009, 2001, 2307, 102], # "It was great" [101, 2027, 2293, 2009, 1012, 102], # "They love it." [101, 2057, 2342, 102], # "We tried" ] labels = [1, 1, 0] # Sentiment labels class IMDBDataset(Dataset): def __init__(self, encodings, labels): self.encodings = encodings self.labels = labels def __getitem__(self, idx): return { 'input_ids': torch.tensor(self.encodings[idx], dtype=torch.long), 'labels': torch.tensor(self.labels[idx], dtype=torch.long) } def __len__(self): return len(self.labels) def collate_fn(batch): input_ids = [item['input_ids'] for item in batch] labels = torch.stack([item['labels'] for item in batch]) # Dynamic padding input_ids_padded = torch.nn.utils.rnn.pad_sequence( input_ids, batch_first=True, padding_value=0 ) return {'input_ids': input_ids_padded, 'labels': labels} dataset = IMDBDataset(tokenized_samples, labels) dataloader = DataLoader(dataset, batch_size=2, shuffle=True, collate_fn=collate_fn) for batch in dataloader: print("Batch input_ids:", batch['input_ids']) print("Batch labels:", batch['labels'])
copy
Note
Note

Shuffle your data at the start of each training epoch. This prevents your model from learning patterns based on the order of the data, which could reduce its ability to generalize.

question mark

How does dynamic padding improve training speed?

Select the correct answer

Var alt klart?

Hvordan kan vi forbedre det?

Tak for dine kommentarer!

Sektion 2. Kapitel 4

Spørg AI

expand

Spørg AI

ChatGPT

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

Sektion 2. Kapitel 4
some-alt