Creating Dataloaders for Transformers
Svep för att visa menyn
When training Transformer models on large text datasets, you need to process multiple samples at once to take advantage of modern hardware. This is called batching, and it is essential for efficient training. However, text samples usually have different lengths, which can make batching tricky. To address this, you use padding: shorter sequences are padded with a special token so that all sequences in a batch have the same length. The dataloader is responsible for organizing your dataset into batches, applying padding, and shuffling the data to ensure robust learning. Without proper batching and padding, your training would be slow and resource-inefficient.
Use dynamic padding to minimize computation waste. Instead of padding all sequences to a fixed maximum length, pad each batch to the length of its longest sequence. This reduces the number of unnecessary computations for padded tokens.
123456789101112131415161718192021222324252627282930313233343536373839404142import torch from torch.utils.data import DataLoader, Dataset # Sample tokenized data (pretend these are token ids) tokenized_samples = [ [101, 2009, 2001, 2307, 102], # "It was great" [101, 2027, 2293, 2009, 1012, 102], # "They love it." [101, 2057, 2342, 102], # "We tried" ] labels = [1, 1, 0] # Sentiment labels class IMDBDataset(Dataset): def __init__(self, encodings, labels): self.encodings = encodings self.labels = labels def __getitem__(self, idx): return { 'input_ids': torch.tensor(self.encodings[idx], dtype=torch.long), 'labels': torch.tensor(self.labels[idx], dtype=torch.long) } def __len__(self): return len(self.labels) def collate_fn(batch): input_ids = [item['input_ids'] for item in batch] labels = torch.stack([item['labels'] for item in batch]) # Dynamic padding input_ids_padded = torch.nn.utils.rnn.pad_sequence( input_ids, batch_first=True, padding_value=0 ) return {'input_ids': input_ids_padded, 'labels': labels} dataset = IMDBDataset(tokenized_samples, labels) dataloader = DataLoader(dataset, batch_size=2, shuffle=True, collate_fn=collate_fn) for batch in dataloader: print("Batch input_ids:", batch['input_ids']) print("Batch labels:", batch['labels'])
Shuffle your data at the start of each training epoch. This prevents your model from learning patterns based on the order of the data, which could reduce its ability to generalize.
Tack för dina kommentarer!
Fråga AI
Fråga AI
Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal