Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Impara Working with Datasets and Splits | Preparing Data and Tokenization
Fine-Tuning Transformers

bookWorking with Datasets and Splits

Scorri per mostrare il menu

When preparing data for NLP tasks, you often need to load a dataset, split it into training and validation sets, and access the label information for downstream tasks such as classification. The Hugging Face datasets library provides a convenient interface for these steps, allowing you to work with popular benchmarks like IMDb, SST-2, and others. Loading a dataset is typically straightforward: you specify the dataset name and configuration, then access the splits directly or create your own custom splits. For classification problems, accessing the label column is essential for both model training and evaluation.

Note
Note

Always shuffle and stratify your data splits. Shuffling ensures that your training and validation data are representative of the overall distribution, while stratification guarantees that class proportions are preserved across splits, leading to more balanced and reliable evaluation.

1234567891011121314151617181920212223242526
from datasets import load_dataset import pandas as pd # Load a subset of the IMDb dataset (for demonstration) dataset = load_dataset("imdb", split="train[:2000]") # Convert to pandas DataFrame for easier manipulation df = pd.DataFrame(dataset) # Stratified split: preserve label distribution from sklearn.model_selection import train_test_split train_df, val_df = train_test_split( df, test_size=0.2, stratify=df["label"], random_state=42, shuffle=True ) # Show label distribution in each split print("Train label distribution:") print(train_df["label"].value_counts(normalize=True)) print("\nValidation label distribution:") print(val_df["label"].value_counts(normalize=True))
copy

Proper splitting is crucial to avoid bias and ensure that your model's performance metrics are meaningful. If the splits are not representative, or if information from the validation set leaks into the training set, your evaluation results may be unreliable.

Note
Note

Data leakage can occur if you split your dataset after preprocessing or tokenization, or if you inadvertently include related samples across splits. This can inflate validation performance and lead to models that do not generalize well to unseen data.

question mark

Why is it important to use stratified splits when preparing data for classification tasks?

Select the correct answer

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 2. Capitolo 3

Chieda ad AI

expand

Chieda ad AI

ChatGPT

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Sezione 2. Capitolo 3
some-alt