Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda Working with Datasets and Splits | Preparing Data and Tokenization
Fine-Tuning Transformers

bookWorking with Datasets and Splits

Deslize para mostrar o menu

When preparing data for NLP tasks, you often need to load a dataset, split it into training and validation sets, and access the label information for downstream tasks such as classification. The Hugging Face datasets library provides a convenient interface for these steps, allowing you to work with popular benchmarks like IMDb, SST-2, and others. Loading a dataset is typically straightforward: you specify the dataset name and configuration, then access the splits directly or create your own custom splits. For classification problems, accessing the label column is essential for both model training and evaluation.

Note
Note

Always shuffle and stratify your data splits. Shuffling ensures that your training and validation data are representative of the overall distribution, while stratification guarantees that class proportions are preserved across splits, leading to more balanced and reliable evaluation.

1234567891011121314151617181920212223242526
from datasets import load_dataset import pandas as pd # Load a subset of the IMDb dataset (for demonstration) dataset = load_dataset("imdb", split="train[:2000]") # Convert to pandas DataFrame for easier manipulation df = pd.DataFrame(dataset) # Stratified split: preserve label distribution from sklearn.model_selection import train_test_split train_df, val_df = train_test_split( df, test_size=0.2, stratify=df["label"], random_state=42, shuffle=True ) # Show label distribution in each split print("Train label distribution:") print(train_df["label"].value_counts(normalize=True)) print("\nValidation label distribution:") print(val_df["label"].value_counts(normalize=True))
copy

Proper splitting is crucial to avoid bias and ensure that your model's performance metrics are meaningful. If the splits are not representative, or if information from the validation set leaks into the training set, your evaluation results may be unreliable.

Note
Note

Data leakage can occur if you split your dataset after preprocessing or tokenization, or if you inadvertently include related samples across splits. This can inflate validation performance and lead to models that do not generalize well to unseen data.

question mark

Why is it important to use stratified splits when preparing data for classification tasks?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 2. Capítulo 3

Pergunte à IA

expand

Pergunte à IA

ChatGPT

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Seção 2. Capítulo 3
some-alt