Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Text Preprocessing for Transformers | Preparing Data and Tokenization
Practice
Projects
Quizzes & Challenges
Quizer
Challenges
/
Fine-Tuning Transformers

bookText Preprocessing for Transformers

Sveip for å vise menyen

When working with transformer models in natural language processing (NLP), the quality of your input data has a direct impact on model performance. Raw text often contains inconsistencies, such as mixed casing, unnecessary special characters, and irregular spacing, all of which can confuse the model and degrade results. Lowercasing text is a common step that helps reduce the vocabulary size and ensures that words like Apple and apple are treated the same, unless case carries meaning in your specific context. Removing special characters—such as stray punctuation or symbols—helps eliminate noise, while careful handling of whitespace ensures that token boundaries are clear and consistent. Each of these steps plays a critical role in making your data more uniform and easier for transformer models to process.

Note
Note

Avoid aggressive preprocessing that removes context, such as stripping all punctuation. Some punctuation (like question marks or exclamation points) can carry important semantic information that transformers use to understand intent or meaning.

12345678910111213141516171819
# List of sample sentences sentences = [ "Hello, World! ", "Transformers are amazing... ", "Fine-tune your model: it's powerful!", " Spaces everywhere! " ] cleaned = [] for s in sentences: # Lowercase s = s.lower() # Remove special characters (keep only letters, numbers, and spaces) s = ''.join(c for c in s if c.isalnum() or c.isspace()) # Collapse multiple spaces into one and strip leading/trailing spaces s = ' '.join(s.split()) cleaned.append(s) print(cleaned)
copy
Note
Note

Over-cleaning your text—such as removing all punctuation or stopwords—can strip away valuable context and negatively affect model performance. Always balance cleaning with preserving information relevant to your task.

question mark

Which of the following preprocessing steps is least likely to harm the semantic meaning of your text?

Select the correct answer

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 2. Kapittel 1

Spør AI

expand

Spør AI

ChatGPT

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Seksjon 2. Kapittel 1
some-alt