Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Text Preprocessing for Transformers | Preparing Data and Tokenization
Fine-Tuning Transformers

bookText Preprocessing for Transformers

Stryg for at vise menuen

When working with transformer models in natural language processing (NLP), the quality of your input data has a direct impact on model performance. Raw text often contains inconsistencies, such as mixed casing, unnecessary special characters, and irregular spacing, all of which can confuse the model and degrade results. Lowercasing text is a common step that helps reduce the vocabulary size and ensures that words like Apple and apple are treated the same, unless case carries meaning in your specific context. Removing special characters—such as stray punctuation or symbols—helps eliminate noise, while careful handling of whitespace ensures that token boundaries are clear and consistent. Each of these steps plays a critical role in making your data more uniform and easier for transformer models to process.

Note
Note

Avoid aggressive preprocessing that removes context, such as stripping all punctuation. Some punctuation (like question marks or exclamation points) can carry important semantic information that transformers use to understand intent or meaning.

12345678910111213141516171819
# List of sample sentences sentences = [ "Hello, World! ", "Transformers are amazing... ", "Fine-tune your model: it's powerful!", " Spaces everywhere! " ] cleaned = [] for s in sentences: # Lowercase s = s.lower() # Remove special characters (keep only letters, numbers, and spaces) s = ''.join(c for c in s if c.isalnum() or c.isspace()) # Collapse multiple spaces into one and strip leading/trailing spaces s = ' '.join(s.split()) cleaned.append(s) print(cleaned)
copy
Note
Note

Over-cleaning your text—such as removing all punctuation or stopwords—can strip away valuable context and negatively affect model performance. Always balance cleaning with preserving information relevant to your task.

question mark

Which of the following preprocessing steps is least likely to harm the semantic meaning of your text?

Select the correct answer

Var alt klart?

Hvordan kan vi forbedre det?

Tak for dine kommentarer!

Sektion 2. Kapitel 1

Spørg AI

expand

Spørg AI

ChatGPT

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

Sektion 2. Kapitel 1
some-alt