Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen Tokenization with AutoTokenizer | Preparing Data and Tokenization
Fine-Tuning Transformers

bookTokenization with AutoTokenizer

Swipe um das Menü anzuzeigen

Tokenization is a crucial step in preparing text data for Transformer models. It involves splitting raw text into smaller units called tokens, which may be words, subwords, or even individual characters. Transformers require input sequences to be in a consistent, numerical format, as they cannot process raw strings directly. By converting text into sequences of tokens, and subsequently into integer IDs, you enable the model to interpret and process the data efficiently. Subword units are especially important in Transformer models because they allow the tokenizer to handle rare or unseen words by breaking them into more common, meaningful pieces. This ensures that the model can generalize better and handle a wider variety of input text.

Note
Definition

Tokenization is the process of splitting text into smaller units called tokens.
A Subword Token is a fragment of a word, often used when a whole word is not found in the tokenizer's vocabulary, allowing the model to handle unknown or rare words by breaking them into familiar pieces.

123456789101112131415161718192021222324
from transformers import AutoTokenizer # Initialize the tokenizer for a pretrained model tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") # Batch of sentences to tokenize sentences = [ "Transformers are powerful models for NLP.", "Tokenization breaks text into subword units.", "Fine-tuning adapts models to specific tasks." ] # Tokenize the batch with padding and truncation tokenized = tokenizer( sentences, padding=True, # Pad sentences to the same length truncation=True, # Truncate sentences longer than max length return_tensors="np" # Return NumPy arrays ) # Inspect input IDs and attention masks print("Input IDs:\n", tokenized["input_ids"]) print("Attention Masks:\n", tokenized["attention_mask"])
copy
Note
Note

When processing batches of sentences, always use padding and truncation. Padding ensures all input sequences in the batch are the same length, which is required for efficient batch processing. Truncation prevents sequences from exceeding the model's maximum input length, avoiding errors and unnecessary computation.

question mark

Why are attention masks needed when training Transformer models?

Select all correct answers

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 2. Kapitel 2

Fragen Sie AI

expand

Fragen Sie AI

ChatGPT

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

Abschnitt 2. Kapitel 2
some-alt