Lære Tokenization with AutoTokenizer | Preparing Data and Tokenization

Sveip for å vise menyen

Tokenization is a crucial step in preparing text data for Transformer models. It involves splitting raw text into smaller units called tokens, which may be words, subwords, or even individual characters. Transformers require input sequences to be in a consistent, numerical format, as they cannot process raw strings directly. By converting text into sequences of tokens, and subsequently into integer IDs, you enable the model to interpret and process the data efficiently. Subword units are especially important in Transformer models because they allow the tokenizer to handle rare or unseen words by breaking them into more common, meaningful pieces. This ensures that the model can generalize better and handle a wider variety of input text.

Definition

Tokenization is the process of splitting text into smaller units called tokens.
A Subword Token is a fragment of a word, often used when a whole word is not found in the tokenizer's vocabulary, allowing the model to handle unknown or rare words by breaking them into familiar pieces.


              123456789101112131415161718192021222324
            
from transformers import AutoTokenizer

# Initialize the tokenizer for a pretrained model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Batch of sentences to tokenize
sentences = [
    "Transformers are powerful models for NLP.",
    "Tokenization breaks text into subword units.",
    "Fine-tuning adapts models to specific tasks."
]

# Tokenize the batch with padding and truncation
tokenized = tokenizer(
    sentences,
    padding=True,          # Pad sentences to the same length
    truncation=True,       # Truncate sentences longer than max length
    return_tensors="np"    # Return NumPy arrays
)

# Inspect input IDs and attention masks
print("Input IDs:\n", tokenized["input_ids"])
print("Attention Masks:\n", tokenized["attention_mask"])

Note

When processing batches of sentences, always use padding and truncation. Padding ensures all input sequences in the batch are the same length, which is required for efficient batch processing. Truncation prevents sequences from exceeding the model's maximum input length, avoiding errors and unnecessary computation.

Alt var klart?

Takk for tilbakemeldingene dine!

Seksjon 2. Kapittel 2

Spør AI

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Seksjon 2. Kapittel 2