Tokenization with AutoTokenizer
Sveip for å vise menyen
Tokenization is a crucial step in preparing text data for Transformer models. It involves splitting raw text into smaller units called tokens, which may be words, subwords, or even individual characters. Transformers require input sequences to be in a consistent, numerical format, as they cannot process raw strings directly. By converting text into sequences of tokens, and subsequently into integer IDs, you enable the model to interpret and process the data efficiently. Subword units are especially important in Transformer models because they allow the tokenizer to handle rare or unseen words by breaking them into more common, meaningful pieces. This ensures that the model can generalize better and handle a wider variety of input text.
Tokenization is the process of splitting text into smaller units called tokens.
A Subword Token is a fragment of a word, often used when a whole word is not found in the tokenizer's vocabulary, allowing the model to handle unknown or rare words by breaking them into familiar pieces.
123456789101112131415161718192021222324from transformers import AutoTokenizer # Initialize the tokenizer for a pretrained model tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") # Batch of sentences to tokenize sentences = [ "Transformers are powerful models for NLP.", "Tokenization breaks text into subword units.", "Fine-tuning adapts models to specific tasks." ] # Tokenize the batch with padding and truncation tokenized = tokenizer( sentences, padding=True, # Pad sentences to the same length truncation=True, # Truncate sentences longer than max length return_tensors="np" # Return NumPy arrays ) # Inspect input IDs and attention masks print("Input IDs:\n", tokenized["input_ids"]) print("Attention Masks:\n", tokenized["attention_mask"])
When processing batches of sentences, always use padding and truncation. Padding ensures all input sequences in the batch are the same length, which is required for efficient batch processing. Truncation prevents sequences from exceeding the model's maximum input length, avoiding errors and unnecessary computation.
Takk for tilbakemeldingene dine!
Spør AI
Spør AI
Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår