Tokenization with AutoTokenizer
Scorri per mostrare il menu
Tokenization is a crucial step in preparing text data for Transformer models. It involves splitting raw text into smaller units called tokens, which may be words, subwords, or even individual characters. Transformers require input sequences to be in a consistent, numerical format, as they cannot process raw strings directly. By converting text into sequences of tokens, and subsequently into integer IDs, you enable the model to interpret and process the data efficiently. Subword units are especially important in Transformer models because they allow the tokenizer to handle rare or unseen words by breaking them into more common, meaningful pieces. This ensures that the model can generalize better and handle a wider variety of input text.
Tokenization is the process of splitting text into smaller units called tokens.
A Subword Token is a fragment of a word, often used when a whole word is not found in the tokenizer's vocabulary, allowing the model to handle unknown or rare words by breaking them into familiar pieces.
123456789101112131415161718192021222324from transformers import AutoTokenizer # Initialize the tokenizer for a pretrained model tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") # Batch of sentences to tokenize sentences = [ "Transformers are powerful models for NLP.", "Tokenization breaks text into subword units.", "Fine-tuning adapts models to specific tasks." ] # Tokenize the batch with padding and truncation tokenized = tokenizer( sentences, padding=True, # Pad sentences to the same length truncation=True, # Truncate sentences longer than max length return_tensors="np" # Return NumPy arrays ) # Inspect input IDs and attention masks print("Input IDs:\n", tokenized["input_ids"]) print("Attention Masks:\n", tokenized["attention_mask"])
When processing batches of sentences, always use padding and truncation. Padding ensures all input sequences in the batch are the same length, which is required for efficient batch processing. Truncation prevents sequences from exceeding the model's maximum input length, avoiding errors and unnecessary computation.
Grazie per i tuoi commenti!
Chieda ad AI
Chieda ad AI
Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione