Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Leer Tokenization with AutoTokenizer | Preparing Data and Tokenization
Practice
Projects
Quizzes & Challenges
Quizzen
Challenges
/
Fine-Tuning Transformers

bookTokenization with AutoTokenizer

Veeg om het menu te tonen

Tokenization is a crucial step in preparing text data for Transformer models. It involves splitting raw text into smaller units called tokens, which may be words, subwords, or even individual characters. Transformers require input sequences to be in a consistent, numerical format, as they cannot process raw strings directly. By converting text into sequences of tokens, and subsequently into integer IDs, you enable the model to interpret and process the data efficiently. Subword units are especially important in Transformer models because they allow the tokenizer to handle rare or unseen words by breaking them into more common, meaningful pieces. This ensures that the model can generalize better and handle a wider variety of input text.

Note
Definition

Tokenization is the process of splitting text into smaller units called tokens.
A Subword Token is a fragment of a word, often used when a whole word is not found in the tokenizer's vocabulary, allowing the model to handle unknown or rare words by breaking them into familiar pieces.

123456789101112131415161718192021222324
from transformers import AutoTokenizer # Initialize the tokenizer for a pretrained model tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") # Batch of sentences to tokenize sentences = [ "Transformers are powerful models for NLP.", "Tokenization breaks text into subword units.", "Fine-tuning adapts models to specific tasks." ] # Tokenize the batch with padding and truncation tokenized = tokenizer( sentences, padding=True, # Pad sentences to the same length truncation=True, # Truncate sentences longer than max length return_tensors="np" # Return NumPy arrays ) # Inspect input IDs and attention masks print("Input IDs:\n", tokenized["input_ids"]) print("Attention Masks:\n", tokenized["attention_mask"])
copy
Note
Note

When processing batches of sentences, always use padding and truncation. Padding ensures all input sequences in the batch are the same length, which is required for efficient batch processing. Truncation prevents sequences from exceeding the model's maximum input length, avoiding errors and unnecessary computation.

question mark

Why are attention masks needed when training Transformer models?

Select all correct answers

Was alles duidelijk?

Hoe kunnen we het verbeteren?

Bedankt voor je feedback!

Sectie 2. Hoofdstuk 2

Vraag AI

expand

Vraag AI

ChatGPT

Vraag wat u wilt of probeer een van de voorgestelde vragen om onze chat te starten.

Sectie 2. Hoofdstuk 2
some-alt