Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Challenge: Evaluating Transformer Models | Applying Transformers to NLP Tasks
Transformers for Natural Language Processing
Section 3. Chapter 6
single

single

bookChallenge: Evaluating Transformer Models

Swipe to show menu

Evaluation Metrics for NLP Tasks

When you finish training a Transformer model for NLP, you need to measure how well it performs on your task. To calculate the metrics you will need to calculate TP, TN, FN and FP.

Note
Definition

TP (True Positive): number of positive samples correctly predicted as positive.
TN (True Negative): number of negative samples correctly predicted as negative.
FP (False Positive): number of negative samples incorrectly predicted as positive.
FN (False Negative): number of positive samples incorrectly predicted as negative.

  • Accuracy is best for balanced datasets where each class has about the same number of samples;
  • Precision and Recall become more important when you have imbalanced data, such as spam detection, where missing a spam message (false negative) or wrongly flagging a normal message (false positive) has different costs;
  • F1 score combines both precision and recall into a single number, making it easier to compare models on imbalanced tasks;
  • BLEU and ROUGE are used for tasks where the output is a sequence, such as translation or summarization;
  • Perplexity is most useful for language models that predict the next word in a sequence.

Interpreting Evaluation Results and Improving Model Performance

Once you have calculated evaluation metrics for your Transformer model, it is important to understand what the results mean and how you can use them to improve your model. High accuracy generally shows your model is making correct predictions, but if your data is imbalanced, look at precision, recall, and F1 score. For example, a model with high precision but low recall is conservative - it only makes positive predictions when it is very sure, but misses many true positives. If recall is high but precision is low, the model predicts more positives but includes more false alarms.

If your model's performance is not satisfactory, consider the following ways to improve it:

  • Collect more labeled data, especially for underrepresented classes;
  • Try different preprocessing steps, such as removing noise or balancing classes;
  • Fine-tune hyperparameters, such as learning rate, batch size, or number of epochs;
  • Adjust the model architecture, such as adding attention heads or layers;
  • Use data augmentation techniques to increase dataset diversity;
  • Analyze errors to see if the model is struggling with certain types of inputs.

By carefully selecting the right metric and interpreting the results, you will be able to diagnose model weaknesses and focus your improvement efforts where they matter most.

Task

Swipe to start coding

Use your knowledge from previous chapters to complete a small evaluation scenario for a Transformer text classifier.

  • Given a model that predicts whether a movie review is positive or negative, you have the following results on a test set of 10 samples:
    • 6 reviews are truly positive, 4 are truly negative;
    • The model predicts: 5 positive (4 correct), 5 negative (3 correct).
  • Calculate accuracy, precision, recall, and F1 score for the positive class;
  • Enter your answers as decimals rounded to two places.

Solution

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 3. Chapter 6
single

single

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

some-alt