## Evaluation Metrics for NLP Tasks

When you finish training a Transformer model for NLP, you need to measure how well it performs on your task. 
To calculate the metrics you will need to calculate `TP`, `TN`,     `FN` and `FP`.


**TP (True Positive)**: number of positive samples correctly predicted as positive.  
**TN (True Negative)**: number of negative samples correctly predicted as negative.  
**FP (False Positive)**: number of negative samples incorrectly predicted as positive.  
**FN (False Negative)**: number of positive samples incorrectly predicted as negative.

Definition


- **Accuracy** is best for balanced datasets where each class has about the same number of samples; 
- **Precision and Recall** become more important when you have imbalanced data, such as spam detection, where missing a spam message (false negative) or wrongly flagging a normal message (false positive) has different costs; 
- **F1** score combines both precision and recall into a single number, making it easier to compare models on imbalanced tasks; 
- **BLEU and ROUGE** are used for tasks where the output is a sequence, such as translation or summarization; 
- **Perplexity** is most useful for language models that predict the next word in a sequence.


## Interpreting Evaluation Results and Improving Model Performance

Once you have calculated evaluation metrics for your Transformer model, it is important to understand what the results mean and how you can use them to improve your model. High accuracy generally shows your model is making correct predictions, but if your data is imbalanced, look at **precision**, **recall**, and **F1 score**. For example, a model with high precision but low recall is conservative - it only makes positive predictions when it is very sure, but misses many true positives. If recall is high but precision is low, the model predicts more positives but includes more false alarms.

If your model's performance is not satisfactory, consider the following ways to improve it:
- Collect more labeled data, especially for underrepresented classes;
- Try different preprocessing steps, such as removing noise or balancing classes;
- Fine-tune hyperparameters, such as learning rate, batch size, or number of epochs;
- Adjust the model architecture, such as adding attention heads or layers;
- Use data augmentation techniques to increase dataset diversity;
- Analyze errors to see if the model is struggling with certain types of inputs.

By carefully selecting the right metric and interpreting the results, you will be able to diagnose model weaknesses and focus your improvement efforts where they matter most.


import unittest
import user_code
import ast
import re   
import importlib
import csv
import unittest
import importlib

class TestTask(unittest.TestCase):
    def test_accuracy_value(self):
        import user_code
        importlib.reload(user_code)
        accuracy = getattr(user_code, 'accuracy', None)
        _dynamic_test(
            self,
            isinstance(accuracy, float) and round(accuracy, 2) == 0.7,
            "Accuracy is correctly calculated and rounded to 0.7",
            f"Expected accuracy 0.7, got {accuracy}",
        )

    def test_precision_value(self):
        import user_code
        importlib.reload(user_code)
        precision = getattr(user_code, 'precision', None)
        _dynamic_test(
            self,
            isinstance(precision, float) and round(precision, 2) == 0.8,
            "Precision is correctly calculated and rounded to 0.8",
            f"Expected precision 0.8, got {precision}",
        )

    def test_recall_value(self):
        import user_code
        importlib.reload(user_code)
        recall = getattr(user_code, 'recall', None)
        _dynamic_test(
            self,
            isinstance(recall, float) and round(recall, 2) == 0.67,
            "Recall is correctly calculated and rounded to 0.67",
            f"Expected recall 0.67, got {recall}",
        )

    def test_f1_score_value(self):
        import user_code
        importlib.reload(user_code)
        f1_score = getattr(user_code, 'f1_score', None)
        _dynamic_test(
            self,
            isinstance(f1_score, float) and round(f1_score, 2) == 0.73,
            "F1 score is correctly calculated and rounded to 0.73",
            f"Expected F1 score 0.73, got {f1_score}",
        )

def _dynamic_test(test_case, condition, success_message, failure_message):
    if condition:
        test_case._testMethodName = success_message
        test_case.assertTrue(True, success_message)
    else:
        test_case._testMethodName = failure_message
        test_case.fail(failure_message)

def normalize_text(text):
    text = text.lower()
    text = re.sub(r"\\s{2,}", " ", text)
    text = re.sub(r"\\s*([,:?])\\s*", r"\\1 ", text)
    return text.strip()

def change_var(code: str, var_name: str, value: str) -> str:
    tree = ast.parse(code)
    lines = code.splitlines()
    changed = False
    # Collect all assignment nodes to modify
    assign_nodes = [
        (i, node)
        for i, node in enumerate(tree.body)
        if isinstance(node, ast.Assign)
        and any(isinstance(target, ast.Name) and target.id == var_name for target in node.targets)
    ]

    # If nothing to change, return unmodified code
    if not assign_nodes:
        return code

    # Perform replacements for all matching assignments (from last to first to not break line offsets)
    for i, node in reversed(assign_nodes):
        start_line = node.lineno - 1
        line = lines[start_line]
        indent = ' ' * (len(line) - len(line.lstrip()))
        lines[start_line] = f"{indent}{var_name} = {value}"
        next_line = len(lines)
        for next_node in tree.body[i+1:]:
            if hasattr(next_node, 'lineno'):
                next_line = next_node.lineno - 1
                break
        if next_line > start_line + 1:
            lines[start_line+1:next_line] = []
        changed = True

    return '\\n'.join(lines) if changed else code

if __name__ == "__main__":
    unittest.main()


test_main.py

Master the essentials of Transformer models in Python for natural language processing. Discover how to build, interpret, and apply Transformers to real-world text data, focusing on practical skills and model understanding.

Explore the essentials of Transformer models, including self-attention, positional encoding, and architecture. Build a strong conceptual and practical base for advanced NLP applications.

Master the skills needed to construct core Transformer building blocks, including multi-head attention, feed-forward layers, and normalization, for effective text processing.

Discover how to use Transformers for real-world NLP tasks, visualize attention, and interpret model predictions for better text understanding.

Challenge: Evaluating Transformer Models

Evaluation Metrics for NLP Tasks

Interpreting Evaluation Results and Improving Model Performance

Solution