Summary  
This chapter explains how Transformer architectures employ self-attention to capture global context and enable efficient parallel training, overcoming the sequential limitations of RNNs and the local-pattern focus of CNNs.

General domain of usage  
Machine translation

**The Evolution of NLP Models**

Early NLP models relied on recurrent neural networks (`RNN`s) and convolutional neural networks (`CNN`s). While RNNs process text sequentially, they often lose track of long-distance context. `CNN`s excel at identifying local patterns but struggle with the overall meaning of complex sentences. Both architectures are limited by slow training speeds and an inability to fully leverage modern hardware.

**The Power of Transformers**

The Transformer architecture revolutionized the field by introducing self-attention. This mechanism allows you to: 
- Analyze all words in a sentence simultaneously to capture global context;
- Train models more efficiently using parallel processing;
- Achieve superior accuracy in translation, summarization, and text generation; 
- Master the skills needed to leverage these modern models, which provide deeper context and more precise results for your real-world applications.

Introduced the original Transformer architecture, replacing RNNs/CNNs with **self-attention** for sequence modeling. Enabled parallel training and better handling of context.

2017: Attention is All You Need

Showed how **pre-training** on large text corpora could yield universal language representations. BERT's **bidirectional attention** improved performance on many NLP tasks.

2018: BERT (Bidirectional Encoder Representations from Transformers)

Demonstrated the power of large, **generative language models** trained on vast amounts of data. GPT models could generate coherent, contextually relevant text.

2018 - 2019: GPT (Generative Pretrained Transformer)

Extended Transformers to capture longer-term dependencies by introducing **recurrence at the segment level**, improving performance on long documents.

2019: Transformer-XL

Unified many NLP tasks under a single framework by treating all tasks as **text-to-text problems**, further simplifying model training and deployment.

2020: T5 (Text-to-Text Transfer Transformer)

Each milestone has pushed the boundaries of what you can achieve with text data, making models more powerful, flexible, and applicable to real-world NLP challenges.

Impact of Transformer Milestones

Which of the following statements best explains why the Transformer architecture replaced RNNs and CNNs in modern NLP?

Master the essentials of Transformer models in Python for natural language processing. Discover how to build, interpret, and apply Transformers to real-world text data, focusing on practical skills and model understanding.

Explore the essentials of Transformer models, including self-attention, positional encoding, and architecture. Build a strong conceptual and practical base for advanced NLP applications.

Master the skills needed to construct core Transformer building blocks, including multi-head attention, feed-forward layers, and normalization, for effective text processing.

Discover how to use Transformers for real-world NLP tasks, visualize attention, and interpret model predictions for better text understanding.

How NLP Models Have Evolved