Learn Challenge: Building a Transformer Text Classifier | Applying Transformers to NLP Tasks

Section 3. Chapter 2

single

Swipe to show menu


              123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
            
import numpy as np

# Simple tokenizer: splits text into lowercase words
def tokenize(text):
    return text.lower().split()

# Build vocabulary from training texts
def build_vocab(texts):
    vocab = {}
    idx = 1  # 0 reserved for padding
    for text in texts:
        for word in tokenize(text):
            if word not in vocab:
                vocab[word] = idx
                idx += 1
    return vocab

# Convert text to sequence of word indices
def text_to_sequence(text, vocab, max_len):
    seq = [vocab.get(word, 0) for word in tokenize(text)]
    if len(seq) < max_len:
        seq += [0] * (max_len - len(seq))
    else:
        seq = seq[:max_len]
    return np.array(seq)

# Sinusoidal positional encoding
def positional_encoding(seq_len, d_model):
    pos = np.arange(seq_len)[:, np.newaxis]
    i = np.arange(d_model)[np.newaxis, :]
    angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model))
    angle_rads = pos * angle_rates
    pe = np.zeros((seq_len, d_model))
    pe[:, 0::2] = np.sin(angle_rads[:, 0::2])
    pe[:, 1::2] = np.cos(angle_rads[:, 1::2])
    return pe

# Simple self-attention layer
def self_attention(X):
    d_k = X.shape[-1]
    Q = X
    K = X
    V = X
    scores = Q @ K.T / np.sqrt(d_k)
    weights = softmax(scores)
    return weights @ V

def softmax(x):
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return e_x / np.sum(e_x, axis=-1, keepdims=True)

# Feed-forward layer
def feed_forward(X, W1, b1, W2, b2):
    return np.maximum(0, X @ W1 + b1) @ W2 + b2

# Minimal Transformer encoder block
def transformer_encoder(X, pe, W1, b1, W2, b2):
    X_pe = X + pe
    attn_out = self_attention(X_pe)
    ff_out = feed_forward(attn_out, W1, b1, W2, b2)
    return ff_out

# Classifier
class TransformerClassifier:
    def __init__(self, vocab_size, d_model, max_len, num_classes):
        self.d_model = d_model
        self.max_len = max_len
        self.embeddings = np.random.randn(vocab_size + 1, d_model) * 0.01
        self.W1 = np.random.randn(d_model, d_model) * 0.01
        self.b1 = np.zeros(d_model)
        self.W2 = np.random.randn(d_model, d_model) * 0.01
        self.b2 = np.zeros(d_model)
        self.Wc = np.random.randn(d_model, num_classes) * 0.01
        self.bc = np.zeros(num_classes)
        self.pe = positional_encoding(max_len, d_model)

    def forward(self, x_seq):
        X = self.embeddings[x_seq]
        encoded = transformer_encoder(X, self.pe, self.W1, self.b1, self.W2, self.b2)
        pooled = np.mean(encoded, axis=0)  # Simple mean pooling
        logits = pooled @ self.Wc + self.bc
        probs = softmax(logits)
        return probs

# Example usage:
texts = ["I love machine learning", "Transformers are powerful", "Deep learning is cool"]
labels = [0, 1, 0]  # 2 classes: 0, 1
vocab = build_vocab(texts)
max_len = 6
d_model = 8
num_classes = 2

clf = TransformerClassifier(len(vocab), d_model, max_len, num_classes)

# Predict class probabilities for a new text
test_text = "I love transformers"
test_seq = text_to_sequence(test_text, vocab, max_len)
probs = clf.forward(test_seq)
print("Predicted class probabilities:", probs)

You have just seen a minimal implementation of a Transformer-based text classifier using only numpy and core Python features. This code demonstrates how the essential parts of a Transformer can be put together for a text classification task without relying on deep learning libraries.

The process begins by converting raw text into a format suitable for computation. The tokenize function breaks the text into lowercase words:

def tokenize(text):
    return text.lower().split()

Next, build_vocab creates a mapping from each unique word to a unique integer:

def build_vocab(texts):
    vocab = {}
    idx = 1  # 0 reserved for padding
    for text in texts:
        for word in tokenize(text):
            if word not in vocab:
                vocab[word] = idx
                idx += 1
    return vocab

The text_to_sequence function turns each input sentence into a fixed-length sequence of word indices, padding or truncating as needed:

def text_to_sequence(text, vocab, max_len):
    seq = [vocab.get(word, 0) for word in tokenize(text)]
    if len(seq) < max_len:
        seq += [0] * (max_len - len(seq))
    else:
        seq = seq[:max_len]
    return seq

To give the model awareness of word order, positional_encoding generates a matrix of sinusoidal values that can be added to the word embeddings. This step is crucial because, unlike recurrent networks, Transformers do not inherently process sequences in order:

def positional_encoding(seq_len, d_model):
    pos = np.arange(seq_len)[:, np.newaxis]
    i = np.arange(d_model)[np.newaxis, :]
    angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model))
    angle_rads = pos * angle_rates
    pe = np.zeros((seq_len, d_model))
    pe[:, 0::2] = np.sin(angle_rads[:, 0::2])
    pe[:, 1::2] = np.cos(angle_rads[:, 1::2])
    return pe

The self-attention mechanism, implemented in self_attention, allows the model to weigh the importance of each word in the context of the whole sequence. This is achieved by computing similarity scores between all pairs of words, applying the softmax function to obtain attention weights, and then creating a new representation for each word as a weighted sum of all words in the sequence:

def self_attention(X):
    d_k = X.shape[-1]
    Q = X
    K = X
    V = X
    scores = Q @ K.T / np.sqrt(d_k)
    weights = softmax(scores)
    return weights @ V

def softmax(x):
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return e_x / np.sum(e_x, axis=-1, keepdims=True)

After self-attention, a simple feed-forward network (feed_forward) further processes the sequence:

def feed_forward(X, W1, b1, W2, b2):
    return np.maximum(0, X @ W1 + b1) @ W2 + b2

The transformer_encoder function combines these steps, applying positional encoding, self-attention, and the feed-forward network in sequence:

def transformer_encoder(X, pe, W1, b1, W2, b2):
    X_pe = X + pe
    attn_out = self_attention(X_pe)
    ff_out = feed_forward(attn_out, W1, b1, W2, b2)
    return ff_out

The TransformerClassifier class ties everything together. It initializes random word embeddings, layer weights, and positional encodings. In the forward method, it looks up embeddings for each token, applies the encoder block, pools the outputs by averaging across the sequence, and passes the result through a final linear layer to produce class probabilities:

class TransformerClassifier:
    def __init__(self, vocab_size, d_model, max_len, num_classes):
        self.d_model = d_model
        self.max_len = max_len
        self.embeddings = np.random.randn(vocab_size + 1, d_model) * 0.01
        self.W1 = np.random.randn(d_model, d_model) * 0.01
        self.b1 = np.zeros(d_model)
        self.W2 = np.random.randn(d_model, d_model) * 0.01
        self.b2 = np.zeros(d_model)
        self.Wc = np.random.randn(d_model, num_classes) * 0.01
        self.bc = np.zeros(num_classes)
        self.pe = positional_encoding(max_len, d_model)

    def forward(self, x_seq):
        X = self.embeddings[x_seq]
        encoded = transformer_encoder(X, self.pe, self.W1, self.b1, self.W2, self.b2)
        pooled = np.mean(encoded, axis=0)  # Simple mean pooling
        logits = pooled @ self.Wc + self.bc
        probs = softmax(logits)
        return probs

You can see an example at the end of the code where a small set of training texts is used to build the vocabulary and set up the classifier. When you input a new sentence, the model predicts class probabilities, showing how the minimal Transformer pipeline works from raw text to prediction:

texts = ["I love machine learning", "Transformers are powerful", "Deep learning is cool"]
labels = [0, 1, 0]  # 2 classes: 0, 1
vocab = build_vocab(texts)
max_len = 6
d_model = 8
num_classes = 2

clf = TransformerClassifier(len(vocab), d_model, max_len, num_classes)

test_text = "I love transformers"
test_seq = text_to_sequence(test_text, vocab, max_len)
probs = clf.forward(test_seq)
print("Predicted class probabilities:", probs)

Task

Swipe to start coding

Build a minimal Transformer text classifier step by step:

Write a function tokenize that takes a string and returns a list of lowercase words;
Write a function build_vocab that takes a list of texts and returns a dictionary mapping each unique word to an integer index (start indexing from 1, reserve 0 for padding);
Write a function text_to_sequence that converts a text to a list of word indices using a vocabulary and pads/truncates to a fixed length.

Test your functions with:

tokenize("Hello World!") should return ["hello", "world!"];
build_vocab(["Hello world", "world of NLP"]) should return a dictionary mapping words to unique indices (e.g., {"hello": 1, "world": 2, "of": 3, "nlp": 4});
text_to_sequence("Hello NLP", vocab, 4) should return a list of 4 indices, padded with zeros if needed.

Solution

Switch to desktop for real-world practiceContinue from where you are using one of the options below

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 2

single

Ask AI

Ask anything or try one of the suggested questions to begin our chat