Learn Multi-class Cross-Entropy and the Softmax Connection

The multi-class cross-entropy loss is a fundamental tool for training classifiers when there are more than two possible classes. Its formula is:

L_{CE}(y, \hat{p}) = -\sum_{k} y_k \log \hat{p}_k

where $y_k$ is the true distribution for class $k$ (typically 1 for the correct class and 0 otherwise), and $\hat{p}_k$ is the predicted probability for class $k$ , usually produced by applying the softmax function to the model's raw outputs.


              1234567
            
import numpy as np

correct_probs = np.array([0.9, 0.6, 0.33, 0.1])
loss = -np.log(correct_probs)

for p, l in zip(correct_probs, loss):
    print(f"Predicted probability for true class = {p:.2f} → CE loss = {l:.3f}")

A simple numeric demo showing:

High confidence & correct → small loss;
Moderate confidence → moderate loss;
Confident but wrong ( $p$ very small) → huge loss.

Note

Cross-entropy quantifies the difference between true and predicted class distributions. It measures how well the predicted probabilities match the actual class labels, assigning a higher loss when the model is confident but wrong.

The softmax transformation is critical in multi-class classification. It converts a vector of raw output scores (logits) from a model into a probability distribution over classes, ensuring that all predicted probabilities $\hat{p}_k$ are between 0 and 1 and sum to 1. This is defined as:

\hat{p}_k = \frac{\exp(z_k)}{\sum_{j} \exp(z_j)}

where $z_k$ is the raw score for class $k$ . Softmax and cross-entropy are paired because softmax outputs interpretable probabilities, and cross-entropy penalizes the model based on how far these probabilities are from the true class distribution. When the model assigns a high probability to the wrong class, the loss increases sharply, guiding the model to improve its predictions.


              12345678
            
import numpy as np

logits = np.array([2.0, 1.0, 0.1])
exp_vals = np.exp(logits)
softmax = exp_vals / np.sum(exp_vals)

print("Logits:", logits)
print("Softmax probabilities:", softmax)

Shows how a single large logit can dominate the distribution and how softmax normalizes everything into probabilities.

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 2

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Awesome!

Completion rate improved to 6.67

Swipe to show menu