Aprende Loss Function | Neural Network from Scratch

In training a neural network, we need a way to measure how well our model is performing. This is done using a loss function, which quantifies the difference between the predicted outputs and the actual target values. The goal of training is to minimize this loss, making our predictions as close to the actual values as possible.

One of the most commonly used loss functions for binary classification is the cross-entropy loss, which works well with models that output probabilities.

Derivation of Cross-Entropy Loss

To understand cross-entropy loss, we start with the maximum likelihood principle. In a binary classification problem, the goal is to train a model that estimates the probability $\hat{y}$ that a given input belongs to class 1. The actual label $y$ can be either 0 or 1.

A good model should maximize the probability of correctly predicting all training examples. This means we want to maximize the likelihood function, which represents the probability of seeing the observed data given the model's predictions.

For a single training example, assuming independence, the likelihood can be written as:

P(y|x) = \hat{y}^y(1 - \hat{y})^{1-y}

This expression simply means:

If $y = 1$ , then $P(y|x) = \hat{y}$ , meaning we want to maximize $\hat{y}$ (the probability assigned to class 1);
If $y = 0$ , then $P(y|x) = 1 - \hat{y}$ , meaning we want to maximize $1 - \hat{y}$ (the probability assigned to class 0).

Note

$P(y|x)$ means the probability of observing the actual class label $y$ given the inputs $x$ .

To make optimization easier, we take the log-likelihood instead of the likelihood itself (since logarithms turn products into sums, making differentiation simpler):

\log P(y|x) = y\log(\hat{y}) + (1-y)\log(1-\hat{y})

Since the goal is maximization, we define the loss function as the negative log-likelihood, which we want to minimize:

L = -(y\log(\hat{y}) + (1-y)\log(1-\hat{y}))

This is the binary cross-entropy loss function, commonly used for classification problems.

Given that the output variable represents $\hat{y}$ for a particular training example, and the target variable represents $y$ for this training example, this loss function can be implemented as follows:

import numpy as np

loss = -(target * np.log(output) + (1 - target) * np.log(1 - output))

Why This Formula?

Cross-entropy loss has a clear intuitive interpretation:

If $y = 1$ , the loss simplifies to $-\log(\hat{y})$ , meaning the loss is low when $\hat{y}$ is close to 1 and very high when $\hat{y}$ is close to 0;
If $y = 0$ , the loss simplifies to $-\log(1 - \hat{y})$ , meaning the loss is low when $\hat{y}$ is close to 0 and very high when it is close to 1.

Since logarithms grow negatively large as their input approaches zero, incorrect predictions are heavily penalized, encouraging the model to make confident, correct predictions.

If multiple examples are passed during forward propagation, the total loss is computed as the average loss across all examples:

L = -\frac1N \sum_{i=1}^N (y_i\log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i))

where $N$ is the number of training samples.

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 2. Capítulo 6

Pregunte a AI

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

Desliza para mostrar el menú

One of the most commonly used loss functions for binary classification is the cross-entropy loss, which works well with models that output probabilities.

Derivation of Cross-Entropy Loss

For a single training example, assuming independence, the likelihood can be written as:

P(y|x) = \hat{y}^y(1 - \hat{y})^{1-y}

This expression simply means:

If $y = 1$ , then $P(y|x) = \hat{y}$ , meaning we want to maximize $\hat{y}$ (the probability assigned to class 1);
If $y = 0$ , then $P(y|x) = 1 - \hat{y}$ , meaning we want to maximize $1 - \hat{y}$ (the probability assigned to class 0).

Note

$P(y|x)$ means the probability of observing the actual class label $y$ given the inputs $x$ .

To make optimization easier, we take the log-likelihood instead of the likelihood itself (since logarithms turn products into sums, making differentiation simpler):

\log P(y|x) = y\log(\hat{y}) + (1-y)\log(1-\hat{y})

Since the goal is maximization, we define the loss function as the negative log-likelihood, which we want to minimize:

L = -(y\log(\hat{y}) + (1-y)\log(1-\hat{y}))

This is the binary cross-entropy loss function, commonly used for classification problems.

import numpy as np

loss = -(target * np.log(output) + (1 - target) * np.log(1 - output))

Why This Formula?

Cross-entropy loss has a clear intuitive interpretation:

If $y = 1$ , the loss simplifies to $-\log(\hat{y})$ , meaning the loss is low when $\hat{y}$ is close to 1 and very high when $\hat{y}$ is close to 0;
If $y = 0$ , the loss simplifies to $-\log(1 - \hat{y})$ , meaning the loss is low when $\hat{y}$ is close to 0 and very high when it is close to 1.

Since logarithms grow negatively large as their input approaches zero, incorrect predictions are heavily penalized, encouraging the model to make confident, correct predictions.

If multiple examples are passed during forward propagation, the total loss is computed as the average loss across all examples:

L = -\frac1N \sum_{i=1}^N (y_i\log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i))

where $N$ is the number of training samples.

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 2. Capítulo 6

Loss Function

Derivation of Cross-Entropy Loss

Why This Formula?

Awesome!

Loss Function

Derivation of Cross-Entropy Loss

Why This Formula?