Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Empirical Risk Minimization (ERM) | Foundations of Statistical Learning
Statistical Learning Theory Foundations

bookEmpirical Risk Minimization (ERM)

Empirical risk minimization, or ERM, is a central principle in statistical learning theory. When you use ERM, you choose a function from a set of possible hypotheses that minimizes the average loss on your observed training data. The motivation for ERM comes from the fact that, in real-world situations, the true distribution of the data is unknown. You cannot directly minimize the expected loss (the true risk) because you do not have access to all possible data pointsβ€”only to a finite sample. ERM offers a practical approach: instead of minimizing the true risk, you minimize the empirical risk, which is the average loss over your sample. This makes ERM the default strategy for many learning algorithms that need to select the best hypothesis based solely on the available data.

Step 1: Define the hypothesis class and loss function
expand arrow

You start with a set of possible functions (hypothesis class) and a loss function that measures how well a function predicts the correct output. The hypothesis class could include models like linear functions, decision trees, or neural networks. The loss function, such as mean_squared_error or zero_one_loss, quantifies prediction errors.

Step 2: Observe a finite training sample
expand arrow

You collect a set of labeled examples drawn from the unknown data distribution. This training sample is your only source of information about the underlying process, since the true distribution is not accessible.

Step 3: Compute empirical risk for each hypothesis
expand arrow

For each function in your hypothesis class, calculate the average loss over the training data. This value is called the empirical risk:

EmpiricalΒ Risk(h)=1nβˆ‘i=1nloss(h(xi),yi)\text{Empirical Risk}(h) = \frac{1}{n} \sum_{i=1}^n \text{loss}(h(x_i), y_i)

where hh is a hypothesis, (xi,yi)(x_i, y_i) are the training examples, and nn is the number of examples.

Step 4: Select the hypothesis with minimal empirical risk
expand arrow

Choose the function from your hypothesis class that achieves the lowest empirical risk on the observed sample. This is the hypothesis that best fits your training data according to the chosen loss function.

Step 5: Use the selected hypothesis for future predictions
expand arrow

The chosen function is then used to make predictions on new, unseen data. The hope is that minimizing empirical risk will also lead to low true risk on future data.

Note
Note

While ERM is intuitive and widely used, it has important limitations. The most significant is its sensitivity to overfitting: a hypothesis that perfectly fits the training data may perform poorly on new data if it simply memorizes the sample rather than capturing the underlying pattern. This happens especially when the hypothesis class is very large or flexible compared to the amount of available data.

question mark

Which of the following best describes the main idea behind empirical risk minimization (ERM)?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 2

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain the difference between empirical risk and true risk?

What are some common loss functions used in ERM?

How does ERM relate to overfitting in machine learning?

bookEmpirical Risk Minimization (ERM)

Swipe to show menu

Empirical risk minimization, or ERM, is a central principle in statistical learning theory. When you use ERM, you choose a function from a set of possible hypotheses that minimizes the average loss on your observed training data. The motivation for ERM comes from the fact that, in real-world situations, the true distribution of the data is unknown. You cannot directly minimize the expected loss (the true risk) because you do not have access to all possible data pointsβ€”only to a finite sample. ERM offers a practical approach: instead of minimizing the true risk, you minimize the empirical risk, which is the average loss over your sample. This makes ERM the default strategy for many learning algorithms that need to select the best hypothesis based solely on the available data.

Step 1: Define the hypothesis class and loss function
expand arrow

You start with a set of possible functions (hypothesis class) and a loss function that measures how well a function predicts the correct output. The hypothesis class could include models like linear functions, decision trees, or neural networks. The loss function, such as mean_squared_error or zero_one_loss, quantifies prediction errors.

Step 2: Observe a finite training sample
expand arrow

You collect a set of labeled examples drawn from the unknown data distribution. This training sample is your only source of information about the underlying process, since the true distribution is not accessible.

Step 3: Compute empirical risk for each hypothesis
expand arrow

For each function in your hypothesis class, calculate the average loss over the training data. This value is called the empirical risk:

EmpiricalΒ Risk(h)=1nβˆ‘i=1nloss(h(xi),yi)\text{Empirical Risk}(h) = \frac{1}{n} \sum_{i=1}^n \text{loss}(h(x_i), y_i)

where hh is a hypothesis, (xi,yi)(x_i, y_i) are the training examples, and nn is the number of examples.

Step 4: Select the hypothesis with minimal empirical risk
expand arrow

Choose the function from your hypothesis class that achieves the lowest empirical risk on the observed sample. This is the hypothesis that best fits your training data according to the chosen loss function.

Step 5: Use the selected hypothesis for future predictions
expand arrow

The chosen function is then used to make predictions on new, unseen data. The hope is that minimizing empirical risk will also lead to low true risk on future data.

Note
Note

While ERM is intuitive and widely used, it has important limitations. The most significant is its sensitivity to overfitting: a hypothesis that perfectly fits the training data may perform poorly on new data if it simply memorizes the sample rather than capturing the underlying pattern. This happens especially when the hypothesis class is very large or flexible compared to the amount of available data.

question mark

Which of the following best describes the main idea behind empirical risk minimization (ERM)?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 2
some-alt