Summary  
Empirical risk minimization is a principle in statistical learning where you select a hypothesis by minimizing the average loss on observed training data as a practical surrogate for minimizing true risk.

General domain of usage  
Machine learning

Empirical risk minimization, or **ERM**, is a central principle in statistical learning theory. When you use **ERM**, you choose a function from a set of possible hypotheses that minimizes the average loss on your observed training data. The motivation for **ERM** comes from the fact that, in real-world situations, the true distribution of the data is unknown. You cannot directly minimize the expected loss (the **true risk**) because you do not have access to all possible data points—only to a finite sample. **ERM** offers a practical approach: instead of minimizing the true risk, you minimize the **empirical risk**, which is the average loss over your sample. This makes **ERM** the default strategy for many learning algorithms that need to select the best hypothesis based solely on the available data.

You start with a set of possible functions (**hypothesis class**) and a **loss function** that measures how well a function predicts the correct output. The hypothesis class could include models like linear functions, decision trees, or neural networks. The loss function, such as `mean_squared_error` or `zero_one_loss`, quantifies prediction errors.

Step 1: Define the hypothesis class and loss function

You collect a set of labeled examples drawn from the unknown data distribution. This **training sample** is your only source of information about the underlying process, since the true distribution is not accessible.

Step 2: Observe a finite training sample

For each function in your hypothesis class, calculate the average loss over the training data. This value is called the **empirical risk**:

$$
\text{Empirical Risk}(h) = \frac{1}{n}  \sum_{i=1}^n \text{loss}(h(x_i), y_i)
$$

where $$h$$ is a hypothesis, $$(x_i, y_i)$$ are the training examples, and $$n$$ is the number of examples.

Step 3: Compute empirical risk for each hypothesis

Choose the function from your hypothesis class that achieves the lowest empirical risk on the observed sample. This is the hypothesis that best fits your training data according to the chosen loss function.

Step 4: Select the hypothesis with minimal empirical risk

The chosen function is then used to make predictions on new, unseen data. The hope is that minimizing empirical risk will also lead to low true risk on future data.

Step 5: Use the selected hypothesis for future predictions

While **ERM** is intuitive and widely used, it has important limitations. The most significant is its sensitivity to **overfitting**: a hypothesis that perfectly fits the training data may perform poorly on new data if it simply memorizes the sample rather than capturing the underlying pattern. This happens especially when the hypothesis class is very large or flexible compared to the amount of available data.

Note

Which of the following best describes the main idea behind empirical risk minimization (ERM)?

Explore the mathematical foundations of machine learning generalization. This course covers empirical risk minimization, bias–variance tradeoff, VC dimension, generalization bounds, and the theory of overfitting, equipping you with rigorous intuition for model selection and evaluation.

Establish the formal framework for learning from data, introducing key definitions and the supervised learning setup.

Delve into the statistical reasoning behind the bias–variance tradeoff and its implications for model selection.

Introduce the concept of hypothesis class capacity and the VC dimension as a measure of model complexity.

Examine the theoretical foundations of generalization and the phenomenon of overfitting.

Empirical Risk Minimization (ERM)