Summary  
This chapter formalizes the bias–variance decomposition of expected squared error in supervised learning, showing how model complexity influences bias, variance, and irreducible noise.  

General domain of usage  
Machine learning model evaluation and selection

To understand the **bias–variance tradeoff** formally, consider a supervised learning scenario where you want to predict an output variable $$Y$$ from an input $$X$$ using a model trained on data. Suppose the true relationship between $$X$$ and $$Y$$ is given by a function $$f(x)$$, but you only observe noisy samples: $$Y = f(X) + \varepsilon$$, where $$\varepsilon$$ is random noise with mean zero and variance $$\sigma^2$$.

The expected prediction error of a model at a point $$x$$ can be decomposed as follows:

**Bias–Variance Decomposition:**

$$
\mathbb{E}_{\mathcal{D},\varepsilon}\left[ (Y - \hat{f}(x))^2 \right] = \left( \mathbb{E}_{\mathcal{D}}[\hat{f}(x)] - f(x) \right)^2 + \mathbb{E}_{\mathcal{D}}\left[ (\hat{f}(x) - \mathbb{E}_{\mathcal{D}}[\hat{f}(x)])^2 \right] + \sigma^2
$$

- The first term, $$(\mathbb{E}_{\mathcal{D}}[\hat{f}(x)] - f(x))^2$$, is the **squared bias**: how far the average model prediction is from the true function;
- The second term, $$\mathbb{E}_{\mathcal{D}}[(\hat{f}(x) - \mathbb{E}_{\mathcal{D}}[\hat{f}(x)])^2]$$, is the **variance**: how much the model's prediction varies across different training sets;
- The third term, $$\sigma^2$$, is the **irreducible error**: the intrinsic noise in the data that no model can eliminate.

The **bias–variance tradeoff** arises because models with high capacity (complexity) tend to have low bias but high variance, while simpler models have higher bias but lower variance. The optimal generalization performance is achieved by balancing these two sources of error, minimizing their sum.

When selecting a model for a learning task, you must consider both **bias** and **variance** to achieve optimal generalization. If you choose a model that is too simple, it may have high bias and **underfit** the data, failing to capture important patterns. On the other hand, a model that is too complex may have high variance and **overfit**, capturing random noise instead of the underlying structure. The practical implication is that you should use validation techniques, such as **cross-validation**, to empirically find the right level of model complexity that minimizes the total expected error on unseen data, rather than focusing solely on fitting the training set.

Note

Which statement accurately describes a component of the bias–variance decomposition?

Explore the mathematical foundations of machine learning generalization. This course covers empirical risk minimization, bias–variance tradeoff, VC dimension, generalization bounds, and the theory of overfitting, equipping you with rigorous intuition for model selection and evaluation.

Establish the formal framework for learning from data, introducing key definitions and the supervised learning setup.

Delve into the statistical reasoning behind the bias–variance tradeoff and its implications for model selection.

Introduce the concept of hypothesis class capacity and the VC dimension as a measure of model complexity.

Examine the theoretical foundations of generalization and the phenomenon of overfitting.

Formalizing the Tradeoff