Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Formalizing the Tradeoff | Bias–Variance Tradeoff
Statistical Learning Theory Foundations

bookFormalizing the Tradeoff

To understand the bias–variance tradeoff formally, consider a supervised learning scenario where you want to predict an output variable YY from an input XX using a model trained on data. Suppose the true relationship between XX and YY is given by a function f(x)f(x), but you only observe noisy samples: Y=f(X)+Ξ΅Y = f(X) + \varepsilon, where Ξ΅\varepsilon is random noise with mean zero and variance Οƒ2\sigma^2.

The expected prediction error of a model at a point xx can be decomposed as follows:

Bias–Variance Decomposition:

ED,Ξ΅[(Yβˆ’f^(x))2]=(ED[f^(x)]βˆ’f(x))2+ED[(f^(x)βˆ’ED[f^(x)])2]+Οƒ2\mathbb{E}_{\mathcal{D},\varepsilon}\left[ (Y - \hat{f}(x))^2 \right] = \left( \mathbb{E}_{\mathcal{D}}[\hat{f}(x)] - f(x) \right)^2 + \mathbb{E}_{\mathcal{D}}\left[ (\hat{f}(x) - \mathbb{E}_{\mathcal{D}}[\hat{f}(x)])^2 \right] + \sigma^2
  • The first term, (ED[f^(x)]βˆ’f(x))2(\mathbb{E}_{\mathcal{D}}[\hat{f}(x)] - f(x))^2, is the squared bias: how far the average model prediction is from the true function;
  • The second term, ED[(f^(x)βˆ’ED[f^(x)])2]\mathbb{E}_{\mathcal{D}}[(\hat{f}(x) - \mathbb{E}_{\mathcal{D}}[\hat{f}(x)])^2], is the variance: how much the model's prediction varies across different training sets;
  • The third term, Οƒ2\sigma^2, is the irreducible error: the intrinsic noise in the data that no model can eliminate.

The bias–variance tradeoff arises because models with high capacity (complexity) tend to have low bias but high variance, while simpler models have higher bias but lower variance. The optimal generalization performance is achieved by balancing these two sources of error, minimizing their sum.

Note
Note

When selecting a model for a learning task, you must consider both bias and variance to achieve optimal generalization. If you choose a model that is too simple, it may have high bias and underfit the data, failing to capture important patterns. On the other hand, a model that is too complex may have high variance and overfit, capturing random noise instead of the underlying structure. The practical implication is that you should use validation techniques, such as cross-validation, to empirically find the right level of model complexity that minimizes the total expected error on unseen data, rather than focusing solely on fitting the training set.

question mark

Which statement accurately describes a component of the bias–variance decomposition?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 3

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

bookFormalizing the Tradeoff

Swipe to show menu

To understand the bias–variance tradeoff formally, consider a supervised learning scenario where you want to predict an output variable YY from an input XX using a model trained on data. Suppose the true relationship between XX and YY is given by a function f(x)f(x), but you only observe noisy samples: Y=f(X)+Ξ΅Y = f(X) + \varepsilon, where Ξ΅\varepsilon is random noise with mean zero and variance Οƒ2\sigma^2.

The expected prediction error of a model at a point xx can be decomposed as follows:

Bias–Variance Decomposition:

ED,Ξ΅[(Yβˆ’f^(x))2]=(ED[f^(x)]βˆ’f(x))2+ED[(f^(x)βˆ’ED[f^(x)])2]+Οƒ2\mathbb{E}_{\mathcal{D},\varepsilon}\left[ (Y - \hat{f}(x))^2 \right] = \left( \mathbb{E}_{\mathcal{D}}[\hat{f}(x)] - f(x) \right)^2 + \mathbb{E}_{\mathcal{D}}\left[ (\hat{f}(x) - \mathbb{E}_{\mathcal{D}}[\hat{f}(x)])^2 \right] + \sigma^2
  • The first term, (ED[f^(x)]βˆ’f(x))2(\mathbb{E}_{\mathcal{D}}[\hat{f}(x)] - f(x))^2, is the squared bias: how far the average model prediction is from the true function;
  • The second term, ED[(f^(x)βˆ’ED[f^(x)])2]\mathbb{E}_{\mathcal{D}}[(\hat{f}(x) - \mathbb{E}_{\mathcal{D}}[\hat{f}(x)])^2], is the variance: how much the model's prediction varies across different training sets;
  • The third term, Οƒ2\sigma^2, is the irreducible error: the intrinsic noise in the data that no model can eliminate.

The bias–variance tradeoff arises because models with high capacity (complexity) tend to have low bias but high variance, while simpler models have higher bias but lower variance. The optimal generalization performance is achieved by balancing these two sources of error, minimizing their sum.

Note
Note

When selecting a model for a learning task, you must consider both bias and variance to achieve optimal generalization. If you choose a model that is too simple, it may have high bias and underfit the data, failing to capture important patterns. On the other hand, a model that is too complex may have high variance and overfit, capturing random noise instead of the underlying structure. The practical implication is that you should use validation techniques, such as cross-validation, to empirically find the right level of model complexity that minimizes the total expected error on unseen data, rather than focusing solely on fitting the training set.

question mark

Which statement accurately describes a component of the bias–variance decomposition?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 3
some-alt