Mean Squared Error (MSE): Theory and Intuition
To understand the Mean Squared Error (MSE) loss, begin with its mathematical form. For a single data point, where y is the true value and y^ (read as "y-hat") is the predicted value, the MSE loss is defined as:
LMSE(y,y^)=(y−y^)21234567891011import numpy as np import matplotlib.pyplot as plt errors = np.linspace(-4, 4, 400) mse = errors**2 plt.plot(errors, mse) plt.title("MSE Loss as a Function of Error") plt.xlabel("Error (y - ŷ)") plt.ylabel("Loss") plt.show()
When you have a dataset with n observations, the average MSE across all points becomes:
LMSE_avg=n1i=1∑n(yi−y^i)2This formula calculates the mean of the squared differences between the actual and predicted values, providing a single metric that summarizes how well the model's predictions match the true outputs.
123456import numpy as np errors = np.array([1, 2, 4, 8, 16, 32]) squared = errors**2 print("Errors:", errors) print("Squared errors:", squared)
MSE penalizes larger errors more heavily because the difference is squared, so outliers or large deviations have a disproportionately big effect on the final value. MSE is also optimal when the noise in the data is Gaussian (normally distributed), making it a natural choice under these conditions.
You can interpret MSE geometrically as the squared Euclidean distance between the vectors of true values and predicted values. If you imagine each data point as a dimension, the difference vector y−y^ represents the error in each dimension. Squaring and summing these errors gives the squared length (or squared distance) between the prediction vector and the true vector. This is the foundation of least squares regression, where the goal is to find the line (or hyperplane) that minimizes the sum of squared errors to all data points.
12345678910import numpy as np y = np.array([3, 5, 7]) y_hat = np.array([2.5, 5.5, 6]) diff = y - y_hat squared_distance = np.sum(diff**2) print("Error vector:", diff) print("Squared Euclidean distance:", squared_distance)
There is also a probabilistic perspective: when you minimize MSE, you are implicitly assuming that the noise in your observations is Gaussian, and you are finding the maximum likelihood estimator of the mean. In other words, if your data are noisy measurements centered around some true value, minimizing the MSE leads you to the best estimate of that value, which is the mean. This connection is why the mean is called the optimal estimator under MSE loss.
12345678import numpy as np rng = np.random.default_rng(0) true_value = 10 observations = true_value + rng.normal(0, 2, size=1000) mse_estimate = np.mean(observations) print("Mean (MLE under MSE):", mse_estimate)
Takk for tilbakemeldingene dine!
Spør AI
Spør AI
Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår
Can you explain why squaring the errors is important in MSE?
How does MSE compare to other loss functions like MAE?
Can you give an example of when MSE might not be the best choice?
Awesome!
Completion rate improved to 6.67
Mean Squared Error (MSE): Theory and Intuition
Sveip for å vise menyen
To understand the Mean Squared Error (MSE) loss, begin with its mathematical form. For a single data point, where y is the true value and y^ (read as "y-hat") is the predicted value, the MSE loss is defined as:
LMSE(y,y^)=(y−y^)21234567891011import numpy as np import matplotlib.pyplot as plt errors = np.linspace(-4, 4, 400) mse = errors**2 plt.plot(errors, mse) plt.title("MSE Loss as a Function of Error") plt.xlabel("Error (y - ŷ)") plt.ylabel("Loss") plt.show()
When you have a dataset with n observations, the average MSE across all points becomes:
LMSE_avg=n1i=1∑n(yi−y^i)2This formula calculates the mean of the squared differences between the actual and predicted values, providing a single metric that summarizes how well the model's predictions match the true outputs.
123456import numpy as np errors = np.array([1, 2, 4, 8, 16, 32]) squared = errors**2 print("Errors:", errors) print("Squared errors:", squared)
MSE penalizes larger errors more heavily because the difference is squared, so outliers or large deviations have a disproportionately big effect on the final value. MSE is also optimal when the noise in the data is Gaussian (normally distributed), making it a natural choice under these conditions.
You can interpret MSE geometrically as the squared Euclidean distance between the vectors of true values and predicted values. If you imagine each data point as a dimension, the difference vector y−y^ represents the error in each dimension. Squaring and summing these errors gives the squared length (or squared distance) between the prediction vector and the true vector. This is the foundation of least squares regression, where the goal is to find the line (or hyperplane) that minimizes the sum of squared errors to all data points.
12345678910import numpy as np y = np.array([3, 5, 7]) y_hat = np.array([2.5, 5.5, 6]) diff = y - y_hat squared_distance = np.sum(diff**2) print("Error vector:", diff) print("Squared Euclidean distance:", squared_distance)
There is also a probabilistic perspective: when you minimize MSE, you are implicitly assuming that the noise in your observations is Gaussian, and you are finding the maximum likelihood estimator of the mean. In other words, if your data are noisy measurements centered around some true value, minimizing the MSE leads you to the best estimate of that value, which is the mean. This connection is why the mean is called the optimal estimator under MSE loss.
12345678import numpy as np rng = np.random.default_rng(0) true_value = 10 observations = true_value + rng.normal(0, 2, size=1000) mse_estimate = np.mean(observations) print("Mean (MLE under MSE):", mse_estimate)
Takk for tilbakemeldingene dine!