Overfitting and Model Complexity
Understanding how your model performs on new, unseen data is a core challenge in supervised learning. Two concepts that often come up are overfitting and underfitting. Overfitting happens when your model learns not only the underlying pattern in the training data but also the noise—meaning it performs very well on the training set but poorly on new data. Underfitting is the opposite: your model is too simple to capture the underlying structure, resulting in poor performance on both training and test data.
This leads to the bias–variance tradeoff. Bias refers to errors introduced by approximating a real-world problem with a simplified model. Variance is the error introduced by sensitivity to small fluctuations in the training set. A model with high bias pays little attention to the training data and oversimplifies the model (underfitting). A model with high variance pays too much attention to the training data and does not generalize well (overfitting). Finding the right balance between bias and variance is crucial for building models that generalize well.
1234567891011121314151617181920212223242526272829303132333435import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures # Generate synthetic data np.random.seed(0) X = np.linspace(0, 1, 20) y = 1.5 * X + np.random.normal(0, 0.15, size=X.shape) # Reshape X for sklearn X = X.reshape(-1, 1) # Fit linear regression (degree 1) poly1 = PolynomialFeatures(degree=1) X_poly1 = poly1.fit_transform(X) model1 = LinearRegression().fit(X_poly1, y) y_pred1 = model1.predict(X_poly1) # Fit polynomial regression (degree 15 - very complex) poly15 = PolynomialFeatures(degree=15) X_poly15 = poly15.fit_transform(X) model15 = LinearRegression().fit(X_poly15, y) y_pred15 = model15.predict(X_poly15) # Plot results plt.figure(figsize=(10, 5)) plt.scatter(X, y, color='black', label='Data') plt.plot(X, y_pred1, color='blue', label='Degree 1 (Underfit)') plt.plot(X, y_pred15, color='red', linestyle='--', label='Degree 15 (Overfit)') plt.legend() plt.title('Polynomial Regression: Underfitting vs Overfitting') plt.xlabel('X') plt.ylabel('y') plt.show()
When you increase the complexity of your model, such as by raising the polynomial degree in regression, you give the model more flexibility to fit the training data. In the code above, the degree 1 polynomial (a straight line) cannot capture the pattern in the data well, resulting in underfitting. The degree 15 polynomial, on the other hand, fits the training data almost perfectly—including its noise—leading to overfitting. This model will likely perform poorly on new data because it has learned patterns that do not generalize. The key is to choose a model that is complex enough to capture the underlying trend, but not so complex that it memorizes noise.
This is why controlling model complexity is so important for generalization. You want your model to perform well on both the training data and unseen data. As you saw in the previous example, too simple a model leads to high bias and underfitting, while too complex a model leads to high variance and overfitting.
Regularization is a set of techniques used to control model complexity by adding a penalty to large parameter values in a model. By discouraging overly complex models, regularization helps prevent overfitting and improves the model's ability to generalize to new data.
Tak for dine kommentarer!
Spørg AI
Spørg AI
Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat
Awesome!
Completion rate improved to 8.33
Overfitting and Model Complexity
Stryg for at vise menuen
Understanding how your model performs on new, unseen data is a core challenge in supervised learning. Two concepts that often come up are overfitting and underfitting. Overfitting happens when your model learns not only the underlying pattern in the training data but also the noise—meaning it performs very well on the training set but poorly on new data. Underfitting is the opposite: your model is too simple to capture the underlying structure, resulting in poor performance on both training and test data.
This leads to the bias–variance tradeoff. Bias refers to errors introduced by approximating a real-world problem with a simplified model. Variance is the error introduced by sensitivity to small fluctuations in the training set. A model with high bias pays little attention to the training data and oversimplifies the model (underfitting). A model with high variance pays too much attention to the training data and does not generalize well (overfitting). Finding the right balance between bias and variance is crucial for building models that generalize well.
1234567891011121314151617181920212223242526272829303132333435import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures # Generate synthetic data np.random.seed(0) X = np.linspace(0, 1, 20) y = 1.5 * X + np.random.normal(0, 0.15, size=X.shape) # Reshape X for sklearn X = X.reshape(-1, 1) # Fit linear regression (degree 1) poly1 = PolynomialFeatures(degree=1) X_poly1 = poly1.fit_transform(X) model1 = LinearRegression().fit(X_poly1, y) y_pred1 = model1.predict(X_poly1) # Fit polynomial regression (degree 15 - very complex) poly15 = PolynomialFeatures(degree=15) X_poly15 = poly15.fit_transform(X) model15 = LinearRegression().fit(X_poly15, y) y_pred15 = model15.predict(X_poly15) # Plot results plt.figure(figsize=(10, 5)) plt.scatter(X, y, color='black', label='Data') plt.plot(X, y_pred1, color='blue', label='Degree 1 (Underfit)') plt.plot(X, y_pred15, color='red', linestyle='--', label='Degree 15 (Overfit)') plt.legend() plt.title('Polynomial Regression: Underfitting vs Overfitting') plt.xlabel('X') plt.ylabel('y') plt.show()
When you increase the complexity of your model, such as by raising the polynomial degree in regression, you give the model more flexibility to fit the training data. In the code above, the degree 1 polynomial (a straight line) cannot capture the pattern in the data well, resulting in underfitting. The degree 15 polynomial, on the other hand, fits the training data almost perfectly—including its noise—leading to overfitting. This model will likely perform poorly on new data because it has learned patterns that do not generalize. The key is to choose a model that is complex enough to capture the underlying trend, but not so complex that it memorizes noise.
This is why controlling model complexity is so important for generalization. You want your model to perform well on both the training data and unseen data. As you saw in the previous example, too simple a model leads to high bias and underfitting, while too complex a model leads to high variance and overfitting.
Regularization is a set of techniques used to control model complexity by adding a penalty to large parameter values in a model. By discouraging overly complex models, regularization helps prevent overfitting and improves the model's ability to generalize to new data.
Tak for dine kommentarer!