Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Overfitting and Regularization | Logistic Regression
Classification with Python

bookOverfitting and Regularization

As demonstrated in the previous chapter, using PolynomialFeatures, you can create a complex decision boundary. Second-degree polynomial features can even produce the boundaries shown in the image below:

And it is only a degree of two. A higher degree may yield even more complex shapes. But there is a problem with it. The decision boundary built by Logistic Regression may become too complicated, causing the model to overfit.

Overfitting is when the model, instead of learning general patterns in data, builds a very complex decision boundary to handle every training instance. Still, it does not perform as well on data it has never seen, while performing well on unseen data is a primary task of the machine learning model.

The regularization tackles the problem of overfitting. In fact, l2 regularization is used in the LogisticRegression class by default. But you need to configure how strongly the model should be regularized. It is controlled by a C parameter:

carousel-imgcarousel-imgcarousel-img
  • greater C - lower regularization, more overfitting;
  • lower C - stronger regularization, less overfitting (but possibly underfitting).

What values of C will result in a good model depends on the dataset, thus better to choose it using the GridSearchCV.

Note
Note

When using Logistic Regression with regularization, it's essential to scale your data. Regularization penalizes large coefficients, and without scaling, features with larger values can distort the results. In fact, scaling is almost always necessary - even when regularization is not used.

The LogisticRegression class includes regularization by default, so you should either remove regularization(by setting penalty=None) or scale the data(e.g., using StandardScaler).

Note
Note

If you're using both PolynomialFeatures and StandardScaler, make sure to apply StandardScaler after generating the polynomial features. Scaling the data before polynomial expansion can distort the resulting features, since operations like squaring or multiplying already standardized values may lead to unnatural distributions.

1. Choose the INCORRECT statement.

2. What is the correct order to preprocess data

question mark

Choose the INCORRECT statement.

Select the correct answer

question mark

What is the correct order to preprocess data

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 5

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain how to use GridSearchCV to find the best C value?

What is the difference between overfitting and underfitting?

How does regularization help prevent overfitting in logistic regression?

Awesome!

Completion rate improved to 4.17

bookOverfitting and Regularization

Swipe to show menu

As demonstrated in the previous chapter, using PolynomialFeatures, you can create a complex decision boundary. Second-degree polynomial features can even produce the boundaries shown in the image below:

And it is only a degree of two. A higher degree may yield even more complex shapes. But there is a problem with it. The decision boundary built by Logistic Regression may become too complicated, causing the model to overfit.

Overfitting is when the model, instead of learning general patterns in data, builds a very complex decision boundary to handle every training instance. Still, it does not perform as well on data it has never seen, while performing well on unseen data is a primary task of the machine learning model.

The regularization tackles the problem of overfitting. In fact, l2 regularization is used in the LogisticRegression class by default. But you need to configure how strongly the model should be regularized. It is controlled by a C parameter:

carousel-imgcarousel-imgcarousel-img
  • greater C - lower regularization, more overfitting;
  • lower C - stronger regularization, less overfitting (but possibly underfitting).

What values of C will result in a good model depends on the dataset, thus better to choose it using the GridSearchCV.

Note
Note

When using Logistic Regression with regularization, it's essential to scale your data. Regularization penalizes large coefficients, and without scaling, features with larger values can distort the results. In fact, scaling is almost always necessary - even when regularization is not used.

The LogisticRegression class includes regularization by default, so you should either remove regularization(by setting penalty=None) or scale the data(e.g., using StandardScaler).

Note
Note

If you're using both PolynomialFeatures and StandardScaler, make sure to apply StandardScaler after generating the polynomial features. Scaling the data before polynomial expansion can distort the resulting features, since operations like squaring or multiplying already standardized values may lead to unnatural distributions.

1. Choose the INCORRECT statement.

2. What is the correct order to preprocess data

question mark

Choose the INCORRECT statement.

Select the correct answer

question mark

What is the correct order to preprocess data

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 5
some-alt