Conteúdo do Curso

Linear Regression with Python

1. Simple Linear Regression

What is Linear Regression Finding the Parameters Building Linear Regression Using NumPy Building Linear Regression Using Statsmodels Challenge: Predicting House Prices

2. Multiple Linear Regression

Linear Regression with Two Features Linear Regression with N Features Building Multiple Linear Regression Choosing the Features Challenge: Predicting Prices Using Two Features

3. Polynomial Regression

Quadratic Regression Polynomial Regression Building Polynomial Regression Interpolation vs Extrapolation Challenge: Evaluating the Model

4. Choosing The Best Model

Metrics Overfitting R-squared Challenge: Predicting Prices Using Polynomial Regression

Metrics

When building a model, it is important to measure its performance.
We require a score associated with the model that accurately describes how well it fits the data. This score is known as a metric, and there are numerous metrics available.
In this chapter, we will focus on the most commonly used ones.

We will use the following notation:

We are already familiar with one metric, SSR (Sum of Squared Residuals), which we minimized to identify the optimal parameters.
Using our notation, we can express the formula for SSR as follows:

or equally:

This metric was good for comparing models with the same number of instances. However, it doesn't give us an understanding of how well the model performs. Here is why:
Suppose you have two models on the different training sets (shown in the image below).

You can see that the first model fits well but still has a higher SSR than the second model, which visually fits the data worse. It happened only because the first model has much more data points, so the sum is higher, but on average, the first model's residuals are lower. So taking the average of squared residuals as a metric would describe the model better. That is precisely what the Mean Squared Error(MSE) is.

MSE

or equally:

To calculate the MSE metric using python, you can use NumPy's functions:


python

Or you can use Scikit-learn's mean_squared_error() method:


python

Where y_true is an array of actual target values and y_pred is an array of predicted target values for the same features.

The problem is the error it shows is squared. For example, suppose the MSE of the model predicting houses is 49 dollars². We are interested in price, not price squared, as given by MSE, so we would like to take the root of MSE and get 7 dollars. Now we have a metric with the same unit as the predicted value. This metric is called Root Mean Squared Error(RMSE).

RMSE

To calculate the RMSE metric using python, you can use NumPy's functions:


python

Or you can use Scikit-learn's mean_squared_error() method with squared=False:


python

MAE

In SSR, we squared the residuals to get rid of the sign. The second approach would be taking the absolute values of residuals instead of squaring them. That is the idea behind Mean Absolute Error(MAE).

or equally

It is the same as the MSE, but instead of squaring residuals, we take their absolute values.

While MAE is similar to MSE, since it uses the absolute values of the residuals, it is more robust to outliers, as it does not amplify large errors as much as MSE does. As a result, MAE is often a better choice when the dataset contains outliers, since its value doesn't increase disproportionately due to a few extreme errors.

To calculate the MAE metric using python, you can use NumPy's functions:


python

Or you can use Scikit-learn's mean_absolute_error() method:


python

For choosing the parameters, we used the SSR metric. That is because it was good for mathematical calculations and allowed us to get the Normal Equation. But to further compare the models, you can use any other metric.

Note

For comparing models, SSR, MSE, and RMSE will always identically choose which model is better and which is worse. And MAE can sometimes prefer a different model than SSR/MSE/RMSE since those penalize high residuals much more. Usually, you want to choose one metric a priori and focus on minimizing it.

Now you can surely tell that the second model is better since all its metrics are lower. However, lower metrics do not always mean the model is better.

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 4. Capítulo 1

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Conteúdo do Curso

Linear Regression with Python

1. Simple Linear Regression

What is Linear Regression Finding the Parameters Building Linear Regression Using NumPy Building Linear Regression Using Statsmodels Challenge: Predicting House Prices

2. Multiple Linear Regression

Linear Regression with Two Features Linear Regression with N Features Building Multiple Linear Regression Choosing the Features Challenge: Predicting Prices Using Two Features

3. Polynomial Regression

Quadratic Regression Polynomial Regression Building Polynomial Regression Interpolation vs Extrapolation Challenge: Evaluating the Model

4. Choosing The Best Model

Metrics Overfitting R-squared Challenge: Predicting Prices Using Polynomial Regression

Metrics

We will use the following notation:

or equally:

MSE

or equally:

To calculate the MSE metric using python, you can use NumPy's functions:


python

Or you can use Scikit-learn's mean_squared_error() method:


python

Where y_true is an array of actual target values and y_pred is an array of predicted target values for the same features.

RMSE

To calculate the RMSE metric using python, you can use NumPy's functions:


python

Or you can use Scikit-learn's mean_squared_error() method with squared=False:


python

MAE

or equally

It is the same as the MSE, but instead of squaring residuals, we take their absolute values.

To calculate the MAE metric using python, you can use NumPy's functions:


python

Or you can use Scikit-learn's mean_absolute_error() method:


python

Note

Now you can surely tell that the second model is better since all its metrics are lower. However, lower metrics do not always mean the model is better.

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 4. Capítulo 1