Conteúdo do Curso
ML Introduction with scikit-learn
ML Introduction with scikit-learn
StandardScaler, MinMaxScaler, MaxAbsScaler
There are three popular approaches to scaling the data:
MinMaxScaler
: scales features to a [0, 1] range;MaxAbsScaler
: scales features such as the maximum absolute value is 1 (so the data is guaranteed to be in a [-1, 1] range);StandardScaler
: standardize features making the mean equal to 0 and variance equal to 1.
To demonstrate how the scalers work, we will use the 'culmen_depth_mm'
and 'body_mass_g'
features of the penguins dataset. Let's plot them.
MinMaxScaler
The MinMaxScaler
works by subtracting the minimum value (to make values start from zero) and then dividing by (x_max - x_min) to make it less or equal to 1.
Here is the gif showing how MinMaxScaler
works:
MaxAbsScaler
The MaxAbsScaler
works by finding the maximum absolute value and dividing each value by it. This ensures that the maximum absolute value is 1.
StandardScaler
The idea of StandardScaler
comes from statistics. It works by subtracting the mean (to center around zero) and dividing by the standard deviation (to make the variance equal to 1).
Let's look at a coding example using MinMaxScaler
. Other scalers are used in the same way.
import pandas as pd from sklearn.preprocessing import MinMaxScaler df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_imputed_encoded.csv') # Assign X,y variables X, y = df.drop('species', axis=1), df['species'] # Initialize a MinMaxScaler object and transform the X minmax = MinMaxScaler() X = minmax.fit_transform(X) print(X)
The output is not the prettiest since scalers transform the data to a NumPy array, but with pipelines, it won't be a problem.
Which Scaler to Use?
A StandardScaler
is more sensitive to outliers, making it less suitable as a default scaler. If you prefer an alternative to StandardScaler
, the choice between MinMaxScaler
and MaxAbsScaler
depends on personal preference, whether scaling data to the [0,1] range with MinMaxScaler
or to [-1,1] with MaxAbsScaler
.
Obrigado pelo seu feedback!