Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Data Scaling | Processing Quantitative Data
Data Preprocessing
course content

Зміст курсу

Data Preprocessing

Data Preprocessing

1. Brief Introduction
2. Processing Quantitative Data
3. Processing Categorical Data
4. Time Series Data Processing
5. Feature Engineering
6. Moving on to Tasks

book
Data Scaling

Data scaling is a technique used to transform data into a common scale, making it easier to compare and analyze. It is an important step in data preprocessing that helps improve machine learning models' performance.

Data in real-world scenarios can be in different units and ranges, making it difficult to compare and analyze. Scaling the data helps to bring all the features of the data on a common scale, which ensures that each feature is given equal importance by the machine learning models.

Suppose we have a dataset of customer information for a bank, where we want to predict whether or not a customer will default on their loan. The dataset contains: age, income, credit score, loan amount, and whether or not the customer defaulted (1 for yes, 0 for no).

Let's say that the age column ranges from 20 to 70, the income column ranges from 20,000 to 200,000, and the credit score column ranges from 400 to 800. However, the loan amount column ranges from 10,000 to 500,000, which is much larger than the other columns.

If we were to use this data to train a machine learning model without scaling the features, the loan amount would have a much larger influence on the prediction compared to the other features. This is because the loan amount range is much larger than the range of the other features, and the model would assign more weight to the loan amount when making predictions.

As a result, the model would not be as accurate as it could be because it does not consider each feature's relative importance. To avoid this, we need to use data scaling so the features all have a similar range and influence on the prediction.

There are several techniques for scaling data, but in more detail, we will only look at min-max normalization.

Min-max normalization scales the data to a fixed range between 0 and 1. The formula for min-max normalization is:

where X is the original value, X_min is the minimum value in the data, and X_max is the maximum value in the data.

There are also such methods as Z-Score normalization and decimal scaling normalization.

Here is an example of how to normalize data using sklearn:

1234567891011
from sklearn.preprocessing import MinMaxScaler import numpy as np # Create simple dataset dataset = np.array([[10, 2, 3], [5, 7, 9], [11, 12, 8]]) # Create a scaler model scaler = MinMaxScaler() # Fit and transform dataset scaled_data = scaler.fit_transform(dataset)
copy

We first import the MinMaxScaler class. Next, we create a MinMaxScaler object called scaler. This scaler will transform our data into a common scale using the minimum and maximum values of the data.

We then fit and transform our sample data using the scaler object. The resulting scaled_data is a numpy array that contains our scaled data.

Data scaling is an important step in data preprocessing that helps to transform data into a common scale, making it easier to compare and analyze. Choosing the appropriate scaling technique depends on the nature of the data and the specific problem being solved.

Data scaling is usually done on the feature axis. This is because scaling is applied separately to each feature (or column) to bring them all to a similar scale. Scaling on the example axis (or row-wise) would scale the individual observations (or rows) and potentially distort the relationship between the features.

The last point we'll look at is what data needs to be scaled. Training, testing, or the entire dataset? The test dataset should be scaled using the same parameters used to scale the training data so that the test data is consistent with the training data. It is important to scale the test data separately from the training data to avoid data leakage from the test set into the training set.

Завдання
test

Swipe to show code editor

Scale the data in 'pr_cars.csv' dataset.

Рішення

Switch to desktopПерейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 2. Розділ 1
toggle bottom row

book
Data Scaling

Data scaling is a technique used to transform data into a common scale, making it easier to compare and analyze. It is an important step in data preprocessing that helps improve machine learning models' performance.

Data in real-world scenarios can be in different units and ranges, making it difficult to compare and analyze. Scaling the data helps to bring all the features of the data on a common scale, which ensures that each feature is given equal importance by the machine learning models.

Suppose we have a dataset of customer information for a bank, where we want to predict whether or not a customer will default on their loan. The dataset contains: age, income, credit score, loan amount, and whether or not the customer defaulted (1 for yes, 0 for no).

Let's say that the age column ranges from 20 to 70, the income column ranges from 20,000 to 200,000, and the credit score column ranges from 400 to 800. However, the loan amount column ranges from 10,000 to 500,000, which is much larger than the other columns.

If we were to use this data to train a machine learning model without scaling the features, the loan amount would have a much larger influence on the prediction compared to the other features. This is because the loan amount range is much larger than the range of the other features, and the model would assign more weight to the loan amount when making predictions.

As a result, the model would not be as accurate as it could be because it does not consider each feature's relative importance. To avoid this, we need to use data scaling so the features all have a similar range and influence on the prediction.

There are several techniques for scaling data, but in more detail, we will only look at min-max normalization.

Min-max normalization scales the data to a fixed range between 0 and 1. The formula for min-max normalization is:

where X is the original value, X_min is the minimum value in the data, and X_max is the maximum value in the data.

There are also such methods as Z-Score normalization and decimal scaling normalization.

Here is an example of how to normalize data using sklearn:

1234567891011
from sklearn.preprocessing import MinMaxScaler import numpy as np # Create simple dataset dataset = np.array([[10, 2, 3], [5, 7, 9], [11, 12, 8]]) # Create a scaler model scaler = MinMaxScaler() # Fit and transform dataset scaled_data = scaler.fit_transform(dataset)
copy

We first import the MinMaxScaler class. Next, we create a MinMaxScaler object called scaler. This scaler will transform our data into a common scale using the minimum and maximum values of the data.

We then fit and transform our sample data using the scaler object. The resulting scaled_data is a numpy array that contains our scaled data.

Data scaling is an important step in data preprocessing that helps to transform data into a common scale, making it easier to compare and analyze. Choosing the appropriate scaling technique depends on the nature of the data and the specific problem being solved.

Data scaling is usually done on the feature axis. This is because scaling is applied separately to each feature (or column) to bring them all to a similar scale. Scaling on the example axis (or row-wise) would scale the individual observations (or rows) and potentially distort the relationship between the features.

The last point we'll look at is what data needs to be scaled. Training, testing, or the entire dataset? The test dataset should be scaled using the same parameters used to scale the training data so that the test data is consistent with the training data. It is important to scale the test data separately from the training data to avoid data leakage from the test set into the training set.

Завдання
test

Swipe to show code editor

Scale the data in 'pr_cars.csv' dataset.

Рішення

Switch to desktopПерейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 2. Розділ 1
Switch to desktopПерейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
We're sorry to hear that something went wrong. What happened?
some-alt