Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Deleting an "Extra" Data | Brief Introduction
Data Preprocessing
course content

Course Content

Data Preprocessing

Data Preprocessing

1. Brief Introduction
2. Processing Quantitative Data
3. Processing Categorical Data
4. Time Series Data Processing
5. Feature Engineering
6. Moving on to Tasks

bookDeleting an "Extra" Data

The point of data processing, which includes removing “extra” data, is very diverse. That is why we will now briefly review its main steps: processing gaps, duplicates, and outliers.

Dealing with missing values is the first important step

The simplest thing we can do is remove rows that contain NaN values:

Another option is to replace the missing data with the median over the entire column:

If rows with NaN values are not more than 5-8%, then they can still be removed. But if there are many of them, it is better to resort to the imputation method (replacing the values with the mean, median, or mode).

For example, mean imputation is used when working with datasets with a symmetrical distribution, and it can be unstable to a large number of outliers. While median imputation is suitable for data with a skewed distribution.

Mode imputation is commonly used for categorical features and discrete variables with a small number of possible values.

Identification of duplicates in the dataset is our next step

To implement it, we use the .drop_duplicates() method:

Removing non-informative features is the last thing we will consider

A column with too many rows with the same values does not provide useful information for the project. Using the following algorithm, we can compile a list of features for which more than 95% of the rows contain the same value:

Another simple method for removing non-informative features - is to calculate the correlation between the feature and the target variable. If the correlation does not reach a certain threshold (which you set manually), then the feature can be removed. Of course, correlation is used only for linear models. When working with non-linear dependencies, you can evaluate the cross-entropy of a model without and with certain features.

In some cases, even features with small correlations can still provide useful information when combined with other features. Feature selection techniques such as forward/backward selection or regularization methods can be used to identify and select the most informative features for the model.

Task

Clean up the dataset using the above 2 methods on the penguins.csv dataset.

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 1. Chapter 4
toggle bottom row

bookDeleting an "Extra" Data

The point of data processing, which includes removing “extra” data, is very diverse. That is why we will now briefly review its main steps: processing gaps, duplicates, and outliers.

Dealing with missing values is the first important step

The simplest thing we can do is remove rows that contain NaN values:

Another option is to replace the missing data with the median over the entire column:

If rows with NaN values are not more than 5-8%, then they can still be removed. But if there are many of them, it is better to resort to the imputation method (replacing the values with the mean, median, or mode).

For example, mean imputation is used when working with datasets with a symmetrical distribution, and it can be unstable to a large number of outliers. While median imputation is suitable for data with a skewed distribution.

Mode imputation is commonly used for categorical features and discrete variables with a small number of possible values.

Identification of duplicates in the dataset is our next step

To implement it, we use the .drop_duplicates() method:

Removing non-informative features is the last thing we will consider

A column with too many rows with the same values does not provide useful information for the project. Using the following algorithm, we can compile a list of features for which more than 95% of the rows contain the same value:

Another simple method for removing non-informative features - is to calculate the correlation between the feature and the target variable. If the correlation does not reach a certain threshold (which you set manually), then the feature can be removed. Of course, correlation is used only for linear models. When working with non-linear dependencies, you can evaluate the cross-entropy of a model without and with certain features.

In some cases, even features with small correlations can still provide useful information when combined with other features. Feature selection techniques such as forward/backward selection or regularization methods can be used to identify and select the most informative features for the model.

Task

Clean up the dataset using the above 2 methods on the penguins.csv dataset.

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 1. Chapter 4
toggle bottom row

bookDeleting an "Extra" Data

The point of data processing, which includes removing “extra” data, is very diverse. That is why we will now briefly review its main steps: processing gaps, duplicates, and outliers.

Dealing with missing values is the first important step

The simplest thing we can do is remove rows that contain NaN values:

Another option is to replace the missing data with the median over the entire column:

If rows with NaN values are not more than 5-8%, then they can still be removed. But if there are many of them, it is better to resort to the imputation method (replacing the values with the mean, median, or mode).

For example, mean imputation is used when working with datasets with a symmetrical distribution, and it can be unstable to a large number of outliers. While median imputation is suitable for data with a skewed distribution.

Mode imputation is commonly used for categorical features and discrete variables with a small number of possible values.

Identification of duplicates in the dataset is our next step

To implement it, we use the .drop_duplicates() method:

Removing non-informative features is the last thing we will consider

A column with too many rows with the same values does not provide useful information for the project. Using the following algorithm, we can compile a list of features for which more than 95% of the rows contain the same value:

Another simple method for removing non-informative features - is to calculate the correlation between the feature and the target variable. If the correlation does not reach a certain threshold (which you set manually), then the feature can be removed. Of course, correlation is used only for linear models. When working with non-linear dependencies, you can evaluate the cross-entropy of a model without and with certain features.

In some cases, even features with small correlations can still provide useful information when combined with other features. Feature selection techniques such as forward/backward selection or regularization methods can be used to identify and select the most informative features for the model.

Task

Clean up the dataset using the above 2 methods on the penguins.csv dataset.

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

The point of data processing, which includes removing “extra” data, is very diverse. That is why we will now briefly review its main steps: processing gaps, duplicates, and outliers.

Dealing with missing values is the first important step

The simplest thing we can do is remove rows that contain NaN values:

Another option is to replace the missing data with the median over the entire column:

If rows with NaN values are not more than 5-8%, then they can still be removed. But if there are many of them, it is better to resort to the imputation method (replacing the values with the mean, median, or mode).

For example, mean imputation is used when working with datasets with a symmetrical distribution, and it can be unstable to a large number of outliers. While median imputation is suitable for data with a skewed distribution.

Mode imputation is commonly used for categorical features and discrete variables with a small number of possible values.

Identification of duplicates in the dataset is our next step

To implement it, we use the .drop_duplicates() method:

Removing non-informative features is the last thing we will consider

A column with too many rows with the same values does not provide useful information for the project. Using the following algorithm, we can compile a list of features for which more than 95% of the rows contain the same value:

Another simple method for removing non-informative features - is to calculate the correlation between the feature and the target variable. If the correlation does not reach a certain threshold (which you set manually), then the feature can be removed. Of course, correlation is used only for linear models. When working with non-linear dependencies, you can evaluate the cross-entropy of a model without and with certain features.

In some cases, even features with small correlations can still provide useful information when combined with other features. Feature selection techniques such as forward/backward selection or regularization methods can be used to identify and select the most informative features for the model.

Task

Clean up the dataset using the above 2 methods on the penguins.csv dataset.

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Section 1. Chapter 4
Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
some-alt