Course Content
Data Preprocessing
Data Preprocessing
Deleting an "Extra" Data
The point of data processing, which includes removing “extra” data, is very diverse. That is why we will now briefly review its main steps: processing gaps, duplicates, and outliers.
Dealing with missing values is the first important step
The simplest thing we can do is remove rows that contain NaN
values:
Another option is to replace the missing data with the median over the entire column:
If rows with NaN values are not more than 5-8%, then they can still be removed. But if there are many of them, it is better to resort to the imputation method (replacing the values with the mean, median, or mode).
For example, mean imputation is used when working with datasets with a symmetrical distribution, and it can be unstable to a large number of outliers. While median imputation is suitable for data with a skewed distribution.
Mode imputation is commonly used for categorical features and discrete variables with a small number of possible values.
Identification of duplicates in the dataset is our next step
To implement it, we use the .drop_duplicates()
method:
Removing non-informative features is the last thing we will consider
A column with too many rows with the same values does not provide useful information for the project. Using the following algorithm, we can compile a list of features for which more than 95% of the rows contain the same value:
Another simple method for removing non-informative features - is to calculate the correlation between the feature and the target variable. If the correlation does not reach a certain threshold (which you set manually), then the feature can be removed. Of course, correlation is used only for linear models. When working with non-linear dependencies, you can evaluate the cross-entropy of a model without and with certain features.
In some cases, even features with small correlations can still provide useful information when combined with other features. Feature selection techniques such as forward/backward selection or regularization methods can be used to identify and select the most informative features for the model.
Swipe to show code editor
Clean up the dataset using the above 2 methods on the penguins.csv
dataset.
Solution
Thanks for your feedback!
Deleting an "Extra" Data
The point of data processing, which includes removing “extra” data, is very diverse. That is why we will now briefly review its main steps: processing gaps, duplicates, and outliers.
Dealing with missing values is the first important step
The simplest thing we can do is remove rows that contain NaN
values:
Another option is to replace the missing data with the median over the entire column:
If rows with NaN values are not more than 5-8%, then they can still be removed. But if there are many of them, it is better to resort to the imputation method (replacing the values with the mean, median, or mode).
For example, mean imputation is used when working with datasets with a symmetrical distribution, and it can be unstable to a large number of outliers. While median imputation is suitable for data with a skewed distribution.
Mode imputation is commonly used for categorical features and discrete variables with a small number of possible values.
Identification of duplicates in the dataset is our next step
To implement it, we use the .drop_duplicates()
method:
Removing non-informative features is the last thing we will consider
A column with too many rows with the same values does not provide useful information for the project. Using the following algorithm, we can compile a list of features for which more than 95% of the rows contain the same value:
Another simple method for removing non-informative features - is to calculate the correlation between the feature and the target variable. If the correlation does not reach a certain threshold (which you set manually), then the feature can be removed. Of course, correlation is used only for linear models. When working with non-linear dependencies, you can evaluate the cross-entropy of a model without and with certain features.
In some cases, even features with small correlations can still provide useful information when combined with other features. Feature selection techniques such as forward/backward selection or regularization methods can be used to identify and select the most informative features for the model.
Swipe to show code editor
Clean up the dataset using the above 2 methods on the penguins.csv
dataset.
Solution
Thanks for your feedback!