Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Oppiskele Deleting an "Extra" Data | Brief Introduction
Data Preprocessing
course content

Kurssisisältö

Data Preprocessing

Data Preprocessing

1. Brief Introduction
2. Processing Quantitative Data
3. Processing Categorical Data
4. Time Series Data Processing
5. Feature Engineering
6. Moving on to Tasks

book
Deleting an "Extra" Data

The point of data processing, which includes removing “extra” data, is very diverse. That is why we will now briefly review its main steps: processing gaps, duplicates, and outliers.

Dealing with missing values is the first important step

The simplest thing we can do is remove rows that contain NaN values:

python

Another option is to replace the missing data with the median over the entire column:

python

If rows with NaN values are not more than 5-8%, then they can still be removed. But if there are many of them, it is better to resort to the imputation method (replacing the values with the mean, median, or mode).

For example, mean imputation is used when working with datasets with a symmetrical distribution, and it can be unstable to a large number of outliers. While median imputation is suitable for data with a skewed distribution.

Mode imputation is commonly used for categorical features and discrete variables with a small number of possible values.

Identification of duplicates in the dataset is our next step

To implement it, we use the .drop_duplicates() method:

python

Removing non-informative features is the last thing we will consider

A column with too many rows with the same values does not provide useful information for the project. Using the following algorithm, we can compile a list of features for which more than 95% of the rows contain the same value:

python

Another simple method for removing non-informative features - is to calculate the correlation between the feature and the target variable. If the correlation does not reach a certain threshold (which you set manually), then the feature can be removed. Of course, correlation is used only for linear models. When working with non-linear dependencies, you can evaluate the cross-entropy of a model without and with certain features.

In some cases, even features with small correlations can still provide useful information when combined with other features. Feature selection techniques such as forward/backward selection or regularization methods can be used to identify and select the most informative features for the model.

Tehtävä

Swipe to start coding

Clean up the dataset using the above 2 methods on the penguins.csv dataset.

Ratkaisu

Switch to desktopVaihda työpöytään todellista harjoitusta vartenJatka siitä, missä olet käyttämällä jotakin alla olevista vaihtoehdoista
Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 1. Luku 4
toggle bottom row

book
Deleting an "Extra" Data

The point of data processing, which includes removing “extra” data, is very diverse. That is why we will now briefly review its main steps: processing gaps, duplicates, and outliers.

Dealing with missing values is the first important step

The simplest thing we can do is remove rows that contain NaN values:

python

Another option is to replace the missing data with the median over the entire column:

python

If rows with NaN values are not more than 5-8%, then they can still be removed. But if there are many of them, it is better to resort to the imputation method (replacing the values with the mean, median, or mode).

For example, mean imputation is used when working with datasets with a symmetrical distribution, and it can be unstable to a large number of outliers. While median imputation is suitable for data with a skewed distribution.

Mode imputation is commonly used for categorical features and discrete variables with a small number of possible values.

Identification of duplicates in the dataset is our next step

To implement it, we use the .drop_duplicates() method:

python

Removing non-informative features is the last thing we will consider

A column with too many rows with the same values does not provide useful information for the project. Using the following algorithm, we can compile a list of features for which more than 95% of the rows contain the same value:

python

Another simple method for removing non-informative features - is to calculate the correlation between the feature and the target variable. If the correlation does not reach a certain threshold (which you set manually), then the feature can be removed. Of course, correlation is used only for linear models. When working with non-linear dependencies, you can evaluate the cross-entropy of a model without and with certain features.

In some cases, even features with small correlations can still provide useful information when combined with other features. Feature selection techniques such as forward/backward selection or regularization methods can be used to identify and select the most informative features for the model.

Tehtävä

Swipe to start coding

Clean up the dataset using the above 2 methods on the penguins.csv dataset.

Ratkaisu

Switch to desktopVaihda työpöytään todellista harjoitusta vartenJatka siitä, missä olet käyttämällä jotakin alla olevista vaihtoehdoista
Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 1. Luku 4
Switch to desktopVaihda työpöytään todellista harjoitusta vartenJatka siitä, missä olet käyttämällä jotakin alla olevista vaihtoehdoista
Pahoittelemme, että jotain meni pieleen. Mitä tapahtui?
some-alt