Learn Deleting an "Extra" Data | Brief Introduction

The point of data processing, which includes removing “extra” data, is very diverse. That is why we will now briefly review its main steps: processing gaps, duplicates, and outliers.

Dealing with missing values is the first important step

The simplest thing we can do is remove rows that contain NaN values:

df = df.dropna()

Another option is to replace the missing data with the median over the entire column:

med = df['bill_depth_mm'].median()
df['bill_depth_mm'] = df['bill_depth_mm'].fillna(med)

If rows with NaN values are not more than 5-8%, then they can still be removed. But if there are many of them, it is better to resort to the imputation method (replacing the values with the mean, median, or mode).

For example, mean imputation is used when working with datasets with a symmetrical distribution, and it can be unstable to a large number of outliers. While median imputation is suitable for data with a skewed distribution.

Mode imputation is commonly used for categorical features and discrete variables with a small number of possible values.

Identification of duplicates in the dataset is our next step

To implement it, we use the .drop_duplicates() method:

df = df.drop_duplicates()

Removing non-informative features is the last thing we will consider

A column with too many rows with the same values does not provide useful information for the project. Using the following algorithm, we can compile a list of features for which more than 95% of the rows contain the same value:

num_rows = len(df.index)
low_information_cols = [] 

for col in df.columns:
    cnts = df[col].value_counts(dropna=False)
    top_pct = (cnts/num_rows).iloc[0]
    
    if top_pct > 0.95:
        low_information_cols.append(col)
        print('{0}: {1:.5f}%'.format(col, top_pct*100))
        print(cnts)

Another simple method for removing non-informative features - is to calculate the correlation between the feature and the target variable. If the correlation does not reach a certain threshold (which you set manually), then the feature can be removed. Of course, correlation is used only for linear models. When working with non-linear dependencies, you can evaluate the cross-entropy of a model without and with certain features.

In some cases, even features with small correlations can still provide useful information when combined with other features. Feature selection techniques such as forward/backward selection or regularization methods can be used to identify and select the most informative features for the model.

Task

Swipe to start coding

Clean up the dataset using the above 2 methods on the penguins.csv dataset.

Solution

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 4

single

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Summarize this chapter

Explain the code in file

Explain why file doesn't solve the task

Awesome!

Completion rate improved to 3.33

Swipe to show menu