Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Removing Outliers | Data Cleaning
Preprocessing Data
course content

Зміст курсу

Preprocessing Data

Preprocessing Data

1. Data Exploration
2. Data Cleaning
3. Data Validation
4. Normalization & Standardization
5. Data Encoding

Removing Outliers

Outliers are some extra values that do not fit the defined interval. These values can affect some metrics(like mean, mode, etc.) or model weights. Let's explore what are the methods how to remove the outliers.

There are some popular ways to define the allowable interval (limits of acceptable value for some distribution):

  • remove all the values that form the first and the last 1% of values (presented in ascending order).
  • Leave the values that fit the interval [q25 - 1.5*IQR; q75 + 1.5*IQR], where IQR = q75 - q25. IQR is Inter Quartile Range. Remove other values.
  • leave all data that fits the interval [mean - std; mean + std]. Remove other values.

Let's explore the data distribution for each continuous numerical column (Age, SibSp, Parch, and Fare):

Note that Age feature is cleaned using interpoltaion.

Age distribution is close to Normal, and other features distributions look like Exponential ones.

There is a demo of removing the Age data outliers using two approaches.

Remove Outliers Using Mean and Std

Let's remove all the data outside the range [mean - std; mean + std] and check how the distribution changed. After running the following code:

123456
ages = new_data['Age'] mean, std = ages.mean(), ages.std() # data without outliers ages_wo = ages.loc[(ages > mean-std) & (ages < mean +std)] print('Removed data:', (1 - ages_wo.size/ages.size)*100, '%')
copy

The expected amount of removed data is 26.936 %, which is quite a lot. On the plot, the orange area matches the outliers, and blue area matches the other observations:

Maybe you should think about another approach.

Removing Outliers Using IQR

The following code removes all data outside the range [q25 - 1.5*IQR; q75 + 1.5*IQR]:

1234567
ages = new_data['Age'] q25, q50, q75 = ages.quantile(q=[0.25, 0.5, 0.75]) iqr = q75 - q25 # ages column without outliers ages_wo = ages.loc[(ages > q25 - 1.5*iqr) & (ages < q75 + 1.5*iqr)] print('Removed data:', (1 - ages_wo.size/ages.size)*100, '%')
copy

The expected amount of removed data is 1.571 % now, looks pretty enough. The distribution is pictured below:

Your goal is to apply these two approaches to the Fare column, explore the amount of outliers, and make some visualization.

Завдання

For the Fare column:

  1. Find the amount of outliers out the range [mean - std; mean + std]
  2. Find the amount of outliers outside the range [q25 - 1.5*IQR; q75 + 1.5*IQR] (Be careful and do not use the data after the previous task execution).

Завдання

For the Fare column:

  1. Find the amount of outliers out the range [mean - std; mean + std]
  2. Find the amount of outliers outside the range [q25 - 1.5*IQR; q75 + 1.5*IQR] (Be careful and do not use the data after the previous task execution).

The expected results are 8.193% and 13.019% respectively.

Перейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів

Все було зрозуміло?

Секція 2. Розділ 8
toggle bottom row

Removing Outliers

Outliers are some extra values that do not fit the defined interval. These values can affect some metrics(like mean, mode, etc.) or model weights. Let's explore what are the methods how to remove the outliers.

There are some popular ways to define the allowable interval (limits of acceptable value for some distribution):

  • remove all the values that form the first and the last 1% of values (presented in ascending order).
  • Leave the values that fit the interval [q25 - 1.5*IQR; q75 + 1.5*IQR], where IQR = q75 - q25. IQR is Inter Quartile Range. Remove other values.
  • leave all data that fits the interval [mean - std; mean + std]. Remove other values.

Let's explore the data distribution for each continuous numerical column (Age, SibSp, Parch, and Fare):

Note that Age feature is cleaned using interpoltaion.

Age distribution is close to Normal, and other features distributions look like Exponential ones.

There is a demo of removing the Age data outliers using two approaches.

Remove Outliers Using Mean and Std

Let's remove all the data outside the range [mean - std; mean + std] and check how the distribution changed. After running the following code:

123456
ages = new_data['Age'] mean, std = ages.mean(), ages.std() # data without outliers ages_wo = ages.loc[(ages > mean-std) & (ages < mean +std)] print('Removed data:', (1 - ages_wo.size/ages.size)*100, '%')
copy

The expected amount of removed data is 26.936 %, which is quite a lot. On the plot, the orange area matches the outliers, and blue area matches the other observations:

Maybe you should think about another approach.

Removing Outliers Using IQR

The following code removes all data outside the range [q25 - 1.5*IQR; q75 + 1.5*IQR]:

1234567
ages = new_data['Age'] q25, q50, q75 = ages.quantile(q=[0.25, 0.5, 0.75]) iqr = q75 - q25 # ages column without outliers ages_wo = ages.loc[(ages > q25 - 1.5*iqr) & (ages < q75 + 1.5*iqr)] print('Removed data:', (1 - ages_wo.size/ages.size)*100, '%')
copy

The expected amount of removed data is 1.571 % now, looks pretty enough. The distribution is pictured below:

Your goal is to apply these two approaches to the Fare column, explore the amount of outliers, and make some visualization.

Завдання

For the Fare column:

  1. Find the amount of outliers out the range [mean - std; mean + std]
  2. Find the amount of outliers outside the range [q25 - 1.5*IQR; q75 + 1.5*IQR] (Be careful and do not use the data after the previous task execution).

Завдання

For the Fare column:

  1. Find the amount of outliers out the range [mean - std; mean + std]
  2. Find the amount of outliers outside the range [q25 - 1.5*IQR; q75 + 1.5*IQR] (Be careful and do not use the data after the previous task execution).

The expected results are 8.193% and 13.019% respectively.

Перейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів

Все було зрозуміло?

Секція 2. Розділ 8
toggle bottom row

Removing Outliers

Outliers are some extra values that do not fit the defined interval. These values can affect some metrics(like mean, mode, etc.) or model weights. Let's explore what are the methods how to remove the outliers.

There are some popular ways to define the allowable interval (limits of acceptable value for some distribution):

  • remove all the values that form the first and the last 1% of values (presented in ascending order).
  • Leave the values that fit the interval [q25 - 1.5*IQR; q75 + 1.5*IQR], where IQR = q75 - q25. IQR is Inter Quartile Range. Remove other values.
  • leave all data that fits the interval [mean - std; mean + std]. Remove other values.

Let's explore the data distribution for each continuous numerical column (Age, SibSp, Parch, and Fare):

Note that Age feature is cleaned using interpoltaion.

Age distribution is close to Normal, and other features distributions look like Exponential ones.

There is a demo of removing the Age data outliers using two approaches.

Remove Outliers Using Mean and Std

Let's remove all the data outside the range [mean - std; mean + std] and check how the distribution changed. After running the following code:

123456
ages = new_data['Age'] mean, std = ages.mean(), ages.std() # data without outliers ages_wo = ages.loc[(ages > mean-std) & (ages < mean +std)] print('Removed data:', (1 - ages_wo.size/ages.size)*100, '%')
copy

The expected amount of removed data is 26.936 %, which is quite a lot. On the plot, the orange area matches the outliers, and blue area matches the other observations:

Maybe you should think about another approach.

Removing Outliers Using IQR

The following code removes all data outside the range [q25 - 1.5*IQR; q75 + 1.5*IQR]:

1234567
ages = new_data['Age'] q25, q50, q75 = ages.quantile(q=[0.25, 0.5, 0.75]) iqr = q75 - q25 # ages column without outliers ages_wo = ages.loc[(ages > q25 - 1.5*iqr) & (ages < q75 + 1.5*iqr)] print('Removed data:', (1 - ages_wo.size/ages.size)*100, '%')
copy

The expected amount of removed data is 1.571 % now, looks pretty enough. The distribution is pictured below:

Your goal is to apply these two approaches to the Fare column, explore the amount of outliers, and make some visualization.

Завдання

For the Fare column:

  1. Find the amount of outliers out the range [mean - std; mean + std]
  2. Find the amount of outliers outside the range [q25 - 1.5*IQR; q75 + 1.5*IQR] (Be careful and do not use the data after the previous task execution).

Завдання

For the Fare column:

  1. Find the amount of outliers out the range [mean - std; mean + std]
  2. Find the amount of outliers outside the range [q25 - 1.5*IQR; q75 + 1.5*IQR] (Be careful and do not use the data after the previous task execution).

The expected results are 8.193% and 13.019% respectively.

Перейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів

Все було зрозуміло?

Outliers are some extra values that do not fit the defined interval. These values can affect some metrics(like mean, mode, etc.) or model weights. Let's explore what are the methods how to remove the outliers.

There are some popular ways to define the allowable interval (limits of acceptable value for some distribution):

  • remove all the values that form the first and the last 1% of values (presented in ascending order).
  • Leave the values that fit the interval [q25 - 1.5*IQR; q75 + 1.5*IQR], where IQR = q75 - q25. IQR is Inter Quartile Range. Remove other values.
  • leave all data that fits the interval [mean - std; mean + std]. Remove other values.

Let's explore the data distribution for each continuous numerical column (Age, SibSp, Parch, and Fare):

Note that Age feature is cleaned using interpoltaion.

Age distribution is close to Normal, and other features distributions look like Exponential ones.

There is a demo of removing the Age data outliers using two approaches.

Remove Outliers Using Mean and Std

Let's remove all the data outside the range [mean - std; mean + std] and check how the distribution changed. After running the following code:

123456
ages = new_data['Age'] mean, std = ages.mean(), ages.std() # data without outliers ages_wo = ages.loc[(ages > mean-std) & (ages < mean +std)] print('Removed data:', (1 - ages_wo.size/ages.size)*100, '%')
copy

The expected amount of removed data is 26.936 %, which is quite a lot. On the plot, the orange area matches the outliers, and blue area matches the other observations:

Maybe you should think about another approach.

Removing Outliers Using IQR

The following code removes all data outside the range [q25 - 1.5*IQR; q75 + 1.5*IQR]:

1234567
ages = new_data['Age'] q25, q50, q75 = ages.quantile(q=[0.25, 0.5, 0.75]) iqr = q75 - q25 # ages column without outliers ages_wo = ages.loc[(ages > q25 - 1.5*iqr) & (ages < q75 + 1.5*iqr)] print('Removed data:', (1 - ages_wo.size/ages.size)*100, '%')
copy

The expected amount of removed data is 1.571 % now, looks pretty enough. The distribution is pictured below:

Your goal is to apply these two approaches to the Fare column, explore the amount of outliers, and make some visualization.

Завдання

For the Fare column:

  1. Find the amount of outliers out the range [mean - std; mean + std]
  2. Find the amount of outliers outside the range [q25 - 1.5*IQR; q75 + 1.5*IQR] (Be careful and do not use the data after the previous task execution).

The expected results are 8.193% and 13.019% respectively.

Перейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
Секція 2. Розділ 8
Перейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
We're sorry to hear that something went wrong. What happened?
some-alt