Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Dealing with Duplicates and Outliers | Data Cleaning Essentials
Data Preprocessing and Feature Engineering

bookDealing with Duplicates and Outliers

When working with real-world datasets, you will often encounter duplicate records and outliers. Both can significantly impact your data analysis and the performance of your machine learning models. Duplicates can artificially inflate the importance of certain patterns, leading to biased results, while outliers can distort statistical summaries and model predictions. Properly identifying and handling these issues is a core part of data cleaning.

1234567891011121314151617
import pandas as pd import seaborn as sns # Load the Titanic dataset from seaborn df = sns.load_dataset("titanic") # Find duplicate rows in the Titanic dataset duplicates = df.duplicated() print("Duplicate row indicators:") print(duplicates.value_counts()) # Show how many duplicates exist # Remove duplicate rows df_no_duplicates = df.drop_duplicates() print("\nNumber of rows before removing duplicates:") print(len(df)) print("Number of rows after removing duplicates:") print(len(df_no_duplicates))
copy
Note
Definition

Outliers are data points that deviate significantly from the majority of a dataset. Common methods to detect outliers include visualizations (such as box plots), statistical measures (like Z-score), and the interquartile range (IQR) method.

Z-score and interquartile range (IQR) are two common statistical measures used to identify outliers in a dataset:

  • Z-score:
    • Measures how many standard deviations a data point is from the mean;
    • A Z-score is calculated using the formula: (value - mean) / standard deviation;
    • Data points with Z-scores greater than 3 or less than -3 are often considered outliers, as they are far from the average value.
  • Interquartile Range (IQR):
    • Represents the range between the first quartile (Q1, 25th percentile) and the third quartile (Q3, 75th percentile);
    • The IQR is calculated as Q3 - Q1;
    • Outliers are typically defined as data points below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR, which means they fall outside the typical spread of the central 50% of the data.

Both methods help you measure how far values deviate from the expected range. Z-score focuses on distance from the mean, while IQR identifies values outside the central portion of the dataset.

12345678910111213141516171819202122
import seaborn as sns import pandas as pd # Load the Titanic dataset from seaborn df = sns.load_dataset("titanic") # Drop rows with missing 'fare' values df_fare = df.dropna(subset=["fare"]) # Calculate Q1 and Q3 for the 'fare' column Q1 = df_fare["fare"].quantile(0.25) Q3 = df_fare["fare"].quantile(0.75) IQR = Q3 - Q1 # Define outlier bounds lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR # Detect outliers in 'fare' outliers = df_fare[(df_fare["fare"] < lower_bound) | (df_fare["fare"] > upper_bound)] print("Outliers detected in 'fare' using IQR method:") print(outliers[["fare"]])
copy
Note
Note

When handling outliers, you can choose to remove them or transform them (for example, by capping extreme values or applying a log transformation). The best approach depends on your dataset and the goals of your analysis.

question mark

Which of the following statements are true about handling duplicates and outliers in a dataset?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 3

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Awesome!

Completion rate improved to 8.33

bookDealing with Duplicates and Outliers

Swipe to show menu

When working with real-world datasets, you will often encounter duplicate records and outliers. Both can significantly impact your data analysis and the performance of your machine learning models. Duplicates can artificially inflate the importance of certain patterns, leading to biased results, while outliers can distort statistical summaries and model predictions. Properly identifying and handling these issues is a core part of data cleaning.

1234567891011121314151617
import pandas as pd import seaborn as sns # Load the Titanic dataset from seaborn df = sns.load_dataset("titanic") # Find duplicate rows in the Titanic dataset duplicates = df.duplicated() print("Duplicate row indicators:") print(duplicates.value_counts()) # Show how many duplicates exist # Remove duplicate rows df_no_duplicates = df.drop_duplicates() print("\nNumber of rows before removing duplicates:") print(len(df)) print("Number of rows after removing duplicates:") print(len(df_no_duplicates))
copy
Note
Definition

Outliers are data points that deviate significantly from the majority of a dataset. Common methods to detect outliers include visualizations (such as box plots), statistical measures (like Z-score), and the interquartile range (IQR) method.

Z-score and interquartile range (IQR) are two common statistical measures used to identify outliers in a dataset:

  • Z-score:
    • Measures how many standard deviations a data point is from the mean;
    • A Z-score is calculated using the formula: (value - mean) / standard deviation;
    • Data points with Z-scores greater than 3 or less than -3 are often considered outliers, as they are far from the average value.
  • Interquartile Range (IQR):
    • Represents the range between the first quartile (Q1, 25th percentile) and the third quartile (Q3, 75th percentile);
    • The IQR is calculated as Q3 - Q1;
    • Outliers are typically defined as data points below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR, which means they fall outside the typical spread of the central 50% of the data.

Both methods help you measure how far values deviate from the expected range. Z-score focuses on distance from the mean, while IQR identifies values outside the central portion of the dataset.

12345678910111213141516171819202122
import seaborn as sns import pandas as pd # Load the Titanic dataset from seaborn df = sns.load_dataset("titanic") # Drop rows with missing 'fare' values df_fare = df.dropna(subset=["fare"]) # Calculate Q1 and Q3 for the 'fare' column Q1 = df_fare["fare"].quantile(0.25) Q3 = df_fare["fare"].quantile(0.75) IQR = Q3 - Q1 # Define outlier bounds lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR # Detect outliers in 'fare' outliers = df_fare[(df_fare["fare"] < lower_bound) | (df_fare["fare"] > upper_bound)] print("Outliers detected in 'fare' using IQR method:") print(outliers[["fare"]])
copy
Note
Note

When handling outliers, you can choose to remove them or transform them (for example, by capping extreme values or applying a log transformation). The best approach depends on your dataset and the goals of your analysis.

question mark

Which of the following statements are true about handling duplicates and outliers in a dataset?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 3
some-alt