Learn Dealing with Duplicates and Outliers

When working with real-world datasets, you will often encounter duplicate records and outliers. Both can significantly impact your data analysis and the performance of your machine learning models. Duplicates can artificially inflate the importance of certain patterns, leading to biased results, while outliers can distort statistical summaries and model predictions. Properly identifying and handling these issues is a core part of data cleaning.


              1234567891011121314151617
            
import pandas as pd
import seaborn as sns

# Load the Titanic dataset from seaborn
df = sns.load_dataset("titanic")

# Find duplicate rows in the Titanic dataset
duplicates = df.duplicated()
print("Duplicate row indicators:")
print(duplicates.value_counts())  # Show how many duplicates exist

# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()
print("\nNumber of rows before removing duplicates:")
print(len(df))
print("Number of rows after removing duplicates:")
print(len(df_no_duplicates))

Definition

Outliers are data points that deviate significantly from the majority of a dataset. Common methods to detect outliers include visualizations (such as box plots), statistical measures (like Z-score), and the interquartile range (IQR) method.

Z-score and interquartile range (IQR) are two common statistical measures used to identify outliers in a dataset:

Z-score:
- Measures how many standard deviations a data point is from the mean;
- A Z-score is calculated using the formula: (value - mean) / standard deviation;
- Data points with Z-scores greater than 3 or less than -3 are often considered outliers, as they are far from the average value.
Interquartile Range (IQR):
- Represents the range between the first quartile (Q1, 25th percentile) and the third quartile (Q3, 75th percentile);
- The IQR is calculated as Q3 - Q1;
- Outliers are typically defined as data points below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR, which means they fall outside the typical spread of the central 50% of the data.

Both methods help you measure how far values deviate from the expected range. Z-score focuses on distance from the mean, while IQR identifies values outside the central portion of the dataset.


              12345678910111213141516171819202122
            
import seaborn as sns
import pandas as pd

# Load the Titanic dataset from seaborn
df = sns.load_dataset("titanic")

# Drop rows with missing 'fare' values
df_fare = df.dropna(subset=["fare"])

# Calculate Q1 and Q3 for the 'fare' column
Q1 = df_fare["fare"].quantile(0.25)
Q3 = df_fare["fare"].quantile(0.75)
IQR = Q3 - Q1

# Define outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Detect outliers in 'fare'
outliers = df_fare[(df_fare["fare"] < lower_bound) | (df_fare["fare"] > upper_bound)]
print("Outliers detected in 'fare' using IQR method:")
print(outliers[["fare"]])

Note

When handling outliers, you can choose to remove them or transform them (for example, by capping extreme values or applying a log transformation). The best approach depends on your dataset and the goals of your analysis.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Awesome!

Completion rate improved to 8.33

Swipe to show menu


              1234567891011121314151617
            
import pandas as pd
import seaborn as sns

# Load the Titanic dataset from seaborn
df = sns.load_dataset("titanic")

# Find duplicate rows in the Titanic dataset
duplicates = df.duplicated()
print("Duplicate row indicators:")
print(duplicates.value_counts())  # Show how many duplicates exist

# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()
print("\nNumber of rows before removing duplicates:")
print(len(df))
print("Number of rows after removing duplicates:")
print(len(df_no_duplicates))