Dealing with Duplicates and Outliers
When working with real-world datasets, you will often encounter duplicate records and outliers. Both can significantly impact your data analysis and the performance of your machine learning models. Duplicates can artificially inflate the importance of certain patterns, leading to biased results, while outliers can distort statistical summaries and model predictions. Properly identifying and handling these issues is a core part of data cleaning.
1234567891011121314151617import pandas as pd import seaborn as sns # Load the Titanic dataset from seaborn df = sns.load_dataset("titanic") # Find duplicate rows in the Titanic dataset duplicates = df.duplicated() print("Duplicate row indicators:") print(duplicates.value_counts()) # Show how many duplicates exist # Remove duplicate rows df_no_duplicates = df.drop_duplicates() print("\nNumber of rows before removing duplicates:") print(len(df)) print("Number of rows after removing duplicates:") print(len(df_no_duplicates))
Outliers are data points that deviate significantly from the majority of a dataset. Common methods to detect outliers include visualizations (such as box plots), statistical measures (like Z-score), and the interquartile range (IQR) method.
Z-score and interquartile range (IQR) are two common statistical measures used to identify outliers in a dataset:
- Z-score:
- Measures how many standard deviations a data point is from the mean;
- A Z-score is calculated using the formula:
(value - mean) / standard deviation; - Data points with Z-scores greater than 3 or less than -3 are often considered outliers, as they are far from the average value.
- Interquartile Range (IQR):
- Represents the range between the first quartile (Q1, 25th percentile) and the third quartile (Q3, 75th percentile);
- The IQR is calculated as
Q3 - Q1; - Outliers are typically defined as data points below
Q1 - 1.5 * IQRor aboveQ3 + 1.5 * IQR, which means they fall outside the typical spread of the central 50% of the data.
Both methods help you measure how far values deviate from the expected range. Z-score focuses on distance from the mean, while IQR identifies values outside the central portion of the dataset.
12345678910111213141516171819202122import seaborn as sns import pandas as pd # Load the Titanic dataset from seaborn df = sns.load_dataset("titanic") # Drop rows with missing 'fare' values df_fare = df.dropna(subset=["fare"]) # Calculate Q1 and Q3 for the 'fare' column Q1 = df_fare["fare"].quantile(0.25) Q3 = df_fare["fare"].quantile(0.75) IQR = Q3 - Q1 # Define outlier bounds lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR # Detect outliers in 'fare' outliers = df_fare[(df_fare["fare"] < lower_bound) | (df_fare["fare"] > upper_bound)] print("Outliers detected in 'fare' using IQR method:") print(outliers[["fare"]])
When handling outliers, you can choose to remove them or transform them (for example, by capping extreme values or applying a log transformation). The best approach depends on your dataset and the goals of your analysis.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 8.33
Dealing with Duplicates and Outliers
Swipe to show menu
When working with real-world datasets, you will often encounter duplicate records and outliers. Both can significantly impact your data analysis and the performance of your machine learning models. Duplicates can artificially inflate the importance of certain patterns, leading to biased results, while outliers can distort statistical summaries and model predictions. Properly identifying and handling these issues is a core part of data cleaning.
1234567891011121314151617import pandas as pd import seaborn as sns # Load the Titanic dataset from seaborn df = sns.load_dataset("titanic") # Find duplicate rows in the Titanic dataset duplicates = df.duplicated() print("Duplicate row indicators:") print(duplicates.value_counts()) # Show how many duplicates exist # Remove duplicate rows df_no_duplicates = df.drop_duplicates() print("\nNumber of rows before removing duplicates:") print(len(df)) print("Number of rows after removing duplicates:") print(len(df_no_duplicates))
Outliers are data points that deviate significantly from the majority of a dataset. Common methods to detect outliers include visualizations (such as box plots), statistical measures (like Z-score), and the interquartile range (IQR) method.
Z-score and interquartile range (IQR) are two common statistical measures used to identify outliers in a dataset:
- Z-score:
- Measures how many standard deviations a data point is from the mean;
- A Z-score is calculated using the formula:
(value - mean) / standard deviation; - Data points with Z-scores greater than 3 or less than -3 are often considered outliers, as they are far from the average value.
- Interquartile Range (IQR):
- Represents the range between the first quartile (Q1, 25th percentile) and the third quartile (Q3, 75th percentile);
- The IQR is calculated as
Q3 - Q1; - Outliers are typically defined as data points below
Q1 - 1.5 * IQRor aboveQ3 + 1.5 * IQR, which means they fall outside the typical spread of the central 50% of the data.
Both methods help you measure how far values deviate from the expected range. Z-score focuses on distance from the mean, while IQR identifies values outside the central portion of the dataset.
12345678910111213141516171819202122import seaborn as sns import pandas as pd # Load the Titanic dataset from seaborn df = sns.load_dataset("titanic") # Drop rows with missing 'fare' values df_fare = df.dropna(subset=["fare"]) # Calculate Q1 and Q3 for the 'fare' column Q1 = df_fare["fare"].quantile(0.25) Q3 = df_fare["fare"].quantile(0.75) IQR = Q3 - Q1 # Define outlier bounds lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR # Detect outliers in 'fare' outliers = df_fare[(df_fare["fare"] < lower_bound) | (df_fare["fare"] > upper_bound)] print("Outliers detected in 'fare' using IQR method:") print(outliers[["fare"]])
When handling outliers, you can choose to remove them or transform them (for example, by capping extreme values or applying a log transformation). The best approach depends on your dataset and the goals of your analysis.
Thanks for your feedback!