Removing Outliers Using Z-Score Method
Outliers can heavily influence statistical analyses and models. One common method for detecting and removing them is the Z-Score Method. This technique identifies how far a data point is from the mean in terms of standard deviations. If a data point lies beyond a certain threshold (commonly Β±3), it is considered an outlier.
What Is a Z-Score?
A Z-score (also known as a standard score) is calculated using the formula:
Z=ΟXβΞΌβ
Where:
- X: the original data point;
- ΞΌ: the mean of the dataset;
- Ο: the standard deviation of the dataset.
Calculating Z-Scores for CGPA
# Step 1: Calculate mean and standard deviation
mean_cgpa <- mean(df$cgpa)
sd_cgpa <- sd(df$cgpa)
# Step 2: Calculate Z-scores manually
df$cgpa_zscore <- (df$cgpa - mean_cgpa) / sd_cgpa
# OR use the built-in function
df$cgpa_zscore <- scale(df$cgpa)
head(df$cgpa_zscore) # View first few Z-scores
Identifying Outliers
thresh_hold <- 3 # Common threshold for Z-score outliers
# Filter out outliers
outliers <- df[df$cgpa_zscore > thresh_hold | df$cgpa_zscore < -thresh_hold, ]
print(outliers) # View outlier rows
Creating an Outlier-Free Dataset
df2 <- df[df$cgpa_zscore < thresh_hold & df$cgpa_zscore > -thresh_hold, ]
View(df2) # View cleaned data
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Can you explain why a Z-score threshold of 3 is commonly used for outlier detection?
How does changing the Z-score threshold affect the number of outliers detected?
What should I do if my data is not normally distributed?
Awesome!
Completion rate improved to 4
Removing Outliers Using Z-Score Method
Swipe to show menu
Outliers can heavily influence statistical analyses and models. One common method for detecting and removing them is the Z-Score Method. This technique identifies how far a data point is from the mean in terms of standard deviations. If a data point lies beyond a certain threshold (commonly Β±3), it is considered an outlier.
What Is a Z-Score?
A Z-score (also known as a standard score) is calculated using the formula:
Z=ΟXβΞΌβ
Where:
- X: the original data point;
- ΞΌ: the mean of the dataset;
- Ο: the standard deviation of the dataset.
Calculating Z-Scores for CGPA
# Step 1: Calculate mean and standard deviation
mean_cgpa <- mean(df$cgpa)
sd_cgpa <- sd(df$cgpa)
# Step 2: Calculate Z-scores manually
df$cgpa_zscore <- (df$cgpa - mean_cgpa) / sd_cgpa
# OR use the built-in function
df$cgpa_zscore <- scale(df$cgpa)
head(df$cgpa_zscore) # View first few Z-scores
Identifying Outliers
thresh_hold <- 3 # Common threshold for Z-score outliers
# Filter out outliers
outliers <- df[df$cgpa_zscore > thresh_hold | df$cgpa_zscore < -thresh_hold, ]
print(outliers) # View outlier rows
Creating an Outlier-Free Dataset
df2 <- df[df$cgpa_zscore < thresh_hold & df$cgpa_zscore > -thresh_hold, ]
View(df2) # View cleaned data
Thanks for your feedback!