Removing Outliers Using Z-Score Method
Outliers can heavily influence statistical analyses and models. One common method for detecting and removing them is the Z-Score Method. This technique identifies how far a data point is from the mean in terms of standard deviations. If a data point lies beyond a certain threshold (commonly ±3), it is considered an outlier.
What Is a Z-Score?
A Z-score (also known as a standard score) is calculated using the formula:
Z=σX−μ
Where:
- X: the original data point;
- μ: the mean of the dataset;
- σ: the standard deviation of the dataset.
Calculating Z-Scores for CGPA
# Step 1: Calculate mean and standard deviation
mean_cgpa <- mean(df$cgpa)
sd_cgpa <- sd(df$cgpa)
# Step 2: Calculate Z-scores manually
df$cgpa_zscore <- (df$cgpa - mean_cgpa) / sd_cgpa
# OR use the built-in function
df$cgpa_zscore <- scale(df$cgpa)
head(df$cgpa_zscore) # View first few Z-scores
Identifying Outliers
thresh_hold <- 3 # Common threshold for Z-score outliers
# Filter out outliers
outliers <- df[df$cgpa_zscore > thresh_hold | df$cgpa_zscore < -thresh_hold, ]
print(outliers) # View outlier rows
Creating an Outlier-Free Dataset
df2 <- df[df$cgpa_zscore < thresh_hold & df$cgpa_zscore > -thresh_hold, ]
View(df2) # View cleaned data
Tak for dine kommentarer!
Spørg AI
Spørg AI
Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat
Awesome!
Completion rate improved to 4
Removing Outliers Using Z-Score Method
Stryg for at vise menuen
Outliers can heavily influence statistical analyses and models. One common method for detecting and removing them is the Z-Score Method. This technique identifies how far a data point is from the mean in terms of standard deviations. If a data point lies beyond a certain threshold (commonly ±3), it is considered an outlier.
What Is a Z-Score?
A Z-score (also known as a standard score) is calculated using the formula:
Z=σX−μ
Where:
- X: the original data point;
- μ: the mean of the dataset;
- σ: the standard deviation of the dataset.
Calculating Z-Scores for CGPA
# Step 1: Calculate mean and standard deviation
mean_cgpa <- mean(df$cgpa)
sd_cgpa <- sd(df$cgpa)
# Step 2: Calculate Z-scores manually
df$cgpa_zscore <- (df$cgpa - mean_cgpa) / sd_cgpa
# OR use the built-in function
df$cgpa_zscore <- scale(df$cgpa)
head(df$cgpa_zscore) # View first few Z-scores
Identifying Outliers
thresh_hold <- 3 # Common threshold for Z-score outliers
# Filter out outliers
outliers <- df[df$cgpa_zscore > thresh_hold | df$cgpa_zscore < -thresh_hold, ]
print(outliers) # View outlier rows
Creating an Outlier-Free Dataset
df2 <- df[df$cgpa_zscore < thresh_hold & df$cgpa_zscore > -thresh_hold, ]
View(df2) # View cleaned data
Tak for dine kommentarer!