Removing Outliers Using Z-Score Method
Outliers can heavily influence statistical analyses and models. One common method for detecting and removing them is the Z-Score Method. This technique identifies how far a data point is from the mean in terms of standard deviations. If a data point lies beyond a certain threshold (commonly ±3), it is considered an outlier.
What Is a Z-Score?
A Z-score (also known as a standard score) is calculated using the formula:
Z=σX−μ
Where:
- X: the original data point;
- μ: the mean of the dataset;
- σ: the standard deviation of the dataset.
Calculating Z-Scores for CGPA
# Step 1: Calculate mean and standard deviation
mean_cgpa <- mean(df$cgpa)
sd_cgpa <- sd(df$cgpa)
# Step 2: Calculate Z-scores manually
df$cgpa_zscore <- (df$cgpa - mean_cgpa) / sd_cgpa
# OR use the built-in function
df$cgpa_zscore <- scale(df$cgpa)
head(df$cgpa_zscore) # View first few Z-scores
Identifying Outliers
thresh_hold <- 3 # Common threshold for Z-score outliers
# Filter out outliers
outliers <- df[df$cgpa_zscore > thresh_hold | df$cgpa_zscore < -thresh_hold, ]
print(outliers) # View outlier rows
Creating an Outlier-Free Dataset
df2 <- df[df$cgpa_zscore < thresh_hold & df$cgpa_zscore > -thresh_hold, ]
View(df2) # View cleaned data
Tack för dina kommentarer!
Fråga AI
Fråga AI
Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal
Can you explain why a Z-score threshold of 3 is commonly used for outlier detection?
How does changing the Z-score threshold affect the number of outliers detected?
What should I do if my data is not normally distributed?
Awesome!
Completion rate improved to 4
Removing Outliers Using Z-Score Method
Svep för att visa menyn
Outliers can heavily influence statistical analyses and models. One common method for detecting and removing them is the Z-Score Method. This technique identifies how far a data point is from the mean in terms of standard deviations. If a data point lies beyond a certain threshold (commonly ±3), it is considered an outlier.
What Is a Z-Score?
A Z-score (also known as a standard score) is calculated using the formula:
Z=σX−μ
Where:
- X: the original data point;
- μ: the mean of the dataset;
- σ: the standard deviation of the dataset.
Calculating Z-Scores for CGPA
# Step 1: Calculate mean and standard deviation
mean_cgpa <- mean(df$cgpa)
sd_cgpa <- sd(df$cgpa)
# Step 2: Calculate Z-scores manually
df$cgpa_zscore <- (df$cgpa - mean_cgpa) / sd_cgpa
# OR use the built-in function
df$cgpa_zscore <- scale(df$cgpa)
head(df$cgpa_zscore) # View first few Z-scores
Identifying Outliers
thresh_hold <- 3 # Common threshold for Z-score outliers
# Filter out outliers
outliers <- df[df$cgpa_zscore > thresh_hold | df$cgpa_zscore < -thresh_hold, ]
print(outliers) # View outlier rows
Creating an Outlier-Free Dataset
df2 <- df[df$cgpa_zscore < thresh_hold & df$cgpa_zscore > -thresh_hold, ]
View(df2) # View cleaned data
Tack för dina kommentarer!