Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Removing Outliers | Processing Quantitative Data
Data Preprocessing
course content

Course Content

Data Preprocessing

Data Preprocessing

1. Brief Introduction
2. Processing Quantitative Data
3. Processing Categorical Data
4. Time Series Data Processing
5. Feature Engineering
6. Moving on to Tasks

bookRemoving Outliers

Outliers are data points that are significantly different from the other data points in a dataset. Why is it important to deal with them? Outliers can occur due to measurement errors, data entry errors, or other factors and can significantly impact the data analysis.

Outliers can significantly impact statistical analysis, machine learning models, and data visualization. They can distort the results of statistical analysis, lead to biased machine learning models, and make it difficult to visualize the data accurately. Removing outliers can help improve the analysis's accuracy and reliability and improve the results' interpretability.

There are several ways to remove outliers in Python, but one common technique is the Z-score method:

123456789101112131415
import numpy as np # Generate small dataset dataset = np.random.normal(0, 1, 1000) # Calculate the Z-scores z_scores = (dataset - np.mean(dataset)) / np.std(dataset) # Find the indices of the outliers outlier_indices = np.where(np.abs(z_scores) > 3)[0] # Print outliers print('Outliers are: ', dataset[outlier_indices]) # Remove the outliers filtered_data = np.delete(dataset, outlier_indices)
copy

In this example, we first generate some sample data using the random.normal() method. We then calculate the Z-scores for the data by subtracting the mean and dividing by the standard deviation. We define outliers as any data point whose absolute Z-score is greater than 3 (a common threshold for identifying outliers). We find the indices of these outliers using the .where() method and then remove them from the original data using the .delete() method.

It should be clarified that this method only works for Gaussian data. If your data has non-symmetric distribution, then you can use a modified Z-score. The modified Z-score is calculated as the difference between a data point and the median, divided by the median absolute deviation.

It is also important to remember that not all outliers need to be removed because outliers can sometimes be a natural part of the data and provide important information about the underlying process or phenomenon being studied.

In some cases, outliers may represent rare or extreme events that are important to capture in the analysis. For example, in medical research, outliers in inpatient data may represent rare but important cases that need to be studied separately.

Furthermore, outliers can sometimes result from measurement errors or random fluctuations in the data. In such cases, removing all outliers may not be necessary or appropriate.

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 2. Chapter 3
some-alt