Swipe to show menu

Data Cleaning

Data cleaning in time series processing removes anomalies, errors, and incomplete or irrelevant data. It is an important preprocessing step to ensure analysis quality and forecast accuracy.

The main methods of data cleaning are:

Imputation

Imputation - filling missing values using the mean, median, interpolation, or time series methods (e.g., extrapolation).

The window size (the span over which the mean or median is taken) is often set in the range from 2 to 10-15. In the main, the choice is made on the basis of the visual assessment of the dataset recovery. Mean imputation is generally not recommended for time series data because it can introduce bias and distort the underlying patterns in the data. Therefore, other imputation methods, such as interpolation, regression, or more sophisticated time-series-specific methods, are often preferred for dealing with missing values in time series data.

In terms of imputation, interpolation may be appropriate if the missing values occur at the end of a time series and the pattern or trend of the time series is relatively stable. Summarizing, interpolation can be useful when the time series exhibits a clear trend or pattern that can be continued beyond the observed values.

Outlier removal

Outlier removal - identifying and removing anomalous values that may distort analysis using statistical methods (e.g., IQR, z-score).

For non-stationary data, we can use the following procedure:

If you are working with homoscedastic data, you need to manually set some limit L by which all values x_valwill be filtered out: ||x_val - x_mean||>L, where x_mean - the average calculated over the moving window;
If you are working with heteroscedastic data, then you need to transform the data using mathematical functions such as Box-Cox transformation, which can help reduce the data's variability and make it more homoscedastic. Now you can go to the first point.

A time series dataset is said to be homoscedastic when the distribution of errors or residuals is symmetric and does not change with respect to time. One way to check for homoscedasticity is to perform a statistical test, such as the Breusch-Pagan or White tests.

If we're talking about heteroscedasticity - it refers to a situation where the variance of the error terms or the spread of the data is not constant over time. In other words, the data points' variability is inconsistent across the entire range of the time series.

Smoothing - reducing noise in the data using moving average filters, exponential smoothing, or other methods that improve the clarity of time series;
Seasonality adjustment - extracting and accounting for seasonal components of a time series to obtain cleaner data and improve forecasting (e.g., using the Holt-Winters method or time series decomposition);

Here we will consider a method for recovering missing data using interpolation since the previous sections have already covered the use of the mean or median:


              1234567891011
            
import pandas as pd

# Create a time-series dataset with missing values
dataset = pd.DataFrame({'value': [1, 2, 3, None, 5, 6, None, 8, 9]}, 
                        index=['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', 
                               '2022-01-06', '2022-01-07', '2022-01-08', '2022-01-09'])

# Interpolate missing values using linear method
dataset['value_interpolated'] = dataset['value'].interpolate(method='linear')

print(dataset)

The .interpolate() method is used to implement interpolation, which contains the methods 'linear', 'time', 'index', 'pad', 'polynomial' etc., which you can experiment with depending on the data.

Task

Swipe to start coding

Read the 'clients.csv' dataset and recover the missing values using the interpolation linear method.

Solution

Switch to desktop for real-world practiceContinue from where you are using one of the options below

Everything was clear?

Thanks for your feedback!

Section 4. Chapter 2

single

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Data Cleaning

Data cleaning in time series processing removes anomalies, errors, and incomplete or irrelevant data. It is an important preprocessing step to ensure analysis quality and forecast accuracy.

The main methods of data cleaning are:

Imputation

Imputation - filling missing values using the mean, median, interpolation, or time series methods (e.g., extrapolation).

Outlier removal

Outlier removal - identifying and removing anomalous values that may distort analysis using statistical methods (e.g., IQR, z-score).

For non-stationary data, we can use the following procedure:

If you are working with homoscedastic data, you need to manually set some limit L by which all values x_valwill be filtered out: ||x_val - x_mean||>L, where x_mean - the average calculated over the moving window;
If you are working with heteroscedastic data, then you need to transform the data using mathematical functions such as Box-Cox transformation, which can help reduce the data's variability and make it more homoscedastic. Now you can go to the first point.

Smoothing - reducing noise in the data using moving average filters, exponential smoothing, or other methods that improve the clarity of time series;
Seasonality adjustment - extracting and accounting for seasonal components of a time series to obtain cleaner data and improve forecasting (e.g., using the Holt-Winters method or time series decomposition);

Here we will consider a method for recovering missing data using interpolation since the previous sections have already covered the use of the mean or median:


              1234567891011
            
import pandas as pd

# Create a time-series dataset with missing values
dataset = pd.DataFrame({'value': [1, 2, 3, None, 5, 6, None, 8, 9]}, 
                        index=['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', 
                               '2022-01-06', '2022-01-07', '2022-01-08', '2022-01-09'])

# Interpolate missing values using linear method
dataset['value_interpolated'] = dataset['value'].interpolate(method='linear')

print(dataset)

The .interpolate() method is used to implement interpolation, which contains the methods 'linear', 'time', 'index', 'pad', 'polynomial' etc., which you can experiment with depending on the data.

Task

Swipe to start coding

Read the 'clients.csv' dataset and recover the missing values using the interpolation linear method.

Solution

Switch to desktop for real-world practiceContinue from where you are using one of the options below

Everything was clear?

Thanks for your feedback!

Swipe to show menu

Data Cleaning

Imputation

Outlier removal

Solution

Awesome!

Data Cleaning

Imputation

Outlier removal

Solution

Awesome!