Course Content
Data Preprocessing
Data Preprocessing
Data Cleaning
Data cleaning in time series processing removes anomalies, errors, and incomplete or irrelevant data. It is an important preprocessing step to ensure analysis quality and forecast accuracy.
The main methods of data cleaning are:
Imputation
Imputation - filling missing values using the mean, median, interpolation, or time series methods (e.g., extrapolation).
The window size (the span over which the mean or median is taken) is often set in the range from 2 to 10-15. In the main, the choice is made on the basis of the visual assessment of the dataset recovery. Mean imputation is generally not recommended for time series data because it can introduce bias and distort the underlying patterns in the data. Therefore, other imputation methods, such as interpolation, regression, or more sophisticated time-series-specific methods, are often preferred for dealing with missing values in time series data.
In terms of imputation, interpolation may be appropriate if the missing values occur at the end of a time series and the pattern or trend of the time series is relatively stable. Summarizing, interpolation can be useful when the time series exhibits a clear trend or pattern that can be continued beyond the observed values.
Outlier removal
Outlier removal - identifying and removing anomalous values that may distort analysis using statistical methods (e.g., IQR, z-score).
For non-stationary data, we can use the following procedure:
- If you are working with homoscedastic data, you need to manually set some limit
L
by which all valuesx_val
will be filtered out: ||x_val
-x_mean
||>L
, wherex_mean
- the average calculated over the moving window; - If you are working with heteroscedastic data, then you need to transform the data using mathematical functions such as Box-Cox transformation, which can help reduce the data's variability and make it more homoscedastic. Now you can go to the first point.
A time series dataset is said to be homoscedastic when the distribution of errors or residuals is symmetric and does not change with respect to time. One way to check for homoscedasticity is to perform a statistical test, such as the Breusch-Pagan or White tests.
If we're talking about heteroscedasticity - it refers to a situation where the variance of the error terms or the spread of the data is not constant over time. In other words, the data points' variability is inconsistent across the entire range of the time series.
- Smoothing - reducing noise in the data using moving average filters, exponential smoothing, or other methods that improve the clarity of time series;
- Seasonality adjustment - extracting and accounting for seasonal components of a time series to obtain cleaner data and improve forecasting (e.g., using the Holt-Winters method or time series decomposition);
Here we will consider a method for recovering missing data using interpolation since the previous sections have already covered the use of the mean or median:
import pandas as pd # Create a time-series dataset with missing values dataset = pd.DataFrame({'value': [1, 2, 3, None, 5, 6, None, 8, 9]}, index=['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08', '2022-01-09']) # Interpolate missing values using linear method dataset['value_interpolated'] = dataset['value'].interpolate(method='linear') print(dataset)
The .interpolate()
method is used to implement interpolation, which contains the methods 'linear', 'time', 'index', 'pad', 'polynomial' etc., which you can experiment with depending on the data.
Swipe to show code editor
Read the 'clients.csv'
dataset and recover the missing values using the interpolation linear method.
Solution
Thanks for your feedback!
Data Cleaning
Data cleaning in time series processing removes anomalies, errors, and incomplete or irrelevant data. It is an important preprocessing step to ensure analysis quality and forecast accuracy.
The main methods of data cleaning are:
Imputation
Imputation - filling missing values using the mean, median, interpolation, or time series methods (e.g., extrapolation).
The window size (the span over which the mean or median is taken) is often set in the range from 2 to 10-15. In the main, the choice is made on the basis of the visual assessment of the dataset recovery. Mean imputation is generally not recommended for time series data because it can introduce bias and distort the underlying patterns in the data. Therefore, other imputation methods, such as interpolation, regression, or more sophisticated time-series-specific methods, are often preferred for dealing with missing values in time series data.
In terms of imputation, interpolation may be appropriate if the missing values occur at the end of a time series and the pattern or trend of the time series is relatively stable. Summarizing, interpolation can be useful when the time series exhibits a clear trend or pattern that can be continued beyond the observed values.
Outlier removal
Outlier removal - identifying and removing anomalous values that may distort analysis using statistical methods (e.g., IQR, z-score).
For non-stationary data, we can use the following procedure:
- If you are working with homoscedastic data, you need to manually set some limit
L
by which all valuesx_val
will be filtered out: ||x_val
-x_mean
||>L
, wherex_mean
- the average calculated over the moving window; - If you are working with heteroscedastic data, then you need to transform the data using mathematical functions such as Box-Cox transformation, which can help reduce the data's variability and make it more homoscedastic. Now you can go to the first point.
A time series dataset is said to be homoscedastic when the distribution of errors or residuals is symmetric and does not change with respect to time. One way to check for homoscedasticity is to perform a statistical test, such as the Breusch-Pagan or White tests.
If we're talking about heteroscedasticity - it refers to a situation where the variance of the error terms or the spread of the data is not constant over time. In other words, the data points' variability is inconsistent across the entire range of the time series.
- Smoothing - reducing noise in the data using moving average filters, exponential smoothing, or other methods that improve the clarity of time series;
- Seasonality adjustment - extracting and accounting for seasonal components of a time series to obtain cleaner data and improve forecasting (e.g., using the Holt-Winters method or time series decomposition);
Here we will consider a method for recovering missing data using interpolation since the previous sections have already covered the use of the mean or median:
import pandas as pd # Create a time-series dataset with missing values dataset = pd.DataFrame({'value': [1, 2, 3, None, 5, 6, None, 8, 9]}, index=['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08', '2022-01-09']) # Interpolate missing values using linear method dataset['value_interpolated'] = dataset['value'].interpolate(method='linear') print(dataset)
The .interpolate()
method is used to implement interpolation, which contains the methods 'linear', 'time', 'index', 'pad', 'polynomial' etc., which you can experiment with depending on the data.
Swipe to show code editor
Read the 'clients.csv'
dataset and recover the missing values using the interpolation linear method.
Solution
Thanks for your feedback!