Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Data Cleaning | Time Series Data Processing
Data Preprocessing
course content

Course Content

Data Preprocessing

Data Preprocessing

1. Brief Introduction
2. Processing Quantitative Data
3. Processing Categorical Data
4. Time Series Data Processing
5. Feature Engineering
6. Moving on to Tasks

bookData Cleaning

Data cleaning in time series processing removes anomalies, errors, and incomplete or irrelevant data. It is an important preprocessing step to ensure analysis quality and forecast accuracy.

The main methods of data cleaning are:

Imputation

Imputation - filling missing values using the mean, median, interpolation, or time series methods (e.g., extrapolation).

The window size (the span over which the mean or median is taken) is often set in the range from 2 to 10-15. In the main, the choice is made on the basis of the visual assessment of the dataset recovery. Mean imputation is generally not recommended for time series data because it can introduce bias and distort the underlying patterns in the data. Therefore, other imputation methods, such as interpolation, regression, or more sophisticated time-series-specific methods, are often preferred for dealing with missing values in time series data.

In terms of imputation, interpolation may be appropriate if the missing values occur at the end of a time series and the pattern or trend of the time series is relatively stable. Summarizing, interpolation can be useful when the time series exhibits a clear trend or pattern that can be continued beyond the observed values.

Outlier removal

Outlier removal - identifying and removing anomalous values that may distort analysis using statistical methods (e.g., IQR, z-score).

For non-stationary data, we can use the following procedure:

  • If you are working with homoscedastic data, you need to manually set some limit L by which all values x_val​​will be filtered out: ||x_val - x_mean||>L, where x_mean - the average calculated over the moving window;
  • If you are working with heteroscedastic data, then you need to transform the data using mathematical functions such as Box-Cox transformation, which can help reduce the data's variability and make it more homoscedastic. Now you can go to the first point.

A time series dataset is said to be homoscedastic when the distribution of errors or residuals is symmetric and does not change with respect to time. One way to check for homoscedasticity is to perform a statistical test, such as the Breusch-Pagan or White tests.

If we're talking about heteroscedasticity - it refers to a situation where the variance of the error terms or the spread of the data is not constant over time. In other words, the data points' variability is inconsistent across the entire range of the time series.

  • Smoothing - reducing noise in the data using moving average filters, exponential smoothing, or other methods that improve the clarity of time series;
  • Seasonality adjustment - extracting and accounting for seasonal components of a time series to obtain cleaner data and improve forecasting (e.g., using the Holt-Winters method or time series decomposition);

Here we will consider a method for recovering missing data using interpolation since the previous sections have already covered the use of the mean or median:

1234567891011
import pandas as pd # Create a time-series dataset with missing values dataset = pd.DataFrame({'value': [1, 2, 3, None, 5, 6, None, 8, 9]}, index=['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08', '2022-01-09']) # Interpolate missing values using linear method dataset['value_interpolated'] = dataset['value'].interpolate(method='linear') print(dataset)
copy

The .interpolate() method is used to implement interpolation, which contains the methods 'linear', 'time', 'index', 'pad', 'polynomial' etc., which you can experiment with depending on the data.

Task

Read the 'clients.csv' dataset and recover the missing values using the interpolation linear method.

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 4. Chapter 2
toggle bottom row

bookData Cleaning

Data cleaning in time series processing removes anomalies, errors, and incomplete or irrelevant data. It is an important preprocessing step to ensure analysis quality and forecast accuracy.

The main methods of data cleaning are:

Imputation

Imputation - filling missing values using the mean, median, interpolation, or time series methods (e.g., extrapolation).

The window size (the span over which the mean or median is taken) is often set in the range from 2 to 10-15. In the main, the choice is made on the basis of the visual assessment of the dataset recovery. Mean imputation is generally not recommended for time series data because it can introduce bias and distort the underlying patterns in the data. Therefore, other imputation methods, such as interpolation, regression, or more sophisticated time-series-specific methods, are often preferred for dealing with missing values in time series data.

In terms of imputation, interpolation may be appropriate if the missing values occur at the end of a time series and the pattern or trend of the time series is relatively stable. Summarizing, interpolation can be useful when the time series exhibits a clear trend or pattern that can be continued beyond the observed values.

Outlier removal

Outlier removal - identifying and removing anomalous values that may distort analysis using statistical methods (e.g., IQR, z-score).

For non-stationary data, we can use the following procedure:

  • If you are working with homoscedastic data, you need to manually set some limit L by which all values x_val​​will be filtered out: ||x_val - x_mean||>L, where x_mean - the average calculated over the moving window;
  • If you are working with heteroscedastic data, then you need to transform the data using mathematical functions such as Box-Cox transformation, which can help reduce the data's variability and make it more homoscedastic. Now you can go to the first point.

A time series dataset is said to be homoscedastic when the distribution of errors or residuals is symmetric and does not change with respect to time. One way to check for homoscedasticity is to perform a statistical test, such as the Breusch-Pagan or White tests.

If we're talking about heteroscedasticity - it refers to a situation where the variance of the error terms or the spread of the data is not constant over time. In other words, the data points' variability is inconsistent across the entire range of the time series.

  • Smoothing - reducing noise in the data using moving average filters, exponential smoothing, or other methods that improve the clarity of time series;
  • Seasonality adjustment - extracting and accounting for seasonal components of a time series to obtain cleaner data and improve forecasting (e.g., using the Holt-Winters method or time series decomposition);

Here we will consider a method for recovering missing data using interpolation since the previous sections have already covered the use of the mean or median:

1234567891011
import pandas as pd # Create a time-series dataset with missing values dataset = pd.DataFrame({'value': [1, 2, 3, None, 5, 6, None, 8, 9]}, index=['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08', '2022-01-09']) # Interpolate missing values using linear method dataset['value_interpolated'] = dataset['value'].interpolate(method='linear') print(dataset)
copy

The .interpolate() method is used to implement interpolation, which contains the methods 'linear', 'time', 'index', 'pad', 'polynomial' etc., which you can experiment with depending on the data.

Task

Read the 'clients.csv' dataset and recover the missing values using the interpolation linear method.

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 4. Chapter 2
toggle bottom row

bookData Cleaning

Data cleaning in time series processing removes anomalies, errors, and incomplete or irrelevant data. It is an important preprocessing step to ensure analysis quality and forecast accuracy.

The main methods of data cleaning are:

Imputation

Imputation - filling missing values using the mean, median, interpolation, or time series methods (e.g., extrapolation).

The window size (the span over which the mean or median is taken) is often set in the range from 2 to 10-15. In the main, the choice is made on the basis of the visual assessment of the dataset recovery. Mean imputation is generally not recommended for time series data because it can introduce bias and distort the underlying patterns in the data. Therefore, other imputation methods, such as interpolation, regression, or more sophisticated time-series-specific methods, are often preferred for dealing with missing values in time series data.

In terms of imputation, interpolation may be appropriate if the missing values occur at the end of a time series and the pattern or trend of the time series is relatively stable. Summarizing, interpolation can be useful when the time series exhibits a clear trend or pattern that can be continued beyond the observed values.

Outlier removal

Outlier removal - identifying and removing anomalous values that may distort analysis using statistical methods (e.g., IQR, z-score).

For non-stationary data, we can use the following procedure:

  • If you are working with homoscedastic data, you need to manually set some limit L by which all values x_val​​will be filtered out: ||x_val - x_mean||>L, where x_mean - the average calculated over the moving window;
  • If you are working with heteroscedastic data, then you need to transform the data using mathematical functions such as Box-Cox transformation, which can help reduce the data's variability and make it more homoscedastic. Now you can go to the first point.

A time series dataset is said to be homoscedastic when the distribution of errors or residuals is symmetric and does not change with respect to time. One way to check for homoscedasticity is to perform a statistical test, such as the Breusch-Pagan or White tests.

If we're talking about heteroscedasticity - it refers to a situation where the variance of the error terms or the spread of the data is not constant over time. In other words, the data points' variability is inconsistent across the entire range of the time series.

  • Smoothing - reducing noise in the data using moving average filters, exponential smoothing, or other methods that improve the clarity of time series;
  • Seasonality adjustment - extracting and accounting for seasonal components of a time series to obtain cleaner data and improve forecasting (e.g., using the Holt-Winters method or time series decomposition);

Here we will consider a method for recovering missing data using interpolation since the previous sections have already covered the use of the mean or median:

1234567891011
import pandas as pd # Create a time-series dataset with missing values dataset = pd.DataFrame({'value': [1, 2, 3, None, 5, 6, None, 8, 9]}, index=['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08', '2022-01-09']) # Interpolate missing values using linear method dataset['value_interpolated'] = dataset['value'].interpolate(method='linear') print(dataset)
copy

The .interpolate() method is used to implement interpolation, which contains the methods 'linear', 'time', 'index', 'pad', 'polynomial' etc., which you can experiment with depending on the data.

Task

Read the 'clients.csv' dataset and recover the missing values using the interpolation linear method.

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

Data cleaning in time series processing removes anomalies, errors, and incomplete or irrelevant data. It is an important preprocessing step to ensure analysis quality and forecast accuracy.

The main methods of data cleaning are:

Imputation

Imputation - filling missing values using the mean, median, interpolation, or time series methods (e.g., extrapolation).

The window size (the span over which the mean or median is taken) is often set in the range from 2 to 10-15. In the main, the choice is made on the basis of the visual assessment of the dataset recovery. Mean imputation is generally not recommended for time series data because it can introduce bias and distort the underlying patterns in the data. Therefore, other imputation methods, such as interpolation, regression, or more sophisticated time-series-specific methods, are often preferred for dealing with missing values in time series data.

In terms of imputation, interpolation may be appropriate if the missing values occur at the end of a time series and the pattern or trend of the time series is relatively stable. Summarizing, interpolation can be useful when the time series exhibits a clear trend or pattern that can be continued beyond the observed values.

Outlier removal

Outlier removal - identifying and removing anomalous values that may distort analysis using statistical methods (e.g., IQR, z-score).

For non-stationary data, we can use the following procedure:

  • If you are working with homoscedastic data, you need to manually set some limit L by which all values x_val​​will be filtered out: ||x_val - x_mean||>L, where x_mean - the average calculated over the moving window;
  • If you are working with heteroscedastic data, then you need to transform the data using mathematical functions such as Box-Cox transformation, which can help reduce the data's variability and make it more homoscedastic. Now you can go to the first point.

A time series dataset is said to be homoscedastic when the distribution of errors or residuals is symmetric and does not change with respect to time. One way to check for homoscedasticity is to perform a statistical test, such as the Breusch-Pagan or White tests.

If we're talking about heteroscedasticity - it refers to a situation where the variance of the error terms or the spread of the data is not constant over time. In other words, the data points' variability is inconsistent across the entire range of the time series.

  • Smoothing - reducing noise in the data using moving average filters, exponential smoothing, or other methods that improve the clarity of time series;
  • Seasonality adjustment - extracting and accounting for seasonal components of a time series to obtain cleaner data and improve forecasting (e.g., using the Holt-Winters method or time series decomposition);

Here we will consider a method for recovering missing data using interpolation since the previous sections have already covered the use of the mean or median:

1234567891011
import pandas as pd # Create a time-series dataset with missing values dataset = pd.DataFrame({'value': [1, 2, 3, None, 5, 6, None, 8, 9]}, index=['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08', '2022-01-09']) # Interpolate missing values using linear method dataset['value_interpolated'] = dataset['value'].interpolate(method='linear') print(dataset)
copy

The .interpolate() method is used to implement interpolation, which contains the methods 'linear', 'time', 'index', 'pad', 'polynomial' etc., which you can experiment with depending on the data.

Task

Read the 'clients.csv' dataset and recover the missing values using the interpolation linear method.

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Section 4. Chapter 2
Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
some-alt