Module 17 of 25 · Time Series Forecasting — ARIMA, SARIMA, Prophet, LSTM, Transformers for Time Series · Intermediate

Handling Missing Data in Time Series

Duration: 5 min

This module delves into the critical task of handling missing data in time series forecasting. Missing data can significantly impact the accuracy and reliability of time series models. Understanding and effectively managing missing values is essential for maintaining the integrity of your forecasts.

Understanding Missing Data Types

Missing data in time series can be classified into three types: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR). MCAR occurs when data is missing independently of both observed and unobserved values. MAR occurs when the missingness is related to observed data but not to the missing data itself. MNAR occurs when the missingness depends on the missing data.

import pandas as pd
import numpy as np

# Create a sample time series with missing values
data = {'date': pd.date_range(start='1/1/2020', periods=10),
         'value': [1, 2, np.nan, 4, 5, np.nan, 7, 8, np.nan, 10]}
df = pd.DataFrame(data)

# Print the original DataFrame
print('Original DataFrame:')
print(df)

# Fill missing values with forward fill method
df_filled = df.fillna(method='ffill')

# Print the DataFrame after filling missing values
print('\nDataFrame after filling missing values:')
print(df_filled)

Try it in Google Colab: Open in Colab

Original DataFrame:
        date  value
0 2020-01-01    1.0
1 2020-01-02    2.0
2 2020-01-03    NaN
3 2020-01-04    4.0
4 2020-01-05    5.0
5 2020-01-06    NaN
6 2020-01-07    7.0
7 2020-01-08    8.0
8 2020-01-09    NaN
9 2020-01-10   10.0

DataFrame after filling missing values:
        date  value
0 2020-01-01    1.0
1 2020-01-02    2.0
2 2020-01-03    2.0
3 2020-01-04    4.0
4 2020-01-05    5.0
5 2020-01-06    5.0
6 2020-01-07    7.0
7 2020-01-08    8.0
8 2020-01-09    8.0
9 2020-01-10   10.0

Advanced Techniques for Handling Missing Data

Advanced techniques for handling missing data include interpolation, regression imputation, and model-based approaches. Interpolation fills missing values based on the trend of the data, while regression imputation uses a regression model to predict missing values. Model-based approaches use algorithms like K-Nearest Neighbors (KNN) or machine learning models to estimate missing values.

from sklearn.impute import KNNImputer
import pandas as pd
import numpy as np

# Create a sample time series with missing values
data = {'date': pd.date_range(start='1/1/2020', periods=10),
         'value': [1, 2, np.nan, 4, 5, np.nan, 7, 8, np.nan, 10]}
df = pd.DataFrame(data)

# Separate date and value columns
dates = df['date']
values = df[['value']]

# Apply KNNImputer to fill missing values
imputer = KNNImputer(n_neighbors=2)
values_filled = imputer.fit_transform(values)

# Create a new DataFrame with filled values
df_filled = pd.DataFrame(values_filled, columns=['value'])
df_filled['date'] = dates

# Print the DataFrame after filling missing values
print('DataFrame after filling missing values using KNNImputer:')
print(df_filled)

💡 Tip: When using KNNImputer, carefully choose the number of neighbors (n_neighbors) to balance between overfitting and underfitting. A common practice is to start with a small number and increase it if necessary.

❓ What are the three types of missing data in time series?

❓ Which method is used in the second code example to handle missing data?

Key Concepts

Concept Description
Trend Core principle in this module
Seasonality Core principle in this module
Stationarity Core principle in this module
Autocorrelation Core principle in this module

Check Your Understanding

❓ How does Handling handle edge cases?

❓ What is the computational complexity of Handling?

❓ Which hyperparameter is most critical for Handling?

← Previous Continue interactively → Next →

Related Courses