Handling Missing Data in Time Series
Duration: 5 min
This module delves into the critical task of handling missing data in time series forecasting. Missing data can significantly impact the accuracy and reliability of time series models. Understanding and effectively managing missing values is essential for maintaining the integrity of your forecasts.
Understanding Missing Data Types
Missing data in time series can be classified into three types: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR). MCAR occurs when data is missing independently of both observed and unobserved values. MAR occurs when the missingness is related to observed data but not to the missing data itself. MNAR occurs when the missingness depends on the missing data.
import pandas as pd
import numpy as np
# Create a sample time series with missing values
data = {'date': pd.date_range(start='1/1/2020', periods=10),
'value': [1, 2, np.nan, 4, 5, np.nan, 7, 8, np.nan, 10]}
df = pd.DataFrame(data)
# Print the original DataFrame
print('Original DataFrame:')
print(df)
# Fill missing values with forward fill method
df_filled = df.fillna(method='ffill')
# Print the DataFrame after filling missing values
print('\nDataFrame after filling missing values:')
print(df_filled)Original DataFrame:
date value
0 2020-01-01 1.0
1 2020-01-02 2.0
2 2020-01-03 NaN
3 2020-01-04 4.0
4 2020-01-05 5.0
5 2020-01-06 NaN
6 2020-01-07 7.0
7 2020-01-08 8.0
8 2020-01-09 NaN
9 2020-01-10 10.0
DataFrame after filling missing values:
date value
0 2020-01-01 1.0
1 2020-01-02 2.0
2 2020-01-03 2.0
3 2020-01-04 4.0
4 2020-01-05 5.0
5 2020-01-06 5.0
6 2020-01-07 7.0
7 2020-01-08 8.0
8 2020-01-09 8.0
9 2020-01-10 10.0Advanced Techniques for Handling Missing Data
Advanced techniques for handling missing data include interpolation, regression imputation, and model-based approaches. Interpolation fills missing values based on the trend of the data, while regression imputation uses a regression model to predict missing values. Model-based approaches use algorithms like K-Nearest Neighbors (KNN) or machine learning models to estimate missing values.
from sklearn.impute import KNNImputer
import pandas as pd
import numpy as np
# Create a sample time series with missing values
data = {'date': pd.date_range(start='1/1/2020', periods=10),
'value': [1, 2, np.nan, 4, 5, np.nan, 7, 8, np.nan, 10]}
df = pd.DataFrame(data)
# Separate date and value columns
dates = df['date']
values = df[['value']]
# Apply KNNImputer to fill missing values
imputer = KNNImputer(n_neighbors=2)
values_filled = imputer.fit_transform(values)
# Create a new DataFrame with filled values
df_filled = pd.DataFrame(values_filled, columns=['value'])
df_filled['date'] = dates
# Print the DataFrame after filling missing values
print('DataFrame after filling missing values using KNNImputer:')
print(df_filled)💡 Tip: When using KNNImputer, carefully choose the number of neighbors (n_neighbors) to balance between overfitting and underfitting. A common practice is to start with a small number and increase it if necessary.
❓ What are the three types of missing data in time series?
❓ Which method is used in the second code example to handle missing data?
Key Concepts
| Concept | Description |
|---|---|
| Trend | Core principle in this module |
| Seasonality | Core principle in this module |
| Stationarity | Core principle in this module |
| Autocorrelation | Core principle in this module |
Check Your Understanding
❓ How does Handling handle edge cases?
❓ What is the computational complexity of Handling?
❓ Which hyperparameter is most critical for Handling?