Time Series Basics for Machine Learning
Learn the fundamentals of working with time series data in machine learning.
Time Series Basics for Machine Learning
Time series data has a twist: the order matters. You can't randomly split it like regular data. Let's learn the fundamentals.
What Makes Time Series Special
1. **Temporal order matters** - Can't shuffle data 2. **Autocorrelation** - Values correlate with past values 3. **Trends** - Long-term increase/decrease 4. **Seasonality** - Repeating patterns (daily, weekly, yearly)
Loading and Exploring
```python import pandas as pd import matplotlib.pyplot as plt
Load with datetime index df = pd.read_csv('data.csv', parse_dates=['date'], index_col='date')
Quick visualization df['value'].plot(figsize=(12, 4)) plt.title('Time Series Data') plt.show()
Check for trends and seasonality from statsmodels.tsa.seasonal import seasonal_decompose decomposition = seasonal_decompose(df['value'], period=12) decomposition.plot() ```
Splitting Time Series Data
**NEVER use random splits!** Future data leaks into training.
```python # Correct: Split by time train_size = int(len(df) * 0.8) train = df[:train_size] test = df[train_size:]
Visual check plt.plot(train.index, train['value'], label='Train') plt.plot(test.index, test['value'], label='Test') plt.legend() ```
Creating Lag Features
Past values as features for predicting future:
```python def create_lag_features(df, column, lags): df_features = df.copy() for lag in lags: df_features[f'{column}_lag_{lag}'] = df[column].shift(lag) return df_features.dropna()
Create lags 1-7 df_features = create_lag_features(df, 'value', range(1, 8))
Features: value at t-1, t-2, ..., t-7 # Target: value at t ```
Rolling Statistics
```python # Moving average df['rolling_mean_7'] = df['value'].rolling(window=7).mean() df['rolling_std_7'] = df['value'].rolling(window=7).std()
Exponential moving average (more weight on recent) df['ema_7'] = df['value'].ewm(span=7).mean() ```
Time-Based Features
```python df['day_of_week'] = df.index.dayofweek df['month'] = df.index.month df['is_weekend'] = df.index.dayofweek.isin([5, 6]).astype(int) df['quarter'] = df.index.quarter ```
Cross-Validation for Time Series
Regular CV leaks future info. Use TimeSeriesSplit:
```python from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X): X_train, X_test = X[train_idx], X[test_idx] y_train, y_test = y[train_idx], y[test_idx] # Train and evaluate
With cross_val_score from sklearn.model_selection import cross_val_score scores = cross_val_score(model, X, y, cv=TimeSeriesSplit(n_splits=5)) ```
Simple Forecasting with ML
```python from sklearn.ensemble import RandomForestRegressor
Prepare features and target X = df_features.drop('value', axis=1) y = df_features['value']
Time-based split train_size = int(len(X) * 0.8) X_train, X_test = X[:train_size], X[train_size:] y_train, y_test = y[:train_size], y[train_size:]
Train model model = RandomForestRegressor(n_estimators=100) model.fit(X_train, y_train)
Evaluate from sklearn.metrics import mean_absolute_error, mean_squared_error predictions = model.predict(X_test) print(f"MAE: {mean_absolute_error(y_test, predictions):.2f}") print(f"RMSE: {mean_squared_error(y_test, predictions)**0.5:.2f}") ```
Common Mistakes
1. **Random train/test split** - Future leaks into training 2. **Using future data as features** - Check your lags! 3. **Ignoring seasonality** - Add time-based features 4. **Wrong evaluation** - Use time-aware CV
When to Use ML vs Traditional Methods
**ML approaches work when:** - Multiple features available - Complex non-linear patterns - External variables matter
**Traditional (ARIMA, etc.) better when:** - Pure time series (no extra features) - Strong seasonality - Need confidence intervals
Key Takeaway
Time series requires respecting temporal order in everything: splits, cross-validation, and feature creation. Create lag features and rolling statistics, add time-based features for seasonality, and always use TimeSeriesSplit for validation. Never let future information leak into your training data!