ML9 min read

Time Series Basics for Machine Learning

Learn the fundamentals of working with time series data in machine learning.

Sarah Chen
December 19, 2025
0.0k0

Time Series Basics for Machine Learning

Time series data has a twist: the order matters. You can't randomly split it like regular data. Let's learn the fundamentals.

What Makes Time Series Special

  1. Temporal order matters - Can't shuffle data
  2. Autocorrelation - Values correlate with past values
  3. Trends - Long-term increase/decrease
  4. Seasonality - Repeating patterns (daily, weekly, yearly)

Loading and Exploring

import pandas as pd
import matplotlib.pyplot as plt

# Load with datetime index
df = pd.read_csv('data.csv', parse_dates=['date'], index_col='date')

# Quick visualization
df['value'].plot(figsize=(12, 4))
plt.title('Time Series Data')
plt.show()

# Check for trends and seasonality
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(df['value'], period=12)
decomposition.plot()

Splitting Time Series Data

NEVER use random splits! Future data leaks into training.

# Correct: Split by time
train_size = int(len(df) * 0.8)
train = df[:train_size]
test = df[train_size:]

# Visual check
plt.plot(train.index, train['value'], label='Train')
plt.plot(test.index, test['value'], label='Test')
plt.legend()

Creating Lag Features

Past values as features for predicting future:

def create_lag_features(df, column, lags):
    df_features = df.copy()
    for lag in lags:
        df_features[f'{column}_lag_{lag}'] = df[column].shift(lag)
    return df_features.dropna()

# Create lags 1-7
df_features = create_lag_features(df, 'value', range(1, 8))

# Features: value at t-1, t-2, ..., t-7
# Target: value at t

Rolling Statistics

# Moving average
df['rolling_mean_7'] = df['value'].rolling(window=7).mean()
df['rolling_std_7'] = df['value'].rolling(window=7).std()

# Exponential moving average (more weight on recent)
df['ema_7'] = df['value'].ewm(span=7).mean()

Time-Based Features

df['day_of_week'] = df.index.dayofweek
df['month'] = df.index.month
df['is_weekend'] = df.index.dayofweek.isin([5, 6]).astype(int)
df['quarter'] = df.index.quarter

Cross-Validation for Time Series

Regular CV leaks future info. Use TimeSeriesSplit:

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)

for train_idx, test_idx in tscv.split(X):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    # Train and evaluate

# With cross_val_score
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=TimeSeriesSplit(n_splits=5))

Simple Forecasting with ML

from sklearn.ensemble import RandomForestRegressor

# Prepare features and target
X = df_features.drop('value', axis=1)
y = df_features['value']

# Time-based split
train_size = int(len(X) * 0.8)
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]

# Train model
model = RandomForestRegressor(n_estimators=100)
model.fit(X_train, y_train)

# Evaluate
from sklearn.metrics import mean_absolute_error, mean_squared_error
predictions = model.predict(X_test)
print(f"MAE: {mean_absolute_error(y_test, predictions):.2f}")
print(f"RMSE: {mean_squared_error(y_test, predictions)**0.5:.2f}")

Common Mistakes

  1. Random train/test split - Future leaks into training
  2. Using future data as features - Check your lags!
  3. Ignoring seasonality - Add time-based features
  4. Wrong evaluation - Use time-aware CV

When to Use ML vs Traditional Methods

ML approaches work when:

  • Multiple features available
  • Complex non-linear patterns
  • External variables matter

Traditional (ARIMA, etc.) better when:

  • Pure time series (no extra features)
  • Strong seasonality
  • Need confidence intervals

Key Takeaway

Time series requires respecting temporal order in everything: splits, cross-validation, and feature creation. Create lag features and rolling statistics, add time-based features for seasonality, and always use TimeSeriesSplit for validation. Never let future information leak into your training data!

#Machine Learning#Time Series#Forecasting#Intermediate