ML9 min read

Time Series Basics for Machine Learning

Learn the fundamentals of working with time series data in machine learning.

Sarah Chen
December 19, 2025
0.0k0

Time Series Basics for Machine Learning

Time series data has a twist: the order matters. You can't randomly split it like regular data. Let's learn the fundamentals.

What Makes Time Series Special

1. **Temporal order matters** - Can't shuffle data 2. **Autocorrelation** - Values correlate with past values 3. **Trends** - Long-term increase/decrease 4. **Seasonality** - Repeating patterns (daily, weekly, yearly)

Loading and Exploring

```python import pandas as pd import matplotlib.pyplot as plt

Load with datetime index df = pd.read_csv('data.csv', parse_dates=['date'], index_col='date')

Quick visualization df['value'].plot(figsize=(12, 4)) plt.title('Time Series Data') plt.show()

Check for trends and seasonality from statsmodels.tsa.seasonal import seasonal_decompose decomposition = seasonal_decompose(df['value'], period=12) decomposition.plot() ```

Splitting Time Series Data

**NEVER use random splits!** Future data leaks into training.

```python # Correct: Split by time train_size = int(len(df) * 0.8) train = df[:train_size] test = df[train_size:]

Visual check plt.plot(train.index, train['value'], label='Train') plt.plot(test.index, test['value'], label='Test') plt.legend() ```

Creating Lag Features

Past values as features for predicting future:

```python def create_lag_features(df, column, lags): df_features = df.copy() for lag in lags: df_features[f'{column}_lag_{lag}'] = df[column].shift(lag) return df_features.dropna()

Create lags 1-7 df_features = create_lag_features(df, 'value', range(1, 8))

Features: value at t-1, t-2, ..., t-7 # Target: value at t ```

Rolling Statistics

```python # Moving average df['rolling_mean_7'] = df['value'].rolling(window=7).mean() df['rolling_std_7'] = df['value'].rolling(window=7).std()

Exponential moving average (more weight on recent) df['ema_7'] = df['value'].ewm(span=7).mean() ```

Time-Based Features

```python df['day_of_week'] = df.index.dayofweek df['month'] = df.index.month df['is_weekend'] = df.index.dayofweek.isin([5, 6]).astype(int) df['quarter'] = df.index.quarter ```

Cross-Validation for Time Series

Regular CV leaks future info. Use TimeSeriesSplit:

```python from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)

for train_idx, test_idx in tscv.split(X): X_train, X_test = X[train_idx], X[test_idx] y_train, y_test = y[train_idx], y[test_idx] # Train and evaluate

With cross_val_score from sklearn.model_selection import cross_val_score scores = cross_val_score(model, X, y, cv=TimeSeriesSplit(n_splits=5)) ```

Simple Forecasting with ML

```python from sklearn.ensemble import RandomForestRegressor

Prepare features and target X = df_features.drop('value', axis=1) y = df_features['value']

Time-based split train_size = int(len(X) * 0.8) X_train, X_test = X[:train_size], X[train_size:] y_train, y_test = y[:train_size], y[train_size:]

Train model model = RandomForestRegressor(n_estimators=100) model.fit(X_train, y_train)

Evaluate from sklearn.metrics import mean_absolute_error, mean_squared_error predictions = model.predict(X_test) print(f"MAE: {mean_absolute_error(y_test, predictions):.2f}") print(f"RMSE: {mean_squared_error(y_test, predictions)**0.5:.2f}") ```

Common Mistakes

1. **Random train/test split** - Future leaks into training 2. **Using future data as features** - Check your lags! 3. **Ignoring seasonality** - Add time-based features 4. **Wrong evaluation** - Use time-aware CV

When to Use ML vs Traditional Methods

**ML approaches work when:** - Multiple features available - Complex non-linear patterns - External variables matter

**Traditional (ARIMA, etc.) better when:** - Pure time series (no extra features) - Strong seasonality - Need confidence intervals

Key Takeaway

Time series requires respecting temporal order in everything: splits, cross-validation, and feature creation. Create lag features and rolling statistics, add time-based features for seasonality, and always use TimeSeriesSplit for validation. Never let future information leak into your training data!

#Machine Learning#Time Series#Forecasting#Intermediate