LSTM Networks: Solving Long-Term Dependencies
Learn how LSTM networks solve the vanishing gradient problem and enable learning of long-term dependencies in sequences.
LSTM Networks: Solving Long-Term Dependencies
Simple RNNs forget quickly. Try to remember something 100 steps ago? Nearly impossible. LSTMs (Long Short-Term Memory) fix this with a clever gating mechanism.
The Problem
In vanilla RNNs, information gets diluted at each step. After 20-30 steps, early inputs barely influence the output. This is the vanishing gradient problem.
LSTM's Solution: Memory Cell + Gates
LSTM adds a "memory highway" (cell state) that information can flow through unchanged. Three gates control what gets added, forgotten, or output:
- Forget Gate: What to remove from memory
- Input Gate: What new information to add
- Output Gate: What to output from memory
Visual Overview
┌─────────── Cell State ───────────┐
│ × + × │
▼ │ │ │ ▼
[Forget Gate] [Input Gate] [Output Gate]
│ │ │ │
└───────────┴─────┴───────────┘
│
[Hidden State]
LSTM Step by Step
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def lstm_cell(x, h_prev, c_prev, weights):
# Concatenate input and previous hidden state
combined = np.concatenate([h_prev, x])
# Forget gate: what to forget from cell state
f = sigmoid(weights['Wf'] @ combined + weights['bf'])
# Input gate: what new info to add
i = sigmoid(weights['Wi'] @ combined + weights['bi'])
# Candidate cell state
c_candidate = np.tanh(weights['Wc'] @ combined + weights['bc'])
# New cell state
c = f * c_prev + i * c_candidate
# Output gate: what to output
o = sigmoid(weights['Wo'] @ combined + weights['bo'])
# New hidden state
h = o * np.tanh(c)
return h, c
Using LSTM in Keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding, Dropout
# Text classification with LSTM
model = Sequential([
Embedding(vocab_size, 128, input_length=max_length),
LSTM(128, dropout=0.2, recurrent_dropout=0.2),
Dense(64, activation='relu'),
Dropout(0.5),
Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)
Stacked LSTMs
For complex patterns, stack multiple LSTM layers:
model = Sequential([
Embedding(vocab_size, 128),
LSTM(128, return_sequences=True), # Return full sequence for next layer
LSTM(64, return_sequences=False), # Only return final output
Dense(1, activation='sigmoid')
])
LSTM for Time Series
from tensorflow.keras.layers import LSTM, Dense
import numpy as np
# Prepare sequences
def create_sequences(data, seq_length):
X, y = [], []
for i in range(len(data) - seq_length):
X.append(data[i:i+seq_length])
y.append(data[i+seq_length])
return np.array(X), np.array(y)
X, y = create_sequences(stock_prices, seq_length=60)
X = X.reshape(-1, 60, 1) # (samples, timesteps, features)
# Model
model = Sequential([
LSTM(50, return_sequences=True, input_shape=(60, 1)),
LSTM(50),
Dense(1)
])
model.compile(optimizer='adam', loss='mse')
model.fit(X, y, epochs=50, batch_size=32)
LSTM vs GRU
GRU (Gated Recurrent Unit) is a simpler alternative:
| LSTM | GRU |
|---|---|
| 3 gates | 2 gates |
| More parameters | Fewer parameters |
| Better for long sequences | Often similar performance |
| Slower training | Faster training |
Key Takeaway
LSTM solves the vanishing gradient problem through its cell state and gating mechanism. Use it for sequences where long-term dependencies matter: language, long time series, music. Start with a single LSTM layer, add stacking if needed. Consider GRU as a faster alternative.