ML11 min read

LSTM Networks: Solving Long-Term Dependencies

Learn how LSTM networks solve the vanishing gradient problem and enable learning of long-term dependencies in sequences.

Sarah Chen
December 19, 2025
0.0k0

LSTM Networks: Solving Long-Term Dependencies

Simple RNNs forget quickly. Try to remember something 100 steps ago? Nearly impossible. LSTMs (Long Short-Term Memory) fix this with a clever gating mechanism.

The Problem

In vanilla RNNs, information gets diluted at each step. After 20-30 steps, early inputs barely influence the output. This is the vanishing gradient problem.

LSTM's Solution: Memory Cell + Gates

LSTM adds a "memory highway" (cell state) that information can flow through unchanged. Three gates control what gets added, forgotten, or output:

  1. Forget Gate: What to remove from memory
  2. Input Gate: What new information to add
  3. Output Gate: What to output from memory

Visual Overview

        ┌─────────── Cell State ───────────┐
        │    ×           +           ×     │
        ▼    │           │           │     ▼
   [Forget Gate]   [Input Gate]   [Output Gate]
        │           │     │           │
        └───────────┴─────┴───────────┘
                    │
              [Hidden State]

LSTM Step by Step

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def lstm_cell(x, h_prev, c_prev, weights):
    # Concatenate input and previous hidden state
    combined = np.concatenate([h_prev, x])
    
    # Forget gate: what to forget from cell state
    f = sigmoid(weights['Wf'] @ combined + weights['bf'])
    
    # Input gate: what new info to add
    i = sigmoid(weights['Wi'] @ combined + weights['bi'])
    
    # Candidate cell state
    c_candidate = np.tanh(weights['Wc'] @ combined + weights['bc'])
    
    # New cell state
    c = f * c_prev + i * c_candidate
    
    # Output gate: what to output
    o = sigmoid(weights['Wo'] @ combined + weights['bo'])
    
    # New hidden state
    h = o * np.tanh(c)
    
    return h, c

Using LSTM in Keras

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding, Dropout

# Text classification with LSTM
model = Sequential([
    Embedding(vocab_size, 128, input_length=max_length),
    LSTM(128, dropout=0.2, recurrent_dropout=0.2),
    Dense(64, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

Stacked LSTMs

For complex patterns, stack multiple LSTM layers:

model = Sequential([
    Embedding(vocab_size, 128),
    LSTM(128, return_sequences=True),   # Return full sequence for next layer
    LSTM(64, return_sequences=False),    # Only return final output
    Dense(1, activation='sigmoid')
])

LSTM for Time Series

from tensorflow.keras.layers import LSTM, Dense
import numpy as np

# Prepare sequences
def create_sequences(data, seq_length):
    X, y = [], []
    for i in range(len(data) - seq_length):
        X.append(data[i:i+seq_length])
        y.append(data[i+seq_length])
    return np.array(X), np.array(y)

X, y = create_sequences(stock_prices, seq_length=60)
X = X.reshape(-1, 60, 1)  # (samples, timesteps, features)

# Model
model = Sequential([
    LSTM(50, return_sequences=True, input_shape=(60, 1)),
    LSTM(50),
    Dense(1)
])

model.compile(optimizer='adam', loss='mse')
model.fit(X, y, epochs=50, batch_size=32)

LSTM vs GRU

GRU (Gated Recurrent Unit) is a simpler alternative:

LSTM GRU
3 gates 2 gates
More parameters Fewer parameters
Better for long sequences Often similar performance
Slower training Faster training

Key Takeaway

LSTM solves the vanishing gradient problem through its cell state and gating mechanism. Use it for sequences where long-term dependencies matter: language, long time series, music. Start with a single LSTM layer, add stacking if needed. Consider GRU as a faster alternative.

#Machine Learning#Deep Learning#LSTM#RNN#Advanced