ML11 min read

LSTM Networks: Solving Long-Term Dependencies

Learn how LSTM networks solve the vanishing gradient problem and enable learning of long-term dependencies in sequences.

Sarah Chen
December 19, 2025
0.0k0

LSTM Networks: Solving Long-Term Dependencies

Simple RNNs forget quickly. Try to remember something 100 steps ago? Nearly impossible. LSTMs (Long Short-Term Memory) fix this with a clever gating mechanism.

The Problem

In vanilla RNNs, information gets diluted at each step. After 20-30 steps, early inputs barely influence the output. This is the vanishing gradient problem.

LSTM's Solution: Memory Cell + Gates

LSTM adds a "memory highway" (cell state) that information can flow through unchanged. Three gates control what gets added, forgotten, or output:

1. **Forget Gate:** What to remove from memory 2. **Input Gate:** What new information to add 3. **Output Gate:** What to output from memory

Visual Overview

``` ┌─────────── Cell State ───────────┐ │ × + × │ ▼ │ │ │ ▼ [Forget Gate] [Input Gate] [Output Gate] │ │ │ │ └───────────┴─────┴───────────┘ │ [Hidden State] ```

LSTM Step by Step

```python import numpy as np

def sigmoid(x): return 1 / (1 + np.exp(-x))

def lstm_cell(x, h_prev, c_prev, weights): # Concatenate input and previous hidden state combined = np.concatenate([h_prev, x]) # Forget gate: what to forget from cell state f = sigmoid(weights['Wf'] @ combined + weights['bf']) # Input gate: what new info to add i = sigmoid(weights['Wi'] @ combined + weights['bi']) # Candidate cell state c_candidate = np.tanh(weights['Wc'] @ combined + weights['bc']) # New cell state c = f * c_prev + i * c_candidate # Output gate: what to output o = sigmoid(weights['Wo'] @ combined + weights['bo']) # New hidden state h = o * np.tanh(c) return h, c ```

Using LSTM in Keras

```python from tensorflow.keras.models import Sequential from tensorflow.keras.layers import LSTM, Dense, Embedding, Dropout

Text classification with LSTM model = Sequential([ Embedding(vocab_size, 128, input_length=max_length), LSTM(128, dropout=0.2, recurrent_dropout=0.2), Dense(64, activation='relu'), Dropout(0.5), Dense(1, activation='sigmoid') ])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2) ```

Stacked LSTMs

For complex patterns, stack multiple LSTM layers:

```python model = Sequential([ Embedding(vocab_size, 128), LSTM(128, return_sequences=True), # Return full sequence for next layer LSTM(64, return_sequences=False), # Only return final output Dense(1, activation='sigmoid') ]) ```

LSTM for Time Series

```python from tensorflow.keras.layers import LSTM, Dense import numpy as np

Prepare sequences def create_sequences(data, seq_length): X, y = [], [] for i in range(len(data) - seq_length): X.append(data[i:i+seq_length]) y.append(data[i+seq_length]) return np.array(X), np.array(y)

X, y = create_sequences(stock_prices, seq_length=60) X = X.reshape(-1, 60, 1) # (samples, timesteps, features)

Model model = Sequential([ LSTM(50, return_sequences=True, input_shape=(60, 1)), LSTM(50), Dense(1) ])

model.compile(optimizer='adam', loss='mse') model.fit(X, y, epochs=50, batch_size=32) ```

LSTM vs GRU

GRU (Gated Recurrent Unit) is a simpler alternative:

| LSTM | GRU | |------|-----| | 3 gates | 2 gates | | More parameters | Fewer parameters | | Better for long sequences | Often similar performance | | Slower training | Faster training |

Key Takeaway

LSTM solves the vanishing gradient problem through its cell state and gating mechanism. Use it for sequences where long-term dependencies matter: language, long time series, music. Start with a single LSTM layer, add stacking if needed. Consider GRU as a faster alternative.

#Machine Learning#Deep Learning#LSTM#RNN#Advanced