ML9 min read

Understanding Gradient Descent

Learn how gradient descent works - the optimization algorithm behind most ML models.

Sarah Chen
December 19, 2025
0.0k0

Understanding Gradient Descent

Gradient Descent is how machines "learn." It's the algorithm that finds the best parameters for your model.

The Big Picture

Imagine you're blindfolded on a hilly terrain, trying to find the lowest point (valley).

Strategy:

  1. Feel which direction goes downhill
  2. Take a step that direction
  3. Repeat until you can't go lower

That's gradient descent!

The Math (Simple Version)

We have a loss function that measures how bad our model is:

Loss = f(weights)

Goal: Find weights that minimize this loss.

Gradient = Direction of steepest increase
Negative gradient = Direction of steepest decrease

Update rule:

new_weight = old_weight - learning_rate × gradient

Visual Example

Loss
  │╲
  │ ╲     ← Start here
  │  ╲
  │   ╲   ← Take steps down
  │    ╲
  │     ╲___ ← Minimum!
  └────────────
        weights

Learning Rate

The learning rate (α) controls step size.

Too small:              Too big:              Just right:
Loss                    Loss                   Loss
  │╲.                     │╲                     │╲
  │ ╲...                  │ ╲ /╲                │ ╲
  │  ╲....               │ ╲/  ╲              │  ╲
  │   ╲.....             │      ↑             │   ╲___
  └──────────            Overshoots!          └────────
  Takes forever                                Converges

Typical values: 0.001 to 0.1

Types of Gradient Descent

1. Batch Gradient Descent

Uses ALL data to compute gradient each step.

for epoch in range(num_epochs):
    gradient = compute_gradient(ALL_data)
    weights = weights - learning_rate * gradient

✅ Smooth convergence
❌ Slow for large datasets
❌ Needs all data in memory

2. Stochastic Gradient Descent (SGD)

Uses ONE sample per step.

for epoch in range(num_epochs):
    shuffle(data)
    for sample in data:
        gradient = compute_gradient(sample)
        weights = weights - learning_rate * gradient

✅ Fast updates
✅ Can escape local minima
❌ Noisy, zigzags a lot

3. Mini-Batch Gradient Descent (Most Common!)

Uses a BATCH of samples (e.g., 32, 64, 128).

batch_size = 32
for epoch in range(num_epochs):
    for batch in get_batches(data, batch_size):
        gradient = compute_gradient(batch)
        weights = weights - learning_rate * gradient

✅ Best of both worlds
✅ Efficient GPU usage
✅ Reasonably smooth

Code Example: Linear Regression with GD

import numpy as np

# Generate data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1) * 0.5

# Add bias term
X_b = np.c_[np.ones((100, 1)), X]

# Gradient Descent
learning_rate = 0.1
n_iterations = 1000
m = len(X)

theta = np.random.randn(2, 1)  # Random initialization

for iteration in range(n_iterations):
    # Predictions
    predictions = X_b.dot(theta)
    
    # Error
    error = predictions - y
    
    # Gradient
    gradient = (2/m) * X_b.T.dot(error)
    
    # Update
    theta = theta - learning_rate * gradient

print(f"Learned parameters: {theta.flatten()}")
# Should be close to [4, 3]

Convergence Issues

Problem 1: Local Minima

Loss
  │  ╱╲    ╱╲
  │ ╱  ╲__╱  ╲
  │╱    ↑      ╲
  └──────────────
       Stuck here (local minimum)
       Not the global minimum!

Solutions:

  • Random initialization (try multiple starting points)
  • Momentum (helps roll past local minima)
  • SGD noise can help escape

Problem 2: Vanishing/Exploding Gradients

Gradients become very small or very large.

Solutions:

  • Proper weight initialization
  • Batch normalization
  • Gradient clipping

Problem 3: Saddle Points

        ↑ Up in one direction
        │
─────╲──┼──╱───── ← Flat here
        │
        ↓ Down in another direction

Solution: Advanced optimizers (Adam, RMSprop)

Advanced Optimizers

Momentum

Adds "velocity" to updates:

velocity = momentum * velocity - learning_rate * gradient
weights = weights + velocity

Helps accelerate in consistent directions.

Adam (Most Popular!)

Adaptive learning rates per parameter:

from keras.optimizers import Adam

optimizer = Adam(learning_rate=0.001)  # Default settings work well

Usually the best default choice.

Sklearn Gradient Descent

from sklearn.linear_model import SGDClassifier, SGDRegressor

# Classification
clf = SGDClassifier(loss='log_loss', learning_rate='adaptive', eta0=0.01)
clf.fit(X_train, y_train)

# Regression
reg = SGDRegressor(learning_rate='adaptive', eta0=0.01)
reg.fit(X_train, y_train)

Key Hyperparameters

Parameter What It Does Typical Values
Learning rate Step size 0.001 - 0.1
Batch size Samples per update 32, 64, 128
Epochs Passes through data 10 - 1000
Momentum Velocity accumulation 0.9

Tips for Success

  1. Normalize features: Gradient descent works better when features are on similar scales
  2. Start with Adam: It's robust and works well by default
  3. Monitor training: Plot loss over time to ensure convergence
  4. Learning rate schedules: Start high, decrease over time

Key Takeaway

Gradient Descent is iterative optimization:

  1. Compute how wrong you are (loss)
  2. Compute which direction improves things (gradient)
  3. Take a small step that direction
  4. Repeat until good enough

It's simple in concept but powers everything from linear regression to deep neural networks!

#Machine Learning#Gradient Descent#Optimization#Beginner