Understanding Gradient Descent
Learn how gradient descent works - the optimization algorithm behind most ML models.
Understanding Gradient Descent
Gradient Descent is how machines "learn." It's the algorithm that finds the best parameters for your model.
The Big Picture
Imagine you're blindfolded on a hilly terrain, trying to find the lowest point (valley).
Strategy:
- Feel which direction goes downhill
- Take a step that direction
- Repeat until you can't go lower
That's gradient descent!
The Math (Simple Version)
We have a loss function that measures how bad our model is:
Loss = f(weights)
Goal: Find weights that minimize this loss.
Gradient = Direction of steepest increase
Negative gradient = Direction of steepest decrease
Update rule:
new_weight = old_weight - learning_rate × gradient
Visual Example
Loss
│╲
│ ╲ ← Start here
│ ╲
│ ╲ ← Take steps down
│ ╲
│ ╲___ ← Minimum!
└────────────
weights
Learning Rate
The learning rate (α) controls step size.
Too small: Too big: Just right:
Loss Loss Loss
│╲. │╲ │╲
│ ╲... │ ╲ /╲ │ ╲
│ ╲.... │ ╲/ ╲ │ ╲
│ ╲..... │ ↑ │ ╲___
└────────── Overshoots! └────────
Takes forever Converges
Typical values: 0.001 to 0.1
Types of Gradient Descent
1. Batch Gradient Descent
Uses ALL data to compute gradient each step.
for epoch in range(num_epochs):
gradient = compute_gradient(ALL_data)
weights = weights - learning_rate * gradient
✅ Smooth convergence
❌ Slow for large datasets
❌ Needs all data in memory
2. Stochastic Gradient Descent (SGD)
Uses ONE sample per step.
for epoch in range(num_epochs):
shuffle(data)
for sample in data:
gradient = compute_gradient(sample)
weights = weights - learning_rate * gradient
✅ Fast updates
✅ Can escape local minima
❌ Noisy, zigzags a lot
3. Mini-Batch Gradient Descent (Most Common!)
Uses a BATCH of samples (e.g., 32, 64, 128).
batch_size = 32
for epoch in range(num_epochs):
for batch in get_batches(data, batch_size):
gradient = compute_gradient(batch)
weights = weights - learning_rate * gradient
✅ Best of both worlds
✅ Efficient GPU usage
✅ Reasonably smooth
Code Example: Linear Regression with GD
import numpy as np
# Generate data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1) * 0.5
# Add bias term
X_b = np.c_[np.ones((100, 1)), X]
# Gradient Descent
learning_rate = 0.1
n_iterations = 1000
m = len(X)
theta = np.random.randn(2, 1) # Random initialization
for iteration in range(n_iterations):
# Predictions
predictions = X_b.dot(theta)
# Error
error = predictions - y
# Gradient
gradient = (2/m) * X_b.T.dot(error)
# Update
theta = theta - learning_rate * gradient
print(f"Learned parameters: {theta.flatten()}")
# Should be close to [4, 3]
Convergence Issues
Problem 1: Local Minima
Loss
│ ╱╲ ╱╲
│ ╱ ╲__╱ ╲
│╱ ↑ ╲
└──────────────
Stuck here (local minimum)
Not the global minimum!
Solutions:
- Random initialization (try multiple starting points)
- Momentum (helps roll past local minima)
- SGD noise can help escape
Problem 2: Vanishing/Exploding Gradients
Gradients become very small or very large.
Solutions:
- Proper weight initialization
- Batch normalization
- Gradient clipping
Problem 3: Saddle Points
↑ Up in one direction
│
─────╲──┼──╱───── ← Flat here
│
↓ Down in another direction
Solution: Advanced optimizers (Adam, RMSprop)
Advanced Optimizers
Momentum
Adds "velocity" to updates:
velocity = momentum * velocity - learning_rate * gradient
weights = weights + velocity
Helps accelerate in consistent directions.
Adam (Most Popular!)
Adaptive learning rates per parameter:
from keras.optimizers import Adam
optimizer = Adam(learning_rate=0.001) # Default settings work well
Usually the best default choice.
Sklearn Gradient Descent
from sklearn.linear_model import SGDClassifier, SGDRegressor
# Classification
clf = SGDClassifier(loss='log_loss', learning_rate='adaptive', eta0=0.01)
clf.fit(X_train, y_train)
# Regression
reg = SGDRegressor(learning_rate='adaptive', eta0=0.01)
reg.fit(X_train, y_train)
Key Hyperparameters
| Parameter | What It Does | Typical Values |
|---|---|---|
| Learning rate | Step size | 0.001 - 0.1 |
| Batch size | Samples per update | 32, 64, 128 |
| Epochs | Passes through data | 10 - 1000 |
| Momentum | Velocity accumulation | 0.9 |
Tips for Success
- Normalize features: Gradient descent works better when features are on similar scales
- Start with Adam: It's robust and works well by default
- Monitor training: Plot loss over time to ensure convergence
- Learning rate schedules: Start high, decrease over time
Key Takeaway
Gradient Descent is iterative optimization:
- Compute how wrong you are (loss)
- Compute which direction improves things (gradient)
- Take a small step that direction
- Repeat until good enough
It's simple in concept but powers everything from linear regression to deep neural networks!