ML9 min read

Understanding Gradient Descent

Learn how gradient descent works - the optimization algorithm behind most ML models.

Sarah Chen
December 19, 2025
0.0k0

Understanding Gradient Descent

Gradient Descent is how machines "learn." It's the algorithm that finds the best parameters for your model.

The Big Picture

Imagine you're blindfolded on a hilly terrain, trying to find the lowest point (valley).

Strategy: 1. Feel which direction goes downhill 2. Take a step that direction 3. Repeat until you can't go lower

That's gradient descent!

The Math (Simple Version)

We have a **loss function** that measures how bad our model is:

``` Loss = f(weights) ```

Goal: Find weights that minimize this loss.

**Gradient** = Direction of steepest increase **Negative gradient** = Direction of steepest decrease

Update rule: ``` new_weight = old_weight - learning_rate × gradient ```

Visual Example

``` Loss │╲ │ ╲ ← Start here │ ╲ │ ╲ ← Take steps down │ ╲ │ ╲___ ← Minimum! └──────────── weights ```

Learning Rate

The learning rate (α) controls step size.

``` Too small: Too big: Just right: Loss Loss Loss │╲. │╲ │╲ │ ╲... │ ╲ /╲ │ ╲ │ ╲.... │ ╲/ ╲ │ ╲ │ ╲..... │ ↑ │ ╲___ └────────── Overshoots! └──────── Takes forever Converges ```

**Typical values:** 0.001 to 0.1

Types of Gradient Descent

### 1. Batch Gradient Descent Uses ALL data to compute gradient each step.

```python for epoch in range(num_epochs): gradient = compute_gradient(ALL_data) weights = weights - learning_rate * gradient ```

✅ Smooth convergence ❌ Slow for large datasets ❌ Needs all data in memory

### 2. Stochastic Gradient Descent (SGD) Uses ONE sample per step.

```python for epoch in range(num_epochs): shuffle(data) for sample in data: gradient = compute_gradient(sample) weights = weights - learning_rate * gradient ```

✅ Fast updates ✅ Can escape local minima ❌ Noisy, zigzags a lot

### 3. Mini-Batch Gradient Descent (Most Common!) Uses a BATCH of samples (e.g., 32, 64, 128).

```python batch_size = 32 for epoch in range(num_epochs): for batch in get_batches(data, batch_size): gradient = compute_gradient(batch) weights = weights - learning_rate * gradient ```

✅ Best of both worlds ✅ Efficient GPU usage ✅ Reasonably smooth

Code Example: Linear Regression with GD

```python import numpy as np

Generate data np.random.seed(42) X = 2 * np.random.rand(100, 1) y = 4 + 3 * X + np.random.randn(100, 1) * 0.5

Add bias term X_b = np.c_[np.ones((100, 1)), X]

Gradient Descent learning_rate = 0.1 n_iterations = 1000 m = len(X)

theta = np.random.randn(2, 1) # Random initialization

for iteration in range(n_iterations): # Predictions predictions = X_b.dot(theta) # Error error = predictions - y # Gradient gradient = (2/m) * X_b.T.dot(error) # Update theta = theta - learning_rate * gradient

print(f"Learned parameters: {theta.flatten()}") # Should be close to [4, 3] ```

Convergence Issues

### Problem 1: Local Minima ``` Loss │ ╱╲ ╱╲ │ ╱ ╲__╱ ╲ │╱ ↑ ╲ └────────────── Stuck here (local minimum) Not the global minimum! ```

**Solutions:** - Random initialization (try multiple starting points) - Momentum (helps roll past local minima) - SGD noise can help escape

### Problem 2: Vanishing/Exploding Gradients Gradients become very small or very large.

**Solutions:** - Proper weight initialization - Batch normalization - Gradient clipping

### Problem 3: Saddle Points ``` ↑ Up in one direction │ ─────╲──┼──╱───── ← Flat here │ ↓ Down in another direction ```

**Solution:** Advanced optimizers (Adam, RMSprop)

Advanced Optimizers

### Momentum Adds "velocity" to updates:

```python velocity = momentum * velocity - learning_rate * gradient weights = weights + velocity ```

Helps accelerate in consistent directions.

### Adam (Most Popular!) Adaptive learning rates per parameter:

```python from keras.optimizers import Adam

optimizer = Adam(learning_rate=0.001) # Default settings work well ```

Usually the best default choice.

Sklearn Gradient Descent

```python from sklearn.linear_model import SGDClassifier, SGDRegressor

Classification clf = SGDClassifier(loss='log_loss', learning_rate='adaptive', eta0=0.01) clf.fit(X_train, y_train)

Regression reg = SGDRegressor(learning_rate='adaptive', eta0=0.01) reg.fit(X_train, y_train) ```

Key Hyperparameters

| Parameter | What It Does | Typical Values | |-----------|--------------|----------------| | Learning rate | Step size | 0.001 - 0.1 | | Batch size | Samples per update | 32, 64, 128 | | Epochs | Passes through data | 10 - 1000 | | Momentum | Velocity accumulation | 0.9 |

Tips for Success

1. **Normalize features:** Gradient descent works better when features are on similar scales 2. **Start with Adam:** It's robust and works well by default 3. **Monitor training:** Plot loss over time to ensure convergence 4. **Learning rate schedules:** Start high, decrease over time

Key Takeaway

Gradient Descent is iterative optimization: 1. Compute how wrong you are (loss) 2. Compute which direction improves things (gradient) 3. Take a small step that direction 4. Repeat until good enough

It's simple in concept but powers everything from linear regression to deep neural networks!

#Machine Learning#Gradient Descent#Optimization#Beginner