Learn the Q-Learning algorithm, the foundation of value-based reinforcement learning.

Q-Learning: Tabular Reinforcement Learning

Q-Learning is the foundational algorithm in reinforcement learning. It learns which action to take in each state to maximize future rewards.

The Q-Table

Q-Learning maintains a table of Q-values: Q(state, action) = expected future reward.

``` Action1 Action2 Action3 State1 0.5 0.2 0.8 ← Best action is Action3 State2 0.9 0.3 0.1 ← Best action is Action1 State3 0.1 0.7 0.4 ← Best action is Action2 ```

The Q-Learning Update

``` Q(s,a) ← Q(s,a) + α * [r + γ * max(Q(s',a')) - Q(s,a)] ```

Where: - α (alpha): Learning rate - γ (gamma): Discount factor (how much future matters) - r: Immediate reward - s': Next state - max(Q(s',a')): Best possible future value

Implementation

```python import numpy as np

class QLearning: def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99, epsilon=1.0): self.Q = np.zeros((n_states, n_actions)) self.alpha = alpha # Learning rate self.gamma = gamma # Discount factor self.epsilon = epsilon # Exploration rate self.epsilon_decay = 0.995 self.epsilon_min = 0.01 self.n_actions = n_actions def choose_action(self, state): if np.random.random() < self.epsilon: return np.random.randint(self.n_actions) # Explore return np.argmax(self.Q[state]) # Exploit def update(self, state, action, reward, next_state, done): if done: target = reward else: target = reward + self.gamma * np.max(self.Q[next_state]) # Q-learning update self.Q[state, action] += self.alpha * (target - self.Q[state, action]) def decay_epsilon(self): self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay) ```

Training Loop

```python import gym

Environment with discrete state space env = gym.make('FrozenLake-v1', is_slippery=False)

agent = QLearning( n_states=env.observation_space.n, n_actions=env.action_space.n )

Training n_episodes = 10000 rewards_history = []

for episode in range(n_episodes): state = env.reset() total_reward = 0 done = False while not done: action = agent.choose_action(state) next_state, reward, done, _ = env.step(action) agent.update(state, action, reward, next_state, done) state = next_state total_reward += reward agent.decay_epsilon() rewards_history.append(total_reward) if episode % 1000 == 0: avg_reward = np.mean(rewards_history[-100:]) print(f"Episode {episode}, Avg Reward: {avg_reward:.2f}, Epsilon: {agent.epsilon:.3f}") ```

Hyperparameters

| Parameter | Effect | Typical Range | |-----------|--------|---------------| | α (learning rate) | How fast to update Q | 0.01 - 0.5 | | γ (discount) | Future reward importance | 0.9 - 0.99 | | ε (exploration) | Random action probability | 1.0 → 0.01 |

Visualizing the Policy

```python # After training, visualize what the agent learned def show_policy(Q, shape=(4, 4)): actions = ['←', '↓', '→', '↑'] policy = np.argmax(Q, axis=1).reshape(shape) for row in policy: print(' '.join(actions[a] for a in row))

show_policy(agent.Q) ```

Limitations

Q-Learning works for **discrete** state spaces. When states are continuous (like images), the table becomes infinite. Solution: Use neural networks to approximate Q → Deep Q-Network (DQN).

```python # Can't have a table entry for every pixel combination! # States like [0.523, 1.234, -0.891, 2.456] need function approximation ```

Key Takeaway

Q-Learning is the cornerstone of value-based RL. It learns action values through the Bellman equation and balances exploration with exploitation. Works great for small, discrete problems. For larger state spaces, you'll need Deep Q-Networks (DQN), which we'll cover next.

Q-Learning: Tabular Reinforcement Learning

Q-Learning: Tabular Reinforcement Learning

The Q-Table

The Q-Learning Update

Implementation

Training Loop

Environment with discrete state space env = gym.make('FrozenLake-v1', is_slippery=False)

Training n_episodes = 10000 rewards_history = []

Hyperparameters

Visualizing the Policy

Limitations

Key Takeaway

More on ML

What is Machine Learning? A Simple Introduction

Supervised vs Unsupervised Learning Explained

Understanding Training, Validation, and Test Sets