Q-Learning: Tabular Reinforcement Learning
Learn the Q-Learning algorithm, the foundation of value-based reinforcement learning.
Q-Learning: Tabular Reinforcement Learning
Q-Learning is the foundational algorithm in reinforcement learning. It learns which action to take in each state to maximize future rewards.
The Q-Table
Q-Learning maintains a table of Q-values: Q(state, action) = expected future reward.
``` Action1 Action2 Action3 State1 0.5 0.2 0.8 ← Best action is Action3 State2 0.9 0.3 0.1 ← Best action is Action1 State3 0.1 0.7 0.4 ← Best action is Action2 ```
The Q-Learning Update
``` Q(s,a) ← Q(s,a) + α * [r + γ * max(Q(s',a')) - Q(s,a)] ```
Where: - α (alpha): Learning rate - γ (gamma): Discount factor (how much future matters) - r: Immediate reward - s': Next state - max(Q(s',a')): Best possible future value
Implementation
```python import numpy as np
class QLearning: def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99, epsilon=1.0): self.Q = np.zeros((n_states, n_actions)) self.alpha = alpha # Learning rate self.gamma = gamma # Discount factor self.epsilon = epsilon # Exploration rate self.epsilon_decay = 0.995 self.epsilon_min = 0.01 self.n_actions = n_actions def choose_action(self, state): if np.random.random() < self.epsilon: return np.random.randint(self.n_actions) # Explore return np.argmax(self.Q[state]) # Exploit def update(self, state, action, reward, next_state, done): if done: target = reward else: target = reward + self.gamma * np.max(self.Q[next_state]) # Q-learning update self.Q[state, action] += self.alpha * (target - self.Q[state, action]) def decay_epsilon(self): self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay) ```
Training Loop
```python import gym
Environment with discrete state space env = gym.make('FrozenLake-v1', is_slippery=False)
agent = QLearning( n_states=env.observation_space.n, n_actions=env.action_space.n )
Training n_episodes = 10000 rewards_history = []
for episode in range(n_episodes): state = env.reset() total_reward = 0 done = False while not done: action = agent.choose_action(state) next_state, reward, done, _ = env.step(action) agent.update(state, action, reward, next_state, done) state = next_state total_reward += reward agent.decay_epsilon() rewards_history.append(total_reward) if episode % 1000 == 0: avg_reward = np.mean(rewards_history[-100:]) print(f"Episode {episode}, Avg Reward: {avg_reward:.2f}, Epsilon: {agent.epsilon:.3f}") ```
Hyperparameters
| Parameter | Effect | Typical Range | |-----------|--------|---------------| | α (learning rate) | How fast to update Q | 0.01 - 0.5 | | γ (discount) | Future reward importance | 0.9 - 0.99 | | ε (exploration) | Random action probability | 1.0 → 0.01 |
Visualizing the Policy
```python # After training, visualize what the agent learned def show_policy(Q, shape=(4, 4)): actions = ['←', '↓', '→', '↑'] policy = np.argmax(Q, axis=1).reshape(shape) for row in policy: print(' '.join(actions[a] for a in row))
show_policy(agent.Q) ```
Limitations
Q-Learning works for **discrete** state spaces. When states are continuous (like images), the table becomes infinite. Solution: Use neural networks to approximate Q → Deep Q-Network (DQN).
```python # Can't have a table entry for every pixel combination! # States like [0.523, 1.234, -0.891, 2.456] need function approximation ```
Key Takeaway
Q-Learning is the cornerstone of value-based RL. It learns action values through the Bellman equation and balances exploration with exploitation. Works great for small, discrete problems. For larger state spaces, you'll need Deep Q-Networks (DQN), which we'll cover next.