Q-Learning: Tabular Reinforcement Learning
Learn the Q-Learning algorithm, the foundation of value-based reinforcement learning.
Q-Learning: Tabular Reinforcement Learning
Q-Learning is the foundational algorithm in reinforcement learning. It learns which action to take in each state to maximize future rewards.
The Q-Table
Q-Learning maintains a table of Q-values: Q(state, action) = expected future reward.
Action1 Action2 Action3
State1 0.5 0.2 0.8 ← Best action is Action3
State2 0.9 0.3 0.1 ← Best action is Action1
State3 0.1 0.7 0.4 ← Best action is Action2
The Q-Learning Update
Q(s,a) ← Q(s,a) + α * [r + γ * max(Q(s',a')) - Q(s,a)]
Where:
- α (alpha): Learning rate
- γ (gamma): Discount factor (how much future matters)
- r: Immediate reward
- s': Next state
- max(Q(s',a')): Best possible future value
Implementation
import numpy as np
class QLearning:
def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99, epsilon=1.0):
self.Q = np.zeros((n_states, n_actions))
self.alpha = alpha # Learning rate
self.gamma = gamma # Discount factor
self.epsilon = epsilon # Exploration rate
self.epsilon_decay = 0.995
self.epsilon_min = 0.01
self.n_actions = n_actions
def choose_action(self, state):
if np.random.random() < self.epsilon:
return np.random.randint(self.n_actions) # Explore
return np.argmax(self.Q[state]) # Exploit
def update(self, state, action, reward, next_state, done):
if done:
target = reward
else:
target = reward + self.gamma * np.max(self.Q[next_state])
# Q-learning update
self.Q[state, action] += self.alpha * (target - self.Q[state, action])
def decay_epsilon(self):
self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
Training Loop
import gym
# Environment with discrete state space
env = gym.make('FrozenLake-v1', is_slippery=False)
agent = QLearning(
n_states=env.observation_space.n,
n_actions=env.action_space.n
)
# Training
n_episodes = 10000
rewards_history = []
for episode in range(n_episodes):
state = env.reset()
total_reward = 0
done = False
while not done:
action = agent.choose_action(state)
next_state, reward, done, _ = env.step(action)
agent.update(state, action, reward, next_state, done)
state = next_state
total_reward += reward
agent.decay_epsilon()
rewards_history.append(total_reward)
if episode % 1000 == 0:
avg_reward = np.mean(rewards_history[-100:])
print(f"Episode {episode}, Avg Reward: {avg_reward:.2f}, Epsilon: {agent.epsilon:.3f}")
Hyperparameters
| Parameter | Effect | Typical Range |
|---|---|---|
| α (learning rate) | How fast to update Q | 0.01 - 0.5 |
| γ (discount) | Future reward importance | 0.9 - 0.99 |
| ε (exploration) | Random action probability | 1.0 → 0.01 |
Visualizing the Policy
# After training, visualize what the agent learned
def show_policy(Q, shape=(4, 4)):
actions = ['←', '↓', '→', '↑']
policy = np.argmax(Q, axis=1).reshape(shape)
for row in policy:
print(' '.join(actions[a] for a in row))
show_policy(agent.Q)
Limitations
Q-Learning works for discrete state spaces. When states are continuous (like images), the table becomes infinite. Solution: Use neural networks to approximate Q → Deep Q-Network (DQN).
# Can't have a table entry for every pixel combination!
# States like [0.523, 1.234, -0.891, 2.456] need function approximation
Key Takeaway
Q-Learning is the cornerstone of value-based RL. It learns action values through the Bellman equation and balances exploration with exploitation. Works great for small, discrete problems. For larger state spaces, you'll need Deep Q-Networks (DQN), which we'll cover next.