ML10 min read

Q-Learning: Tabular Reinforcement Learning

Learn the Q-Learning algorithm, the foundation of value-based reinforcement learning.

Sarah Chen
December 19, 2025
0.0k0

Q-Learning: Tabular Reinforcement Learning

Q-Learning is the foundational algorithm in reinforcement learning. It learns which action to take in each state to maximize future rewards.

The Q-Table

Q-Learning maintains a table of Q-values: Q(state, action) = expected future reward.

        Action1  Action2  Action3
State1    0.5      0.2      0.8   ← Best action is Action3
State2    0.9      0.3      0.1   ← Best action is Action1
State3    0.1      0.7      0.4   ← Best action is Action2

The Q-Learning Update

Q(s,a) ← Q(s,a) + α * [r + γ * max(Q(s',a')) - Q(s,a)]

Where:

  • α (alpha): Learning rate
  • γ (gamma): Discount factor (how much future matters)
  • r: Immediate reward
  • s': Next state
  • max(Q(s',a')): Best possible future value

Implementation

import numpy as np

class QLearning:
    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99, epsilon=1.0):
        self.Q = np.zeros((n_states, n_actions))
        self.alpha = alpha      # Learning rate
        self.gamma = gamma      # Discount factor
        self.epsilon = epsilon  # Exploration rate
        self.epsilon_decay = 0.995
        self.epsilon_min = 0.01
        self.n_actions = n_actions
    
    def choose_action(self, state):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)  # Explore
        return np.argmax(self.Q[state])  # Exploit
    
    def update(self, state, action, reward, next_state, done):
        if done:
            target = reward
        else:
            target = reward + self.gamma * np.max(self.Q[next_state])
        
        # Q-learning update
        self.Q[state, action] += self.alpha * (target - self.Q[state, action])
    
    def decay_epsilon(self):
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)

Training Loop

import gym

# Environment with discrete state space
env = gym.make('FrozenLake-v1', is_slippery=False)

agent = QLearning(
    n_states=env.observation_space.n,
    n_actions=env.action_space.n
)

# Training
n_episodes = 10000
rewards_history = []

for episode in range(n_episodes):
    state = env.reset()
    total_reward = 0
    done = False
    
    while not done:
        action = agent.choose_action(state)
        next_state, reward, done, _ = env.step(action)
        
        agent.update(state, action, reward, next_state, done)
        
        state = next_state
        total_reward += reward
    
    agent.decay_epsilon()
    rewards_history.append(total_reward)
    
    if episode % 1000 == 0:
        avg_reward = np.mean(rewards_history[-100:])
        print(f"Episode {episode}, Avg Reward: {avg_reward:.2f}, Epsilon: {agent.epsilon:.3f}")

Hyperparameters

Parameter Effect Typical Range
α (learning rate) How fast to update Q 0.01 - 0.5
γ (discount) Future reward importance 0.9 - 0.99
ε (exploration) Random action probability 1.0 → 0.01

Visualizing the Policy

# After training, visualize what the agent learned
def show_policy(Q, shape=(4, 4)):
    actions = ['←', '↓', '→', '↑']
    policy = np.argmax(Q, axis=1).reshape(shape)
    
    for row in policy:
        print(' '.join(actions[a] for a in row))

show_policy(agent.Q)

Limitations

Q-Learning works for discrete state spaces. When states are continuous (like images), the table becomes infinite. Solution: Use neural networks to approximate Q → Deep Q-Network (DQN).

# Can't have a table entry for every pixel combination!
# States like [0.523, 1.234, -0.891, 2.456] need function approximation

Key Takeaway

Q-Learning is the cornerstone of value-based RL. It learns action values through the Bellman equation and balances exploration with exploitation. Works great for small, discrete problems. For larger state spaces, you'll need Deep Q-Networks (DQN), which we'll cover next.

#Machine Learning#Reinforcement Learning#Q-Learning#Advanced