ML11 min read

Introduction to Reinforcement Learning

Understand the fundamentals of reinforcement learning: agents, environments, rewards, and the exploration-exploitation tradeoff.

Sarah Chen
December 19, 2025
0.0k0

Introduction to Reinforcement Learning

Supervised learning needs labeled data. Reinforcement Learning (RL) learns from experience - through trial and error, guided by rewards.

The RL Framework

     ┌────────────────────────────┐
     │                            │
     ▼                            │
  [Agent] ───action───> [Environment]
     ▲                            │
     │                            │
     └──reward, state─────────────┘
  • Agent: The learner/decision maker
  • Environment: The world the agent interacts with
  • State: Current situation
  • Action: What the agent can do
  • Reward: Feedback signal (good or bad)

Example: Game Playing

State: Current game screen
Action: Move left, right, jump
Reward: +1 for coins, -1 for dying, +100 for winning

The agent learns which actions lead to high rewards.

Key Concepts

Policy (π): Strategy for choosing actions given states

π(state) → action

Value Function (V): Expected future reward from a state

V(state) = Expected total future reward

Q-Function: Expected reward for taking action a in state s

Q(state, action) = Expected total future reward

Simple RL Environment

import gym

# Create environment
env = gym.make('CartPole-v1')

# Reset and get initial state
state = env.reset()

total_reward = 0
done = False

while not done:
    # Choose action (random for now)
    action = env.action_space.sample()
    
    # Take action, get result
    next_state, reward, done, info = env.step(action)
    
    total_reward += reward
    state = next_state

print(f"Total reward: {total_reward}")
env.close()

Exploration vs Exploitation

The fundamental tradeoff:

  • Exploitation: Choose the best known action
  • Exploration: Try new actions to learn more

Too much exploitation = miss better options
Too much exploration = waste time on bad actions

Epsilon-Greedy Strategy

import numpy as np

def epsilon_greedy(Q, state, epsilon):
    if np.random.random() < epsilon:
        # Explore: random action
        return np.random.randint(num_actions)
    else:
        # Exploit: best known action
        return np.argmax(Q[state])

# Decay epsilon over time
epsilon = 1.0
epsilon_decay = 0.995
epsilon_min = 0.01

for episode in range(1000):
    # ... training ...
    epsilon = max(epsilon_min, epsilon * epsilon_decay)

Types of RL

Value-Based: Learn value function, derive policy

  • Q-Learning, DQN

Policy-Based: Learn policy directly

  • REINFORCE, Policy Gradients

Actor-Critic: Learn both

  • A2C, PPO, SAC

Reward Design

Good rewards are crucial:

# Bad: Sparse reward
reward = 1 if goal_reached else 0  # Hard to learn

# Better: Shaped reward
reward = -distance_to_goal  # Continuous feedback

# Watch for reward hacking!
# Agent might find unexpected ways to maximize reward

When to Use RL

Good fit:

  • Sequential decision making
  • Clear reward signal
  • Can simulate many episodes
  • Games, robotics, recommendations

Not ideal:

  • One-shot decisions
  • No clear reward
  • Expensive to try actions (real robots)
  • Labeled data is available (use supervised)

Key Takeaway

RL learns through interaction - taking actions, receiving rewards, improving. It's powerful for sequential decision problems but requires careful reward design and lots of experience. Start with simple environments (CartPole, GridWorld), understand the exploration-exploitation tradeoff, then move to more complex algorithms.

#Machine Learning#Reinforcement Learning#Advanced