Introduction to Reinforcement Learning
Understand the fundamentals of reinforcement learning: agents, environments, rewards, and the exploration-exploitation tradeoff.
Introduction to Reinforcement Learning
Supervised learning needs labeled data. Reinforcement Learning (RL) learns from experience - through trial and error, guided by rewards.
The RL Framework
┌────────────────────────────┐
│ │
▼ │
[Agent] ───action───> [Environment]
▲ │
│ │
└──reward, state─────────────┘
- Agent: The learner/decision maker
- Environment: The world the agent interacts with
- State: Current situation
- Action: What the agent can do
- Reward: Feedback signal (good or bad)
Example: Game Playing
State: Current game screen
Action: Move left, right, jump
Reward: +1 for coins, -1 for dying, +100 for winning
The agent learns which actions lead to high rewards.
Key Concepts
Policy (π): Strategy for choosing actions given states
π(state) → action
Value Function (V): Expected future reward from a state
V(state) = Expected total future reward
Q-Function: Expected reward for taking action a in state s
Q(state, action) = Expected total future reward
Simple RL Environment
import gym
# Create environment
env = gym.make('CartPole-v1')
# Reset and get initial state
state = env.reset()
total_reward = 0
done = False
while not done:
# Choose action (random for now)
action = env.action_space.sample()
# Take action, get result
next_state, reward, done, info = env.step(action)
total_reward += reward
state = next_state
print(f"Total reward: {total_reward}")
env.close()
Exploration vs Exploitation
The fundamental tradeoff:
- Exploitation: Choose the best known action
- Exploration: Try new actions to learn more
Too much exploitation = miss better options
Too much exploration = waste time on bad actions
Epsilon-Greedy Strategy
import numpy as np
def epsilon_greedy(Q, state, epsilon):
if np.random.random() < epsilon:
# Explore: random action
return np.random.randint(num_actions)
else:
# Exploit: best known action
return np.argmax(Q[state])
# Decay epsilon over time
epsilon = 1.0
epsilon_decay = 0.995
epsilon_min = 0.01
for episode in range(1000):
# ... training ...
epsilon = max(epsilon_min, epsilon * epsilon_decay)
Types of RL
Value-Based: Learn value function, derive policy
- Q-Learning, DQN
Policy-Based: Learn policy directly
- REINFORCE, Policy Gradients
Actor-Critic: Learn both
- A2C, PPO, SAC
Reward Design
Good rewards are crucial:
# Bad: Sparse reward
reward = 1 if goal_reached else 0 # Hard to learn
# Better: Shaped reward
reward = -distance_to_goal # Continuous feedback
# Watch for reward hacking!
# Agent might find unexpected ways to maximize reward
When to Use RL
Good fit:
- Sequential decision making
- Clear reward signal
- Can simulate many episodes
- Games, robotics, recommendations
Not ideal:
- One-shot decisions
- No clear reward
- Expensive to try actions (real robots)
- Labeled data is available (use supervised)
Key Takeaway
RL learns through interaction - taking actions, receiving rewards, improving. It's powerful for sequential decision problems but requires careful reward design and lots of experience. Start with simple environments (CartPole, GridWorld), understand the exploration-exploitation tradeoff, then move to more complex algorithms.