Introduction to Reinforcement Learning
Understand the fundamentals of reinforcement learning: agents, environments, rewards, and the exploration-exploitation tradeoff.
Introduction to Reinforcement Learning
Supervised learning needs labeled data. Reinforcement Learning (RL) learns from experience - through trial and error, guided by rewards.
The RL Framework
``` ┌────────────────────────────┐ │ │ ▼ │ [Agent] ───action───> [Environment] ▲ │ │ │ └──reward, state─────────────┘ ```
- **Agent:** The learner/decision maker - **Environment:** The world the agent interacts with - **State:** Current situation - **Action:** What the agent can do - **Reward:** Feedback signal (good or bad)
Example: Game Playing
State: Current game screen Action: Move left, right, jump Reward: +1 for coins, -1 for dying, +100 for winning
The agent learns which actions lead to high rewards.
Key Concepts
**Policy (π):** Strategy for choosing actions given states ``` π(state) → action ```
**Value Function (V):** Expected future reward from a state ``` V(state) = Expected total future reward ```
**Q-Function:** Expected reward for taking action a in state s ``` Q(state, action) = Expected total future reward ```
Simple RL Environment
```python import gym
Create environment env = gym.make('CartPole-v1')
Reset and get initial state state = env.reset()
total_reward = 0 done = False
while not done: # Choose action (random for now) action = env.action_space.sample() # Take action, get result next_state, reward, done, info = env.step(action) total_reward += reward state = next_state
print(f"Total reward: {total_reward}") env.close() ```
Exploration vs Exploitation
The fundamental tradeoff: - **Exploitation:** Choose the best known action - **Exploration:** Try new actions to learn more
Too much exploitation = miss better options Too much exploration = waste time on bad actions
Epsilon-Greedy Strategy
```python import numpy as np
def epsilon_greedy(Q, state, epsilon): if np.random.random() < epsilon: # Explore: random action return np.random.randint(num_actions) else: # Exploit: best known action return np.argmax(Q[state])
Decay epsilon over time epsilon = 1.0 epsilon_decay = 0.995 epsilon_min = 0.01
for episode in range(1000): # ... training ... epsilon = max(epsilon_min, epsilon * epsilon_decay) ```
Types of RL
**Value-Based:** Learn value function, derive policy - Q-Learning, DQN
**Policy-Based:** Learn policy directly - REINFORCE, Policy Gradients
**Actor-Critic:** Learn both - A2C, PPO, SAC
Reward Design
Good rewards are crucial:
```python # Bad: Sparse reward reward = 1 if goal_reached else 0 # Hard to learn
Better: Shaped reward reward = -distance_to_goal # Continuous feedback
Watch for reward hacking! # Agent might find unexpected ways to maximize reward ```
When to Use RL
**Good fit:** - Sequential decision making - Clear reward signal - Can simulate many episodes - Games, robotics, recommendations
**Not ideal:** - One-shot decisions - No clear reward - Expensive to try actions (real robots) - Labeled data is available (use supervised)
Key Takeaway
RL learns through interaction - taking actions, receiving rewards, improving. It's powerful for sequential decision problems but requires careful reward design and lots of experience. Start with simple environments (CartPole, GridWorld), understand the exploration-exploitation tradeoff, then move to more complex algorithms.