ML11 min read

Introduction to Reinforcement Learning

Understand the fundamentals of reinforcement learning: agents, environments, rewards, and the exploration-exploitation tradeoff.

Sarah Chen
December 19, 2025
0.0k0

Introduction to Reinforcement Learning

Supervised learning needs labeled data. Reinforcement Learning (RL) learns from experience - through trial and error, guided by rewards.

The RL Framework

``` ┌────────────────────────────┐ │ │ ▼ │ [Agent] ───action───> [Environment] ▲ │ │ │ └──reward, state─────────────┘ ```

- **Agent:** The learner/decision maker - **Environment:** The world the agent interacts with - **State:** Current situation - **Action:** What the agent can do - **Reward:** Feedback signal (good or bad)

Example: Game Playing

State: Current game screen Action: Move left, right, jump Reward: +1 for coins, -1 for dying, +100 for winning

The agent learns which actions lead to high rewards.

Key Concepts

**Policy (π):** Strategy for choosing actions given states ``` π(state) → action ```

**Value Function (V):** Expected future reward from a state ``` V(state) = Expected total future reward ```

**Q-Function:** Expected reward for taking action a in state s ``` Q(state, action) = Expected total future reward ```

Simple RL Environment

```python import gym

Create environment env = gym.make('CartPole-v1')

Reset and get initial state state = env.reset()

total_reward = 0 done = False

while not done: # Choose action (random for now) action = env.action_space.sample() # Take action, get result next_state, reward, done, info = env.step(action) total_reward += reward state = next_state

print(f"Total reward: {total_reward}") env.close() ```

Exploration vs Exploitation

The fundamental tradeoff: - **Exploitation:** Choose the best known action - **Exploration:** Try new actions to learn more

Too much exploitation = miss better options Too much exploration = waste time on bad actions

Epsilon-Greedy Strategy

```python import numpy as np

def epsilon_greedy(Q, state, epsilon): if np.random.random() < epsilon: # Explore: random action return np.random.randint(num_actions) else: # Exploit: best known action return np.argmax(Q[state])

Decay epsilon over time epsilon = 1.0 epsilon_decay = 0.995 epsilon_min = 0.01

for episode in range(1000): # ... training ... epsilon = max(epsilon_min, epsilon * epsilon_decay) ```

Types of RL

**Value-Based:** Learn value function, derive policy - Q-Learning, DQN

**Policy-Based:** Learn policy directly - REINFORCE, Policy Gradients

**Actor-Critic:** Learn both - A2C, PPO, SAC

Reward Design

Good rewards are crucial:

```python # Bad: Sparse reward reward = 1 if goal_reached else 0 # Hard to learn

Better: Shaped reward reward = -distance_to_goal # Continuous feedback

Watch for reward hacking! # Agent might find unexpected ways to maximize reward ```

When to Use RL

**Good fit:** - Sequential decision making - Clear reward signal - Can simulate many episodes - Games, robotics, recommendations

**Not ideal:** - One-shot decisions - No clear reward - Expensive to try actions (real robots) - Labeled data is available (use supervised)

Key Takeaway

RL learns through interaction - taking actions, receiving rewards, improving. It's powerful for sequential decision problems but requires careful reward design and lots of experience. Start with simple environments (CartPole, GridWorld), understand the exploration-exploitation tradeoff, then move to more complex algorithms.

#Machine Learning#Reinforcement Learning#Advanced