Reinforcement Learning Basics
Learn through rewards and penalties.
Agent learns by taking actions and getting rewards.
What is Reinforcement Learning?
Agent interacts with environment to maximize rewards.
**Like training a dog**: Good behavior → Reward!
Key Concepts
**Agent**: Learner (like a robot) **Environment**: World agent interacts with **State**: Current situation **Action**: What agent can do **Reward**: Feedback (positive or negative)
Simple RL Example
```python import gym import numpy as np
Create environment (simple grid world) env = gym.make('FrozenLake-v1')
Q-learning table Q = np.zeros([env.observation_space.n, env.action_space.n])
Parameters learning_rate = 0.8 discount_factor = 0.95 epsilon = 0.1 # Exploration rate episodes = 2000
Training for episode in range(episodes): state = env.reset() done = False while not done: # Choose action (explore vs exploit) if np.random.uniform(0, 1) < epsilon: action = env.action_space.sample() # Explore else: action = np.argmax(Q[state, :]) # Exploit # Take action next_state, reward, done, info = env.step(action) # Update Q-table old_value = Q[state, action] next_max = np.max(Q[next_state, :]) new_value = old_value + learning_rate * (reward + discount_factor * next_max - old_value) Q[state, action] = new_value state = next_state if episode % 100 == 0: print(f"Episode {episode} completed")
print("Training finished!") ```
Test Trained Agent
```python # Test the agent state = env.reset() done = False total_reward = 0
while not done: action = np.argmax(Q[state, :]) # Use learned policy state, reward, done, info = env.step(action) total_reward += reward
print(f"Total reward: {total_reward}") ```
RL Algorithms
**Q-Learning**: Learn action values **SARSA**: On-policy learning **DQN**: Deep Q-Network (uses neural networks) **A3C**: Asynchronous Actor-Critic **PPO**: Proximal Policy Optimization
Applications
- Game playing (Chess, Go, video games) - Robotics - Self-driving cars - Resource management
Remember
- RL learns from trial and error - Balance exploration vs exploitation - Requires many episodes to learn - Works when you can simulate environment