AI7 min read

Reinforcement Learning Basics

Learn through rewards and penalties.

Dr. Michael Torres
December 18, 2025
0.0k0

Agent learns by taking actions and getting rewards.

What is Reinforcement Learning?

Agent interacts with environment to maximize rewards.

**Like training a dog**: Good behavior → Reward!

Key Concepts

**Agent**: Learner (like a robot) **Environment**: World agent interacts with **State**: Current situation **Action**: What agent can do **Reward**: Feedback (positive or negative)

Simple RL Example

```python import gym import numpy as np

Create environment (simple grid world) env = gym.make('FrozenLake-v1')

Q-learning table Q = np.zeros([env.observation_space.n, env.action_space.n])

Parameters learning_rate = 0.8 discount_factor = 0.95 epsilon = 0.1 # Exploration rate episodes = 2000

Training for episode in range(episodes): state = env.reset() done = False while not done: # Choose action (explore vs exploit) if np.random.uniform(0, 1) < epsilon: action = env.action_space.sample() # Explore else: action = np.argmax(Q[state, :]) # Exploit # Take action next_state, reward, done, info = env.step(action) # Update Q-table old_value = Q[state, action] next_max = np.max(Q[next_state, :]) new_value = old_value + learning_rate * (reward + discount_factor * next_max - old_value) Q[state, action] = new_value state = next_state if episode % 100 == 0: print(f"Episode {episode} completed")

print("Training finished!") ```

Test Trained Agent

```python # Test the agent state = env.reset() done = False total_reward = 0

while not done: action = np.argmax(Q[state, :]) # Use learned policy state, reward, done, info = env.step(action) total_reward += reward

print(f"Total reward: {total_reward}") ```

RL Algorithms

**Q-Learning**: Learn action values **SARSA**: On-policy learning **DQN**: Deep Q-Network (uses neural networks) **A3C**: Asynchronous Actor-Critic **PPO**: Proximal Policy Optimization

Applications

- Game playing (Chess, Go, video games) - Robotics - Self-driving cars - Resource management

Remember

- RL learns from trial and error - Balance exploration vs exploitation - Requires many episodes to learn - Works when you can simulate environment

#AI#Advanced#RL