AI7 min read

Reinforcement Learning Basics

Learn through rewards and penalties.

Dr. Michael Torres
December 18, 2025
0.0k0

Agent learns by taking actions and getting rewards.

What is Reinforcement Learning?

Agent interacts with environment to maximize rewards.

Like training a dog: Good behavior → Reward!

Key Concepts

Agent: Learner (like a robot)
Environment: World agent interacts with
State: Current situation
Action: What agent can do
Reward: Feedback (positive or negative)

Simple RL Example

import gym
import numpy as np

# Create environment (simple grid world)
env = gym.make('FrozenLake-v1')

# Q-learning table
Q = np.zeros([env.observation_space.n, env.action_space.n])

# Parameters
learning_rate = 0.8
discount_factor = 0.95
epsilon = 0.1  # Exploration rate
episodes = 2000

# Training
for episode in range(episodes):
    state = env.reset()
    done = False
    
    while not done:
        # Choose action (explore vs exploit)
        if np.random.uniform(0, 1) < epsilon:
            action = env.action_space.sample()  # Explore
        else:
            action = np.argmax(Q[state, :])  # Exploit
        
        # Take action
        next_state, reward, done, info = env.step(action)
        
        # Update Q-table
        old_value = Q[state, action]
        next_max = np.max(Q[next_state, :])
        new_value = old_value + learning_rate * (reward + discount_factor * next_max - old_value)
        Q[state, action] = new_value
        
        state = next_state
    
    if episode % 100 == 0:
        print(f"Episode {episode} completed")

print("Training finished!")

Test Trained Agent

# Test the agent
state = env.reset()
done = False
total_reward = 0

while not done:
    action = np.argmax(Q[state, :])  # Use learned policy
    state, reward, done, info = env.step(action)
    total_reward += reward

print(f"Total reward: {total_reward}")

RL Algorithms

Q-Learning: Learn action values
SARSA: On-policy learning
DQN: Deep Q-Network (uses neural networks)
A3C: Asynchronous Actor-Critic
PPO: Proximal Policy Optimization

Applications

  • Game playing (Chess, Go, video games)
  • Robotics
  • Self-driving cars
  • Resource management

Remember

  • RL learns from trial and error
  • Balance exploration vs exploitation
  • Requires many episodes to learn
  • Works when you can simulate environment
#AI#Advanced#RL