AI7 min read
Reinforcement Learning Basics
Learn through rewards and penalties.
Dr. Michael Torres
December 18, 2025
0.0k0
Agent learns by taking actions and getting rewards.
What is Reinforcement Learning?
Agent interacts with environment to maximize rewards.
Like training a dog: Good behavior → Reward!
Key Concepts
Agent: Learner (like a robot)
Environment: World agent interacts with
State: Current situation
Action: What agent can do
Reward: Feedback (positive or negative)
Simple RL Example
import gym
import numpy as np
# Create environment (simple grid world)
env = gym.make('FrozenLake-v1')
# Q-learning table
Q = np.zeros([env.observation_space.n, env.action_space.n])
# Parameters
learning_rate = 0.8
discount_factor = 0.95
epsilon = 0.1 # Exploration rate
episodes = 2000
# Training
for episode in range(episodes):
state = env.reset()
done = False
while not done:
# Choose action (explore vs exploit)
if np.random.uniform(0, 1) < epsilon:
action = env.action_space.sample() # Explore
else:
action = np.argmax(Q[state, :]) # Exploit
# Take action
next_state, reward, done, info = env.step(action)
# Update Q-table
old_value = Q[state, action]
next_max = np.max(Q[next_state, :])
new_value = old_value + learning_rate * (reward + discount_factor * next_max - old_value)
Q[state, action] = new_value
state = next_state
if episode % 100 == 0:
print(f"Episode {episode} completed")
print("Training finished!")
Test Trained Agent
# Test the agent
state = env.reset()
done = False
total_reward = 0
while not done:
action = np.argmax(Q[state, :]) # Use learned policy
state, reward, done, info = env.step(action)
total_reward += reward
print(f"Total reward: {total_reward}")
RL Algorithms
Q-Learning: Learn action values
SARSA: On-policy learning
DQN: Deep Q-Network (uses neural networks)
A3C: Asynchronous Actor-Critic
PPO: Proximal Policy Optimization
Applications
- Game playing (Chess, Go, video games)
- Robotics
- Self-driving cars
- Resource management
Remember
- RL learns from trial and error
- Balance exploration vs exploitation
- Requires many episodes to learn
- Works when you can simulate environment
#AI#Advanced#RL