Reinforcement Learning Basics
Train AI through rewards and penalties.
AI that learns by doing.
What is Reinforcement Learning?
Agent learns by interacting with environment.
**Key Idea**: Actions → Rewards/Penalties → Learning
Like training a dog with treats!
Key Concepts
**Agent**: The learner (AI) **Environment**: The world **State**: Current situation **Action**: What agent can do **Reward**: Feedback from environment
Example - Robot Navigation
**State**: Robot's position in room **Actions**: Move forward, turn left, turn right **Reward**: +10 for reaching goal, -1 for hitting wall
Q-Learning
Simple RL algorithm:
```python import numpy as np
Q-table: State x Action → Expected reward Q = np.zeros((num_states, num_actions))
Hyperparameters learning_rate = 0.1 discount = 0.95 epsilon = 0.1 # Exploration rate
for episode in range(1000): state = env.reset() done = False while not done: # Choose action (epsilon-greedy) if np.random.random() < epsilon: action = env.action_space.sample() # Explore else: action = np.argmax(Q[state]) # Exploit # Take action next_state, reward, done, _ = env.step(action) # Update Q-value old_q = Q[state, action] next_max = np.max(Q[next_state]) new_q = old_q + learning_rate * (reward + discount * next_max - old_q) Q[state, action] = new_q state = next_state
Use learned policy state = env.reset() while not done: action = np.argmax(Q[state]) state, reward, done, _ = env.step(action) ```
OpenAI Gym
Standard RL environment:
```python import gym
Create environment env = gym.make('CartPole-v1')
Reset environment state = env.reset()
for _ in range(1000): env.render() # Take random action action = env.action_space.sample() # Get result next_state, reward, done, info = env.step(action) if done: break
env.close() ```
Deep Q-Network (DQN)
Q-Learning with neural network:
```python from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from collections import deque import random
class DQNAgent: def __init__(self, state_size, action_size): self.state_size = state_size self.action_size = action_size self.memory = deque(maxlen=2000) self.gamma = 0.95 # Discount rate self.epsilon = 1.0 # Exploration rate self.epsilon_decay = 0.995 self.epsilon_min = 0.01 self.model = self._build_model() def _build_model(self): model = Sequential() model.add(Dense(24, input_dim=self.state_size, activation='relu')) model.add(Dense(24, activation='relu')) model.add(Dense(self.action_size, activation='linear')) model.compile(loss='mse', optimizer=Adam(lr=0.001)) return model def remember(self, state, action, reward, next_state, done): self.memory.append((state, action, reward, next_state, done)) def act(self, state): if np.random.random() <= self.epsilon: return random.randrange(self.action_size) q_values = self.model.predict(state) return np.argmax(q_values[0]) def replay(self, batch_size): minibatch = random.sample(self.memory, batch_size) for state, action, reward, next_state, done in minibatch: target = reward if not done: target += self.gamma * np.amax(self.model.predict(next_state)[0]) target_f = self.model.predict(state) target_f[0][action] = target self.model.fit(state, target_f, epochs=1, verbose=0) if self.epsilon > self.epsilon_min: self.epsilon *= self.epsilon_decay
Train DQN agent = DQNAgent(state_size=4, action_size=2)
for episode in range(1000): state = env.reset() state = np.reshape(state, [1, 4]) for time in range(500): action = agent.act(state) next_state, reward, done, _ = env.step(action) next_state = np.reshape(next_state, [1, 4]) agent.remember(state, action, reward, next_state, done) state = next_state if done: print(f"Episode: {episode}, Score: {time}") break if len(agent.memory) > 32: agent.replay(32) ```
Policy Gradient
Learn policy directly:
```python # Instead of Q-values, output action probabilities model = Sequential([ Dense(24, input_dim=state_size, activation='relu'), Dense(24, activation='relu'), Dense(action_size, activation='softmax') # Probabilities ])
Sample action from probability distribution probs = model.predict(state)[0] action = np.random.choice(action_size, p=probs) ```
Applications
- Game playing (AlphaGo, Atari) - Robotics - Self-driving cars - Resource management - Trading algorithms - Recommendation systems
Challenges
- Sparse rewards - Exploration vs exploitation - Sample inefficiency - Stability
Remember
- RL learns through trial and error - Q-Learning for discrete actions - DQN for complex environments - Requires lots of episodes