AI8 min read

Reinforcement Learning Basics

Train AI through rewards and penalties.

Dr. Patricia Moore
December 18, 2025
0.0k0

AI that learns by doing.

What is Reinforcement Learning?

Agent learns by interacting with environment.

**Key Idea**: Actions → Rewards/Penalties → Learning

Like training a dog with treats!

Key Concepts

**Agent**: The learner (AI) **Environment**: The world **State**: Current situation **Action**: What agent can do **Reward**: Feedback from environment

Example - Robot Navigation

**State**: Robot's position in room **Actions**: Move forward, turn left, turn right **Reward**: +10 for reaching goal, -1 for hitting wall

Q-Learning

Simple RL algorithm:

```python import numpy as np

Q-table: State x Action → Expected reward Q = np.zeros((num_states, num_actions))

Hyperparameters learning_rate = 0.1 discount = 0.95 epsilon = 0.1 # Exploration rate

for episode in range(1000): state = env.reset() done = False while not done: # Choose action (epsilon-greedy) if np.random.random() < epsilon: action = env.action_space.sample() # Explore else: action = np.argmax(Q[state]) # Exploit # Take action next_state, reward, done, _ = env.step(action) # Update Q-value old_q = Q[state, action] next_max = np.max(Q[next_state]) new_q = old_q + learning_rate * (reward + discount * next_max - old_q) Q[state, action] = new_q state = next_state

Use learned policy state = env.reset() while not done: action = np.argmax(Q[state]) state, reward, done, _ = env.step(action) ```

OpenAI Gym

Standard RL environment:

```python import gym

Create environment env = gym.make('CartPole-v1')

Reset environment state = env.reset()

for _ in range(1000): env.render() # Take random action action = env.action_space.sample() # Get result next_state, reward, done, info = env.step(action) if done: break

env.close() ```

Deep Q-Network (DQN)

Q-Learning with neural network:

```python from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from collections import deque import random

class DQNAgent: def __init__(self, state_size, action_size): self.state_size = state_size self.action_size = action_size self.memory = deque(maxlen=2000) self.gamma = 0.95 # Discount rate self.epsilon = 1.0 # Exploration rate self.epsilon_decay = 0.995 self.epsilon_min = 0.01 self.model = self._build_model() def _build_model(self): model = Sequential() model.add(Dense(24, input_dim=self.state_size, activation='relu')) model.add(Dense(24, activation='relu')) model.add(Dense(self.action_size, activation='linear')) model.compile(loss='mse', optimizer=Adam(lr=0.001)) return model def remember(self, state, action, reward, next_state, done): self.memory.append((state, action, reward, next_state, done)) def act(self, state): if np.random.random() <= self.epsilon: return random.randrange(self.action_size) q_values = self.model.predict(state) return np.argmax(q_values[0]) def replay(self, batch_size): minibatch = random.sample(self.memory, batch_size) for state, action, reward, next_state, done in minibatch: target = reward if not done: target += self.gamma * np.amax(self.model.predict(next_state)[0]) target_f = self.model.predict(state) target_f[0][action] = target self.model.fit(state, target_f, epochs=1, verbose=0) if self.epsilon > self.epsilon_min: self.epsilon *= self.epsilon_decay

Train DQN agent = DQNAgent(state_size=4, action_size=2)

for episode in range(1000): state = env.reset() state = np.reshape(state, [1, 4]) for time in range(500): action = agent.act(state) next_state, reward, done, _ = env.step(action) next_state = np.reshape(next_state, [1, 4]) agent.remember(state, action, reward, next_state, done) state = next_state if done: print(f"Episode: {episode}, Score: {time}") break if len(agent.memory) > 32: agent.replay(32) ```

Policy Gradient

Learn policy directly:

```python # Instead of Q-values, output action probabilities model = Sequential([ Dense(24, input_dim=state_size, activation='relu'), Dense(24, activation='relu'), Dense(action_size, activation='softmax') # Probabilities ])

Sample action from probability distribution probs = model.predict(state)[0] action = np.random.choice(action_size, p=probs) ```

Applications

- Game playing (AlphaGo, Atari) - Robotics - Self-driving cars - Resource management - Trading algorithms - Recommendation systems

Challenges

- Sparse rewards - Exploration vs exploitation - Sample inefficiency - Stability

Remember

- RL learns through trial and error - Q-Learning for discrete actions - DQN for complex environments - Requires lots of episodes

#AI#Advanced#RL