Attention Mechanism and Transformers
Understand the attention mechanism that revolutionized NLP and how Transformers work without recurrence.
Attention Mechanism and Transformers
RNNs process sequences one step at a time. Transformers process everything at once with "attention" - and they've taken over NLP.
The Attention Insight
When translating "The cat sat on the mat," the word "sat" mostly depends on "cat," not equally on all words. Attention learns these dependencies.
How Attention Works
For each position, compute attention scores to all other positions:
Query (what am I looking for?)
Key (what do I contain?)
Value (what do I give back?)
Attention = softmax(Q × K^T / √d) × V
import numpy as np
def attention(Q, K, V):
d_k = K.shape[-1]
# Compute attention scores
scores = np.matmul(Q, K.T) / np.sqrt(d_k)
# Convert to probabilities
weights = softmax(scores)
# Weighted sum of values
output = np.matmul(weights, V)
return output, weights
Self-Attention
In self-attention, Q, K, V all come from the same sequence. Each position attends to all positions (including itself).
import torch
import torch.nn as nn
class SelfAttention(nn.Module):
def __init__(self, embed_dim, num_heads):
super().__init__()
self.attention = nn.MultiheadAttention(embed_dim, num_heads)
def forward(self, x):
# x shape: (seq_len, batch, embed_dim)
attn_output, attn_weights = self.attention(x, x, x)
return attn_output, attn_weights
The Transformer Architecture
Transformers stack self-attention with feed-forward networks:
Input Embedding + Positional Encoding
│
┌───────▼───────┐
│ Multi-Head │
│ Attention │
└───────┬───────┘
│ + Residual
┌───────▼───────┐
│ Feed-Forward │
│ Network │
└───────┬───────┘
│ + Residual
▼
(Repeat N times)
Multi-Head Attention
Instead of one attention, use multiple "heads" that learn different patterns:
class TransformerBlock(nn.Module):
def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
super().__init__()
self.attention = nn.MultiheadAttention(embed_dim, num_heads)
self.ffn = nn.Sequential(
nn.Linear(embed_dim, ff_dim),
nn.ReLU(),
nn.Linear(ff_dim, embed_dim)
)
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# Self-attention with residual
attn_out, _ = self.attention(x, x, x)
x = self.norm1(x + self.dropout(attn_out))
# Feed-forward with residual
ff_out = self.ffn(x)
x = self.norm2(x + self.dropout(ff_out))
return x
Positional Encoding
Transformers have no inherent position sense. Positional encoding adds position information:
def positional_encoding(seq_len, d_model):
pos = np.arange(seq_len)[:, np.newaxis]
i = np.arange(d_model)[np.newaxis, :]
angle = pos / np.power(10000, (2 * (i // 2)) / d_model)
# Apply sin to even indices, cos to odd
encoding = np.zeros((seq_len, d_model))
encoding[:, 0::2] = np.sin(angle[:, 0::2])
encoding[:, 1::2] = np.cos(angle[:, 1::2])
return encoding
Using Pre-trained Transformers
from transformers import AutoTokenizer, AutoModel
# Load pre-trained BERT
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')
# Tokenize
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
# Get embeddings
outputs = model(**inputs)
embeddings = outputs.last_hidden_state
Why Transformers Win
| RNN | Transformer |
|---|---|
| Sequential processing | Parallel processing |
| Limited context | Full context |
| Vanishing gradients | Stable training |
| Slower training | Much faster |
Key Takeaway
Attention lets models focus on relevant parts of the input. Transformers use self-attention to process entire sequences in parallel, enabling massive scale and better long-range dependencies. They power GPT, BERT, and most modern NLP. Use pre-trained models for most tasks - training from scratch requires enormous resources.