Attention Mechanism and Transformers
Understand the attention mechanism that revolutionized NLP and how Transformers work without recurrence.
Attention Mechanism and Transformers
RNNs process sequences one step at a time. Transformers process everything at once with "attention" - and they've taken over NLP.
The Attention Insight
When translating "The cat sat on the mat," the word "sat" mostly depends on "cat," not equally on all words. Attention learns these dependencies.
How Attention Works
For each position, compute attention scores to all other positions:
``` Query (what am I looking for?) Key (what do I contain?) Value (what do I give back?)
Attention = softmax(Q × K^T / √d) × V ```
```python import numpy as np
def attention(Q, K, V): d_k = K.shape[-1] # Compute attention scores scores = np.matmul(Q, K.T) / np.sqrt(d_k) # Convert to probabilities weights = softmax(scores) # Weighted sum of values output = np.matmul(weights, V) return output, weights ```
Self-Attention
In self-attention, Q, K, V all come from the same sequence. Each position attends to all positions (including itself).
```python import torch import torch.nn as nn
class SelfAttention(nn.Module): def __init__(self, embed_dim, num_heads): super().__init__() self.attention = nn.MultiheadAttention(embed_dim, num_heads) def forward(self, x): # x shape: (seq_len, batch, embed_dim) attn_output, attn_weights = self.attention(x, x, x) return attn_output, attn_weights ```
The Transformer Architecture
Transformers stack self-attention with feed-forward networks:
``` Input Embedding + Positional Encoding │ ┌───────▼───────┐ │ Multi-Head │ │ Attention │ └───────┬───────┘ │ + Residual ┌───────▼───────┐ │ Feed-Forward │ │ Network │ └───────┬───────┘ │ + Residual ▼ (Repeat N times) ```
Multi-Head Attention
Instead of one attention, use multiple "heads" that learn different patterns:
```python class TransformerBlock(nn.Module): def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1): super().__init__() self.attention = nn.MultiheadAttention(embed_dim, num_heads) self.ffn = nn.Sequential( nn.Linear(embed_dim, ff_dim), nn.ReLU(), nn.Linear(ff_dim, embed_dim) ) self.norm1 = nn.LayerNorm(embed_dim) self.norm2 = nn.LayerNorm(embed_dim) self.dropout = nn.Dropout(dropout) def forward(self, x): # Self-attention with residual attn_out, _ = self.attention(x, x, x) x = self.norm1(x + self.dropout(attn_out)) # Feed-forward with residual ff_out = self.ffn(x) x = self.norm2(x + self.dropout(ff_out)) return x ```
Positional Encoding
Transformers have no inherent position sense. Positional encoding adds position information:
```python def positional_encoding(seq_len, d_model): pos = np.arange(seq_len)[:, np.newaxis] i = np.arange(d_model)[np.newaxis, :] angle = pos / np.power(10000, (2 * (i // 2)) / d_model) # Apply sin to even indices, cos to odd encoding = np.zeros((seq_len, d_model)) encoding[:, 0::2] = np.sin(angle[:, 0::2]) encoding[:, 1::2] = np.cos(angle[:, 1::2]) return encoding ```
Using Pre-trained Transformers
```python from transformers import AutoTokenizer, AutoModel
Load pre-trained BERT tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') model = AutoModel.from_pretrained('bert-base-uncased')
Tokenize inputs = tokenizer("Hello, how are you?", return_tensors="pt")
Get embeddings outputs = model(**inputs) embeddings = outputs.last_hidden_state ```
Why Transformers Win
| RNN | Transformer | |-----|-------------| | Sequential processing | Parallel processing | | Limited context | Full context | | Vanishing gradients | Stable training | | Slower training | Much faster |
Key Takeaway
Attention lets models focus on relevant parts of the input. Transformers use self-attention to process entire sequences in parallel, enabling massive scale and better long-range dependencies. They power GPT, BERT, and most modern NLP. Use pre-trained models for most tasks - training from scratch requires enormous resources.