ML12 min read

Attention Mechanism and Transformers

Understand the attention mechanism that revolutionized NLP and how Transformers work without recurrence.

Sarah Chen
December 19, 2025
0.0k0

Attention Mechanism and Transformers

RNNs process sequences one step at a time. Transformers process everything at once with "attention" - and they've taken over NLP.

The Attention Insight

When translating "The cat sat on the mat," the word "sat" mostly depends on "cat," not equally on all words. Attention learns these dependencies.

How Attention Works

For each position, compute attention scores to all other positions:

``` Query (what am I looking for?) Key (what do I contain?) Value (what do I give back?)

Attention = softmax(Q × K^T / √d) × V ```

```python import numpy as np

def attention(Q, K, V): d_k = K.shape[-1] # Compute attention scores scores = np.matmul(Q, K.T) / np.sqrt(d_k) # Convert to probabilities weights = softmax(scores) # Weighted sum of values output = np.matmul(weights, V) return output, weights ```

Self-Attention

In self-attention, Q, K, V all come from the same sequence. Each position attends to all positions (including itself).

```python import torch import torch.nn as nn

class SelfAttention(nn.Module): def __init__(self, embed_dim, num_heads): super().__init__() self.attention = nn.MultiheadAttention(embed_dim, num_heads) def forward(self, x): # x shape: (seq_len, batch, embed_dim) attn_output, attn_weights = self.attention(x, x, x) return attn_output, attn_weights ```

The Transformer Architecture

Transformers stack self-attention with feed-forward networks:

``` Input Embedding + Positional Encoding │ ┌───────▼───────┐ │ Multi-Head │ │ Attention │ └───────┬───────┘ │ + Residual ┌───────▼───────┐ │ Feed-Forward │ │ Network │ └───────┬───────┘ │ + Residual ▼ (Repeat N times) ```

Multi-Head Attention

Instead of one attention, use multiple "heads" that learn different patterns:

```python class TransformerBlock(nn.Module): def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1): super().__init__() self.attention = nn.MultiheadAttention(embed_dim, num_heads) self.ffn = nn.Sequential( nn.Linear(embed_dim, ff_dim), nn.ReLU(), nn.Linear(ff_dim, embed_dim) ) self.norm1 = nn.LayerNorm(embed_dim) self.norm2 = nn.LayerNorm(embed_dim) self.dropout = nn.Dropout(dropout) def forward(self, x): # Self-attention with residual attn_out, _ = self.attention(x, x, x) x = self.norm1(x + self.dropout(attn_out)) # Feed-forward with residual ff_out = self.ffn(x) x = self.norm2(x + self.dropout(ff_out)) return x ```

Positional Encoding

Transformers have no inherent position sense. Positional encoding adds position information:

```python def positional_encoding(seq_len, d_model): pos = np.arange(seq_len)[:, np.newaxis] i = np.arange(d_model)[np.newaxis, :] angle = pos / np.power(10000, (2 * (i // 2)) / d_model) # Apply sin to even indices, cos to odd encoding = np.zeros((seq_len, d_model)) encoding[:, 0::2] = np.sin(angle[:, 0::2]) encoding[:, 1::2] = np.cos(angle[:, 1::2]) return encoding ```

Using Pre-trained Transformers

```python from transformers import AutoTokenizer, AutoModel

Load pre-trained BERT tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') model = AutoModel.from_pretrained('bert-base-uncased')

Tokenize inputs = tokenizer("Hello, how are you?", return_tensors="pt")

Get embeddings outputs = model(**inputs) embeddings = outputs.last_hidden_state ```

Why Transformers Win

| RNN | Transformer | |-----|-------------| | Sequential processing | Parallel processing | | Limited context | Full context | | Vanishing gradients | Stable training | | Slower training | Much faster |

Key Takeaway

Attention lets models focus on relevant parts of the input. Transformers use self-attention to process entire sequences in parallel, enabling massive scale and better long-range dependencies. They power GPT, BERT, and most modern NLP. Use pre-trained models for most tasks - training from scratch requires enormous resources.

#Machine Learning#Deep Learning#Transformers#Attention#NLP#Advanced