ML12 min read

Attention Mechanism and Transformers

Understand the attention mechanism that revolutionized NLP and how Transformers work without recurrence.

Sarah Chen
December 19, 2025
0.0k0

Attention Mechanism and Transformers

RNNs process sequences one step at a time. Transformers process everything at once with "attention" - and they've taken over NLP.

The Attention Insight

When translating "The cat sat on the mat," the word "sat" mostly depends on "cat," not equally on all words. Attention learns these dependencies.

How Attention Works

For each position, compute attention scores to all other positions:

Query (what am I looking for?)
Key (what do I contain?)
Value (what do I give back?)

Attention = softmax(Q × K^T / √d) × V
import numpy as np

def attention(Q, K, V):
    d_k = K.shape[-1]
    
    # Compute attention scores
    scores = np.matmul(Q, K.T) / np.sqrt(d_k)
    
    # Convert to probabilities
    weights = softmax(scores)
    
    # Weighted sum of values
    output = np.matmul(weights, V)
    
    return output, weights

Self-Attention

In self-attention, Q, K, V all come from the same sequence. Each position attends to all positions (including itself).

import torch
import torch.nn as nn

class SelfAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.attention = nn.MultiheadAttention(embed_dim, num_heads)
    
    def forward(self, x):
        # x shape: (seq_len, batch, embed_dim)
        attn_output, attn_weights = self.attention(x, x, x)
        return attn_output, attn_weights

The Transformer Architecture

Transformers stack self-attention with feed-forward networks:

Input Embedding + Positional Encoding
            │
    ┌───────▼───────┐
    │  Multi-Head   │
    │   Attention   │
    └───────┬───────┘
            │ + Residual
    ┌───────▼───────┐
    │  Feed-Forward │
    │    Network    │
    └───────┬───────┘
            │ + Residual
            ▼
      (Repeat N times)

Multi-Head Attention

Instead of one attention, use multiple "heads" that learn different patterns:

class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
        super().__init__()
        self.attention = nn.MultiheadAttention(embed_dim, num_heads)
        self.ffn = nn.Sequential(
            nn.Linear(embed_dim, ff_dim),
            nn.ReLU(),
            nn.Linear(ff_dim, embed_dim)
        )
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        # Self-attention with residual
        attn_out, _ = self.attention(x, x, x)
        x = self.norm1(x + self.dropout(attn_out))
        
        # Feed-forward with residual
        ff_out = self.ffn(x)
        x = self.norm2(x + self.dropout(ff_out))
        
        return x

Positional Encoding

Transformers have no inherent position sense. Positional encoding adds position information:

def positional_encoding(seq_len, d_model):
    pos = np.arange(seq_len)[:, np.newaxis]
    i = np.arange(d_model)[np.newaxis, :]
    
    angle = pos / np.power(10000, (2 * (i // 2)) / d_model)
    
    # Apply sin to even indices, cos to odd
    encoding = np.zeros((seq_len, d_model))
    encoding[:, 0::2] = np.sin(angle[:, 0::2])
    encoding[:, 1::2] = np.cos(angle[:, 1::2])
    
    return encoding

Using Pre-trained Transformers

from transformers import AutoTokenizer, AutoModel

# Load pre-trained BERT
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')

# Tokenize
inputs = tokenizer("Hello, how are you?", return_tensors="pt")

# Get embeddings
outputs = model(**inputs)
embeddings = outputs.last_hidden_state

Why Transformers Win

RNN Transformer
Sequential processing Parallel processing
Limited context Full context
Vanishing gradients Stable training
Slower training Much faster

Key Takeaway

Attention lets models focus on relevant parts of the input. Transformers use self-attention to process entire sequences in parallel, enabling massive scale and better long-range dependencies. They power GPT, BERT, and most modern NLP. Use pre-trained models for most tasks - training from scratch requires enormous resources.

#Machine Learning#Deep Learning#Transformers#Attention#NLP#Advanced