AI8 min read
Transformer Architecture
Understand the architecture behind ChatGPT.
Dr. Patricia Moore
December 18, 2025
0.0k0
Architecture that changed AI.
What are Transformers?
Revolutionary architecture that replaced RNNs.
Key Innovation: Self-attention mechanism
Powers: ChatGPT, BERT, GPT-4, Gemini
Why Better Than RNNs?
RNNs:
- Process sequentially (slow)
- Forget long-range dependencies
- Can't parallelize
Transformers:
- Process entire sequence at once (fast)
- Capture long-range dependencies
- Fully parallelizable
Self-Attention Explained
Goal: Each word looks at all other words
Example: "The cat sat on the mat"
- "sat" pays attention to "cat" (who sat?)
- "sat" pays attention to "mat" (where?)
Attention Mechanism
import numpy as np
def scaled_dot_product_attention(Q, K, V):
"""
Q: Query matrix
K: Key matrix
V: Value matrix
"""
# Calculate attention scores
d_k = Q.shape[-1]
scores = np.matmul(Q, K.transpose()) / np.sqrt(d_k)
# Softmax to get weights
weights = softmax(scores)
# Weighted sum of values
output = np.matmul(weights, V)
return output
# Example
sentence = "The cat sat"
# Each word becomes Q, K, V vectors
attention_output = scaled_dot_product_attention(Q, K, V)
Multi-Head Attention
Multiple attention mechanisms in parallel:
from tensorflow.keras.layers import Layer
class MultiHeadAttention(Layer):
def __init__(self, d_model, num_heads):
super().__init__()
self.num_heads = num_heads
self.d_model = d_model
self.depth = d_model // num_heads
self.wq = Dense(d_model)
self.wk = Dense(d_model)
self.wv = Dense(d_model)
self.dense = Dense(d_model)
def split_heads(self, x, batch_size):
x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
return tf.transpose(x, perm=[0, 2, 1, 3])
def call(self, q, k, v):
batch_size = tf.shape(q)[0]
# Linear projections
q = self.wq(q)
k = self.wk(k)
v = self.wv(v)
# Split into multiple heads
q = self.split_heads(q, batch_size)
k = self.split_heads(k, batch_size)
v = self.split_heads(v, batch_size)
# Scaled dot-product attention
attention = scaled_dot_product_attention(q, k, v)
# Concatenate heads
attention = tf.transpose(attention, perm=[0, 2, 1, 3])
concat_attention = tf.reshape(attention, (batch_size, -1, self.d_model))
# Final linear projection
output = self.dense(concat_attention)
return output
Positional Encoding
Since Transformers process all words at once, need to encode position:
def positional_encoding(max_len, d_model):
pos = np.arange(max_len)[:, np.newaxis]
i = np.arange(d_model)[np.newaxis, :]
angles = pos / np.power(10000, (2 * (i // 2)) / d_model)
# Even indices: sin
angles[:, 0::2] = np.sin(angles[:, 0::2])
# Odd indices: cos
angles[:, 1::2] = np.cos(angles[:, 1::2])
return angles
# Add to input embeddings
embeddings = word_embeddings + positional_encoding
Full Transformer Block
class TransformerBlock(Layer):
def __init__(self, d_model, num_heads, ff_dim):
super().__init__()
self.attention = MultiHeadAttention(d_model, num_heads)
self.ffn = Sequential([
Dense(ff_dim, activation='relu'),
Dense(d_model)
])
self.layernorm1 = LayerNormalization()
self.layernorm2 = LayerNormalization()
self.dropout1 = Dropout(0.1)
self.dropout2 = Dropout(0.1)
def call(self, x, training):
# Multi-head attention
attn_output = self.attention(x, x, x)
attn_output = self.dropout1(attn_output, training=training)
out1 = self.layernorm1(x + attn_output) # Residual connection
# Feed-forward network
ffn_output = self.ffn(out1)
ffn_output = self.dropout2(ffn_output, training=training)
out2 = self.layernorm2(out1 + ffn_output) # Residual connection
return out2
Encoder-Decoder Architecture
Encoder: Processes input sequence
Decoder: Generates output sequence
Used for translation, summarization, etc.
Using Pre-trained Transformers
from transformers import pipeline
# Text classification
classifier = pipeline('sentiment-analysis')
result = classifier("The service in Austin was amazing!")
print(result) # [{'label': 'POSITIVE', 'score': 0.9998}]
# Text generation
generator = pipeline('text-generation', model='gpt2')
text = generator("In San Francisco", max_length=50)
print(text)
# Question answering
qa = pipeline('question-answering')
context = "The Golden Gate Bridge is in San Francisco."
question = "Where is the Golden Gate Bridge?"
answer = qa(question=question, context=context)
print(answer)
Key Models
BERT: Bidirectional Encoder (understanding)
GPT: Decoder only (generation)
T5: Encoder-Decoder (versatile)
BART: Encoder-Decoder (generation + understanding)
Applications
- Machine translation
- Text summarization
- Question answering
- Text generation
- Sentiment analysis
- Named entity recognition
Remember
- Self-attention is the key innovation
- Positional encoding captures order
- Pre-trained models are very powerful
- Transformers revolutionized NLP
#AI#Advanced#Transformers