AI8 min read

Transformer Architecture

Understand the architecture behind ChatGPT.

Dr. Patricia Moore
December 18, 2025
0.0k0

Architecture that changed AI.

What are Transformers?

Revolutionary architecture that replaced RNNs.

Key Innovation: Self-attention mechanism

Powers: ChatGPT, BERT, GPT-4, Gemini

Why Better Than RNNs?

RNNs:

  • Process sequentially (slow)
  • Forget long-range dependencies
  • Can't parallelize

Transformers:

  • Process entire sequence at once (fast)
  • Capture long-range dependencies
  • Fully parallelizable

Self-Attention Explained

Goal: Each word looks at all other words

Example: "The cat sat on the mat"

  • "sat" pays attention to "cat" (who sat?)
  • "sat" pays attention to "mat" (where?)

Attention Mechanism

import numpy as np

def scaled_dot_product_attention(Q, K, V):
    """
    Q: Query matrix
    K: Key matrix
    V: Value matrix
    """
    # Calculate attention scores
    d_k = Q.shape[-1]
    scores = np.matmul(Q, K.transpose()) / np.sqrt(d_k)
    
    # Softmax to get weights
    weights = softmax(scores)
    
    # Weighted sum of values
    output = np.matmul(weights, V)
    return output

# Example
sentence = "The cat sat"
# Each word becomes Q, K, V vectors
attention_output = scaled_dot_product_attention(Q, K, V)

Multi-Head Attention

Multiple attention mechanisms in parallel:

from tensorflow.keras.layers import Layer

class MultiHeadAttention(Layer):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        
        self.depth = d_model // num_heads
        
        self.wq = Dense(d_model)
        self.wk = Dense(d_model)
        self.wv = Dense(d_model)
        
        self.dense = Dense(d_model)
    
    def split_heads(self, x, batch_size):
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])
    
    def call(self, q, k, v):
        batch_size = tf.shape(q)[0]
        
        # Linear projections
        q = self.wq(q)
        k = self.wk(k)
        v = self.wv(v)
        
        # Split into multiple heads
        q = self.split_heads(q, batch_size)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)
        
        # Scaled dot-product attention
        attention = scaled_dot_product_attention(q, k, v)
        
        # Concatenate heads
        attention = tf.transpose(attention, perm=[0, 2, 1, 3])
        concat_attention = tf.reshape(attention, (batch_size, -1, self.d_model))
        
        # Final linear projection
        output = self.dense(concat_attention)
        return output

Positional Encoding

Since Transformers process all words at once, need to encode position:

def positional_encoding(max_len, d_model):
    pos = np.arange(max_len)[:, np.newaxis]
    i = np.arange(d_model)[np.newaxis, :]
    
    angles = pos / np.power(10000, (2 * (i // 2)) / d_model)
    
    # Even indices: sin
    angles[:, 0::2] = np.sin(angles[:, 0::2])
    # Odd indices: cos
    angles[:, 1::2] = np.cos(angles[:, 1::2])
    
    return angles

# Add to input embeddings
embeddings = word_embeddings + positional_encoding

Full Transformer Block

class TransformerBlock(Layer):
    def __init__(self, d_model, num_heads, ff_dim):
        super().__init__()
        
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.ffn = Sequential([
            Dense(ff_dim, activation='relu'),
            Dense(d_model)
        ])
        
        self.layernorm1 = LayerNormalization()
        self.layernorm2 = LayerNormalization()
        self.dropout1 = Dropout(0.1)
        self.dropout2 = Dropout(0.1)
    
    def call(self, x, training):
        # Multi-head attention
        attn_output = self.attention(x, x, x)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)  # Residual connection
        
        # Feed-forward network
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)  # Residual connection
        
        return out2

Encoder-Decoder Architecture

Encoder: Processes input sequence
Decoder: Generates output sequence

Used for translation, summarization, etc.

Using Pre-trained Transformers

from transformers import pipeline

# Text classification
classifier = pipeline('sentiment-analysis')
result = classifier("The service in Austin was amazing!")
print(result)  # [{'label': 'POSITIVE', 'score': 0.9998}]

# Text generation
generator = pipeline('text-generation', model='gpt2')
text = generator("In San Francisco", max_length=50)
print(text)

# Question answering
qa = pipeline('question-answering')
context = "The Golden Gate Bridge is in San Francisco."
question = "Where is the Golden Gate Bridge?"
answer = qa(question=question, context=context)
print(answer)

Key Models

BERT: Bidirectional Encoder (understanding)
GPT: Decoder only (generation)
T5: Encoder-Decoder (versatile)
BART: Encoder-Decoder (generation + understanding)

Applications

  • Machine translation
  • Text summarization
  • Question answering
  • Text generation
  • Sentiment analysis
  • Named entity recognition

Remember

  • Self-attention is the key innovation
  • Positional encoding captures order
  • Pre-trained models are very powerful
  • Transformers revolutionized NLP
#AI#Advanced#Transformers