Architecture that changed AI.

What are Transformers?

Revolutionary architecture that replaced RNNs.

**Key Innovation**: Self-attention mechanism

**Powers**: ChatGPT, BERT, GPT-4, Gemini

Why Better Than RNNs?

**RNNs**: - Process sequentially (slow) - Forget long-range dependencies - Can't parallelize

**Transformers**: - Process entire sequence at once (fast) - Capture long-range dependencies - Fully parallelizable

Self-Attention Explained

**Goal**: Each word looks at all other words

Example: "The cat sat on the mat" - "sat" pays attention to "cat" (who sat?) - "sat" pays attention to "mat" (where?)

Attention Mechanism

```python import numpy as np

def scaled_dot_product_attention(Q, K, V): """ Q: Query matrix K: Key matrix V: Value matrix """ # Calculate attention scores d_k = Q.shape[-1] scores = np.matmul(Q, K.transpose()) / np.sqrt(d_k) # Softmax to get weights weights = softmax(scores) # Weighted sum of values output = np.matmul(weights, V) return output

Example sentence = "The cat sat" # Each word becomes Q, K, V vectors attention_output = scaled_dot_product_attention(Q, K, V) ```

Multi-Head Attention

Multiple attention mechanisms in parallel:

```python from tensorflow.keras.layers import Layer

class MultiHeadAttention(Layer): def __init__(self, d_model, num_heads): super().__init__() self.num_heads = num_heads self.d_model = d_model self.depth = d_model // num_heads self.wq = Dense(d_model) self.wk = Dense(d_model) self.wv = Dense(d_model) self.dense = Dense(d_model) def split_heads(self, x, batch_size): x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth)) return tf.transpose(x, perm=[0, 2, 1, 3]) def call(self, q, k, v): batch_size = tf.shape(q)[0] # Linear projections q = self.wq(q) k = self.wk(k) v = self.wv(v) # Split into multiple heads q = self.split_heads(q, batch_size) k = self.split_heads(k, batch_size) v = self.split_heads(v, batch_size) # Scaled dot-product attention attention = scaled_dot_product_attention(q, k, v) # Concatenate heads attention = tf.transpose(attention, perm=[0, 2, 1, 3]) concat_attention = tf.reshape(attention, (batch_size, -1, self.d_model)) # Final linear projection output = self.dense(concat_attention) return output ```

Positional Encoding

Since Transformers process all words at once, need to encode position:

```python def positional_encoding(max_len, d_model): pos = np.arange(max_len)[:, np.newaxis] i = np.arange(d_model)[np.newaxis, :] angles = pos / np.power(10000, (2 * (i // 2)) / d_model) # Even indices: sin angles[:, 0::2] = np.sin(angles[:, 0::2]) # Odd indices: cos angles[:, 1::2] = np.cos(angles[:, 1::2]) return angles

Add to input embeddings embeddings = word_embeddings + positional_encoding ```

Full Transformer Block

```python class TransformerBlock(Layer): def __init__(self, d_model, num_heads, ff_dim): super().__init__() self.attention = MultiHeadAttention(d_model, num_heads) self.ffn = Sequential([ Dense(ff_dim, activation='relu'), Dense(d_model) ]) self.layernorm1 = LayerNormalization() self.layernorm2 = LayerNormalization() self.dropout1 = Dropout(0.1) self.dropout2 = Dropout(0.1) def call(self, x, training): # Multi-head attention attn_output = self.attention(x, x, x) attn_output = self.dropout1(attn_output, training=training) out1 = self.layernorm1(x + attn_output) # Residual connection # Feed-forward network ffn_output = self.ffn(out1) ffn_output = self.dropout2(ffn_output, training=training) out2 = self.layernorm2(out1 + ffn_output) # Residual connection return out2 ```

Encoder-Decoder Architecture

**Encoder**: Processes input sequence **Decoder**: Generates output sequence

Used for translation, summarization, etc.

Using Pre-trained Transformers

```python from transformers import pipeline

Text classification classifier = pipeline('sentiment-analysis') result = classifier("The service in Austin was amazing!") print(result) # [{'label': 'POSITIVE', 'score': 0.9998}]

Text generation generator = pipeline('text-generation', model='gpt2') text = generator("In San Francisco", max_length=50) print(text)

Question answering qa = pipeline('question-answering') context = "The Golden Gate Bridge is in San Francisco." question = "Where is the Golden Gate Bridge?" answer = qa(question=question, context=context) print(answer) ```

Key Models

**BERT**: Bidirectional Encoder (understanding) **GPT**: Decoder only (generation) **T5**: Encoder-Decoder (versatile) **BART**: Encoder-Decoder (generation + understanding)

Applications

- Machine translation - Text summarization - Question answering - Text generation - Sentiment analysis - Named entity recognition

Remember

- Self-attention is the key innovation - Positional encoding captures order - Pre-trained models are very powerful - Transformers revolutionized NLP

Transformer Architecture