GPT and Language Generation
Generate text with GPT models.
Generate human-like text with GPT.
What is GPT?
Generative Pre-trained Transformer.
**Goal**: Predict next word in sequence!
How GPT Works
Trained to complete text: - Input: "The weather is" - Output: "nice today"
Using GPT-2
```python from transformers import GPT2LMHeadModel, GPT2Tokenizer
Load model tokenizer = GPT2Tokenizer.from_pretrained('gpt2') model = GPT2LMHeadModel.from_pretrained('gpt2')
Generate text def generate_text(prompt, max_length=50): inputs = tokenizer.encode(prompt, return_tensors='pt') outputs = model.generate( inputs, max_length=max_length, num_return_sequences=1, no_repeat_ngram_size=2, temperature=0.7, top_k=50, top_p=0.95 ) text = tokenizer.decode(outputs[0], skip_special_tokens=True) return text
Test prompt = "In the year 2050," result = generate_text(prompt) print(result) ```
Control Generation
```python # More creative (higher temperature) outputs = model.generate( inputs, max_length=100, temperature=1.0, # More random do_sample=True )
More focused (lower temperature) outputs = model.generate( inputs, max_length=100, temperature=0.3, # More deterministic do_sample=True )
Top-k sampling outputs = model.generate( inputs, max_length=100, top_k=40, # Consider top 40 words do_sample=True )
Top-p (nucleus) sampling outputs = model.generate( inputs, max_length=100, top_p=0.9, # Cumulative probability 0.9 do_sample=True ) ```
Fine-tuning GPT-2
```python from transformers import TextDataset, DataCollatorForLanguageModeling from transformers import Trainer, TrainingArguments
Prepare dataset def load_dataset(file_path, tokenizer): dataset = TextDataset( tokenizer=tokenizer, file_path=file_path, block_size=128 ) return dataset
train_dataset = load_dataset('train.txt', tokenizer)
data_collator = DataCollatorForLanguageModeling( tokenizer=tokenizer, mlm=False # GPT uses causal LM, not masked LM )
Training training_args = TrainingArguments( output_dir='./gpt2-finetuned', overwrite_output_dir=True, num_train_epochs=3, per_device_train_batch_size=4, save_steps=500, save_total_limit=2, )
trainer = Trainer( model=model, args=training_args, data_collator=data_collator, train_dataset=train_dataset, )
trainer.train() ```
Practical Applications
### Story Generation ```python prompt = "Once upon a time in Boston, there was" story = generate_text(prompt, max_length=200) print(story) ```
### Code Generation ```python prompt = "def calculate_fibonacci(n):" code = generate_text(prompt, max_length=150) print(code) ```
### Chatbot ```python def chat(message, history=""): prompt = f"{history} User: {message} Bot:" response = generate_text(prompt, max_length=100) return response.split("Bot:")[-1].split(" ")[0]
Use it response = chat("Hello! How are you?") print(response) ```
GPT Versions
**GPT-2**: 1.5B parameters, open source **GPT-3**: 175B parameters, API access **GPT-3.5**: ChatGPT base model **GPT-4**: Most advanced, multimodal
Generation Parameters
**Temperature**: Randomness (0.7-1.0 good) **Top-k**: Limit to k most likely words **Top-p**: Nucleus sampling threshold **Max length**: Maximum tokens to generate **Repetition penalty**: Avoid repeating text
Remember
- GPT generates text token by token - Temperature controls creativity - Fine-tune for domain-specific tasks - Larger models = better quality