Word Embeddings
Represent words as vectors in AI.
Words as numbers.
What are Word Embeddings?
Representing words as vectors (lists of numbers).
Similar words have similar vectors!
Why Better than One-Hot?
**One-Hot**: Each word is unique, no relationships - cat = [1, 0, 0, 0] - dog = [0, 1, 0, 0] - king = [0, 0, 1, 0]
**Embeddings**: Capture relationships - cat and dog are similar (both animals) - king and queen are similar (both royalty)
Word2Vec
Popular embedding method:
```python from gensim.models import Word2Vec
sentences = [ ['cat', 'sits', 'on', 'mat'], ['dog', 'plays', 'in', 'park'], ['cat', 'and', 'dog', 'play'] ]
Train Word2Vec model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)
Get vector for word vector = model.wv['cat'] print(vector)
Find similar words similar = model.wv.most_similar('cat', topn=3) print(similar) # [('dog', 0.87), ...] ```
Cool Math with Embeddings
```python # king - man + woman ≈ queen result = model.wv.most_similar( positive=['king', 'woman'], negative=['man'] ) print(result[0]) # 'queen' ```
Pre-trained Embeddings
Use embeddings trained on millions of documents:
**GloVe**: Stanford's embeddings **FastText**: Facebook's embeddings **BERT**: Google's contextual embeddings
```python import gensim.downloader as api
Load pre-trained GloVe glove = api.load("glove-wiki-gigaword-100")
Use immediately similar = glove.most_similar('python') print(similar) ```
Using in Neural Networks
```python from tensorflow.keras.layers import Embedding
model = Sequential([ Embedding(input_dim=10000, output_dim=128), LSTM(64), Dense(1, activation='sigmoid') ]) ```
BERT - Contextual Embeddings
Unlike Word2Vec, BERT understands context:
"Bank" in "river bank" vs "money bank" gets different vectors!
```python from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained('bert-base-uncased')
text = "The bank by the river" inputs = tokenizer(text, return_tensors='pt') outputs = model(**inputs) ```
Applications
- Text classification - Machine translation - Sentiment analysis - Question answering
Remember
- Embeddings capture word meanings - Pre-trained embeddings save time - BERT understands context