AI6 min read

Word Embeddings

Represent words as vectors in AI.

Robert Anderson
December 18, 2025
0.0k0

Words as numbers.

What are Word Embeddings?

Representing words as vectors (lists of numbers).

Similar words have similar vectors!

Why Better than One-Hot?

**One-Hot**: Each word is unique, no relationships - cat = [1, 0, 0, 0] - dog = [0, 1, 0, 0] - king = [0, 0, 1, 0]

**Embeddings**: Capture relationships - cat and dog are similar (both animals) - king and queen are similar (both royalty)

Word2Vec

Popular embedding method:

```python from gensim.models import Word2Vec

sentences = [ ['cat', 'sits', 'on', 'mat'], ['dog', 'plays', 'in', 'park'], ['cat', 'and', 'dog', 'play'] ]

Train Word2Vec model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)

Get vector for word vector = model.wv['cat'] print(vector)

Find similar words similar = model.wv.most_similar('cat', topn=3) print(similar) # [('dog', 0.87), ...] ```

Cool Math with Embeddings

```python # king - man + woman ≈ queen result = model.wv.most_similar( positive=['king', 'woman'], negative=['man'] ) print(result[0]) # 'queen' ```

Pre-trained Embeddings

Use embeddings trained on millions of documents:

**GloVe**: Stanford's embeddings **FastText**: Facebook's embeddings **BERT**: Google's contextual embeddings

```python import gensim.downloader as api

Load pre-trained GloVe glove = api.load("glove-wiki-gigaword-100")

Use immediately similar = glove.most_similar('python') print(similar) ```

Using in Neural Networks

```python from tensorflow.keras.layers import Embedding

model = Sequential([ Embedding(input_dim=10000, output_dim=128), LSTM(64), Dense(1, activation='sigmoid') ]) ```

BERT - Contextual Embeddings

Unlike Word2Vec, BERT understands context:

"Bank" in "river bank" vs "money bank" gets different vectors!

```python from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained('bert-base-uncased')

text = "The bank by the river" inputs = tokenizer(text, return_tensors='pt') outputs = model(**inputs) ```

Applications

- Text classification - Machine translation - Sentiment analysis - Question answering

Remember

- Embeddings capture word meanings - Pre-trained embeddings save time - BERT understands context

#AI#Intermediate#NLP