AI6 min read

Word Embeddings

Represent words as vectors in AI.

Robert Anderson
December 18, 2025
0.0k0

Words as numbers.

What are Word Embeddings?

Representing words as vectors (lists of numbers).

Similar words have similar vectors!

Why Better than One-Hot?

One-Hot: Each word is unique, no relationships

  • cat = [1, 0, 0, 0]
  • dog = [0, 1, 0, 0]
  • king = [0, 0, 1, 0]

Embeddings: Capture relationships

  • cat and dog are similar (both animals)
  • king and queen are similar (both royalty)

Word2Vec

Popular embedding method:

from gensim.models import Word2Vec

sentences = [
    ['cat', 'sits', 'on', 'mat'],
    ['dog', 'plays', 'in', 'park'],
    ['cat', 'and', 'dog', 'play']
]

# Train Word2Vec
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)

# Get vector for word
vector = model.wv['cat']
print(vector)

# Find similar words
similar = model.wv.most_similar('cat', topn=3)
print(similar)  # [('dog', 0.87), ...]

Cool Math with Embeddings

# king - man + woman ≈ queen
result = model.wv.most_similar(
    positive=['king', 'woman'],
    negative=['man']
)
print(result[0])  # 'queen'

Pre-trained Embeddings

Use embeddings trained on millions of documents:

GloVe: Stanford's embeddings
FastText: Facebook's embeddings
BERT: Google's contextual embeddings

import gensim.downloader as api

# Load pre-trained GloVe
glove = api.load("glove-wiki-gigaword-100")

# Use immediately
similar = glove.most_similar('python')
print(similar)

Using in Neural Networks

from tensorflow.keras.layers import Embedding

model = Sequential([
    Embedding(input_dim=10000, output_dim=128),
    LSTM(64),
    Dense(1, activation='sigmoid')
])

BERT - Contextual Embeddings

Unlike Word2Vec, BERT understands context:

"Bank" in "river bank" vs "money bank" gets different vectors!

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

text = "The bank by the river"
inputs = tokenizer(text, return_tensors='pt')
outputs = model(**inputs)

Applications

  • Text classification
  • Machine translation
  • Sentiment analysis
  • Question answering

Remember

  • Embeddings capture word meanings
  • Pre-trained embeddings save time
  • BERT understands context
#AI#Intermediate#NLP