ML8 min read
Text Preprocessing for Machine Learning
Learn how to prepare text data for machine learning models.
Sarah Chen
December 19, 2025
0.0k0
Text Preprocessing for Machine Learning
Machines don't understand text. They understand numbers. Text preprocessing converts words into numbers that models can learn from.
The Pipeline
Raw Text → Clean → Tokenize → Vectorize → Model
Step 1: Basic Cleaning
import re
def clean_text(text):
# Lowercase
text = text.lower()
# Remove URLs
text = re.sub(r'http\S+|www\S+', '', text)
# Remove special characters
text = re.sub(r'[^a-zA-Z\s]', '', text)
# Remove extra whitespace
text = ' '.join(text.split())
return text
df['clean_text'] = df['text'].apply(clean_text)
Step 2: Tokenization
Breaking text into words or subwords:
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
text = "Machine learning is amazing!"
tokens = word_tokenize(text)
# ['Machine', 'learning', 'is', 'amazing', '!']
Step 3: Stopword Removal
Remove common words that don't carry meaning:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
def remove_stopwords(tokens):
return [w for w in tokens if w not in stop_words]
# ['Machine', 'learning', 'amazing']
Step 4: Stemming/Lemmatization
Reduce words to base form:
from nltk.stem import PorterStemmer, WordNetLemmatizer
# Stemming (crude, fast)
stemmer = PorterStemmer()
stemmer.stem('running') # 'run'
stemmer.stem('studies') # 'studi'
# Lemmatization (proper words, slower)
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('running', pos='v') # 'run'
lemmatizer.lemmatize('studies') # 'study'
Vectorization: Turning Text into Numbers
Bag of Words
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'I love machine learning',
'Machine learning is great',
'I love programming'
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
# See vocabulary
print(vectorizer.get_feature_names_out())
# ['great', 'is', 'learning', 'love', 'machine', 'programming']
print(X.toarray())
# [[0, 0, 1, 1, 1, 0],
# [1, 1, 1, 0, 1, 0],
# [0, 0, 0, 1, 0, 1]]
TF-IDF (Better!)
Weighs words by importance:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(
max_features=5000, # Limit vocabulary
min_df=2, # Ignore rare words
max_df=0.95, # Ignore very common words
ngram_range=(1, 2) # Include bigrams
)
X = tfidf.fit_transform(corpus)
TF-IDF logic:
- High frequency in document = higher score
- High frequency across ALL documents = lower score
- Rare but present = important!
Complete Pipeline Example
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
# Text classification pipeline
pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=5000)),
('classifier', LogisticRegression())
])
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
Modern Approach: Word Embeddings
Pre-trained embeddings capture semantic meaning:
# Using sentence-transformers (recommended)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(texts)
# Now use embeddings as features
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(embeddings_train, y_train)
Choosing an Approach
| Method | Pros | Cons |
|---|---|---|
| Bag of Words | Simple, interpretable | Ignores word order |
| TF-IDF | Better weighting | Still ignores semantics |
| Word Embeddings | Captures meaning | Less interpretable |
| Transformers | State-of-art | Computationally expensive |
Key Takeaway
Start with TF-IDF for most text classification tasks - it's simple and works well. Use max_features to limit vocabulary, remove stopwords, and consider n-grams. For semantic understanding or if TF-IDF isn't enough, move to pre-trained embeddings. The preprocessing steps (cleaning, lowercasing) matter for traditional methods but less for modern transformers.
#Machine Learning#NLP#Text Processing#Intermediate