ML8 min read

Text Preprocessing for Machine Learning

Learn how to prepare text data for machine learning models.

Sarah Chen
December 19, 2025
0.0k0

Text Preprocessing for Machine Learning

Machines don't understand text. They understand numbers. Text preprocessing converts words into numbers that models can learn from.

The Pipeline

Raw Text → Clean → Tokenize → Vectorize → Model

Step 1: Basic Cleaning

import re

def clean_text(text):
    # Lowercase
    text = text.lower()
    
    # Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)
    
    # Remove special characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Remove extra whitespace
    text = ' '.join(text.split())
    
    return text

df['clean_text'] = df['text'].apply(clean_text)

Step 2: Tokenization

Breaking text into words or subwords:

from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

text = "Machine learning is amazing!"
tokens = word_tokenize(text)
# ['Machine', 'learning', 'is', 'amazing', '!']

Step 3: Stopword Removal

Remove common words that don't carry meaning:

from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def remove_stopwords(tokens):
    return [w for w in tokens if w not in stop_words]

# ['Machine', 'learning', 'amazing']

Step 4: Stemming/Lemmatization

Reduce words to base form:

from nltk.stem import PorterStemmer, WordNetLemmatizer

# Stemming (crude, fast)
stemmer = PorterStemmer()
stemmer.stem('running')  # 'run'
stemmer.stem('studies')  # 'studi'

# Lemmatization (proper words, slower)
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('running', pos='v')  # 'run'
lemmatizer.lemmatize('studies')  # 'study'

Vectorization: Turning Text into Numbers

Bag of Words

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'I love machine learning',
    'Machine learning is great',
    'I love programming'
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

# See vocabulary
print(vectorizer.get_feature_names_out())
# ['great', 'is', 'learning', 'love', 'machine', 'programming']

print(X.toarray())
# [[0, 0, 1, 1, 1, 0],
#  [1, 1, 1, 0, 1, 0],
#  [0, 0, 0, 1, 0, 1]]

TF-IDF (Better!)

Weighs words by importance:

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features=5000,    # Limit vocabulary
    min_df=2,             # Ignore rare words
    max_df=0.95,          # Ignore very common words
    ngram_range=(1, 2)    # Include bigrams
)

X = tfidf.fit_transform(corpus)

TF-IDF logic:

  • High frequency in document = higher score
  • High frequency across ALL documents = lower score
  • Rare but present = important!

Complete Pipeline Example

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# Text classification pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000)),
    ('classifier', LogisticRegression())
])

pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

Modern Approach: Word Embeddings

Pre-trained embeddings capture semantic meaning:

# Using sentence-transformers (recommended)
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(texts)

# Now use embeddings as features
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(embeddings_train, y_train)

Choosing an Approach

Method Pros Cons
Bag of Words Simple, interpretable Ignores word order
TF-IDF Better weighting Still ignores semantics
Word Embeddings Captures meaning Less interpretable
Transformers State-of-art Computationally expensive

Key Takeaway

Start with TF-IDF for most text classification tasks - it's simple and works well. Use max_features to limit vocabulary, remove stopwords, and consider n-grams. For semantic understanding or if TF-IDF isn't enough, move to pre-trained embeddings. The preprocessing steps (cleaning, lowercasing) matter for traditional methods but less for modern transformers.

#Machine Learning#NLP#Text Processing#Intermediate