ML8 min read

Text Preprocessing for Machine Learning

Learn how to prepare text data for machine learning models.

Sarah Chen
December 19, 2025
0.0k0

Text Preprocessing for Machine Learning

Machines don't understand text. They understand numbers. Text preprocessing converts words into numbers that models can learn from.

The Pipeline

``` Raw Text → Clean → Tokenize → Vectorize → Model ```

Step 1: Basic Cleaning

```python import re

def clean_text(text): # Lowercase text = text.lower() # Remove URLs text = re.sub(r'http\S+|www\S+', '', text) # Remove special characters text = re.sub(r'[^a-zA-Z\s]', '', text) # Remove extra whitespace text = ' '.join(text.split()) return text

df['clean_text'] = df['text'].apply(clean_text) ```

Step 2: Tokenization

Breaking text into words or subwords:

```python from nltk.tokenize import word_tokenize import nltk nltk.download('punkt')

text = "Machine learning is amazing!" tokens = word_tokenize(text) # ['Machine', 'learning', 'is', 'amazing', '!'] ```

Step 3: Stopword Removal

Remove common words that don't carry meaning:

```python from nltk.corpus import stopwords nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def remove_stopwords(tokens): return [w for w in tokens if w not in stop_words]

['Machine', 'learning', 'amazing'] ```

Step 4: Stemming/Lemmatization

Reduce words to base form:

```python from nltk.stem import PorterStemmer, WordNetLemmatizer

Stemming (crude, fast) stemmer = PorterStemmer() stemmer.stem('running') # 'run' stemmer.stem('studies') # 'studi'

Lemmatization (proper words, slower) lemmatizer = WordNetLemmatizer() lemmatizer.lemmatize('running', pos='v') # 'run' lemmatizer.lemmatize('studies') # 'study' ```

Vectorization: Turning Text into Numbers

### Bag of Words

```python from sklearn.feature_extraction.text import CountVectorizer

corpus = [ 'I love machine learning', 'Machine learning is great', 'I love programming' ]

vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus)

See vocabulary print(vectorizer.get_feature_names_out()) # ['great', 'is', 'learning', 'love', 'machine', 'programming']

print(X.toarray()) # [[0, 0, 1, 1, 1, 0], # [1, 1, 1, 0, 1, 0], # [0, 0, 0, 1, 0, 1]] ```

### TF-IDF (Better!)

Weighs words by importance:

```python from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer( max_features=5000, # Limit vocabulary min_df=2, # Ignore rare words max_df=0.95, # Ignore very common words ngram_range=(1, 2) # Include bigrams )

X = tfidf.fit_transform(corpus) ```

**TF-IDF logic:** - High frequency in document = higher score - High frequency across ALL documents = lower score - Rare but present = important!

Complete Pipeline Example

```python from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression

Text classification pipeline pipeline = Pipeline([ ('tfidf', TfidfVectorizer(max_features=5000)), ('classifier', LogisticRegression()) ])

pipeline.fit(X_train, y_train) predictions = pipeline.predict(X_test) ```

Modern Approach: Word Embeddings

Pre-trained embeddings capture semantic meaning:

```python # Using sentence-transformers (recommended) from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2') embeddings = model.encode(texts)

Now use embeddings as features from sklearn.linear_model import LogisticRegression clf = LogisticRegression() clf.fit(embeddings_train, y_train) ```

Choosing an Approach

| Method | Pros | Cons | |--------|------|------| | Bag of Words | Simple, interpretable | Ignores word order | | TF-IDF | Better weighting | Still ignores semantics | | Word Embeddings | Captures meaning | Less interpretable | | Transformers | State-of-art | Computationally expensive |

Key Takeaway

Start with TF-IDF for most text classification tasks - it's simple and works well. Use max_features to limit vocabulary, remove stopwords, and consider n-grams. For semantic understanding or if TF-IDF isn't enough, move to pre-trained embeddings. The preprocessing steps (cleaning, lowercasing) matter for traditional methods but less for modern transformers.

#Machine Learning#NLP#Text Processing#Intermediate