Text Preprocessing for Machine Learning
Learn how to prepare text data for machine learning models.
Text Preprocessing for Machine Learning
Machines don't understand text. They understand numbers. Text preprocessing converts words into numbers that models can learn from.
The Pipeline
``` Raw Text → Clean → Tokenize → Vectorize → Model ```
Step 1: Basic Cleaning
```python import re
def clean_text(text): # Lowercase text = text.lower() # Remove URLs text = re.sub(r'http\S+|www\S+', '', text) # Remove special characters text = re.sub(r'[^a-zA-Z\s]', '', text) # Remove extra whitespace text = ' '.join(text.split()) return text
df['clean_text'] = df['text'].apply(clean_text) ```
Step 2: Tokenization
Breaking text into words or subwords:
```python from nltk.tokenize import word_tokenize import nltk nltk.download('punkt')
text = "Machine learning is amazing!" tokens = word_tokenize(text) # ['Machine', 'learning', 'is', 'amazing', '!'] ```
Step 3: Stopword Removal
Remove common words that don't carry meaning:
```python from nltk.corpus import stopwords nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
def remove_stopwords(tokens): return [w for w in tokens if w not in stop_words]
['Machine', 'learning', 'amazing'] ```
Step 4: Stemming/Lemmatization
Reduce words to base form:
```python from nltk.stem import PorterStemmer, WordNetLemmatizer
Stemming (crude, fast) stemmer = PorterStemmer() stemmer.stem('running') # 'run' stemmer.stem('studies') # 'studi'
Lemmatization (proper words, slower) lemmatizer = WordNetLemmatizer() lemmatizer.lemmatize('running', pos='v') # 'run' lemmatizer.lemmatize('studies') # 'study' ```
Vectorization: Turning Text into Numbers
### Bag of Words
```python from sklearn.feature_extraction.text import CountVectorizer
corpus = [ 'I love machine learning', 'Machine learning is great', 'I love programming' ]
vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus)
See vocabulary print(vectorizer.get_feature_names_out()) # ['great', 'is', 'learning', 'love', 'machine', 'programming']
print(X.toarray()) # [[0, 0, 1, 1, 1, 0], # [1, 1, 1, 0, 1, 0], # [0, 0, 0, 1, 0, 1]] ```
### TF-IDF (Better!)
Weighs words by importance:
```python from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer( max_features=5000, # Limit vocabulary min_df=2, # Ignore rare words max_df=0.95, # Ignore very common words ngram_range=(1, 2) # Include bigrams )
X = tfidf.fit_transform(corpus) ```
**TF-IDF logic:** - High frequency in document = higher score - High frequency across ALL documents = lower score - Rare but present = important!
Complete Pipeline Example
```python from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression
Text classification pipeline pipeline = Pipeline([ ('tfidf', TfidfVectorizer(max_features=5000)), ('classifier', LogisticRegression()) ])
pipeline.fit(X_train, y_train) predictions = pipeline.predict(X_test) ```
Modern Approach: Word Embeddings
Pre-trained embeddings capture semantic meaning:
```python # Using sentence-transformers (recommended) from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2') embeddings = model.encode(texts)
Now use embeddings as features from sklearn.linear_model import LogisticRegression clf = LogisticRegression() clf.fit(embeddings_train, y_train) ```
Choosing an Approach
| Method | Pros | Cons | |--------|------|------| | Bag of Words | Simple, interpretable | Ignores word order | | TF-IDF | Better weighting | Still ignores semantics | | Word Embeddings | Captures meaning | Less interpretable | | Transformers | State-of-art | Computationally expensive |
Key Takeaway
Start with TF-IDF for most text classification tasks - it's simple and works well. Use max_features to limit vocabulary, remove stopwords, and consider n-grams. For semantic understanding or if TF-IDF isn't enough, move to pre-trained embeddings. The preprocessing steps (cleaning, lowercasing) matter for traditional methods but less for modern transformers.