Text Preprocessing for Machine Learning

Machines don't understand text. They understand numbers. Text preprocessing converts words into numbers that models can learn from.

The Pipeline

``` Raw Text → Clean → Tokenize → Vectorize → Model ```

Step 1: Basic Cleaning

```python import re

def clean_text(text): # Lowercase text = text.lower() # Remove URLs text = re.sub(r'http\S+|www\S+', '', text) # Remove special characters text = re.sub(r'[^a-zA-Z\s]', '', text) # Remove extra whitespace text = ' '.join(text.split()) return text

df['clean_text'] = df['text'].apply(clean_text) ```

Step 2: Tokenization

Breaking text into words or subwords:

```python from nltk.tokenize import word_tokenize import nltk nltk.download('punkt')

text = "Machine learning is amazing!" tokens = word_tokenize(text) # ['Machine', 'learning', 'is', 'amazing', '!'] ```

Step 3: Stopword Removal

Remove common words that don't carry meaning:

```python from nltk.corpus import stopwords nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def remove_stopwords(tokens): return [w for w in tokens if w not in stop_words]

['Machine', 'learning', 'amazing'] ```

Step 4: Stemming/Lemmatization

Reduce words to base form:

```python from nltk.stem import PorterStemmer, WordNetLemmatizer

Stemming (crude, fast) stemmer = PorterStemmer() stemmer.stem('running') # 'run' stemmer.stem('studies') # 'studi'

Lemmatization (proper words, slower) lemmatizer = WordNetLemmatizer() lemmatizer.lemmatize('running', pos='v') # 'run' lemmatizer.lemmatize('studies') # 'study' ```

Vectorization: Turning Text into Numbers

### Bag of Words

```python from sklearn.feature_extraction.text import CountVectorizer

corpus = [ 'I love machine learning', 'Machine learning is great', 'I love programming' ]

vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus)

See vocabulary print(vectorizer.get_feature_names_out()) # ['great', 'is', 'learning', 'love', 'machine', 'programming']

print(X.toarray()) # [[0, 0, 1, 1, 1, 0], # [1, 1, 1, 0, 1, 0], # [0, 0, 0, 1, 0, 1]] ```

### TF-IDF (Better!)

Weighs words by importance:

```python from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer( max_features=5000, # Limit vocabulary min_df=2, # Ignore rare words max_df=0.95, # Ignore very common words ngram_range=(1, 2) # Include bigrams )

X = tfidf.fit_transform(corpus) ```

**TF-IDF logic:** - High frequency in document = higher score - High frequency across ALL documents = lower score - Rare but present = important!

Complete Pipeline Example

```python from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression

Text classification pipeline pipeline = Pipeline([ ('tfidf', TfidfVectorizer(max_features=5000)), ('classifier', LogisticRegression()) ])

pipeline.fit(X_train, y_train) predictions = pipeline.predict(X_test) ```

Modern Approach: Word Embeddings

Pre-trained embeddings capture semantic meaning:

```python # Using sentence-transformers (recommended) from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2') embeddings = model.encode(texts)

Now use embeddings as features from sklearn.linear_model import LogisticRegression clf = LogisticRegression() clf.fit(embeddings_train, y_train) ```

Choosing an Approach

| Method | Pros | Cons | |--------|------|------| | Bag of Words | Simple, interpretable | Ignores word order | | TF-IDF | Better weighting | Still ignores semantics | | Word Embeddings | Captures meaning | Less interpretable | | Transformers | State-of-art | Computationally expensive |

Key Takeaway

Start with TF-IDF for most text classification tasks - it's simple and works well. Use max_features to limit vocabulary, remove stopwords, and consider n-grams. For semantic understanding or if TF-IDF isn't enough, move to pre-trained embeddings. The preprocessing steps (cleaning, lowercasing) matter for traditional methods but less for modern transformers.

Text Preprocessing for Machine Learning

Text Preprocessing for Machine Learning

The Pipeline

Step 1: Basic Cleaning

Step 2: Tokenization

Step 3: Stopword Removal

['Machine', 'learning', 'amazing'] ```

Step 4: Stemming/Lemmatization

Stemming (crude, fast) stemmer = PorterStemmer() stemmer.stem('running') # 'run' stemmer.stem('studies') # 'studi'

Lemmatization (proper words, slower) lemmatizer = WordNetLemmatizer() lemmatizer.lemmatize('running', pos='v') # 'run' lemmatizer.lemmatize('studies') # 'study' ```

Vectorization: Turning Text into Numbers

See vocabulary print(vectorizer.get_feature_names_out()) # ['great', 'is', 'learning', 'love', 'machine', 'programming']

Complete Pipeline Example

Text classification pipeline pipeline = Pipeline([ ('tfidf', TfidfVectorizer(max_features=5000)), ('classifier', LogisticRegression()) ])

Modern Approach: Word Embeddings

Now use embeddings as features from sklearn.linear_model import LogisticRegression clf = LogisticRegression() clf.fit(embeddings_train, y_train) ```

Choosing an Approach

Key Takeaway

More on ML

What is Machine Learning? A Simple Introduction

Supervised vs Unsupervised Learning Explained

Understanding Training, Validation, and Test Sets