AI7 min read

Natural Language Processing Tasks

Common NLP tasks and techniques.

Robert Anderson
December 18, 2025
0.0k0

Process text with AI.

Text Preprocessing

Clean text before using:

import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

text = "The cats are running in Boston!"

# Lowercase
text = text.lower()

# Remove punctuation
text = re.sub(r'[^ws]', '', text)

# Remove stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in text.split() if w not in stop_words]

# Stemming
stemmer = PorterStemmer()
words = [stemmer.stem(w) for w in words]

print(words)  # ['cat', 'run', 'boston']

Tokenization

Split text into pieces:

from nltk.tokenize import word_tokenize

text = "Hello from San Francisco!"
tokens = word_tokenize(text)
print(tokens)  # ['Hello', 'from', 'San', 'Francisco', '!']

Bag of Words

Convert text to numbers:

from sklearn.feature_extraction.text import CountVectorizer

texts = [
    "I love AI",
    "AI is amazing",
    "I love machine learning"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

print(vectorizer.get_feature_names_out())
print(X.toarray())

TF-IDF

Better than bag of words:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

# Rare words get higher scores
# Common words get lower scores

Sentiment Analysis

from textblob import TextBlob

text = "The restaurant in Austin was amazing!"
blob = TextBlob(text)

sentiment = blob.sentiment.polarity
if sentiment > 0:
    print("Positive!")
elif sentiment < 0:
    print("Negative!")
else:
    print("Neutral")

Named Entity Recognition

Find names, places, organizations:

import spacy

nlp = spacy.load("en_core_web_sm")
text = "Apple Inc. is located in Cupertino, California."

doc = nlp(text)
for ent in doc.ents:
    print(f"{ent.text}: {ent.label_}")
# Apple Inc.: ORG
# Cupertino: GPE
# California: GPE

Text Classification

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', MultinomialNB())
])

pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

Remember

  • Always preprocess text
  • TF-IDF better than bag of words
  • Use pre-trained models when possible
#AI#Intermediate#NLP