Natural Language Processing Tasks
Common NLP tasks and techniques.
Process text with AI.
Text Preprocessing
Clean text before using:
```python import re from nltk.corpus import stopwords from nltk.stem import PorterStemmer
text = "The cats are running in Boston!"
Lowercase text = text.lower()
Remove punctuation text = re.sub(r'[^ws]', '', text)
Remove stopwords stop_words = set(stopwords.words('english')) words = [w for w in text.split() if w not in stop_words]
Stemming stemmer = PorterStemmer() words = [stemmer.stem(w) for w in words]
print(words) # ['cat', 'run', 'boston'] ```
Tokenization
Split text into pieces:
```python from nltk.tokenize import word_tokenize
text = "Hello from San Francisco!" tokens = word_tokenize(text) print(tokens) # ['Hello', 'from', 'San', 'Francisco', '!'] ```
Bag of Words
Convert text to numbers:
```python from sklearn.feature_extraction.text import CountVectorizer
texts = [ "I love AI", "AI is amazing", "I love machine learning" ]
vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts)
print(vectorizer.get_feature_names_out()) print(X.toarray()) ```
TF-IDF
Better than bag of words:
```python from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(texts)
Rare words get higher scores # Common words get lower scores ```
Sentiment Analysis
```python from textblob import TextBlob
text = "The restaurant in Austin was amazing!" blob = TextBlob(text)
sentiment = blob.sentiment.polarity if sentiment > 0: print("Positive!") elif sentiment < 0: print("Negative!") else: print("Neutral") ```
Named Entity Recognition
Find names, places, organizations:
```python import spacy
nlp = spacy.load("en_core_web_sm") text = "Apple Inc. is located in Cupertino, California."
doc = nlp(text) for ent in doc.ents: print(f"{ent.text}: {ent.label_}") # Apple Inc.: ORG # Cupertino: GPE # California: GPE ```
Text Classification
```python from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline
pipeline = Pipeline([ ('tfidf', TfidfVectorizer()), ('clf', MultinomialNB()) ])
pipeline.fit(X_train, y_train) predictions = pipeline.predict(X_test) ```
Remember
- Always preprocess text - TF-IDF better than bag of words - Use pre-trained models when possible