Process text with AI.

Text Preprocessing

Clean text before using:

```python import re from nltk.corpus import stopwords from nltk.stem import PorterStemmer

text = "The cats are running in Boston!"

Lowercase text = text.lower()

Remove punctuation text = re.sub(r'[^ws]', '', text)

Remove stopwords stop_words = set(stopwords.words('english')) words = [w for w in text.split() if w not in stop_words]

Stemming stemmer = PorterStemmer() words = [stemmer.stem(w) for w in words]

print(words) # ['cat', 'run', 'boston'] ```

Tokenization

Split text into pieces:

```python from nltk.tokenize import word_tokenize

text = "Hello from San Francisco!" tokens = word_tokenize(text) print(tokens) # ['Hello', 'from', 'San', 'Francisco', '!'] ```

Bag of Words

Convert text to numbers:

```python from sklearn.feature_extraction.text import CountVectorizer

texts = [ "I love AI", "AI is amazing", "I love machine learning" ]

vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts)

print(vectorizer.get_feature_names_out()) print(X.toarray()) ```

TF-IDF

Better than bag of words:

```python from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(texts)

Rare words get higher scores # Common words get lower scores ```

Sentiment Analysis

```python from textblob import TextBlob

text = "The restaurant in Austin was amazing!" blob = TextBlob(text)

sentiment = blob.sentiment.polarity if sentiment > 0: print("Positive!") elif sentiment < 0: print("Negative!") else: print("Neutral") ```

Named Entity Recognition

Find names, places, organizations:

```python import spacy

nlp = spacy.load("en_core_web_sm") text = "Apple Inc. is located in Cupertino, California."

doc = nlp(text) for ent in doc.ents: print(f"{ent.text}: {ent.label_}") # Apple Inc.: ORG # Cupertino: GPE # California: GPE ```

Text Classification

```python from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline

pipeline = Pipeline([ ('tfidf', TfidfVectorizer()), ('clf', MultinomialNB()) ])

pipeline.fit(X_train, y_train) predictions = pipeline.predict(X_test) ```

Remember

- Always preprocess text - TF-IDF better than bag of words - Use pre-trained models when possible

Natural Language Processing Tasks