Classification Basics: Logistic Regression
Learn Logistic Regression - the fundamental algorithm for classification problems in machine learning.
Logistic Regression: Classification Basics
Despite the confusing name, Logistic Regression is for **classification**, not regression. It predicts categories.
The Problem
Linear Regression predicts numbers. But what if you need: - Spam or Not Spam? - Will customer churn? Yes/No - Is transaction fraud? Yes/No
You need probabilities and categories, not raw numbers.
The Solution: Sigmoid Function
Logistic Regression uses the sigmoid function to squash any number into 0-1 range:
``` 1 σ(x) = ───────── 1 + e^(-x) ```
``` Output 1 │ ════════ │ ╱ 0.5 │──────╋────────── │ ╱ 0 │════ └──────────────── x ```
Now we can interpret the output as probability!
How It Works
### Step 1: Linear combination (like linear regression) ```python z = w1*x1 + w2*x2 + ... + bias ```
### Step 2: Sigmoid to get probability ```python probability = sigmoid(z) = 1 / (1 + exp(-z)) ```
### Step 3: Threshold for final prediction ```python prediction = 1 if probability >= 0.5 else 0 ```
Code Example
```python from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split import numpy as np
Sample data: study hours, previous score -> pass/fail X = np.array([[2, 50], [3, 55], [5, 65], [6, 70], [8, 85], [10, 90], [1, 45], [2, 48], [4, 60], [7, 78]]) y = np.array([0, 0, 1, 1, 1, 1, 0, 0, 1, 1]) # 0=fail, 1=pass
Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
Train model model = LogisticRegression() model.fit(X_train, y_train)
Predict new_student = [[5, 62]] # 5 hours study, 62 previous score probability = model.predict_proba(new_student) prediction = model.predict(new_student)
print(f"Probability of passing: {probability[0][1]:.2%}") print(f"Prediction: {'Pass' if prediction[0] == 1 else 'Fail'}") ```
Output: ``` Probability of passing: 73.45% Prediction: Pass ```
Understanding the Output
```python model.predict_proba([[5, 62]]) # Returns: [[0.2655, 0.7345]] # ↑ Prob(Fail) ↑ Prob(Pass) ```
Probabilities always sum to 1.
Multiclass Classification
What if you have more than 2 classes? (Cat, Dog, Bird)
```python model = LogisticRegression(multi_class='multinomial') model.fit(X, y) # y can be [0, 1, 2] for three classes
Predictions model.predict_proba([[features]]) # Returns: [[0.15, 0.75, 0.10]] # Probabilities for each class ```
Evaluation Metrics
### Accuracy ```python from sklearn.metrics import accuracy_score accuracy = accuracy_score(y_true, y_pred) # % correct ```
### Confusion Matrix ``` Predicted 0 1 Actual 0 [TN, FP] 1 [FN, TP] ```
- **TN**: Correctly predicted negative - **TP**: Correctly predicted positive - **FP**: False alarm (predicted positive, was negative) - **FN**: Missed it (predicted negative, was positive)
### Precision & Recall ```python precision = TP / (TP + FP) # Of predicted positives, how many correct? recall = TP / (TP + FN) # Of actual positives, how many found? ```
Choosing the Threshold
Default threshold is 0.5, but you can change it:
```python # Lower threshold = more positive predictions threshold = 0.3 predictions = (model.predict_proba(X)[:, 1] >= threshold).astype(int) ```
**When to lower threshold:** When missing positives is costly (disease detection) **When to raise threshold:** When false alarms are costly (spam filter)
Logistic vs Linear Regression
| Aspect | Linear Regression | Logistic Regression | |--------|------------------|---------------------| | Output | Any number | Probability (0-1) | | Use case | Predict values | Predict categories | | Function | Straight line | S-curve (sigmoid) | | Example | House price | Spam/Not spam |
Pros and Cons
**Pros:** - Simple and fast - Gives probabilities, not just predictions - Interpretable coefficients - Works well for linearly separable data
**Cons:** - Assumes linear decision boundary - Can't capture complex patterns - Struggles with highly correlated features
Key Takeaway
Logistic Regression is the go-to algorithm for classification. It's simple, interpretable, and often works surprisingly well. Even when you use fancier models, Logistic Regression is a great baseline to compare against.