Classification Basics: Logistic Regression
Learn Logistic Regression - the fundamental algorithm for classification problems in machine learning.
Logistic Regression: Classification Basics
Despite the confusing name, Logistic Regression is for classification, not regression. It predicts categories.
The Problem
Linear Regression predicts numbers. But what if you need:
- Spam or Not Spam?
- Will customer churn? Yes/No
- Is transaction fraud? Yes/No
You need probabilities and categories, not raw numbers.
The Solution: Sigmoid Function
Logistic Regression uses the sigmoid function to squash any number into 0-1 range:
1
σ(x) = ─────────
1 + e^(-x)
Output
1 │ ════════
│ ╱
0.5 │──────╋──────────
│ ╱
0 │════
└──────────────── x
Now we can interpret the output as probability!
How It Works
Step 1: Linear combination (like linear regression)
z = w1*x1 + w2*x2 + ... + bias
Step 2: Sigmoid to get probability
probability = sigmoid(z) = 1 / (1 + exp(-z))
Step 3: Threshold for final prediction
prediction = 1 if probability >= 0.5 else 0
Code Example
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np
# Sample data: study hours, previous score -> pass/fail
X = np.array([[2, 50], [3, 55], [5, 65], [6, 70], [8, 85], [10, 90],
[1, 45], [2, 48], [4, 60], [7, 78]])
y = np.array([0, 0, 1, 1, 1, 1, 0, 0, 1, 1]) # 0=fail, 1=pass
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Train model
model = LogisticRegression()
model.fit(X_train, y_train)
# Predict
new_student = [[5, 62]] # 5 hours study, 62 previous score
probability = model.predict_proba(new_student)
prediction = model.predict(new_student)
print(f"Probability of passing: {probability[0][1]:.2%}")
print(f"Prediction: {'Pass' if prediction[0] == 1 else 'Fail'}")
Output:
Probability of passing: 73.45%
Prediction: Pass
Understanding the Output
model.predict_proba([[5, 62]])
# Returns: [[0.2655, 0.7345]]
# ↑ Prob(Fail) ↑ Prob(Pass)
Probabilities always sum to 1.
Multiclass Classification
What if you have more than 2 classes? (Cat, Dog, Bird)
model = LogisticRegression(multi_class='multinomial')
model.fit(X, y) # y can be [0, 1, 2] for three classes
# Predictions
model.predict_proba([[features]])
# Returns: [[0.15, 0.75, 0.10]] # Probabilities for each class
Evaluation Metrics
Accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_true, y_pred) # % correct
Confusion Matrix
Predicted
0 1
Actual 0 [TN, FP]
1 [FN, TP]
- TN: Correctly predicted negative
- TP: Correctly predicted positive
- FP: False alarm (predicted positive, was negative)
- FN: Missed it (predicted negative, was positive)
Precision & Recall
precision = TP / (TP + FP) # Of predicted positives, how many correct?
recall = TP / (TP + FN) # Of actual positives, how many found?
Choosing the Threshold
Default threshold is 0.5, but you can change it:
# Lower threshold = more positive predictions
threshold = 0.3
predictions = (model.predict_proba(X)[:, 1] >= threshold).astype(int)
When to lower threshold: When missing positives is costly (disease detection)
When to raise threshold: When false alarms are costly (spam filter)
Logistic vs Linear Regression
| Aspect | Linear Regression | Logistic Regression |
|---|---|---|
| Output | Any number | Probability (0-1) |
| Use case | Predict values | Predict categories |
| Function | Straight line | S-curve (sigmoid) |
| Example | House price | Spam/Not spam |
Pros and Cons
Pros:
- Simple and fast
- Gives probabilities, not just predictions
- Interpretable coefficients
- Works well for linearly separable data
Cons:
- Assumes linear decision boundary
- Can't capture complex patterns
- Struggles with highly correlated features
Key Takeaway
Logistic Regression is the go-to algorithm for classification. It's simple, interpretable, and often works surprisingly well. Even when you use fancier models, Logistic Regression is a great baseline to compare against.