ML45 min read

Machine Learning Interview Questions: 50 Essential Questions for Developers

Comprehensive collection of 50 essential Machine Learning interview questions covering algorithms, model evaluation, feature engineering, and ML best practices. Free ML, Machine Learning interview questions with answers. ML AI interview prep guide.

Dr. Alex Kumar
December 16, 2025
0.0k0

This comprehensive guide covers 50 essential Machine Learning interview questions that every ML engineer should know. These questions cover fundamental algorithms, model evaluation, feature engineering, optimization, and practical ML concepts commonly asked in technical interviews.

Core ML Algorithms

Understanding core machine learning algorithms is essential. These questions test your knowledge of linear regression, logistic regression, decision trees, random forests, and ensemble methods.

Model Evaluation & Metrics

Proper evaluation is crucial for ML models. Master these questions to demonstrate your understanding of accuracy, precision, recall, ROC curves, cross-validation, and model selection.

Feature Engineering

Feature engineering is often more important than algorithm choice. These questions cover feature selection, transformation, encoding, handling missing values, and feature scaling techniques.

Optimization & Training

Understanding how models learn is key. These questions cover gradient descent variants, learning rates, regularization techniques, and optimization algorithms used in ML.

Advanced ML Concepts

Advanced topics include bias-variance tradeoff, overfitting prevention, ensemble methods, and production ML considerations. These questions test your deep understanding of ML principles.

#ML#Machine Learning#AI#Data Science#Interview#Algorithms#Model Evaluation#Feature Engineering#ML Interview#ML Tutorial

Common Questions & Answers

Q1

What is Machine Learning?

A

Machine Learning is subset of AI that enables systems to learn and improve from experience without being explicitly programmed. Uses algorithms to identify patterns in data, make predictions, or decisions. Three types: supervised, unsupervised, reinforcement learning.

Q2

What is the difference between supervised and unsupervised learning?

A

Supervised learning uses labeled data (input-output pairs) to train models. Unsupervised learning finds patterns in unlabeled data. Supervised: classification, regression. Unsupervised: clustering, dimensionality reduction, association rules.

python
# Supervised Learning
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)  # y_train has labels

# Unsupervised Learning
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
model.fit(X)  # No labels
Q3

What is linear regression?

A

Linear regression models relationship between dependent variable and one or more independent variables using linear equation: y = mx + b. Finds best-fit line minimizing sum of squared errors. Used for continuous predictions. Assumes linear relationship.

python
from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([[1], [2], [3], [4]])
y = np.array([2, 4, 6, 8])

model = LinearRegression()
model.fit(X, y)
predictions = model.predict([[5]])  # [10]
Q4

What is logistic regression?

A

Logistic regression is classification algorithm that predicts probability using sigmoid function. Outputs values between 0 and 1. Uses log-odds (logit). Binary classification: predicts class based on probability threshold (usually 0.5). Can be extended to multi-class.

python
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
probabilities = model.predict_proba(X_test)
predictions = model.predict(X_test)
Q5

What is a decision tree?

A

Decision tree makes decisions by splitting data based on feature values. Tree structure: root, internal nodes (decisions), leaves (outcomes). Uses information gain or Gini impurity for splits. Easy to interpret, prone to overfitting. Basis for random forests.

python
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth=5)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Q6

What is a random forest?

A

Random forest is ensemble method combining multiple decision trees. Each tree trained on random subset of data and features. Predictions averaged (regression) or voted (classification). Reduces overfitting, handles non-linearity, feature importance available.

python
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42
)
model.fit(X_train, y_train)
feature_importance = model.feature_importances_
Q7

What is overfitting and how do you prevent it?

A

Overfitting occurs when model learns training data too well, including noise, performs poorly on new data. Prevent with: more training data, cross-validation, regularization (L1/L2), early stopping, dropout, feature selection, ensemble methods, reducing model complexity.

python
# Regularization
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)  # L2 regularization

# Early stopping
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(
    n_estimators=100,
    validation_fraction=0.2,
    n_iter_no_change=5
)
Q8

What is cross-validation?

A

Cross-validation splits data into k folds, trains on k-1 folds, tests on remaining fold, repeats k times. Provides better estimate of model performance than single train/test split. Common: k-fold (k=5 or 10), stratified k-fold, leave-one-out, time series CV.

python
from sklearn.model_selection import cross_val_score

model = RandomForestClassifier()
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"Mean accuracy: {scores.mean():.2f} (+/- {scores.std() * 2:.2f})")
Q9

What is the difference between precision and recall?

A

Precision = TP / (TP + FP) - accuracy of positive predictions. Recall = TP / (TP + FN) - ability to find all positives. High precision: few false positives. High recall: few false negatives. F1-score balances both: 2 * (precision * recall) / (precision + recall).

python
from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
Q10

What is the ROC curve and AUC?

A

ROC curve plots True Positive Rate vs False Positive Rate at different classification thresholds. AUC (Area Under Curve) measures classifier performance: 1.0 perfect, 0.5 random, >0.7 good. Higher AUC = better discrimination. Useful for binary classification evaluation.

python
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

fpr, tpr, thresholds = roc_curve(y_true, y_scores)
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, label=f'AUC = {roc_auc:.2f}')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
Q11

What is gradient descent?

A

Gradient descent minimizes cost function by iteratively moving in direction of steepest descent (negative gradient). Updates parameters: θ = θ - α * ∇J(θ). α is learning rate. Variants: batch (all data), stochastic (one sample), mini-batch (small subset), Adam, RMSprop.

python
def gradient_descent(X, y, theta, alpha, iterations):
    m = len(y)
    for i in range(iterations):
        predictions = X.dot(theta)
        error = predictions - y
        gradient = X.T.dot(error) / m
        theta = theta - alpha * gradient
    return theta
Q12

What is the difference between L1 and L2 regularization?

A

L1 (Lasso) adds sum of absolute weights: λΣ|w|, encourages sparsity (zero weights), feature selection. L2 (Ridge) adds sum of squared weights: λΣw², prevents large weights, smoother solutions. Elastic Net combines both. L1 for feature selection, L2 for generalization.

python
from sklearn.linear_model import Lasso, Ridge, ElasticNet

# L1 regularization
lasso = Lasso(alpha=1.0)

# L2 regularization
ridge = Ridge(alpha=1.0)

# Both
elastic = ElasticNet(alpha=1.0, l1_ratio=0.5)
Q13

What is the bias-variance tradeoff?

A

Bias is error from oversimplifying assumptions. Variance is error from sensitivity to small fluctuations. High bias: underfitting. High variance: overfitting. Goal: balance both. Complex models: low bias, high variance. Simple models: high bias, low variance.

python
# High bias (underfitting) - too simple
from sklearn.linear_model import LinearRegression
simple_model = LinearRegression()

# High variance (overfitting) - too complex
from sklearn.tree import DecisionTreeClassifier
complex_model = DecisionTreeClassifier(max_depth=None)

# Balanced
from sklearn.ensemble import RandomForestClassifier
balanced_model = RandomForestClassifier(n_estimators=100, max_depth=10)
Q14

What is feature scaling and why is it important?

A

Feature scaling normalizes features to similar scale. Important because algorithms using distance (k-NN, SVM) or gradient descent are sensitive to scale. Methods: standardization (mean=0, std=1), min-max scaling (0-1), normalization. Tree-based models don't need scaling.

python
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Min-Max scaling
minmax = MinMaxScaler()
X_normalized = minmax.fit_transform(X)
Q15

What is feature engineering?

A

Feature engineering creates, transforms, selects features to improve model performance. Includes: scaling, encoding categorical variables, creating polynomial features, handling missing values, feature selection, creating interaction features. Often more important than algorithm choice.

python
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_classif

# Polynomial features
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

# Feature selection
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)
Q16

What is the difference between bagging and boosting?

A

Bagging trains models in parallel on different data subsets, averages predictions (e.g., Random Forest). Boosting trains models sequentially, each corrects previous errors (e.g., AdaBoost, XGBoost). Bagging reduces variance, boosting reduces bias. Both improve accuracy.

python
# Bagging
from sklearn.ensemble import RandomForestClassifier
bagging = RandomForestClassifier(n_estimators=100)

# Boosting
from sklearn.ensemble import AdaBoostClassifier
boosting = AdaBoostClassifier(n_estimators=100)
Q17

What is XGBoost?

A

XGBoost (Extreme Gradient Boosting) is optimized gradient boosting implementation. Features: regularization, parallel processing, handles missing values, tree pruning, early stopping. Often wins Kaggle competitions. Fast, accurate, handles large datasets. Popular for tabular data.

python
import xgboost as xgb

model = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8
)
model.fit(X_train, y_train)
Q18

What is k-means clustering?

A

K-means partitions data into k clusters. Algorithm: initialize k centroids, assign points to nearest centroid, update centroids, repeat until convergence. Unsupervised learning. Requires specifying k. Sensitive to initialization. Used for customer segmentation, image compression.

python
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_
Q19

What is PCA (Principal Component Analysis)?

A

PCA reduces dimensionality by finding principal components (directions of maximum variance). Projects data onto lower-dimensional space. Preserves most variance with fewer dimensions. Unsupervised, linear transformation. Used for visualization, noise reduction, feature extraction.

python
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
print(f"Explained variance: {pca.explained_variance_ratio_}")
Q20

What is the curse of dimensionality?

A

Curse of dimensionality: as dimensions increase, data becomes sparse, distances become similar, volume increases exponentially. Makes learning difficult, requires more data. Solutions: dimensionality reduction (PCA, t-SNE), feature selection, regularization, more training data.

python
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

# Dimensionality reduction
pca = PCA(n_components=50)
X_reduced = pca.fit_transform(X_high_dimensional)

# Feature selection
selector = SelectKBest(k=20)
X_selected = selector.fit_transform(X, y)
Q21

What is SVM (Support Vector Machine)?

A

SVM finds optimal hyperplane separating classes with maximum margin. Uses support vectors (closest points). Kernel trick handles non-linear data. Types: linear, polynomial, RBF. Good for high-dimensional data. Can be sensitive to feature scaling.

python
from sklearn.svm import SVC

# Linear SVM
linear_svm = SVC(kernel='linear')

# RBF kernel for non-linear
rbf_svm = SVC(kernel='rbf', gamma='scale')

linear_svm.fit(X_train, y_train)
Q22

What is k-NN (k-Nearest Neighbors)?

A

K-NN classifies based on k nearest neighbors. Lazy learner (no training, stores all data). Distance metric (Euclidean, Manhattan). k value important: small k (noisy), large k (smooth). Sensitive to scale, curse of dimensionality. Simple but can be slow for large datasets.

python
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)
Q23

What is Naive Bayes?

A

Naive Bayes is probabilistic classifier based on Bayes theorem with "naive" assumption of feature independence. Fast, simple, works well with small data. Types: Gaussian (continuous), Multinomial (counts), Bernoulli (binary). Good baseline, handles high dimensions.

python
from sklearn.naive_bayes import GaussianNB, MultinomialNB

# For continuous features
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# For count data
mnb = MultinomialNB()
mnb.fit(X_train, y_train)
Q24

What is the difference between classification and regression?

A

Classification predicts discrete categories (classes). Regression predicts continuous values. Classification: email spam/not spam, image labels. Regression: house prices, temperature. Different loss functions: cross-entropy for classification, MSE/MAE for regression.

python
# Classification
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)  # y_train: categories

# Regression
from sklearn.ensemble import RandomForestRegressor
reg = RandomForestRegressor()
reg.fit(X_train, y_train)  # y_train: continuous values
Q25

What is hyperparameter tuning?

A

Hyperparameter tuning finds optimal hyperparameters (not learned, set before training). Methods: grid search (exhaustive), random search (random sampling), Bayesian optimization (efficient). Examples: learning rate, k in k-NN, max_depth in trees, regularization strength.

python
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(
    RandomForestClassifier(),
    param_grid,
    cv=5,
    scoring='accuracy'
)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
Q26

What is the difference between training, validation, and test sets?

A

Training set: used to train model. Validation set: used to tune hyperparameters, select models, prevent overfitting. Test set: used for final evaluation, never used during training. Typical split: 60% train, 20% validation, 20% test. Test set should be held out completely.

python
from sklearn.model_selection import train_test_split

# First split: train and temp
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.4, random_state=42
)

# Second split: validation and test
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42
)
Q27

What is ensemble learning?

A

Ensemble combines multiple models for better performance. Types: bagging (parallel, e.g., Random Forest), boosting (sequential, e.g., XGBoost), stacking (meta-learner), voting (majority/weighted). Reduces variance, improves accuracy. "Wisdom of crowds" principle.

python
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression

model1 = RandomForestClassifier()
model2 = LogisticRegression()
ensemble = VotingClassifier(
    estimators=[('rf', model1), ('lr', model2)],
    voting='soft'
)
ensemble.fit(X_train, y_train)
Q28

What is the difference between batch, stochastic, and mini-batch gradient descent?

A

Batch uses all training data per update, stable but slow. Stochastic uses one sample per update, fast but noisy. Mini-batch uses small subset (32-256 samples), balances speed and stability. Mini-batch is most common in practice.

python
# Mini-batch gradient descent
batch_size = 32
for epoch in range(epochs):
    for i in range(0, len(X), batch_size):
        batch_X = X[i:i+batch_size]
        batch_y = y[i:i+batch_size]
        gradients = compute_gradients(batch_X, batch_y)
        update_weights(gradients)
Q29

What is feature selection?

A

Feature selection chooses most relevant features, removes irrelevant/redundant ones. Benefits: reduces overfitting, faster training, better interpretability. Methods: filter (correlation, chi-square), wrapper (forward/backward selection), embedded (L1 regularization, tree importance).

python
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.feature_selection import RFE

# Filter method
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)

# Wrapper method (Recursive Feature Elimination)
rfe = RFE(RandomForestClassifier(), n_features_to_select=10)
X_selected = rfe.fit_transform(X, y)
Q30

What is the difference between parametric and non-parametric models?

A

Parametric models have fixed number of parameters (e.g., linear regression, neural networks). Non-parametric models number of parameters grows with data (e.g., k-NN, decision trees). Parametric: faster, less data needed. Non-parametric: more flexible, need more data.

python
# Parametric - fixed parameters
from sklearn.linear_model import LinearRegression
parametric = LinearRegression()

# Non-parametric - adapts to data
from sklearn.neighbors import KNeighborsRegressor
non_parametric = KNeighborsRegressor(n_neighbors=5)
Q31

What is the learning rate?

A

Learning rate (α) controls step size in gradient descent. Too high: overshoots minimum, unstable. Too low: slow convergence, may get stuck. Adaptive methods: Adam, RMSprop adjust learning rate. Learning rate scheduling: reduce over time. Critical hyperparameter.

python
# Fixed learning rate
from sklearn.linear_model import SGDClassifier
model = SGDClassifier(learning_rate='constant', eta0=0.01)

# Adaptive learning rate
from sklearn.linear_model import SGDClassifier
model = SGDClassifier(learning_rate='adaptive', eta0=0.01)
Q32

What is the difference between accuracy, precision, recall, and F1-score?

A

Accuracy: (TP + TN) / total, overall correctness. Precision: TP / (TP + FP), positive prediction accuracy. Recall: TP / (TP + FN), ability to find positives. F1: harmonic mean of precision and recall, balances both. Use based on problem requirements.

python
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score
)

accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
Q33

What is confusion matrix?

A

Confusion matrix shows actual vs predicted classifications. 2x2 for binary: TP, TN, FP, FN. Larger for multi-class. From it derive: accuracy, precision, recall, specificity. Visual representation of model performance. Helps identify which classes are confused.

python
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

cm = confusion_matrix(y_true, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()
Q34

What is the difference between MSE, MAE, and RMSE?

A

MSE (Mean Squared Error): average of squared differences, penalizes large errors. MAE (Mean Absolute Error): average of absolute differences, robust to outliers. RMSE (Root Mean Squared Error): square root of MSE, same units as target. MSE sensitive to outliers.

python
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

mse = mean_squared_error(y_true, y_pred)
mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mse)
Q35

What is one-hot encoding?

A

One-hot encoding converts categorical variables to binary vectors. Each category becomes binary feature (1 if category, 0 otherwise). Creates sparse matrix. Alternative: label encoding (for ordinal), target encoding. Used when categories have no inherent order.

python
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(df[['category']])
df_encoded = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())
Q36

What is the difference between imputation and deletion for missing values?

A

Imputation fills missing values (mean, median, mode, predictive). Preserves data, can introduce bias. Deletion removes rows/columns with missing values. Simple but loses data. Imputation preferred when missing <5%, deletion when missing >50% or MCAR (Missing Completely At Random).

python
from sklearn.impute import SimpleImputer

# Mean imputation
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# Median imputation
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X)
Q37

What is stratified sampling?

A

Stratified sampling maintains class distribution in train/test splits. Important for imbalanced datasets. Ensures each split has similar proportion of classes. Prevents one split having all minority class. Use StratifiedKFold for cross-validation with imbalanced data.

python
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True)
for train_idx, test_idx in skf.split(X, y):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
Q38

What is the difference between correlation and causation?

A

Correlation measures linear relationship between variables (doesn't imply causation). Causation means one variable causes change in another. Correlation can be spurious. Establish causation through: randomized experiments, controlled studies, temporal precedence, mechanism. ML finds correlations, not causation.

python
import pandas as pd

# Correlation
correlation = df['feature1'].corr(df['feature2'])

# Correlation matrix
corr_matrix = df.corr()
Q39

What is the difference between underfitting and overfitting?

A

Underfitting: model too simple, high bias, poor performance on train and test. Overfitting: model too complex, high variance, good on train, poor on test. Balance: model complexity matches data complexity. Use validation set to detect, regularization to prevent.

python
# Underfitting - too simple
from sklearn.linear_model import LinearRegression
simple_model = LinearRegression()

# Overfitting - too complex
from sklearn.tree import DecisionTreeRegressor
complex_model = DecisionTreeRegressor(max_depth=None)

# Balanced
from sklearn.ensemble import RandomForestRegressor
balanced_model = RandomForestRegressor(max_depth=10)
Q40

What is feature importance?

A

Feature importance measures contribution of each feature to model predictions. Tree-based models provide importance (Gini/entropy reduction). Permutation importance: shuffle feature, measure performance drop. Useful for feature selection, model interpretation, understanding data.

python
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance

model = RandomForestClassifier()
model.fit(X_train, y_train)

# Tree-based importance
importance = model.feature_importances_

# Permutation importance
perm_importance = permutation_importance(
    model, X_test, y_test, n_repeats=10
)
Q41

What is the difference between batch, stochastic, and mini-batch gradient descent?

A

Batch uses all training data per update, stable but slow, memory intensive. Stochastic uses one sample per update, fast but noisy, may not converge. Mini-batch uses small subset (32-256), balances speed and stability, most common in practice.

python
# Batch gradient descent
def batch_gd(X, y, theta, alpha, iterations):
    m = len(y)
    for i in range(iterations):
        gradient = X.T.dot(X.dot(theta) - y) / m
        theta = theta - alpha * gradient
    return theta

# Stochastic gradient descent
def sgd(X, y, theta, alpha, iterations):
    for i in range(iterations):
        idx = np.random.randint(len(y))
        gradient = (X[idx].dot(theta) - y[idx]) * X[idx]
        theta = theta - alpha * gradient
    return theta
Q42

What is the difference between classification and regression metrics?

A

Classification metrics: accuracy, precision, recall, F1, ROC-AUC, confusion matrix. Regression metrics: MSE, MAE, RMSE, R², MAPE. Different because classification predicts categories, regression predicts continuous values. Choose metrics based on problem type and business requirements.

python
# Classification metrics
from sklearn.metrics import accuracy_score, f1_score
acc = accuracy_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

# Regression metrics
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
Q43

What is data leakage?

A

Data leakage occurs when training data contains information about target that won't be available at prediction time. Causes: target encoding with future data, including target-derived features, train/test contamination. Results in overly optimistic performance, poor generalization.

python
# Data leakage example (WRONG)
# Using future data to encode
df['target_mean'] = df.groupby('category')['target'].transform('mean')

# Correct approach
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
for train_idx, val_idx in kf.split(X):
    # Compute encoding only on train fold
    pass
Q44

What is the difference between holdout and cross-validation?

A

Holdout splits data once into train/test (e.g., 80/20). Simple, fast, but single estimate. Cross-validation splits into k folds, trains/test k times. More robust estimate, uses all data for training/testing. CV preferred for small datasets, model selection.

python
# Holdout
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Cross-validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
Q45

What is the difference between bagging and boosting?

A

Bagging trains models in parallel on bootstrap samples, averages predictions, reduces variance (Random Forest). Boosting trains models sequentially, each corrects previous errors, reduces bias (AdaBoost, XGBoost). Bagging: independent models. Boosting: dependent models.

python
# Bagging
from sklearn.ensemble import BaggingClassifier
bagging = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=100
)

# Boosting
from sklearn.ensemble import AdaBoostClassifier
boosting = AdaBoostClassifier(n_estimators=100)
Q46

What is the difference between supervised, unsupervised, and semi-supervised learning?

A

Supervised: labeled data, learns input-output mapping. Unsupervised: unlabeled data, finds patterns. Semi-supervised: mix of labeled and unlabeled data, uses both. Semi-supervised useful when labels are expensive/scarce but unlabeled data is abundant.

python
# Semi-supervised learning
from sklearn.semi_supervised import SelfTrainingClassifier

base_model = LogisticRegression()
self_training = SelfTrainingClassifier(base_estimator=base_model)
# -1 for unlabeled
y_semi = y_labeled + [-1] * len(y_unlabeled)
self_training.fit(X_all, y_semi)
Q47

What is the difference between parametric and non-parametric models?

A

Parametric models have fixed number of parameters (e.g., linear regression, neural networks). Non-parametric models number of parameters grows with data (e.g., k-NN, decision trees). Parametric: faster, less data needed, assumptions. Non-parametric: more flexible, need more data.

python
# Parametric - fixed parameters
from sklearn.linear_model import LinearRegression
parametric = LinearRegression()  # 2 parameters (slope, intercept)

# Non-parametric - adapts to data
from sklearn.neighbors import KNeighborsRegressor
non_parametric = KNeighborsRegressor(n_neighbors=5)  # Stores all data
Q48

What is the difference between imbalanced and balanced datasets?

A

Imbalanced dataset has unequal class distribution (e.g., 99% class A, 1% class B). Balanced has equal distribution. Imbalanced causes models to favor majority class. Solutions: resampling (oversample minority, undersample majority), class weights, SMOTE, different metrics (F1, precision-recall).

python
from imblearn.over_sampling import SMOTE
from sklearn.utils.class_weight import compute_class_weight

# SMOTE oversampling
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)

# Class weights
class_weights = compute_class_weight('balanced', classes=np.unique(y), y=y)
model = RandomForestClassifier(class_weight=dict(enumerate(class_weights)))
Q49

What is the difference between model accuracy and model performance?

A

Accuracy is specific metric: (TP + TN) / total. Performance is broader term encompassing all evaluation metrics (accuracy, precision, recall, F1, AUC, etc.). Performance depends on problem: accuracy for balanced data, precision/recall for imbalanced, AUC for ranking.

python
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
)

# Accuracy
accuracy = accuracy_score(y_true, y_pred)

# Overall performance
performance = {
    'accuracy': accuracy_score(y_true, y_pred),
    'precision': precision_score(y_true, y_pred),
    'recall': recall_score(y_true, y_pred),
    'f1': f1_score(y_true, y_pred),
    'auc': roc_auc_score(y_true, y_scores)
}
Q50

What is the difference between training loss and validation loss?

A

Training loss measures error on training data. Validation loss measures error on validation set. Training loss < validation loss indicates overfitting. Both decreasing: good. Training decreasing, validation increasing: overfitting. Use validation loss for early stopping, model selection.

python
# Track losses during training
train_losses = []
val_losses = []

for epoch in range(epochs):
    train_loss = model.train_on_batch(X_train, y_train)
    val_loss = model.evaluate(X_val, y_val, verbose=0)
    
    train_losses.append(train_loss)
    val_losses.append(val_loss)
    
    # Early stopping if validation loss increases
    if val_loss > min(val_losses):
        patience -= 1