Question 1

What is Machine Learning?

Accepted Answer

Machine Learning is subset of AI that enables systems to learn and improve from experience without being explicitly programmed. Uses algorithms to identify patterns in data, make predictions, or decisions. Three types: supervised, unsupervised, reinforcement learning.

Question 2

What is the difference between supervised and unsupervised learning?

Accepted Answer

Supervised learning uses labeled data (input-output pairs) to train models. Unsupervised learning finds patterns in unlabeled data. Supervised: classification, regression. Unsupervised: clustering, dimensionality reduction, association rules.

Question 3

What is linear regression?

Accepted Answer

Linear regression models relationship between dependent variable and one or more independent variables using linear equation: y = mx + b. Finds best-fit line minimizing sum of squared errors. Used for continuous predictions. Assumes linear relationship.

Question 4

What is logistic regression?

Accepted Answer

Logistic regression is classification algorithm that predicts probability using sigmoid function. Outputs values between 0 and 1. Uses log-odds (logit). Binary classification: predicts class based on probability threshold (usually 0.5). Can be extended to multi-class.

Question 5

What is a decision tree?

Accepted Answer

Decision tree makes decisions by splitting data based on feature values. Tree structure: root, internal nodes (decisions), leaves (outcomes). Uses information gain or Gini impurity for splits. Easy to interpret, prone to overfitting. Basis for random forests.

Question 6

What is a random forest?

Accepted Answer

Random forest is ensemble method combining multiple decision trees. Each tree trained on random subset of data and features. Predictions averaged (regression) or voted (classification). Reduces overfitting, handles non-linearity, feature importance available.

Question 7

What is overfitting and how do you prevent it?

Accepted Answer

Overfitting occurs when model learns training data too well, including noise, performs poorly on new data. Prevent with: more training data, cross-validation, regularization (L1/L2), early stopping, dropout, feature selection, ensemble methods, reducing model complexity.

Question 8

What is cross-validation?

Accepted Answer

Cross-validation splits data into k folds, trains on k-1 folds, tests on remaining fold, repeats k times. Provides better estimate of model performance than single train/test split. Common: k-fold (k=5 or 10), stratified k-fold, leave-one-out, time series CV.

Question 9

What is the difference between precision and recall?

Accepted Answer

Precision = TP / (TP + FP) - accuracy of positive predictions. Recall = TP / (TP + FN) - ability to find all positives. High precision: few false positives. High recall: few false negatives. F1-score balances both: 2 * (precision * recall) / (precision + recall).

Question 10

What is the ROC curve and AUC?

Accepted Answer

ROC curve plots True Positive Rate vs False Positive Rate at different classification thresholds. AUC (Area Under Curve) measures classifier performance: 1.0 perfect, 0.5 random, >0.7 good. Higher AUC = better discrimination. Useful for binary classification evaluation.

Question 11

What is gradient descent?

Accepted Answer

Gradient descent minimizes cost function by iteratively moving in direction of steepest descent (negative gradient). Updates parameters: θ = θ - α * ∇J(θ). α is learning rate. Variants: batch (all data), stochastic (one sample), mini-batch (small subset), Adam, RMSprop.

Question 12

What is the difference between L1 and L2 regularization?

Accepted Answer

L1 (Lasso) adds sum of absolute weights: λΣ|w|, encourages sparsity (zero weights), feature selection. L2 (Ridge) adds sum of squared weights: λΣw², prevents large weights, smoother solutions. Elastic Net combines both. L1 for feature selection, L2 for generalization.

Question 13

What is the bias-variance tradeoff?

Accepted Answer

Bias is error from oversimplifying assumptions. Variance is error from sensitivity to small fluctuations. High bias: underfitting. High variance: overfitting. Goal: balance both. Complex models: low bias, high variance. Simple models: high bias, low variance.

Question 14

What is feature scaling and why is it important?

Accepted Answer

Feature scaling normalizes features to similar scale. Important because algorithms using distance (k-NN, SVM) or gradient descent are sensitive to scale. Methods: standardization (mean=0, std=1), min-max scaling (0-1), normalization. Tree-based models don't need scaling.

Question 15

What is feature engineering?

Accepted Answer

Feature engineering creates, transforms, selects features to improve model performance. Includes: scaling, encoding categorical variables, creating polynomial features, handling missing values, feature selection, creating interaction features. Often more important than algorithm choice.

Question 16

What is the difference between bagging and boosting?

Accepted Answer

Bagging trains models in parallel on different data subsets, averages predictions (e.g., Random Forest). Boosting trains models sequentially, each corrects previous errors (e.g., AdaBoost, XGBoost). Bagging reduces variance, boosting reduces bias. Both improve accuracy.

Question 17

What is XGBoost?

Accepted Answer

XGBoost (Extreme Gradient Boosting) is optimized gradient boosting implementation. Features: regularization, parallel processing, handles missing values, tree pruning, early stopping. Often wins Kaggle competitions. Fast, accurate, handles large datasets. Popular for tabular data.

Question 18

What is k-means clustering?

Accepted Answer

K-means partitions data into k clusters. Algorithm: initialize k centroids, assign points to nearest centroid, update centroids, repeat until convergence. Unsupervised learning. Requires specifying k. Sensitive to initialization. Used for customer segmentation, image compression.

Question 19

What is PCA (Principal Component Analysis)?

Accepted Answer

PCA reduces dimensionality by finding principal components (directions of maximum variance). Projects data onto lower-dimensional space. Preserves most variance with fewer dimensions. Unsupervised, linear transformation. Used for visualization, noise reduction, feature extraction.

Question 20

What is the curse of dimensionality?

Accepted Answer

Curse of dimensionality: as dimensions increase, data becomes sparse, distances become similar, volume increases exponentially. Makes learning difficult, requires more data. Solutions: dimensionality reduction (PCA, t-SNE), feature selection, regularization, more training data.

Machine Learning Interview Questions - ML Interview Prep

All Questions & Answers