Top 28 Machine Learning Interview Questions (2026)

Machine learning questions dominate technical rounds at Meta, Google, Amazon, Apple, Netflix, and Spotify because building production ML systems requires deep expertise across modeling, evaluation, and deployment. Unlike software engineering roles where coding skills can be isolated, ML interviews test your ability to connect business problems to statistical solutions, debug model failures in production, and design experiments that actually move metrics.

What trips up most candidates is the gap between academic ML and production reality. You might know that Random Forest reduces overfitting through bagging, but can you explain why your click-through rate model started failing after Black Friday when feature distributions shifted? Can you design an evaluation framework for a recommendation system where user behavior changes based on the recommendations themselves? These are the scenarios that separate strong candidates from those who memorized textbook definitions.

Here are the top 28 machine learning questions organized by the core competencies that matter most in production ML roles.

Intermediate28 questions

Machine Learning Interview Questions

Top Machine Learning interview questions covering the key areas tested at leading tech companies. Practice with real questions and detailed solutions.

Data ScientistAI EngineerMachine Learning Engineer Meta

Supervised Learning Fundamentals

Interviewers focus on supervised learning fundamentals because they reveal whether you can translate messy business problems into clean modeling setups. Most candidates stumble when asked to define labels precisely or choose appropriate baselines, especially when the business objective doesn't map cleanly to standard loss functions.

The key insight that separates strong answers is understanding that label definition drives everything downstream. If you're predicting user churn at Spotify, does a user who cancels after their trial ends count as churn or natural conversion failure? This choice affects your positive rate, class balance, and ultimately which model architecture makes sense.

Supervised Learning Fundamentals

Start by proving you can frame real product problems as prediction tasks, choose appropriate targets, and justify baseline models. You often get tripped up when asked to connect assumptions like linearity and independence to concrete feature, label, and data quality choices.

At Spotify, you are asked to reduce user churn for Premium trials. Frame this as a supervised learning problem, define the label precisely, and choose a baseline model that is hard to beat in the first week of iteration.

SpotifyMediumSupervised Learning Fundamentals

Sample Answer

Most candidates default to predicting a probability of churn with a complex model, but that fails here because your label and leakage risks are not nailed down yet. You should define churn as a time bound event, for example, cancel within 14 days after trial ends, and restrict features to those available before the prediction timestamp. Start with a stratified baseline like logistic regression or even a calibrated prior, plus a rule based on tenure and recent activity, because these set a strong bar and surface label and feature bugs. If your baseline cannot be explained, you cannot debug whether lift comes from modeling or from leakage.

At Amazon, you are building a model to predict whether a product will be returned. How do you pick the target, handle delayed labels, and choose a loss or metric that matches business cost?

AmazonHardSupervised Learning Fundamentals

Sample Answer

Define the target as return within a fixed window after delivery, and train with a time based cutoff so you only use examples whose label has had time to resolve. Delayed labels mean you either wait for maturity, or use techniques like censoring aware labeling, but you must not treat unknowns as negatives. Use a cost sensitive objective, for example weighted log loss where false positives and false negatives reflect downstream cost, or optimize expected cost: $$\mathbb{E}[c_{FP}\,\mathbf{1}[\hat y=1,y=0] + c_{FN}\,\mathbf{1}[\hat y=0,y=1]].$$ Calibrate probabilities so thresholding can be tuned to changing business tradeoffs.

At Netflix, you want to predict whether a user will watch a recommended title within 24 hours of seeing it. Would you model this as a classification problem or a regression problem, and what baseline would you compare against?

NetflixMediumSupervised Learning Fundamentals

Sample Answer

You could do classification on $y \in \{0,1\}$ for watch within 24 hours, or regression on watch time, for example minutes watched, conditioned on exposure. Classification wins here because the product question is binary and exposure creates a clear prediction timestamp, while regression mixes intent with duration noise. A strong baseline is a per user and per title popularity model, for example $\hat p = \sigma(b_u + b_i)$, which often beats naive global averages. If your model cannot beat user and item bias, your features or labeling are likely wrong.

At Meta, you are predicting click through rate for a new ad format. An interviewer asks you to connect the linearity and independence assumptions of logistic regression to concrete feature and data choices. What do you say?

MetaHardSupervised Learning Fundamentals

Sample Answer

First, you note logistic regression assumes log odds is approximately linear in features, so you choose features that make effects close to additive, like log transforms for spend, bucketed counts, and explicit interaction terms for known non linearities. Next, independence is not about features being independent, it is about examples, so you avoid leakage across repeated impressions by grouping by user or session in splits, and you use robust validation that respects time and user identity. Then you check for correlated rows, like many impressions from one user, and mitigate with downsampling, per user weighting, or clustered standard errors if doing inference. Finally, you tie this back to data quality, if your labels are noisy due to delayed attribution, linear models will look stable but actually learn bias, so you validate attribution windows and missingness explicitly.

At Google, you need to predict which search queries will be reformulated by a user within 60 seconds. Describe how you would create training data without label leakage, including the timestamp you would attach to each example.

GoogleMediumSupervised Learning Fundamentals

At Airbnb, you are asked to predict booking conversion for a listing, but the data has strong seasonality and the inventory changes daily. What baseline model and validation scheme do you propose, and how do you justify them to a product manager?

AirbnbMediumSupervised Learning Fundamentals

Practice more Supervised Learning Fundamentals questions

Model Evaluation and Experiment Design

Evaluation and experiment design questions test your ability to connect offline metrics to online business outcomes, which is where most production ML systems succeed or fail. Candidates often know the formulas for precision and recall but can't explain why their fraud model with 95% precision still floods operations with false alarms.

The critical mistake here is treating offline evaluation as the end goal rather than a proxy for online performance. Your ranking model might achieve higher AUC on last week's data, but if the test set doesn't reflect the temporal patterns users actually experience, you're optimizing for a metric that doesn't predict success.

Model Evaluation and Experiment Design

In interviews, you are tested on whether you can pick metrics, build reliable validation schemes, and avoid leakage under realistic constraints. Many candidates struggle to explain tradeoffs like PR AUC vs ROC AUC, offline metrics vs online A/B tests, and how to debug metric regressions.

You are building a fraud model where only 0.2% of transactions are fraudulent, and operations can review at most 500 alerts per day. Which evaluation metric(s) do you choose, and how do you set the decision threshold for launch?

AmazonMediumModel Evaluation and Experiment Design

Sample Answer

Use precision at a fixed review capacity (for example Precision@500/day) plus recall at that operating point, not ROC AUC. ROC AUC can look great even when your precision at the top of the list is unusable under extreme class imbalance. Set the threshold by sorting scores and picking the cutoff that yields 500 alerts per day on recent validation data, then report the resulting precision and recall with confidence intervals. Calibrate scores if you need stable thresholds across time, but tie the KPI to the business constraint.

Your team tracks offline PR AUC for a ranking model, but an online A/B test shows a significant drop in user engagement even though PR AUC improved. What do you do next to diagnose the mismatch?

NetflixHardModel Evaluation and Experiment Design

Sample Answer

You could treat the offline metric as wrong and rerun the A/B test, or you could treat the online result as real and audit the offline setup. The audit path wins here because offline improvements often come from label or exposure shifts, leakage, or optimizing a surrogate that is misaligned with engagement. Check that offline evaluation replays the same candidate generation, filtering, and position bias as production, and segment by traffic slices where the A/B regressed. Then align metrics, for example use counterfactual metrics like IPS or optimize for calibrated expected value, and only rerun online once you have a falsifiable hypothesis.

You are predicting next-day churn for a subscription app, and you have event logs, support tickets, and marketing touches. How do you design a train, validation, and test split to avoid leakage while still using as much data as possible?

MetaMediumModel Evaluation and Experiment Design

Sample Answer

First, define the prediction time $t$ and the label window, for example churn in $(t, t+1]$, then only use features available at or before $t$. Next, split by time, not random, so training is earlier weeks, validation is a later block, and test is the most recent block, which matches deployment and prevents future info leaking backward. If users appear multiple times, group by user within each time block or use a rolling origin evaluation so the model never trains on a later snapshot of the same user than it is evaluated on. Finally, confirm leakage by checking suspiciously high feature importance for post-outcome events like cancellation confirmations or retention offers sent after churn.

A model’s offline AUC dropped from 0.82 to 0.78 after a feature pipeline change, but online metrics are flat. How do you debug whether this is a real regression or an evaluation artifact?

MicrosoftMediumModel Evaluation and Experiment Design

Sample Answer

This question is checking whether you can separate model quality from measurement quality under production constraints. You should first verify evaluation parity: same dataset version, same label definition, same joins, and identical filtering, because small pipeline shifts often change the evaluation population. Then run a backtest with the old model scored through the new pipeline and the new model scored through the old pipeline to isolate whether features or scoring logic changed. If the drop only appears on one split, check for temporal drift or a label delay issue, and use bootstrap CIs to see if 0.04 is statistically meaningful.

You are launching a search ranking change and the primary metric is click-through rate, but you worry about position bias and novelty effects. How do you design the online experiment and guardrail metrics?

GoogleHardModel Evaluation and Experiment Design

Your training data includes a feature called "is_premium" that is set by a downstream system after a user converts, and the model is used to predict conversion. What tests do you run to detect leakage and what is your remediation plan?

AirbnbEasyModel Evaluation and Experiment Design

Practice more Model Evaluation and Experiment Design questions

Regularization, Overfitting, and Feature Engineering

Questions about regularization and feature engineering probe whether you understand the bias-variance tradeoff in practice, not just in theory. When your gradient boosting model overfits, do you reach for early stopping, L2 regularization, or feature selection first? The wrong choice wastes weeks of iteration.

Smart candidates recognize that overfitting often comes from features, not just model complexity. If your click-through rate model memorizes campaign IDs that only appear in training data, adding dropout won't help. You need to either regularize those high-cardinality features specifically or engineer them differently from the start.

Regularization, Overfitting, and Feature Engineering

You will be asked to diagnose overfitting and propose fixes that make sense for the data, the model family, and deployment constraints. A common failure mode is listing techniques without explaining which signal they preserve or destroy, and how you would verify the fix with ablations.

You train a gradient boosted trees model for click through rate prediction and see training AUC 0.93 but validation AUC 0.78, with the biggest gap on rare categorical IDs like campaign_id. What do you change first, regularization settings or feature engineering, and how do you verify you did not kill real signal?

MetaMediumRegularization, Overfitting, and Feature Engineering

Sample Answer

You could do heavier tree regularization or change how you represent those high cardinality IDs. Regularization wins here because it is the fastest, least invasive way to reduce memorization, start by tightening max_depth, increasing min_child_weight, adding subsampling, and stronger L2 on leaf weights. Then test feature level ablations, drop campaign_id entirely, and compare to target encoding with K fold smoothing to see if you recover validation AUC without the gap. Confirm with time based split and a leakage check, because rare IDs often proxy the label through logging artifacts.

A logistic regression model for spam detection performs well offline but degrades in production after a week. You suspect overfitting to short lived tokens and distribution shift, what regularization and feature engineering changes do you make, and how do you validate them before redeploying?

GoogleHardRegularization, Overfitting, and Feature Engineering

Sample Answer

First, you check whether the offline split matched production, if you used random split, you redo it as a time split to reproduce the drop. Next, you inspect feature drift, especially token frequency and new token rate, and quantify it with PSI or KL on hashed buckets. Then you add stronger $L_1$ or elastic net to prune brittle token weights, cap vocab by minimum document frequency, and consider feature hashing with a stable hash to reduce churn. Finally, you validate with rolling window backtests, ablate token features versus meta features, and monitor calibration and precision at fixed recall to ensure the fix addresses the real failure mode.

You build a deep model for demand forecasting with many static store features and lag features. Training loss keeps dropping but validation worsens after epoch 5. What do you do, and how do you prove which change helped: early stopping, dropout, weight decay, or changing the lag feature set?

AmazonMediumRegularization, Overfitting, and Feature Engineering

Sample Answer

This question is checking whether you can tie an overfitting symptom to targeted interventions and validate with clean ablations. You start by fixing the evaluation, use a strict time split per store and a metric aligned to the business, then apply early stopping as the baseline since it directly matches the observed divergence. Next, you run controlled ablations, early stopping only, then add weight decay, then dropout, then adjust lag features, keeping everything else fixed and logging validation by horizon. You pick the winner based on improved validation across horizons and stability across folds, not just a single best epoch.

At serving time you have a 10 ms latency budget, but your model overfits when you add dozens of interaction features. How would you redesign the feature set to reduce overfitting while staying within latency, and what experiments would you run to confirm you did not remove essential signal?

NetflixMediumRegularization, Overfitting, and Feature Engineering

You see a large generalization gap on a small tabular dataset with 200 features and only 5,000 labeled examples. Compare $L_1$, $L_2$, elastic net, and feature selection using mutual information or permutation importance. Which would you choose under strict interpretability requirements, and how would you defend it with ablations?

MicrosoftEasyRegularization, Overfitting, and Feature Engineering

Practice more Regularization, Overfitting, and Feature Engineering questions

Tree Based Models and Boosting

Tree-based model questions dominate ML interviews because trees handle tabular data better than neural networks in most production settings, yet many candidates only know the scikit-learn defaults. Can you tune a Random Forest to reduce both overfitting and inference latency when you're serving millions of predictions per day?

The insight that matters most is understanding how tree ensemble hyperparameters interact with your data characteristics. Increasing max_depth might improve training accuracy, but if you have high-cardinality categorical features, you're probably just memorizing rare category combinations that won't generalize.

Tree Based Models and Boosting

Expect questions that probe how you tune and interpret decision trees, random forests, and gradient boosted trees for tabular data at scale. Candidates often stumble on explaining bias variance behavior, handling categorical features and missingness, and why boosting can overfit without careful early stopping.

You trained a CART decision tree on 10 million rows of tabular data and it overfits badly. What hyperparameters do you change first, and how do those changes affect bias and variance?

GoogleMediumTree Based Models and Boosting

Sample Answer

Reason through it: a deep tree with tiny leaves drives training error toward zero, so variance explodes. First you cap growth with `max_depth`, raise `min_samples_leaf` or `min_data_in_leaf`, and possibly raise `min_samples_split`, each forces more averaging per leaf, increasing bias and reducing variance. Then you tune the split criterion regularization like `min_impurity_decrease` or cost complexity pruning parameter $\alpha$ to avoid low value splits. Finally, validate with time or group aware CV if needed, because the apparent overfit might be leakage rather than pure model capacity.

In a random forest for tabular classification, training AUC is high but validation AUC is flat and the model is slow. Which levers do you pull to improve generalization and latency, and what tradeoffs do you expect?

AmazonMediumTree Based Models and Boosting

Sample Answer

This question is checking whether you can control both variance and compute in an ensemble without guessing. You reduce per tree cost by limiting `max_depth`, using larger `min_samples_leaf`, and subsampling features via `max_features`, which also decorrelates trees and can improve validation. You tune `n_estimators` until OOB or validation stabilizes, since more trees mostly reduce variance but increase latency and memory. If latency is the bottleneck, you can distill the forest or export a smaller forest, accepting a small AUC drop for predictable inference time.

You are using gradient boosted trees on a dataset with high cardinality categorical features and some missing values. How do you handle encoding and missingness, and when is target encoding risky?

MetaHardTree Based Models and Boosting

Sample Answer

The standard move is to use native categorical handling if your GBDT implementation supports it, or use target encoding with strong regularization and out of fold computation. But here, leakage matters because target encoding can memorize rare categories, so you must compute encodings on training folds only, add smoothing toward the global mean, and optionally add noise. For missingness, trees can learn missing as its own branch if supported, otherwise you add missing indicators and impute with a constant so the model can split on "is missing". If the feature is time varying, you also ensure encodings and imputations are computed with only past data to avoid temporal leakage.

Your XGBoost or LightGBM model improves on the training set for hundreds of rounds, but validation metrics start degrading. Explain why boosting can overfit, and how you would use early stopping and regularization to prevent it.

NetflixHardTree Based Models and Boosting

Sample Answer

Get this wrong in production and your offline lift turns into online churn because the model chases noise and degrades on fresh traffic. Boosting adds trees sequentially to fit residuals, so after the signal is captured, later trees fit idiosyncrasies in the training sample and validation loss rises. The right call is early stopping on a clean validation set, plus smaller `learning_rate` with an appropriate `n_estimators`, and stronger regularization like shallower trees, larger `min_data_in_leaf`, subsampling rows and columns, and $L_1/L_2$ penalties. You also watch for dataset shift, because apparent overfit can be a validation split that does not match serving.

A stakeholder asks why a gradient boosted trees model rejected certain users. How do you produce reliable global and local explanations, and what pitfalls do you warn about when using feature importance and SHAP on correlated features?

MicrosoftMediumTree Based Models and Boosting

You need to train GBDT models at scale on a very wide, sparse dataset, and your training time and memory blow up. What algorithmic and systems level choices do you make, including histogram binning, sparsity aware splits, sampling, and distributed training, and how do they change model quality?

SpotifyHardTree Based Models and Boosting

Practice more Tree Based Models and Boosting questions

Unsupervised Learning and Representation Learning

Unsupervised learning questions test whether you can extract business value from unlabeled data, which is harder than it sounds because there's no ground truth to validate against. How do you convince a product manager that your user clustering actually represents meaningful behavioral differences rather than noise?

The biggest challenge here is evaluation without labels. You can't just run k-means with different values of k and pick the one with the lowest within-cluster sum of squares. You need business-relevant validation, like showing that users in different clusters respond differently to the same product changes or have different lifetime value distributions.

Unsupervised Learning and Representation Learning

To stand out, you need to show you can use clustering, dimensionality reduction, and embeddings to drive product decisions when labels are scarce. Interviewers frequently see vague answers here, especially around choosing k, validating clusters, and avoiding misleading visualizations from t-SNE or UMAP.

At Spotify, you cluster users based on 50 behavior features to drive personalized onboarding, but there are no labels. How do you choose $k$ for k-means and convince a PM the clusters are real and actionable?

SpotifyMediumUnsupervised Learning and Representation Learning

Sample Answer

This question is checking whether you can select model capacity and validate usefulness without labels. You should sweep $k$ and compare multiple signals, elbow on inertia, silhouette or Davies Bouldin, plus stability under bootstrap or time-based resampling. Then translate clusters into business narratives via top differentiating features, cluster size, and downstream lift in an offline proxy like onboarding completion, even if it is not a perfect label. If clusters are unstable or only separable in 2D plots, you should say so and propose a different representation or clustering family.

At Google, you have 10 million high-dimensional embeddings for search queries and you need to visualize them to spot topical structure. How do you use UMAP or t-SNE without being misled, and what quantitative checks do you run?

GoogleHardUnsupervised Learning and Representation Learning

Sample Answer

The standard move is to subsample, run UMAP or t-SNE, and eyeball clusters. But here, distortion matters because these methods can invent apparent separation by optimizing local neighborhoods, not global geometry. You should tune and report sensitivity to key hyperparameters, t-SNE perplexity, UMAP $n\_neighbors$ and $min\_dist$, and rerun with multiple seeds to check stability. Quantitatively, you can evaluate neighborhood preservation, for example trustworthiness, and verify any claimed clusters also separate in the original space via kNN agreement or centroid distances.

At Netflix, you want to learn title embeddings from co-watch sequences to improve recommendations, but you also need to handle cold-start titles and popularity bias. What representation learning approach do you pick, and how do you evaluate it?

NetflixMediumUnsupervised Learning and Representation Learning

Sample Answer

Get this wrong in production and you amplify popularity, drown out niche content, and make cold-start items invisible. The right call is to learn sequence-based embeddings, for example skip-gram style on sessions or a two-tower setup with self-supervised objectives, and regularize against popularity via sampling or debiasing. For cold start, you should blend metadata or content features into the embedding space, then calibrate with a gating model. You evaluate with offline ranking metrics like NDCG or recall at $k$ on time-split data, plus embedding quality checks like nearest-neighbor coherence and stability across time.

At Meta, you run DBSCAN on user event embeddings to detect emerging communities, but density varies a lot across regions and languages. How do you adapt the method, choose hyperparameters, and monitor drift over time?

MetaHardUnsupervised Learning and Representation Learning

At Amazon, you need to deduplicate product listings using text and image embeddings, but you cannot afford many false merges. How do you design a clustering or matching pipeline, set thresholds, and validate it with minimal labels?

AmazonEasyUnsupervised Learning and Representation Learning

Practice more Unsupervised Learning and Representation Learning questions

How to Prepare for Machine Learning Interviews

Practice label definition edge cases

Take three business problems and write down five different ways to define the target variable for each. For user churn, consider time windows, grace periods, and reactivation scenarios. This exercise reveals how label choices affect model complexity and business alignment.

Build end-to-end baselines

Pick a dataset and implement three baselines: a simple heuristic, logistic regression, and a tree ensemble. Focus on the evaluation pipeline, not just model accuracy. Can you explain why each baseline fails and what that tells you about the problem structure?

Debug overfitting systematically

Use a high-dimensional dataset and intentionally overfit a model. Then practice isolating the cause: is it model complexity, feature engineering, or data leakage? Learn to diagnosis overfitting by examining feature importance and validation curves, not just train/test gaps.

Connect offline metrics to business value

For every model evaluation metric you know, practice explaining when it aligns with business goals and when it doesn't. Why might optimizing for AUC hurt a fraud detection system? When does RMSE mislead you in demand forecasting problems?

Design evaluation without ground truth

Practice unsupervised evaluation techniques on real datasets. Use clustering to segment users, then validate clusters by checking if they have different conversion rates, session lengths, or other business metrics. Learn to argue for cluster quality using external validation.

How Ready Are You for Machine Learning Interviews?

1 / 6

Supervised Learning Fundamentals

You are building a model to predict customer churn next month, only 3% of customers churn. A baseline logistic regression gets 97% accuracy but misses most churners. What is the best next step to evaluate and improve the model in an interview setting?

Frequently Asked Questions

How deep do I need to go on Machine Learning theory for interviews?

You should be able to explain core models and concepts from first principles, then connect them to practical tradeoffs. Expect questions on bias-variance, regularization, loss functions, evaluation metrics, and how data issues affect performance. For senior roles, you will also need to discuss system constraints like latency, monitoring, and model drift.

Which companies ask the most Machine Learning questions in interviews?

Large tech companies with mature ML stacks, plus AI-first product companies, tend to ask the most ML-specific questions. Teams that deploy models to production, such as search, recommendations, ads, fraud, and ranking, usually test ML depth more heavily. At smaller companies, ML questions are often blended with analytics, data engineering, and experimentation topics.

Do I need to code in a Machine Learning interview?

Yes, most ML interviews include coding, even if the role is model-focused. You will likely write Python to manipulate data, implement parts of an algorithm, or debug training and evaluation logic. Practice with ML-flavored coding tasks at datainterview.com/coding.

How do ML interviews differ for Data Scientist vs AI Engineer vs Machine Learning Engineer?

For Data Scientist roles, you will see more emphasis on problem framing, metrics, experimentation, and interpreting model results. For AI Engineer roles, interviews often focus on LLM usage, prompt design, retrieval, evaluation, safety, and integrating models into applications. For Machine Learning Engineer roles, expect deeper coverage of training pipelines, deployment, scaling, feature stores, monitoring, and reliability of production models.

How can I prepare for ML interviews if I have no real-world ML experience?

Build one or two end-to-end projects that mimic production, include data collection, preprocessing, model training, evaluation, and a simple deployment or batch inference job. Be ready to justify model choice, handle leakage, define metrics, and explain error analysis with concrete examples. Use datainterview.com/questions to drill common ML interview topics and structure your explanations.

What are the most common mistakes candidates make in Machine Learning interviews?

You lose points when you talk about models without defining the objective, metric, and data constraints first. Another common mistake is ignoring data leakage, distribution shift, and class imbalance, then claiming strong performance without validating properly. You should also avoid vague statements like using a more complex model, instead describe specific levers like regularization, thresholding, calibration, or better negative sampling.

Machine Learning Interview Questions

Machine Learning Interview Questions

Supervised Learning Fundamentals

Supervised Learning Fundamentals

Model Evaluation and Experiment Design

Model Evaluation and Experiment Design

Regularization, Overfitting, and Feature Engineering

Regularization, Overfitting, and Feature Engineering

Tree Based Models and Boosting

Tree Based Models and Boosting

Unsupervised Learning and Representation Learning

Unsupervised Learning and Representation Learning

How to Prepare for Machine Learning Interviews

Practice label definition edge cases

Build end-to-end baselines

Debug overfitting systematically

Connect offline metrics to business value

Design evaluation without ground truth

Frequently Asked Questions

Dan Lee

Related Articles

AI Engineer vs Machine Learning Engineer vs Data Scientist (2026)

What Is a Forward Deployed Engineer? The 2026 Role Explained

Securing AI Applications: Common Threats and Defenses