Feature engineering makes or breaks machine learning interviews at Meta, Google, Amazon, Airbnb, Uber, and Netflix. Unlike coding problems where you implement algorithms, feature engineering tests your ability to extract signal from messy real-world data while avoiding leakage, handling scale, and shipping features that work reliably in production. Senior roles expect you to design entire feature pipelines, not just answer textbook questions about normalization.
What makes feature engineering interviews brutal is the open-ended nature combined with production constraints. You might start with a simple question like "how would you encode user_id for a recommendation model" but then face follow-ups about memory budgets, training-serving skew, cold start handling, and privacy requirements. A candidate who suggests target encoding without mentioning regularization or holdout strategies immediately signals they haven't shipped features at scale.
Here are the top 30 feature engineering questions organized by the core challenges you'll face: numerical preprocessing, categorical encoding, temporal features, text processing, and feature selection.
Feature Engineering Interview Questions
Top Feature Engineering interview questions covering the key areas tested at leading tech companies. Practice with real questions and detailed solutions.
Numerical Features and Scaling
Interviewers use numerical feature questions to test whether you understand the downstream effects of your preprocessing choices on model behavior, not just the mechanics of scaling formulas. Most candidates can explain StandardScaler but fail when asked how different scaling approaches interact with L1 versus L2 regularization, or why you might choose robust scaling over min-max for tree-based models.
The key insight is that numerical preprocessing isn't just about making features "nice" for the algorithm. It's about preserving signal while handling edge cases that break models in production: outliers from data quality issues, distributional shifts between training and serving, and computational constraints during online inference.
Numerical Features and Scaling
In this section you show you can turn raw numbers into stable model inputs, including transforms, scaling, clipping, and missing value handling. You will get pressed on when choices change model behavior, and many candidates hand wave tradeoffs across linear models, trees, and neural nets.
You are predicting ad click probability with logistic regression, and one feature is user spend in the last 30 days with a heavy right tail and many zeros. What transform and scaling would you apply, and how would you validate that it improved calibration and stability?
Sample Answer
Most candidates default to z score scaling on the raw spend, but that fails here because extreme outliers dominate the mean and variance and you still keep a highly skewed feature. You should use a monotonic transform like $x' = \log(1 + x)$ to compress the tail, then standardize $x'$ for the optimizer and regularization to behave sensibly. Treat zeros naturally via $\log(1 + 0)=0$, and consider winsorizing at a high percentile if fraud or logging spikes exist. Validate by checking AUC plus calibration metrics like ECE or reliability plots, and by monitoring coefficient stability across time splits, not just a random split.
In an L2 regularized linear model, you have features on wildly different scales: price in dollars, distance in meters, and a binary flag. Should you scale, and if yes, how does scaling change the learned model behavior under regularization?
You are building a gradient boosted tree model for trip duration, and the raw numeric feature pickup speed sometimes has sensor glitches with huge spikes. Would you clip, log transform, or leave it, and why might your choice differ from a neural network model?
You have a numeric feature, restaurant prep time, missing for 20% of rows, and missingness is not random because some partners do not report it. For a model that could be either linear or tree based, how do you handle missing values and scaling without leaking information?
You need to combine income, account balance, and transaction count into a single feature for a fraud model, and each has a different distribution and outliers. Design a robust scaling pipeline and explain how you would set hyperparameters like clip thresholds using only training data.
Your team standardizes all numeric features globally, but you suspect this hurts a model serving multiple countries with different currencies and purchasing power. What alternative scaling strategies would you propose, and how would you evaluate them while controlling for leakage and fairness concerns?
Categorical Encoding and High Cardinality
High cardinality categorical encoding separates junior from senior candidates because it forces you to balance statistical power against overfitting, cold start robustness, and computational efficiency. Many candidates know one-hot encoding and target encoding but struggle when asked to handle a feature with millions of categories, frequent new values, and strict latency requirements.
The critical mistake is treating encoding as a preprocessing step instead of a modeling decision. Smart encoding strategies like frequency-based grouping, learned embeddings, or hierarchical bucketing require deep understanding of your model architecture, data distribution, and business constraints.
Categorical Encoding and High Cardinality
Expect interviewers to test how you encode categories under constraints like sparse data, new values at serving time, and leakage risk. Candidates often struggle to justify one hot vs target encoding vs hashing while keeping pipelines reproducible and safe.
You are building a click through rate model for ads with a "campaign_id" feature that has 2 million unique values and a long tail. At serving time, new campaign_ids appear every hour, how do you encode this feature and keep training serving consistent?
Sample Answer
Use hashing, optionally combined with a smoothed target encoding learned on training folds, to handle high cardinality and unseen IDs safely. Hashing gives you a fixed dimensional representation and deterministic handling of new values by mapping them into the same bucket space. If you add target encoding, compute it out of fold to prevent leakage and apply smoothing so rare campaigns back off to the global mean. Keep the hash function, seed, number of buckets, and any fold logic versioned in the feature pipeline so offline and online match.
You have a "restaurant_id" categorical feature for a ranking model, with many restaurants having fewer than 20 impressions. You want a strong signal but you are worried about leakage and overfitting, would you use one hot encoding or target encoding, and how would you make it safe?
In an ecommerce model you have "brand" with 50k categories and "color" with 30 categories. You can only add 5000 feature dimensions to stay within latency and memory budgets, how do you decide what encoding to use for each and how do you implement it?
Your model uses "city" and "user_segment" categorical features and is trained daily on yesterday's data, then served in real time. Some categories only appear today, and others disappear, how do you design the encoding and pipeline to avoid training serving skew and silent failures?
You are training a churn model with a categorical "plan_type" feature, but the plan definitions change over time and some plans are renamed. How do you encode plan_type to preserve signal, support backfills, and avoid leakage from future plan mappings?
You want to use target encoding for "host_id" in a booking conversion model, but you also have repeated observations per host and strong seasonality. Describe a cross validation or splitting strategy that produces unbiased encodings, and how you would deploy it without leaking labels online.
Time Based Features, Windows, and Leakage Control
Temporal feature engineering questions expose whether you've actually built production ML systems or just worked with clean datasets. The challenge isn't computing rolling averages, it's handling late-arriving data, backfills, timezone inconsistencies, and label leakage while maintaining reproducible training pipelines.
Most failures happen because candidates ignore the relationship between event time, processing time, and label time. You might correctly compute a 7-day window feature but create subtle leakage if you don't account for when data becomes available versus when decisions need to be made.
Time Based Features, Windows, and Leakage Control
You will be evaluated on building time aware features like rolling aggregates, recency, seasonality, and lagged signals without peeking into the future. Many people can write the window logic but fail when asked to align event time, label time, and backfill rules in production.
You are building a churn model for a music app where labels are defined at user-level label_time (end of day). You want a 7-day rolling count of plays, how do you compute it to avoid leakage when plays arrive late and can be backfilled?
Sample Answer
You could compute the window off event_time or off ingestion_time. Event_time wins here because the feature must reflect what happened in the user world, but you must enforce a cutoff of event_time < label_time and freeze features as of a chosen snapshot. To handle late events, you either (1) train and serve on the same snapshot lag, for example compute features from data available by label_time + 2 days, or (2) exclude late events by requiring ingestion_time <= label_time. Pick one policy and apply it consistently in both training and online backfills, otherwise your offline AUC will be inflated by future-arriving events.
At a marketplace, you predict whether a host will accept a booking request within 24 hours. You have request_time, host_response_time (nullable), and you want features like 'messages sent in the last 3 days', walk through how you align event time, label time, and window boundaries for training rows.
You are building a feed ranking model and want a recency feature 'time since last click' per user. The click table has both click_time and log_time, and sometimes log_time is later due to batching, how do you compute the feature without leakage and with reproducible training data?
You have daily active user prediction and want a 28-day rolling average of sessions per user. In production, you recompute features daily and backfill missed days, what is your rule for windowing, and when can that rule break?
You are forecasting next-hour demand for rides and you have a feature 'rides in the last 15 minutes' computed from trip events. Trips can be updated after completion (price adjustments, cancellations), how do you design the feature store logic to prevent leakage while keeping the metric accurate?
For an ecommerce model predicting whether a user will purchase in the next 7 days, you want category-level rolling conversion rates computed over the past 30 days. How do you avoid target leakage when purchases and views are linked, and how do you handle users who appear multiple times with different label_times?
Text and NLP Feature Construction
Text feature engineering tests your ability to extract meaningful signal from unstructured data while respecting computational budgets and handling real-world messiness like multilingual content, evolving vocabulary, and adversarial inputs. Candidates often focus on sophisticated NLP techniques but ignore basic issues like consistent tokenization between training and serving.
The fundamental tension in text features is between expressiveness and robustness. TF-IDF might outperform embeddings for your specific task, but only if you design the vocabulary management and normalization pipeline to handle distribution drift and edge cases without breaking production systems.
Text and NLP Feature Construction
Rather than definitions, you need to explain what text features you would ship, how you would normalize, and how you would handle vocabulary drift and multilingual data. Candidates commonly over index on fancy embeddings and miss baselines, latency, and offline online skew concerns.
You are building a model to predict whether a user will click a search result using only the query text and the result title, and you have a 20 ms online budget. What text features do you ship first, and how do you normalize them to avoid offline online skew?
Sample Answer
Reason through it: start with the cheapest strong baselines, character and word $n$-grams with hashing for both query and title, plus simple cross features like shared token count, Jaccard overlap, and BM25-like scores. Normalize with the exact same tokenizer in training and serving, same Unicode normalization, same lowercasing rules, same stopword policy, and store the hashing seed and vocab settings as versioned artifacts. Add length features, fraction of digits, and casing ratios, then standardize numeric features using training-set statistics that you freeze and ship. Only after you have these stable features and latency measured do you consider a compact embedding, and you still keep the hashed baselines as fallbacks.
Your product reviews model uses TF-IDF features trained monthly, but traffic and slang shift weekly and you are seeing a slow accuracy drop. How do you design the feature pipeline to handle vocabulary drift without destabilizing the model?
You need a multilingual intent classifier for a global app, but you only have labeled data in English and Spanish. What features do you ship first, and how do you avoid language-specific tokenization bugs and inconsistent normalization?
A feed ranking model uses a sentence embedding of the post text, computed offline, and you also want to add an online feature for the first 200 characters. What do you do to prevent offline online skew and keep latency predictable?
You are adding toxicity detection for user comments, but you must be robust to obfuscation like spaced letters, leetspeak, and emoji. What feature set do you propose that is cheap, resilient, and debuggable?
You trained a model with subword tokenization, but a new app release changes how users input text, adding more emoji and mixed scripts. What monitoring and retraining strategy do you implement to catch vocabulary drift early and avoid regressions across languages?
Feature Selection, Importance, and Model Debugging
Feature selection and model debugging questions test your ability to understand what your models actually learned versus what you think they learned. Interviewers want to see if you can diagnose feature issues, run proper ablation studies, and make principled decisions about feature complexity versus model performance.
The most common mistake is relying too heavily on automated importance scores without understanding their limitations. SHAP values, permutation importance, and correlation analysis all tell different stories, and experienced practitioners know when each method is reliable versus when it can mislead you about true feature contributions.
Feature Selection, Importance, and Model Debugging
Advanced interviews ask you to prove which features matter and why, using ablations, permutation tests, SHAP style explanations, and slice based error analysis. You will be challenged on correlated features, stability across time, and how you decide to drop, keep, or refactor features safely.
You launch a ranking model and AUC improves overall, but you suspect a new user embedding feature is doing most of the work. How do you prove the embedding is truly important and not just correlated with user age and activity, and what ablations would you run?
Sample Answer
This question is checking whether you can separate true signal from correlation, and whether your importance claims survive controlled experiments. You should run grouped ablations, remove the embedding and also remove the correlated feature set together, then compare with a model that keeps only the correlated set to see if the embedding adds incremental lift. Add permutation importance with conditional or grouped permutations, because naive permutation overstates importance when features are correlated. Validate stability by repeating ablations across time splits and key slices, and check calibration and slice metrics, not just overall AUC.
You compute SHAP values for a churn model and one feature, "days_since_last_login", dominates globally. The PM wants to drop several smaller features to simplify the pipeline. How do you decide what to drop, and how do you sanity check that the SHAP story is not misleading?
After a retrain, your fraud model performance drops mainly for one country, and only for Android. You suspect a device fingerprint feature changed distribution and is now hurting. What debugging steps do you take, and how do you decide whether to keep, refactor, or roll back that feature?
You have two highly correlated features, "price" and "discounted_price", and L1 regularization keeps flipping which one has a nonzero weight across runs. Stakeholders want a stable explanation for which feature matters. What do you do to get stable importance and a robust feature choice?
You are asked to remove 30 percent of features to reduce latency in a real time model, but you cannot degrade p99 latency or lose more than 0.2 percent AUC. How do you prioritize features to drop safely, and what validation plan do you propose?
A new text derived feature improves offline metrics, but online it hurts retention for new users while helping power users. How do you use slice based analysis and counterfactual style tests to decide whether to keep it, and how do you prevent regressions during rollout?
How to Prepare for Feature Engineering Interviews
Practice with Production Constraints
Every feature engineering question should include follow-ups about latency budgets, memory limits, and training-serving consistency. When you propose a solution, immediately ask yourself: how would this scale to billions of examples, what breaks if new categories appear, and how do I validate it's working correctly?
Master Leakage Detection Patterns
Build a mental checklist for temporal leakage: are you using information that wouldn't be available at prediction time, are your windows aligned correctly with label definition, and how do late-arriving events affect your features? Practice walking through the exact timeline of when data becomes available.
Know When Standard Approaches Fail
Don't just memorize scaling techniques, understand when they break. Standard scaling fails with heavy tails, one-hot encoding explodes memory with high cardinality, and target encoding overfits with small categories. Prepare specific examples of when you'd choose alternatives and why.
Connect Features to Business Metrics
For every feature you propose, explain how it connects to the business objective and what could go wrong. A recency feature for engagement prediction makes sense, but what if it discriminates against users in different timezones or usage patterns? Think beyond statistical performance to real-world impact.
How Ready Are You for Feature Engineering Interviews?
1 / 6You are training a linear model with L2 regularization on a dataset where one feature is annual income in dollars and another is a ratio between 0 and 1. Validation performance is unstable and coefficients look dominated by income. What is the best next step?
Frequently Asked Questions
How deep do I need to go on feature engineering concepts for interviews?
You should be able to explain why a feature helps, how it is computed, and how you would validate it without leaking target information. Expect depth on handling missingness, scaling, encoding, time based features, text features, and interaction features, plus tradeoffs like interpretability versus performance. You should also be ready to discuss feature selection, drift, and how feature engineering changes with model choice such as linear models versus tree based models.
Which companies tend to ask the most feature engineering questions?
Companies with mature ML systems and high data complexity ask it most, including big tech, ad tech, fintech, and marketplaces. You will often see feature engineering emphasized where offline and online consistency matters, like ranking, recommendations, fraud, and pricing. Startups building their first production models also focus on it because feature quality can dominate model choice.
Will I need to code feature engineering in the interview?
Often yes, you may be asked to write SQL to build aggregates, window features, or leakage safe labels, and sometimes Python to transform data with pandas or sklearn pipelines. You should be able to implement train and validation splits correctly for time series and build features without peeking into the future. Practice with realistic transformations at datainterview.com/coding so you can move quickly from raw tables to model ready features.
How does feature engineering interviewing differ for Data Scientist vs Machine Learning Engineer roles?
As a Data Scientist, you are usually evaluated on feature ideation, statistical validation, and interpreting feature effects, including ablations and error analysis by segment. As a Machine Learning Engineer, you are more often evaluated on feature pipelines, feature stores, online serving constraints, and reproducibility across training and inference. You should tailor answers toward experimentation and model performance for DS, and toward data contracts, latency, and monitoring for MLE.
How can I prepare for feature engineering interviews if I have no real world experience?
You can build a small end to end project where you create features from messy raw data, then show the lift from baseline to engineered features using a clear evaluation setup. Focus on common patterns like time window aggregates, categorical encoding with rare categories, and text normalization, and document how you avoided leakage. Use datainterview.com/questions to practice explaining your choices and the validation logic in a concise way.
What are common feature engineering mistakes interviewers watch for?
The biggest red flag is data leakage, like using future information in time series features or aggregating over a window that includes the label period. Another common mistake is applying preprocessing incorrectly, such as fitting scalers or encoders on the full dataset instead of training only, or exploding cardinality with naive one hot encoding. You should also avoid creating features that cannot be computed at prediction time, or that are too slow or unstable to serve in production.
