Feature Engineering Interview Questions

Dan Lee's profile image
Dan LeeData & AI Lead
Last updateMarch 13, 2026

Feature engineering makes or breaks machine learning interviews at Meta, Google, Amazon, Airbnb, Uber, and Netflix. Unlike coding problems where you implement algorithms, feature engineering tests your ability to extract signal from messy real-world data while avoiding leakage, handling scale, and shipping features that work reliably in production. Senior roles expect you to design entire feature pipelines, not just answer textbook questions about normalization.

What makes feature engineering interviews brutal is the open-ended nature combined with production constraints. You might start with a simple question like "how would you encode user_id for a recommendation model" but then face follow-ups about memory budgets, training-serving skew, cold start handling, and privacy requirements. A candidate who suggests target encoding without mentioning regularization or holdout strategies immediately signals they haven't shipped features at scale.

Here are the top 30 feature engineering questions organized by the core challenges you'll face: numerical preprocessing, categorical encoding, temporal features, text processing, and feature selection.

Intermediate30 questions

Feature Engineering Interview Questions

Top Feature Engineering interview questions covering the key areas tested at leading tech companies. Practice with real questions and detailed solutions.

Data ScientistMachine Learning EngineerMetaGoogleAmazonAirbnbUberNetflixSpotifyLinkedIn

Numerical Features and Scaling

Interviewers use numerical feature questions to test whether you understand the downstream effects of your preprocessing choices on model behavior, not just the mechanics of scaling formulas. Most candidates can explain StandardScaler but fail when asked how different scaling approaches interact with L1 versus L2 regularization, or why you might choose robust scaling over min-max for tree-based models.

The key insight is that numerical preprocessing isn't just about making features "nice" for the algorithm. It's about preserving signal while handling edge cases that break models in production: outliers from data quality issues, distributional shifts between training and serving, and computational constraints during online inference.

Numerical Features and Scaling

In this section you show you can turn raw numbers into stable model inputs, including transforms, scaling, clipping, and missing value handling. You will get pressed on when choices change model behavior, and many candidates hand wave tradeoffs across linear models, trees, and neural nets.

You are predicting ad click probability with logistic regression, and one feature is user spend in the last 30 days with a heavy right tail and many zeros. What transform and scaling would you apply, and how would you validate that it improved calibration and stability?

MetaMetaMediumNumerical Features and Scaling

Sample Answer

Most candidates default to z score scaling on the raw spend, but that fails here because extreme outliers dominate the mean and variance and you still keep a highly skewed feature. You should use a monotonic transform like $x' = \log(1 + x)$ to compress the tail, then standardize $x'$ for the optimizer and regularization to behave sensibly. Treat zeros naturally via $\log(1 + 0)=0$, and consider winsorizing at a high percentile if fraud or logging spikes exist. Validate by checking AUC plus calibration metrics like ECE or reliability plots, and by monitoring coefficient stability across time splits, not just a random split.

Practice more Numerical Features and Scaling questions

Categorical Encoding and High Cardinality

High cardinality categorical encoding separates junior from senior candidates because it forces you to balance statistical power against overfitting, cold start robustness, and computational efficiency. Many candidates know one-hot encoding and target encoding but struggle when asked to handle a feature with millions of categories, frequent new values, and strict latency requirements.

The critical mistake is treating encoding as a preprocessing step instead of a modeling decision. Smart encoding strategies like frequency-based grouping, learned embeddings, or hierarchical bucketing require deep understanding of your model architecture, data distribution, and business constraints.

Categorical Encoding and High Cardinality

Expect interviewers to test how you encode categories under constraints like sparse data, new values at serving time, and leakage risk. Candidates often struggle to justify one hot vs target encoding vs hashing while keeping pipelines reproducible and safe.

You are building a click through rate model for ads with a "campaign_id" feature that has 2 million unique values and a long tail. At serving time, new campaign_ids appear every hour, how do you encode this feature and keep training serving consistent?

MetaMetaHardCategorical Encoding and High Cardinality

Sample Answer

Use hashing, optionally combined with a smoothed target encoding learned on training folds, to handle high cardinality and unseen IDs safely. Hashing gives you a fixed dimensional representation and deterministic handling of new values by mapping them into the same bucket space. If you add target encoding, compute it out of fold to prevent leakage and apply smoothing so rare campaigns back off to the global mean. Keep the hash function, seed, number of buckets, and any fold logic versioned in the feature pipeline so offline and online match.

Practice more Categorical Encoding and High Cardinality questions

Time Based Features, Windows, and Leakage Control

Temporal feature engineering questions expose whether you've actually built production ML systems or just worked with clean datasets. The challenge isn't computing rolling averages, it's handling late-arriving data, backfills, timezone inconsistencies, and label leakage while maintaining reproducible training pipelines.

Most failures happen because candidates ignore the relationship between event time, processing time, and label time. You might correctly compute a 7-day window feature but create subtle leakage if you don't account for when data becomes available versus when decisions need to be made.

Time Based Features, Windows, and Leakage Control

You will be evaluated on building time aware features like rolling aggregates, recency, seasonality, and lagged signals without peeking into the future. Many people can write the window logic but fail when asked to align event time, label time, and backfill rules in production.

You are building a churn model for a music app where labels are defined at user-level label_time (end of day). You want a 7-day rolling count of plays, how do you compute it to avoid leakage when plays arrive late and can be backfilled?

SpotifySpotifyHardTime Based Features, Windows, and Leakage Control

Sample Answer

You could compute the window off event_time or off ingestion_time. Event_time wins here because the feature must reflect what happened in the user world, but you must enforce a cutoff of event_time < label_time and freeze features as of a chosen snapshot. To handle late events, you either (1) train and serve on the same snapshot lag, for example compute features from data available by label_time + 2 days, or (2) exclude late events by requiring ingestion_time <= label_time. Pick one policy and apply it consistently in both training and online backfills, otherwise your offline AUC will be inflated by future-arriving events.

Practice more Time Based Features, Windows, and Leakage Control questions

Text and NLP Feature Construction

Text feature engineering tests your ability to extract meaningful signal from unstructured data while respecting computational budgets and handling real-world messiness like multilingual content, evolving vocabulary, and adversarial inputs. Candidates often focus on sophisticated NLP techniques but ignore basic issues like consistent tokenization between training and serving.

The fundamental tension in text features is between expressiveness and robustness. TF-IDF might outperform embeddings for your specific task, but only if you design the vocabulary management and normalization pipeline to handle distribution drift and edge cases without breaking production systems.

Text and NLP Feature Construction

Rather than definitions, you need to explain what text features you would ship, how you would normalize, and how you would handle vocabulary drift and multilingual data. Candidates commonly over index on fancy embeddings and miss baselines, latency, and offline online skew concerns.

You are building a model to predict whether a user will click a search result using only the query text and the result title, and you have a 20 ms online budget. What text features do you ship first, and how do you normalize them to avoid offline online skew?

GoogleGoogleMediumText and NLP Feature Construction

Sample Answer

Reason through it: start with the cheapest strong baselines, character and word $n$-grams with hashing for both query and title, plus simple cross features like shared token count, Jaccard overlap, and BM25-like scores. Normalize with the exact same tokenizer in training and serving, same Unicode normalization, same lowercasing rules, same stopword policy, and store the hashing seed and vocab settings as versioned artifacts. Add length features, fraction of digits, and casing ratios, then standardize numeric features using training-set statistics that you freeze and ship. Only after you have these stable features and latency measured do you consider a compact embedding, and you still keep the hashed baselines as fallbacks.

Practice more Text and NLP Feature Construction questions

Feature Selection, Importance, and Model Debugging

Feature selection and model debugging questions test your ability to understand what your models actually learned versus what you think they learned. Interviewers want to see if you can diagnose feature issues, run proper ablation studies, and make principled decisions about feature complexity versus model performance.

The most common mistake is relying too heavily on automated importance scores without understanding their limitations. SHAP values, permutation importance, and correlation analysis all tell different stories, and experienced practitioners know when each method is reliable versus when it can mislead you about true feature contributions.

Feature Selection, Importance, and Model Debugging

Advanced interviews ask you to prove which features matter and why, using ablations, permutation tests, SHAP style explanations, and slice based error analysis. You will be challenged on correlated features, stability across time, and how you decide to drop, keep, or refactor features safely.

You launch a ranking model and AUC improves overall, but you suspect a new user embedding feature is doing most of the work. How do you prove the embedding is truly important and not just correlated with user age and activity, and what ablations would you run?

MetaMetaHardFeature Selection, Importance, and Model Debugging

Sample Answer

This question is checking whether you can separate true signal from correlation, and whether your importance claims survive controlled experiments. You should run grouped ablations, remove the embedding and also remove the correlated feature set together, then compare with a model that keeps only the correlated set to see if the embedding adds incremental lift. Add permutation importance with conditional or grouped permutations, because naive permutation overstates importance when features are correlated. Validate stability by repeating ablations across time splits and key slices, and check calibration and slice metrics, not just overall AUC.

Practice more Feature Selection, Importance, and Model Debugging questions

How to Prepare for Feature Engineering Interviews

Practice with Production Constraints

Every feature engineering question should include follow-ups about latency budgets, memory limits, and training-serving consistency. When you propose a solution, immediately ask yourself: how would this scale to billions of examples, what breaks if new categories appear, and how do I validate it's working correctly?

Master Leakage Detection Patterns

Build a mental checklist for temporal leakage: are you using information that wouldn't be available at prediction time, are your windows aligned correctly with label definition, and how do late-arriving events affect your features? Practice walking through the exact timeline of when data becomes available.

Know When Standard Approaches Fail

Don't just memorize scaling techniques, understand when they break. Standard scaling fails with heavy tails, one-hot encoding explodes memory with high cardinality, and target encoding overfits with small categories. Prepare specific examples of when you'd choose alternatives and why.

Connect Features to Business Metrics

For every feature you propose, explain how it connects to the business objective and what could go wrong. A recency feature for engagement prediction makes sense, but what if it discriminates against users in different timezones or usage patterns? Think beyond statistical performance to real-world impact.

How Ready Are You for Feature Engineering Interviews?

1 / 6
Numerical Features and Scaling

You are training a linear model with L2 regularization on a dataset where one feature is annual income in dollars and another is a ratio between 0 and 1. Validation performance is unstable and coefficients look dominated by income. What is the best next step?

Frequently Asked Questions

How deep do I need to go on feature engineering concepts for interviews?

You should be able to explain why a feature helps, how it is computed, and how you would validate it without leaking target information. Expect depth on handling missingness, scaling, encoding, time based features, text features, and interaction features, plus tradeoffs like interpretability versus performance. You should also be ready to discuss feature selection, drift, and how feature engineering changes with model choice such as linear models versus tree based models.

Which companies tend to ask the most feature engineering questions?

Companies with mature ML systems and high data complexity ask it most, including big tech, ad tech, fintech, and marketplaces. You will often see feature engineering emphasized where offline and online consistency matters, like ranking, recommendations, fraud, and pricing. Startups building their first production models also focus on it because feature quality can dominate model choice.

Will I need to code feature engineering in the interview?

Often yes, you may be asked to write SQL to build aggregates, window features, or leakage safe labels, and sometimes Python to transform data with pandas or sklearn pipelines. You should be able to implement train and validation splits correctly for time series and build features without peeking into the future. Practice with realistic transformations at datainterview.com/coding so you can move quickly from raw tables to model ready features.

How does feature engineering interviewing differ for Data Scientist vs Machine Learning Engineer roles?

As a Data Scientist, you are usually evaluated on feature ideation, statistical validation, and interpreting feature effects, including ablations and error analysis by segment. As a Machine Learning Engineer, you are more often evaluated on feature pipelines, feature stores, online serving constraints, and reproducibility across training and inference. You should tailor answers toward experimentation and model performance for DS, and toward data contracts, latency, and monitoring for MLE.

How can I prepare for feature engineering interviews if I have no real world experience?

You can build a small end to end project where you create features from messy raw data, then show the lift from baseline to engineered features using a clear evaluation setup. Focus on common patterns like time window aggregates, categorical encoding with rare categories, and text normalization, and document how you avoided leakage. Use datainterview.com/questions to practice explaining your choices and the validation logic in a concise way.

What are common feature engineering mistakes interviewers watch for?

The biggest red flag is data leakage, like using future information in time series features or aggregating over a window that includes the label period. Another common mistake is applying preprocessing incorrectly, such as fitting scalers or encoders on the full dataset instead of training only, or exploding cardinality with naive one hot encoding. You should also avoid creating features that cannot be computed at prediction time, or that are too slow or unstable to serve in production.

Dan Lee's profile image

Written by

Dan Lee

Data & AI Lead

Dan is a seasoned data scientist and ML coach with 10+ years of experience at Google, PayPal, and startups. He has helped candidates land top-paying roles and offers personalized guidance to accelerate your data career.

Connect on LinkedIn