eBay Machine Learning Engineer at a Glance
Total Compensation
$125k - $380k/yr
Interview Rounds
8 rounds
Difficulty
Levels
T22 - T26
Education
PhD
Experience
0–15+ yrs
eBay's ML engineers rank 2+ billion live listings where every piece of metadata is seller-generated and wildly inconsistent. That single constraint shapes everything about this role: the models you build, the validation cycles you run, and the infrastructure you operate. If you've only worked with clean, first-party catalog data, the adjustment is real.
eBay Machine Learning Engineer Role
Primary Focus
Skill Profile
Math & Stats
HighStrong foundation expected in data analysis, model evaluation, and experimentation methodology (e.g., experiment design, validation, diagnosing regressions). Interview prep sources emphasize probability/statistics coverage. Not explicitly research-heavy, so typically not "expert".
Software Eng
HighRole emphasizes end-to-end productionization, designing reliable components, writing technical build/implementation plans, mentoring/reviewing, and integrating ML into live services; interview process includes DS&A/coding rounds.
Data & SQL
HighExplicit requirement to design/operate both batch and real-time inference pipelines with guarantees around correctness, reproducibility, and operational stability; handling noisy/large-scale data into reliable signals.
Machine Learning
HighHands-on applied ML required: building and deploying predictive modeling solutions, modeling + experimentation + validation, accuracy evaluation, performance analysis, and iterative improvement in production.
Applied AI
MediumSome evidence of transformer/modern model knowledge appearing in interview experiences; however, core posting focuses more broadly on predictive modeling and inference pipelines rather than explicit GenAI/LLM product work. Estimate is conservative.
Infra & Cloud
HighProduction deployment is central (operationalizing models, high-traffic/high-reliability systems preferred). Close collaboration with platform/infra teams and ownership of operational debugging and architectural decisions are emphasized; specific cloud vendor details not explicit in provided sources.
Business
MediumMust translate business/product intent into measurable production outcomes and operate from ambiguous problem statements; cross-team alignment and influencing stakeholders is repeatedly emphasized.
Viz & Comms
HighStrong technical communication explicitly required, including written design documentation and executive-level explanations; role involves aligning multiple teams and providing technical clarity.
What You Need
- Production ML model development and deployment (predictive modeling solutions)
- Model evaluation, performance analysis, and regression diagnosis
- Experimentation methodology (design, validation, metrics/A-B style thinking)
- Batch and real-time inference pipeline design and operation
- Data analysis on imperfect/large-scale datasets; signal extraction from noisy data
- Production-safe ML workflows enabling rapid experimentation
- Integration of ML outputs into live systems with platform/application teams
- Technical design documentation and implementation planning
- Debugging and operational support for ML-based services
- Cross-functional collaboration and influence across engineering/data/product
Nice to Have
- Operationalizing ML in high-traffic, high-reliability systems
- Distributed data processing familiarity (e.g., Spark-like patterns) (inferred from "distributed data processing" preference; specific tech varies)
- Scalable inference architectures
- Applied ML for classification/inference/decision-support systems
- Ownership of shared ML workflows/platforms used by multiple teams
Languages
Tools & Technologies
Want to ace the interview?
Practice with real questions.
You'll own production ML models for one of eBay's core surfaces: search ranking, promoted listings, recommendations, or fraud detection. These models serve a marketplace where sellers create their own listings with unpredictable quality, so your day-to-day involves as much data debugging and production validation as it does modeling. Year-one success looks like shipping a model change that moves an online metric (search relevance, ad click-through, fraud catch rate) while earning enough operational trust to own your deployment pipeline end-to-end.
A Typical Week
A Week in the Life of a Ebay Machine Learning Engineer
Typical L5 workweek · Ebay
Weekly time split
Culture notes
- eBay runs at a steady large-company pace — weeks are structured but not frantic, and most engineers protect focused afternoon blocks for deep work without guilt.
- eBay requires hybrid in-office attendance (typically Tuesday through Thursday at the San Jose campus), with Monday and Friday commonly worked from home.
The ratio of production validation to model architecture work will surprise you. Tracing why a listing quality score returns stale values from the feature store, or investigating why item condition fields are blank across an entire category like Collectibles, eats more hours than hyperparameter tuning. The mid-week deep modeling block (refactoring a two-tower retrieval model, adjusting negative sampling) is real, but you earn it by clearing Monday's operational debt first.
Projects & Impact Areas
Search ranking is the gravitational center, where retrieval and re-ranking models must handle seller-created inventory with no standardized catalog to fall back on. Promoted listings ML has grown increasingly strategic as eBay's advertising revenue expands, and those models must balance ad relevance against organic results without eroding buyer trust. Trust & Safety teams (based in Toronto and San Jose) tackle a different flavor of the problem: fraud detection where adversaries actively adapt, forcing retraining and monitoring cycles far tighter than what search ranking requires.
Skills & What's Expected
Deployment and infrastructure skills matter more here than at most peer companies. The provided data rates infrastructure/cloud deployment as "high" and notes that specific cloud vendor details aren't standardized across teams, so you can't assume a single managed platform in system design answers. Modern AI and GenAI knowledge is rated "medium," not negligible. eBay has shipped AI-powered listing tools and the role specialization explicitly includes transformer-based models, so dismissing GenAI entirely would be a mistake. The sweet spot is someone who pairs classical ranking and recommendation depth with real production engineering chops (pipeline orchestration, inference serving, monitoring).
Levels & Career Growth
Ebay Machine Learning Engineer Levels
Each level has different expectations, compensation, and interview focus.
$105k
$12k
$8k
What This Level Looks Like
Contributes to a single ML component or small end-to-end feature (model, data pipeline, evaluation, or serving integration) within an established ML system. Impact is typically limited to one team’s roadmap and measured via offline metrics and small online experiments under guidance.
Day-to-Day Focus
- →Strong software engineering fundamentals (readability, testing, reliability)
- →ML fundamentals (supervised learning, evaluation, overfitting, feature leakage)
- →Data proficiency (SQL, data validation, pipeline basics)
- →Production awareness (latency, scalability basics, monitoring, rollback)
Interview Focus at This Level
Emphasis on coding ability (Python and/or another backend language), core ML knowledge (model selection, evaluation, leakage, bias/variance), practical data/SQL skills, and ability to reason about taking a model from notebook to production with basic MLOps hygiene (testing, monitoring, reproducibility). System design is lightweight and focused on simple ML pipeline/serving components rather than large-architecture ownership.
Promotion Path
Promotion to the next level typically requires independently owning a well-scoped ML feature end-to-end (data + model + deployment), consistently delivering high-quality production changes, demonstrating strong debugging/operational excellence (monitoring, incident follow-up), improving team velocity via reusable components or automation, and showing growing autonomy in scoping work and communicating tradeoffs to stakeholders.
Find your level
Practice with questions tailored to your target level.
T24 (Senior) and T25 (Staff) are the levels that appear most often in current external ML job postings. The jump from T24 to T25 is the hardest promotion because it demands multi-team influence and platform-level thinking, not just shipping great models within your own pod. eBay uses "MTS" (Member of Technical Staff) titles alongside T-levels, which confuses external candidates: MTS-1 maps roughly to T23, MTS-2 to T24, and T26 Principal MTS roles show up in active postings with real architectural scope.
Work Culture
eBay's hybrid model has most ML engineers in-office Tuesday through Thursday at the San Jose campus (Austin and Bengaluru are the other major hubs), with Monday and Friday commonly remote. With roughly 12,000 employees globally, you get more direct product influence than you would at a company ten times that size, though internal mobility options are narrower if your team's charter shifts. eBay's mission around "economic opportunity for all" shows up in practice: ML teams actively debate ranking fairness for small sellers rather than optimizing purely for GMV.
eBay Machine Learning Engineer Compensation
eBay does not publicly disclose its RSU vesting schedule. The data that exists for other large tech companies (Amazon's backloaded structure, Google's even quarterly vest) simply can't be projected onto eBay. Ask your recruiter directly about cliff periods, vesting cadence, and, most importantly, refresh grant policies before you sign anything.
The negotiation data provided in the widget shows wide bands at T25 and T26, with a $220k+ spread between min and max total comp. That range is your signal: at Staff and above, there's real room to move. Your strongest lever is anchoring on the specific scope of the role you're being hired for (owning search ranking models across 2B+ listings, for example, or leading fraud detection for a global marketplace) rather than relying on generic competing-offer tactics. Equity and sign-on amounts are where conversations tend to have the most flexibility, so push there with concrete numbers tied to what you'd be leaving behind.
eBay Machine Learning Engineer Interview Process
8 rounds·~4 weeks end to end
Initial Screen
2 roundsRecruiter Screen
Kick off with a recruiter conversation focused on role fit, location/remote expectations, and compensation alignment. You’ll walk through your resume with emphasis on ML projects that shipped to production (search, ads, personalization, fraud). Expect light probing on availability, work authorization, and what team/domain you’re targeting inside the marketplace ecosystem.
Tips for this round
- Prepare a 60-second pitch that names 1-2 shipped ML systems, the metric moved (e.g., CTR, NDCG, fraud catch rate), and the scale (QPS, latency, data volume).
- Have a crisp compensation range ready and ask how eBay structures base/bonus/RSUs for this level before giving a hard number.
- Clarify domain preferences early (search ranking vs ads relevance vs trust/fraud) and tie your experience to that area.
- Share concrete tooling you’ve used end-to-end (Python, Spark, Airflow, Kubernetes, Triton/TF-Serving) to signal production maturity.
- Ask about interview format and whether a long final loop is expected so you can plan stamina and scheduling.
Hiring Manager Screen
Next, the hiring manager will dig into how you choose models, handle tradeoffs, and deliver measurable business impact. You’ll be asked to narrate one project deeply: problem framing, offline evaluation, online experimentation, and production constraints like latency and monitoring. The conversation often includes collaboration style with PM/infra and how you handle ambiguous requirements.
Technical Assessment
4 roundsCoding & Algorithms
Expect a live coding session where you solve one or two problems under time pressure, typically in Python. The interviewer will care about correctness, complexity, and how you communicate your approach while debugging. Some prompts may be ML-adjacent (ranking, retrieval, string/array processing) but are graded like standard SWE problems.
Tips for this round
- Practice writing clean Python with tests-in-head: edge cases, empty inputs, duplicates, and time/space complexity commentary.
- Use a consistent problem-solving template: restate, list constraints, propose approach, then code and validate with examples.
- Be fluent with hash maps, heaps, two pointers, BFS/DFS, and sorting-based techniques since these often appear in ranking/relevance contexts.
- Narrate tradeoffs and add quick sanity checks (assertions, small dry-runs) to reduce mistakes.
- If stuck, propose a baseline solution first, then optimize—showing progression matters in ambiguous setups.
SQL & Data Modeling
You’ll be given marketplace-style tables (listings, clicks, purchases, impressions, users) and asked to write SQL to compute metrics and cohorts. The interviewer may push on joins, window functions, deduping, and handling missing or delayed events. A small portion can include how you’d model event data for experimentation or feature generation.
Statistics & Probability
The interviewer will probe your applied stats foundation, often through A/B testing and inference questions tied to search/ads/product changes. You may be asked to choose metrics, estimate sample sizes, interpret confidence intervals, and handle multiple testing or novelty effects. Expect some ambiguity—part of the evaluation is how you clarify assumptions and choose a defensible method.
Machine Learning & Modeling
A dedicated ML round typically focuses on model choice, feature engineering, offline/online evaluation, and practical constraints in a large marketplace. You’ll discuss ranking/recommendation patterns like candidate generation + re-ranking, handling implicit feedback, and combating bias or feedback loops. The interviewer may also ask how you would optimize inference latency (e.g., batching, quantization) and monitor drift in production.
Onsite
2 roundsSystem Design
During the final loop, you’ll be asked to design an end-to-end ML system such as real-time search ranking, ads relevance, or fraud detection. Expect questions on data ingestion, feature stores, training cadence, online serving, latency budgets, and experimentation/rollout strategy. The interviewer will challenge scalability and reliability choices, especially around real-time inference and A/B testing infrastructure.
Tips for this round
- Start with requirements: target metric, latency/QPS, freshness needs, and failure modes (fallback ranking, circuit breakers).
- Propose a two-stage architecture (retrieval + ranking) and specify what runs online vs offline with clear boundaries.
- Include an experimentation plan: logging, exposure attribution, guardrails, and rollback criteria.
- Address observability: data quality checks, drift, model performance by segment, and alerting tied to business KPIs.
- Mention inference optimization techniques (caching, batching, distillation, quantization, Triton-style serving) when latency is tight.
Behavioral
Close out with a behavioral interview that emphasizes collaboration, ownership, and how you operate in fast feedback loops. You’ll be evaluated on conflict resolution, influencing without authority, and delivering under ambiguity—common in cross-functional ML work. Expect follow-ups that test depth and authenticity, especially around failures and learning.
Tips to Stand Out
- Tell one end-to-end production story. Pick a flagship ML project and be able to explain data collection, labeling, leakage prevention, offline metrics, online A/B design, deployment, monitoring, and rollback with numbers (traffic, latency, lift).
- Bias toward marketplace-relevant examples. Frame answers around search relevance, ads ranking, personalization, or fraud/trust, using metrics like CTR/CVR, NDCG, GMV, chargeback rate, and seller/buyer experience guardrails.
- Prepare for ambiguity in prompts. Practice clarifying questions (baseline rates, constraints, attribution windows, data availability) because candidates report unclear problem framing and you’ll be graded on how you impose structure.
- Be fluent in experimentation details. Expect to discuss sample size/power, novelty effects, multiple testing, and segmentation; be ready with variance reduction ideas (CUPED/stratification) and robust guardrails.
- Rehearse ML system design with latency constraints. Marketplace systems often require real-time inference; practice architectures with retrieval+ranking, feature stores, streaming updates, caching, and inference optimization.
- Communicate while coding. In live coding, narrate tradeoffs, write small checks, and keep code readable; many rejections come from silent debugging and missed edge cases rather than lack of knowledge.
Common Reasons Candidates Don't Pass
- ✗Unstructured answers under ambiguity. Candidates get rejected when they jump into solutions without clarifying assumptions, success metrics, or constraints, leading to misaligned designs and incorrect conclusions.
- ✗Weak production/MLOps depth. Not being able to discuss monitoring, drift, data quality, rollout strategies, or latency/QPS tradeoffs signals notebook-only experience and is a frequent blocker for ML engineer roles.
- ✗Experimentation gaps. Misinterpreting A/B results, ignoring guardrails, or failing to address multiple testing/peeking makes it hard to trust you with marketplace-impacting launches.
- ✗Coding fundamentals issues. Struggling with standard data structures, complexity analysis, or edge cases in live sessions can outweigh strong ML knowledge.
- ✗Metrics not tied to business impact. Talking only about model accuracy without connecting to online KPIs (CTR/CVR/GMV, fraud loss, user trust) suggests poor product sense for applied ML.
Offer & Negotiation
For Machine Learning Engineer offers at a large public tech company like eBay, compensation is typically a mix of base salary, annual cash bonus (often tied to company and individual performance), and RSUs that vest over multiple years (commonly 4 years with periodic vesting). The most negotiable levers are base (within level band), initial equity/RSU grant, sign-on bonus (especially if you’re leaving unvested equity), and sometimes level/title if interview feedback supports it. Ask for the exact leveling, equity vesting schedule, and bonus target, then negotiate by anchoring on competing offers and the scope/impact of the role (e.g., owning ranking models in a high-traffic surface) while staying within bands.
The Hiring Manager Screen is where most candidates lose momentum. eBay's ML hiring managers tend to probe deeply on one past project, asking you to walk through offline evaluation, online A/B results, and production constraints like latency budgets or monitoring. From what candidates report, surface-level answers about "building a model" without specific metrics (NDCG lift, fraud catch rate, serving latency) make it hard to advance past this stage.
The SQL & Data Modeling round is the other common stumble point, especially for ML engineers who've lived in notebook environments. eBay's marketplace data involves massive event tables (impressions, clicks, purchases across 2B+ listings) and interviewers expect fluent window functions and complex joins under time pressure. If you're strong on ML but rusty on SQL, that gap alone can cost you the offer, so budget real prep time there.
eBay Machine Learning Engineer Interview Questions
ML System Design (Ranking/Ads/Personalization)
Expect questions that force you to design an end-to-end ML-powered service (candidate generation + ranking, feature computation, online/offline parity) under real marketplace constraints like latency, freshness, and scale. You’ll be evaluated on tradeoffs, failure modes, and how you make the system measurable and debuggable.
Design an online search ranking service for eBay where query, user, and listing features come from both real-time events (clicks, add-to-cart) and batch aggregates, with a hard $P99 \le 120\text{ ms}$ budget for the ranking call. What is your feature computation and serving plan to guarantee offline online parity and safe fallbacks when features are missing or stale?
Sample Answer
Most candidates default to a single feature store call and assume the same features exist offline and online, but that fails here because latency and freshness constraints force different pipelines and create silent training serving skew. You separate features into tiers, request time cheap features computed in process, precomputed per user and per listing features served from low latency stores, and expensive cross features approximated or dropped online. You enforce parity with a shared feature spec and logging of resolved feature values, plus offline backfills that replay the same resolution logic. You ship degradations: default values with missingness indicators, cached last known values with TTLs, and a rules based safe ranker when critical features fail.
You own the eBay Promoted Listings auction ranking model and a new transformer based model increases offline NDCG but drops revenue per mille and increases user ad hides in A/B by $+3\%$ relative. Design the end-to-end debugging plan across data, modeling, and serving to find the root cause and decide whether to ship, rollback, or gate traffic.
MLOps & Production Inference
Most candidates underestimate how much the interview probes operational ownership: deployment patterns, rollouts/rollbacks, monitoring, drift/quality checks, and incident response for models in high-traffic services. The focus is on making model updates safe while keeping iteration speed high.
Your real-time search ranking model for eBay listings is rolled out to 10% traffic and CTR is flat, but conversion rate drops 0.8% and p99 latency increases by 25 ms. What production monitoring and rollback gates do you put in place to catch this within 10 minutes, and how do you decide whether to auto-rollback?
Sample Answer
Set SLO-based auto-rollback gates on conversion rate (primary), p99 latency, and critical error rates, then page on-call when any gate breaches for 2 to 3 consecutive windows. CTR is not a safety gate here because it can stay flat while you harm downstream conversion or buyer experience. Use near-real-time metrics keyed by experiment cell, query class, and device, with a 1 to 2 minute aggregation and guardrails like $\Delta$CVR $< -\tau$ and p99 $> p99_{baseline}+\delta$. Add a canary holdback plus a hard stop if latency or error budget burn exceeds threshold, because infra regressions can mask model quality signals.
A transformer-based reranker for search is served online, and you also run a nightly batch inference job to backfill features for offline evaluation, but you see a persistent offline-online metric gap. How do you redesign the inference and feature logging so the offline evaluation matches online behavior while keeping iteration speed high?
Machine Learning (Applied Modeling & Evaluation)
Your ability to choose models, features, and metrics for ranking/ads/recs/trust & safety is tested through practical scenarios (sparse categorical signals, cold start, imbalance, delayed labels). You’ll need to explain validation strategy, diagnose regressions, and link offline metrics to online outcomes.
You are shipping a new learning-to-rank model for eBay Search and offline NDCG@10 improves by 1.2%, but the model reduces long-tail item exposure. What offline metric set and validation slice strategy do you use to decide whether to launch, and why?
Sample Answer
You could optimize on a pure relevance metric like NDCG@10, or you could use a balanced scorecard that includes relevance plus marketplace health metrics (long-tail exposure, coverage, seller diversity). A pure relevance focus wins if the business goal is narrowly clicks on head queries, but the scorecard wins here because eBay is a two-sided marketplace and long-tail exposure is a first-order constraint. Validate by slicing on query frequency buckets (head, torso, tail), cold-start items, and seller cohorts so you can see where the gain comes from and where you are harming the ecosystem. Ship only if gains are not concentrated in one bucket while tail exposure regresses beyond a predefined guardrail.
You are training an ads CTR model where clicks are rare and conversions arrive with a $7$ day delay, but product wants a decision in $48$ hours. How do you evaluate candidate models offline so the metric correlates with online revenue without leaking future information?
A transformer-based query, item encoder for Search improves offline MRR, but online you see higher bounce rate and lower add-to-cart for mobile users. What is your regression diagnosis plan, and which hypotheses do you test first?
Data Pipelines & Feature Engineering at Scale
Rather than trivia, you’re judged on whether you can build reproducible batch + streaming pipelines that keep training and serving consistent (feature stores, backfills, late data, idempotency). Candidates often struggle to articulate data contracts, lineage, and correctness guarantees.
Your search ranking model uses a feature "seller_7d_cancel_rate" computed daily in batch, but online inference reads real-time cancellations from Kafka and you see an offline AUC lift with an online CTR drop. What exact checks and pipeline changes do you make to prove and then eliminate training serving skew for this feature?
Sample Answer
Reason through it: Walk through the logic step by step as if thinking out loud. Start by validating that the feature definitions match byte-for-byte, same numerator, denominator, time zone, and inclusion rules, then compare offline feature values versus online logged feature values for the same $(user, item, timestamp)$ samples. Next, check time travel correctness, the batch job likely uses a full day of data while streaming uses event time with late arrivals, so your training set may be leaking future cancels relative to the impression time. Then enforce a single source of truth, either an online feature store with point-in-time reads for training, or logged features from serving used directly for training. Finally add automated skew monitors, distribution drift checks, and a canary that diffs batch versus streaming aggregates on a rolling window with thresholds that page.
You are building an ads CTR model and need "user_30d_ad_clicks" with late click events arriving up to 48 hours late and duplicates due to retries. Write a Spark-style PySpark job that produces a daily feature table that is idempotent and correct by event time, not processing time.
A fraud model uses a feature store with offline backfills, and you discover that some training examples include features computed after the transaction time because of a backfill bug. How do you redesign the data contract and point-in-time join so this cannot happen again, and what tests catch it before training starts?
Experimentation & Metrics (A/B, Marketplace Constraints)
The bar here isn’t whether you know A/B testing terms, it’s whether you can design experiments for search/ads systems with interference, seasonality, and multiple objectives (CTR vs revenue vs buyer/seller trust). You’ll be pushed to define guardrails, success criteria, and debugging steps when results disagree.
You run an A/B test on eBay search ranking that increases CTR by 0.8% but decreases GMV per search by 0.3% and increases return rate by 0.1 pp. What is your primary success metric, what are your guardrails, and what decision do you ship?
Sample Answer
This question is checking whether you can translate business intent into a measurable win condition, then stop the experiment from optimizing the wrong thing. Pick one primary metric tied to the goal (for search, typically GMV per search or buyer conversion), and treat CTR as a diagnostic, not the objective. Add guardrails like return rate, cancel rate, buyer complaints, latency, and zero-results rate, and require the primary metric to win while guardrails stay within pre-set thresholds. If the primary metric loses or a trust metric regresses, you do not ship, you segment and diagnose.
You A/B test a new ad ranking model, randomizing at user level, and see a strong lift in ad revenue but a drop in seller organic impressions and seller NPS. How do you redesign the experiment to handle marketplace interference between buyers, sellers, and auctions?
A search ranking experiment shows no overall GMV lift, but large positive lift on mobile and negative lift on desktop, and the result flips sign after you exclude the first 48 hours of traffic. How do you debug whether this is a real interaction, a logging issue, or novelty and learning effects, and what analysis do you trust?
ML Coding (Python for Modeling/Signals)
Coding prompts typically mirror day-to-day applied work: compute ranking metrics, build a simple training/eval loop, implement calibration or thresholding, or process logged events into features. Clear, correct code plus reasoning about edge cases matters more than fancy algorithms.
You are evaluating an eBay search ranker and have per-query graded relevance labels $y \in \{0,1,2,3\}$ for returned items. Write a function to compute $\mathrm{NDCG@k}$ averaged over queries, handling ties and queries with all-zero relevance safely.
Sample Answer
The standard move is to compute $\mathrm{DCG@k}=\sum_{i=1}^{k} \frac{2^{rel_i}-1}{\log_2(i+1)}$ and divide by the ideal $\mathrm{IDCG@k}$ per query, then average over queries. But here, the all-zero (or empty) query matters because $\mathrm{IDCG@k}=0$ makes the ratio undefined, so you must decide and implement a consistent policy (typically return $0.0$ for that query).
1from __future__ import annotations
2
3import math
4from typing import Dict, Iterable, List, Sequence, Tuple
5
6
7def ndcg_at_k(
8 per_query_results: Dict[str, Sequence[Tuple[float, int]]],
9 k: int,
10 *,
11 zero_idcg_policy: str = "zero",
12) -> float:
13 """Compute mean NDCG@k across queries.
14
15 Args:
16 per_query_results: Mapping query_id -> sequence of (score, relevance).
17 The sequence can be in any order, it will be sorted by score desc.
18 k: Cutoff.
19 zero_idcg_policy: What to return for a query with IDCG@k == 0.
20 "zero" -> NDCG = 0.0, "skip" -> exclude from mean.
21
22 Returns:
23 Mean NDCG@k.
24 """
25 if k <= 0:
26 raise ValueError("k must be positive")
27
28 def dcg(rels: List[int]) -> float:
29 total = 0.0
30 for i, rel in enumerate(rels[:k], start=1):
31 gain = (2 ** rel) - 1
32 denom = math.log2(i + 1)
33 total += gain / denom
34 return total
35
36 ndcgs: List[float] = []
37
38 for qid, items in per_query_results.items():
39 # Sort by model score descending. Stable sort helps deterministic tie handling.
40 sorted_by_score = sorted(items, key=lambda x: x[0], reverse=True)
41 rels_ranked = [rel for _, rel in sorted_by_score]
42 rels_ideal = sorted([rel for _, rel in items], reverse=True)
43
44 dcg_k = dcg(rels_ranked)
45 idcg_k = dcg(rels_ideal)
46
47 if idcg_k == 0.0:
48 if zero_idcg_policy == "skip":
49 continue
50 if zero_idcg_policy == "zero":
51 ndcgs.append(0.0)
52 continue
53 raise ValueError(f"Unknown zero_idcg_policy: {zero_idcg_policy}")
54
55 ndcgs.append(dcg_k / idcg_k)
56
57 return float(sum(ndcgs) / len(ndcgs)) if ndcgs else 0.0
58
59
60if __name__ == "__main__":
61 # Example usage
62 data = {
63 "q1": [(0.9, 3), (0.2, 0), (0.1, 2)],
64 "q2": [(0.3, 0), (0.2, 0)], # all-zero relevance
65 }
66 print(ndcg_at_k(data, k=3))
67For an eBay Trust and Safety model, you have predicted fraud probabilities $p$ and observed labels $y \in \{0,1\}$; implement temperature scaling to calibrate probabilities by minimizing negative log-likelihood on a validation set. Return calibrated probabilities for a new batch, and keep it numerically stable for $p$ near $0$ or $1$.
You are joining logged impressions, clicks, and purchases for eBay promoted listings into a training set for a CTR model, where each row has (user_id, item_id, ts, event_type) and multiple events can occur per (user_id, item_id). Write Python to build one example per impression with labels click_within_30m and purchase_within_24h, using only events with $ts' > ts_{impression}$ to avoid leakage.
eBay's question mix rewards candidates who can trace a model from whiteboard sketch all the way through a canary rollout on live marketplace traffic, then diagnose why conversion dropped even though CTR held steady. Where this gets uniquely hard is the interplay between design and operations: a search reranker prompt doesn't end when you draw the architecture, because interviewers will pivot to asking how you'd detect training/serving skew when seller-generated listing metadata shifts daily in ways a curated catalog never would. The prep mistake that costs the most time is drilling applied modeling theory in isolation while ignoring the experimentation scenarios where eBay's two-sided marketplace creates interference (ranking changes alter seller pricing behavior, contaminating your control group) and the pipeline scenarios where late-arriving click data breaks feature parity.
Sharpen your prep with eBay-style ML interview questions at datainterview.com/questions.
How to Prepare for eBay Machine Learning Engineer Interviews
Know the Business
Official mission
“We connect people and build communities to create economic opportunity for all.”
What it actually means
eBay's real mission is to facilitate global commerce by connecting millions of buyers and sellers, providing a platform for economic opportunity, and offering a vast and unique selection of goods. It aims to be the preferred destination for discovering value and unique items, particularly focusing on enthusiast buyers and high-value categories.
Key Business Metrics
$11B
+15% YoY
$39B
+26% YoY
12K
-6% YoY
Current Strategic Priorities
- Transform through innovation, investment, and powerful tools designed to fuel sellers’ growth
- Accelerate innovation using AI to make selling smarter, faster, and more efficient
- Enhance trust throughout the marketplace
- Connect the right buyers to unique inventory
- Create more personalized, inspirational shopping experiences for all
eBay's Q4 2025 earnings show $11.1 billion in revenue (up 15% YoY), and the company's north star priorities center on AI-powered seller tools, search personalization, and trust. Their engineering team published a candid breakdown of GenAI's actual impact on developer productivity, measuring what works and what doesn't rather than chasing hype. Read it before your loop, because eBay's ML culture skews empirical, and interviewers notice when candidates can speak to that measurement-first posture.
The "why eBay" answer that falls flat is any version of "marketplace scale excites me." What separates strong candidates: connecting your experience to a problem only eBay's two-sided marketplace creates. Maybe you've dealt with ranking under inconsistent metadata (eBay's 2B+ seller-generated listings have wildly uneven quality), or you've run A/B tests where treatment effects leak across user groups the way seller behavior shifts do when eBay changes ranking. eBay also designs its own server hardware and has open-sourced those designs, so if you've optimized model serving under hardware constraints rather than just scaling up cloud instances, say so.
Try a Real Interview Question
Streaming AUC for Click Prediction
pythonImplement a function that computes ROC AUC for a binary label stream given $y_i \in \{0,1\}$ and predicted score $s_i \in \mathbb{R}$, where ties in $s$ must be handled by assigning average rank. Return the AUC as a float in $[0,1]$, and return $0.5$ if all labels are the same. Your implementation must be $O(n\log n)$ time and should not use external libraries.
1def roc_auc(y_true, y_score):
2 """Compute ROC AUC for binary labels with tie-aware average ranks.
3
4 Args:
5 y_true: List[int] of 0/1 labels.
6 y_score: List[float] of predicted scores.
7
8 Returns:
9 float AUC in [0, 1]. If y_true has no positive or no negative labels, return 0.5.
10 """
11 pass
12700+ ML coding problems with a live Python executor.
Practice in the EngineeBay's ML coding round leans toward problems where you process marketplace signals or build scoring logic in Python, not isolated algorithm exercises. The widget above gives you a feel for that flavor. Build your muscle memory with more ML-oriented coding problems at datainterview.com/coding.
Test Your Readiness
How Ready Are You for eBay Machine Learning Engineer?
1 / 10Can you design a ranking system for eBay search that combines candidate generation, feature retrieval, and learning to rank, and explain how you would optimize for both buyer satisfaction and marketplace health?
This quiz covers the mix of ranking, experimentation, and production ML topics that eBay's onsite emphasizes. Identify your weak spots, then go deeper at datainterview.com/questions.
Frequently Asked Questions
How long does the eBay Machine Learning Engineer interview process take?
From first recruiter call to offer, most candidates report the eBay MLE process takes about 4 to 6 weeks. You'll typically have a recruiter screen, a technical phone screen focused on coding and ML basics, and then a virtual or onsite loop with 4 to 5 rounds. Scheduling can stretch things out, especially if the team is busy, so don't be surprised if it takes closer to 7 weeks in some cases.
What technical skills are tested in the eBay Machine Learning Engineer interview?
Python coding is non-negotiable. Beyond that, expect questions on production ML model development and deployment, model evaluation and performance analysis, batch and real-time inference pipeline design, and experimentation methodology like A/B testing. They also care about your ability to work with large, noisy datasets and integrate ML outputs into live systems. At senior levels (T24+), system design for ML becomes a major focus.
How should I tailor my resume for an eBay MLE role?
Lead with production ML experience, not just research or Kaggle projects. eBay wants people who've deployed models, monitored them in production, and debugged issues in live systems. Quantify your impact with metrics (latency improvements, revenue lift, precision/recall gains). Mention Python explicitly, and call out experience with feature engineering, experimentation design, or real-time inference if you have it. If you're applying at T25 or T26, highlight cross-functional leadership and end-to-end ownership of ML systems.
What is the total compensation for eBay Machine Learning Engineers by level?
Here's what I've seen in the data. T22 (Junior, 0-2 years): total comp around $125K with a range of $95K to $160K. T23 (Mid, 2-5 years): about $165K, ranging $155K to $190K. T24 (Senior, 6-10 years): roughly $151K to $173K. T25 (Staff, 8-15 years): this is where it jumps, with total comp averaging $380K and a range of $300K to $520K. T26 (Principal): averages $276K but can reach $526K at the top end. Base salaries range from about $105K at junior to $254K at principal.
How do I prepare for the behavioral interview at eBay for a Machine Learning Engineer position?
eBay's core values are Customer Focus, Innovate Boldly, Be For Everyone, Deliver With Impact, and Act With Integrity. Prepare stories that map to these. They want to hear about times you shipped something that directly helped users, took a bold technical bet, collaborated across diverse teams, or made a tough ethical call. I'd have at least 6 to 8 stories ready that you can adapt to different prompts. Focus on cross-functional collaboration since MLE work at eBay involves tight partnership with engineering, data, and product teams.
How hard are the coding and SQL questions in the eBay MLE interview?
The coding questions are practical software engineering style, not pure competitive programming. Think data structures, algorithms, and writing clean Python that could go into production. Difficulty is roughly medium, occasionally medium-hard for senior levels. SQL comes up more at junior and mid levels (T22, T23) as part of data skills assessment. You should be comfortable with window functions, joins, and aggregations on messy datasets. Practice at datainterview.com/coding to get a feel for the style.
What ML and statistics concepts should I study for the eBay Machine Learning Engineer interview?
Bias-variance tradeoff comes up constantly, across every level. You also need solid understanding of model evaluation metrics (precision, recall, AUC, calibration), regularization techniques, feature engineering, and common model families (tree-based models, linear models, neural nets). At T23 and above, expect applied case studies where you'd improve a ranking or classification system. For staff and principal levels, they'll test your ability to reason about end-to-end ML systems, training/inference architecture, and production tradeoffs. Check datainterview.com/questions for ML concept practice.
What's the best format for answering behavioral questions at eBay?
Use the STAR format (Situation, Task, Action, Result) but keep it tight. I've seen candidates ramble for 10 minutes on a single story. Aim for 2 to 3 minutes max. Start with a one-sentence setup, spend most of your time on what you specifically did (not the team), and end with a measurable result. At eBay, they value impact and integrity, so always close with what changed because of your work and any lessons learned.
What happens during the eBay Machine Learning Engineer onsite interview?
The onsite (often virtual now) typically has 4 to 5 rounds. Expect at least one pure coding round in Python, one or two ML-focused rounds covering fundamentals and applied problem solving, a system design round (especially at T24 and above where you'll design production ML pipelines), and a behavioral round. At staff and principal levels, the system design round gets much deeper, covering data pipelines, feature management, training/inference architecture, and latency/reliability tradeoffs. There's usually a lunch or informal chat that isn't scored but still matters for culture fit.
What metrics and business concepts should I know for the eBay MLE interview?
eBay is a two-sided marketplace connecting buyers and sellers, generating $11.1B in revenue. You should understand marketplace metrics like GMV (gross merchandise volume), conversion rate, search relevance, and buyer/seller engagement. Know how to think about A/B testing in a marketplace context where treating one side affects the other. They also care about experimentation methodology, so be ready to discuss how you'd design an experiment, pick success metrics, and handle interference effects. Tying your ML solutions back to business outcomes will set you apart.
What education do I need for an eBay Machine Learning Engineer role?
A BS in Computer Science, Engineering, Mathematics, Statistics, or a related field is the baseline. An MS is preferred at most levels, and a PhD is common (though not required) for senior ML roles. That said, eBay explicitly notes that equivalent practical experience is acceptable. If you don't have a graduate degree but have shipped production ML systems and can demonstrate depth in your interviews, you're still a strong candidate. Your portfolio of real work matters more than the degree.
What are common mistakes candidates make in the eBay Machine Learning Engineer interview?
The biggest one I see is treating it like a pure software engineering interview and neglecting the ML depth. eBay wants you to reason about model selection, evaluation pitfalls, data leakage, and production deployment, not just write clean code. Another common mistake is being too theoretical. They care about practical tradeoffs: why would you choose batch over real-time inference, how would you debug a model regression in production. Finally, at senior levels, candidates often underestimate the system design round. Practice designing end-to-end ML systems with real constraints like latency, scale, and data freshness.




