Amazon Data Engineer Guide (2026): Job, Salary & Interviews

Amazon Data Engineer at a Glance

Total Compensation

$194k - $600k/yr

Interview Rounds

8 rounds

Difficulty

Levels

L4 - L8

Education

Bachelor's / Master's / PhD

Experience

0–25+ yrs

SQL Pythonbusiness_intelligenceanalytics_engineeringdata_warehousingsqletldashboards_reportingkpi_metricsexperimentation_analyticsoperations_analyticsamazon_redshift_athena

Most candidates prep for Amazon's BI Engineer role like it's a backend engineering job. It isn't. You'll own everything from Redshift data models to the QuickSight dashboard a VP checks before making inventory decisions, and the interview loop tests that full spectrum.

Amazon Data Engineer Role

Primary Focus

business_intelligenceanalytics_engineeringdata_warehousingsqletldashboards_reportingkpi_metricsexperimentation_analyticsoperations_analyticsamazon_redshift_athena

Skill Profile

Math & Stats

Medium

Strong analytical ability expected; work involves trend analysis, anomaly investigation, defining/using metrics, and interpreting large-scale business data. Not explicitly requiring advanced statistical modeling in the posting; interviews often include product/metric sense case questions (uncertainty: some teams may expect deeper stats depending on scope).

Software Eng

Medium

Scripting (Python) for processing data and automating reporting solutions is required, but role emphasis is BI/analytics engineering rather than building customer-facing services. Expect solid coding hygiene for maintainable scripts and reproducible analyses.

Data & SQL

High

Explicit requirement for data modeling, warehousing concepts, and building ETL pipelines; owning complex reporting and automating reporting solutions; working in a large cloud-based data lake with curated source-of-truth datasets implies strong dimensional modeling, ETL design, and data quality practices at scale.

Machine Learning

Low

Job description focuses on BI, metrics, reporting automation, and trend analysis rather than ML model development. Team mission mentions predicting customer actions, but the role posting does not list ML skills as required (uncertainty: may collaborate with DS/ML rather than build models).

Applied AI

Low

No explicit GenAI/LLM tooling or prompt engineering requirements in the provided posting/interview materials.

Infra & Cloud

Medium

Preferred experience with AWS services (EC2, DynamoDB, S3, Redshift) and working in a cloud data lake environment; typically more about using managed data services than deploying infrastructure via IaC (not explicitly required).

Business

High

Explicitly calls for outstanding business acumen, partnering with product/business leaders, defining key business questions and KPIs, driving decision-making, and identifying business opportunities; interview prep sources emphasize product sense/metrics case studies.

Viz & Comms

High

Required experience with Tableau/QuickSight (or similar), building dashboards and operational/business metrics; strong written and verbal communication needed to work with business owners and present insights concisely and effectively.

What You Need

Advanced SQL querying for analytics and reporting (joins, aggregations, validation)
Data modeling and data warehousing concepts
ETL pipeline development (building/maintaining pipelines for analytics datasets)
Dashboarding and data visualization (Tableau, QuickSight, or similar)
Scripting for data processing/automation (Python)
Analysis of large-scale datasets to identify trends, anomalies, and customer friction
Defining KPIs/metrics and translating business questions into datasets
Stakeholder communication and cross-functional collaboration

Nice to Have

AWS data ecosystem experience (S3, Redshift, DynamoDB; EC2 mentioned in posting)
Data mining on large-scale, complex datasets in a business environment
Designing scalable, explainable metrics and enforcing data quality/source-of-truth standards
Root cause analysis across upstream/downstream data systems (not explicit in posting but common in BIE expectations; uncertainty)

Languages

SQLPython

Tools & Technologies

Amazon RedshiftOracleNoSQL databases (e.g., DynamoDB)AWS S3AWS EC2TableauAmazon QuickSightCloud data lake / large-scale data warehouse

Want to ace the interview?

Practice with real questions.

Start Mock Interview

After year one, the measure of success is whether stakeholders trust your datasets enough to make decisions without pinging you to verify the numbers. That means your Redshift models are documented, your ETL pipelines have data quality checks that actually catch issues before the Monday metrics review, and you've built enough domain context to proactively flag when a KPI shifts unexpectedly. The role's official title is BI Engineer, and the work skews closer to analytics engineering than pure data plumbing.

A Typical Week

A Week in the Life of a Amazon Data Engineer

Typical L5 workweek · Amazon

Weekly time split

Coding — 30%Infrastructure — 20%Meetings — 18%Writing — 12%Break — 10%Analysis — 5%Research — 5%

Expect to spend more time investigating metric anomalies and talking to stakeholders than writing new pipeline code. Amazon's two-pizza team structure means no analyst sits between you and the product leader asking why conversion dipped, so your communication skills get tested daily, not just during interviews. Data freshness SLAs and pipeline reliability fall squarely on your shoulders, and on small teams that responsibility concentrates fast.

Projects & Impact Areas

One quarter you might build the dimensional model (slowly changing dimensions included) behind a Marketplace seller performance tracking system that product managers check in QuickSight every morning. Simultaneously, a BIE on an AWS Finance team could be migrating legacy Oracle reporting to Redshift Spectrum queries against S3, slashing costs while improving latency. The common thread: you own ingestion, transformation, metric logic, and the dashboard layer, not just one slice of the stack.

Skills & What's Expected

SQL mastery is the single most underestimated requirement. Candidates over-index on Python algorithms, but interview rounds probe Redshift-specific tuning (sort keys, dist keys, VACUUM), window functions, and query debugging under time pressure. Business acumen is the most underrated skill: interviewers hand you a vague prompt like "how would you measure the health of Subscribe & Save?" and expect you to define KPIs, propose a data model, and articulate tradeoffs. ML knowledge is rated low for this role, though understanding how your curated datasets feed systems like SageMaker may come up depending on the team.

Levels & Career Growth

Amazon Data Engineer Levels

Each level has different expectations, compensation, and interview focus.

Base

$0k

Stock/yr

$0k

Bonus

$0k

0–2 yrs Bachelor’s degree in a quantitative/technical field (e.g., CS, Engineering, Statistics, Math, Economics) typically expected; Master’s can be a plus but not required for L4.

What This Level Looks Like

Owns well-scoped analytics problems for a team or feature area; builds and maintains a small set of production datasets/ETL jobs and dashboards with defined SLAs; influences decisions primarily within an immediate team/program through accurate metrics, deep-dives, and experiment/operational reporting.

Day-to-Day Focus

→SQL excellence and correctness (including performance tuning basics)
→Reliable metric definitions and data quality (recon, checks, explainability)
→Building repeatable reporting/dashboards that stakeholders trust
→Clear written and verbal communication of insights and recommendations
→Operating effectively under ambiguity while using established patterns and guidance

Interview Focus at This Level

Emphasizes SQL-heavy technical screens (joins, window functions, CTEs, aggregation, data validation), basic data modeling and ETL/pipeline thinking, and strong behavioral assessment against Amazon Leadership Principles (e.g., Ownership, Customer Obsession, Bias for Action). Expect probing on how you define metrics, ensure data accuracy, communicate insights, and handle ambiguous stakeholder requests.

Promotion Path

To be promoted to L5 (Business Intelligence Engineer II), consistently deliver independent end-to-end analytics solutions (datasets + pipelines + dashboards) with measurable business impact, demonstrate strong ownership of metric quality and operational excellence, raise the bar on design/documentation/testing, and influence beyond a single stakeholder by driving decisions or process improvements across adjacent teams/programs.

Find your level

Practice with questions tailored to your target level.

Start Practicing

The widget shows the level bands, but here's the career insight that matters most: the L5-to-L6 jump is where the promotion path narrows, and the blocker is almost always influence scope. At L5 you own your team's datasets reliably; L6 demands cross-team metric governance and evidence that your data products changed how an org makes decisions. Also worth knowing: Amazon's leveling tends to run lower than peer companies, so compare total compensation rather than title when evaluating offers.

Work Culture

Amazon's Leadership Principles aren't wall art. "Dive Deep" and "Ownership" are the literal rubric on your performance review, and they shape the BIE experience more than any technology choice. Frugality is genuine: justifying your Redshift cluster spend shows up in performance goals, and pipeline cost optimization is a real metric your manager tracks. Work arrangement details (remote, hybrid, in-office) vary by org and location, so confirm the policy for your specific team during the recruiter screen.

Amazon Data Engineer Compensation

Amazon's Data Engineer roles often carry Business Intelligence Engineer titles (BIE I through Senior Principal BIE), so don't be confused when your offer letter doesn't say "Data Engineer." The RSU vesting schedule is commonly described as 5/15/40/40 over four years, though Amazon doesn't publish this officially and your offer terms may vary. Sign-on bonuses paid across Years 1 and 2 are designed to bridge the gap, but from what candidates report, the front-loaded cash still doesn't fully match the back-loaded equity, creating a real incentive to stay through Year 3.

Your single biggest lever in negotiation is the sign-on bonus, specifically the Year 1 and Year 2 cash amounts. Amazon's base salary bands are rigid at each level, and RSU grants move only modestly without a competing offer. Frame your counter around total first-year cash (base + sign-on + vested equity) rather than annualized total comp, because that's the number where the gap bites and where recruiters at Amazon have the most flexibility to adjust.

Amazon Data Engineer Interview Process

8 rounds·~5 weeks end to end

Initial Screen

1 round

Recruiter Screen

30mPhone

This initial conversation with a recruiter will cover your background, experience, and why you're interested in a Data Engineer role at Amazon. You should be prepared for a couple of light SQL or Python checks to verify your fundamental technical skills. The recruiter will also assess your general fit for the role and company culture.

behavioralgeneraldata_engineering

Tips for this round

Be ready to articulate your career goals and how they align with Amazon's mission.
Review your resume thoroughly and be prepared to discuss key projects and accomplishments.
Practice basic SQL queries (e.g., joins, aggregations) and Python scripting fundamentals.
Familiarize yourself with Amazon's Leadership Principles (LPs) and be ready to briefly mention how you embody one or two.
Prepare a few thoughtful questions to ask the recruiter about the role, team, or interview process.

Technical Assessment

2 rounds

SQL & Data Modeling

75mPhone

Expect a focused technical phone screen that delves into your proficiency with SQL, Python, and ETL design principles. You'll be asked to solve problems, not just for correctness, but also to explain your thought process, potential trade-offs, and optimization strategies. This round will also include behavioral questions related to Amazon's Leadership Principles.

data_modelingdata_pipelinedata_engineeringbehavioral

Tips for this round

Master advanced SQL concepts including window functions, common table expressions (CTEs), and query optimization.
Be prepared to write Python code for data manipulation, scripting, and potentially API interactions.
Understand ETL (Extract, Transform, Load) concepts, different design patterns, and how to handle data quality issues.
Practice explaining your technical solutions clearly and discussing alternative approaches and their pros/cons.
Prepare STAR method examples for 2-3 Leadership Principles, focusing on technical challenges and data-related projects.

Coding & Algorithms

75mPhone

This second technical phone screen will likely focus more heavily on your general programming skills, data structures, and algorithms, typically using Python. You'll be expected to write clean, efficient code to solve problems relevant to data processing or general software engineering. As with other rounds, expect behavioral questions interspersed throughout.

algorithmsdata_structuresengineeringbehavioral

Tips for this round

Brush up on fundamental data structures like arrays, linked lists, trees, hash maps, and graphs.
Practice common algorithms such as sorting, searching, dynamic programming, and recursion.
Focus on writing production-ready Python code, including error handling, comments, and clear variable names.
Think out loud during problem-solving to demonstrate your thought process and clarify assumptions.
Have more STAR stories ready, particularly those demonstrating 'Invent and Simplify' or 'Dive Deep'.

Onsite

5 rounds

SQL & Data Modeling

60mVideo Call

You'll be challenged to demonstrate deep expertise in SQL and data modeling during this onsite interview. Expect to design database schemas, write complex analytical queries, and discuss various data warehousing concepts like dimensional modeling (star/snowflake schemas) and slowly changing dimensions. Behavioral questions will also be a significant component.

data_modelingdata_pipelinedata_warehousebehavioral

Tips for this round

Review advanced SQL, including complex joins, subqueries, window functions, and performance tuning.
Understand different data modeling techniques (e.g., 3NF, dimensional modeling) and their trade-offs for analytical workloads.
Be prepared to discuss ETL/ELT pipelines, data governance, and ensuring data quality and integrity.
Practice whiteboarding or typing out your SQL solutions and explaining your design choices clearly.
Prepare STAR examples that highlight your 'Customer Obsession' and 'Ownership' in data projects.

System Design

60mVideo Call

This round will assess your ability to design scalable, fault-tolerant, and efficient data systems. You'll be given a high-level problem and asked to design an end-to-end data pipeline, considering components like data ingestion, storage, processing, and serving. Expect to discuss distributed systems concepts and how you would leverage AWS services.

system_designdata_engineeringcloud_infrastructuredata_pipelinebehavioral

Tips for this round

Familiarize yourself with common data architectural patterns (e.g., Lambda, Kappa) and their use cases.
Understand key AWS data services like S3, Redshift, Glue, Kinesis, EMR, and their appropriate applications.
Practice breaking down large system design problems into smaller, manageable components (e.g., requirements, data model, API, scaling).
Be ready to discuss trade-offs between different design choices (e.g., cost, latency, consistency, availability).
Incorporate Leadership Principles like 'Think Big' and 'Bias for Action' into your design discussions.

Coding & Algorithms

60mVideo Call

You'll face another coding challenge in this onsite round, typically involving a more complex problem than the phone screens, often related to data manipulation or processing. The interviewer will evaluate your problem-solving approach, algorithmic thinking, and ability to write robust, optimized code. Expect to discuss the time and space complexity of your solutions.

algorithmsdata_structuresengineeringbehavioral

Tips for this round

Practice coding problems on platforms like datainterview.com/coding (medium to hard difficulty), focusing on array, string, and tree manipulation.
Be proficient in Python for coding, including its standard libraries for data processing.
Always start by clarifying the problem, discussing edge cases, and outlining your approach before coding.
Test your code thoroughly with various inputs, including edge cases and invalid inputs.
Demonstrate 'Learn and Be Curious' by asking clarifying questions and considering different approaches.

Behavioral

60mVideo Call

This interview will serve as a deep dive into your practical data engineering experience and domain knowledge. The interviewer will probe your understanding of various strategies for ingesting, modeling, processing, and persisting data at scale. You might discuss specific big data technologies, data governance, or how technical decisions impact business analytics.

data_engineeringdata_pipelinecloud_infrastructuredata_warehousebehavioral

Tips for this round

Be prepared to discuss your experience with big data technologies (e.g., Spark, Hadoop, Kafka) and their applications.
Understand concepts like data lineage, data quality frameworks, and metadata management.
Articulate how you ensure data accuracy, availability, and security in your data pipelines.
Discuss real-world challenges you've faced in data engineering projects and how you overcame them.
Showcase 'Deliver Results' and 'Are Right, A Lot' by discussing successful data initiatives and lessons learned.

Bar Raiser

60mVideo Call

This is a critical behavioral interview conducted by a 'Bar Raiser,' an experienced Amazonian from a different team, whose primary role is to ensure hiring standards are maintained. This round focuses exclusively on Amazon's 16 Leadership Principles, and you'll be expected to provide detailed, specific examples from your past experiences using the STAR method. The Bar Raiser looks for evidence that you can operate at a higher level than 50% of current employees in that role.

behavioral

Tips for this round

Thoroughly review all 16 Amazon Leadership Principles and understand what each one entails.
Prepare 2-3 detailed STAR (Situation, Task, Action, Result) stories for each LP, focusing on quantifiable outcomes.
Choose examples that demonstrate significant impact, challenges, and your specific contributions.
Be ready for deep follow-up questions on your examples; the interviewer will probe for details and your thought process.
Practice articulating your stories concisely yet comprehensively, highlighting your role and the lessons learned.

Tips to Stand Out

Master Amazon's Leadership Principles (LPs). These are central to every interview. Prepare multiple STAR method examples for each LP, focusing on quantifiable results and your specific actions. Interviewers will probe deeply into your stories.
Excel in SQL and Python. Data Engineers at Amazon are expected to be highly proficient in both. Practice complex SQL queries (window functions, CTEs, optimization) and Python coding for data manipulation, algorithms, and scripting.
Understand Data System Design. Be prepared to design scalable, reliable, and efficient data pipelines and architectures. Familiarize yourself with distributed systems concepts and relevant AWS data services (S3, Redshift, Glue, Kinesis, EMR).
Practice the STAR Method relentlessly. For every behavioral question, structure your answer using Situation, Task, Action, and Result. Quantify your results whenever possible and clearly articulate your individual contribution.
Think out loud during technical problems. Interviewers want to understand your thought process, problem-solving approach, and how you handle ambiguity. Clearly communicate your assumptions, explore different solutions, and discuss trade-offs.
Showcase your data engineering expertise. Be ready to discuss ETL/ELT processes, data modeling (dimensional vs. normalized), data quality, data governance, and how technical decisions impact business outcomes.
Prepare thoughtful questions. At the end of each interview, ask insightful questions that demonstrate your curiosity, engagement, and understanding of the role or team. This also helps you assess if the role is a good fit for you.

Common Reasons Candidates Don't Pass

✗Lack of strong Leadership Principle examples. Candidates often fail to provide specific, detailed STAR stories that clearly demonstrate how they embody Amazon's LPs, or their examples lack depth and quantifiable results.
✗Insufficient technical depth in SQL or Python. Many candidates struggle with complex SQL queries, efficient Python coding, or fail to optimize their solutions for large datasets, indicating a gap in core technical skills.
✗Poor system design capabilities. Inability to design scalable data pipelines, understand distributed systems, or effectively leverage cloud technologies (especially AWS) for data solutions is a common reason for rejection.
✗Weak problem-solving and communication skills. Candidates who don't articulate their thought process clearly, struggle to break down complex problems, or fail to discuss trade-offs effectively often don't pass.
✗Not a strong culture fit. While technical skills are crucial, a candidate's inability to align with Amazon's unique culture, as assessed through the Leadership Principles, can lead to rejection, particularly in the Bar Raiser round.

Offer & Negotiation

Amazon's compensation packages for Data Engineers typically consist of a base salary, a sign-on bonus (often paid out over the first two years), and Restricted Stock Units (RSUs). The RSU vesting schedule is usually back-weighted, commonly 5% in year 1, 15% in year 2, and 40% in years 3 and 4. The most negotiable components are often the sign-on bonus and, to a lesser extent, the number of RSUs. Having competing offers can significantly strengthen your negotiation position. Be prepared to articulate your value and desired compensation clearly, but also be aware that Amazon has a structured approach to offers.

Weak Leadership Principle stories are among the most common reasons candidates get rejected, right alongside insufficient SQL depth and poor system design. From what candidates report, vague STAR answers without quantified results hurt just as badly as a botched Redshift schema question. Every interviewer scores you against specific LPs, so a single round with thin behavioral examples drags down your entire debrief scorecard.

After your loop wraps, each interviewer writes independent feedback before seeing anyone else's notes. The Bar Raiser, an experienced Amazonian pulled from a completely different org with zero incentive to fill the headcount, carries outsized influence in that debrief. Their role is to protect Amazon's hiring bar, not to help a team staff up, which means a strong Bar Raiser objection can override enthusiasm from the rest of the panel. Prep for that round by mapping at least two detailed STAR stories to each of the top LPs (Customer Obsession, Ownership, Dive Deep, Bias for Action, Earn Trust) with real metrics from your past work.

Amazon Data Engineer Interview Questions

Data Architecture & Pipeline Design

Expect questions that force you to design batch/stream pipelines end-to-end: ingestion, orchestration, backfills, SLAs, and failure recovery. Candidates often struggle to balance correctness, cost, and operability under real e-commerce data volumes and change rates.

You own a daily batch pipeline that builds an Orders fact table in Redshift from S3 (Parquet), used for BI metrics like gross merchandise sales and units shipped. Design the pipeline to support late arriving updates, idempotent reruns, and a 2 hour SLA, specify keys, staging strategy, and how you detect data completeness.

MediumBatch ETL Design and Idempotency

Sample Answer

Most candidates default to append only loads into the fact table, but that fails here because orders can be updated after creation (returns, cancellations, address changes) and reruns will double count. You need a landing zone in S3, a deterministic partitioning scheme (for example by event_date and marketplace_id), and a Redshift staging table loaded via COPY, then a MERGE pattern (or delete plus insert) keyed on order_id plus line_id plus versioning. Track a high water mark per source and partition, and enforce completeness with a control table that records expected partitions and row counts per marketplace plus checksum or count deltas. For late data, allow a rolling backfill window (for example $N$ days), and make the merge idempotent by using source record version (updated_at) and only applying the latest record per key.

You ingest clickstream events (AddToCart, Purchase, Search) via Kinesis into S3 and compute near real time conversion rate per ASIN for an Ops dashboard with a 5 minute freshness target. Design the architecture, include deduplication, exactly-once semantics expectations, and how you handle a sudden 10x traffic spike on Prime Day.

HardStreaming Pipeline and Operational Scaling

Sample Answer

Use Kinesis for ingestion, a streaming compute layer (Glue streaming or Spark on EMR) to aggregate and write to S3 (and optionally DynamoDB or Redshift for serving) with idempotent sinks based on event_id. Kinesis is at least once, so you must dedupe using a stable event_id plus a bounded state TTL (for example 1 day) and emit aggregates in tumbling windows of 1 to 5 minutes with watermarking for late events. For Prime Day spikes, you scale Kinesis shards (or use on-demand), increase consumer parallelism, and ensure backpressure does not break the 5 minute SLA by isolating hot partitions (for example partition key includes session_id) and autoscaling EMR task nodes. Monitor iterator age, consumer lag, and end-to-end freshness, then trigger shard scaling and compute scaling via alarms and runbooks.

A business team wants a single dataset for Customer Lifetime Value that joins orders (fact), returns (fact), and customer and product dimensions, and it must support backfills for 2 years and daily incremental loads. Choose between an ELT approach in Redshift versus a lake-first approach in S3 plus Athena, describe the tables, partitions, and how you keep BI queries fast and costs predictable.

EasyLake vs Warehouse Pipeline Design

Practice more Data Architecture & Pipeline Design questions

AWS Data Infrastructure

Most candidates underestimate how much the interview probes practical AWS tradeoffs across S3, Redshift, Glue/EMR, Kinesis/Firehose, Lambda, and IAM. You’ll be evaluated on secure, scalable, and cost-aware choices plus how you’d monitor and operate them.

Your daily Glue job loads 500 GB of clickstream JSON from S3 into Redshift and suddenly the runtime doubles and Redshift WLM queues spike. What specific AWS checks and fixes do you apply across S3 layout, Glue execution, and Redshift load strategy to restore steady runtime and control cost?

EasyGlue-Redshift Operations and Performance

Sample Answer

Fix the S3 to Redshift ingestion path by enforcing partitioned, columnar data in S3, right-sizing Glue, and switching to a Redshift COPY-based load with proper compression and sort keys. Check S3 for small files and missing partitions because they inflate listing and shuffle cost in Glue and create many tiny COPY operations. Validate Glue job metrics (DPU, shuffle spill, skew) and enable job bookmarks to avoid reprocessing. In Redshift, load into a staging table via COPY from S3, then MERGE, and confirm DISTKEY and SORTKEY align with the dominant join and filter patterns to reduce WLM queue time.

You need near real-time order events (p95 under 5 seconds) for an Operations dashboard and also a durable replayable history for backfills, events are 20k per second at peak. How do you choose between Kinesis Data Streams plus Lambda versus Kinesis Firehose into S3 plus Glue, and what IAM, encryption, and monitoring controls do you put in place?

HardStreaming Ingestion Tradeoffs and Security

Practice more AWS Data Infrastructure questions

SQL: Analytics, Debugging, and Performance

Your ability to write correct SQL under pressure is a major signal, especially for BI-style questions with tricky joins, window functions, and edge cases. You’ll also need to explain performance implications (partitioning, distribution/sort keys, predicate pushdown) rather than just producing a query.

You have Amazon retail tables orders(order_id, customer_id, order_ts, order_total, order_status) and order_items(order_id, asin, qty, item_price). Return each customer's most recent non-canceled order with its order_total and total_units, breaking ties by higher order_total.

EasyWindow Functions

Sample Answer

You could do a correlated subquery to pick the max timestamp, or you could use a window function with ROW_NUMBER. The window function wins here because it handles tie-breaking deterministically and avoids repeated scans that often show up in correlated patterns. It also reads like the business rule, sort by recency, then by value. Most bugs come from forgetting to filter canceled orders before ranking.

SQL

1WITH item_rollup AS (
2  SELECT
3    oi.order_id,
4    SUM(oi.qty) AS total_units
5  FROM order_items oi
6  GROUP BY oi.order_id
7), ranked_orders AS (
8  SELECT
9    o.customer_id,
10    o.order_id,
11    o.order_ts,
12    o.order_total,
13    COALESCE(ir.total_units, 0) AS total_units,
14    ROW_NUMBER() OVER (
15      PARTITION BY o.customer_id
16      ORDER BY o.order_ts DESC, o.order_total DESC, o.order_id DESC
17    ) AS rn
18  FROM orders o
19  LEFT JOIN item_rollup ir
20    ON ir.order_id = o.order_id
21  WHERE o.order_status <> 'CANCELED'
22)
23SELECT
24  customer_id,
25  order_id,
26  order_ts,
27  order_total,
28  total_units
29FROM ranked_orders
30WHERE rn = 1;

A BI dashboard shows duplicate GMV for some days; you suspect a join fanout between shipments(shipment_id, order_id, shipped_ts) and shipment_events(shipment_id, event_ts, event_type) when calculating daily shipped GMV from orders(order_id, order_ts, order_total). Write SQL to compute daily shipped GMV correctly and explain how your query prevents duplication.

MediumDebugging Join Fanout

Sample Answer

Walk through the logic step by step as if thinking out loud. Start by defining the grain you want, one row per order per ship date, then ensure any many-to-one tables are reduced before joining. Next, derive a unique ship date per order by aggregating shipments, then join that to orders exactly once per order. Finally, aggregate GMV by that ship date, so event rows can never multiply order_total.

SQL

1WITH order_ship_date AS (
2  -- Reduce shipments to one row per order so order_total cannot fan out.
3  -- Business assumption: an order is considered "shipped" on its earliest shipped_ts.
4  SELECT
5    s.order_id,
6    CAST(MIN(s.shipped_ts) AS DATE) AS ship_date
7  FROM shipments s
8  GROUP BY s.order_id
9), shipped_orders AS (
10  SELECT
11    o.order_id,
12    osd.ship_date,
13    o.order_total
14  FROM orders o
15  JOIN order_ship_date osd
16    ON osd.order_id = o.order_id
17)
18SELECT
19  ship_date,
20  SUM(order_total) AS shipped_gmv
21FROM shipped_orders
22GROUP BY ship_date
23ORDER BY ship_date;
24
25-- Note: shipment_events is intentionally not joined.
26-- If you must filter to "delivered" or similar, pre-aggregate shipment_events to one row per shipment_id
27-- (for example, MAX(CASE WHEN event_type = 'DELIVERED' THEN 1 END)) before joining to shipments.

In Redshift, a query that computes 7-day rolling conversion rate by marketplace is slow; tables are fact_sessions(session_id, customer_id, marketplace_id, session_ts) and fact_orders(order_id, customer_id, marketplace_id, order_ts). Write SQL for daily conversion rate (orders within 7 days of a session) and call out 2 concrete performance fixes using sort and distribution keys.

HardPerformance and Rolling Analytics

Practice more SQL: Analytics, Debugging, and Performance questions

Data Modeling for Warehouses & BI

The bar here isn’t whether you know star schema terms, it’s whether you can model messy business processes into dimensions/facts that stay stable as requirements evolve. Interviewers look for grain clarity, SCD strategy, and how models enable reliable self-serve analytics.

You are modeling Amazon retail Orders for BI, and business wants daily metrics for units, revenue, and discounts by marketplace, seller type, and customer segment. Define the grain of your main fact table and name 5 dimensions you would include, including how you would handle returned and canceled items without double counting.

EasyDimensional Modeling and Fact Grain

Sample Answer

Reason through it: Start by locking the grain, if you cannot say one sentence that uniquely identifies a row, the model will break under new questions. Use order item as the default grain (one row per order_id, order_item_id, and optionally shipment_id if partial shipments matter), because units and revenue live at item level. Put marketplace, date, product, seller, customer, and fulfillment channel as conformed dimensions, and keep discounts as additive measures at the same grain. Treat returns and cancellations as separate fact rows (or a linked returns fact) with clear transaction types and signed measures, so net metrics are SUMs without joins that duplicate.

In a Redshift star schema for Prime delivery performance, you have a fact_delivery_event table at event grain and a dim_address that changes when a customer updates their address. Which SCD type do you choose for dim_address, and how do you keep historical on-time rate consistent across backfills and late arriving events?

MediumSCD Strategy and Late Arriving Data

Sample Answer

Start with what the interviewer is really testing: This question is checking whether you can preserve metric truth over time by aligning fact timestamps to the correct dimensional version. Use SCD Type 2 for dim_address when historical analysis depends on where the package was going at the time, store effective_start, effective_end, and current_flag, and join facts by event_timestamp between the effective dates. For late arriving events or backfills, resolve the surrogate key using the event time, not load time, and quarantine records that cannot find a valid dimension version. If business truly only cares about the latest address, then Type 1 is acceptable, but then you must explicitly state that historical views will shift.

Your BI users want a single dataset to analyze Customer Lifetime Value (CLV) for Amazon retail, mixing orders, refunds, and promotional credits, and they also want to slice by both order date and ship date. How do you model facts and date dimensions so CLV is additive and filtering by either date does not silently change the metric definition?

HardMulti-Fact Modeling and Role-Playing Dates

Practice more Data Modeling for Warehouses & BI questions

Data Warehouse Engineering (Redshift/OLAP)

In practice, you’ll be pushed on how to make warehouses fast and trustworthy: incremental loads, compaction/vacuuming patterns, late-arriving data, and concurrency. Strong answers connect physical design decisions to query patterns and downstream BI usage.

You own an hourly refresh of a Redshift star schema for Amazon retail analytics with fact_orders (order_id, customer_id, sku_id, order_ts, ship_ts, price) and dimensions dim_customer, dim_sku. How do you handle late arriving ship_ts updates so BI dashboards show correct shipped revenue without full reloads, and what physical maintenance do you schedule (VACUUM, ANALYZE) to keep queries fast?

EasyIncremental Loads and Late Arriving Facts

Sample Answer

This question is checking whether you can keep a warehouse correct under updates, and fast under constant ingestion. You should describe an incremental pattern that upserts changed rows (staging plus dedupe by business key and latest update_ts, then MERGE or delete-insert) and preserves stable surrogate keys in dimensions. Call out how you backfill metrics for late updates (recompute affected partitions, for example by order_date or update_date). Then tie it to Redshift ops: ANALYZE after significant changes, VACUUM on tables with deletes or updates to restore sort order and reclaim space, not on every run.

A Redshift cluster powers an Amazon operations dashboard where 150 concurrent users run the same 3 queries, one query scans fact_clickstream (10 TB) joined to dim_sku and dim_marketplace and groups by day and marketplace, but it spikes to 40 minutes at peak. What concrete Redshift table design changes (DISTKEY, SORTKEY, compression, materialized views) and workload controls would you apply, and how do you validate each change with evidence?

HardRedshift Physical Design and Concurrency

Practice more Data Warehouse Engineering (Redshift/OLAP) questions

Coding & Algorithms (Python)

You’ll likely face coding tasks that test clean implementation, runtime/space reasoning, and robust edge-case handling rather than exotic theory. Typical prompts map to data engineering utilities (parsing, batching, deduping, top-K, stream-like processing) with production-quality expectations.

You ingest an S3 file of Amazon retail order events, each line is a JSON object with fields order_id, event_time (ISO-8601), and status. Return the first event per order_id by event_time, and if ties exist keep the lexicographically smallest status.

EasyDeduplication and Sorting

Sample Answer

The standard move is to track one best record per key (order_id) while you scan the data. But here, tie handling matters because duplicate timestamps happen in distributed systems, so you must apply a deterministic secondary rule (status lexicographic) to avoid nondeterministic downstream aggregates.

Python

1import json
2from datetime import datetime
3from typing import Iterable, List, Dict, Any, Tuple
4
5
6def _parse_time(ts: str) -> datetime:
7    """Parse ISO-8601 timestamps, supporting a trailing 'Z'."""
8    if ts.endswith("Z"):
9        ts = ts[:-1] + "+00:00"
10    return datetime.fromisoformat(ts)
11
12
13def first_event_per_order(lines: Iterable[str]) -> List[Dict[str, Any]]:
14    """Return earliest event per order_id, tie broken by smallest status.
15
16    Each input line is a JSON object with keys: order_id, event_time, status.
17    Output is a list of the selected JSON dicts (original fields preserved).
18    """
19    best: Dict[str, Tuple[datetime, str, Dict[str, Any]]] = {}
20
21    for line in lines:
22        line = line.strip()
23        if not line:
24            continue
25        rec = json.loads(line)
26        oid = rec["order_id"]
27        t = _parse_time(rec["event_time"])
28        status = rec["status"]
29
30        if oid not in best:
31            best[oid] = (t, status, rec)
32            continue
33
34        bt, bs, _ = best[oid]
35        # Earlier timestamp wins; if equal timestamp, lexicographically smaller status wins.
36        if t < bt or (t == bt and status < bs):
37            best[oid] = (t, status, rec)
38
39    # Return just the chosen records.
40    return [triple[2] for triple in best.values()]
41
42
43if __name__ == "__main__":
44    sample = [
45        '{"order_id":"A1","event_time":"2024-01-01T00:00:00Z","status":"SHIPPED"}',
46        '{"order_id":"A1","event_time":"2024-01-01T00:00:00Z","status":"CREATED"}',
47        '{"order_id":"A2","event_time":"2024-01-02T03:00:00+00:00","status":"CREATED"}',
48        '{"order_id":"A1","event_time":"2024-01-01T00:01:00Z","status":"DELIVERED"}',
49    ]
50    out = first_event_per_order(sample)
51    # For A1, timestamps tie, so CREATED wins.
52    print(sorted(out, key=lambda r: r["order_id"]))
53

Given a stream of (asin, customer_id, ts) clicks for an Amazon detail page, compute the top K ASINs by unique customer count within the last 24 hours for a given query time ts_now. Input can be unsorted, and you must handle duplicates and out-of-window events correctly.

MediumSliding Window Top-K

Sample Answer

Get this wrong in production and your top ASIN dashboard flaps, because late events and duplicates inflate counts and reorder the top K every refresh. The right call is to filter by the $24$ hour window relative to ts_now, dedupe by (asin, customer_id), then use a heap or partial sort to extract K efficiently.

Python

1from __future__ import annotations
2
3from datetime import datetime, timedelta
4from typing import Iterable, List, Tuple, Dict, Set
5import heapq
6
7
8def _parse_time(ts: str) -> datetime:
9    """Parse ISO-8601 timestamps, supporting a trailing 'Z'."""
10    if ts.endswith("Z"):
11        ts = ts[:-1] + "+00:00"
12    return datetime.fromisoformat(ts)
13
14
15def top_k_asins_unique_customers_last_24h(
16    events: Iterable[Tuple[str, str, str]],
17    ts_now: str,
18    k: int,
19) -> List[Tuple[str, int]]:
20    """Return top K (asin, unique_customer_count) in the last 24h window.
21
22    events: iterable of (asin, customer_id, ts) where ts is ISO-8601 string.
23    ts_now: window reference time (ISO-8601).
24    k: number of ASINs to return.
25
26    Ties are broken by ASIN lexicographic order (stable, deterministic output).
27    """
28    now = _parse_time(ts_now)
29    start = now - timedelta(hours=24)
30
31    # Deduplicate by (asin, customer_id) within the window.
32    # If events are huge, you would partition by asin or approximate, but here keep it exact.
33    seen_pairs: Set[Tuple[str, str]] = set()
34    customers_by_asin: Dict[str, Set[str]] = {}
35
36    for asin, customer_id, ts in events:
37        t = _parse_time(ts)
38        if t < start or t > now:
39            continue
40        pair = (asin, customer_id)
41        if pair in seen_pairs:
42            continue
43        seen_pairs.add(pair)
44        customers_by_asin.setdefault(asin, set()).add(customer_id)
45
46    # Build counts.
47    counts: List[Tuple[int, str]] = []
48    for asin, custs in customers_by_asin.items():
49        counts.append((len(custs), asin))
50
51    if k <= 0:
52        return []
53
54    # Get top K by count desc, then asin asc.
55    # heapq.nlargest uses the tuple ordering, so use (count, -) carefully.
56    top = heapq.nlargest(k, counts, key=lambda x: (x[0], -ord(x[1][0]) if x[1] else 0))
57
58    # The key above is not a correct general lexicographic tiebreak, so do it explicitly.
59    # Sort all candidates by (-count, asin) and slice K. This is acceptable for moderate cardinality.
60    top_sorted = sorted(((asin, cnt) for cnt, asin in counts), key=lambda p: (-p[1], p[0]))
61    return top_sorted[:k]
62
63
64if __name__ == "__main__":
65    data = [
66        ("B001", "C1", "2024-01-02T00:00:00Z"),
67        ("B001", "C1", "2024-01-02T00:01:00Z"),  # duplicate customer for same ASIN
68        ("B001", "C2", "2024-01-02T01:00:00Z"),
69        ("B002", "C3", "2024-01-01T02:00:00Z"),
70        ("B003", "C4", "2023-12-31T00:00:00Z"),  # out of window
71    ]
72    print(top_k_asins_unique_customers_last_24h(data, "2024-01-02T02:00:00Z", 2))
73

You receive a list of Glue job dependencies as (upstream_job, downstream_job) for a daily BI pipeline, and you must return a valid execution order or raise an error if there is a cycle. Also return one example cycle path when a cycle exists.

HardGraph Topological Sort and Cycle Detection

Practice more Coding & Algorithms (Python) questions

Amazon's loop is structured so that pipeline design and AWS infrastructure questions feed off each other in the same interview day, and candidates who prep them in isolation get exposed when a Bar Raiser or system design interviewer asks them to walk through a real scenario end-to-end (say, rebuilding the ingestion layer for seller performance metrics on the marketplace). The compounding difficulty isn't just topical overlap; it's that Amazon interviewers frame every prompt around a specific business domain like fulfillment forecasting or ad attribution, so generic "I'd use Spark and S3" answers without connecting to how that data actually flows through Amazon's ecosystem score poorly on the Ownership and Dive Deep LPs. The single biggest prep mistake? Treating Python algorithms as the main technical hurdle when the real gate is the combined weight of SQL, data modeling, and Redshift/OLAP engineering, three distinct areas that all demand you reason about physical warehouse design choices under Amazon-scale constraints.

Practice Amazon-style questions across all six areas at datainterview.com/questions.

How to Prepare for Amazon Data Engineer Interviews

Know the Business

Updated Q1 2026

Official mission

“Amazon is guided by four principles: customer obsession rather than competitor focus, passion for invention, commitment to operational excellence, and long-term thinking. We strive to be Earth’s most customer-centric company, Earth’s best employer, and Earth’s safest place to work.”

What it actually means

Amazon's core mission is to be the most customer-centric company on Earth, achieved through relentless innovation, operational excellence, and a long-term strategic outlook. It also aims to be Earth's best employer and safest place to work, though the consistent prioritization of these employee-focused goals is debated.

Seattle, WashingtonUnknown

Key Business Metrics

Revenue

$717B

+14% YoY

Market Cap

$2.2T

-12% YoY

Employees

1.6M

+1% YoY

Business Segments and Where DS Fits

AWS

Cloud platform that powers AI inference with custom chips, smart routing systems, and purpose-built infrastructure, making AI faster and more affordable. Offers services like Amazon Bedrock.

DS focus: Making AI faster and more affordable (inference), foundation model evaluation (via Amazon Bedrock with models like Claude Sonnet 4.6)

Amazon Stores

Encompasses Prime benefits, small businesses, retail stores, and other features. Focuses on improving delivery speed and expanding services like Amazon Pharmacy.

DS focus: Personalized product recommendations, tracking price history, automated purchasing based on target prices (via Rufus AI assistant)

Amazon Ads

Advertising platform for brands to connect with audiences, focusing on authenticated identity, AI-powered optimization, and integrated campaigns across streaming TV, online video, and display advertising. Offers solutions like Amazon Marketing Cloud and AWS Clean Rooms.

DS focus: AI-powered optimization, unified audience view across touchpoints, connecting media exposure to shopping behavior, AI for creative brief generation and storyboarding (Creative Agent), continuous optimization for full-funnel campaigns

Current Strategic Priorities

Continue to be a leading corporate purchaser of carbon-free energy
Make AI faster and more affordable via AWS infrastructure
Deploy initial low Earth orbit satellite internet constellation (Project Kuiper)
Expand Amazon Pharmacy Same-Day Delivery to nearly 4,500 cities
Improve Prime delivery speed (set new record in 2025)
Advance advertising solutions with authenticated identity, AI-powered optimization, and integrated campaigns
Simplify advertising for brands by leveraging AI to remove friction and accelerate insight-to-action

Competitive Moat

audience scaleextensive selectionglobal presenceconvenient buying experiencerapid delivery servicesSpeedTrustsearch engine

Amazon reported $717 billion in revenue last year, and the company's current priorities read like a Data Engineering hiring brief. AWS is racing to make AI inference cheaper through custom chips and purpose-built infrastructure, Ads is wiring up AI-powered campaign optimization with unified audience views across streaming and display, and Stores just pushed same-day pharmacy delivery to nearly 4,500 cities.

Each of those bets demands pipelines, warehouse schemas, and real-time data flows that don't exist yet. Before your loop, ask your recruiter which segment you're interviewing for so you can speak to its specific data problems, not Amazon's mission statement in the abstract.

Your "why Amazon" answer needs to name the exact LP you'll demonstrate, then prove it with a story that could only come from your experience. Don't just say "Ownership." Describe how you owned a pipeline's on-call rotation, caught a data quality regression during a weekly business review, and communicated the root cause to a non-technical stakeholder before they had to ask. That mirrors how Amazon's Leadership Principles actually function as the scoring rubric in behavioral and Bar Raiser rounds, where interviewers are trained to probe for specifics like SLA ownership and cross-team influence rather than enthusiasm about working at scale.

Try a Real Interview Question

sql

Given daily category revenue per marketplace, return for each marketplace the top $2$ categories by revenue share in the last $7$ days ending on $2024$-$01$-$07$. Output columns: marketplace_id, category, revenue_7d, total_revenue_7d, revenue_share, rank, where revenue_share is $$\frac{revenue\_7d}{total\_revenue\_7d}$$ and rank uses dense ranking by revenue_7d descending with category as a tiebreaker ascending.

category_daily_revenue

marketplace_id	order_date	category	revenue
US	2024-01-01	Electronics	500
US	2024-01-03	Books	120
US	2024-01-07	Electronics	300
DE	2024-01-02	Books	90
DE	2024-01-06	Electronics	110

marketplaces

marketplace_id	marketplace_name
US	United States
DE	Germany
JP	Japan

SQL

1WITH filtered AS (
2  SELECT
3    marketplace_id,
4    category,
5    CAST(revenue AS DECIMAL(18,2)) AS revenue
6  FROM category_daily_revenue
7  WHERE order_date >= DATE '2024-01-01'
8    AND order_date <= DATE '2024-01-07'
9),
10agg AS (
11  SELECT
12    marketplace_id,
13    category,
14    SUM(revenue) AS revenue_7d
15  FROM filtered
16  GROUP BY marketplace_id, category
17),
18totals AS (
19  SELECT
20    marketplace_id,
21    SUM(revenue_7d) AS total_revenue_7d
22  FROM agg
23  GROUP BY marketplace_id
24),
25ranked AS (
26  SELECT
27    a.marketplace_id,
28    a.category,
29    a.revenue_7d,
30    t.total_revenue_7d,
31    CASE
32      WHEN t.total_revenue_7d = 0 THEN 0
33      ELSE a.revenue_7d / t.total_revenue_7d
34    END AS revenue_share,
35    DENSE_RANK() OVER (
36      PARTITION BY a.marketplace_id
37      ORDER BY a.revenue_7d DESC, a.category ASC
38    ) AS rank
39  FROM agg a
40  JOIN totals t
41    ON a.marketplace_id = t.marketplace_id
42)
43SELECT
44  marketplace_id,
45  category,
46  revenue_7d,
47  total_revenue_7d,
48  revenue_share,
49  rank
50FROM ranked
51WHERE rank <= 2
52ORDER BY marketplace_id, rank, category;

700+ ML coding problems with a live Python executor.

Practice in the Engine

Amazon's coding rounds from what candidates report lean toward business-flavored data manipulation (think: parsing seller transaction logs or aggregating ad attribution events) rather than graph theory or dynamic programming. The interviewers want to see you handle messy, real-world inputs the way you'd handle a production ETL edge case. Practice similar scenarios at datainterview.com/coding, paying special attention to problems that require you to clean and reshape data before computing a result.

Test Your Readiness

How Ready Are You for Amazon Data Engineer?

1 / 10

Data Architecture

Can you design an end to end batch plus streaming pipeline, including ingestion, validation, idempotent writes, late arriving data handling, and data quality checks?

Amazon's two dedicated SQL rounds go deep on Redshift-specific tuning (sort keys, dist keys, VACUUM) and window functions for business metrics. Pressure-test those skills at datainterview.com/questions so you discover gaps while you can still close them.

Frequently Asked Questions

What technical skills are tested in Data Engineer interviews?

Core skills tested are SQL (complex joins, optimization, data modeling), Python coding, system design (design a data pipeline, a streaming architecture), and knowledge of tools like Spark, Airflow, and dbt. Statistics and ML are not primary focus areas.

How long does the Data Engineer interview process take?

Most candidates report 3 to 5 weeks. The process typically includes a recruiter screen, hiring manager screen, SQL round, system design round, coding round, and behavioral interview. Some companies add a take-home or replace live coding with a pair-programming session.

What is the total compensation for a Data Engineer?

Total compensation across the industry ranges from $105k to $1014k depending on level, location, and company. This includes base salary, equity (RSUs or stock options), and annual bonus. Pre-IPO equity is harder to value, so weight cash components more heavily when comparing offers.

What education do I need to become a Data Engineer?

A Bachelor's degree in Computer Science or Software Engineering is the most common background. A Master's is rarely required. What matters more is hands-on experience with data systems, SQL, and pipeline tooling.

How should I prepare for Data Engineer behavioral interviews?

Use the STAR format (Situation, Task, Action, Result). Prepare 5 stories covering cross-functional collaboration, handling ambiguity, failed projects, technical disagreements, and driving impact without authority. Keep each answer under 90 seconds. Most interview loops include 1-2 dedicated behavioral rounds.

How many years of experience do I need for a Data Engineer role?

Entry-level positions typically require 0+ years (including internships and academic projects). Senior roles expect 9-18+ years of industry experience. What matters more than raw years is demonstrated impact: shipped models, experiments that changed decisions, or pipelines you built and maintained.

Amazon Data Engineer Interview Guide

Amazon Data Engineer Role

A Typical Week

A Week in the Life of a Amazon Data Engineer

Weekly time split

Projects & Impact Areas

Skills & What's Expected

Levels & Career Growth

Amazon Data Engineer Levels

Work Culture

Amazon Data Engineer Compensation

Amazon Data Engineer Interview Process

Initial Screen

Recruiter Screen

Technical Assessment

SQL & Data Modeling

Coding & Algorithms

Onsite

SQL & Data Modeling

System Design

Coding & Algorithms

Behavioral

Bar Raiser

Tips to Stand Out

Common Reasons Candidates Don't Pass

Amazon Data Engineer Interview Questions

Data Architecture & Pipeline Design

AWS Data Infrastructure

SQL: Analytics, Debugging, and Performance

Data Modeling for Warehouses & BI

Data Warehouse Engineering (Redshift/OLAP)

Coding & Algorithms (Python)

How to Prepare for Amazon Data Engineer Interviews

Try a Real Interview Question

Top categories by 7-day revenue share per marketplace

Test Your Readiness

Frequently Asked Questions

Dan Lee

Related Articles

Salesforce AI Engineer Interview Guide

Two Sigma Data Scientist Interview Guide

Product Data Scientist Interview Prep