Top 33 SQL Interview Questions (2026)

Q: How much SQL depth do I need for a Data Analyst, Data Scientist, or Data Engineer interview?

You should be comfortable writing joins, aggregations, subqueries or CTEs, and window functions, and you should understand how NULLs affect logic and aggregates. For analyst and scientist roles, focus on accurate analytics SQL and edge cases like duplicates and time ranges. For engineering roles, add deeper knowledge of query performance, indexing basics, data modeling, and incremental loads.

Q: Which companies tend to ask the most SQL interview questions?

Data heavy companies often lean hard on SQL, including Meta, Google, Amazon, Microsoft, Uber, Lyft, Airbnb, Netflix, and Stripe. Many fintech, marketplaces, and ad tech firms also use SQL as a primary filter because day to day work is query driven. You should expect at least one SQL round at most roles where you will analyze product, growth, or pipeline data.

Q: Do SQL interviews require live coding, or is it mostly discussion?

Many interviews include writing SQL live in a shared editor, or in a take home query task, not just talking through concepts. You may be asked to produce a correct query, then iterate after feedback to handle edge cases and improve readability. Practice writing queries from scratch at datainterview.com/coding and review common prompts at datainterview.com/questions.

Q: How do SQL interviews differ across Data Analyst, Data Scientist, and Data Engineer roles?

For Data Analysts, questions emphasize business metrics, cohorting, funnels, and clean aggregations with correct grouping and filtering. For Data Scientists, expect experiment analysis, feature extraction, and time series style queries, often using window functions and careful joins. For Data Engineers, expect more about correctness at scale, schema design, deduplication, slowly changing dimensions, and performance considerations like partitioning and avoiding expensive cross joins.

Q: How can I prepare for SQL interviews if I have no real world SQL experience?

Start by learning a consistent pattern: define the grain, list the tables, decide join keys, then build the query in layers using CTEs. Work through realistic datasets and prompts, then compare your outputs to expected results to catch logic bugs like double counting. Use datainterview.com/questions to study common scenarios and datainterview.com/coding to practice writing complete solutions under time constraints.

Q: What are common SQL mistakes that cause people to fail interviews?

You often lose points by double counting from many to many joins, forgetting DISTINCT when needed, or grouping at the wrong grain. Another common issue is mishandling NULLs, for example using NOT IN with NULLs, or filtering in WHERE instead of ON and changing join behavior. You should also avoid ambiguous columns, unreadable nested queries, and missing edge cases like ties, multiple events per user, or incomplete date ranges.

SQL is the most tested skill in data interviews, appearing in 95% of Data Analyst, Data Scientist, and Data Engineer roles at top-tier companies. Meta, Google, Amazon, Airbnb, Uber, and Stripe all use SQL as a primary filter to assess your ability to work with real data at scale. Unlike coding interviews where algorithms dominate, SQL questions mirror the exact problems you'll solve on the job: measuring user engagement, computing business metrics, and debugging production queries.

What makes SQL interviews particularly challenging is that seemingly simple questions hide multiple layers of complexity. A question about computing monthly revenue sounds straightforward until you realize you need to handle refunds in different months, avoid double-counting from table joins, and deal with NULL customer IDs for guest purchases. The difference between a junior and senior data professional is often their ability to spot these edge cases and write queries that work correctly in production.

Here are the top 33 SQL interview questions organized by the core skills companies actually test, from basic filtering semantics to advanced performance optimization.

Intermediate33 questions

SQL Interview Questions

Top SQL interview questions covering the key areas tested at leading tech companies. Practice with real questions and detailed solutions.

Data AnalystData ScientistData Engineer Meta

Core SELECT, Filtering, and NULL Semantics

Most candidates fail SQL interviews not because they can't write SELECT statements, but because they misunderstand how NULL values behave in filters and comparisons. Companies use these foundational questions to quickly identify candidates who will write buggy queries that return incorrect results in production.

The most common mistake is assuming that NOT IN works the same way when the subquery contains NULLs. If your subquery returns even a single NULL value, NOT IN will return zero rows, not the filtered set you expect.

Core SELECT, Filtering, and NULL Semantics

Start here: you are tested on whether you can accurately filter and shape rows under time pressure. Candidates often slip on NULL behavior, boolean logic, date handling, and subtle differences between WHERE and HAVING.

At Netflix, you are pulling a list of active subscriptions where country is not 'US'. The table has many NULL countries due to missing data. Write a query that returns only truly non-US rows and excludes NULLs, and explain why a naive filter is wrong.

NetflixEasyCore SELECT, Filtering, and NULL Semantics

Sample Answer

Most candidates default to `WHERE country <> 'US'`, but that fails here because comparisons with NULL evaluate to UNKNOWN and get filtered out in `WHERE`. You need to be explicit about NULL handling, for example `WHERE country IS NOT NULL AND country <> 'US'`. If the requirement instead is to treat NULL as non-US, you would use `WHERE COALESCE(country,'') <> 'US'` or `WHERE country IS NULL OR country <> 'US'`. The key is remembering SQL uses three-valued logic, TRUE, FALSE, UNKNOWN.

At Meta, you need users who did not make a purchase in the last 30 days. Purchases are in a table with user_id and purchased_at, and purchased_at can be NULL for test rows. Write a query that correctly returns those users and does not get tripped up by NULLs in a NOT IN subquery.

MetaMediumCore SELECT, Filtering, and NULL Semantics

Sample Answer

Use `NOT EXISTS` with a correlated subquery on the last 30 days. `NOT IN` breaks when the subquery can return NULL, because `x NOT IN (.., NULL, ..)` becomes UNKNOWN for every $x$ and returns zero rows. With `NOT EXISTS`, you write `WHERE NOT EXISTS (SELECT 1 FROM purchases p WHERE p.user_id = u.user_id AND p.purchased_at >= CURRENT_DATE - INTERVAL '30 day')` and you can also add `AND p.purchased_at IS NOT NULL` if needed. This is the standard safe pattern under time pressure.

At Stripe, you are asked to find customers whose first successful charge happened in January 2026. The charges table has customer_id, status, and charged_at timestamps. Write the query and explain why WHERE vs HAVING matters for both correctness and performance.

StripeHardCore SELECT, Filtering, and NULL Semantics

Sample Answer

You could do this with a window function or with aggregation plus HAVING. The aggregation approach is `WHERE status = 'succeeded' AND charged_at >= DATE '2026-01-01' AND charged_at < DATE '2026-02-01'` in a subquery that computes `MIN(charged_at)` per customer, then filter where that min is in January, but you must not put the January filter only in WHERE or you will accidentally ignore earlier successful charges. The safer pattern is: aggregate over all succeeded charges, `GROUP BY customer_id`, then `HAVING MIN(charged_at) >= DATE '2026-01-01' AND MIN(charged_at) < DATE '2026-02-01'`. WHERE filters rows before grouping, HAVING filters groups after aggregation, which is exactly what “first charge” needs.

At Uber, you are given a trips table with requested_at timestamps and a cancel_reason column that is often NULL. You need trips in the last 7 days where the rider did not cancel for 'fraud' and you must keep rows with NULL cancel_reason. Write the WHERE clause only.

UberMediumCore SELECT, Filtering, and NULL Semantics

At Google, you are debugging a query that uses `WHERE (a = 1 OR b = 1) AND c = 1` and the results look wrong when a or b can be NULL. Provide a minimal example dataset of 4 to 6 rows that demonstrates the issue, then write a corrected predicate that matches the intended logic: treat NULL as 0 for a and b, but keep NULLs for c as unknown so they are excluded.

GoogleHardCore SELECT, Filtering, and NULL Semantics

Practice more Core SELECT, Filtering, and NULL Semantics questions

Aggregations, Grouping, and Conditional Metrics

Aggregation questions test whether you understand the difference between row-level and group-level logic, which is critical for building accurate business metrics. Data professionals who confuse WHERE and HAVING clauses or misuse window functions in aggregations create dashboards that show wrong numbers to executives.

The key insight is that conditional aggregation with CASE statements is often cleaner and more performant than multiple subqueries. Instead of writing separate queries for each metric, you can compute DAU, conversion rate, and average order value in a single pass through the data.

Aggregations, Grouping, and Conditional Metrics

In analytics and product SQL interviews, you need to translate a metric definition into correct GROUP BY logic. You will likely struggle if you mix row level filters with aggregate filters, or if you mis-handle distinct counts, cohorts, and conditional aggregation.

At Meta, you are given events(user_id, event_time, event_name). Define DAU for each date as users who had at least one 'app_open' that day, and also compute the share of DAU who had at least one 'purchase' the same day.

MetaMediumAggregations, Grouping, and Conditional Metrics

Sample Answer

Compute per-day DAU with a distinct user count of app_open users, and compute purchase share using conditional distinct counting in the same grouped query. You group by date(event_time), set DAU as COUNT(DISTINCT CASE WHEN event_name = 'app_open' THEN user_id END). Then compute purchasers as COUNT(DISTINCT CASE WHEN event_name = 'purchase' THEN user_id END) and divide by DAU, guarding against $0$ with NULLIF. You do not filter the table to app_open only, or you will undercount purchasers who did not also have an app_open row in the filtered set.

At Amazon, you have orders(order_id, user_id, order_ts, status, gmv). For each week, report total GMV for completed orders, and also the number of users who had 2 or more completed orders that week.

AmazonHardAggregations, Grouping, and Conditional Metrics

Sample Answer

You could do it in one pass with a window function, or you could do it in two stages with a per-user-per-week aggregation. The two-stage approach wins here because the "2 or more orders" metric is a cohort filter on an aggregate, which is cleaner with a HAVING clause on the per-user-per-week rollup. First, aggregate to (week, user_id) with COUNT(*) and SUM(gmv) for status = 'completed'. Then, in the outer query, sum the per-user GMV for weekly GMV, and count users where the per-user count is >= 2.

At Airbnb, you have reservations(reservation_id, guest_id, created_at, check_in_date, canceled_at). For each month of reservation creation, compute the cancellation rate where a reservation counts as canceled if canceled_at is not null, and the numerator and denominator are both distinct reservations.

AirbnbEasyAggregations, Grouping, and Conditional Metrics

Sample Answer

First, decide your grain, it is reservation_id, because both numerator and denominator are distinct reservations. Next, group by the month of created_at. Then compute the denominator as COUNT(DISTINCT reservation_id), and the numerator as COUNT(DISTINCT CASE WHEN canceled_at IS NOT NULL THEN reservation_id END). Finally, divide numerator by NULLIF(denominator, 0) to avoid division by zero, and do not put canceled_at in a WHERE clause because that would shrink your denominator.

At Stripe, you have charges(charge_id, customer_id, created_at, amount, outcome). For each day, compute the average amount of successful charges, and separately compute the success rate defined as successful charges divided by all charges that day.

StripeMediumAggregations, Grouping, and Conditional Metrics

Sample Answer

This question is checking whether you can keep multiple metrics with different denominators in the same GROUP BY without mixing row-level filters and aggregate logic. You group by date(created_at). For average successful amount, use AVG(CASE WHEN outcome = 'succeeded' THEN amount END) so failures contribute NULL, not 0. For success rate, compute COUNT(*) of all charges as the denominator and COUNT(CASE WHEN outcome = 'succeeded' THEN 1 END) as the numerator, then divide with NULLIF.

At Netflix, you have plays(user_id, play_ts, title_id, seconds_watched). For each week, report (a) weekly active viewers, defined as distinct users with at least 300 seconds watched total that week, and (b) total seconds watched that week from those active viewers only.

NetflixHardAggregations, Grouping, and Conditional Metrics

At Uber, you have trips(trip_id, driver_id, city_id, requested_at, completed_at, canceled_at). For each city and day, compute the cancellation rate, but exclude drivers who had fewer than 5 trip requests that day from both numerator and denominator.

UberMediumAggregations, Grouping, and Conditional Metrics

Practice more Aggregations, Grouping, and Conditional Metrics questions

Joins, Relationships, and Duplicates

Join questions separate candidates who write correct queries from those who accidentally multiply rows and return inflated metrics. This skill matters because real production tables have one-to-many relationships, and a poorly written join can make your revenue numbers 10x too high.

Always ask yourself: am I joining to a table that has multiple rows per key? If yes, you need to either filter to one row per key before joining, or use window functions to deduplicate after joining. The FROM clause structure determines whether your final aggregations will be correct.

Joins, Relationships, and Duplicates

Expect join questions to probe whether you can reason about cardinality and avoid accidental row multiplication. You will get tripped up when the interviewer introduces one-to-many relationships, missing keys, or requires anti-joins and de-duplication.

At Spotify, you need daily active users by country from `events(user_id, event_ts)` and `users(user_id, country, updated_at)`, where `users` has multiple rows per user due to profile updates. Write a query that counts distinct active users per day and country without multiplying rows.

SpotifyMediumJoins, Relationships, and Duplicates

Sample Answer

You could join `events` to `users` directly, or you could first collapse `users` to one row per `user_id` and then join. The direct join looks simpler but it multiplies each event by the number of profile versions, inflating counts. The de-dup wins here because it restores the intended 1-to-1 join key, for example by picking the latest profile per user with `ROW_NUMBER() = 1`. Then you group by `date(event_ts)` and `country`, and count distinct `user_id`.

At Amazon, you are given `orders(order_id, user_id, order_ts)` and `order_items(order_id, item_id, quantity)`. Return users who placed at least one order in the last 30 days that contains both item 101 and item 202, and explain how you avoid false positives from joins.

AmazonMediumJoins, Relationships, and Duplicates

Sample Answer

First you filter `orders` to the last 30 days, so you only reason about relevant orders. Then you join to `order_items` and group at the `order_id` level, because the condition is about items within the same order, not within the same user. In that grouped result, you compute two flags like `MAX(CASE WHEN item_id = 101 THEN 1 END)` and `MAX(CASE WHEN item_id = 202 THEN 1 END)`, and keep orders where both are 1. Finally you select distinct `user_id` from those qualifying orders, which prevents duplicates from multiple qualifying orders or multiple item rows.

At Uber, you have `trips(trip_id, rider_id, requested_at)` and `promotions(rider_id, promo_id, start_at, end_at)`, where a rider can have overlapping promotions. For each trip, assign the promo in effect at request time, but if multiple promos match, pick the one with the latest `start_at`.

UberHardJoins, Relationships, and Duplicates

Sample Answer

This question is checking whether you can reason about 1-to-many joins caused by overlapping validity windows, and then deterministically collapse back to one row per trip. You join trips to promotions on `rider_id` plus the time predicate `requested_at BETWEEN start_at AND end_at`, which can return multiple promos per trip. Then you rank matches per `trip_id` by `start_at DESC` and keep `ROW_NUMBER() = 1` so each trip maps to a single promo. If no promo matches, keep the trip with a NULL promo via a left join before ranking.

At Stripe, you want merchants with zero successful charges in the last 90 days. Tables are `merchants(merchant_id)` and `charges(charge_id, merchant_id, created_at, status)`. Write the query and call out how you handle missing keys and duplicates.

StripeEasyJoins, Relationships, and Duplicates

Sample Answer

The standard move is an anti-join, left join filtered successful charges in the last 90 days, then keep rows where the joined key is NULL. But here, filter placement matters because putting the status and date predicates in the `WHERE` clause turns your left join into an inner join and drops merchants with no charges. So you put `status = 'succeeded'` and the 90 day window in the join condition, then select merchants where `charges.merchant_id IS NULL`. Duplicates in `charges` do not matter because you are only checking existence, not counting.

At Meta, you have `posts(post_id, author_id)` and `likes(post_id, user_id, liked_at)`. Return each author and their most-liked post, breaking ties by earliest `post_id`, and ensure authors with zero likes still appear.

MetaHardJoins, Relationships, and Duplicates

At Google, you are debugging a metric spike after joining `sessions(session_id, user_id, started_at)` to `pageviews(session_id, url, ts)` and then to `experiments(user_id, experiment_id, variant, assigned_at)`, where users can be reassigned over time. Explain why row multiplication happens, and write a corrected query to compute total sessions per variant for a given day using the variant active at session start.

GoogleHardJoins, Relationships, and Duplicates

Practice more Joins, Relationships, and Duplicates questions

Subqueries, CTEs, and Set Operations

Subquery and CTE questions test your ability to break down complex business logic into readable, maintainable SQL. Senior data professionals use CTEs to make their intent clear, while junior candidates write nested subqueries that are impossible to debug.

Window functions inside CTEs are your best tool for ranking and filtering patterns. When you need the top N items per group or consecutive day streaks, build your window function first in a CTE, then filter the results in the outer query.

Subqueries, CTEs, and Set Operations

Rather than writing one giant query, you are evaluated on how you structure logic into readable, correct steps. You can lose points by using correlated subqueries incorrectly, misunderstanding UNION vs UNION ALL, or producing non-deterministic results.

At Spotify, you have a table plays(user_id, track_id, played_at). Return the top 3 tracks per day by unique listeners for the last 7 days, and include ties deterministically so results do not change between runs.

SpotifyHardSubqueries, CTEs, and Set Operations

Sample Answer

Reason through it: first, you filter plays to the last 7 days and compute daily unique listeners per track with a GROUP BY on date(played_at), track_id and COUNT(DISTINCT user_id). Next, you rank tracks within each day using a window function like DENSE_RANK() over (PARTITION BY play_date ORDER BY listeners DESC, track_id ASC) so ties are handled and ordering is deterministic. Then you select ranks $\le 3$ and output play_date, track_id, listeners. If the question forces subqueries or CTEs, you put the aggregation in a CTE, then apply the ranking in an outer SELECT to keep the logic readable and correct.

At Meta, you have pageviews(user_id, view_ts, url). Find users who viewed at least 5 distinct URLs in a day for 3 consecutive days, and return their first qualifying start date.

MetaMediumSubqueries, CTEs, and Set Operations

Sample Answer

This question is checking whether you can break a complex condition into clean CTE steps and avoid a correlated subquery that explodes in cost or logic. You first build a per user, per day table with COUNT(DISTINCT url) and keep only days with count $\ge 5$. Then you create a streak key using something like date - row_number() over (PARTITION BY user_id ORDER BY day) and count rows per key to find streaks with length $\ge 3$. Finally, you take MIN(day) within those streak groups as the first qualifying start date per user.

At Amazon, you need a list of active customers in Q1 from two systems: crm_customers(customer_id) and billing_customers(customer_id). You want customers who appear in either system, plus a flag for whether they are in both, and you must avoid accidentally dropping duplicates that signal data quality issues.

AmazonEasySubqueries, CTEs, and Set Operations

Sample Answer

The standard move is to use set operations: UNION ALL to keep all rows, then aggregate to compute presence across sources. But here, deduplication matters because UNION would hide duplicate customer_ids that might indicate upstream bugs, so you keep duplicates and count them. You can UNION ALL the two tables with a source label, then GROUP BY customer_id and compute flags like MAX(CASE WHEN source = 'crm' THEN 1 END) and MAX(CASE WHEN source = 'billing' THEN 1 END), and also keep COUNT(*) as a quality signal. If the interviewer wants only unique customers at the end, you still do that after you have captured the duplicate evidence.

At Uber, you have trips(trip_id, driver_id, city_id, start_ts, status). Write a query that returns cities where the cancel rate in the last 28 days is higher than the global cancel rate, and show both rates side by side using CTEs.

UberMediumSubqueries, CTEs, and Set Operations

At Netflix, you have subscriptions(user_id, start_date, end_date) and you need to compute daily active users for the last 90 days. Do it using set operations on event dates, not a calendar table, and explain how you avoid double counting when users have overlapping subscription periods.

NetflixHardSubqueries, CTEs, and Set Operations

Practice more Subqueries, CTEs, and Set Operations questions

Window Functions and Advanced Analytics SQL

Window function questions are where companies assess your ability to solve advanced analytics problems that simple GROUP BY cannot handle. These questions mirror real product analytics work: cohort analysis, rolling metrics, and ranking systems that power recommendation engines.

The secret to window functions is understanding the frame specification. ROWS BETWEEN tells SQL exactly which rows to include in calculations like rolling sums, and getting this wrong will shift your numbers by days or weeks in time-series analysis.

Window Functions and Advanced Analytics SQL

When the question asks for top-N per group, running totals, retention curves, or sessionization, you are expected to reach for window functions. You may struggle with partitioning, ordering, frames, and the difference between windowed and grouped aggregates.

Meta wants to show the 3 most recent posts per user in the feed. Given a posts table (user_id, post_id, created_at), write SQL to return only the latest 3 posts per user, breaking ties deterministically.

MetaEasyWindow Functions and Advanced Analytics SQL

Sample Answer

This question is checking whether you can use window functions for top-N per group without accidentally collapsing rows. You should rank posts per user with ROW_NUMBER() over (PARTITION BY user_id ORDER BY created_at DESC, post_id DESC). Then filter to row_number <= 3 in an outer query or QUALIFY if supported. The deterministic tie break is the extra ORDER BY column, otherwise your results can be non-repeatable.

At Amazon you are asked to compute a 7-day rolling sum of daily revenue per marketplace. Given daily_revenue (marketplace_id, dt, revenue), return dt, revenue, and rolling_7d_revenue for each marketplace.

AmazonMediumWindow Functions and Advanced Analytics SQL

Sample Answer

The standard move is SUM(revenue) OVER (PARTITION BY marketplace_id ORDER BY dt ROWS BETWEEN 6 PRECEDING AND CURRENT ROW). But here, missing dates matter because ROWS counts rows, not calendar days. If dt can have gaps, you either densify the date series first, then apply the ROWS frame, or switch to a RANGE frame only if your engine supports it correctly for dates. Also make sure dt is a true date type so ordering is chronological.

Uber wants a D1 retention curve by signup_date. Given users (user_id, signup_date) and events (user_id, event_ts), compute for each signup_date the fraction of users who return on day 1, day 2, and day 7 after signup.

UberHardWindow Functions and Advanced Analytics SQL

Sample Answer

Get this wrong in production and your retention curve inflates, you might count multiple returns per user per day or shift days due to timestamp boundaries. The right call is to first derive each user and relative_day = DATE(event_ts) - signup_date, then dedupe to one row per user per relative_day. Next aggregate to counts per signup_date and relative_day, and window those counts to divide by the cohort size per signup_date. Keep the day math explicit, for example $relative\_day \in \{1,2,7\}$, and standardize timezone before DATE() casting.

Netflix wants to detect binge sessions. Given watch_events (user_id, event_ts, title_id), define a new session when the gap between consecutive events for a user is greater than 30 minutes, then return session_id and the total minutes watched per session assuming each event represents 5 minutes watched.

NetflixMediumWindow Functions and Advanced Analytics SQL

Sample Answer

LAG() sounds reasonable but breaks under the first event per user unless you handle NULLs. Using a grouped aggregate on time buckets does not work because sessions are defined by variable gaps, not fixed intervals. That leaves a windowed gap check: compute prev_ts with LAG(event_ts) over (PARTITION BY user_id ORDER BY event_ts), flag new_session = 1 when prev_ts is NULL or event_ts > prev_ts + interval '30 minutes'. Then create session_id as a running SUM(new_session) over the same partition and order, and finally SUM(5) per (user_id, session_id) for minutes watched.

Stripe asks you to find for each customer the first successful payment after a failed payment, and the time between them. Given payments (customer_id, payment_id, created_at, status), return customer_id, failed_payment_id, next_success_payment_id, and minutes_to_recover.

StripeHardWindow Functions and Advanced Analytics SQL

Spotify wants to label each track play with the running count of distinct artists a user has listened to so far, ordered by play time. Given plays (user_id, played_at, track_id, artist_id), return all rows with distinct_artists_so_far.

SpotifyHardWindow Functions and Advanced Analytics SQL

Practice more Window Functions and Advanced Analytics SQL questions

Query Performance, Debugging, and Edge Cases

Performance and debugging questions test whether you can optimize queries for production scale and diagnose problems when things go wrong. These skills matter because a query that works on sample data might time out or return wrong results when run against billions of rows.

Query execution plans reveal the truth about performance: are you scanning full tables when you should hit an index, or creating massive intermediate results from poorly ordered joins? Learning to read EXPLAIN plans turns you from someone who writes queries into someone who optimizes data systems.

Query Performance, Debugging, and Edge Cases

To close out, you are judged on whether your SQL would hold up in production style datasets at Meta, Google, Amazon, or Stripe. You can stumble by ignoring explain plans, indexing implications, predicate pushdown, and correctness under messy data edge cases.

At Meta, you need daily active users per country from a 10B row events table. Your query joins events to users and filters to the last 30 days, but it scans the full table. What specific query changes and physical design choices would you make to force predicate pushdown and avoid the scan?

MetaHardQuery Performance, Debugging, and Edge Cases

Sample Answer

The standard move is to filter as early as possible and join as late as possible, so you put the 30 day filter in a subquery or CTE that only selects needed columns, then aggregate, then join. But here, predicate pushdown can be blocked by wrapping the partition column in a function, for example WHERE DATE(event_ts) >= ..., so you should filter with a sargable predicate like event_ts >= CURRENT_TIMESTAMP - INTERVAL '30' DAY. Partition on event_date or event_ts, cluster or sort by user_id, and only project the columns you need so the engine can prune partitions and use data skipping. Verify with EXPLAIN that the filter is applied at the scan and that the join order did not pull the users table first.

At Amazon, a dashboard query times out after a schema change, and you suspect join duplication. You are counting orders per customer after joining orders to order_items and shipments. How do you debug correctness and performance, and what is the safest rewrite?

AmazonMediumQuery Performance, Debugging, and Edge Cases

Sample Answer

Get this wrong in production and your metrics inflate, teams make bad decisions, and you burn compute on a bigger intermediate join than necessary. The right call is to validate cardinalities first, check row counts after each join, and confirm the intended grain, for example one row per order versus one row per order item. Then pre-aggregate each many-side table to the join key before joining, for example aggregate order_items to order_id and shipments to order_id, and only then join to orders and group by customer. For performance, add or verify indexes on join keys, and use EXPLAIN to ensure the planner is not doing a massive hash join on unfiltered tables.

At Stripe, you need monthly revenue by customer from payments, refunds, and chargebacks. Some payments have multiple refunds, some refunds arrive in a later month, and there are NULL customer_ids for guest checkouts. How do you write a query that is correct under these edge cases and does not double count?

StripeHardQuery Performance, Debugging, and Edge Cases

Sample Answer

Summing payments and subtracting refunds after a naive join sounds reasonable but breaks under multiple refunds per payment, because each payment row gets duplicated. Joining everything at the payment_id level and grouping by month does not work because refunds and chargebacks can post in later months, so month attribution must come from each ledger event, not the original payment. That leaves a ledger style UNION ALL of normalized cash flow events with a signed amount, for example +amount for captured payments, -amount for refunds, -amount for chargebacks, each stamped with its own created_at month. Then aggregate by month and customer_id, and decide explicitly how to bucket NULL customer_id, for example treat as 'guest' so it does not disappear in GROUP BY.

At Google, your query has WHERE user_id IN (SELECT user_id FROM banned_users) and it is slow on large tables. What rewrite would you try first, when would it be faster, and what edge case can change the results?

GoogleMediumQuery Performance, Debugging, and Edge Cases

At Netflix, you are asked to find the top 100 titles by watch time in the last 7 days, but the watch_events table has late arriving events and occasional duplicate event_ids. How would you write the query so it is both performant and correct, and how would you validate it with an explain plan and spot checks?

NetflixHardQuery Performance, Debugging, and Edge Cases

Practice more Query Performance, Debugging, and Edge Cases questions

How to Prepare for SQL Interviews

Practice NULL edge cases systematically

For every filter or join condition you write, ask what happens if the column contains NULL values. Test your assumptions by creating small sample tables with NULLs and running your queries against them.

Build queries incrementally from inner to outer

Start with the most complex CTE or subquery, verify it returns the right data, then build the next layer. This approach catches logical errors early and makes debugging much faster during interviews.

Always specify your window frame explicitly

Instead of letting SQL guess, write ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW for running totals or ROWS BETWEEN 6 PRECEDING AND CURRENT ROW for 7-day windows. Explicit frames prevent off-by-one errors.

Test join logic with small manual examples

Before writing complex joins, sketch out 3-4 rows from each table on paper and manually trace through what the join should produce. This catches cardinality issues before you run the query.

Use consistent date filtering patterns

Establish habits like DATE(timestamp_column) = '2024-01-01' for exact day matches or timestamp_column >= '2024-01-01' AND timestamp_column < '2024-01-02' for day ranges. Consistent patterns reduce bugs and improve readability.

How Ready Are You for SQL Interviews?

1 / 6

Core SELECT, Filtering, and NULL Semantics

You need all users who are not from the US, but the country column is nullable and you want to exclude NULLs from the result. Which predicate best matches the requirement?

Frequently Asked Questions

How much SQL depth do I need for a Data Analyst, Data Scientist, or Data Engineer interview?

You should be comfortable writing joins, aggregations, subqueries or CTEs, and window functions, and you should understand how NULLs affect logic and aggregates. For analyst and scientist roles, focus on accurate analytics SQL and edge cases like duplicates and time ranges. For engineering roles, add deeper knowledge of query performance, indexing basics, data modeling, and incremental loads.

Which companies tend to ask the most SQL interview questions?

Data heavy companies often lean hard on SQL, including Meta, Google, Amazon, Microsoft, Uber, Lyft, Airbnb, Netflix, and Stripe. Many fintech, marketplaces, and ad tech firms also use SQL as a primary filter because day to day work is query driven. You should expect at least one SQL round at most roles where you will analyze product, growth, or pipeline data.

Do SQL interviews require live coding, or is it mostly discussion?

Many interviews include writing SQL live in a shared editor, or in a take home query task, not just talking through concepts. You may be asked to produce a correct query, then iterate after feedback to handle edge cases and improve readability. Practice writing queries from scratch at datainterview.com/coding and review common prompts at datainterview.com/questions.

How do SQL interviews differ across Data Analyst, Data Scientist, and Data Engineer roles?

For Data Analysts, questions emphasize business metrics, cohorting, funnels, and clean aggregations with correct grouping and filtering. For Data Scientists, expect experiment analysis, feature extraction, and time series style queries, often using window functions and careful joins. For Data Engineers, expect more about correctness at scale, schema design, deduplication, slowly changing dimensions, and performance considerations like partitioning and avoiding expensive cross joins.

How can I prepare for SQL interviews if I have no real world SQL experience?

Start by learning a consistent pattern: define the grain, list the tables, decide join keys, then build the query in layers using CTEs. Work through realistic datasets and prompts, then compare your outputs to expected results to catch logic bugs like double counting. Use datainterview.com/questions to study common scenarios and datainterview.com/coding to practice writing complete solutions under time constraints.

What are common SQL mistakes that cause people to fail interviews?

You often lose points by double counting from many to many joins, forgetting DISTINCT when needed, or grouping at the wrong grain. Another common issue is mishandling NULLs, for example using NOT IN with NULLs, or filtering in WHERE instead of ON and changing join behavior. You should also avoid ambiguous columns, unreadable nested queries, and missing edge cases like ties, multiple events per user, or incomplete date ranges.

SQL Interview Questions

SQL Interview Questions

Core SELECT, Filtering, and NULL Semantics

Core SELECT, Filtering, and NULL Semantics

Aggregations, Grouping, and Conditional Metrics

Aggregations, Grouping, and Conditional Metrics

Joins, Relationships, and Duplicates

Joins, Relationships, and Duplicates

Subqueries, CTEs, and Set Operations

Subqueries, CTEs, and Set Operations

Window Functions and Advanced Analytics SQL

Window Functions and Advanced Analytics SQL

Query Performance, Debugging, and Edge Cases

Query Performance, Debugging, and Edge Cases

How to Prepare for SQL Interviews

Practice NULL edge cases systematically

Build queries incrementally from inner to outer

Always specify your window frame explicitly

Test join logic with small manual examples

Use consistent date filtering patterns

Frequently Asked Questions

Dan Lee

Related Articles

How to Choose an AI Engineering Course (and 4 Red Flags to Avoid)

Forward Deployed Engineer vs AI Engineer: Which Path Fits You?

Cost Optimization Strategies for LLM-Powered Apps