SQL is the most tested skill in data interviews, appearing in 95% of Data Analyst, Data Scientist, and Data Engineer roles at top-tier companies. Meta, Google, Amazon, Airbnb, Uber, and Stripe all use SQL as a primary filter to assess your ability to work with real data at scale. Unlike coding interviews where algorithms dominate, SQL questions mirror the exact problems you'll solve on the job: measuring user engagement, computing business metrics, and debugging production queries.
What makes SQL interviews particularly challenging is that seemingly simple questions hide multiple layers of complexity. A question about computing monthly revenue sounds straightforward until you realize you need to handle refunds in different months, avoid double-counting from table joins, and deal with NULL customer IDs for guest purchases. The difference between a junior and senior data professional is often their ability to spot these edge cases and write queries that work correctly in production.
Here are the top 33 SQL interview questions organized by the core skills companies actually test, from basic filtering semantics to advanced performance optimization.
SQL Interview Questions
Top SQL interview questions covering the key areas tested at leading tech companies. Practice with real questions and detailed solutions.
Core SELECT, Filtering, and NULL Semantics
Most candidates fail SQL interviews not because they can't write SELECT statements, but because they misunderstand how NULL values behave in filters and comparisons. Companies use these foundational questions to quickly identify candidates who will write buggy queries that return incorrect results in production.
The most common mistake is assuming that NOT IN works the same way when the subquery contains NULLs. If your subquery returns even a single NULL value, NOT IN will return zero rows, not the filtered set you expect.
Core SELECT, Filtering, and NULL Semantics
Start here: you are tested on whether you can accurately filter and shape rows under time pressure. Candidates often slip on NULL behavior, boolean logic, date handling, and subtle differences between WHERE and HAVING.
At Netflix, you are pulling a list of active subscriptions where country is not 'US'. The table has many NULL countries due to missing data. Write a query that returns only truly non-US rows and excludes NULLs, and explain why a naive filter is wrong.
Sample Answer
Most candidates default to `WHERE country <> 'US'`, but that fails here because comparisons with NULL evaluate to UNKNOWN and get filtered out in `WHERE`. You need to be explicit about NULL handling, for example `WHERE country IS NOT NULL AND country <> 'US'`. If the requirement instead is to treat NULL as non-US, you would use `WHERE COALESCE(country,'') <> 'US'` or `WHERE country IS NULL OR country <> 'US'`. The key is remembering SQL uses three-valued logic, TRUE, FALSE, UNKNOWN.
At Meta, you need users who did not make a purchase in the last 30 days. Purchases are in a table with user_id and purchased_at, and purchased_at can be NULL for test rows. Write a query that correctly returns those users and does not get tripped up by NULLs in a NOT IN subquery.
At Stripe, you are asked to find customers whose first successful charge happened in January 2026. The charges table has customer_id, status, and charged_at timestamps. Write the query and explain why WHERE vs HAVING matters for both correctness and performance.
At Uber, you are given a trips table with requested_at timestamps and a cancel_reason column that is often NULL. You need trips in the last 7 days where the rider did not cancel for 'fraud' and you must keep rows with NULL cancel_reason. Write the WHERE clause only.
At Google, you are debugging a query that uses `WHERE (a = 1 OR b = 1) AND c = 1` and the results look wrong when a or b can be NULL. Provide a minimal example dataset of 4 to 6 rows that demonstrates the issue, then write a corrected predicate that matches the intended logic: treat NULL as 0 for a and b, but keep NULLs for c as unknown so they are excluded.
Aggregations, Grouping, and Conditional Metrics
Aggregation questions test whether you understand the difference between row-level and group-level logic, which is critical for building accurate business metrics. Data professionals who confuse WHERE and HAVING clauses or misuse window functions in aggregations create dashboards that show wrong numbers to executives.
The key insight is that conditional aggregation with CASE statements is often cleaner and more performant than multiple subqueries. Instead of writing separate queries for each metric, you can compute DAU, conversion rate, and average order value in a single pass through the data.
Aggregations, Grouping, and Conditional Metrics
In analytics and product SQL interviews, you need to translate a metric definition into correct GROUP BY logic. You will likely struggle if you mix row level filters with aggregate filters, or if you mis-handle distinct counts, cohorts, and conditional aggregation.
At Meta, you are given events(user_id, event_time, event_name). Define DAU for each date as users who had at least one 'app_open' that day, and also compute the share of DAU who had at least one 'purchase' the same day.
Sample Answer
Compute per-day DAU with a distinct user count of app_open users, and compute purchase share using conditional distinct counting in the same grouped query. You group by date(event_time), set DAU as COUNT(DISTINCT CASE WHEN event_name = 'app_open' THEN user_id END). Then compute purchasers as COUNT(DISTINCT CASE WHEN event_name = 'purchase' THEN user_id END) and divide by DAU, guarding against $0$ with NULLIF. You do not filter the table to app_open only, or you will undercount purchasers who did not also have an app_open row in the filtered set.
At Amazon, you have orders(order_id, user_id, order_ts, status, gmv). For each week, report total GMV for completed orders, and also the number of users who had 2 or more completed orders that week.
At Airbnb, you have reservations(reservation_id, guest_id, created_at, check_in_date, canceled_at). For each month of reservation creation, compute the cancellation rate where a reservation counts as canceled if canceled_at is not null, and the numerator and denominator are both distinct reservations.
At Stripe, you have charges(charge_id, customer_id, created_at, amount, outcome). For each day, compute the average amount of successful charges, and separately compute the success rate defined as successful charges divided by all charges that day.
At Netflix, you have plays(user_id, play_ts, title_id, seconds_watched). For each week, report (a) weekly active viewers, defined as distinct users with at least 300 seconds watched total that week, and (b) total seconds watched that week from those active viewers only.
At Uber, you have trips(trip_id, driver_id, city_id, requested_at, completed_at, canceled_at). For each city and day, compute the cancellation rate, but exclude drivers who had fewer than 5 trip requests that day from both numerator and denominator.
Joins, Relationships, and Duplicates
Join questions separate candidates who write correct queries from those who accidentally multiply rows and return inflated metrics. This skill matters because real production tables have one-to-many relationships, and a poorly written join can make your revenue numbers 10x too high.
Always ask yourself: am I joining to a table that has multiple rows per key? If yes, you need to either filter to one row per key before joining, or use window functions to deduplicate after joining. The FROM clause structure determines whether your final aggregations will be correct.
Joins, Relationships, and Duplicates
Expect join questions to probe whether you can reason about cardinality and avoid accidental row multiplication. You will get tripped up when the interviewer introduces one-to-many relationships, missing keys, or requires anti-joins and de-duplication.
At Spotify, you need daily active users by country from `events(user_id, event_ts)` and `users(user_id, country, updated_at)`, where `users` has multiple rows per user due to profile updates. Write a query that counts distinct active users per day and country without multiplying rows.
Sample Answer
You could join `events` to `users` directly, or you could first collapse `users` to one row per `user_id` and then join. The direct join looks simpler but it multiplies each event by the number of profile versions, inflating counts. The de-dup wins here because it restores the intended 1-to-1 join key, for example by picking the latest profile per user with `ROW_NUMBER() = 1`. Then you group by `date(event_ts)` and `country`, and count distinct `user_id`.
At Amazon, you are given `orders(order_id, user_id, order_ts)` and `order_items(order_id, item_id, quantity)`. Return users who placed at least one order in the last 30 days that contains both item 101 and item 202, and explain how you avoid false positives from joins.
At Uber, you have `trips(trip_id, rider_id, requested_at)` and `promotions(rider_id, promo_id, start_at, end_at)`, where a rider can have overlapping promotions. For each trip, assign the promo in effect at request time, but if multiple promos match, pick the one with the latest `start_at`.
At Stripe, you want merchants with zero successful charges in the last 90 days. Tables are `merchants(merchant_id)` and `charges(charge_id, merchant_id, created_at, status)`. Write the query and call out how you handle missing keys and duplicates.
At Meta, you have `posts(post_id, author_id)` and `likes(post_id, user_id, liked_at)`. Return each author and their most-liked post, breaking ties by earliest `post_id`, and ensure authors with zero likes still appear.
At Google, you are debugging a metric spike after joining `sessions(session_id, user_id, started_at)` to `pageviews(session_id, url, ts)` and then to `experiments(user_id, experiment_id, variant, assigned_at)`, where users can be reassigned over time. Explain why row multiplication happens, and write a corrected query to compute total sessions per variant for a given day using the variant active at session start.
Subqueries, CTEs, and Set Operations
Subquery and CTE questions test your ability to break down complex business logic into readable, maintainable SQL. Senior data professionals use CTEs to make their intent clear, while junior candidates write nested subqueries that are impossible to debug.
Window functions inside CTEs are your best tool for ranking and filtering patterns. When you need the top N items per group or consecutive day streaks, build your window function first in a CTE, then filter the results in the outer query.
Subqueries, CTEs, and Set Operations
Rather than writing one giant query, you are evaluated on how you structure logic into readable, correct steps. You can lose points by using correlated subqueries incorrectly, misunderstanding UNION vs UNION ALL, or producing non-deterministic results.
At Spotify, you have a table plays(user_id, track_id, played_at). Return the top 3 tracks per day by unique listeners for the last 7 days, and include ties deterministically so results do not change between runs.
Sample Answer
Reason through it: first, you filter plays to the last 7 days and compute daily unique listeners per track with a GROUP BY on date(played_at), track_id and COUNT(DISTINCT user_id). Next, you rank tracks within each day using a window function like DENSE_RANK() over (PARTITION BY play_date ORDER BY listeners DESC, track_id ASC) so ties are handled and ordering is deterministic. Then you select ranks $\le 3$ and output play_date, track_id, listeners. If the question forces subqueries or CTEs, you put the aggregation in a CTE, then apply the ranking in an outer SELECT to keep the logic readable and correct.
At Meta, you have pageviews(user_id, view_ts, url). Find users who viewed at least 5 distinct URLs in a day for 3 consecutive days, and return their first qualifying start date.
At Amazon, you need a list of active customers in Q1 from two systems: crm_customers(customer_id) and billing_customers(customer_id). You want customers who appear in either system, plus a flag for whether they are in both, and you must avoid accidentally dropping duplicates that signal data quality issues.
At Uber, you have trips(trip_id, driver_id, city_id, start_ts, status). Write a query that returns cities where the cancel rate in the last 28 days is higher than the global cancel rate, and show both rates side by side using CTEs.
At Netflix, you have subscriptions(user_id, start_date, end_date) and you need to compute daily active users for the last 90 days. Do it using set operations on event dates, not a calendar table, and explain how you avoid double counting when users have overlapping subscription periods.
Window Functions and Advanced Analytics SQL
Window function questions are where companies assess your ability to solve advanced analytics problems that simple GROUP BY cannot handle. These questions mirror real product analytics work: cohort analysis, rolling metrics, and ranking systems that power recommendation engines.
The secret to window functions is understanding the frame specification. ROWS BETWEEN tells SQL exactly which rows to include in calculations like rolling sums, and getting this wrong will shift your numbers by days or weeks in time-series analysis.
Window Functions and Advanced Analytics SQL
When the question asks for top-N per group, running totals, retention curves, or sessionization, you are expected to reach for window functions. You may struggle with partitioning, ordering, frames, and the difference between windowed and grouped aggregates.
Meta wants to show the 3 most recent posts per user in the feed. Given a posts table (user_id, post_id, created_at), write SQL to return only the latest 3 posts per user, breaking ties deterministically.
Sample Answer
This question is checking whether you can use window functions for top-N per group without accidentally collapsing rows. You should rank posts per user with ROW_NUMBER() over (PARTITION BY user_id ORDER BY created_at DESC, post_id DESC). Then filter to row_number <= 3 in an outer query or QUALIFY if supported. The deterministic tie break is the extra ORDER BY column, otherwise your results can be non-repeatable.
At Amazon you are asked to compute a 7-day rolling sum of daily revenue per marketplace. Given daily_revenue (marketplace_id, dt, revenue), return dt, revenue, and rolling_7d_revenue for each marketplace.
Uber wants a D1 retention curve by signup_date. Given users (user_id, signup_date) and events (user_id, event_ts), compute for each signup_date the fraction of users who return on day 1, day 2, and day 7 after signup.
Netflix wants to detect binge sessions. Given watch_events (user_id, event_ts, title_id), define a new session when the gap between consecutive events for a user is greater than 30 minutes, then return session_id and the total minutes watched per session assuming each event represents 5 minutes watched.
Stripe asks you to find for each customer the first successful payment after a failed payment, and the time between them. Given payments (customer_id, payment_id, created_at, status), return customer_id, failed_payment_id, next_success_payment_id, and minutes_to_recover.
Spotify wants to label each track play with the running count of distinct artists a user has listened to so far, ordered by play time. Given plays (user_id, played_at, track_id, artist_id), return all rows with distinct_artists_so_far.
Query Performance, Debugging, and Edge Cases
Performance and debugging questions test whether you can optimize queries for production scale and diagnose problems when things go wrong. These skills matter because a query that works on sample data might time out or return wrong results when run against billions of rows.
Query execution plans reveal the truth about performance: are you scanning full tables when you should hit an index, or creating massive intermediate results from poorly ordered joins? Learning to read EXPLAIN plans turns you from someone who writes queries into someone who optimizes data systems.
Query Performance, Debugging, and Edge Cases
To close out, you are judged on whether your SQL would hold up in production style datasets at Meta, Google, Amazon, or Stripe. You can stumble by ignoring explain plans, indexing implications, predicate pushdown, and correctness under messy data edge cases.
At Meta, you need daily active users per country from a 10B row events table. Your query joins events to users and filters to the last 30 days, but it scans the full table. What specific query changes and physical design choices would you make to force predicate pushdown and avoid the scan?
Sample Answer
The standard move is to filter as early as possible and join as late as possible, so you put the 30 day filter in a subquery or CTE that only selects needed columns, then aggregate, then join. But here, predicate pushdown can be blocked by wrapping the partition column in a function, for example WHERE DATE(event_ts) >= ..., so you should filter with a sargable predicate like event_ts >= CURRENT_TIMESTAMP - INTERVAL '30' DAY. Partition on event_date or event_ts, cluster or sort by user_id, and only project the columns you need so the engine can prune partitions and use data skipping. Verify with EXPLAIN that the filter is applied at the scan and that the join order did not pull the users table first.
At Amazon, a dashboard query times out after a schema change, and you suspect join duplication. You are counting orders per customer after joining orders to order_items and shipments. How do you debug correctness and performance, and what is the safest rewrite?
At Stripe, you need monthly revenue by customer from payments, refunds, and chargebacks. Some payments have multiple refunds, some refunds arrive in a later month, and there are NULL customer_ids for guest checkouts. How do you write a query that is correct under these edge cases and does not double count?
At Google, your query has WHERE user_id IN (SELECT user_id FROM banned_users) and it is slow on large tables. What rewrite would you try first, when would it be faster, and what edge case can change the results?
At Netflix, you are asked to find the top 100 titles by watch time in the last 7 days, but the watch_events table has late arriving events and occasional duplicate event_ids. How would you write the query so it is both performant and correct, and how would you validate it with an explain plan and spot checks?
How to Prepare for SQL Interviews
Practice NULL edge cases systematically
For every filter or join condition you write, ask what happens if the column contains NULL values. Test your assumptions by creating small sample tables with NULLs and running your queries against them.
Build queries incrementally from inner to outer
Start with the most complex CTE or subquery, verify it returns the right data, then build the next layer. This approach catches logical errors early and makes debugging much faster during interviews.
Always specify your window frame explicitly
Instead of letting SQL guess, write ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW for running totals or ROWS BETWEEN 6 PRECEDING AND CURRENT ROW for 7-day windows. Explicit frames prevent off-by-one errors.
Test join logic with small manual examples
Before writing complex joins, sketch out 3-4 rows from each table on paper and manually trace through what the join should produce. This catches cardinality issues before you run the query.
Use consistent date filtering patterns
Establish habits like DATE(timestamp_column) = '2024-01-01' for exact day matches or timestamp_column >= '2024-01-01' AND timestamp_column < '2024-01-02' for day ranges. Consistent patterns reduce bugs and improve readability.
How Ready Are You for SQL Interviews?
1 / 6You need all users who are not from the US, but the country column is nullable and you want to exclude NULLs from the result. Which predicate best matches the requirement?
Frequently Asked Questions
How much SQL depth do I need for a Data Analyst, Data Scientist, or Data Engineer interview?
You should be comfortable writing joins, aggregations, subqueries or CTEs, and window functions, and you should understand how NULLs affect logic and aggregates. For analyst and scientist roles, focus on accurate analytics SQL and edge cases like duplicates and time ranges. For engineering roles, add deeper knowledge of query performance, indexing basics, data modeling, and incremental loads.
Which companies tend to ask the most SQL interview questions?
Data heavy companies often lean hard on SQL, including Meta, Google, Amazon, Microsoft, Uber, Lyft, Airbnb, Netflix, and Stripe. Many fintech, marketplaces, and ad tech firms also use SQL as a primary filter because day to day work is query driven. You should expect at least one SQL round at most roles where you will analyze product, growth, or pipeline data.
Do SQL interviews require live coding, or is it mostly discussion?
Many interviews include writing SQL live in a shared editor, or in a take home query task, not just talking through concepts. You may be asked to produce a correct query, then iterate after feedback to handle edge cases and improve readability. Practice writing queries from scratch at datainterview.com/coding and review common prompts at datainterview.com/questions.
How do SQL interviews differ across Data Analyst, Data Scientist, and Data Engineer roles?
For Data Analysts, questions emphasize business metrics, cohorting, funnels, and clean aggregations with correct grouping and filtering. For Data Scientists, expect experiment analysis, feature extraction, and time series style queries, often using window functions and careful joins. For Data Engineers, expect more about correctness at scale, schema design, deduplication, slowly changing dimensions, and performance considerations like partitioning and avoiding expensive cross joins.
How can I prepare for SQL interviews if I have no real world SQL experience?
Start by learning a consistent pattern: define the grain, list the tables, decide join keys, then build the query in layers using CTEs. Work through realistic datasets and prompts, then compare your outputs to expected results to catch logic bugs like double counting. Use datainterview.com/questions to study common scenarios and datainterview.com/coding to practice writing complete solutions under time constraints.
What are common SQL mistakes that cause people to fail interviews?
You often lose points by double counting from many to many joins, forgetting DISTINCT when needed, or grouping at the wrong grain. Another common issue is mishandling NULLs, for example using NOT IN with NULLs, or filtering in WHERE instead of ON and changing join behavior. You should also avoid ambiguous columns, unreadable nested queries, and missing edge cases like ties, multiple events per user, or incomplete date ranges.
