Spark and Big Data questions dominate data engineering interviews at Meta, Google, Amazon, Netflix, Uber, and Databricks. These companies process petabytes daily, and they need engineers who can architect resilient pipelines, debug performance bottlenecks, and handle real-time streaming workloads. Unlike coding questions with clear right answers, Spark interviews test your ability to reason through trade-offs and diagnose production issues.
What makes these questions brutal is that they mirror actual on-call scenarios. You might get a question about why a join is spilling to disk with 32GB executors, or why a streaming job is losing events after a driver restart. The interviewer has lived through these exact problems, and they can tell immediately if you understand the underlying execution model or if you are just reciting documentation.
Here are the top 30 Spark and Big Data interview questions, organized by the core areas that trip up most candidates.
Spark & Big Data Interview Questions
Top Spark & Big Data interview questions covering the key areas tested at leading tech companies. Practice with real questions and detailed solutions.
Spark Architecture and Execution Model
Most candidates can explain what a driver and executor do, but they crumble when asked to trace how a specific query creates stages and tasks. Interviewers at companies like Databricks will give you a multi-step query and ask you to predict exactly where bottlenecks will occur, testing whether you truly understand how Spark breaks down work.
The key insight is that Spark's lazy evaluation means the driver only creates a physical plan when you trigger an action, and wide transformations like groupBy force stage boundaries. Candidates who miss this will give wrong answers about parallelism and task distribution.
Spark Architecture and Execution Model
Start by nailing how Driver, Executors, tasks, and stages relate, because interviewers test whether you can predict behavior under load. You often struggle here when you memorize terms but cannot reason about what happens during an action.
You call df.repartition(200).groupBy("user_id").count().write.mode("overwrite").parquet("s3://bucket/out"). On a 20 executor cluster, you observe only a few tasks running at a time during the write. Explain how Driver, stages, and tasks are created for this job, and give two concrete reasons parallelism might be low.
Sample Answer
Most candidates default to "200 partitions means 200 tasks will saturate the cluster," but that fails here because the number of active tasks is bounded by stage boundaries, available executor cores, and the final stage's partitioning. The Driver builds a DAG, then the scheduler splits it into stages at shuffle boundaries, your repartition and groupBy introduce shuffles that create separate stages. Even if you have 200 tasks in a stage, you will only run up to total executor cores concurrently, and speculation, backpressure from the output committer, or small-file coalescing can further reduce throughput. Also, the write stage task count is driven by the number of output partitions at that point, which may be fewer than 200 if AQE coalesces partitions or if you accidentally reduced partitions earlier.
In a Spark application you see the Driver logs show "Job aborted" after an executor is lost, but the cluster manager immediately launches a replacement executor. Under what conditions will Spark recompute the lost work versus fail the job, and what parts of the execution model determine that behavior?
You need to speed up a pipeline that does a heavy join and then runs two different actions, one writes Parquet and the other computes a KPI count. You can either cache right after the join or rely on Spark to reuse work across actions. Which do you pick, and explain what the Driver will do when each action is triggered.
You run df.selectExpr("key", "explode(values) as v").groupBy("key").agg(count("v")).collect() and it is slow, plus the Driver OOMs. Walk through what happens from action to execution, including where data moves, what runs on executors, and why the Driver is the one that crashes.
A job has three stages, stage 2 is a shuffle map stage with 20,000 tasks and stage 3 has 200 tasks. On a cluster with 2,000 total cores, explain why stage 3 might still be the end-to-end bottleneck, and what in Spark's execution model causes that.
You observe many "Shuffle fetch failed" errors only when dynamic allocation scales executors down. Describe how shuffle data is produced and consumed across stages, what happens when executors that wrote shuffle blocks disappear, and two mitigations you would apply.
RDDs vs DataFrames, Catalyst and Tungsten
DataFrames versus RDDs is not just an API choice, it fundamentally changes how your code executes under the hood. Senior engineers at Meta and Netflix expect you to know when Catalyst optimization helps and when it hurts, plus how Tungsten's memory management affects garbage collection patterns.
The most common mistake is assuming DataFrames are always faster than RDDs. When you need complex custom logic or are working with nested data that Catalyst cannot optimize well, dropping to RDDs can actually improve performance by avoiding serialization overhead.
RDDs vs DataFrames, Catalyst and Tungsten
You will be asked to choose between RDD, DataFrame, and Dataset and justify it with optimizer and memory implications. Candidates stumble when they describe APIs instead of explaining query planning, code generation, and serialization costs.
You need to compute daily active users from 5 TB of click logs, then join to a 50 GB user dimension and write Parquet. Would you implement this in RDDs or DataFrames, and what do Catalyst and Tungsten change about the runtime behavior?
Sample Answer
Use DataFrames, because Catalyst can optimize the join and aggregation plan, and Tungsten reduces CPU and memory overhead via whole-stage codegen and off-heap, columnar execution. You get predicate pushdown, projection pruning, and automatic join selection, like broadcast hash join for the 50 GB dimension if you can broadcast it, or sort-merge if not. With RDDs, you mostly handcraft map and reduce steps, you pay higher serialization and object allocation costs, and you miss many logical and physical plan optimizations.
You have a pipeline that does a parse, a filter, and a groupBy, and it is slower than expected. Would you keep it as a typed Dataset with case classes, convert to an untyped DataFrame, or drop to RDDs, and why in terms of encoders, code generation, and serialization?
A job uses DataFrames but is still spilling heavily during a join and running long GC pauses. How would you reason about whether the bottleneck is in Catalyst planning, Tungsten memory format, or shuffle mechanics, and what change would you try first?
You have a Python UDF in a DataFrame pipeline that parses a nested JSON blob and then filters on a field. The job is slow and CPU-bound. How would you redesign this using Spark SQL functions or RDDs, and what does Catalyst do differently once the UDF is gone?
In a mixed workload, one stage is a pure SQL aggregation, another stage is a custom algorithm that needs complex per-key state and is hard to express in SQL. Where would you draw the boundary between DataFrames and RDDs, and how do you minimize serialization and plan breakage at that boundary?
Joins, Shuffles, and Data Skew
Join performance separates junior from senior data engineers, and it is where most production Spark jobs fail. Google and Uber will present you with realistic data size scenarios and expect you to choose the right join strategy, predict where shuffles happen, and solve data skew without just throwing more resources at the problem.
Data skew is particularly tricky because it is not just about one key being popular. You need to detect it in the Spark UI by looking at task duration histograms, then apply techniques like salting or splitting skewed keys, which requires understanding both the business logic and Spark's partitioning model.
Joins, Shuffles, and Data Skew
In interviews, you need to explain exactly when a shuffle occurs and how join strategy affects runtime and stability. Many people miss skew and end up proposing fixes that move the problem around rather than reducing hot partitions.
You need to join a 5 TB fact table with a 200 MB dimension table in Spark, and the dimension updates daily. How do you decide between a broadcast hash join and a shuffle sort-merge join, and what knobs do you check first?
Sample Answer
You could do a broadcast hash join or a shuffle-based sort-merge join. Broadcast wins here because it avoids shuffling the 5 TB side, you only ship the 200 MB dimension to executors and join locally. First you check whether the dimension actually fits under your broadcast threshold and executor memory, and whether it is already pruned by filters. If it does not fit safely, you fall back to sort-merge, accept the shuffle, and then focus on partitioning and skew handling.
You call groupBy(user_id).count() and then join the result back to the original DataFrame on user_id. Explain exactly where shuffles happen and why, assuming no bucketing and default settings.
Your job joins clickstream with user_profile on user_id, but one user_id accounts for 30 percent of clicks, causing one task to run 20 minutes while others finish in seconds. How do you detect and mitigate the skew without just increasing partitions?
You have two large tables in Parquet and you repeatedly join them on (country, day). When would you bucket or repartition to reduce shuffles, and when does it not help?
A Spark SQL query uses a left join where the right side has multiple matches per key, and output size explodes 50x, then the job OOMs during shuffle write. What would you change in the plan to keep it stable?
Explain how Spark AQE can change join strategies at runtime, including when it can switch to broadcast and how it handles skewed partitions. Give one scenario where AQE helps and one where it cannot.
You are asked to justify, with specifics, whether a sort-merge join will necessarily be slower than a broadcast hash join on the same inputs. What factors would you mention, and how would you verify them with Spark metrics?
Performance Tuning and Resource Management
Performance tuning questions test whether you can systematically diagnose bottlenecks using Spark UI metrics rather than just guessing. Companies like Amazon will describe a slow job and ask you to prioritize which configuration changes to try first, based on symptoms like high GC time versus shuffle spill.
The critical skill is connecting Spark UI evidence to root causes. If you see long GC pauses but low CPU utilization, the problem is usually memory pressure from too much data per partition, not insufficient compute resources. Most candidates jump straight to scaling up instead of optimizing data layout first.
Performance Tuning and Resource Management
Expect scenarios where you must tune partitions, memory, and file layout based on symptoms like spill, long GC, or straggler tasks. You typically struggle if you tweak configs randomly instead of forming a hypothesis from the Spark UI and metrics.
A Spark SQL job got 3x slower after a data growth jump. In the Spark UI you see many tasks with high spill to disk and shuffle read, but executor CPU is low. What do you change first, and how do you validate it with metrics?
Sample Answer
Reason through it: start from the symptoms, low CPU plus high spill usually means tasks are starved on memory during shuffle and spend time spilling and reading back. Next, check stage details, look at spill metrics, shuffle read size per task, and the number of shuffle partitions to see if each task is processing too much data. Your first lever is usually to increase parallelism, raise `spark.sql.shuffle.partitions` (or enable AQE and verify it is coalescing sensibly) so per-task shuffle data drops and spill goes down. Then validate by comparing median and p95 spill per task, task runtime distribution, and disk read throughput before and after, you want spill and p95 task time to fall while CPU utilization rises.
Executors show long GC time, frequent full GCs, and occasional OOM, but the job is mostly joins and aggregations in Spark SQL. What configuration and query-level changes do you consider, and what Spark UI evidence tells you which one to try first?
You have a skewed key that causes one reducer task to run 10x longer than others, and the stage cannot finish until that straggler completes. How do you detect skew precisely in Spark UI, and what is your go-to mitigation for a skewed join?
A daily ETL writes a partitioned Delta or Parquet table. Reads are fast at first, but after weeks the same queries slow down even though compute is unchanged. What do you change in file layout and write behavior to stabilize performance?
A Spark job runs fine on small samples, but on full data the shuffle fetch phase is slow and you see intermittent `FetchFailedException` and executor loss. What knobs and architectural changes do you consider to make shuffle more reliable?
Your cluster has 50 executors, each with 8 cores, and a wide transformation stage shows low CPU utilization with many short tasks and high scheduler delay. How do you choose a better partition count and task sizing strategy?
Structured Streaming, State, and Exactly-once Semantics
Structured Streaming questions reveal whether you understand stateful processing and exactly-once semantics, which are essential for real-time analytics at scale. Netflix and Uber rely heavily on streaming for personalization and matching, so they will ask about watermarks, state management, and recovery scenarios that test your grasp of distributed systems fundamentals.
The trickiest aspect is that achieving exactly-once delivery requires coordination between Spark checkpointing and your output sink. Even with Kafka transactions and Delta Lake, you can still get duplicates if your external system writes are not idempotent, which most candidates miss.
Structured Streaming, State, and Exactly-once Semantics
At the advanced end, interviewers probe whether you can design streaming pipelines with watermarks, state stores, and correct sink semantics. You may find this hard because correctness, latency, and backpressure trade offs show up only in real production systems.
You ingest click events from Kafka where event_time can arrive up to 2 hours late. You need per-campaign counts in 5 minute windows and you cannot keep state forever. How do you use watermarks and windowing to bound state, and what output modes are valid?
Sample Answer
This question is checking whether you can bound state correctly while preserving correctness for late data. You set event-time windows like 5 minute tumbling windows and add a watermark like 2 hours on the event_time column so Spark can drop state for windows older than watermark. With aggregations + watermark, append mode is only correct when the engine can finalize windows, otherwise use update mode, and complete mode is usually too heavy at scale. You should call out that events later than watermark are dropped from aggregation, so your SLA is explicitly encoded in the watermark.
You are deduplicating events using a unique event_id in a stream, then writing to Delta. The feed can resend the same event days later. How do you implement dedup with state, and what is your plan to prevent unbounded state growth?
A Structured Streaming job does a stateful aggregation and writes to an external key value store. After a driver crash, you observe some keys are double-counted. Explain why this happens and how to get effectively exactly-once results end-to-end.
You need to maintain per-user session state with a 30 minute inactivity timeout, and also emit a final session summary when the session closes. Would you use mapGroupsWithState, flatMapGroupsWithState, or windowed aggregation, and how do timeouts interact with watermarks?
You run a join between two Kafka streams, impressions and clicks, keyed by ad_id with a 10 minute join window. What are the state and watermark requirements to make this join correct and bounded, and what failure mode happens if you set the watermark only on one side?
A job uses foreachBatch to write to two sinks, Delta for analytics and a serving store for low latency reads. How do you design the batch logic and checkpointing so that a restart does not cause the two sinks to diverge, and what is your strategy if one sink write succeeds and the other fails?
How to Prepare for Spark & Big Data Interviews
Run Spark locally with intentional bottlenecks
Set up a local Spark cluster with artificially small memory limits and force spilling scenarios. Practice reading the Spark UI to identify where jobs are actually spending time, not where you think they should be slow.
Memorize the shuffle boundary rules
Know exactly which operations force shuffles: groupBy, join, repartition, and aggregations. Practice tracing through multi-step queries to predict how many stages Spark will create and where the expensive data movement happens.
Build streaming jobs that handle late data
Implement a Structured Streaming pipeline with watermarks and stateful operations using sample data with intentionally delayed events. Practice explaining how state grows and gets cleaned up as watermarks advance.
Practice Spark UI forensics on real jobs
Find slow Spark jobs in your current work or create synthetic ones with known problems like data skew or memory pressure. Learn to navigate between the Jobs, Stages, and Executors tabs to build a coherent story about what went wrong.
How Ready Are You for Spark & Big Data Interviews?
1 / 6A Spark job reads Parquet, filters, joins, and writes results. You see many tasks, and some stages are much slower than others. In an interview, how do you explain what creates stages and why some tasks run longer within a stage?
Frequently Asked Questions
How deep do I need to go on Spark and Big Data topics for a Data Engineer interview?
You should be comfortable explaining Spark execution basics like partitions, shuffles, stages, joins, and caching, plus how they impact cost and runtime. Expect to discuss file formats like Parquet, partitioning strategies, and how you would troubleshoot skew and out of memory issues. You do not need to memorize every API, but you should be able to reason about performance, correctness, and data reliability.
Which companies tend to ask the most Spark and Big Data interview questions for Data Engineers?
Cloud and data platform heavy companies ask them most often, including big tech, streaming and marketplace companies, and fintechs with large batch and streaming pipelines. You will also see a lot of Spark questions at organizations that run large lakehouse stacks on AWS, GCP, or Azure, especially when Databricks or managed Spark is central. If the job description mentions petabyte scale, ETL frameworks, or lakehouse migration, expect deep Spark questions.
Will I need to code in Spark, SQL, or Scala/PySpark during a Spark and Big Data interview?
Many Data Engineer interviews include a live coding portion in SQL and sometimes PySpark or Scala, often focused on transforms, joins, windowing, and data quality checks. You may also get a debugging style task like fixing a slow Spark job or rewriting a plan to reduce shuffles. For practice, use datainterview.com/coding and focus on SQL plus Spark style transformations.
How do Spark and Big Data interviews differ across Data Engineer, Analytics Engineer, and ML Engineer roles?
As a Data Engineer, you will be judged on pipeline design, Spark performance tuning, reliability, and batch versus streaming tradeoffs. Analytics Engineer interviews usually emphasize SQL modeling, warehouse and lakehouse semantics, and less low level Spark tuning unless the role runs dbt on Spark. ML Engineer interviews typically focus on feature pipelines and distributed training inputs, you should still know Spark data prep patterns but the depth on cluster internals is often lighter.
How can I prepare for Spark and Big Data interviews if I do not have real production experience?
You can build small but realistic projects that demonstrate partitioning, incremental loads, and handling late arriving data, then be ready to explain your design choices. Practice reading Spark UIs from sample jobs, compare query plans, and learn what triggers shuffles and wide dependencies. Use datainterview.com/questions to drill Spark concepts and scenarios, and practice writing Spark style transformations on datainterview.com/coding.
What are the most common mistakes candidates make in Spark and Big Data interviews?
You often lose points by giving high level answers without explaining shuffles, partitioning, and why a join becomes expensive at scale. Another common mistake is treating Spark like a single machine program, for example using collect, overly wide groupBy, or ignoring skew and small files. You should also avoid hand waving around streaming semantics, be clear about exactly once versus at least once, watermarking, and idempotent writes.
