Streaming and Kafka questions dominate data engineering interviews at companies like Confluent, LinkedIn, Uber, and Netflix. These companies process billions of events daily, so they need engineers who understand not just Kafka basics, but the production gotchas around partitioning, exactly-once semantics, and stream processing. A single misconfigured consumer group can take down critical data pipelines, making this knowledge essential for senior roles.
What makes Kafka interviews particularly challenging is that they test both theoretical understanding and operational experience. You might be asked to debug why a consumer group keeps rebalancing during peak traffic, then immediately pivot to designing a stateful stream processor that handles late-arriving events. Many candidates can explain topics and partitions but struggle when asked about transaction isolation levels or how Schema Registry compatibility modes interact with rolling deployments.
Here are the top 29 Kafka interview questions organized by the core areas that trip up most candidates.
Streaming (Kafka) Interview Questions
Top Streaming (Kafka) interview questions covering the key areas tested at leading tech companies. Practice with real questions and detailed solutions.
Kafka Architecture and Core Concepts
Architecture questions separate candidates who've only used Kafka from those who've operated it in production. Interviewers at Meta and Uber focus heavily on partition leadership, replication mechanics, and failure scenarios because these directly impact system reliability.
The most common mistake is treating Kafka like a simple message queue instead of understanding it as a distributed commit log. When asked about broker failures or uneven partition load, many candidates suggest adding more brokers without considering partition assignment, leadership distribution, or the impact on consumer rebalancing.
Kafka Architecture and Core Concepts
Start with how Kafka is put together: brokers, topics, partitions, leaders, replicas, and controller behavior. You are tested on whether you can reason about throughput, availability, and failure modes under real incident scenarios, not just recite terms.
A producer is seeing uneven throughput: one partition in a topic is hot while others are mostly idle. You are using a key-based partitioner, 12 partitions, RF=3, and acks=all, what do you change first to improve throughput without losing ordering guarantees where they matter?
Sample Answer
Most candidates default to adding partitions, but that fails here because it does not fix key skew, it just spreads the same skew across more partitions and can still bottleneck on the hottest key. Your first lever is to change the partitioning strategy: ensure your key has high cardinality and is evenly distributed, or introduce a compound key or consistent hashing with salting for known hot keys. If ordering is only required per entity, keep the key as that entity id and fix upstream skew; if ordering is only required within a smaller scope, you can widen the key to reduce hotspots. Only after you fix key distribution does adding partitions translate into real throughput gains.
A broker that hosts several leaders for a high-traffic topic goes down during peak. With replication factor 3 and min.insync.replicas=2, what happens to producers using acks=all and to consumers in that topic, assuming unclean leader election is disabled?
You need to increase read throughput for a topic consumed by many independent consumer groups. You are debating between adding partitions versus adding more replicas and using follower fetching. Which do you choose, and what tradeoffs do you call out?
A cluster has 5 brokers. A topic has 60 partitions, RF=3. The controller broker dies and a new controller is elected. Walk through what metadata changes occur and what client-visible symptoms you expect during the controller failover.
You have a topic with RF=3 spread across 3 AZs. One AZ becomes partitioned from the other two for 10 minutes. Describe expected effects on ISR, leader election, and write availability for producers using acks=all and min.insync.replicas=2.
An incident report shows that after a rolling restart, some partitions had leaders concentrated on one broker, causing hotspotting. Explain how Kafka assigns leaders and replicas, what "preferred leader" means, and how you would restore balanced leadership safely in production.
Producers, Consumers, and Consumer Groups
Producer and consumer configuration questions reveal whether you understand Kafka's delivery guarantees and performance characteristics. Companies like Netflix and LinkedIn ask detailed questions about consumer group coordination because lag spikes and rebalancing issues are their most frequent production problems.
Candidates often memorize configuration parameters without understanding their interactions. For example, knowing that max.poll.interval.ms prevents rebalancing is useless if you don't understand how it relates to processing time, fetch sizes, and commit strategies in real applications.
Producers, Consumers, and Consumer Groups
You will be asked to debug and design producer and consumer configurations such as acks, retries, idempotence, batching, fetch settings, and group rebalancing. Many candidates struggle to connect these knobs to symptoms like lag spikes, duplicates, and uneven partition load.
Your Kafka producer is configured with acks=1 and retries enabled. After a broker restart you see occasional duplicate events downstream and your consumers are not idempotent. What producer settings do you change to prevent duplicates, and what tradeoff do you accept?
Sample Answer
Enable idempotence and require durable acknowledgments, typically enable.idempotence=true with acks=all. This prevents duplicates caused by retrying sends after timeouts by assigning sequence numbers per partition and deduplicating at the broker. You should also keep max.in.flight.requests.per.connection within the safe range for idempotence so ordering is preserved per partition. The tradeoff is higher end to end latency and potentially lower throughput because you wait for more replicas and may reduce in flight concurrency.
A consumer group processing 100 partitions shows lag spikes every few minutes even though average throughput is fine. How do you tune consumer fetch and processing settings to smooth lag, and how do you decide between increasing max.poll.records or increasing fetch.min.bytes and fetch.max.wait.ms?
In one consumer group, a few members are constantly rebalancing and the whole group pauses, causing periodic lag cliffs. The logs show 'Max poll interval exceeded' and you do heavy processing per message. How do you fix it without just adding more consumers?
You have uneven load: one partition is hot and the consumer assigned to it is always behind while others are idle. The producer uses a key, and you cannot change the number of partitions immediately. What do you do on the producer and consumer sides to reduce the impact today, and what longer term fix do you plan?
A producer with high throughput shows rising end to end latency and many small requests to brokers. You suspect batching is ineffective. Which producer configs do you inspect and change, and how do you confirm you improved batching without causing unacceptable delay?
After enabling exactly once semantics for a pipeline, you see occasional transaction timeouts and consumers stuck in read_committed mode with growing lag. What producer, broker, and consumer group behaviors could cause this, and what is your debug plan?
Delivery Guarantees and Exactly Once Semantics
Exactly-once semantics questions are where most senior candidates stumble, even those with years of Kafka experience. The topic requires understanding transactions, idempotent producers, and the subtle difference between exactly-once delivery and exactly-once processing.
The critical insight that separates strong candidates is recognizing that exactly-once is an end-to-end property, not just a Kafka feature. You must design your entire data pipeline, including external systems and failure recovery, to achieve exactly-once effects that matter to business logic.
Delivery Guarantees and Exactly Once Semantics
Expect interviewers to push you on at most once, at least once, and exactly once, and what you actually guarantee end to end. To do well, you must explain how offsets, transactions, idempotent producers, and downstream dedupe work together when things crash mid flight.
Your consumer reads from Kafka, writes each event into Postgres, and then commits the offset. A crash happens after the DB write but before the offset commit, what delivery semantics do you actually have end to end, and how would you change the design to get exactly once effects?
Sample Answer
You could rely on at least once and make Postgres writes idempotent, or you could use Kafka transactions with read-process-write and commit offsets as part of the transaction. The first path gives you duplicates on restart, you then win by deduping with a primary key or unique constraint on an event id. The second path wins when your output is Kafka, because transactions tie the produced records and the consumed offsets into one atomic commit. If your sink is Postgres, Kafka EOS alone is not enough, you still need idempotency or a transactional outbox style pattern.
You enable idempotent producer and set acks=all, then you see duplicates in a downstream topic after a broker failover. Explain how that can happen, and what additional pieces you need to truly prevent duplicates from being visible to consumers.
In Kafka Streams or a custom consumer-producer app, you want exactly once processing from input topic to output topic. Describe how offsets and transactions interact, and what happens if the process dies after producing output but before committing offsets.
A team claims they have end to end exactly once because they commit offsets only after writing to S3, and they turn on retries for uploads. Challenge that claim, and explain the standard fix and the key caveat.
You need to join events from two Kafka topics and write the result to another Kafka topic with exactly once semantics. What configuration and processing pattern do you use, and what failure mode still forces you to think about downstream dedupe?
Your pipeline is Kafka to Flink to a payment ledger database. The business asks for exactly once, not just at least once. What guarantees can you honestly claim, and what concrete mechanisms do you add at the database boundary to make the effect exactly once?
Stream Processing Design and Stateful Operations
Stream processing design questions test your ability to handle time, state, and scale simultaneously. At companies like Uber and LinkedIn, these scenarios are pulled directly from real systems that process user behavior, financial transactions, or operational metrics.
The key challenge is balancing freshness, correctness, and resource constraints. Many candidates focus too heavily on the happy path without considering how their windowing and state management strategies behave during traffic spikes, late data, or partial failures.
Stream Processing Design and Stateful Operations
In system design style questions, you design streaming pipelines with joins, windows, aggregations, out of order data handling, and backpressure. Candidates often stumble when translating product requirements into event time semantics, state sizing, recovery, and operational SLOs.
You are building a real time dashboard of ride ETAs. Driver GPS events arrive out of order by up to 90 seconds, and product wants a 1 minute rolling average speed per driver with less than 5 seconds freshness. How do you pick event time vs processing time, define the windowing, and configure allowed lateness and watermarks?
Sample Answer
Reason through it: you start from the requirement that averages must reflect when the GPS happened, not when Kafka delivered it, so you use event time semantics keyed by driver. A 1 minute rolling average implies sliding or hopping windows, for example 60 second windows with a 1 second hop to keep freshness under 5 seconds while controlling compute. With out of order data up to 90 seconds, you set the watermark lag near 90 seconds and allowed lateness slightly above it if you want corrections, otherwise you accept drop or side output for late events. You then decide whether the dashboard can tolerate updates: if not, you emit only after watermark passes, if yes, you emit early results and send retractions or upserts as late events arrive.
You need a streaming join between playback start events and ad impression events to attribute ads to sessions. Events can arrive up to 10 minutes late, session ids are skewed, and you must keep end to end latency under 2 seconds at p99. Design the join, state, and eviction strategy.
You are computing a per merchant fraud score by counting distinct cards per merchant in a 24 hour window, updated every minute. Memory is limited, and you need deterministic results for audits. How do you design the stateful aggregation, including state sizing and recovery?
A Kafka Streams job does a 5 minute windowed aggregation by user and writes results to an output topic consumed by multiple services. During peak traffic, lag grows and the app starts timing out downstream. How do you handle backpressure, state store pressure, and keep SLOs without dropping critical events?
Design a streaming pipeline that maintains a real time top 100 songs per country using play events, deduping repeated plays from retries, and updating the leaderboard every 10 seconds. Explain your windowing, exactly once strategy, and how you avoid global bottlenecks.
You run a stateful stream job that must survive region failover with less than 2 minutes recovery time and no more than $0.01\%$ duplicate outputs. Topics are geo replicated, but state stores are local. Describe your checkpointing, changelog, replay, and cutover plan, including what metrics prove correctness during failover.
Schema Registry, Evolution, and Data Contracts
Schema Registry and evolution questions have become critical as more companies adopt event-driven architectures with dozens of producer and consumer teams. These questions test whether you can maintain data contracts while enabling rapid development cycles.
Understanding compatibility modes is just the beginning. The real test is designing rollout strategies that coordinate schema changes across independent teams with different deployment schedules, especially when those changes involve breaking changes disguised as compatible ones.
Schema Registry, Evolution, and Data Contracts
From a data quality angle, you need to show how you prevent breaking changes while teams ship independently. Interviewers look for how you use compatibility modes, versioning, serialization formats, and rollout strategies to keep consumers safe during migrations.
Your Avro event has a field "price" as an int. A consumer team wants it to be a decimal with cents, and producers want to change it to a string or logicalType. How do you evolve the schema in Schema Registry without breaking existing consumers, and what compatibility mode do you pick?
Sample Answer
This question is checking whether you can keep producers shipping while protecting existing consumers with concrete compatibility guarantees. You do not mutate the type in place, you add a new field like "price_cents" as a long or bytes with a decimal logicalType, keep the old field, and make the new field optional with a default. Set subject compatibility to BACKWARD or FULL depending on whether you also need old producers to work with new consumers, then roll out consumers to read new then old, and only later deprecate the old field. If you must change semantics, treat it as a new field or a new topic, because type changes are the classic schema break even if the name stays the same.
You own a Kafka topic used by 12 downstream teams. You need to add a new required field for compliance within two weeks, but some consumers are deployed monthly. How do you use Schema Registry features and rollout strategy to meet the deadline without causing outages?
A producer team plans to rename field "userId" to "user_id" for consistency, and they want to do it in place. You are the data engineer responsible for the shared contract. What do you allow, and how do you implement the change safely across producers and consumers?
You are migrating a high volume topic from JSON to Protobuf with Schema Registry, and you cannot stop traffic. Describe a dual publish, dual consume plan, how you version subjects, and how you validate that consumers are safe before cutover.
Your Schema Registry is set to FULL compatibility, but a team wants to delete an unused field to reduce payload size. They claim no one reads it. What evidence do you require, what alternatives do you propose, and when would you allow a breaking change?
How to Prepare for Streaming (Kafka) Interviews
Run Kafka locally with failures
Set up a multi-broker cluster locally and practice killing brokers, triggering leader elections, and observing consumer group behavior. Use kafka-topics, kafka-consumer-groups, and kafka-log-dirs tools to inspect partition assignments and log segments.
Memorize key configuration interdependencies
Learn how acks, retries, and enable.idempotence work together for producers. Understand the relationship between max.poll.interval.ms, max.poll.records, and fetch.min.bytes for consumers. These combinations appear in most troubleshooting scenarios.
Practice transaction scenarios on paper
Draw timeline diagrams showing exactly-once processing with transactions, including what happens during failures between producing output and committing input offsets. This visualization skill is essential for explaining complex exactly-once scenarios clearly.
Design state management for realistic constraints
Practice sizing stateful operations by calculating memory requirements for time windows and key cardinalities. Include eviction policies and recovery time estimates in your designs, as these operational concerns often determine feasibility.
Study Schema Registry compatibility matrices
Create examples of field additions, deletions, and type changes under each compatibility mode (backward, forward, full). Practice explaining why certain changes break compatibility even when they seem safe, like changing int to long.
How Ready Are You for Streaming (Kafka) Interviews?
1 / 6Your topic has 24 partitions and 3 replicas. One broker fails. Producers and consumers must keep working with minimal disruption. What design choice best explains why the cluster can keep serving reads and writes?
Frequently Asked Questions
How much Kafka depth do I need for a Data Engineer interview?
You should be comfortable explaining core concepts like partitions, consumer groups, offsets, retention, delivery semantics, and schema evolution. Expect to reason about tradeoffs, for example at-least-once vs exactly-once, log compaction vs time-based retention. You do not need to memorize every broker config, but you should be able to troubleshoot common issues like lag, rebalancing, and duplicates.
Which companies tend to ask the most Kafka and streaming questions?
Companies with large event-driven systems and real-time data products ask Kafka questions the most, especially big tech, fintech, and high-scale consumer platforms. Data infrastructure teams and analytics platform teams often go deeper into Kafka internals than general product data teams. You should be ready for design questions about pipelines, SLAs, and failure handling.
Will I need to code in a Kafka-focused Data Engineer interview?
Often yes, but it is usually practical coding, not pure algorithms. You may write a small producer or consumer, implement idempotency and dedup logic, parse and validate events, or build a mini streaming transform with windowing. If you want targeted practice, use datainterview.com/coding for coding drills and datainterview.com/questions for Kafka interview prompts.
How do Kafka interview expectations differ across Data Engineer roles?
For an analytics-focused Data Engineer, you are typically evaluated on modeling event data, handling late events, and building reliable pipelines from Kafka into warehouses or lakes. For a platform or infra Data Engineer, you are expected to go deeper into broker, partitioning strategy, throughput tuning, ACLs, multi-cluster setups, and operational debugging. For a real-time applications Data Engineer, you should emphasize stream processing semantics, stateful processing, and end-to-end guarantees.
How can I prepare for Kafka interviews if I have no real-world Kafka experience?
Set up Kafka locally with Docker and build a small end-to-end project: produce events, consume them with a consumer group, and persist results to a sink like files or a database. Practice creating topics, choosing partitions, testing rebalances, and simulating failures like consumer crashes to observe offset commits and duplicates. Document your design choices and tradeoffs, then rehearse explaining them using questions from datainterview.com/questions.
What common Kafka interview mistakes should I avoid?
Do not claim exactly-once end-to-end without explaining where duplicates can still happen and how you handle idempotency at sinks. Avoid hand-wavy partitioning choices, you should explain how keys affect ordering, parallelism, and hotspot risk. Also do not ignore operational concerns, you should be able to discuss lag monitoring, offset management, rebalancing impact, and schema compatibility.
