Streaming (Kafka) Interview Questions

Dan Lee's profile image
Dan LeeData & AI Lead
Last updateMarch 13, 2026

Streaming and Kafka questions dominate data engineering interviews at companies like Confluent, LinkedIn, Uber, and Netflix. These companies process billions of events daily, so they need engineers who understand not just Kafka basics, but the production gotchas around partitioning, exactly-once semantics, and stream processing. A single misconfigured consumer group can take down critical data pipelines, making this knowledge essential for senior roles.

What makes Kafka interviews particularly challenging is that they test both theoretical understanding and operational experience. You might be asked to debug why a consumer group keeps rebalancing during peak traffic, then immediately pivot to designing a stateful stream processor that handles late-arriving events. Many candidates can explain topics and partitions but struggle when asked about transaction isolation levels or how Schema Registry compatibility modes interact with rolling deployments.

Here are the top 29 Kafka interview questions organized by the core areas that trip up most candidates.

Advanced29 questions

Streaming (Kafka) Interview Questions

Top Streaming (Kafka) interview questions covering the key areas tested at leading tech companies. Practice with real questions and detailed solutions.

Data EngineerMetaUberNetflixLinkedInConfluentAmazonSpotifyAirbnb

Kafka Architecture and Core Concepts

Architecture questions separate candidates who've only used Kafka from those who've operated it in production. Interviewers at Meta and Uber focus heavily on partition leadership, replication mechanics, and failure scenarios because these directly impact system reliability.

The most common mistake is treating Kafka like a simple message queue instead of understanding it as a distributed commit log. When asked about broker failures or uneven partition load, many candidates suggest adding more brokers without considering partition assignment, leadership distribution, or the impact on consumer rebalancing.

Kafka Architecture and Core Concepts

Start with how Kafka is put together: brokers, topics, partitions, leaders, replicas, and controller behavior. You are tested on whether you can reason about throughput, availability, and failure modes under real incident scenarios, not just recite terms.

A producer is seeing uneven throughput: one partition in a topic is hot while others are mostly idle. You are using a key-based partitioner, 12 partitions, RF=3, and acks=all, what do you change first to improve throughput without losing ordering guarantees where they matter?

UberUberMediumKafka Architecture and Core Concepts

Sample Answer

Most candidates default to adding partitions, but that fails here because it does not fix key skew, it just spreads the same skew across more partitions and can still bottleneck on the hottest key. Your first lever is to change the partitioning strategy: ensure your key has high cardinality and is evenly distributed, or introduce a compound key or consistent hashing with salting for known hot keys. If ordering is only required per entity, keep the key as that entity id and fix upstream skew; if ordering is only required within a smaller scope, you can widen the key to reduce hotspots. Only after you fix key distribution does adding partitions translate into real throughput gains.

Practice more Kafka Architecture and Core Concepts questions

Producers, Consumers, and Consumer Groups

Producer and consumer configuration questions reveal whether you understand Kafka's delivery guarantees and performance characteristics. Companies like Netflix and LinkedIn ask detailed questions about consumer group coordination because lag spikes and rebalancing issues are their most frequent production problems.

Candidates often memorize configuration parameters without understanding their interactions. For example, knowing that max.poll.interval.ms prevents rebalancing is useless if you don't understand how it relates to processing time, fetch sizes, and commit strategies in real applications.

Producers, Consumers, and Consumer Groups

You will be asked to debug and design producer and consumer configurations such as acks, retries, idempotence, batching, fetch settings, and group rebalancing. Many candidates struggle to connect these knobs to symptoms like lag spikes, duplicates, and uneven partition load.

Your Kafka producer is configured with acks=1 and retries enabled. After a broker restart you see occasional duplicate events downstream and your consumers are not idempotent. What producer settings do you change to prevent duplicates, and what tradeoff do you accept?

ConfluentConfluentMediumProducers, Consumers, and Consumer Groups

Sample Answer

Enable idempotence and require durable acknowledgments, typically enable.idempotence=true with acks=all. This prevents duplicates caused by retrying sends after timeouts by assigning sequence numbers per partition and deduplicating at the broker. You should also keep max.in.flight.requests.per.connection within the safe range for idempotence so ordering is preserved per partition. The tradeoff is higher end to end latency and potentially lower throughput because you wait for more replicas and may reduce in flight concurrency.

Practice more Producers, Consumers, and Consumer Groups questions

Delivery Guarantees and Exactly Once Semantics

Exactly-once semantics questions are where most senior candidates stumble, even those with years of Kafka experience. The topic requires understanding transactions, idempotent producers, and the subtle difference between exactly-once delivery and exactly-once processing.

The critical insight that separates strong candidates is recognizing that exactly-once is an end-to-end property, not just a Kafka feature. You must design your entire data pipeline, including external systems and failure recovery, to achieve exactly-once effects that matter to business logic.

Delivery Guarantees and Exactly Once Semantics

Expect interviewers to push you on at most once, at least once, and exactly once, and what you actually guarantee end to end. To do well, you must explain how offsets, transactions, idempotent producers, and downstream dedupe work together when things crash mid flight.

Your consumer reads from Kafka, writes each event into Postgres, and then commits the offset. A crash happens after the DB write but before the offset commit, what delivery semantics do you actually have end to end, and how would you change the design to get exactly once effects?

UberUberMediumDelivery Guarantees and Exactly Once Semantics

Sample Answer

You could rely on at least once and make Postgres writes idempotent, or you could use Kafka transactions with read-process-write and commit offsets as part of the transaction. The first path gives you duplicates on restart, you then win by deduping with a primary key or unique constraint on an event id. The second path wins when your output is Kafka, because transactions tie the produced records and the consumed offsets into one atomic commit. If your sink is Postgres, Kafka EOS alone is not enough, you still need idempotency or a transactional outbox style pattern.

Practice more Delivery Guarantees and Exactly Once Semantics questions

Stream Processing Design and Stateful Operations

Stream processing design questions test your ability to handle time, state, and scale simultaneously. At companies like Uber and LinkedIn, these scenarios are pulled directly from real systems that process user behavior, financial transactions, or operational metrics.

The key challenge is balancing freshness, correctness, and resource constraints. Many candidates focus too heavily on the happy path without considering how their windowing and state management strategies behave during traffic spikes, late data, or partial failures.

Stream Processing Design and Stateful Operations

In system design style questions, you design streaming pipelines with joins, windows, aggregations, out of order data handling, and backpressure. Candidates often stumble when translating product requirements into event time semantics, state sizing, recovery, and operational SLOs.

You are building a real time dashboard of ride ETAs. Driver GPS events arrive out of order by up to 90 seconds, and product wants a 1 minute rolling average speed per driver with less than 5 seconds freshness. How do you pick event time vs processing time, define the windowing, and configure allowed lateness and watermarks?

UberUberMediumStream Processing Design and Stateful Operations

Sample Answer

Reason through it: you start from the requirement that averages must reflect when the GPS happened, not when Kafka delivered it, so you use event time semantics keyed by driver. A 1 minute rolling average implies sliding or hopping windows, for example 60 second windows with a 1 second hop to keep freshness under 5 seconds while controlling compute. With out of order data up to 90 seconds, you set the watermark lag near 90 seconds and allowed lateness slightly above it if you want corrections, otherwise you accept drop or side output for late events. You then decide whether the dashboard can tolerate updates: if not, you emit only after watermark passes, if yes, you emit early results and send retractions or upserts as late events arrive.

Practice more Stream Processing Design and Stateful Operations questions

Schema Registry, Evolution, and Data Contracts

Schema Registry and evolution questions have become critical as more companies adopt event-driven architectures with dozens of producer and consumer teams. These questions test whether you can maintain data contracts while enabling rapid development cycles.

Understanding compatibility modes is just the beginning. The real test is designing rollout strategies that coordinate schema changes across independent teams with different deployment schedules, especially when those changes involve breaking changes disguised as compatible ones.

Schema Registry, Evolution, and Data Contracts

From a data quality angle, you need to show how you prevent breaking changes while teams ship independently. Interviewers look for how you use compatibility modes, versioning, serialization formats, and rollout strategies to keep consumers safe during migrations.

Your Avro event has a field "price" as an int. A consumer team wants it to be a decimal with cents, and producers want to change it to a string or logicalType. How do you evolve the schema in Schema Registry without breaking existing consumers, and what compatibility mode do you pick?

ConfluentConfluentHardSchema Registry, Evolution, and Data Contracts

Sample Answer

This question is checking whether you can keep producers shipping while protecting existing consumers with concrete compatibility guarantees. You do not mutate the type in place, you add a new field like "price_cents" as a long or bytes with a decimal logicalType, keep the old field, and make the new field optional with a default. Set subject compatibility to BACKWARD or FULL depending on whether you also need old producers to work with new consumers, then roll out consumers to read new then old, and only later deprecate the old field. If you must change semantics, treat it as a new field or a new topic, because type changes are the classic schema break even if the name stays the same.

Practice more Schema Registry, Evolution, and Data Contracts questions

How to Prepare for Streaming (Kafka) Interviews

Run Kafka locally with failures

Set up a multi-broker cluster locally and practice killing brokers, triggering leader elections, and observing consumer group behavior. Use kafka-topics, kafka-consumer-groups, and kafka-log-dirs tools to inspect partition assignments and log segments.

Memorize key configuration interdependencies

Learn how acks, retries, and enable.idempotence work together for producers. Understand the relationship between max.poll.interval.ms, max.poll.records, and fetch.min.bytes for consumers. These combinations appear in most troubleshooting scenarios.

Practice transaction scenarios on paper

Draw timeline diagrams showing exactly-once processing with transactions, including what happens during failures between producing output and committing input offsets. This visualization skill is essential for explaining complex exactly-once scenarios clearly.

Design state management for realistic constraints

Practice sizing stateful operations by calculating memory requirements for time windows and key cardinalities. Include eviction policies and recovery time estimates in your designs, as these operational concerns often determine feasibility.

Study Schema Registry compatibility matrices

Create examples of field additions, deletions, and type changes under each compatibility mode (backward, forward, full). Practice explaining why certain changes break compatibility even when they seem safe, like changing int to long.

How Ready Are You for Streaming (Kafka) Interviews?

1 / 6
Kafka Architecture and Core Concepts

Your topic has 24 partitions and 3 replicas. One broker fails. Producers and consumers must keep working with minimal disruption. What design choice best explains why the cluster can keep serving reads and writes?

Frequently Asked Questions

How much Kafka depth do I need for a Data Engineer interview?

You should be comfortable explaining core concepts like partitions, consumer groups, offsets, retention, delivery semantics, and schema evolution. Expect to reason about tradeoffs, for example at-least-once vs exactly-once, log compaction vs time-based retention. You do not need to memorize every broker config, but you should be able to troubleshoot common issues like lag, rebalancing, and duplicates.

Which companies tend to ask the most Kafka and streaming questions?

Companies with large event-driven systems and real-time data products ask Kafka questions the most, especially big tech, fintech, and high-scale consumer platforms. Data infrastructure teams and analytics platform teams often go deeper into Kafka internals than general product data teams. You should be ready for design questions about pipelines, SLAs, and failure handling.

Will I need to code in a Kafka-focused Data Engineer interview?

Often yes, but it is usually practical coding, not pure algorithms. You may write a small producer or consumer, implement idempotency and dedup logic, parse and validate events, or build a mini streaming transform with windowing. If you want targeted practice, use datainterview.com/coding for coding drills and datainterview.com/questions for Kafka interview prompts.

How do Kafka interview expectations differ across Data Engineer roles?

For an analytics-focused Data Engineer, you are typically evaluated on modeling event data, handling late events, and building reliable pipelines from Kafka into warehouses or lakes. For a platform or infra Data Engineer, you are expected to go deeper into broker, partitioning strategy, throughput tuning, ACLs, multi-cluster setups, and operational debugging. For a real-time applications Data Engineer, you should emphasize stream processing semantics, stateful processing, and end-to-end guarantees.

How can I prepare for Kafka interviews if I have no real-world Kafka experience?

Set up Kafka locally with Docker and build a small end-to-end project: produce events, consume them with a consumer group, and persist results to a sink like files or a database. Practice creating topics, choosing partitions, testing rebalances, and simulating failures like consumer crashes to observe offset commits and duplicates. Document your design choices and tradeoffs, then rehearse explaining them using questions from datainterview.com/questions.

What common Kafka interview mistakes should I avoid?

Do not claim exactly-once end-to-end without explaining where duplicates can still happen and how you handle idempotency at sinks. Avoid hand-wavy partitioning choices, you should explain how keys affect ordering, parallelism, and hotspot risk. Also do not ignore operational concerns, you should be able to discuss lag monitoring, offset management, rebalancing impact, and schema compatibility.

Dan Lee's profile image

Written by

Dan Lee

Data & AI Lead

Dan is a seasoned data scientist and ML coach with 10+ years of experience at Google, PayPal, and startups. He has helped candidates land top-paying roles and offers personalized guidance to accelerate your data career.

Connect on LinkedIn