Back-of-the-Envelope for Data Systems

A candidate I coached last year could explain Lambda architecture, knew the difference between Iceberg and Delta Lake, and had real Spark tuning experience. They got rejected at the debrief because when the interviewer asked "how much storage does this pipeline need?", they said "it depends" and moved on. That answer kills your credibility faster than any wrong answer would.

Estimation is a distinct skill from system design knowledge. Your interviewer doesn't expect you to land on the exact number. They want to see whether you can anchor on something real, chain your assumptions together logically, and arrive at an order-of-magnitude answer that proves you've actually run data systems before. Backend engineers think in QPS and p99 latency. You need to think in daily event volumes, row sizes, Parquet compression ratios, partition counts, and Spark job runtimes. Different units, different intuitions.

There are exactly two ways candidates fail this part of the interview. The first is skipping it entirely, jumping straight to "we'll use Kafka and S3" without any sense of scale. The second is the opposite: getting so tangled in arithmetic that three minutes pass and no conclusion surfaces. This guide gives you a repeatable four-step structure that keeps you moving forward, sounds credible out loud, and gets you to an answer before the interviewer loses patience.

The Framework

Four steps. Every estimation question in a data engineering interview fits this structure. Memorize the phases and the time budget before anything else.

Phase	Time	Goal
1. Anchor	1-2 min	Lock in a starting quantity you and the interviewer agree on
2. Derive Volume	2-3 min	Multiply out to daily/monthly row counts
3. Estimate Storage & Throughput	3-4 min	Apply row sizes, compression, replication, and rate math
4. Sanity Check	1 min	Validate against a real-world benchmark and flag surprises

Ten minutes total. If you're spending more than that, you've gone too deep into arithmetic and lost the interviewer.

Back-of-the-Envelope Estimation Framework

Phase 1: Anchor on a Known Quantity

Start by naming a number you're confident about, or asking the interviewer to confirm one. Your anchor is the foundation everything else multiplies from. A bad anchor doesn't sink your estimate; an unstated anchor does.

What to do: - Ask for (or state) DAU, events per user per day, or records per second. Pick whichever is most natural for the problem. - Write it on the whiteboard or say it out loud explicitly. Don't keep it in your head. - If the interviewer hasn't given you a number, propose one and ask if it's reasonable. "I'll assume 10 million daily active users. Does that feel right for this system?"

What to say:

"Before I start sizing anything, let me anchor on the scale we're designing for. I'll assume 10 million DAU, each generating roughly 50 events per session, with maybe two sessions per day. So we're looking at about 1 billion events per day. Does that match your mental model of the system?"

How the interviewer is evaluating you: They want to see that you don't just pull numbers from thin air. Stating your anchor explicitly, and inviting pushback, signals that you understand estimation is collaborative. If they correct you (say, "actually closer to 100M DAU"), take the correction without flinching and update your anchor. That's the right behavior.

Do this: Always write your anchor down before doing any math. It keeps you honest and gives the interviewer something to react to.

Phase 2: Derive Event and Row Volume

Once you have your anchor, the job is mechanical: multiply through to a daily row count. The interviewer is watching to see if you can chain unit conversions without losing track of what you're computing.

What to do: - Convert your anchor to events per second AND events per day. You'll need both. Events per second drives your streaming design; events per day drives your storage math. - Use the conversion: 1 day = 86,400 seconds. Round to 100,000 for fast math. (This is a 15% error, which is totally acceptable.) - Write out the chain explicitly: DAU x events/user/day = daily events. Daily events / 86,400 = events/sec.

What to say:

"Okay, 1 billion events per day. Dividing by roughly 100,000 seconds in a day, that's about 10,000 events per second at peak. I'll use that as my throughput target. And for storage math, I'll work from the daily total of 1 billion rows."

How the interviewer is evaluating you: They're checking whether you naturally think in both batch and streaming terms. Candidates who only compute daily totals look like batch engineers. Candidates who only compute events per second look like they've never thought about storage. Show both.

Common mistake: Forgetting to account for traffic spikes. A system averaging 10K events/sec might see 3-5x that during peak hours. Mention it: "I'll design for 30K events/sec to handle peak load, even though the average is 10K."

Phase 3: Estimate Storage and Throughput

This is where most candidates either shine or fall apart. The math isn't hard, but you need the right constants memorized. Here's what you should have in your head walking into any data engineering interview:

Row sizes for common event types: - Simple click/pageview event (user ID, timestamp, URL, session ID): ~200-500 bytes raw JSON, ~50-100 bytes in Parquet - Ride-sharing trip record (origin, destination, driver, fare, timestamps, status): ~500 bytes-1KB raw, ~150-200 bytes compressed - Log line (service, level, message, trace ID, timestamp): ~300-800 bytes raw

Compression and format multipliers: - Parquet + Snappy compression: roughly 5-10x reduction from raw JSON. Use 5x as your conservative estimate. - Parquet + Gzip: closer to 8-12x. Use 8x if you want to be more aggressive. - Avro (uncompressed): roughly 2-3x reduction from JSON due to schema stripping.

Kafka and Spark rules of thumb: - Kafka partition throughput: ~10-50 MB/sec write, ~50-100 MB/sec read. Use 10 MB/sec as a conservative write limit per partition. - Spark partition target size: 128 MB per partition. Divide your total data volume by 128 MB to get your partition count, which tells you your parallelism. - Spark task throughput: roughly 100-200 MB/sec per core for simple transformations on Parquet data.

Storage costs (approximate, as of 2024): - S3 standard: ~$23/TB/month - BigQuery storage: ~$20/TB/month (active), ~$10/TB/month (long-term) - Snowflake storage: ~$40/TB/month

What to do: - Start from your daily row count. Multiply by your assumed row size to get raw daily bytes. - Apply your compression ratio to get compressed storage per day. - Apply replication factor (typically 3x for Kafka, 1x for S3 with cross-region replication handled separately). - Multiply by retention period to get total storage footprint.

What to say:

"I'm assuming each event is about 500 bytes as raw JSON. 1 billion events times 500 bytes is 500 GB of raw data per day. Stored as Parquet with Snappy compression, I'd expect roughly a 5x reduction, so about 100 GB per day compressed. With 3x Kafka replication and 7 days retention, the Kafka footprint alone is 100 GB times 3 times 7, which is 2.1 TB. The S3 long-term storage is just the compressed number times retention, so 100 GB times 30 days is 3 TB per month, costing roughly $70/month in storage. Compute is the bigger cost."

How the interviewer is evaluating you: They want to see you separate storage from compute costs. Storage is almost always cheap. The expensive part is scanning and processing that data repeatedly. If you only estimate storage and stop, you've missed the point. Mention compute cost separately, even if you only estimate it roughly.

Key insight: The Kafka replication multiplier surprises a lot of candidates. A topic with 3x replication and 7-day retention holds 21 days' worth of raw data. Always call this out explicitly. It shows you've actually operated Kafka in production.

For streaming-specific estimates, your key outputs are different. Instead of total storage, you care about: - Partition count: (peak events/sec x avg event size in bytes) / 10 MB/sec per partition. Round up. - Consumer lag budget: how many seconds of backlog can your consumers tolerate before it's a problem? - Flink or Spark Streaming cluster size: (peak throughput in MB/sec) / (MB/sec per core) = cores needed.

Batch jobs care about total volume and job duration. Streaming jobs care about sustained throughput and lag. State which mode you're estimating for before you start the math.

Phase 4: Sanity Check Against Real-World Systems

Don't skip this. A 30-second sanity check is what separates a candidate who's done estimation exercises from one who's actually built data systems.

What to do: - Name a real company or system at a similar scale and compare your number to what you know about them. - If your number is wildly different (more than 10x off), say so and investigate which assumption is driving the gap. - Flag anything that surprised you in your own estimate. Interviewers love when you catch your own anomalies.

What to say:

"Let me just sense-check this. Uber has said publicly they process around 1 TB of trip data per day. We're estimating 100 GB compressed for a 10M DAU ride-sharing app, which is roughly 10x smaller than Uber. Uber has closer to 100M active users, so that ratio tracks. I feel good about this estimate."

Or, if something looks off:

"Actually, 2.1 TB in Kafka feels high for a 7-day window. Let me double-check my replication math... yes, that's right: 100 GB/day compressed, times 3 replicas, times 7 days. The number is correct, it just means we'd want to think carefully about Kafka retention policy versus offloading to S3 earlier."

How the interviewer is evaluating you: This step is almost entirely about intellectual honesty. They want to see that you have a mental model of real systems, not just abstract math. Dropping a benchmark reference (Uber, Netflix, Airbnb, Spotify) shows you've thought about data at scale outside of textbooks. And catching your own errors out loud, rather than hiding them, is a strong signal.

Do this: Before your interview, memorize two or three public data scale facts. Netflix streams ~15 PB/day. Airbnb has ~100M listings. Uber processes ~100M trips/month. One well-placed benchmark reference makes your sanity check feel grounded rather than hand-wavy.

Putting It Into Practice

Let's work through a real example end-to-end. The prompt: "Design a real-time trip event pipeline for a ride-sharing app with 10 million daily active users. Events should land in S3 within a few minutes of occurring."

This is a classic data engineering interview question. It touches Kafka, object storage, streaming compute, and partitioning strategy, all in one. Here's how you walk through the estimation live.

Interviewer: "Before we get into the architecture, can you give me a rough sense of the data volumes we're dealing with?"

You: "Sure. Let me start from the user base and work down. We have 10 million DAU. For a ride-sharing app, I'd expect a user to generate events at a few key moments: app open, ride request, driver match, trip start, location pings during the trip, and trip end. That's maybe 20 to 30 events per active session on average. I'll use 25 to keep the math clean."

Do this: You anchored on DAU and immediately derived an events-per-user assumption. You also stated the assumption out loud. The interviewer can challenge it, but you're not pulling numbers from thin air.

You: "So that's 10M users times 25 events, which is 250 million events per day. Divide by 86,400 seconds in a day, and we get roughly 3,000 events per second at average load. Peak is probably 3 to 5x that, so call it 10,000 to 15,000 events per second during rush hour."

Interviewer: "That feels a bit low for location pings. Don't those fire every few seconds during a trip?"

You: "Good catch. If the average trip is 20 minutes and location pings fire every 5 seconds, that's 240 pings per trip. If 20% of DAU are actively in a trip at any given moment, that adds... 10M times 0.2 times 240, which is another 480 million events per day just from location pings. So let me revise total daily events up to around 700 million, and average throughput to roughly 8,000 events per second."

Do this: You didn't freeze when challenged. You broke the location ping volume into its own sub-calculation, revised your number, and moved on. This is exactly what interviewers want to see. The revised number is more credible, and you got there in under 30 seconds.

Storage Estimation

You: "Now let me size the storage. A trip event has a user ID, driver ID, trip ID, timestamp, lat/long, event type, and maybe a status code. I'll assume 200 bytes per event as a rough row size for JSON. We can compress that significantly with Parquet and Snappy, so call it a 5x compression ratio, bringing effective storage to about 40 bytes per event."

You: "700 million events times 200 bytes raw is 140 GB per day uncompressed. After Parquet compression, that's around 28 GB per day. For S3 retention, if we keep 90 days of data, that's about 2.5 TB total. At roughly $23 per TB per month on S3, we're talking maybe $60 per month for storage. That's basically free."

Key insight: Separating raw bytes from compressed bytes is a signal that you've actually worked with real pipelines. Candidates who stop at raw bytes look like they've only read about storage, not managed it.

Interviewer: "What if we're wrong about the row size? What if events are closer to 500 bytes?"

You: "Then raw daily volume is 350 GB instead of 140 GB. Compressed, that's around 70 GB per day, and 90-day retention becomes about 6 TB. Still cheap on S3, but it changes the Kafka sizing more than the storage cost. The throughput math shifts, which I'll get to."

Do this: When an interviewer changes an assumption, don't re-derive everything from scratch. Show which downstream numbers change and which ones don't. That's the sign of someone who understands the model, not just the arithmetic.

Kafka Sizing

You: "For Kafka, I need to figure out partition count. The rule of thumb I use is that a single Kafka partition handles about 10 MB/s of throughput safely. At 8,000 events per second and 200 bytes per event, that's 1.6 MB/s of raw throughput. So one partition could technically handle it, but you'd never run a production topic with one partition."

You: "In practice, you want to partition by something meaningful, like trip ID or region, to enable parallel consumers. I'd start with 20 to 30 partitions to give room for consumer parallelism and handle peak spikes. With 3x replication, the broker storage for a 7-day retention window is: 1.6 MB/s times 86,400 seconds times 7 days times 3 replicas. That's about 2.9 TB across the Kafka cluster."

Common mistake: Forgetting replication and retention when sizing Kafka. A 7-day topic with 3x replication is 21x your daily raw volume. Miss that and you'll under-provision brokers by an order of magnitude.

Spark Job Runtime

Interviewer: "Let's say we're running hourly Spark jobs to transform and compact these events before writing to S3. How long do you expect those jobs to take?"

You: "Each hourly batch is about 1/24th of daily volume. At 28 GB compressed per day, that's roughly 1.2 GB per hour. Spark's partition sizing rule of thumb is 128 MB per partition, so that's about 10 partitions worth of data. On a modest cluster, say 5 executors with 4 cores each, that's 20 parallel tasks. With 10 partitions, the job is basically one wave of tasks."

You: "Reading 1.2 GB from S3, running transformations, and writing back out should take 2 to 4 minutes on that cluster. If the SLA is 'events land in S3 within a few minutes,' hourly compaction jobs are fine. If we needed sub-minute latency, we'd need to rethink the architecture entirely and go with a streaming approach."

Do this: Tie your compute estimate back to the original SLA. The interviewer said "within a few minutes." You just showed that hourly batch jobs satisfy that requirement, which means you don't need a more complex streaming architecture for the compaction layer.

Streaming Path (Flink/Spark Streaming)

Interviewer: "Actually, let's say the requirement is more like 30 seconds end-to-end. How does that change things?"

You: "Then we're in streaming territory. At 8,000 events per second, a Flink or Spark Streaming job needs to keep up with that throughput continuously. For Flink, I'd size one task slot per Kafka partition, so 20 to 30 task slots. With 2 cores per task manager and 4 GB of memory, that's maybe 10 to 15 task managers. That's a small cluster."

You: "The consumer lag budget is the key constraint here. If we want 30-second end-to-end latency and Kafka ingestion adds maybe 5 seconds, the streaming job has 25 seconds to process and flush each micro-batch. With 8,000 events per second, each 25-second window contains 200,000 events. At 200 bytes each, that's 40 MB per window. A healthy Flink cluster should process that in well under 25 seconds, so we have headroom."

Key insight: Consumer lag budget is the streaming equivalent of job runtime in batch. State it explicitly: total latency budget minus ingestion time equals the processing window you have to work with. Interviewers rarely hear candidates frame it this way, and it immediately signals experience.

Interviewer: "You mentioned 8,000 events per second as average. What happens at peak?"

You: "At 3 to 5x peak, we're looking at 24,000 to 40,000 events per second. That's where the partition count matters. If we provisioned 30 Kafka partitions and 30 Flink task slots, we can scale the Flink cluster horizontally to match. The Kafka partitions are the ceiling on parallelism, so I'd actually bump that to 50 partitions upfront to give us room to scale without repartitioning later."

The whole estimation took about 5 to 7 minutes of dialogue. You covered daily volume, storage, Kafka sizing, batch job runtime, and streaming throughput. You revised a number mid-stream when challenged, explained which assumptions drive the most sensitivity, and connected every estimate back to the original design requirements. That's the full loop.

Common Mistakes

Most candidates don't fail estimation because they got the math wrong. They fail because they make structural errors that signal inexperience with real data systems. Here's what those look like.

Pulling Numbers Out of Thin Air

You've seen this candidate. They say "let's assume 10TB of data per day" with zero explanation, then build an entire architecture on top of it. The interviewer has no idea if that number came from reasoning or from nowhere.

Don't do this: "So we'll have about 10TB per day... anyway, for the storage layer I'm thinking S3 with Parquet..."

The problem isn't the number. It's the silence before it. Interviewers penalize this because they can't tell if you understand the system or if you're just pattern-matching to something you heard once. An unjustified anchor is worse than a wrong one, because at least a wrong one can be corrected.

Do this: State your assumption explicitly. "I'll assume 50M daily active users, each generating about 20 events per session. That gives me roughly 1 billion events per day. Let me work forward from there."

One sentence of justification turns a guess into a reasoned estimate.

Forgetting Compression

You multiply out your row count, apply your row size, land on 50TB of raw storage, and stop there. Confident. Wrong.

Parquet with Snappy compression typically achieves 5x to 10x reduction over raw JSON or CSV. That 50TB becomes 5-10TB on disk. If you're estimating for a columnar format like Parquet (which you almost always should be in a data engineering context), skipping this step means you've over-provisioned storage by an order of magnitude.

Don't do this: Presenting a raw byte estimate as your final storage number without mentioning compression at all.

Interviewers who've actually run pipelines know that nobody stores raw JSON in production at scale. Missing compression signals that you haven't either.

Fix: After you calculate raw bytes, explicitly say "applying a 5x Parquet/Snappy compression ratio" and divide. It takes five seconds and immediately signals production awareness.

Conflating Storage Cost With Compute Cost

"So we're storing 5TB per day, that's going to get expensive fast."

Storage is not expensive. S3 costs roughly $23 per TB per month. Five terabytes a day for a month is 150TB, which runs you about $3,500. That's a rounding error in a data platform budget.

The expensive part is compute: scanning that 150TB in Spark or BigQuery, running daily aggregation jobs, reprocessing on schema changes. A single full-table scan in BigQuery at $5 per TB costs $750 for that 150TB. Run it 10 times a day and you've spent more in a week than your storage costs in a year.

Key insight: When an interviewer asks "how expensive will this be to run?", they want to hear you separate storage cost (cheap, predictable) from compute cost (variable, the real lever). Candidates who treat them as the same thing look like they've only read about data systems, not operated them.

Getting Stuck on Precision

Three minutes of silence while you work out 86,400 seconds times 1,000 events times 512 bytes divided by 1,073,741,824 bytes per gigabyte. The interviewer is watching you do long division in your head. Nobody is impressed.

Don't do this: Treating estimation like an exam where partial credit depends on showing every step of the arithmetic.

Precision is the enemy of momentum here. The interviewer does not care if you land on 43.2GB versus 40GB. They care that you can reason through orders of magnitude quickly and keep the conversation moving.

Fix: Round aggressively and say so. "86,400 seconds in a day, I'll call it 100K. 1,000 events per second times 100K seconds is 100M events. At 500 bytes each, that's 50GB raw. I'm rounding throughout, so treat this as a ballpark." That's the whole thing, done in 20 seconds.

Ignoring Replication and Retention

You estimate that your Kafka topic will receive 100GB of data per day. You size your Kafka cluster for 100GB. You've now undersized it by 21x.

A standard Kafka topic with 3x replication and 7-day retention holds 100GB/day * 7 days * 3 replicas = 2.1TB. That's not a small rounding error. That's a cluster that falls over in production.

Don't do this: Sizing Kafka (or any storage layer with retention) based on daily ingest volume alone.

This is one of the clearest signals that separates candidates who have actually run pipelines from those who haven't. Replication and retention are operational realities, not theoretical concerns. Missing both of them in the same estimate is a red flag.

Do this: After you calculate daily ingest volume, immediately ask yourself: "What's the replication factor? What's the retention window?" State both out loud, then multiply. "Three replicas, seven-day retention, so my Kafka cluster needs to hold 21x the daily volume."

It takes one extra sentence and it tells the interviewer you've thought about this in production terms.

Quick Reference

Memorize these numbers. Not approximately. Exactly.

The Core Numbers

Category	Value	Notes
Row sizes
Clickstream event	200–500 bytes	URL, user ID, timestamp, metadata
Ride/trip event	500–1,000 bytes	GPS coords, driver/rider IDs, status
Log line	200–300 bytes	Service, level, message, trace ID
Payment transaction	300–600 bytes	IDs, amount, currency, status codes
Compression
Parquet + Snappy	5–10x vs raw JSON	Use 5x for safety
Parquet + GZIP	8–12x vs raw JSON	Slower write, better scan
Avro (binary)	2–3x vs JSON	Schema-on-read, good for Kafka
Kafka
Max throughput per partition	~10 MB/s write	Rule of thumb; hardware-dependent
Recommended partition size	1–10 MB/s target	Size up if you're near the ceiling
Replication factor	3x storage overhead	Standard for production
7-day retention multiplier	7 × daily volume × 3	Don't forget this one
Spark
Target partition size	128 MB	Tune with `spark.sql.files.maxPartitionBytes`
Parallelism rule	2–4 tasks per CPU core	More tasks = better skew tolerance
Shuffle partition default	200	Almost always too low for large jobs
Storage costs (approx.)
S3 standard	$0.023 / GB / month	~$23 / TB / month
BigQuery storage	$0.02 / GB / month	Active; long-term is $0.01
Snowflake storage	~$23 / TB / month	Varies by contract
Throughput
Spark read from S3 (Parquet)	~200–500 MB/s per executor	Network-bound in practice
Kafka consumer throughput	~50–100 MB/s per consumer	Depends on message size

Unit Conversions

Keep these in your head so you're not doing long division mid-interview.

Conversion	Multiplier
Seconds in a day	86,400 (round to 100K)
Events/sec to events/day	× 86,400
1M events/day at 500 bytes	= 500 MB/day raw
500 MB/day compressed (5x)	= 100 MB/day on disk
100 MB/day × 365	~35 GB/year
1 TB =	1,000 GB = 1M MB
1 GB =	1,000 MB = 1B bytes
1M users × 10 events/day	= 10M events/day
10M events/day ÷ 86,400	~115 events/sec

Common Prompts and Where to Start

Interview prompt	Your anchor
"Design a clickstream pipeline"	Page views per DAU (assume 20–50 per user)
"Build a ride event pipeline"	Trips per DAU (assume 1–2 per active rider)
"Design a log aggregation system"	Log lines per service per second (assume 1K–10K/sec)
"Build a payment analytics pipeline"	Transactions per DAU (assume 0.1–0.5 per user)
"Design a real-time recommendation system"	Impression events per DAU (assume 50–200 per user)

Verbal Template

When you open your estimation, say something like this:

"Let me start with what I know and derive from there. I'll assume 10 million daily active users, each generating around 20 events per day. That's 200 million events daily. At roughly 500 bytes per event, that's 100 GB of raw data per day. Compressed with Parquet and Snappy at a 5x ratio, we're looking at about 20 GB per day on disk. With 3x Kafka replication and 7-day retention, the Kafka storage footprint is around 420 GB. That feels reasonable. Let me know if you want me to adjust any of these assumptions."

That's it. State the anchor, show the chain, land on a number, invite pushback.

Framework Phases at a Glance

Phase	What you do	Time to spend
Anchor	State your starting assumption out loud	30 seconds
Volume	Multiply to daily/monthly row count	1 minute
Storage	Apply row size, compression, replication	1–2 minutes
Throughput	Derive events/sec, partition count, job runtime	1–2 minutes
Sanity check	Compare to a real system you know	30 seconds

Total: under 5 minutes. If you're going longer, you're over-indexing on precision.

Phrases to Use

These are ready to drop in at specific moments.

Opening the estimate: "Let me anchor on DAU and derive from there. I'll state my assumptions as I go."
Stating a rough number: "I'll assume 500 bytes per event. That's a reasonable middle ground for this schema. We can revisit if you think it's off."
Applying compression: "Raw JSON is about 100 GB, but in Parquet with Snappy we get roughly a 5x reduction, so call it 20 GB on disk."
Catching your own error: "Actually, I forgot to account for replication. Let me multiply that by 3 before we move on."
Handling a challenge: "Fair point. If we double the row size to 1KB, the storage doubles to 40 GB per day. That changes the Kafka partition count but doesn't fundamentally change the architecture."
Landing the conclusion: "So we're looking at roughly 20 GB per day compressed, about 600 GB per month. That's well within a single S3 prefix and a manageable Spark job. Let me sketch the pipeline."

Red Flags to Avoid

Pulling a number with no anchor. "About 10 TB" with no derivation looks like a guess, because it is.
Stopping at raw bytes. Always apply compression. Always. Parquet changes the math by 5–10x.
Forgetting Kafka replication and retention. A 7-day, 3x-replicated topic is 21x your daily raw volume.
Spending more than 90 seconds on arithmetic. Round aggressively. The interviewer does not care if it's 18 GB or 22 GB.
Never sanity-checking. Saying "that's roughly in line with what Uber or Airbnb would see at this scale" signals you've thought about real systems.

Key takeaway: Estimation in a data engineering interview is not about getting the right number; it's about showing you can reason from a stated assumption through a chain of multiplications to a defensible conclusion, without freezing up or losing the thread.

Back-of-the-Envelope for Data Systems