Data Serialization & Encoding: What Every Data Engineer Must Know

Data Serialization & Encoding

At 3am, a pipeline at a mid-sized fintech company went silent. No errors at first, just missing data. The culprit: a backend team had added a new required field to their event schema, pushed it to production, and the downstream Kafka consumer, expecting the old Avro schema, couldn't deserialize a single message. By the time anyone noticed, six hours of transaction events had piled up unprocessed.

This is a serialization problem. At its core, serialization is just the process of taking data that lives in memory and converting it into bytes you can store or send somewhere. The format you choose determines how those bytes are structured, what happens when the structure changes, and how fast something downstream can read it back. Every pipeline you build makes this choice, often without realizing it.

The format decision touches more than you'd think. Parquet vs Avro vs JSON isn't just a storage preference; it affects your query costs in Snowflake, your Kafka consumer throughput, your ability to add a field without breaking downstream jobs, and how much you're paying for S3. Interviewers at Airbnb, Uber, and Netflix expect you to reason through all of that. Most candidates drop "Parquet" into the conversation like a password and move on. The ones who get offers can explain exactly why.

How It Works

Every time data moves in your pipeline, something has to convert it. A Spark job holds a DataFrame in memory as JVM objects. Kafka needs bytes. S3 needs bytes. Your downstream consumer needs objects again. Serialization is that conversion: in-memory structure goes in, byte sequence comes out. Deserialization runs it in reverse.

The format you choose determines everything about how those bytes are arranged, and that choice has real consequences for storage cost, query speed, and whether your pipeline survives a schema change at 3am.

Think of it like packing a suitcase. You can throw everything in randomly (JSON, readable but bulky) or fold and organize by category (Parquet, harder to unpack one item but dramatically more space-efficient for bulk access).

Here's what that flow looks like:

Data Serialization Flow with Schema Registry

The Row vs. Column Split

This is the most important mental model to lock in before your interview. Row-oriented formats (Avro, Protobuf, JSON) store all fields for a single record together, one record after another. When a Kafka producer emits an event, it needs to write that whole record in one shot and move on. Row formats are built for that.

Column-oriented formats (Parquet, ORC) flip the layout. All values for a single column live together across thousands of rows. When Spark runs a query like SELECT revenue, country FROM events WHERE date = '2024-01-01', it only needs to read two column chunks instead of scanning every field of every record. That's why Parquet dominates data lake workloads.

The trade-off is sharp: columnar formats are terrible for streaming writes and single-row lookups. Appending one row to a Parquet file means rewriting the whole thing. Interviewers will probe this, so know it cold.

Common mistake: Candidates say "we'll store everything in Parquet" without realizing that Parquet and streaming are a bad combination. Parquet is a landing format, not a transport format.

Schema-on-Write vs. Schema-on-Read

Some formats enforce a schema at encode time. With Avro or Protobuf, the producer can't write a malformed record because the serializer validates it against a schema before producing any bytes. You get a hard failure early, which is exactly what you want in a production pipeline.

JSON and CSV take the opposite approach. The producer writes whatever it wants, and the consumer figures out the structure later. This feels flexible until a producer renames a field and your downstream job silently starts reading nulls for three hours before anyone notices.

Your interviewer cares about this because schema drift is one of the most common real-world pipeline failures. Knowing the difference between schema-on-write and schema-on-read signals that you've actually operated pipelines at scale, not just read about them.

How Binary Formats Stay Compact

JSON repeats field names in every single record. In a payload with a million events, the string "user_id" gets written a million times. Binary formats like Avro and Protobuf eliminate that waste entirely.

Avro encodes field values sequentially, relying on the schema to tell the reader what each position means. Protobuf uses numbered field tags, so instead of writing "user_id", it writes a small integer like 1. The reader's compiled proto definition maps that tag back to the field name. No names in the payload, just compact values.

Compression (Snappy, GZIP, ZSTD) layers on top of this. It doesn't replace serialization; it shrinks the byte sequence that serialization already produced. Columnar formats benefit especially from compression because a column of repeated values (like a country field that says "US" ten thousand times) compresses down to almost nothing.

Key insight: Binary format plus columnar layout plus compression is multiplicative, not additive. Each layer compounds the benefit of the others. That's why Parquet with Snappy can be 10x smaller than equivalent JSON.

Your 30-second explanation: "Serialization converts in-memory data structures into bytes for storage or transport, and the format you choose determines the byte layout. Row formats like Avro keep one record's fields together, which is fast for writes and streaming. Columnar formats like Parquet group values by column, which is far more efficient for analytical reads and compression. Some formats enforce a schema at write time, which catches bad data early; others defer that to read time, which is flexible but risky at scale."

Patterns You Need to Know

In an interview, you'll usually need to pick a specific approach. Here are the ones worth knowing.

JSON and CSV

These are the formats everyone starts with and the ones you'll need to talk about critically. JSON encodes data as human-readable text with nested key-value pairs; CSV is flat text with delimiter-separated fields. Neither enforces a schema at write time. The structure lives in your head, in documentation, or gets inferred by the query engine when it reads the file.

That flexibility is genuinely useful in early-stage pipelines or low-volume APIs where debuggability matters more than efficiency. But at scale, you're paying a steep price: field names repeat in every single JSON record, parsing text is slow, and there's nothing stopping a producer from silently renaming a field and breaking every downstream consumer. When an interviewer asks why you wouldn't just use JSON everywhere, this is the answer.

When to reach for this: mention JSON when discussing prototyping, REST API payloads, or low-volume event streams where human readability is worth the overhead. Then pivot to why you'd migrate off it.

Avro

Avro is the standard binary row format for Kafka pipelines, and if you're designing any event-driven data system, you should default to it for the transport layer. The schema is defined separately (in JSON, ironically) and gets registered in a schema registry. Each serialized payload carries only a schema ID in its header, not the full schema, so you get compact binary encoding without sacrificing the ability to decode records later.

The real reason Avro dominates Kafka pipelines is schema evolution. When a producer adds a new optional field with a default value, old consumers can still read the new records because the reader schema fills in the missing field from the declared default. Remove a field? Same idea in reverse. The schema registry enforces compatibility rules (backward, forward, or full) before a new schema version is accepted, which means bad schema changes get caught before they reach consumers.

One thing to know cold: adding a field without a default value breaks backward compatibility. The registry will reject it under backward-compatible mode, but if you're running without enforcement, you'll get silent deserialization failures at 3am.

When to reach for this: any Kafka-based event pipeline where multiple teams produce and consume the same topic. Avro plus a schema registry is the standard answer.

Avro: Binary Row Format with Schema Evolution

Protobuf

Protobuf and Avro are both binary row formats, but they take a different approach to schema evolution. Instead of resolving fields by name, Protobuf identifies every field by a numeric tag defined in a .proto file. When a consumer receives a message with a field number it doesn't recognize, it skips it safely. This makes forward compatibility nearly automatic, as long as you never reuse a deleted field's number.

That last part matters. If you delete field 3 and later assign tag 3 to a new field with a different type, old consumers will try to decode the new data using the old type definition. Silent corruption, no error. The rule is simple: retired field numbers are retired forever. Protobuf is also tightly integrated with gRPC, which makes it the natural choice for service-to-service communication rather than event streaming.

When to reach for this: gRPC APIs, microservice communication, or any streaming pipeline where the producer and consumer are both services you control and you want strongly typed contracts.

Protobuf: Binary Row Format with Field Tags

Parquet

Parquet is the de facto storage format for data lakes, and there's a specific reason for that. Instead of storing all fields for one row together, Parquet groups all values for one column together across many rows. A query that reads only three columns out of fifty never touches the other forty-seven. Combined with per-column compression (dictionary encoding, run-length encoding) and per-row-group statistics stored in the file footer, a query engine like Spark or BigQuery can skip entire chunks of a file without reading them.

This is what "predicate pushdown" means in practice. If your WHERE clause filters on event_date = '2024-01-01' and the row group statistics say the minimum date in that group is 2024-03-01, the engine skips the entire group. No I/O, no CPU.

Where Parquet falls apart: streaming appends and point lookups. Writing a single new record means writing a new file or rewriting a row group. Querying one row by ID means scanning the footer of every file to find the right row group. Neither is what Parquet was designed for. Schema changes at the table level also require care; tools like Apache Iceberg and Delta Lake exist partly to manage Parquet schema evolution safely across a partitioned table.

When to reach for this: analytical workloads on a data lake, columnar SELECT queries, anything landing in S3 or GCS for Spark, Athena, BigQuery, or Snowflake to read.

Parquet: Columnar Format for Analytical Reads

Schema Evolution: The Cross-Cutting Concern

Schema evolution isn't a format, but it's the topic that ties all of these together, and interviewers will probe it once you name-drop any of the above. The core question is always: if I change the schema, what breaks?

With Avro, the answer depends on whether you provided defaults. With Protobuf, it depends on whether you reused field numbers. With Parquet, it depends on whether your table format (Iceberg, Delta Lake, Hive metastore) can handle column additions or type changes without requiring a full rewrite. JSON and CSV have no enforcement mechanism at all, which means nothing breaks at serialization time and everything breaks silently at query time.

Key insight: Backward compatibility means new readers can read old data. Forward compatibility means old readers can read new data. Full compatibility means both. Know which mode your schema registry is enforcing and why.

Schema Evolution: Backward vs Forward Compatibility

Format Comparison

Format	Orientation	Schema Handling	Best For
JSON / CSV	Row	Schema-on-read	Prototyping, low-volume APIs
Avro	Row (binary)	Schema-on-write, registry-enforced	Kafka event streams
Protobuf	Row (binary)	Schema-on-write, tag-based	gRPC, service-to-service
Parquet	Columnar	Schema-on-write, table-level	Data lake analytics
ORC	Columnar	Schema-on-write, Hive-native	Hive/Spark with heavy predicate pushdown

For most interview problems involving a data lake or analytical pipeline, you'll default to Parquet at rest. Reach for Avro when the question involves Kafka, event streaming, or multiple teams sharing a schema. If the system involves gRPC or typed service contracts, Protobuf is the right answer. ORC is worth mentioning if the stack is Hive-heavy or the interviewer brings up Spark on EMR with legacy workloads, but it rarely comes up as the primary choice in modern system design conversations.

What Trips People Up

Here's where candidates lose points — and it's almost always one of these.

The Mistake: Reflex-Parqueting Everything

You ask a candidate what format they'd use for an event stream. They say "Parquet" immediately, with confidence. The interviewer nods and asks, "how would you handle the streaming writes?" Silence.

Parquet is a read-optimized columnar format. It's built for batch workloads where you write large, immutable files and then query subsets of columns repeatedly. It is genuinely bad at streaming appends, small frequent writes, and point lookups. Writing one event at a time to Parquet produces thousands of tiny files, which destroys both query performance and object store costs.

The interviewer isn't trying to trick you. They just want to know if you understand why you're recommending something.

Interview tip: Say "Parquet at rest, Avro in flight." Use Avro or Protobuf for the Kafka layer where you need row-level streaming, then compact and convert to Parquet when data lands in your lake. That one sentence shows you understand the access pattern difference.

The Mistake: Treating Snappy as a Format

"We serialize the data using Snappy." This comes up more than you'd think, and it immediately signals a gap.

Snappy, GZIP, and ZSTD are compression algorithms. They reduce the size of bytes that already exist. They don't define how fields are encoded, how schemas are stored, or how a consumer knows what type a value is. Saying you "use Snappy" without naming the underlying format is like saying you "store data in ZIP files" without mentioning whether it's CSV or JSON inside.

The two layers are independent. Parquet with Snappy compression. Avro with GZIP. Those are complete answers. "Snappy" alone is not.

Common mistake: Candidates conflate compression ratio with serialization efficiency. A well-encoded binary format like Avro will be smaller than JSON before compression, and compress even further after. The interviewer hears "Snappy" and wonders if you know the difference.

The Mistake: Knowing Schema Evolution Buzzwords Without the Rules

Candidates say "Avro supports schema evolution" and stop there. Then the interviewer asks, "what happens if you add a required field to your Avro schema?" and the answer falls apart.

The rules matter. In Avro, adding a field without a default value breaks backward compatibility because old data doesn't have that field, and the new reader has no fallback. In Protobuf, if you delete a field and later reuse its field number for a different type, you corrupt deserialization silently for any consumer still holding old messages. These aren't edge cases. They're the failure modes that cause 3am incidents.

Know the specific rules for at least one format cold. For Avro: new fields need defaults, you can't rename fields safely, removing fields requires the consumer to handle missing data. For Protobuf: never reuse field numbers, mark removed fields as reserved.

Interview tip: When you mention schema evolution, immediately follow it with a constraint. "Avro handles schema evolution well, as long as new fields have defaults and you're running backward compatibility mode in your schema registry." That caveat is what separates someone who's used it from someone who's read about it.

The Mistake: Designing Kafka Pipelines Without a Schema Registry

A candidate designs a Kafka-based ingestion system, walks through producers, topics, consumers, partitioning strategy. Solid answer. Then the interviewer asks, "what prevents a bad producer deploy from breaking all your consumers?"

If you haven't mentioned a schema registry, you don't have an answer.

Without one, any producer can write malformed or structurally incompatible payloads to a topic. Consumers will fail at deserialization, often with cryptic errors, and the bad messages sit in the topic permanently. The schema registry is the enforcement layer. It validates that a producer's schema is compatible with the registered version before a single byte gets written.

Bring it up proactively. Mention Confluent Schema Registry or AWS Glue Schema Registry by name. Specify the compatibility mode you'd configure (backward is the most common default). That level of specificity tells the interviewer you've actually operated one of these systems, not just read the docs.

How to Talk About This in Your Interview

When to Bring It Up

Serialization isn't a topic you wait to be asked about directly. The signals are usually indirect.

When you hear "we're ingesting billions of events per day," that's your cue to mention format choice and why JSON won't cut it at that volume. When the interviewer says "consumers need to read historical data in the lake," that's your opening to bring up Parquet and columnar access patterns. When they mention "multiple teams producing to the same Kafka topic," that's exactly when you raise schema evolution and the Confluent Schema Registry.

Other triggers: "we've had pipeline failures from unexpected data changes" (schema drift, data contracts), "storage costs are high" (compression and columnar encoding), "our Spark jobs are slow" (file format, predicate pushdown, partitioning).

Sample Dialogue

Interviewer: "So for the event stream, what format would you use?"

You: "A few things drive that decision for me. First, volume and velocity. If we're talking millions of events per minute, I want a compact binary format, not JSON. The parse overhead and payload size add up fast. I'd use Avro on the Kafka layer."

Interviewer: "Why Avro specifically? Why not Protobuf?"

You: "Honestly, either works for binary row encoding. I lean toward Avro in Kafka pipelines because the ecosystem around it is mature. Confluent Schema Registry has first-class Avro support, and the compatibility modes, backward, forward, full, are built into the workflow. Protobuf is a great choice too, especially if you're already running gRPC services and want to share .proto definitions across teams. But if we're starting fresh on a Kafka-centric stack, Avro is the path of least resistance."

Interviewer: "And what about once the data lands in S3?"

You: "That's where I'd convert to Parquet. The Kafka layer and the lake layer have completely different access patterns. Avro is optimized for sequential row writes and streaming reads. Parquet is optimized for analytical queries where you're scanning a few columns across millions of rows. Keeping them as Avro in S3 and then running Spark or Athena queries on top of it is leaving a lot of performance on the table. I'd run a Spark job to land data into Parquet, partitioned by date, and register it in the Glue Data Catalog."

Interviewer: "What if the team pushes back and says JSON is simpler to debug?"

You: "That's a real trade-off and I wouldn't dismiss it. JSON is genuinely easier to inspect in a terminal, and for low-volume internal APIs it's totally fine. The problem is at scale. At a billion events a day, JSON's verbosity means you're storing field names repeatedly in every single record. Parse time is slower than binary. And there's no schema enforcement, so one producer adding a field or changing a type can silently break downstream consumers for hours before anyone notices. I'd keep JSON for debugging tools and dashboards, but not as the wire format for the pipeline itself."

Follow-Up Questions to Expect

"How does schema evolution actually work in Avro?" Walk through the writer schema vs reader schema model: the consumer fetches the schema by ID from the registry, and Avro resolves field differences using declared defaults. Adding a field with a default is safe; adding one without breaks backward compatibility.

"What's the difference between backward and forward compatibility?" Backward compatible means new readers can read old data. Forward compatible means old readers can read new data. Full compatibility is both. In practice, you want at least backward compatibility so you can deploy consumers before producers.

"How do you handle schema changes in Parquet files?" This is where table formats like Apache Iceberg or Delta Lake earn their keep. They track schema versions at the table level and handle column additions safely. Without them, a schema change can make older partitions unreadable or require a full rewrite.

"What compression would you use with Parquet?" Snappy for a balance of speed and compression ratio, especially when Spark jobs are CPU-bound. ZSTD if you want better compression and your cluster has headroom. GZIP if you're optimizing purely for storage cost and query frequency is low. The key point: compression is a separate layer on top of the format, not a replacement for choosing the right format.

What Separates Good from Great

A mid-level answer names the right format for the use case. A senior answer explains the layered decision: binary for transport, columnar for storage at rest, and why those two things don't have to be the same format.
Mid-level candidates treat schema evolution as a feature. Senior candidates treat it as a contract. They mention compatibility modes, what happens when you violate them, and how a schema registry enforces the rules before a bad payload ever reaches a consumer.
The detail that really signals experience: naming specific tools. Not "a schema registry" but "Confluent Schema Registry" or "AWS Glue Schema Registry." Not "a table format" but "Iceberg with the Hive metastore." Specificity tells the interviewer you've actually operated these systems, not just read about them.

Key takeaway: The strongest candidates don't pick one format. They pick the right format for each layer of the pipeline, and they can explain exactly why those layers have different requirements.

Data Serialization & Encoding: What Every Data Engineer Must Know