Understanding the Problem
What is a Video Streaming Service?
Product definition: A platform where creators upload videos and viewers stream them on demand, with a personalized feed to help users discover content.
Think YouTube, not Netflix. Anybody can be a creator. You upload a raw video file, the platform processes it into multiple quality levels, and millions of viewers can stream it seconds later from anywhere in the world. The other half of the product is discovery: a home feed that blends content from your subscriptions with recommendations tailored to your watch history.
What makes this problem interesting in an interview is that it's really two systems stitched together. The upload side is a heavy, asynchronous data pipeline. The playback side is a latency-sensitive, read-heavy serving system that needs to work globally. The interviewer wants to see you recognize this tension early and design each path accordingly.
Functional Requirements
Core Requirements
- Video upload and processing: Creators upload raw video files. The system transcodes them into multiple resolutions and bitrates asynchronously, then marks them as ready for viewing.
- Video playback with adaptive quality: Viewers stream videos with the player automatically adjusting quality based on their network conditions (adaptive bitrate streaming).
- Home feed: A personalized feed blending subscriptions and recommendations, ranked by relevance and recency.
- Social interactions: Likes, comments, and subscriptions so viewers can engage with content and follow creators.
Below the line (out of scope)
- Live streaming (fundamentally different architecture from on-demand)
- Monetization, ads, and creator analytics dashboards
- Content moderation and copyright detection (important in production, but a separate system)
Note: "Below the line" features are acknowledged but won't be designed in this lesson. Calling them out explicitly in your interview shows you understand the full product without getting pulled into rabbit holes.
Non-Functional Requirements
- Low-latency playback startup: Video should begin playing within 2 seconds of the viewer pressing play (p99). Once you're past that threshold, users bounce.
- High availability for reads: 99.99% uptime for the streaming path. A viewer in Tokyo at 2 AM should have the same experience as one in New York at peak hours. This means global distribution is non-negotiable.
- Eventual consistency is acceptable for most writes: View counts, like counts, and feed updates don't need to be real-time accurate. A few seconds of staleness is fine. Upload processing can take minutes.
- Scale: 500M daily active viewers, with peak concurrent streams hitting 50M. The write side handles 500K video uploads per day. This is an extremely read-heavy system.
The ratio between reads and writes here is roughly 1000:1. That single number should shape every design decision you make.
Back-of-Envelope Estimation
Tip: Always clarify requirements before jumping into design. This shows maturity. In your interview, spend 2-3 minutes asking about scale, expected latency, and which features are in scope. Then do a quick napkin math pass like the one below.
Let's anchor on concrete numbers. These don't need to be exact; the interviewer wants to see that you can reason about order of magnitude.
Assumptions: - 500M DAU (viewers) - 500K video uploads/day - Average raw video: 500MB, average duration: 5 minutes - Each video transcoded into 5 renditions (240p through 1080p), roughly 3x the raw size total - Average viewer watches 30 minutes/day (roughly 6 video plays per day, given a 5-minute average duration) - Average segment bitrate across all viewers: ~3 Mbps
| Metric | Calculation | Result |
|---|---|---|
| Upload QPS | 500K / 86,400 sec | ~6 uploads/sec |
| Raw storage growth/day | 500K × 500MB | ~250 TB/day |
| Total storage growth/day (with renditions) | 250TB × 3 | ~750 TB/day |
| Playback requests (manifest fetches) | 500M × 6 plays/day / 86,400 | ~35K QPS |
| Peak concurrent streams | 50M simultaneous | 50M streams |
| Peak CDN bandwidth | 50M × 3 Mbps | ~150 Tbps |
| Segment fetch QPS at CDN | 50M streams × 1 segment/4 sec | ~12.5M QPS |
A few things jump out. The upload QPS is tiny (6/sec), but each upload triggers a massive amount of compute work in the transcoding pipeline. Meanwhile, the CDN is handling 150 Tbps of bandwidth at peak. No single origin server cluster can serve that. You need a globally distributed CDN with aggressive caching.
Storage grows at 750 TB per day. That's roughly 274 PB per year. You're not storing this in a database. This is object storage (S3, GCS) with lifecycle policies to manage cost, potentially moving older, less-viewed content to cheaper storage tiers.
This asymmetry between the write path and the read path is the core architectural tension. Your transcoding pipeline can be a queue-based, eventually-consistent batch system. Your streaming path needs to be a globally cached, sub-second-latency serving layer. Keep these two worlds separate in your design, and you'll be in great shape.
The Set Up
Core Entities
Five entities carry the weight of this system. Some are obvious, but the relationship between two of them (Video and VideoAsset) is where interviewers separate candidates who've thought about streaming from those who haven't.
User represents both creators and viewers. There's no separate creator entity. A user becomes a creator the moment they upload their first video.
Video is the metadata container for an uploaded piece of content. It holds the title, description, processing status, and ownership. It does not hold any information about resolutions, bitrates, or file locations. That's the next entity's job.
VideoAsset is a single transcoded rendition of a video. One Video produces many VideoAssets: a 1080p version at 5 Mbps, a 720p version at 3 Mbps, a 480p version at 1.5 Mbps, and so on. Each asset points to its own file in object storage. This one-to-many relationship is the entire foundation of adaptive bitrate streaming, and if you don't model it explicitly, your design falls apart the moment the interviewer asks "how does the player switch quality mid-stream?"
Key insight: Candidates who lump all renditions into a single Video row can't cleanly explain how a manifest file references individual quality levels. The Video/VideoAsset split makes the manifest generation trivial: just query all assets for a given video_id and list them.
Subscription models the follower graph between users. A viewer subscribes to a creator. This powers the "subscriptions" tab of the home feed.
Comment is straightforward. A user writes a comment on a video. You could extend this with replies and threading, but keep it flat unless the interviewer pushes you there.
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
username VARCHAR(64) NOT NULL UNIQUE,
display_name VARCHAR(128) NOT NULL,
avatar_url TEXT,
created_at TIMESTAMP NOT NULL DEFAULT now()
);
CREATE TABLE videos (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
creator_id UUID NOT NULL REFERENCES users(id),
title VARCHAR(256) NOT NULL,
description TEXT,
status VARCHAR(20) NOT NULL DEFAULT 'uploading', -- uploading | processing | ready | failed
duration_sec INT, -- populated after transcoding
view_count BIGINT NOT NULL DEFAULT 0,
created_at TIMESTAMP NOT NULL DEFAULT now()
);
CREATE INDEX idx_videos_creator ON videos(creator_id, created_at DESC);
CREATE INDEX idx_videos_status ON videos(status) WHERE status != 'ready';
CREATE TABLE video_assets (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
video_id UUID NOT NULL REFERENCES videos(id),
resolution VARCHAR(10) NOT NULL, -- '1080p', '720p', '480p', '360p'
bitrate_kbps INT NOT NULL, -- e.g. 5000, 3000, 1500
codec VARCHAR(20) NOT NULL, -- 'h264', 'vp9', 'av1'
storage_url TEXT NOT NULL, -- path in object storage
created_at TIMESTAMP NOT NULL DEFAULT now()
);
CREATE INDEX idx_video_assets_video ON video_assets(video_id);
CREATE TABLE subscriptions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
subscriber_id UUID NOT NULL REFERENCES users(id),
creator_id UUID NOT NULL REFERENCES users(id),
created_at TIMESTAMP NOT NULL DEFAULT now(),
UNIQUE(subscriber_id, creator_id) -- prevent duplicate subscriptions
);
CREATE INDEX idx_subscriptions_subscriber ON subscriptions(subscriber_id);
CREATE INDEX idx_subscriptions_creator ON subscriptions(creator_id);
CREATE TABLE comments (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
video_id UUID NOT NULL REFERENCES videos(id),
user_id UUID NOT NULL REFERENCES users(id),
body TEXT NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT now()
);
CREATE INDEX idx_comments_video ON comments(video_id, created_at DESC);
Notice the status field on videos. This is the state machine that tracks a video through its lifecycle: uploading → processing → ready (or failed). The partial index on status filters out the vast majority of videos that are already ready, keeping the index small for the pipeline workers that poll for in-progress jobs.
The unique constraint on subscriptions(subscriber_id, creator_id) prevents a user from subscribing to the same creator twice. Simple, but easy to forget.

API Design
Each endpoint maps to one of the functional requirements from our earlier discussion. The upload flow is the trickiest because it involves a two-step handshake with object storage.
// Initiate a video upload. Returns a pre-signed URL for direct upload to object storage.
POST /videos
{
"title": "My Cat Video",
"description": "A very important cat."
}
-> {
"video_id": "abc-123",
"upload_url": "https://storage.example.com/raw/abc-123?signature=...",
"status": "uploading"
}
The client never sends video bytes through your API servers. That would be insanely expensive and slow. Instead, the server generates a pre-signed URL pointing directly at object storage (S3, GCS, etc.), and the client uploads there. Once the upload completes, object storage fires an event notification that kicks off transcoding. Your API server stays thin.
Common mistake: Designing the upload endpoint to accept the video file as a multipart form body. This forces every byte of every upload through your application servers, which kills throughput and costs a fortune in compute. Always use pre-signed URLs for large binary uploads.
// Get the streaming manifest for a video. Redirects to the CDN-hosted manifest.
GET /videos/{id}/manifest
-> 302 Redirect to https://cdn.example.com/manifests/{id}/master.m3u8
This endpoint doesn't return the manifest itself. It returns a redirect to the CDN, where the actual .m3u8 (HLS) or .mpd (DASH) file lives. The video player follows the redirect and takes over from there. In practice, many implementations skip this redirect entirely and have the client construct the CDN URL client-side, but showing the redirect demonstrates you understand the separation between metadata and media paths.
// Fetch the personalized home feed for the authenticated user.
GET /feed?page_token={token}&page_size=20
-> {
"videos": [
{
"video_id": "abc-123",
"title": "My Cat Video",
"creator": { "id": "u-456", "display_name": "CatLover99" },
"thumbnail_url": "https://cdn.example.com/thumbs/abc-123.jpg",
"duration_sec": 187,
"view_count": 42000,
"created_at": "2025-01-15T10:30:00Z"
}
],
"next_page_token": "eyJvZmZzZXQiOjIwfQ=="
}
The feed is cursor-paginated, not offset-paginated. With 500M daily active users, offset pagination falls apart once you're past the first few pages because the database still has to scan and discard all preceding rows. A cursor token (typically an opaque encoding of the last item's sort key) lets the query jump directly to the right position.
// Post a comment on a video.
POST /videos/{id}/comments
{
"body": "Great video!"
}
-> {
"comment_id": "c-789",
"user_id": "u-456",
"body": "Great video!",
"created_at": "2025-01-15T11:02:00Z"
}
// Subscribe to a creator.
POST /subscriptions
{
"creator_id": "u-456"
}
-> {
"subscription_id": "sub-101",
"creator_id": "u-456",
"created_at": "2025-01-15T11:05:00Z"
}
A few verb choices worth calling out: POST /videos creates the video resource and returns the upload URL in one shot. You could split this into two calls (create metadata, then request an upload URL), but combining them reduces round trips for the client. GET /feed is a pure read with no side effects, so GET is the right verb. Subscriptions use POST because they create a new resource; unsubscribing would be DELETE /subscriptions/{id}.
Tip: If the interviewer asks about authentication, mention that all endpoints expect a JWT or session token in the Authorization header. The user ID is extracted from the token server-side, never passed in the request body. Don't spend more than 10 seconds on this unless they dig in.High-Level Design
Two fundamentally different systems live under one roof here. The write path (uploading and transcoding video) is heavy, asynchronous, and tolerant of minutes of latency. The read path (streaming video to millions of concurrent viewers) must be fast, globally distributed, and cache-friendly. Every design decision you make should reinforce this split.
1) Video Upload and Transcoding Pipeline
Components: Client (creator's device), API Server, Object Storage (S3 or equivalent), Message Queue (SQS/Kafka), Transcoding Worker Fleet, Metadata Database.
A creator wants to upload a 500MB video file. You absolutely do not want that file flowing through your API servers. Instead, the API server acts as a coordinator: it creates a Video record in the metadata database with status uploading, then hands the client a pre-signed URL pointing directly at object storage.
Here's the flow:
- The client calls
POST /videoswith metadata (title, description, tags). The API server creates a Video row withstatus = 'uploading'and returns a pre-signed upload URL along with the newvideo_id. - The client uploads the raw file directly to object storage using that pre-signed URL. Your API servers never touch the video bytes.
- Object storage emits an upload-complete event (S3 event notification, for example) onto the message queue.
- A transcoding worker picks up the job, downloads the raw file, and produces multiple renditions: 1080p, 720p, 480p, 360p. Each rendition gets chunked into small segments (typically 2-10 seconds each).
- The worker uploads all segments back to object storage and generates an HLS manifest file (
.m3u8) that lists every rendition and its segments. - The worker updates the Video record to
status = 'ready'.
POST /videos
{
"title": "My Cat Video",
"description": "Cat does something funny",
"tags": ["cats", "funny"]
}
Response:
{
"video_id": "v-abc123",
"upload_url": "https://storage.example.com/raw/v-abc123?X-Amz-Signature=...",
"status": "uploading"
}
Tip: When you describe this flow in an interview, emphasize why you use pre-signed URLs. The interviewer wants to hear that routing hundreds of megabytes through your API tier would be a bottleneck, and that pre-signed URLs let the client talk directly to object storage while your backend retains access control.
Why a message queue between the upload event and the workers? Two reasons. First, transcoding is CPU-intensive and takes minutes. You need to decouple the upload acknowledgment from the processing. Second, the queue gives you natural backpressure. If uploads spike, jobs queue up instead of overwhelming your worker fleet. You can autoscale workers based on queue depth.
The Video status field is your contract with the client. The creator's UI can poll GET /videos/{id} and show a progress indicator. When status flips to ready, the video appears on their channel.

2) Video Playback and Streaming
Components: Viewer Client (video player), API Server, CDN Edge Nodes, Object Storage (origin), Metadata Database.
Playback is where 99%+ of your traffic lives. The entire design goal here is: keep video bytes as close to the viewer as possible, and keep your origin servers out of the hot path.
- The viewer client calls
GET /videos/{id}/manifest. The API server looks up the Video record, confirmsstatus = 'ready', and returns a redirect (HTTP 302) to the CDN-hosted manifest URL. - The video player fetches the HLS manifest from the CDN edge. This manifest lists all available renditions and their segment URLs.
- The player estimates current bandwidth and picks a starting rendition (say, 720p). It begins fetching segments sequentially from the CDN.
- As playback continues, the player monitors download speed. If bandwidth drops, it switches to 480p segments on the next request. If bandwidth improves, it jumps to 1080p. This is adaptive bitrate streaming.
- Meanwhile, the API server logs a view event and returns video metadata (title, creator info, like count) from the metadata database.
GET /videos/v-abc123/manifest
Response (302 Redirect):
Location: https://cdn.example.com/videos/v-abc123/master.m3u8
The CDN edge node either has the manifest and segments cached already (cache hit) or pulls them from object storage on the first request (cache miss), then caches them for subsequent viewers. Since video segments are immutable (a transcoded chunk never changes), you can set very long cache TTLs on them. Manifests get shorter TTLs because you might update them if you add a new rendition later.
Common mistake: Candidates sometimes design the playback flow with the API server proxying video bytes to the client. This is a dealbreaker at scale. Your API servers should never serve media content. They handle metadata and redirect to the CDN. Say this explicitly in the interview.
One subtle point: the API server's role during playback is minimal. It serves the initial metadata request and the manifest redirect, then it's out of the picture. The video player talks directly to CDN edges for the rest of the session. This means your API tier can be relatively modest even with 50 million concurrent streams.

3) Home Feed Generation
Components: Viewer Client, Feed Service, Subscription Store (database or graph store), Recommendation Engine, Feed Cache (Redis/Memcached).
The feed is what keeps viewers on the platform. It blends two sources: videos from creators the viewer subscribes to, and algorithmically recommended content.
- The viewer client calls
GET /feed?page=1. The request hits the Feed Service. - The Feed Service first checks the Feed Cache for a pre-computed feed for this user. If a fresh cached result exists (within the TTL window), return it immediately.
- On a cache miss, the Feed Service fans out two parallel requests:
- Subscription Store: Fetch the latest videos from creators this user follows. This is a straightforward query: get the user's subscriptions, then fetch recent videos from those creators, ordered by recency.
- Recommendation Engine: Request a ranked list of suggested videos based on the user's watch history, preferences, and trending content.
- The Feed Service merges and ranks these two lists using a scoring function. Subscription content might get a recency boost; recommended content gets weighted by predicted engagement.
- The merged result is written to the Feed Cache with a short TTL (30-60 seconds) and returned to the client.
GET /feed?page=1&page_size=20
Response:
{
"videos": [
{
"video_id": "v-abc123",
"title": "My Cat Video",
"creator": { "id": "u-xyz", "display_name": "CatLover99" },
"thumbnail_url": "https://cdn.example.com/thumbs/v-abc123.jpg",
"duration_sec": 184,
"view_count": 45200,
"source": "subscription"
},
{
"video_id": "v-def456",
"title": "Top 10 Travel Destinations",
"creator": { "id": "u-travel", "display_name": "Wanderlust" },
"thumbnail_url": "https://cdn.example.com/thumbs/v-def456.jpg",
"duration_sec": 612,
"view_count": 1200000,
"source": "recommended"
}
],
"next_page": 2
}
The short TTL on the feed cache is a deliberate tradeoff. You want the feed to feel fresh (new uploads from subscriptions should appear quickly), but you also can't afford to recompute personalized rankings on every single request for 500M daily users. Thirty seconds is a reasonable middle ground. If a user refreshes within that window, they get the cached version. After the TTL expires, the next request triggers a fresh computation.
Key insight: The feed is a read concern, but it depends on write events (new uploads, new subscriptions, new watch history). You don't need real-time consistency here. A viewer seeing a new upload 30-60 seconds after it goes live is perfectly acceptable. This eventual consistency lets you cache aggressively.
The Recommendation Engine itself is a deep topic (and interviewers usually don't expect you to design the ML model). What they do want to hear is that it's a separate service, called at feed-generation time, and that you've thought about latency. If the recommendation engine is slow, you fall back to a subscription-only feed. Graceful degradation matters.

4) The Two Paths: Metadata vs. Media
This is the architectural insight that ties everything together, and it's worth stating directly to your interviewer.
The metadata path handles titles, descriptions, comments, likes, view counts, subscriptions, and feed data. It flows through traditional API servers, relational or document databases, and caches. The data is small (kilobytes), mutable, and queried in complex ways (joins, aggregations, search). Standard web architecture applies here.
The media path handles video files, segments, thumbnails, and manifests. It flows through object storage and CDN edge nodes. The data is large (megabytes to gigabytes), mostly immutable once transcoded, and accessed by simple key-based lookups. No database is involved.
These two paths share almost nothing at runtime. Your API server is the bridge: it knows the mapping between a video_id in the metadata database and the corresponding CDN URL for the manifest. But once it hands that URL to the client, the media path takes over independently.
Why does this matter in an interview? Because candidates who try to design one unified system for both paths end up with something that's either too slow for streaming or too complex for metadata. The separation lets you scale each path independently. You might have 20 API servers handling metadata but 200+ CDN edge locations serving video. Different scaling dimensions, different cost profiles, different failure modes.
| Concern | Metadata Path | Media Path |
|---|---|---|
| Data size | Kilobytes per record | Megabytes per segment |
| Mutability | Frequently updated | Immutable after transcode |
| Storage | PostgreSQL, Redis | S3, CDN edge caches |
| Access pattern | Complex queries, joins | Simple key-value lookups |
| Scaling lever | Database replicas, app server instances | CDN edge nodes, origin replication |
Putting It All Together
The complete architecture has three major subsystems working in concert:
Upload pipeline (async, write-heavy): Creators upload raw video to object storage via pre-signed URLs. A message queue feeds a transcoding worker fleet that produces multiple renditions and HLS manifests. The metadata database tracks processing status.
Streaming infrastructure (sync, read-heavy): Viewers request manifest URLs from the API, then stream segments directly from CDN edge nodes. Adaptive bitrate logic lives entirely in the client player. The CDN absorbs the vast majority of read traffic, with object storage as the origin for cache misses.
Feed and metadata layer (mixed): The Feed Service blends subscription data with recommendation engine output, caches results per user, and serves the discovery experience. All social features (comments, likes, subscriptions) flow through standard API servers backed by a relational database.
A global CDN sits in front of everything viewers touch: video segments, thumbnails, manifests, and even API responses where appropriate. GeoDNS routes each viewer to the nearest edge. Your origin infrastructure (API servers, databases, object storage) can live in a smaller number of regions while the CDN provides global reach.
Tip: When you draw this on the whiteboard, physically separate the left side (upload/transcode) from the right side (playback/CDN). Connect them through object storage in the middle. This visual separation immediately communicates that you understand the core architectural tension, and interviewers notice.
Deep Dives
"How do we handle video transcoding at scale?"
This is the question that separates candidates who've thought about media systems from those who haven't. A single uploaded video needs to become 6-10 different renditions (resolutions and bitrates), each sliced into small segments. At 500K uploads per day, that's millions of individual transcoding tasks. The interviewer wants to see you reason about parallelism, failure handling, and resource management.
Bad Solution: Single-Server Sequential Processing
You spin up a beefy server with FFmpeg installed. When a video is uploaded, the server picks it up, transcodes it to each rendition one at a time, and writes the results back to object storage. Simple, right?
It's also a disaster. A 10-minute 1080p video can take 5-15 minutes to transcode into a single rendition. Multiply that by 8 renditions and you're looking at over an hour per video. With 500K uploads per day, you'd need hundreds of these machines, each sitting idle between jobs or backed up for hours. One corrupted video that causes FFmpeg to hang blocks everything behind it. There's no retry logic, no visibility into progress, and no way to scale individual bottlenecks.
Warning: Some candidates jump to "just add more servers" without changing the architecture. Horizontal scaling of a bad design just gives you more bad servers. The interviewer wants to see you rethink the approach, not throw hardware at it.
Good Solution: Queue-Based Worker Pool
Decouple the upload event from the transcoding work. When a video lands in object storage, an event fires onto a message queue (SQS, Kafka, etc.). A fleet of transcoding workers pulls jobs from the queue, each worker handling one video-to-rendition task.
# Pseudocode for a transcoding worker
def handle_job(message):
video_id = message["video_id"]
rendition = message["rendition"] # e.g. {"resolution": "720p", "bitrate_kbps": 3000}
try:
update_status(video_id, rendition, "processing")
raw_path = download_from_storage(video_id)
output_path = transcode(raw_path, rendition)
upload_to_storage(video_id, rendition, output_path)
update_status(video_id, rendition, "complete")
except Exception as e:
update_status(video_id, rendition, "failed")
if message["retry_count"] < 3:
requeue_with_backoff(message)
else:
send_to_dead_letter_queue(message)
Each uploaded video fans out into N messages (one per rendition). Workers scale horizontally based on queue depth. You get built-in retry semantics from the queue, and a dead-letter queue catches poison jobs (corrupted files, unsupported codecs) so they don't block the pipeline.
The tradeoff: each worker still processes an entire video file for one rendition. A 2-hour movie still takes a long time per worker, and you're downloading the full raw file for every single rendition. That's a lot of redundant I/O.
Track progress with a simple status table:
CREATE TABLE transcode_jobs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
video_id UUID NOT NULL REFERENCES videos(id),
rendition VARCHAR(50) NOT NULL, -- e.g. '1080p_5000kbps'
status VARCHAR(20) NOT NULL DEFAULT 'queued', -- queued, processing, complete, failed
worker_id VARCHAR(100), -- which worker picked this up
attempts INT NOT NULL DEFAULT 0,
started_at TIMESTAMP,
completed_at TIMESTAMP,
error_message TEXT
);
CREATE INDEX idx_transcode_video ON transcode_jobs(video_id);
CREATE INDEX idx_transcode_status ON transcode_jobs(status);
When all renditions for a video reach "complete," a finalizer process generates the HLS manifest and flips the video's status to "ready."
Great Solution: Chunked Parallel Transcoding
Instead of treating a video as one monolithic blob, split it into time-based chunks (say, 10-second segments) before transcoding. Now a single video fans out into chunks × renditions independent tasks, and they all run in parallel.
The pipeline has three stages:
- Splitter: Downloads the raw video once, splits it into GOP-aligned chunks (splitting on keyframes to avoid artifacts), and uploads each chunk to object storage. Then it enqueues one task per chunk-rendition pair.
- Worker Pool: Each worker transcodes exactly one chunk at one rendition. A 10-second chunk transcodes in seconds, not minutes. Workers are stateless and ephemeral; you can use spot instances aggressively since any interrupted task just gets requeued.
- Stitcher: Once all chunks for a given rendition are done, a stitcher concatenates them (or, for HLS, simply generates the playlist pointing to the individual segment files, which means no stitching at all).
# Splitter logic
def split_video(video_id: str, chunk_duration_sec: int = 10):
raw_path = download_from_storage(video_id)
chunks = split_on_keyframes(raw_path, chunk_duration_sec)
renditions = [
{"resolution": "1080p", "bitrate_kbps": 5000},
{"resolution": "720p", "bitrate_kbps": 3000},
{"resolution": "480p", "bitrate_kbps": 1500},
{"resolution": "360p", "bitrate_kbps": 800},
]
for chunk in chunks:
upload_chunk_to_storage(video_id, chunk)
for rendition in renditions:
enqueue_task({
"video_id": video_id,
"chunk_index": chunk.index,
"rendition": rendition,
"chunk_storage_path": chunk.path,
})
For HLS specifically, each transcoded chunk already is a segment file (.ts or .fmp4). The stitcher's job reduces to writing the .m3u8 playlist that references them in order. No actual video concatenation needed.
This approach turns a 30-minute video with 4 renditions into ~720 independent tasks (180 chunks × 4 renditions), each completing in a few seconds. Total wall-clock time drops from potentially an hour to under a minute with enough workers.
Tip: This is what distinguishes senior candidates. Mentioning chunked parallel transcoding shows you understand that the unit of work should be as small as possible to maximize parallelism and minimize blast radius from failures. One failed chunk gets retried in seconds, not re-transcoding an entire video.

"How does adaptive bitrate streaming actually work?"
Interviewers ask this to test whether you understand what happens between "the user hits play" and "video appears on screen." If you just say "we use a CDN," that's not enough. You need to explain the mechanism that lets a viewer on a flaky train connection watch the same video as someone on gigabit fiber.
The core idea: the video player doesn't fetch one big file. It fetches a manifest that lists all available quality levels and their segment URLs, then downloads segments one at a time, choosing the quality level that matches current network conditions.
Two protocols dominate: HLS (Apple, uses .m3u8 playlists) and DASH (open standard, uses .mpd XML). They work almost identically in concept. For an interview, pick HLS since it's more widely deployed.
Here's what a master playlist looks like:
#EXTM3U
#EXT-X-STREAM-INF:BANDWIDTH=5000000,RESOLUTION=1920x1080
https://cdn.example.com/v/abc123/1080p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=3000000,RESOLUTION=1280x720
https://cdn.example.com/v/abc123/720p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=1500000,RESOLUTION=854x480
https://cdn.example.com/v/abc123/480p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=800000,RESOLUTION=640x360
https://cdn.example.com/v/abc123/360p/playlist.m3u8
Each rendition's playlist then lists the actual segment files:
#EXTM3U
#EXT-X-TARGETDURATION:10
#EXTINF:10.0,
https://cdn.example.com/v/abc123/720p/seg_000.ts
#EXTINF:10.0,
https://cdn.example.com/v/abc123/720p/seg_001.ts
#EXTINF:10.0,
https://cdn.example.com/v/abc123/720p/seg_002.ts
The player's adaptive bitrate (ABR) algorithm works like this: after downloading each segment, it measures how long the download took versus the segment's duration. If a 10-second segment at 720p took 3 seconds to download, there's headroom to try 1080p. If it took 9 seconds, the player drops to 480p for the next segment. The switch happens at segment boundaries, so the viewer sees a quality change every 6-10 seconds at most, not mid-frame.
Segment size is a real tradeoff. Shorter segments (2-4 seconds) let the player react to bandwidth changes faster, giving a smoother experience on mobile networks. But each segment is a separate HTTP request, which means more connection overhead, more CDN cache entries, and more manifest bloat. Longer segments (10-15 seconds) are more efficient for stable connections but sluggish to adapt. Most production systems land on 4-6 second segments.
Key insight: Video segments are immutable. Once segment seg_042.ts for the 720p rendition is created, it never changes. This makes CDN caching trivially effective: set a long TTL (weeks or months) on segments, and a short TTL (minutes) on the manifest. The manifest is tiny (a few KB) and cheap to serve from origin on cache miss.The CDN caches each segment independently per rendition. A popular video might have its 1080p and 720p segments cached at every edge location, while 360p segments only get cached in regions with slower average connections. This happens naturally through access patterns; no special configuration needed.

"How do we protect content with DRM and encryption?"
If you're designing a service that hosts licensed content (movies, TV shows, sports), the content owners will require DRM before they hand over a single file. Even for user-generated content, creators don't want their videos trivially downloadable and redistributed. Skipping this topic in an interview signals you've only thought about the happy path.
Three DRM systems matter in practice: Widevine (Google, covers Chrome and Android), FairPlay (Apple, covers Safari and iOS), and PlayReady (Microsoft, covers Edge and smart TVs). A real streaming service must support all three to reach every device. The good news is that the underlying encryption standard is the same across all of them.
How It Works End-to-End
Every segment gets encrypted with AES-128 during the transcoding pipeline. The encryption key itself is never stored alongside the content. Instead, a license server holds the keys and only releases them to authenticated, authorized players.
The flow looks like this:
- During transcoding, the pipeline requests a content encryption key (CEK) from the key management service for each video.
- Each segment is encrypted with that CEK using AES-128-CTR (Common Encryption, or CENC).
- The encrypted segments go to object storage and the CDN as usual. They're useless without the key.
- The HLS/DASH manifest includes a reference to the license server URL, not the key itself.
- When the player starts playback, it parses the manifest, sees the DRM requirement, and sends a license request to the license server.
- The license server verifies the user's authentication, checks their subscription tier, confirms the content is available in their region, and only then returns the decryption key wrapped in a DRM-specific license blob.
- The player's DRM module (built into the browser or OS) decrypts segments in a secure pipeline that prevents the application layer from accessing the raw bytes.
# During transcoding: request a key and encrypt segments
def encrypt_and_transcode(video_id: str, chunk: Chunk, rendition: dict):
# Get or create encryption key for this video
key_info = key_management_service.get_key(
video_id=video_id,
key_id=generate_deterministic_key_id(video_id)
)
transcoded_path = transcode(chunk.path, rendition)
encrypted_path = encrypt_segment(
input_path=transcoded_path,
key=key_info["cek"],
iv=key_info["iv"],
scheme="cenc" # Common Encryption standard
)
upload_to_storage(video_id, rendition, chunk.index, encrypted_path)
The manifest points players to the license server:
#EXTM3U
#EXT-X-KEY:METHOD=SAMPLE-AES,URI="skd://license.example.com/key?video_id=abc123",KEYFORMAT="com.apple.streamingkeydelivery"
#EXT-X-STREAM-INF:BANDWIDTH=5000000,RESOLUTION=1920x1080
https://cdn.example.com/v/abc123/1080p/playlist.m3u8
Key Management Is the Hard Part
The encryption itself is straightforward. What trips people up is key lifecycle management.
You need a dedicated key management service (or use a managed one like AWS KMS, Google Tink, or a specialized DRM provider like BuyDRM or PallyCon). Each video gets a unique content key. Those keys are themselves encrypted with a master key (envelope encryption). The master key never leaves the KMS hardware security module.
CREATE TABLE content_keys (
key_id UUID PRIMARY KEY,
video_id UUID NOT NULL REFERENCES videos(id),
encrypted_cek BYTEA NOT NULL, -- CEK encrypted with master key
iv BYTEA NOT NULL, -- initialization vector
drm_systems TEXT[] NOT NULL, -- e.g. {'widevine', 'fairplay', 'playready'}
created_at TIMESTAMP NOT NULL DEFAULT now()
);
CREATE UNIQUE INDEX idx_content_keys_video ON content_keys(video_id);
Key rotation matters too. If a key is compromised, you need to re-encrypt the affected content and invalidate existing licenses. This is expensive, so the real defense is making compromise unlikely: HSMs for master keys, short-lived license tokens, and device attestation to ensure the player hasn't been tampered with.
Tip: You don't need to memorize the Widevine protocol spec. What interviewers want to hear is that you understand the separation of concerns: encrypted content on the CDN (public, cacheable, useless without keys), keys on a separate license server (authenticated, authorized, audited), and decryption happening inside a trusted execution environment on the client. That three-part architecture is the answer.
One subtlety worth mentioning: because all three DRM systems support CENC, you encrypt the content once and generate three different license server responses. You're not storing three copies of every video. The segments on the CDN are identical regardless of whether a Widevine, FairPlay, or PlayReady client requests them.
"How do we handle live streaming?"
VOD and live streaming share some infrastructure (CDN, player protocols, adaptive bitrate), but the architecture for getting a live signal from a camera to millions of viewers simultaneously is fundamentally different. If the interviewer asks about live, they're probing whether you can reason about latency constraints, real-time pipelines, and failure modes where "just retry" isn't an option.
In VOD, you transcode once and serve forever. In live, you're transcoding continuously, and every second of delay between the real world and the viewer's screen is a second where someone on Twitter spoils the goal.
Ingest: Getting the Stream In
The broadcaster (a creator with OBS, a sports production truck, a phone app) pushes a video stream to your platform. The standard protocol here is RTMP (Real-Time Messaging Protocol). It's old, it's not perfect, but it's what every encoder supports. Newer alternatives like SRT (Secure Reliable Transport) handle packet loss better over unreliable networks, but RTMP remains the default.
Your ingest layer is a fleet of servers that terminate these RTMP connections. Each ingest server validates the stream key (authentication), checks that the stream resolution and bitrate are within allowed limits, and forwards the raw stream to the transcoding pipeline.
Geographic distribution of ingest servers matters. A streamer in Seoul pushing to an ingest server in Virginia adds 150+ ms of network latency before you even start processing. Place ingest POPs in major regions and route streamers to the nearest one via GeoDNS, just like you do for viewers.
CREATE TABLE live_streams (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
channel_id UUID NOT NULL REFERENCES channels(id),
stream_key VARCHAR(128) NOT NULL UNIQUE,
status VARCHAR(20) NOT NULL DEFAULT 'idle', -- idle, live, ended
ingest_server VARCHAR(100), -- which POP is handling this stream
started_at TIMESTAMP,
ended_at TIMESTAMP,
recording_path TEXT -- object storage path if DVR is enabled
);
CREATE INDEX idx_live_streams_channel ON live_streams(channel_id);
CREATE INDEX idx_live_streams_status ON live_streams(status);
Real-Time Transcoding: The Pipeline That Can't Fall Behind
Unlike VOD transcoding where you can queue work and process it whenever, live transcoding must keep up with real time. If a streamer is pushing 1080p60 and your transcoder falls behind by even a few seconds, that delay compounds and never recovers.
The architecture looks like this: the ingest server splits the incoming stream into GOP-aligned chunks (typically 2-4 seconds for live, shorter than VOD to minimize latency). Each chunk gets fanned out to parallel transcoders, one per output rendition. The transcoders must finish processing each chunk before the next one arrives.
# Live transcoding pipeline (per stream)
class LiveTranscoder:
def __init__(self, stream_id: str, renditions: list):
self.stream_id = stream_id
self.renditions = renditions
self.segment_index = 0
def on_chunk_received(self, chunk: bytes, pts: float):
tasks = []
for rendition in self.renditions:
tasks.append(
transcode_chunk_async(
chunk, rendition, self.stream_id, self.segment_index
)
)
# All renditions must complete before we can update the manifest
results = await asyncio.gather(*tasks, return_exceptions=True)
for r in results:
if isinstance(r, Exception):
# Drop this rendition for this segment; player will skip it
log_error(f"Transcode failed for segment {self.segment_index}: {r}")
self.update_live_manifest(results)
self.segment_index += 1
Notice the failure handling difference from VOD. You can't retry a failed live segment by re-downloading the source. The source is gone; it was a real-time stream. If a rendition fails for one segment, you skip it. The player's ABR algorithm handles the gap by staying on whatever rendition it was already using.
Warning: Candidates sometimes propose using the same queue-based architecture from VOD transcoding for live. That won't work. Message queues introduce variable latency (queue depth, consumer lag, backoff on retries). Live transcoding needs a dedicated, always-running pipeline per active stream with direct memory handoff between ingest and transcode stages.
Packaging and Distribution: The Sliding Window
For VOD, the manifest is static: it lists all segments from start to finish. For live, the manifest is a sliding window that updates every few seconds.
A live HLS manifest looks like this:
#EXTM3U
#EXT-X-TARGETDURATION:4
#EXT-X-MEDIA-SEQUENCE:4521
#EXTINF:4.0,
https://cdn.example.com/live/stream123/720p/seg_4521.ts
#EXTINF:4.0,
https://cdn.example.com/live/stream123/720p/seg_4522.ts
#EXTINF:4.0,
https://cdn.example.com/live/stream123/720p/seg_4523.ts
The player polls this manifest every segment duration (every 4 seconds in this case). Each poll returns the latest few segments. The #EXT-X-MEDIA-SEQUENCE tag tells the player where it is in the stream so it doesn't re-download segments it already has.
This is where CDN caching gets tricky. VOD segments are immutable and cacheable forever. Live manifests change every few seconds and must not be stale. Set Cache-Control: no-cache or a very short max-age (1-2 seconds) on live manifests. The segments themselves are still immutable once created, so they can be cached normally.
Latency Tiers
Not all live content needs the same latency. A music concert stream can tolerate 15-30 seconds of delay. A live auction or sports betting scenario needs sub-5-second latency. The architecture choices differ:
Standard latency (15-30s): Regular HLS/DASH with 4-6 second segments. Simple, reliable, works with existing CDN infrastructure. The player buffers 3-4 segments ahead, which adds latency but smooths over network hiccups.
Low latency (3-8s): Use LL-HLS (Low-Latency HLS) or LL-DASH. These protocols allow the player to request partial segments (called "parts") before the full segment is complete. The server pushes parts as they're transcoded, and the player can start rendering before the segment boundary. This requires CDN support for chunked transfer encoding and HTTP/2 push or preload hints.
Ultra-low latency (<2s): At this point, HLS/DASH breaks down. You need WebRTC for the last mile, which means peer-to-peer or server-to-client UDP-based delivery. This sacrifices scalability (WebRTC doesn't cache at CDN edges the way HTTP segments do) and quality adaptation. Typically reserved for interactive use cases like video calls or live auctions, not broadcast to millions.
Key insight: The interviewer isn't expecting you to design a WebRTC system. What they want to hear is that you recognize latency as a spectrum with engineering tradeoffs at each tier. Picking the right tier for the use case shows product thinking alongside systems thinking.
DVR and Rewind
Viewers expect to rewind live streams or start watching from the beginning after joining late. This is "DVR mode," and it's essentially a hybrid of live and VOD.
As segments are produced, they're written to both the CDN (for live playback) and object storage (for DVR/archival). The manifest expands to include a longer window of past segments. When the stream ends, the live manifest gets converted to a static VOD manifest, and the recording becomes a regular video in your catalog. The transcoded segments are already in the right format; no re-processing needed.
"How do we scale view counts without melting the database?"
At 50M concurrent streams, you could be receiving tens of millions of "I watched this" events per second. If each one triggers a UPDATE videos SET view_count = view_count + 1 WHERE id = ?, your database will catch fire. The interviewer is testing whether you understand write amplification and when approximate answers are good enough.
Bad Solution: Increment on Every View
Every time a client reports a view event, the API server runs an atomic increment against the videos table.
UPDATE videos SET view_count = view_count + 1 WHERE id = 'abc123';
For a viral video getting 100K views per second, that's 100K row-level locks per second on the same row. PostgreSQL will spend more time managing lock contention than doing actual work. Your database's CPU flatlines, write latency spikes, and every other query on the videos table suffers.
Warning: Candidates often say "just use a counter column with atomic increments" as if databases handle unlimited concurrent writes to the same row. They don't. Row-level lock contention is the bottleneck, and no amount of indexing fixes it.
Good Solution: Batched Writes via In-Memory Counter
Put Redis (or any fast in-memory store) between the API servers and the database. Each view event increments a Redis counter (INCR video:abc123:views). A background flusher periodically reads these counters and batch-writes them to the database.
# Background flusher (runs every 30 seconds)
def flush_view_counts():
keys = redis.scan_iter("video:*:views")
batch = []
for key in keys:
video_id = extract_video_id(key)
count = redis.getset(key, 0) # atomically read and reset
if count and int(count) > 0:
batch.append((video_id, int(count)))
# Single batch update
with db.transaction():
for video_id, delta in batch:
db.execute(
"UPDATE videos SET view_count = view_count + %s WHERE id = %s",
(delta, video_id)
)
Now instead of 100K writes per second for a viral video, you get one write every 30 seconds. Redis handles the high-throughput increments effortlessly since it's single-threaded and in-memory.
The tradeoff: view counts in the database lag by up to 30 seconds. For a platform where videos display "1.2M views," nobody notices. You also need to handle Redis failures gracefully (accept that some counts might be lost, or persist the Redis data to disk).
Great Solution: Dedicated Counting Service with Probabilistic Deduplication
The good solution handles throughput, but it still counts every single event, including duplicates. A user who refreshes the page 50 times shouldn't generate 50 views. And bots can inflate counts trivially.
Build a dedicated counting service that sits between the API layer and the database. It does two things: deduplicates views and aggregates counts.
For deduplication, use a HyperLogLog per video per time window. HyperLogLog is a probabilistic data structure that estimates unique cardinality with ~0.8% error using only 12KB of memory per counter, regardless of how many unique elements you add.
class ViewCountingService:
def __init__(self):
self.redis = Redis()
self.flush_interval = 60 # seconds
def record_view(self, video_id: str, viewer_id: str):
# HyperLogLog for unique viewer deduplication
hll_key = f"hll:views:{video_id}:{current_hour()}"
self.redis.pfadd(hll_key, viewer_id)
# Raw counter for total views (including repeats, for analytics)
self.redis.incr(f"raw:views:{video_id}")
def get_unique_views(self, video_id: str) -> int:
# Merge HyperLogLogs across time windows
keys = self.redis.keys(f"hll:views:{video_id}:*")
if not keys:
return 0
self.redis.pfmerge("hll:tmp", *keys)
return self.redis.pfcount("hll:tmp")
def flush_to_db(self):
# Periodic job: write aggregated unique counts to metadata DB
for video_id in get_active_videos():
unique_count = self.get_unique_views(video_id)
db.execute(
"UPDATE videos SET view_count = %s WHERE id = %s",
(unique_count, video_id)
)
The counting service writes to the metadata DB on a relaxed schedule (every few minutes). The displayed view count is eventually consistent, which is perfectly fine. YouTube itself shows "1,234 views" that can lag by hours for viral content.
Tip: Mentioning HyperLogLog by name and knowing its error bounds (~0.8% with 12KB) signals real depth. If the interviewer asks "but what about exact counts?", explain that exact unique counting at this scale would require storing every (video_id, viewer_id) pair, which is billions of rows. The 0.8% error is a worthwhile tradeoff for orders of magnitude less storage and compute.

"How do we design the CDN and global distribution strategy?"
Video bytes are 99%+ of your bandwidth. If every segment request hits your origin servers, you'll need absurd network capacity and your viewers in Tokyo will wait hundreds of milliseconds for each segment from a US data center. The CDN strategy isn't an afterthought; it's the core of your read path.
GeoDNS routing is the entry point. When a viewer's player resolves cdn.example.com, GeoDNS returns the IP of the nearest edge cluster. A viewer in Berlin gets routed to Frankfurt, a viewer in São Paulo gets routed to a South American POP. This happens transparently at the DNS layer.
Immutable segments get aggressive caching. Since seg_042.ts for a given rendition never changes, set Cache-Control: public, max-age=31536000 (one year). The CDN edge serves it forever without revalidating. Manifests, on the other hand, need short TTLs (30-60 seconds) because you might update them if you add a new rendition or fix a segment reference.
For popular content, the CDN's natural access patterns handle caching well. A trending video's segments get pulled to every edge location within minutes as viewers worldwide request them. But what about the long tail?
Here's the uncomfortable math: on a platform like YouTube, the top 10% of videos account for roughly 90% of views. The remaining 90% of videos are rarely watched. Caching all of them at every edge would be astronomically expensive. Most CDN providers charge per GB stored at the edge, and you'd be paying to cache millions of videos that get one view per month.
The practical approach is tiered caching:
- Edge POPs (100+ locations): Cache only hot content. Small storage, fast eviction. LRU naturally keeps popular segments warm.
- Regional mid-tier caches (10-20 locations): Larger storage, hold moderately popular content. Edge misses go here before hitting origin.
- Origin storage (2-3 regions): Multi-region replicated object store (S3 with cross-region replication, or GCS multi-region buckets). This is the source of truth.
For cache warming, proactively push content to edge locations before demand hits. When a creator with 10M subscribers uploads a new video, don't wait for the first viewer in each region to trigger a cache miss. Push the first few segments of each rendition to major POPs as soon as transcoding completes. This eliminates the "thundering herd" problem where thousands of subscribers all trigger cache misses simultaneously.
def warm_cdn_for_video(video_id: str, creator_subscriber_count: int):
if creator_subscriber_count < 100_000:
return # only warm for large creators
segments = get_first_n_segments(video_id, n=5) # first 30 seconds
target_pops = get_top_pops_by_subscriber_region(video_id)
for pop in target_pops:
for segment in segments:
cdn_api.prefetch(segment.url, pop_id=pop.id)
Regional content restrictions add another layer. Some videos can't be served in certain countries due to licensing or legal requirements. Handle this at the API layer, not the CDN layer. When a viewer requests a manifest, the API checks their region against the video's restriction list and returns a 451 (Unavailable For Legal Reasons) if blocked. The CDN itself doesn't need to know about restrictions; it just caches and serves whatever the origin allows.
Key insight: Don't try to enforce geo-restrictions at the CDN edge. CDN configurations are error-prone and slow to update. Keep the logic in your API server where you have full control, and let the CDN be a dumb, fast cache.
Multi-region origin replication deserves a mention. If your single origin is in us-east-1 and it goes down, every cache miss globally fails. Replicate your object storage across at least two regions (e.g., US and EU). Configure the CDN to failover to the secondary origin automatically. The replication lag for new uploads is typically seconds, which is fine since videos take minutes to transcode anyway.

What is Expected at Each Level
Interviewers calibrate their evaluation based on the level you're interviewing for. Here's what separates a passing answer at each band.
Mid-Level
- Separate the write path from the read path. The single biggest signal at this level is recognizing that uploading/transcoding a video and streaming it to viewers are fundamentally different problems with different architectures. If you treat them as one monolithic flow, you'll lose the interviewer early.
- Put a CDN in front of video playback without being prompted. You don't need to explain cache warming strategies or TTL policies yet. But the interviewer expects you to know that serving video bytes directly from your origin servers to millions of concurrent viewers is a non-starter.
- Produce a coherent high-level design that includes object storage, a transcoding queue, and a metadata database. The components should connect logically: pre-signed URL upload to object storage, event triggers a queue message, workers transcode and write back. If you can draw this pipeline on a whiteboard and explain the data flow, you're in good shape.
- Understand why a single video needs multiple renditions. You don't need to explain HLS manifests in detail, but you should articulate that viewers on different devices and network conditions need different quality levels, and that transcoding produces these variants.
Senior
- Explain how adaptive bitrate streaming works end to end. Walk through the manifest file (HLS or DASH), how the player estimates bandwidth, and how it switches between renditions mid-stream. The interviewer wants to see that you understand the protocol, not just the concept of "different quality levels."
- Design the transcoding pipeline with failure handling baked in. Retries with exponential backoff, dead-letter queues for videos that repeatedly fail, and a status field on the Video entity that the client can poll or subscribe to. A senior candidate doesn't just draw the happy path.
- Propose a real solution for view counting at scale. Saying "just increment a counter in the database" on every view event for 50M concurrent streams will get you a raised eyebrow. You should reach for in-memory batching (Redis counters flushed periodically) and explain why eventual consistency is acceptable here.
- Discuss CDN caching with specifics. Long TTLs on immutable video segments, shorter TTLs on manifests that might update, and cache warming for content that's about to spike (new uploads from popular creators). This is where you show you've thought about the operational reality of serving video globally.
Staff+
- Drive toward chunked parallel transcoding without being asked. A 2-hour video shouldn't block a single worker for 2 hours. You should propose splitting the raw file into time-based segments, transcoding them across many workers in parallel, and stitching the results. Then discuss the coordination complexity this introduces and how you'd track per-chunk completion.
- Address the long-tail CDN cost problem. Most videos get very few views, but they still exist in storage. Caching everything at the edge is financially ruinous. You should propose tiered caching: hot content stays at the edge, warm content at regional mid-tier caches, and cold content is served directly from origin. Articulate the cost/latency tradeoff explicitly.
- Bring up the recommendation engine's integration challenges unprompted. How does the feed service handle cold-start for a brand new user with no watch history? What about a freshly uploaded video with no engagement signals? Staff candidates connect the feed system to the broader product experience and acknowledge the ML pipeline's dependency on engagement data that doesn't exist yet.
- Proactively discuss operational maturity. What happens when the transcoding backlog grows to 100K jobs? How do you monitor pipeline health, alert on processing delays, and gracefully degrade (maybe by temporarily reducing the number of renditions produced)? Cost management for transcoding compute, auto-scaling policies for the worker fleet, and multi-region origin replication strategies all belong here. The interviewer is looking for someone who thinks about running the system, not just building it.
Key takeaway: The architecture of a video streaming service is defined by one fundamental split: the write path (upload and transcode) is heavy, asynchronous, and failure-prone, while the read path (streaming playback) must be fast, globally distributed, and cache-friendly. Every design decision flows from keeping these two paths cleanly separated and optimized for their very different constraints.
