System Design Interview Questions

Dan Lee's profile image
Dan LeeData & AI Lead
Last updateMarch 16, 2026
System Design interview questions

System design interviews are the make-or-break component of AI engineer interviews at top tech companies. Google, Meta, Amazon, Microsoft, OpenAI, and Anthropic all use system design rounds to evaluate whether you can architect production ML systems that handle real-world constraints like latency, scale, and reliability. Unlike coding interviews, there's no single correct answer, which means your approach and reasoning matter more than your final diagram.

What makes system design interviews particularly challenging for AI engineers is the intersection of distributed systems knowledge and ML-specific constraints. You might be asked to design a recommendation system that serves 10 million users while handling model updates, or architect a RAG pipeline that maintains sub-50ms latency across terabytes of embeddings. The tricky part isn't just knowing about load balancers or databases, it's understanding how GPU memory limits, model inference costs, and feature freshness requirements fundamentally change your architectural decisions.

Here are the top 30 system design questions organized by the core skills interviewers evaluate, from requirements gathering to distributed systems reliability.

Advanced30 questions

System Design Interview Questions

Top System Design interview questions covering the key areas tested at leading tech companies. Practice with real questions and detailed solutions.

AI EngineerGoogleMetaAmazonMicrosoftOpenAIAnthropicNetflixUber

System Design Framework & Requirements Gathering

Most candidates jump straight into drawing boxes and arrows without understanding what they're actually building. Interviewers test your requirements gathering skills because this reveals whether you approach ambiguous problems systematically or make dangerous assumptions that will sink your architecture later.

The biggest mistake here is treating functional requirements as obvious when they're often the most complex part. When asked to design a content moderation system, candidates assume they know what 'moderation' means, missing critical details like whether the system needs to handle images, what languages to support, or how false positives should be handled.

System Design Framework & Requirements Gathering

Before you sketch a single box on the whiteboard, interviewers want to see you clarify scope, define functional and non-functional requirements, and estimate capacity. Candidates struggle here because they jump straight into architecture without establishing constraints, which leads to unfocused designs that miss the mark entirely.

You're asked to design a real-time content moderation system using LLMs for a platform with 500 million daily active users. Walk me through how you would gather requirements and define scope before drawing any architecture.

MetaMetaMediumSystem Design Framework & Requirements Gathering

Sample Answer

Most candidates default to immediately discussing which LLM to use or how to fine-tune it, but that fails here because you will design a completely different system depending on whether moderation must be synchronous (blocking post publication) or asynchronous (flagging after the fact). Start by clarifying content types: text only, or also images, video, and audio? Then pin down latency requirements, because synchronous moderation at 500M DAU means you need to estimate throughput at roughly $500M \times 20\ \text{posts/day} / 86400 \approx 115K\ \text{requests/sec}$ at peak, which rules out calling a massive LLM for every single post. You should also define non-functional requirements like false positive tolerance, regional compliance differences, and whether human-in-the-loop review is in scope. Only after these constraints are locked should you begin sketching a tiered architecture.

Practice more System Design Framework & Requirements Gathering questions

API Design & Data Contracts

API design questions reveal whether you understand the contract between services and can think through edge cases that break systems in production. Candidates often design APIs that work for the happy path but fall apart when handling errors, versioning, or unexpected input formats.

Your API design directly impacts how your system scales and evolves. A poorly designed inference API that doesn't support batching can limit your throughput by 10x, while inadequate error codes make debugging production issues nearly impossible.

API Design & Data Contracts

You will be expected to define clean, versioned APIs that serve as the contract between system components, especially for ML model serving endpoints and data pipelines. Many candidates underestimate this area, designing vague interfaces that fall apart when the interviewer probes edge cases like pagination, rate limiting, or backward compatibility.

You are building a real-time ML inference API at Google that serves a text embedding model. Design the REST endpoint contract, including request/response schemas, error handling, and how you would handle batched inputs with varying sequence lengths.

GoogleGoogleMediumAPI Design & Data Contracts

Sample Answer

Your endpoint should be POST /v1/embeddings with a request body containing a list of strings and an optional model parameter, returning a list of embedding vectors with corresponding indices. You batch inputs server-side up to a max batch size (e.g., 64) and enforce a per-string token limit (e.g., 8192 tokens), returning a 422 with a clear error pointing to which input exceeded the limit. For partial failures in a batch, return a 200 with an errors array alongside the successful results so the caller can retry selectively. Include a usage field in the response reporting total tokens consumed, which feeds into rate limiting tracked via a response header like X-RateLimit-Remaining.

Practice more API Design & Data Contracts questions

Database Design & Storage Systems

Storage decisions are where ML systems live or die in production. Interviewers probe your understanding of how data access patterns, consistency requirements, and query latency constraints should drive your choice between SQL, NoSQL, and specialized databases like vector stores.

The trap most candidates fall into is optimizing for the wrong bottleneck. Choosing a vector database for 1 million embeddings might seem smart until you realize the operational overhead outweighs the benefits, or picking DynamoDB for analytics workloads that need complex aggregations.

Database Design & Storage Systems

Choosing between SQL, NoSQL, vector databases, and object stores is a decision you will need to justify clearly in every system design round. This section tests whether you understand data modeling, indexing strategies, and the tradeoffs between consistency and availability, which is where candidates often give surface-level answers without connecting choices to the specific workload requirements.

You are building a retrieval-augmented generation (RAG) pipeline at scale for a product like Google Search. You need to store 500 million document embeddings (768 dimensions each) and serve nearest-neighbor queries at p99 latency under 50ms. How would you choose between a dedicated vector database like Pinecone or Milvus versus adding a vector index (pgvector) to your existing PostgreSQL cluster?

GoogleGoogleHardDatabase Design & Storage Systems

Sample Answer

You could use pgvector inside PostgreSQL or a purpose-built vector database like Milvus. A dedicated vector database wins here because at 500 million embeddings, you need sharding, HNSW or IVF indexing with fine-tuned parameters, and memory-mapped storage optimized for high-dimensional similarity search. PostgreSQL with pgvector works well under ~10 million vectors, but it lacks native distributed vector indexing, and at this scale your p99 latency target of 50ms would be nearly impossible to hit without significant custom engineering. You should also mention that a dedicated system lets you decouple embedding storage scaling from your transactional database scaling, which matters when your RAG query volume spikes independently of your CRUD workload.

Practice more Database Design & Storage Systems questions

Scalability, Load Balancing & Caching

Scalability questions test whether you can identify bottlenecks before they hit production and design systems that gracefully handle increasing load. This is where many ML systems fail because traditional web service scaling patterns don't account for GPU constraints and model-specific resource requirements.

Caching becomes exponentially more complex in ML systems because cached features can go stale and cached model outputs depend on model versions. A feature cache that improves latency but serves outdated data to your fraud detection model is worse than no cache at all.

Scalability, Load Balancing & Caching

When your system needs to handle millions of inference requests or training data ingestion at scale, interviewers expect you to articulate horizontal scaling strategies, caching layers, and load balancing policies with precision. You will find that many interview failures stem from memorizing generic patterns without being able to reason about when each technique applies and what bottlenecks it actually solves.

You are designing a real-time inference service at Google that serves a large language model to 10 million users. Each request requires ~2 seconds of GPU compute. Walk me through how you would horizontally scale this system and what bottlenecks you would hit first.

GoogleGoogleHardScalability, Load Balancing & Caching

Sample Answer

Reason through it: Start by estimating the load. If you have $10^7$ users generating, say, 0.1 requests per second on average, that is $10^6$ RPS, and each ties up a GPU for 2 seconds, so you need roughly $2 \times 10^6$ GPU-seconds of capacity per second, meaning $2 \times 10^6$ concurrent GPU slots. You would horizontally scale by deploying model replicas across thousands of GPU nodes behind a load balancer, but the first bottleneck you hit is GPU memory: each replica must hold the full model weights, so you need to decide between tensor parallelism across GPUs within a node versus full replication across nodes. Next, you would face network bandwidth bottlenecks if you try to shard the model across nodes, so the practical move is replicate the full model per node (or per multi-GPU node using intra-node tensor parallelism) and scale out the number of nodes. Finally, autoscaling on GPU utilization rather than CPU or request count is critical because GPU saturation is your true constraint.

Practice more Scalability, Load Balancing & Caching questions

ML System Architecture & Model Serving

ML system architecture questions are where domain expertise separates strong candidates from generic backend engineers. Interviewers want to see that you understand model serving patterns, how to handle model updates safely, and the specific failure modes that plague production ML systems.

Model versioning and rollback strategies are particularly revealing because they require deep understanding of how ML systems differ from traditional services. Rolling back a recommendation model affects user experience differently than rolling back a payment service, and your architecture needs to account for these nuances.

ML System Architecture & Model Serving

As an AI Engineer, you are uniquely expected to design end-to-end ML systems covering feature stores, training pipelines, model registries, and real-time or batch inference infrastructure. Candidates often treat ML components as black boxes instead of addressing latency budgets, model versioning, A/B testing frameworks, and graceful degradation when models fail or drift.

Design a real-time model serving system for a product recommendation engine at scale, where the p99 latency budget is 50ms and you need to support rolling model updates without downtime. Walk through your architecture from feature retrieval to response.

AmazonAmazonHardML System Architecture & Model Serving

Sample Answer

This question is checking whether you can decompose a latency-critical inference path into its component stages and reason about where time is spent. You should design a two-tier architecture: a precomputed candidate retrieval layer (e.g., approximate nearest neighbors over embeddings refreshed hourly) followed by a lightweight ranking model served via a low-latency framework like TensorFlow Serving or Triton behind a load balancer. Feature retrieval must hit an online feature store (Redis or DynamoDB) with precomputed user and item features, not a batch warehouse, to stay within your 50ms budget. For zero-downtime updates, use blue-green deployment of model versions behind a traffic router, where the new version warms up and passes shadow traffic validation before receiving live traffic. Budget roughly 10ms for feature fetch, 15ms for candidate retrieval, 15ms for ranking inference, and 10ms for network overhead, and instrument each segment so you can pinpoint regressions.

Practice more ML System Architecture & Model Serving questions

Distributed Systems & Reliability

Distributed systems reliability questions test your ability to reason about failure scenarios and design systems that degrade gracefully rather than catastrophically. This becomes particularly complex in ML systems where partial failures can lead to inconsistent model outputs or training data corruption.

Consensus algorithms and replication strategies take on new meaning when you're serving LLMs with shared cache layers. A network partition that serves stale cached embeddings might cause model hallucinations that are impossible to debug, making traditional eventually-consistent approaches dangerous.

Distributed Systems & Reliability

Designing for fault tolerance, data replication, and consistency in distributed environments is the most advanced area interviewers probe, and it separates strong candidates from average ones. You need to demonstrate fluency with concepts like consensus protocols, event-driven architectures, idempotency, and failure recovery, particularly as they apply to distributed training jobs and multi-region AI serving infrastructure.

You are serving a large language model across three regions (US, EU, Asia) with a shared KV cache layer for prompt prefixes. A network partition isolates the EU region. How do you design the system so EU users still get responses without stale or inconsistent cache entries causing hallucination divergence?

GoogleGoogleHardDistributed Systems & Reliability

Sample Answer

The standard move is to use a consensus protocol like Raft across regions to keep the KV cache consistent. But here, availability matters more than strict consistency because a partitioned region serving slightly redundant computation is far better than returning errors to all EU users. You should design the KV cache with a local fallback: if the EU node cannot reach the global cache leader, it recomputes prefix KV entries locally and tags them as provisional. Once the partition heals, you reconcile by comparing cache entry hashes and evicting provisional entries that diverge from the authoritative version. This gives you availability under partition while bounding inconsistency to a temporary compute cost, not a correctness cost, since the model itself is deterministic given the same inputs.

Practice more Distributed Systems & Reliability questions

How to Prepare for System Design Interviews

Practice the two-way conversation

System design interviews are collaborative, not interrogations. Practice asking clarifying questions out loud and explaining your reasoning as you design. Record yourself walking through a system design and listen for moments where you make assumptions without stating them.

Learn ML-specific bottlenecks by heart

Memorize the performance characteristics that matter for ML systems: GPU memory limits, batch size impact on throughput, feature staleness tolerance, and model cold start times. These constraints should drive your architectural decisions, not be afterthoughts.

Build a mental library of scaling patterns

Study how real companies scale their ML systems by reading engineering blogs from Google, Meta, Netflix, and Uber. Focus on specific numbers: request volumes, latency requirements, and infrastructure costs. Generic scaling knowledge won't cut it for ML system design.

Master the art of reasonable estimation

Practice back-of-the-envelope calculations for storage, compute, and network requirements until they become automatic. Know how to estimate embedding storage needs, GPU memory requirements for different model sizes, and feature serving QPS from user behavior patterns.

Design for failure scenarios first

Always start with how your system handles failures rather than treating reliability as an add-on. Walk through specific scenarios: what happens when your feature store is down, how you handle model inference timeouts, and how you ensure training data consistency during distributed failures.

How Ready Are You for System Design Interviews?

1 / 6
System Design Framework & Requirements Gathering

You're asked to design a URL shortening service in an interview. What should you do first before jumping into architecture diagrams?

Frequently Asked Questions

How much depth is expected in a System Design interview for an AI Engineer role?

You need to go well beyond surface-level architecture diagrams. Interviewers expect you to discuss ML-specific concerns like model serving infrastructure, feature stores, training pipelines, data versioning, and latency/throughput tradeoffs in detail. You should be able to reason about scaling model inference, handling model updates without downtime, and monitoring for data drift or model degradation in production.

Which companies ask the most System Design questions for AI Engineer positions?

Large tech companies like Google, Meta, Amazon, Apple, and Microsoft almost always include a system design round for AI Engineers. ML-focused companies such as OpenAI, Anthropic, and Netflix also emphasize it heavily. Startups with production ML systems increasingly include these rounds too, since they need engineers who can design end-to-end ML infrastructure, not just train models.

Will I need to write actual code during a System Design interview?

Typically, no. System Design rounds focus on whiteboarding or diagramming architectures, discussing component interactions, and making design tradeoffs. However, you may be asked to write pseudocode for specific components like a feature pipeline or a model serving API. Some companies blend system design with a coding round, so review the interview format beforehand and practice coding problems at datainterview.com/coding to stay sharp.

How does the System Design interview differ for AI Engineers compared to general software engineers?

For AI Engineers, the focus shifts heavily toward ML-specific systems: designing training pipelines, real-time inference services, feature stores, embedding retrieval systems, and RAG architectures. General software engineering system design tends to center on web services, databases, and distributed systems. As an AI Engineer, you are still expected to understand those fundamentals, but you must also demonstrate expertise in ML infrastructure, model lifecycle management, and data pipeline design.

How should I prepare for System Design interviews if I have no real-world experience building ML systems?

Start by studying published architectures from companies like Uber, Spotify, and Google, which often share blog posts about their ML infrastructure. Build small end-to-end projects that include data ingestion, model training, serving, and monitoring so you can speak from hands-on experience. Practice articulating design decisions out loud, and work through ML system design questions at datainterview.com/questions to build structured thinking habits.

What are the most common mistakes candidates make in System Design interviews for AI Engineer roles?

The biggest mistake is jumping straight into model architecture without addressing the broader system: data collection, storage, serving, and monitoring. Another common error is ignoring scalability and operational concerns like how to retrain models, handle failures, or roll back bad deployments. Candidates also frequently forget to clarify requirements upfront, which leads to designing a system that does not match what the interviewer had in mind. Always start by asking questions and defining scope before diving into components.

Dan Lee's profile image

Written by

Dan Lee

Data & AI Lead

Dan is a seasoned data scientist and ML coach with 10+ years of experience at Google, PayPal, and startups. He has helped candidates land top-paying roles and offers personalized guidance to accelerate your data career.

Connect on LinkedIn