Cloud Infrastructure Interview Questions

Dan Lee's profile image
Dan LeeData & AI Lead
Last updateMarch 13, 2026

Cloud infrastructure questions dominate data engineering interviews at FAANG companies and modern data platforms like Snowflake, Databricks, and Netflix. These companies expect you to architect end-to-end data systems that handle petabytes of data, serve millions of users, and scale across multiple regions. Unlike traditional software engineering interviews, data infrastructure questions test your ability to choose the right managed services, design for failure scenarios, and optimize for both cost and performance at massive scale.

What makes cloud infrastructure interviews particularly challenging is that interviewers expect you to reason about trade-offs across multiple dimensions simultaneously. For example, you might need to design a real-time feature store that handles 500k requests per second while staying under a $10k monthly budget, ensuring sub-50ms p99 latency, and maintaining exactly-once delivery guarantees during regional outages. The wrong choice of compute instance, storage tier, or networking configuration can blow your budget by 10x or introduce subtle data consistency bugs that only surface under load.

Here are the top 30 cloud infrastructure questions organized by the six areas that matter most in data engineering interviews.

Intermediate30 questions

Cloud Infrastructure Interview Questions

Top Cloud Infrastructure interview questions covering the key areas tested at leading tech companies. Practice with real questions and detailed solutions.

Data EngineerAmazonGoogleMicrosoftSnowflakeDatabricksNetflixMetaAirbnb

Cloud Fundamentals and Core Building Blocks

Interviewers use cloud fundamentals questions to assess whether you understand how compute, storage, networking, and DNS work together in real data systems. Most candidates fail because they memorize service names without understanding the underlying resource limits, failure modes, and performance characteristics that drive architectural decisions.

The key insight interviewers look for is your ability to trace data flow through multiple system boundaries and predict where bottlenecks will emerge. When you design an event ingestion pipeline, you need to reason about object storage request limits, network bandwidth between regions, DNS failover timing, and compute autoscaling delays as an interconnected system, not isolated components.

Cloud Fundamentals and Core Building Blocks

Start by proving you can map a data platform to compute, storage, networking, and DNS choices. You struggle here when you describe services but cannot explain tradeoffs, failure modes, or how traffic and data actually flow.

You are designing an event ingestion path for clickstream data: producers in multiple regions send to a regional endpoint, data lands in object storage, then a batch job processes it into analytics tables. Walk me through the compute, storage, networking, and DNS pieces you would choose, and explain how traffic flows end to end during a regional outage.

NetflixNetflixHardCloud Fundamentals and Core Building Blocks

Sample Answer

Most candidates default to a single global endpoint and a single bucket, but that fails here because region failure turns into total ingestion loss and DNS cutovers are not instant. You should front producers with regional DNS names and health checks, terminate TLS at a regional load balancer, then write to regional durable storage with replication to a second region using async copy or cross region replication. During a regional outage, DNS or client side region selection shifts new traffic to the healthy region, while backfill replays from the message log or retry buffer so you do not lose events. You must call out failure modes: partial writes, duplicate events, and eventual consistency, then show how you handle them with idempotent keys and replayable logs.

Practice more Cloud Fundamentals and Core Building Blocks questions

AWS vs GCP vs Azure Service Mapping for Data Engineering

Service mapping questions reveal whether you understand the functional differences between AWS, GCP, and Azure services beyond surface-level feature comparisons. Candidates often stumble because they assume services with similar names have identical capabilities, missing critical differences in consistency models, scaling limits, or integration patterns that affect data pipeline design.

The trap most candidates fall into is recommending direct service swaps without considering the ecosystem effects. For instance, migrating from AWS Kinesis to GCP Pub/Sub isn't just about message throughput, it's about how authentication, monitoring, dead letter queues, and exactly-once semantics change across the entire pipeline stack.

AWS vs GCP vs Azure Service Mapping for Data Engineering

In interviews, you are often asked to translate an architecture across clouds and justify equivalent services. Candidates stumble when they rely on brand names instead of capabilities like object storage semantics, managed Spark options, and warehouse primitives.

Your current AWS batch pipeline lands raw events in S3, runs Spark ETL on EMR, and loads curated tables into Redshift. You are asked to propose the closest GCP and Azure equivalents, and call out one capability mismatch you would validate in a proof of concept.

NetflixNetflixMediumAWS vs GCP vs Azure Service Mapping for Data Engineering

Sample Answer

Map it as S3 plus EMR plus Redshift, to GCS plus Dataproc plus BigQuery on GCP, and to ADLS Gen2 plus Azure Databricks or HDInsight plus Synapse Dedicated SQL Pool on Azure. The key is to map capabilities, not names: object storage for raw, managed Spark for ETL, and an MPP warehouse for serving. Validate IAM and data access semantics early, for example cross-account or cross-project access patterns, and how Spark reads and writes commit protocols to object storage. Also validate warehouse loading patterns and concurrency behavior, for example COPY versus external tables, and how partitioning and clustering choices translate.

Practice more AWS vs GCP vs Azure Service Mapping for Data Engineering questions

Serverless and Managed Services for Data Pipelines

Serverless and managed service questions test your judgment about when to give up control for operational simplicity. Engineering teams increasingly prefer managed services to reduce operational overhead, but interviewers want to see that you understand the constraints and design patterns that make serverless architectures successful at scale.

The critical insight is recognizing that serverless services introduce new failure modes around cold starts, concurrency limits, and retry behavior that don't exist in traditional infrastructure. You need to design your data pipeline to work with these constraints, not against them, which often means rethinking your approach to state management, error handling, and backpressure control.

Serverless and Managed Services for Data Pipelines

You will need to defend when to choose managed over self hosted for ingestion, orchestration, and streaming. Many candidates miss operational constraints like cold starts, concurrency limits, retries, and how managed runtimes impact observability.

You are ingesting clickstream events into a lake, traffic spikes from 5k to 200k events per second during live sports. Would you choose a managed streaming service with serverless consumers or self hosted Kafka plus long running consumers, and how do you handle cold starts, concurrency limits, and backpressure?

NetflixNetflixHardServerless and Managed Services for Data Pipelines

Sample Answer

You could do managed streaming plus serverless consumers, or self hosted Kafka plus long running consumers. Managed wins here because it offloads broker operations and scaling, and you can absorb spikes with autoscaling and partition based parallelism while keeping SLOs. The tradeoff is cold starts and concurrency caps, so you design idempotent consumers, pre warm critical functions, and use batching with a bounded retry policy and DLQ. You also cap per key throughput, implement backpressure via consumer max in flight, and monitor lag to trigger scaling before you hit throttling.

Practice more Serverless and Managed Services for Data Pipelines questions

Infrastructure as Code and Environment Promotion

Infrastructure as Code questions evaluate your ability to manage complex data platform deployments across multiple environments without breaking production systems. Most candidates struggle here because they focus on the Terraform syntax rather than the organizational processes, state management, and promotion workflows that prevent outages during infrastructure changes.

What separates senior candidates is understanding how to structure IaC for team collaboration and environment promotion while handling secrets, cross-account permissions, and state drift safely. You need to show that you can detect when manual changes have created drift, decide whether to import or revert those changes, and design variable structures that make environment promotion predictable and auditable.

Infrastructure as Code and Environment Promotion

Expect to be tested on how you make infrastructure reproducible across dev, staging, and prod using Terraform, CloudFormation, ARM, or similar. You get tripped up when you cannot explain state, drift detection, secret handling, or safe rollout patterns.

You have Terraform managing a data platform stack across dev, staging, and prod. A manual console change in prod fixed an outage, now the next terraform apply wants to revert it, how do you detect the drift and decide whether to import, update code, or roll back?

AmazonAmazonHardInfrastructure as Code and Environment Promotion

Sample Answer

Reason through it: First you confirm drift by running terraform plan against the prod workspace with the same backend state, so you see exactly which attributes differ. Next you classify the console change as either an emergency hotfix you want to keep, or a mistake you want to undo, and you validate impact with logs and recent incidents. If you want to keep it, you update the Terraform code to match the live config, then apply so state and reality converge, and only then promote the same change through staging and dev if appropriate. If the change was wrong, you let apply revert it, but you do it with a targeted rollout plan, smaller blast radius, and a clear rollback path.

Practice more Infrastructure as Code and Environment Promotion questions

Security, IAM, and Data Governance Controls

Security and governance questions assess whether you can implement fine-grained access controls and audit trails for data systems handling sensitive information. Candidates typically fail by proposing overly permissive IAM policies or missing critical security boundaries between data zones, environments, or user roles.

The challenge interviewers focus on is designing least-privilege access that actually works in practice when data engineers need to deploy code, debug production issues, and maintain cross-account data flows. You must balance security isolation with operational efficiency, often using techniques like assume-role patterns, resource-based policies, and break-glass access procedures that provide audit trails without blocking legitimate work.

Security, IAM, and Data Governance Controls

Security questions evaluate whether you can design least privilege access for humans, services, and pipelines. You often struggle if you hand wave policies, key management, network isolation, and cross account or cross project access patterns.

You have an S3 data lake with raw and curated zones. A new Databricks job in a different AWS account must read only one curated prefix and write to a separate output prefix. How do you design IAM, KMS, and bucket policies to enforce least privilege and prevent lateral access to other prefixes?

AmazonAmazonHardSecurity, IAM, and Data Governance Controls

Sample Answer

This question is checking whether you can translate least privilege into concrete resource policies, key policies, and cross account assumptions. You set up an IAM role in the data lake account that trusts the Databricks account role, then you lock permissions to exact ARNs like s3:GetObject on curated/prefix/* and s3:PutObject on output/prefix/*. You add a bucket policy that allows only that role, enforces TLS, and denies access outside approved prefixes using explicit Deny with s3:prefix and object ARNs. For encryption, you use SSE-KMS with a CMK whose key policy and grants allow decrypt for reads and encrypt for writes only for that role, plus restrict via kms:EncryptionContext conditions tied to the bucket and prefixes.

Practice more Security, IAM, and Data Governance Controls questions

Cost, Performance, and Reliability Optimization at Scale

Cost and performance optimization questions test your ability to diagnose resource bottlenecks and implement changes that improve both efficiency and user experience. Senior data engineering roles require you to manage infrastructure budgets measured in hundreds of thousands of dollars per month, so interviewers expect you to think like a business owner, not just a technologist.

What makes these questions difficult is that cost optimization often conflicts with performance requirements, and the optimal solution depends on usage patterns, SLA requirements, and business priorities that aren't immediately obvious. You need to demonstrate a systematic approach to identifying the highest-impact optimizations, measuring their effectiveness, and making trade-offs between competing objectives like query latency, data freshness, and infrastructure costs.

Cost, Performance, and Reliability Optimization at Scale

At senior levels, you must show you can hit SLOs while controlling spend and protecting against outages. You may stumble when you cannot quantify drivers like egress, scan costs, autoscaling behavior, multi region strategy, and workload isolation.

Your daily Spark job on S3 jumped from $800/day to $2,500/day after a schema change that added a nested JSON column. You are still meeting the 2 hour SLO, what do you investigate first, and what concrete changes do you make to cut cost without breaking the SLO?

DatabricksDatabricksMediumCost, Performance, and Reliability Optimization at Scale

Sample Answer

The standard move is to reduce bytes scanned: partition on the most selective predicates, use columnar formats, and enforce projection so you only read the columns you need. But here, the nested JSON matters because it can defeat predicate pushdown and inflate IO even if you do not select it, so you should verify the physical plan, file sizes, and whether the new column forced wider rows or less effective compression. Convert JSON to typed columns, store as Parquet or Delta with stats, and consider ZORDER or clustering on the common filters to cut scan. Then right size the cluster for the new IO pattern, because overprovisioned executors can hide inefficiency while still meeting the SLO.

Practice more Cost, Performance, and Reliability Optimization at Scale questions

How to Prepare for Cloud Infrastructure Interviews

Practice cost estimation for real data workloads

Pick a streaming pipeline or batch ETL job and calculate the monthly AWS/GCP/Azure costs for 1TB, 10TB, and 100TB of daily processing. Include compute, storage, network egress, and managed service costs to understand how different architectural choices affect your budget.

Build the same pipeline on two different clouds

Implement a simple batch job that reads from object storage, transforms data, and writes to a data warehouse using both AWS and GCP services. You'll discover the subtle differences in IAM models, networking concepts, and service behaviors that interviewers expect you to know.

Set up Infrastructure as Code with proper state management

Create Terraform modules for a multi-environment data platform and practice promoting changes from dev to staging to prod. Focus on how you handle secrets, cross-account access, and state drift rather than just getting the resources created.

Simulate failure scenarios in your test environment

Practice designing systems that handle regional outages, service throttling, and network partitions. Deploy a simple data pipeline and then break different components to understand how failures propagate and where you need circuit breakers, retries, and fallback logic.

Master the IAM and networking mental models

Draw diagrams showing how requests flow through VPCs, subnets, security groups, IAM roles, and service endpoints for a typical data pipeline. Understanding these foundations helps you reason about security, performance, and troubleshooting in system design questions.

How Ready Are You for Cloud Infrastructure Interviews?

1 / 6
Cloud Fundamentals and Core Building Blocks

You need to run a stateful service that must keep stable network identity, persistent storage, and survive host failures. Which approach best fits typical cloud core building blocks?

Frequently Asked Questions

How deep does my cloud infrastructure knowledge need to be for a Data Engineer interview?

You should be able to explain how you would run and secure data pipelines on cloud primitives, not just name services. Expect to go deep on networking basics, IAM, storage and compute tradeoffs, and reliability concepts like autoscaling and multi-AZ design. You should also be able to justify choices with cost, performance, and failure mode reasoning.

Which companies tend to ask the most cloud infrastructure questions for Data Engineers?

Large tech companies and cloud-heavy product companies often emphasize infrastructure, especially those operating at scale or with strict reliability requirements. Cloud vendors and data platform teams also go deeper on networking, IAM, and distributed systems fundamentals. You will see more infrastructure depth in roles tied to platform, ingestion at scale, or regulated environments.

Is coding required in cloud infrastructure interviews for Data Engineers?

Often yes, but it is usually practical coding, like writing SQL, a small Python script, or infrastructure automation snippets rather than purely algorithmic puzzles. You may also be asked to interpret logs, reason about a Terraform plan, or sketch deployment steps. Practice the coding patterns most relevant to pipelines at datainterview.com/coding.

How do cloud infrastructure interviews differ between Data Engineer and other data roles?

As a Data Engineer, you are typically evaluated on how you build, operate, and scale data systems on cloud infrastructure, including storage formats, orchestration, and failure handling. Analytics Engineers usually get less networking and reliability depth and more modeling and BI stack focus. ML Engineers often get more emphasis on GPU compute, feature stores, and model serving infrastructure.

How can I prepare for cloud infrastructure interviews if I have no real-world cloud experience?

Build a small end to end pipeline in a personal cloud account, like object storage to warehouse, with an orchestrator and basic monitoring. Document your architecture decisions, IAM boundaries, and how you would handle retries, backfills, and cost controls. Use targeted question sets at datainterview.com/questions to practice explaining tradeoffs clearly.

What common mistakes should I avoid in cloud infrastructure interviews as a Data Engineer?

Do not hand-wave networking and security, you should be able to explain VPC boundaries, private connectivity, and least-privilege IAM. Avoid designing everything as always-on and overprovisioned, interviewers look for autoscaling, right-sized storage, and cost awareness. Do not ignore operational details like alerting, runbooks, data quality checks, and disaster recovery assumptions.

Dan Lee's profile image

Written by

Dan Lee

Data & AI Lead

Dan is a seasoned data scientist and ML coach with 10+ years of experience at Google, PayPal, and startups. He has helped candidates land top-paying roles and offers personalized guidance to accelerate your data career.

Connect on LinkedIn