Top 30 Cloud Infrastructure Interview Questions (2026)

Q: How deep does my cloud infrastructure knowledge need to be for a Data Engineer interview?

You should be able to explain how you would run and secure data pipelines on cloud primitives, not just name services. Expect to go deep on networking basics, IAM, storage and compute tradeoffs, and reliability concepts like autoscaling and multi-AZ design. You should also be able to justify choices with cost, performance, and failure mode reasoning.

Q: Which companies tend to ask the most cloud infrastructure questions for Data Engineers?

Large tech companies and cloud-heavy product companies often emphasize infrastructure, especially those operating at scale or with strict reliability requirements. Cloud vendors and data platform teams also go deeper on networking, IAM, and distributed systems fundamentals. You will see more infrastructure depth in roles tied to platform, ingestion at scale, or regulated environments.

Q: Is coding required in cloud infrastructure interviews for Data Engineers?

Often yes, but it is usually practical coding, like writing SQL, a small Python script, or infrastructure automation snippets rather than purely algorithmic puzzles. You may also be asked to interpret logs, reason about a Terraform plan, or sketch deployment steps. Practice the coding patterns most relevant to pipelines at datainterview.com/coding.

Q: How do cloud infrastructure interviews differ between Data Engineer and other data roles?

As a Data Engineer, you are typically evaluated on how you build, operate, and scale data systems on cloud infrastructure, including storage formats, orchestration, and failure handling. Analytics Engineers usually get less networking and reliability depth and more modeling and BI stack focus. ML Engineers often get more emphasis on GPU compute, feature stores, and model serving infrastructure.

Q: How can I prepare for cloud infrastructure interviews if I have no real-world cloud experience?

Build a small end to end pipeline in a personal cloud account, like object storage to warehouse, with an orchestrator and basic monitoring. Document your architecture decisions, IAM boundaries, and how you would handle retries, backfills, and cost controls. Use targeted question sets at datainterview.com/questions to practice explaining tradeoffs clearly.

Q: What common mistakes should I avoid in cloud infrastructure interviews as a Data Engineer?

Do not hand-wave networking and security, you should be able to explain VPC boundaries, private connectivity, and least-privilege IAM. Avoid designing everything as always-on and overprovisioned, interviewers look for autoscaling, right-sized storage, and cost awareness. Do not ignore operational details like alerting, runbooks, data quality checks, and disaster recovery assumptions.

Cloud infrastructure questions dominate data engineering interviews at FAANG companies and modern data platforms like Snowflake, Databricks, and Netflix. These companies expect you to architect end-to-end data systems that handle petabytes of data, serve millions of users, and scale across multiple regions. Unlike traditional software engineering interviews, data infrastructure questions test your ability to choose the right managed services, design for failure scenarios, and optimize for both cost and performance at massive scale.

What makes cloud infrastructure interviews particularly challenging is that interviewers expect you to reason about trade-offs across multiple dimensions simultaneously. For example, you might need to design a real-time feature store that handles 500k requests per second while staying under a $10k monthly budget, ensuring sub-50ms p99 latency, and maintaining exactly-once delivery guarantees during regional outages. The wrong choice of compute instance, storage tier, or networking configuration can blow your budget by 10x or introduce subtle data consistency bugs that only surface under load.

Here are the top 30 cloud infrastructure questions organized by the six areas that matter most in data engineering interviews.

Intermediate30 questions

Cloud Infrastructure Interview Questions

Top Cloud Infrastructure interview questions covering the key areas tested at leading tech companies. Practice with real questions and detailed solutions.

Data Engineer Amazon

Cloud Fundamentals and Core Building Blocks

Interviewers use cloud fundamentals questions to assess whether you understand how compute, storage, networking, and DNS work together in real data systems. Most candidates fail because they memorize service names without understanding the underlying resource limits, failure modes, and performance characteristics that drive architectural decisions.

The key insight interviewers look for is your ability to trace data flow through multiple system boundaries and predict where bottlenecks will emerge. When you design an event ingestion pipeline, you need to reason about object storage request limits, network bandwidth between regions, DNS failover timing, and compute autoscaling delays as an interconnected system, not isolated components.

Cloud Fundamentals and Core Building Blocks

Start by proving you can map a data platform to compute, storage, networking, and DNS choices. You struggle here when you describe services but cannot explain tradeoffs, failure modes, or how traffic and data actually flow.

You are designing an event ingestion path for clickstream data: producers in multiple regions send to a regional endpoint, data lands in object storage, then a batch job processes it into analytics tables. Walk me through the compute, storage, networking, and DNS pieces you would choose, and explain how traffic flows end to end during a regional outage.

NetflixHardCloud Fundamentals and Core Building Blocks

Sample Answer

Most candidates default to a single global endpoint and a single bucket, but that fails here because region failure turns into total ingestion loss and DNS cutovers are not instant. You should front producers with regional DNS names and health checks, terminate TLS at a regional load balancer, then write to regional durable storage with replication to a second region using async copy or cross region replication. During a regional outage, DNS or client side region selection shifts new traffic to the healthy region, while backfill replays from the message log or retry buffer so you do not lose events. You must call out failure modes: partial writes, duplicate events, and eventual consistency, then show how you handle them with idempotent keys and replayable logs.

A Spark job reading 20 TB from object storage is slow and spiky, with frequent timeouts and throttling errors. How would you decide whether the bottleneck is compute, storage request limits, or networking, and what two changes would you make first?

DatabricksMediumCloud Fundamentals and Core Building Blocks

Sample Answer

Start by measuring per stage I/O throughput and request rates, then match them to storage limits and network bandwidth for your instance types and VPC path. Check if you are doing too many small GETs and LISTs, which causes request throttling, then fix it with larger files, fewer partitions, and read ahead where supported. Verify you have sufficient aggregate NIC bandwidth, avoid cross AZ data paths when possible, and ensure endpoints to object storage stay on the provider backbone. Your first two changes are usually file compaction to reduce request count and right sizing or scaling out executors to align compute with available network throughput.

You need to expose an internal metadata service used by Airflow and Spark across three environments: dev, staging, prod, each in separate networks. Describe how you would use DNS, routing, and security controls so jobs resolve the same name in each environment, and explain what breaks when DNS caching is misconfigured.

MetaMediumCloud Fundamentals and Core Building Blocks

Sample Answer

You could do environment specific hostnames like meta-dev, meta-stg, meta-prod, or you could use the same hostname with split horizon DNS per environment. Split horizon wins here because you keep application configs identical and control resolution via private DNS zones associated to each network. You route traffic over private connectivity such as peering or transit, lock it down with security groups and mutual TLS, and restrict DNS recursion to approved resolvers. If DNS caching is misconfigured, you see sticky failures where clients keep an old IP after a failover, or they resolve prod from dev, so you set sane TTLs, monitor resolver hit rates, and force connection retries on address change.

You are asked to store sensitive PII for analytics with a 30 day retention policy, while still allowing efficient joins with non sensitive data. What storage building blocks do you choose, how do you manage encryption keys, and what are the failure modes during key rotation or a region restore?

AmazonHardCloud Fundamentals and Core Building Blocks

A partner team wants to send daily files into your data lake over the public internet. Describe a secure ingestion design using networking and DNS primitives, and compare it to a private connectivity option, including cost, operational overhead, and blast radius.

MicrosoftEasyCloud Fundamentals and Core Building Blocks

Practice more Cloud Fundamentals and Core Building Blocks questions

AWS vs GCP vs Azure Service Mapping for Data Engineering

Service mapping questions reveal whether you understand the functional differences between AWS, GCP, and Azure services beyond surface-level feature comparisons. Candidates often stumble because they assume services with similar names have identical capabilities, missing critical differences in consistency models, scaling limits, or integration patterns that affect data pipeline design.

The trap most candidates fall into is recommending direct service swaps without considering the ecosystem effects. For instance, migrating from AWS Kinesis to GCP Pub/Sub isn't just about message throughput, it's about how authentication, monitoring, dead letter queues, and exactly-once semantics change across the entire pipeline stack.

AWS vs GCP vs Azure Service Mapping for Data Engineering

In interviews, you are often asked to translate an architecture across clouds and justify equivalent services. Candidates stumble when they rely on brand names instead of capabilities like object storage semantics, managed Spark options, and warehouse primitives.

Your current AWS batch pipeline lands raw events in S3, runs Spark ETL on EMR, and loads curated tables into Redshift. You are asked to propose the closest GCP and Azure equivalents, and call out one capability mismatch you would validate in a proof of concept.

NetflixMediumAWS vs GCP vs Azure Service Mapping for Data Engineering

Sample Answer

Map it as S3 plus EMR plus Redshift, to GCS plus Dataproc plus BigQuery on GCP, and to ADLS Gen2 plus Azure Databricks or HDInsight plus Synapse Dedicated SQL Pool on Azure. The key is to map capabilities, not names: object storage for raw, managed Spark for ETL, and an MPP warehouse for serving. Validate IAM and data access semantics early, for example cross-account or cross-project access patterns, and how Spark reads and writes commit protocols to object storage. Also validate warehouse loading patterns and concurrency behavior, for example COPY versus external tables, and how partitioning and clustering choices translate.

You need a cloud-agnostic lakehouse that supports ACID tables, time travel, and incremental upserts for CDC, but you may need to run on AWS, GCP, or Azure depending on the customer. Which table format and managed services do you map to each cloud, and why?

DatabricksHardAWS vs GCP vs Azure Service Mapping for Data Engineering

Sample Answer

You could standardize on Delta Lake everywhere, or you could pick an open table format like Apache Iceberg that has first-class support across multiple engines. Delta wins here because you can get a tightly integrated managed runtime with Databricks on AWS, GCP, and Azure, plus consistent semantics for MERGE, OPTIMIZE, and time travel. Iceberg wins if you need broader engine interoperability, for example Trino, Flink, Spark, and multiple catalogs, and you want to avoid coupling to a single vendor implementation. In either case, you map storage to S3, GCS, or ADLS Gen2, compute to Spark on Databricks or a managed Spark service, and a metastore or catalog to Glue Data Catalog, BigLake or Dataplex, or Microsoft Purview, then you validate lock management, schema evolution, and CDC merge performance.

In AWS you use Kinesis Data Streams into a streaming Spark job, then write to S3 and query via Athena. Translate this to GCP and Azure, and explain what you would change if you need exactly-once delivery into the analytical tables.

MetaMediumAWS vs GCP vs Azure Service Mapping for Data Engineering

Sample Answer

Start with the roles: a streaming ingestion bus, a stream processor, durable storage, and an interactive SQL layer. On GCP you map Kinesis to Pub/Sub, Spark streaming to Dataproc or Dataflow if you switch engines, S3 to GCS, and Athena to BigQuery external tables or BigQuery over ingested tables. On Azure you map Kinesis to Event Hubs, Spark streaming to Azure Databricks, S3 to ADLS Gen2, and Athena to Synapse serverless SQL or an engine like Trino. For exactly-once into analytical tables, you lean on idempotent writes and transactional table formats, for example Delta or Iceberg with checkpointing, and you design keys and MERGE semantics so replays do not duplicate records.

You are moving a data warehouse workload from BigQuery to AWS and Azure. The workload relies on BigQuery features like separation of storage and compute, slot-based concurrency, and frequent ad hoc queries on partitioned tables. What AWS and Azure services do you propose, and what behavioral differences do you warn the team about?

SnowflakeHardAWS vs GCP vs Azure Service Mapping for Data Engineering

Your team uses AWS Glue for metadata crawling, schema discovery, and a central catalog used by Spark and SQL engines. Translate the catalog and governance layer to GCP and Azure, and explain how you would keep consistent table definitions across clouds for a multi-cloud product.

MicrosoftEasyAWS vs GCP vs Azure Service Mapping for Data Engineering

Practice more AWS vs GCP vs Azure Service Mapping for Data Engineering questions

Serverless and Managed Services for Data Pipelines

Serverless and managed service questions test your judgment about when to give up control for operational simplicity. Engineering teams increasingly prefer managed services to reduce operational overhead, but interviewers want to see that you understand the constraints and design patterns that make serverless architectures successful at scale.

The critical insight is recognizing that serverless services introduce new failure modes around cold starts, concurrency limits, and retry behavior that don't exist in traditional infrastructure. You need to design your data pipeline to work with these constraints, not against them, which often means rethinking your approach to state management, error handling, and backpressure control.

Serverless and Managed Services for Data Pipelines

You will need to defend when to choose managed over self hosted for ingestion, orchestration, and streaming. Many candidates miss operational constraints like cold starts, concurrency limits, retries, and how managed runtimes impact observability.

You are ingesting clickstream events into a lake, traffic spikes from 5k to 200k events per second during live sports. Would you choose a managed streaming service with serverless consumers or self hosted Kafka plus long running consumers, and how do you handle cold starts, concurrency limits, and backpressure?

NetflixHardServerless and Managed Services for Data Pipelines

Sample Answer

You could do managed streaming plus serverless consumers, or self hosted Kafka plus long running consumers. Managed wins here because it offloads broker operations and scaling, and you can absorb spikes with autoscaling and partition based parallelism while keeping SLOs. The tradeoff is cold starts and concurrency caps, so you design idempotent consumers, pre warm critical functions, and use batching with a bounded retry policy and DLQ. You also cap per key throughput, implement backpressure via consumer max in flight, and monitor lag to trigger scaling before you hit throttling.

Your orchestration is currently Airflow on Kubernetes, it takes a full time engineer to keep it stable. The team wants to move to a managed workflow service, what is your step by step decision process, and what changes in retries, secret management, and observability do you plan for?

GoogleMediumServerless and Managed Services for Data Pipelines

Sample Answer

First, you list what is hurting you now: upgrade toil, scheduler stability, worker autoscaling, and incident load. Next, you map requirements that might block managed: custom operators, network connectivity, private dependencies, and strict control over execution environment. Then you compare failure semantics, especially retries and timeouts, and decide how you will make tasks idempotent and how you will persist state between attempts. Finally, you plan platform deltas: move secrets to managed secret store with rotation, standardize logs and traces with correlation IDs, and validate that managed limits on concurrency and DAG size will not break peak workloads.

A serverless ingestion job writes to object storage and occasionally creates duplicate files and inconsistent partition folders when retried. How do you redesign it to be exactly once in effect, and what managed service constraints do you explicitly account for?

AmazonMediumServerless and Managed Services for Data Pipelines

Sample Answer

This question is checking whether you can design for failure in managed runtimes where retries, at least once delivery, and partial writes are normal. You make the write path idempotent by using deterministic object names from an event ID or window, writing to a temporary prefix, then committing via an atomic rename or a manifest commit protocol. You store processing state in a durable store, and you treat every invocation as potentially duplicated or out of order. You also account for platform limits like max execution time, concurrency ceilings, and eventual consistency patterns by batching, checkpointing, and using DLQs for poison messages.

Your Databricks streaming job currently runs on an always on cluster to avoid latency. Leadership wants to move it to a more managed serverless runtime to cut cost, what pitfalls do you expect around autoscaling behavior, state store management, and end to end observability, and how do you mitigate them?

DatabricksHardServerless and Managed Services for Data Pipelines

You need to ingest 10 TB per day from hundreds of SaaS APIs with strict rate limits and frequent schema drift. Would you pick a managed connector service or build your own workers, and what is your plan for quotas, incremental sync correctness, and debugging production failures?

MetaEasyServerless and Managed Services for Data Pipelines

Practice more Serverless and Managed Services for Data Pipelines questions

Infrastructure as Code and Environment Promotion

Infrastructure as Code questions evaluate your ability to manage complex data platform deployments across multiple environments without breaking production systems. Most candidates struggle here because they focus on the Terraform syntax rather than the organizational processes, state management, and promotion workflows that prevent outages during infrastructure changes.

What separates senior candidates is understanding how to structure IaC for team collaboration and environment promotion while handling secrets, cross-account permissions, and state drift safely. You need to show that you can detect when manual changes have created drift, decide whether to import or revert those changes, and design variable structures that make environment promotion predictable and auditable.

Infrastructure as Code and Environment Promotion

Expect to be tested on how you make infrastructure reproducible across dev, staging, and prod using Terraform, CloudFormation, ARM, or similar. You get tripped up when you cannot explain state, drift detection, secret handling, or safe rollout patterns.

You have Terraform managing a data platform stack across dev, staging, and prod. A manual console change in prod fixed an outage, now the next terraform apply wants to revert it, how do you detect the drift and decide whether to import, update code, or roll back?

AmazonHardInfrastructure as Code and Environment Promotion

Sample Answer

Reason through it: First you confirm drift by running terraform plan against the prod workspace with the same backend state, so you see exactly which attributes differ. Next you classify the console change as either an emergency hotfix you want to keep, or a mistake you want to undo, and you validate impact with logs and recent incidents. If you want to keep it, you update the Terraform code to match the live config, then apply so state and reality converge, and only then promote the same change through staging and dev if appropriate. If the change was wrong, you let apply revert it, but you do it with a targeted rollout plan, smaller blast radius, and a clear rollback path.

You need to promote the same IaC change from dev to staging to prod, but each environment uses different VPC IDs, KMS keys, and service endpoints. How do you structure modules, variables, and state so promotions are reproducible and least-privilege, without copy-pasting stacks?

GoogleMediumInfrastructure as Code and Environment Promotion

Sample Answer

This question is checking whether you can separate reusable infrastructure logic from environment-specific configuration, and whether you understand how state boundaries affect safety. You keep a single set of modules, then drive differences via per-environment variable files, workspaces, or separate root modules, with remote state isolated per environment and per account or project. You avoid hardcoding IDs, instead you pass them in from well-owned sources like data lookups, outputs from a network foundation stack, or a controlled config repo. You also scope IAM so the dev pipeline cannot touch prod state or resources, and you require approvals for prod applies.

Your pipeline deploys Terraform for a Databricks workspace and its secrets, plus cloud resources like buckets and IAM. How do you handle secrets so they are not in Git, not in Terraform state, and still usable during promotion to staging and prod?

DatabricksHardInfrastructure as Code and Environment Promotion

Sample Answer

The standard move is: never pass secret values through Terraform, instead create secret containers and permissions with IaC, and inject secret values at runtime from a secret manager. But here, state matters because many Terraform resources store inputs in state, so even a masked variable can end up persisted in plaintext in the backend. You model only the secret references in Terraform, like secret names, ARNs, or key IDs, and you write the actual values with a separate step that pulls from AWS Secrets Manager, GCP Secret Manager, or Key Vault using short-lived identity. You then promote by reusing the same naming convention and access policy, while the secret values remain environment-scoped and rotated independently.

You are using CloudFormation to deploy a change to an S3 bucket policy and an IAM role used by production ETL jobs. Describe a safe rollout pattern that avoids breaking running jobs, and explain how you would validate and roll back if needed.

NetflixMediumInfrastructure as Code and Environment Promotion

Your Terraform remote state is in an S3 bucket with DynamoDB locking, and two teams share the same repo but different environments. A failed apply left the lock in place and the state might be partially updated, what exact steps do you take to recover safely, and how do you prevent this class of failure in the first place?

MetaHardInfrastructure as Code and Environment Promotion

Practice more Infrastructure as Code and Environment Promotion questions

Security, IAM, and Data Governance Controls

Security and governance questions assess whether you can implement fine-grained access controls and audit trails for data systems handling sensitive information. Candidates typically fail by proposing overly permissive IAM policies or missing critical security boundaries between data zones, environments, or user roles.

The challenge interviewers focus on is designing least-privilege access that actually works in practice when data engineers need to deploy code, debug production issues, and maintain cross-account data flows. You must balance security isolation with operational efficiency, often using techniques like assume-role patterns, resource-based policies, and break-glass access procedures that provide audit trails without blocking legitimate work.

Security, IAM, and Data Governance Controls

Security questions evaluate whether you can design least privilege access for humans, services, and pipelines. You often struggle if you hand wave policies, key management, network isolation, and cross account or cross project access patterns.

You have an S3 data lake with raw and curated zones. A new Databricks job in a different AWS account must read only one curated prefix and write to a separate output prefix. How do you design IAM, KMS, and bucket policies to enforce least privilege and prevent lateral access to other prefixes?

AmazonHardSecurity, IAM, and Data Governance Controls

Sample Answer

This question is checking whether you can translate least privilege into concrete resource policies, key policies, and cross account assumptions. You set up an IAM role in the data lake account that trusts the Databricks account role, then you lock permissions to exact ARNs like s3:GetObject on curated/prefix/* and s3:PutObject on output/prefix/*. You add a bucket policy that allows only that role, enforces TLS, and denies access outside approved prefixes using explicit Deny with s3:prefix and object ARNs. For encryption, you use SSE-KMS with a CMK whose key policy and grants allow decrypt for reads and encrypt for writes only for that role, plus restrict via kms:EncryptionContext conditions tied to the bucket and prefixes.

A BigQuery dataset contains PII and needs column level access for analysts, while a scheduled pipeline service account must read all columns and write derived tables. How do you set up IAM, authorized views or policy tags, and audit to keep analysts from seeing raw PII while the pipeline still works?

GoogleMediumSecurity, IAM, and Data Governance Controls

Sample Answer

The standard move is to keep analysts off the raw tables and expose only governed interfaces like authorized views or tables protected by policy tags. But here, the pipeline service account needs full fidelity access because it must read PII to produce compliant aggregates, so you grant it dataset level or table level permissions plus policy tag access, while analysts get access only to views or to non-PII policy tags. You apply Data Catalog policy tags to sensitive columns and assign fine grained access via IAM on those tags, not broad dataset roles. You validate with audit logs by checking who accessed tagged columns and you add alerts for direct table reads outside the pipeline identity.

Your company uses Snowflake with multiple workspaces: dev, staging, prod. A data engineer accidentally queried production PII from a dev notebook last quarter. What governance controls do you put in place to prevent environment hopping while still allowing CI/CD and break glass access?

SnowflakeHardSecurity, IAM, and Data Governance Controls

Sample Answer

Get this wrong in production and you end up with unauthorized PII exposure, audit findings, and potentially regulatory reporting obligations. The right call is to hard separate prod from non-prod using distinct accounts or at least separate virtual warehouses, roles, and databases with no default inheritance, then block cross environment reads with tight role grants and network policies. You implement row access policies, masking policies, and tag based classification so even approved roles see only what they need, and you force all access through least privilege roles with session policies. For CI/CD, you use a deployment role that can create objects but cannot query sensitive tables, and for break glass you create a time bound privileged role with approval logging and mandatory MFA, then audit every activation and query.

You are building a cross project data pipeline in Azure where ADF in Project A writes to ADLS Gen2 in Project B, both using private endpoints. How do you set up Managed Identities, RBAC, ACLs, and network rules so the pipeline works without opening public network access?

MicrosoftMediumSecurity, IAM, and Data Governance Controls

A Spark job in Databricks reads from an object store and publishes aggregates to a warehouse. The security team requires short lived credentials, rotation, and provable non use of long lived access keys in notebooks. What is your end to end approach for identity, secrets distribution, and auditability?

DatabricksHardSecurity, IAM, and Data Governance Controls

Practice more Security, IAM, and Data Governance Controls questions

Cost, Performance, and Reliability Optimization at Scale

Cost and performance optimization questions test your ability to diagnose resource bottlenecks and implement changes that improve both efficiency and user experience. Senior data engineering roles require you to manage infrastructure budgets measured in hundreds of thousands of dollars per month, so interviewers expect you to think like a business owner, not just a technologist.

What makes these questions difficult is that cost optimization often conflicts with performance requirements, and the optimal solution depends on usage patterns, SLA requirements, and business priorities that aren't immediately obvious. You need to demonstrate a systematic approach to identifying the highest-impact optimizations, measuring their effectiveness, and making trade-offs between competing objectives like query latency, data freshness, and infrastructure costs.

Cost, Performance, and Reliability Optimization at Scale

At senior levels, you must show you can hit SLOs while controlling spend and protecting against outages. You may stumble when you cannot quantify drivers like egress, scan costs, autoscaling behavior, multi region strategy, and workload isolation.

Your daily Spark job on S3 jumped from $800/day to $2,500/day after a schema change that added a nested JSON column. You are still meeting the 2 hour SLO, what do you investigate first, and what concrete changes do you make to cut cost without breaking the SLO?

DatabricksMediumCost, Performance, and Reliability Optimization at Scale

Sample Answer

The standard move is to reduce bytes scanned: partition on the most selective predicates, use columnar formats, and enforce projection so you only read the columns you need. But here, the nested JSON matters because it can defeat predicate pushdown and inflate IO even if you do not select it, so you should verify the physical plan, file sizes, and whether the new column forced wider rows or less effective compression. Convert JSON to typed columns, store as Parquet or Delta with stats, and consider ZORDER or clustering on the common filters to cut scan. Then right size the cluster for the new IO pattern, because overprovisioned executors can hide inefficiency while still meeting the SLO.

A data product serves features to an online model and must hit p99 < 50 ms with 99.9% availability, but cross region replication and egress are driving costs up. How do you choose between active active, active passive, or single region plus backup, and how do you quantify the cost tradeoffs?

NetflixHardCost, Performance, and Reliability Optimization at Scale

Sample Answer

Get this wrong in production and you either miss p99 during a regional event or you burn budget every hour on unnecessary cross region writes and reads. The right call is to start from the SLO, translate it into an RTO and RPO, then pick the cheapest topology that satisfies them, typically single region with fast failover if your RPO can tolerate seconds to minutes, otherwise active active. You quantify with a simple model: $$\text{monthly cost} = \text{compute} + \text{storage} + \text{replication writes} + \text{egress} + \text{read amplification}$$ and you plug in request rate, average item size, replication factor, and inter region $\$/GB. You also test failure mode latency, because active passive can meet 99.9% but still violate p99 if DNS, cache warmup, or rehydration is slow.

Your warehouse query bill is dominated by a few dashboards that scan tens of TB daily, and users complain about sporadic slowness at peak hours. What do you change first to improve both cost and tail latency, and how do you verify it worked?

SnowflakeMediumCost, Performance, and Reliability Optimization at Scale

Sample Answer

Just adding more compute sounds reasonable but breaks under concurrency, because you still pay for scanning the same TB over and over. Precomputing everything as extracts does not work because freshness requirements and dimensional changes make it brittle. That leaves reducing scan and isolating workloads: materialize the expensive aggregations, cluster on common filter keys, and separate ETL from BI with distinct warehouses so one cannot starve the other. You verify with query profile deltas: bytes scanned, partitions pruned, spill, and queue time, and you track p95 and p99 over peak windows to ensure you fixed contention, not just average runtime.

You ingest 50 TB/day into a lake, and a new downstream consumer in another region needs 10 TB/day of that data. How do you design the data movement to minimize egress while meeting a 30 minute freshness SLO, and what metrics do you watch?

GoogleHardCost, Performance, and Reliability Optimization at Scale

A streaming pipeline uses autoscaling, but during traffic spikes it oscillates, misses a 5 minute end to end lag SLO, and your compute spend doubles. What changes do you make to stabilize scaling behavior and protect reliability while controlling cost?

AmazonEasyCost, Performance, and Reliability Optimization at Scale

Practice more Cost, Performance, and Reliability Optimization at Scale questions

How to Prepare for Cloud Infrastructure Interviews

Practice cost estimation for real data workloads

Pick a streaming pipeline or batch ETL job and calculate the monthly AWS/GCP/Azure costs for 1TB, 10TB, and 100TB of daily processing. Include compute, storage, network egress, and managed service costs to understand how different architectural choices affect your budget.

Build the same pipeline on two different clouds

Implement a simple batch job that reads from object storage, transforms data, and writes to a data warehouse using both AWS and GCP services. You'll discover the subtle differences in IAM models, networking concepts, and service behaviors that interviewers expect you to know.

Set up Infrastructure as Code with proper state management

Create Terraform modules for a multi-environment data platform and practice promoting changes from dev to staging to prod. Focus on how you handle secrets, cross-account access, and state drift rather than just getting the resources created.

Simulate failure scenarios in your test environment

Practice designing systems that handle regional outages, service throttling, and network partitions. Deploy a simple data pipeline and then break different components to understand how failures propagate and where you need circuit breakers, retries, and fallback logic.

Master the IAM and networking mental models

Draw diagrams showing how requests flow through VPCs, subnets, security groups, IAM roles, and service endpoints for a typical data pipeline. Understanding these foundations helps you reason about security, performance, and troubleshooting in system design questions.

How Ready Are You for Cloud Infrastructure Interviews?

1 / 6

Cloud Fundamentals and Core Building Blocks

You need to run a stateful service that must keep stable network identity, persistent storage, and survive host failures. Which approach best fits typical cloud core building blocks?

Frequently Asked Questions

How deep does my cloud infrastructure knowledge need to be for a Data Engineer interview?

You should be able to explain how you would run and secure data pipelines on cloud primitives, not just name services. Expect to go deep on networking basics, IAM, storage and compute tradeoffs, and reliability concepts like autoscaling and multi-AZ design. You should also be able to justify choices with cost, performance, and failure mode reasoning.

Which companies tend to ask the most cloud infrastructure questions for Data Engineers?

Large tech companies and cloud-heavy product companies often emphasize infrastructure, especially those operating at scale or with strict reliability requirements. Cloud vendors and data platform teams also go deeper on networking, IAM, and distributed systems fundamentals. You will see more infrastructure depth in roles tied to platform, ingestion at scale, or regulated environments.

Is coding required in cloud infrastructure interviews for Data Engineers?

Often yes, but it is usually practical coding, like writing SQL, a small Python script, or infrastructure automation snippets rather than purely algorithmic puzzles. You may also be asked to interpret logs, reason about a Terraform plan, or sketch deployment steps. Practice the coding patterns most relevant to pipelines at datainterview.com/coding.

How do cloud infrastructure interviews differ between Data Engineer and other data roles?

As a Data Engineer, you are typically evaluated on how you build, operate, and scale data systems on cloud infrastructure, including storage formats, orchestration, and failure handling. Analytics Engineers usually get less networking and reliability depth and more modeling and BI stack focus. ML Engineers often get more emphasis on GPU compute, feature stores, and model serving infrastructure.

How can I prepare for cloud infrastructure interviews if I have no real-world cloud experience?

Build a small end to end pipeline in a personal cloud account, like object storage to warehouse, with an orchestrator and basic monitoring. Document your architecture decisions, IAM boundaries, and how you would handle retries, backfills, and cost controls. Use targeted question sets at datainterview.com/questions to practice explaining tradeoffs clearly.

What common mistakes should I avoid in cloud infrastructure interviews as a Data Engineer?

Do not hand-wave networking and security, you should be able to explain VPC boundaries, private connectivity, and least-privilege IAM. Avoid designing everything as always-on and overprovisioned, interviewers look for autoscaling, right-sized storage, and cost awareness. Do not ignore operational details like alerting, runbooks, data quality checks, and disaster recovery assumptions.

Cloud Infrastructure Interview Questions

Cloud Infrastructure Interview Questions

Cloud Fundamentals and Core Building Blocks

Cloud Fundamentals and Core Building Blocks

AWS vs GCP vs Azure Service Mapping for Data Engineering

AWS vs GCP vs Azure Service Mapping for Data Engineering

Serverless and Managed Services for Data Pipelines

Serverless and Managed Services for Data Pipelines

Infrastructure as Code and Environment Promotion

Infrastructure as Code and Environment Promotion

Security, IAM, and Data Governance Controls

Security, IAM, and Data Governance Controls

Cost, Performance, and Reliability Optimization at Scale

Cost, Performance, and Reliability Optimization at Scale

How to Prepare for Cloud Infrastructure Interviews

Practice cost estimation for real data workloads

Build the same pipeline on two different clouds

Set up Infrastructure as Code with proper state management

Simulate failure scenarios in your test environment

Master the IAM and networking mental models

Frequently Asked Questions

Dan Lee

Related Articles

Distributed Training Interview Questions

Recommendation Systems Interview Questions

System Design Interview Questions