Goldman Sachs Data Engineer Guide (2026): Job, Salary & Interviews

Goldman Sachs Data Engineer at a Glance

Interview Rounds

6 rounds

Difficulty

When a batch job feeding Goldman's SecDB risk system fails at 6:47 AM, someone on the data engineering team gets a call from a managing director, not a Jira notification. From what candidates tell us, that urgency is the single biggest culture shock for people coming from pure tech companies. GS also happens to maintain Legend, an open-source data management platform, which signals an engineering ambition most people don't associate with a bank.

Goldman Sachs Data Engineer Role

Skill Profile

Math & Stats

Medium

Insufficient source detail.

Software Eng

Medium

Insufficient source detail.

Data & SQL

Medium

Insufficient source detail.

Machine Learning

Medium

Insufficient source detail.

Applied AI

Medium

Insufficient source detail.

Infra & Cloud

Medium

Insufficient source detail.

Business

Medium

Insufficient source detail.

Viz & Comms

Medium

Insufficient source detail.

Want to ace the interview?

Practice with real questions.

Start Mock Interview

You're building and maintaining the pipelines that feed trading risk models, portfolio analytics, client reporting, and regulatory compliance inside divisions like Asset & Wealth Management. The day-to-day means owning Airflow DAGs end to end, shipping ingestion pipelines for new data sources into the Legend platform, and working directly with quant teams who depend on your output before markets open.

A Typical Week

A Week in the Life of a Goldman Sachs Data Engineer

Typical L5 workweek · Goldman Sachs

Weekly time split

Coding — 28%Infrastructure — 25%Meetings — 18%Writing — 12%Break — 8%Analysis — 5%Research — 4%

Culture notes

Goldman expects consistent in-office presence five days a week at 200 West Street, and the pace is intense — data pipelines feed risk and PnL systems that senior leadership monitors daily, so reliability is non-negotiable.
Hours typically run 8:30 AM to 6:30 PM with occasional late nights around quarter-end reporting or major platform migrations, and the engineering culture has modernized significantly with open-source contributions like Legend but retains Goldman's signature rigor around code quality and documentation.

The infrastructure slice is what catches people off guard. You might picture yourself writing Spark transformations all day, but a meaningful chunk of each week goes to debugging failed DAGs, fixing partition skew on cluster jobs, and cleaning up deprecated pipelines nobody wants to touch. The writing load is also real: Goldman's documentation culture means you're producing design docs, runbooks, and on-call handoff notes regularly. If you hate prose, that'll be a friction point.

Projects & Impact Areas

Pipeline work for Asset & Wealth Management sits at the center, where you might build the orchestration layer to land a new satellite imagery dataset in S3, validate it with Great Expectations, and load it into Legend for the quant strategies group. Regulatory and compliance flows (SEC reporting, audit trails, risk aggregation) run alongside that work and demand bulletproof lineage. Less glamorous, arguably higher stakes, since a data quality gap in a compliance pipeline has consequences that go well beyond a broken dashboard.

Skills & What's Expected

SQL and Python are expected, but comfort with the JVM ecosystem is the skill most candidates underestimate. Goldman maintains significant Scala and Java tooling internally, and engineers who arrive knowing only Python often face a steeper ramp. The structured skill profile shows medium demand across every dimension, which in practice means pipeline reliability and data quality are where you earn trust, while deep ML theory rarely comes up in the actual work.

Levels & Career Growth

Where people get stuck is the Associate-to-VP transition. Earlier promotions tend to follow a predictable timeline if you're performing, but VP requires visible architecture ownership or cross-team platform leadership, not just solid execution on assigned tickets. Lateral moves into quant engineering, data science, or platform infrastructure are common and genuinely supported, which makes the banking-style hierarchy more flexible than it looks on paper.

Work Culture

Goldman expects consistent five-day in-office presence (at 200 West Street for the NYC team), with hours that from candidate reports run roughly 8:30 AM to 6:30 PM and stretch later around quarter-end reporting. The engineering culture has modernized (Legend, modern tooling, open-source contributions), but GS's signature rigor around code quality and documentation remains. If your pipeline breaks, you own the fix and the postmortem, and that accountability is both the hardest and most growth-accelerating part of the job.

Goldman Sachs Data Engineer Compensation

The provided compensation data for this role is too sparse to make specific claims about bonus percentages, vesting schedules, or benefits. If you're evaluating a GS offer, compare it against the full picture (base, bonus, deferred comp, and benefits) rather than fixating on base salary alone, since Wall Street comp structures differ fundamentally from tech industry packages built around RSU grants.

Without verified level-by-level data, the most Goldman-specific negotiation advice is this: research the division you're joining. Asset & Wealth Management, Global Banking & Markets, and Platform Solutions each have different revenue profiles that influence discretionary bonus pools, so the same title at two different desks can yield meaningfully different total comp over time.

Goldman Sachs Data Engineer Interview Process

6 rounds·~5 weeks end to end

Initial Screen

2 rounds

Recruiter Screen

30mPhone

An initial phone call with a recruiter to discuss your background, interest in the role, and confirm basic qualifications. Expect questions about your experience, compensation expectations, and timeline.

generalbehavioraldata_engineeringengineeringcloud_infrastructure

Tips for this round

Prepare a crisp 60–90 second walkthrough of your last data pipeline: sources → ingestion → transform → storage → consumption, including scale (rows/day, latency, SLA).
Be ready to name specific tools you’ve used (e.g., Spark, the company, ADF, Airflow, Kafka, the company/Redshift/BigQuery, Delta/Iceberg) and what you personally owned.
Clarify your consulting/client-facing experience: stakeholder management, ambiguous requirements, and how you communicate tradeoffs.
Ask which the company group you’re interviewing for (industry/Capability Network vs local office) because expectations and rounds can differ.

Hiring Manager Screen

45mVideo Call

A deeper conversation with the hiring manager focused on your past projects, problem-solving approach, and team fit. You'll walk through your most impactful work and explain how you think about data problems.

behavioraldata_engineeringdata_pipelinedata_warehousesystem_design

Technical Assessment

2 rounds

SQL & Data Modeling

60mLive

A hands-on round where you write SQL queries and discuss data modeling approaches. Expect window functions, CTEs, joins, and questions about how you'd structure tables for analytics.

data_modelingdatabasedata_warehousedata_engineeringdata_pipeline

Tips for this round

Be fluent with window functions (ROW_NUMBER, LAG/LEAD, SUM OVER PARTITION) and explain why you choose them over self-joins.
Talk through performance: indexes/cluster keys, partition pruning, predicate pushdown, and avoiding unnecessary shuffles in distributed SQL engines.
For modeling, structure answers around grain, keys, slowly changing dimensions (Type 1/2), and how facts relate to dimensions.
Show data quality thinking: constraints, dedupe logic, reconciliation checks, and how you’d detect schema drift.

System Design

60mVideo Call

You'll be given a high-level problem and asked to design a scalable, fault-tolerant data system from scratch. This round assesses your ability to think about data architecture, storage, processing, and infrastructure choices.

system_designdata_pipelinecloud_infrastructuredata_engineeringdata_warehouse

Tips for this round

Frame your design with requirements first: latency (batch vs near-real-time), throughput, data freshness SLAs, and failure modes.
Include operational details: orchestration (Airflow/ADF), retries/backfills, idempotent writes, and data lineage/cataloging.
Discuss storage/layout choices (Parquet, partitioning, Z-order, Delta/Iceberg), and how you avoid small files and enable efficient reads.
Add security/governance: PII handling, RBAC/ABAC, encryption, and audit logging—common in regulated the company client work.

Onsite

2 rounds

Behavioral

45mVideo Call

Assesses collaboration, leadership, conflict resolution, and how you handle ambiguity. Interviewers look for structured answers (STAR format) with concrete examples and measurable outcomes.

behavioralgeneralengineeringdata_engineeringsystem_design

Tips for this round

Use STAR with measurable outcomes (e.g., reduced pipeline cost 30%, improved SLA from 6h to 1h) and be explicit about your role vs the team’s.
Prepare 2–3 stories about handling ambiguity with stakeholders: clarifying requirements, documenting assumptions, and aligning on acceptance criteria.
Demonstrate consulting-style communication: summarize, propose options, call out risks, and confirm next steps.
Have an example of a production incident you owned: root cause, mitigation, and long-term prevention (postmortem actions).

Case Study

60mVideo Call

This is the company's version of a practical problem-solving exercise, where you'll likely be given a business scenario related to data. You'll need to analyze the problem, propose a data-driven solution, and articulate your reasoning and potential impact.

data_engineeringproduct_sensedata_pipelinedata_modelingdatabase

The timeline from first recruiter call to offer varies, but candidates on forums and in interview-sharing communities frequently describe loops in the range of four to eight weeks. Where the process stalls is often after the final round, when you're waiting on a decision with little communication. That silence can feel like a rejection when it's really just bureaucracy.

The non-obvious thing worth knowing: behavioral performance at Goldman carries weight that surprises candidates coming from pure tech backgrounds. GS interviewers across divisions reportedly probe for composure under operational pressure, think "walk me through a time a critical deliverable broke at the worst possible moment." A polished system design answer paired with vague or rehearsed-sounding behavioral responses is, from what candidates describe, a more common failure mode than struggling with a SQL window function.

Goldman Sachs Data Engineer Interview Questions

Data Pipelines & Engineering

Expect questions that force you to design reliable batch/streaming flows for training and online features (e.g., Kafka/Flink + Airflow/Dagster). You’ll be evaluated on backfills, late data, idempotency, SLAs, lineage, and operational failure modes.

What is the difference between a batch pipeline and a streaming pipeline, and when would you choose each?

EasyFundamentals

Sample Answer

Batch pipelines process data in scheduled chunks (e.g., hourly, daily ETL jobs). Streaming pipelines process data continuously as it arrives (e.g., Kafka + Flink). Choose batch when: latency tolerance is hours or days (daily reports, model retraining), data volumes are large but infrequent, and simplicity matters. Choose streaming when you need real-time or near-real-time results (fraud detection, live dashboards, recommendation updates). Most companies use both: streaming for time-sensitive operations and batch for heavy analytical workloads, model training, and historical backfills.

You ingest Kafka events for booking state changes (created, confirmed, canceled) into a Hive table, then daily compute confirmed_nights per listing for search ranking. How do you make the Spark job idempotent under retries and late-arriving cancels without double counting?

AirbnbMediumIdempotency and Late Data

Sample Answer

Most candidates default to append-only aggregations with a checkpoint, but that fails here because duplicates and late cancels mutate history and you will overcount. You need a stable event key (booking_id plus event_version or event_timestamp plus source offset) and a dedupe rule, then compute state using last-write-wins or a state machine per booking. Write results with upserts (partition overwrite, MERGE, or Hudi/Iceberg style) keyed by booking_id and listing_id, and drive the aggregation off the canonical booking state, not raw events. Add watermarking and a correction window so late cancels trigger targeted recompute instead of full backfills.

You need a pipeline that produces a near real-time host payout ledger: streaming updates every minute, but also a daily audited snapshot that exactly matches finance when late adjustments arrive up to 30 days. Design the batch plus streaming architecture, including how you handle schema evolution and backfills without breaking downstream tables.

AirbnbHardLambda Architecture and Backfills

Practice more Data Pipelines & Engineering questions

System Design

Most candidates underestimate how much your design must balance latency, consistency, and cost at top tech companies scale. You’ll be evaluated on clear component boundaries, failure modes, and how you’d monitor and evolve the system over time.

Design a dataset registry for LLM training and evaluation that lets you reproduce any run months later, including the exact prompt template, filtering rules, and source snapshots. What metadata and storage layout do you require, and which failure modes does it prevent?

AnthropicMediumDataset Versioning and Lineage

Sample Answer

Use an immutable, content-addressed dataset registry that writes every dataset as a manifest of exact source pointers, transforms, and hashes, plus a separate human-readable release record. Store raw sources append-only, store derived datasets as partitioned files keyed by dataset_id and version, and capture code commit SHA, config, and schema in the manifest so reruns cannot drift. This prevents silent data changes, schema drift, and accidental reuse of a similarly named dataset, which is where most people fail.

A company wants a unified fact table for Marketplace Orders (bookings, cancellations, refunds, chargebacks) that supports finance reporting and ML features, while source systems emit out-of-order updates and occasional duplicates. Design the data model and pipeline, including how you handle upserts, immutable history, backfills, and data quality gates at petabyte scale.

AirbnbHardWarehouse Modeling and Incremental Processing

Practice more System Design questions

SQL & Data Manipulation

Your SQL will get stress-tested on joins, window functions, deduping, and incremental logic that mirrors real ETL/ELT work. Common pitfalls include incorrect grain, accidental fan-outs, and filtering at the wrong stage.

Airflow runs a daily ETL that builds fact_host_daily(host_id, ds, active_listings, booked_nights). Source tables are listings(listing_id, host_id, created_at, deactivated_at) and bookings(booking_id, listing_id, check_in, check_out, status, created_at, updated_at). Write an incremental SQL for ds = :run_date that counts active_listings at end of day and booked_nights for stays overlapping ds, handling late-arriving booking updates by using updated_at.

AirbnbMediumIncremental ETL and Late Arriving Data

Sample Answer

Walk through the logic step by step as if thinking out loud. You start by defining the day window, ds start and ds end. Next, active_listings is a snapshot metric, so you count listings where created_at is before ds end, and deactivated_at is null or after ds end. Then booked_nights is an overlap metric, so you compute the intersection of [check_in, check_out) with [ds, ds+1), but only for non-canceled bookings. Finally, for incrementality you only scan bookings that could affect ds, either the stay overlaps ds or the record was updated recently, and you upsert the single ds partition for each host.

SQL

1WITH params AS (
2  SELECT
3    CAST(:run_date AS DATE) AS ds,
4    CAST(:run_date AS TIMESTAMP) AS ds_start_ts,
5    CAST(:run_date AS TIMESTAMP) + INTERVAL '1' DAY AS ds_end_ts
6),
7active_listings_by_host AS (
8  SELECT
9    l.host_id,
10    p.ds,
11    COUNT(*) AS active_listings
12  FROM listings l
13  CROSS JOIN params p
14  WHERE l.created_at < p.ds_end_ts
15    AND (l.deactivated_at IS NULL OR l.deactivated_at >= p.ds_end_ts)
16  GROUP BY l.host_id, p.ds
17),
18-- Limit booking scan for incremental run.
19-- Assumption: you run daily and keep a small lookback for late updates.
20-- This reduces IO while still catching updates that change ds attribution.
21bookings_candidates AS (
22  SELECT
23    b.booking_id,
24    b.listing_id,
25    b.check_in,
26    b.check_out,
27    b.status,
28    b.updated_at
29  FROM bookings b
30  CROSS JOIN params p
31  WHERE b.updated_at >= p.ds_start_ts - INTERVAL '7' DAY
32    AND b.updated_at < p.ds_end_ts + INTERVAL '1' DAY
33),
34booked_nights_by_host AS (
35  SELECT
36    l.host_id,
37    p.ds,
38    SUM(
39      CASE
40        WHEN bc.status = 'canceled' THEN 0
41        -- Compute overlap nights between [check_in, check_out) and [ds, ds+1)
42        ELSE GREATEST(
43          0,
44          DATE_DIFF(
45            'day',
46            GREATEST(CAST(bc.check_in AS DATE), p.ds),
47            LEAST(CAST(bc.check_out AS DATE), p.ds + INTERVAL '1' DAY)
48          )
49        )
50      END
51    ) AS booked_nights
52  FROM bookings_candidates bc
53  JOIN listings l
54    ON l.listing_id = bc.listing_id
55  CROSS JOIN params p
56  WHERE CAST(bc.check_in AS DATE) < p.ds + INTERVAL '1' DAY
57    AND CAST(bc.check_out AS DATE) > p.ds
58  GROUP BY l.host_id, p.ds
59),
60final AS (
61  SELECT
62    COALESCE(al.host_id, bn.host_id) AS host_id,
63    (SELECT ds FROM params) AS ds,
64    COALESCE(al.active_listings, 0) AS active_listings,
65    COALESCE(bn.booked_nights, 0) AS booked_nights
66  FROM active_listings_by_host al
67  FULL OUTER JOIN booked_nights_by_host bn
68    ON bn.host_id = al.host_id
69   AND bn.ds = al.ds
70)
71-- In production this would be an upsert into the ds partition.
72SELECT *
73FROM final
74ORDER BY host_id;

Event stream table listing_price_events(listing_id, event_time, ingest_time, price_usd) can contain duplicates and out-of-order arrivals. Write SQL to build a daily snapshot table listing_price_daily(listing_id, ds, price_usd, event_time) for ds = :run_date using the latest event_time within the day, breaking ties by latest ingest_time, and ensuring exactly one row per listing per ds.

AirbnbHardWindow Functions and Deduping

Practice more SQL & Data Manipulation questions

Data Warehouse

A the company client wants one the company account shared by 15 business units, each with its own analysts, plus a central the company X delivery team that runs dbt and Airflow. Design the warehouse layer and access model (schemas, roles, row level security, data products) so units cannot see each other’s data but can consume shared conformed dimensions.

Boston Consulting Group (BCG)MediumMulti-tenant warehouse architecture and access control

Sample Answer

Most candidates default to separate databases per business unit, but that fails here because conformed dimensions and shared transformation code become duplicated and drift fast. You want a shared curated layer for conformed entities (customer, product, calendar) owned by a platform team, plus per unit marts or data products with strict role based access control. Use the company roles with least privilege, database roles, and row access policies (and masking policies) keyed on tenant identifiers where physical separation is not feasible. Put ownership, SLAs, and contract tests on the shared layer so every unit trusts the same definitions.

A Redshift cluster powers an operations dashboard where 150 concurrent users run the same 3 queries, one query scans fact_clickstream (10 TB) joined to dim_sku and dim_marketplace and groups by day and marketplace, but it spikes to 40 minutes at peak. What concrete Redshift table design changes (DISTKEY, SORTKEY, compression, materialized views) and workload controls would you apply, and how do you validate each change with evidence?

AmazonHardRedshift Physical Design and Concurrency

Practice more Data Warehouse questions

Data Modeling

Rather than raw SQL skill, you’re judged on how you structure facts, dimensions, and metrics so downstream analytics stays stable. Watch for prompts around SCD types, grain definition, and metric consistency across Sales/Analytics consumers.

A company has a daily snapshot table listing_snapshot(listing_id, ds, price, is_available, host_id, city_id) and an events table booking_event(booking_id, listing_id, created_at, check_in, check_out). Write SQL to compute booked nights and average snapshot price at booking time by city and ds, where snapshot ds is the booking created_at date.

AirbnbMediumSnapshot vs Event Join

Sample Answer

Start with what the interviewer is really testing: "This question is checking whether you can align event time to snapshot time without creating fanout joins or time leakage." You join booking_event to listing_snapshot on listing_id plus the derived snapshot date, then aggregate nights as $\text{datediff}(\text{check\_out}, \text{check\_in})$. You also group by snapshot ds and city_id, and you keep the join predicates tight so each booking hits at most one snapshot row.

SQL

1SELECT
2  ls.ds,
3  ls.city_id,
4  SUM(DATE_DIFF('day', be.check_in, be.check_out)) AS booked_nights,
5  AVG(ls.price) AS avg_snapshot_price_at_booking
6FROM booking_event be
7JOIN listing_snapshot ls
8  ON ls.listing_id = be.listing_id
9 AND ls.ds = DATE(be.created_at)
10GROUP BY 1, 2;

You are designing a star schema for host earnings and need to support two use cases: monthly payouts reporting and real-time fraud monitoring on payout anomalies. How do you model payout facts and host and listing dimensions, including slowly changing attributes like host country and payout method, so both use cases stay correct?

AirbnbHardSCD and Multi-Use Fact Modeling

Practice more Data Modeling questions

Coding & Algorithms

Your ability to reason about constraints and produce correct, readable Python under time pressure is a major differentiator. You’ll need solid data-structure choices, edge-case handling, and complexity awareness rather than exotic CS theory.

Given a stream of (asin, customer_id, ts) clicks for an detail page, compute the top K ASINs by unique customer count within the last 24 hours for a given query time ts_now. Input can be unsorted, and you must handle duplicates and out-of-window events correctly.

AmazonMediumSliding Window Top-K

Sample Answer

Get this wrong in production and your top ASIN dashboard flaps, because late events and duplicates inflate counts and reorder the top K every refresh. The right call is to filter by the $24$ hour window relative to ts_now, dedupe by (asin, customer_id), then use a heap or partial sort to extract K efficiently.

Python

1from __future__ import annotations
2
3from datetime import datetime, timedelta
4from typing import Iterable, List, Tuple, Dict, Set
5import heapq
6
7
8def _parse_time(ts: str) -> datetime:
9    """Parse ISO-8601 timestamps, supporting a trailing 'Z'."""
10    if ts.endswith("Z"):
11        ts = ts[:-1] + "+00:00"
12    return datetime.fromisoformat(ts)
13
14
15def top_k_asins_unique_customers_last_24h(
16    events: Iterable[Tuple[str, str, str]],
17    ts_now: str,
18    k: int,
19) -> List[Tuple[str, int]]:
20    """Return top K (asin, unique_customer_count) in the last 24h window.
21
22    events: iterable of (asin, customer_id, ts) where ts is ISO-8601 string.
23    ts_now: window reference time (ISO-8601).
24    k: number of ASINs to return.
25
26    Ties are broken by ASIN lexicographic order (stable, deterministic output).
27    """
28    now = _parse_time(ts_now)
29    start = now - timedelta(hours=24)
30
31    # Deduplicate by (asin, customer_id) within the window.
32    # If events are huge, you would partition by asin or approximate, but here keep it exact.
33    seen_pairs: Set[Tuple[str, str]] = set()
34    customers_by_asin: Dict[str, Set[str]] = {}
35
36    for asin, customer_id, ts in events:
37        t = _parse_time(ts)
38        if t < start or t > now:
39            continue
40        pair = (asin, customer_id)
41        if pair in seen_pairs:
42            continue
43        seen_pairs.add(pair)
44        customers_by_asin.setdefault(asin, set()).add(customer_id)
45
46    # Build counts.
47    counts: List[Tuple[int, str]] = []
48    for asin, custs in customers_by_asin.items():
49        counts.append((len(custs), asin))
50
51    if k <= 0:
52        return []
53
54    # Get top K by count desc, then asin asc.
55    # heapq.nlargest uses the tuple ordering, so use (count, -) carefully.
56    top = heapq.nlargest(k, counts, key=lambda x: (x[0], -ord(x[1][0]) if x[1] else 0))
57
58    # The key above is not a correct general lexicographic tiebreak, so do it explicitly.
59    # Sort all candidates by (-count, asin) and slice K. This is acceptable for moderate cardinality.
60    top_sorted = sorted(((asin, cnt) for cnt, asin in counts), key=lambda p: (-p[1], p[0]))
61    return top_sorted[:k]
62
63
64if __name__ == "__main__":
65    data = [
66        ("B001", "C1", "2024-01-02T00:00:00Z"),
67        ("B001", "C1", "2024-01-02T00:01:00Z"),  # duplicate customer for same ASIN
68        ("B001", "C2", "2024-01-02T01:00:00Z"),
69        ("B002", "C3", "2024-01-01T02:00:00Z"),
70        ("B003", "C4", "2023-12-31T00:00:00Z"),  # out of window
71    ]
72    print(top_k_asins_unique_customers_last_24h(data, "2024-01-02T02:00:00Z", 2))
73

Given a list of nightly booking records {"listing_id": int, "guest_id": int, "checkin": int day, "checkout": int day} (checkout is exclusive), flag each listing_id that is overbooked, meaning at least one day has more than $k$ active stays, and return the earliest day where the maximum occupancy exceeds $k$.

AirbnbHardSweep Line Overlap Counting

Practice more Coding & Algorithms questions

Data Engineering

You need to join a 5 TB Delta table of per-frame telemetry with a 50 GB Delta table of trip metadata on trip_id to produce a canonical fact table in the company. Would you rely on broadcast join or shuffle join, and what explicit configs or hints would you set to make it stable and cost efficient?

CruiseMediumSpark Joins and Partitioning

Sample Answer

You could force a broadcast join of the 50 GB table or run a standard shuffle join on trip_id. Broadcast wins only if the metadata table can reliably fit in executor memory across the cluster, otherwise you get OOM or repeated GC and retries. In most real clusters 50 GB is too big to broadcast safely, so shuffle join wins, then you make it stable by pre-partitioning or bucketing by trip_id where feasible, tuning shuffle partitions, and enabling AQE to coalesce partitions.

Python

1from pyspark.sql import functions as F
2
3# Inputs
4telemetry = spark.read.format("delta").table("raw.telemetry_frames")  # very large
5trips = spark.read.format("delta").table("dim.trip_metadata")          # large but smaller
6
7# Prefer shuffle join with AQE for stability
8spark.conf.set("spark.sql.adaptive.enabled", "true")
9spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
10
11# Right-size shuffle partitions, set via env or job config in practice
12spark.conf.set("spark.sql.shuffle.partitions", "4000")
13
14# Pre-filter early if possible to reduce shuffle
15telemetry_f = telemetry.where(F.col("event_date") >= F.date_sub(F.current_date(), 7))
16trips_f = trips.select("trip_id", "vehicle_id", "route_id", "start_ts", "end_ts")
17
18joined = (
19    telemetry_f
20    .join(trips_f.hint("shuffle_hash"), on="trip_id", how="inner")
21)
22
23# Write out with sane partitioning and file sizing
24(
25    joined
26    .repartition("event_date")
27    .write
28    .format("delta")
29    .mode("overwrite")
30    .option("overwriteSchema", "true")
31    .saveAsTable("canon.fact_telemetry_enriched")
32)

A company Support wants a governed semantic layer for "First Response Time" and "Resolution Time" across email and chat, and an LLM tool will answer questions using those metrics. How do you enforce metric definitions, data access, and quality guarantees so the LLM and Looker both return consistent numbers and do not leak restricted fields?

CoinbaseHardGovernance, Semantic Layer, and Access Controls

Practice more Data Engineering questions

Cloud Infrastructure

In practice, you’ll need to articulate why you’d pick Spark/Hive vs an MPP warehouse vs Cassandra for a specific workload. Interviewers look for pragmatic tradeoffs: throughput vs latency, partitioning/sharding choices, and operational constraints.

A the company warehouse for a client’s KPI dashboard has unpredictable concurrency, and monthly spend is spiking. What specific changes do you make to balance performance and cost, and what signals do you monitor to validate the change?

Boston Consulting Group (BCG)MediumCost and performance optimization

Sample Answer

The standard move is to right-size compute, enable auto-suspend and auto-resume, and separate workloads with different warehouses (ELT, BI, ad hoc). But here, concurrency matters because scaling up can be cheaper than scaling out if query runtime drops sharply, and scaling out can be required if queueing dominates. You should call out monitoring of queued time, warehouse load, query history, cache hit rates, and top cost drivers by user, role, and query pattern. You should also mention guardrails like resource monitors and workload isolation via roles and warehouse assignment.

You need near real-time order events (p95 under 5 seconds) for an Operations dashboard and also a durable replayable history for backfills, events are 20k per second at peak. How do you choose between Kinesis Data Streams plus Lambda versus Kinesis Firehose into S3 plus Glue, and what IAM, encryption, and monitoring controls do you put in place?

AmazonHardStreaming Ingestion Tradeoffs and Security

Practice more Cloud Infrastructure questions

GS interviewers love to blur the line between "can you write the query" and "can you ship it reliably before the trading desk needs it." That compounding effect, where a SQL problem escalates into a pipeline design discussion grounded in GS's pre-market SLA pressure, is what separates this loop from a standard data engineering screen. The biggest prep trap is treating each topic as isolated when GS's Superday panelists explicitly build on each other's questions across sessions.

Drill finance-flavored SQL and pipeline design questions mapped to Goldman Sachs interviews at datainterview.com/questions.

How to Prepare for Goldman Sachs Data Engineer Interviews

Know the Business

Updated Q1 2026

Official mission

“Goldman Sachs’ mission is to advance sustainable economic growth and financial opportunity across the globe.”

What it actually means

Goldman Sachs aims to provide comprehensive financial services, including investment banking, asset management, and wealth management, to a diverse global client base. Its core purpose is to foster sustainable economic growth and broaden financial opportunities for individuals and institutions worldwide.

New York, New YorkHybrid - Flexible

Key Business Metrics

Revenue

$59B

+15% YoY

Market Cap

$279B

+35% YoY

Employees

47K

+3% YoY

Business Segments and Where DS Fits

Goldman Sachs Asset Management

The primary investing area within Goldman Sachs, delivering investment and advisory services across public and private markets for the world's leading institutions, financial advisors, and individuals. It is a leading investor across fixed income, liquidity, equity, alternatives, and multi-asset solutions. Goldman Sachs oversees approximately $3.5 trillion in assets under supervision as of September 30, 2025.

DS focus: Utilizing quantitative strategies to navigate market complexities and inefficiencies, employing data-driven approaches for diversified portfolios, and leveraging AI applications for automation, customer engagement, and operational intelligence.

Current Strategic Priorities

Expand offerings in the wealth channel to help more investors reach their long-term goals by combining expertise with T. Rowe Price through co-branded model portfolios.

Competitive Moat

Larger scaleDiversityProven ability to invest in technologyProven ability to invest for future growth in attractive geographiesHigher revenue from investment banking and trading activitiesInherent competitive advantage given their base in the world’s broadest and deepest capital marketsBenefit of a faster recovery after the global financial crisisMore favourable operating environment

Goldman Sachs Asset Management oversaw approximately $3.5 trillion in assets under supervision as of September 30, 2025, and the firm is actively expanding its wealth channel through moves like co-branded model portfolios with T. Rowe Price. Cross-firm product launches like that require stitching together portfolio data, risk feeds, and client reference data across systems that weren't originally designed to talk to each other. That's the kind of integration problem GS data engineers solve daily.

The "why Goldman?" answer that actually works ties your skills to a specific initiative, not to the brand. Instead of gesturing at prestige, reference the T. Rowe Price partnership or GS's investment in open-source Scala tooling and explain why the data engineering challenge behind it excites you. Interviewers at GS are screening for people who understand that a data pipeline feeding a co-branded portfolio product has different SLA and lineage requirements than a generic analytics dashboard.

Try a Real Interview Question

Daily net volume with idempotent status selection

sql

Given payment events where a transaction can have multiple status updates, compute daily net processed amount per merchant in USD for a date range. For each transaction_id, use only the latest event by event_ts, count COMPLETED as +amount_usd and REFUNDED or CHARGEBACK as -amount_usd, and exclude PENDING and FAILED as 0. Output event_date, merchant_id, and net_amount_usd aggregated by day and merchant.

payment_events

transaction_id	merchant_id	event_ts	status	amount_usd
tx1001	m001	2026-01-10 09:15:00	PENDING	50.00
tx1001	m001	2026-01-10 09:16:10	COMPLETED	50.00
tx1002	m001	2026-01-10 10:05:00	COMPLETED	20.00
tx1002	m001	2026-01-11 08:00:00	REFUNDED	20.00
tx1003	m002	2026-01-11 12:00:00	FAILED	75.00

merchants

merchant_id	merchant_name
m001	Alpha Shop
m002	Beta Games
m003	Gamma Travel

SQL

1WITH latest_event AS (
2  SELECT
3    e.transaction_id,
4    e.merchant_id,
5    e.event_ts,
6    DATE(e.event_ts) AS event_date,
7    e.status,
8    CAST(e.amount_usd AS NUMERIC) AS amount_usd,
9    ROW_NUMBER() OVER (
10      PARTITION BY e.transaction_id
11      ORDER BY e.event_ts DESC
12    ) AS rn
13  FROM payment_events e
14  WHERE DATE(e.event_ts) BETWEEN DATE '2026-01-10' AND DATE '2026-01-11'
15)
16SELECT
17  le.event_date,
18  le.merchant_id,
19  SUM(
20    CASE
21      WHEN le.status = 'COMPLETED' THEN le.amount_usd
22      WHEN le.status IN ('REFUNDED', 'CHARGEBACK') THEN -le.amount_usd
23      ELSE CAST(0 AS NUMERIC)
24    END
25  ) AS net_amount_usd
26FROM latest_event le
27WHERE le.rn = 1
28GROUP BY le.event_date, le.merchant_id
29ORDER BY le.event_date, le.merchant_id;

700+ ML coding problems with a live Python executor.

Practice in the Engine

From what candidates report, GS leans heavily on SQL problems with a financial flavor, where understanding what the data represents (positions, P&L, client hierarchies) matters as much as getting the syntax right. Practice at datainterview.com/coding with that mindset: don't just solve the query, articulate why the business cares about the result.

Test Your Readiness

Data Engineer Readiness Assessment

1 / 10

Data Pipelines

Can you design an ETL or ELT pipeline that handles incremental loads (CDC or watermarking), late arriving data, and idempotent retries?

GS's Superday panels score behavioral answers independently, so drill your STAR stories against Goldman-specific prompts at datainterview.com/questions until the delivery feels natural under fatigue.

Frequently Asked Questions

How long does the Goldman Sachs Data Engineer interview process take?

Expect roughly 4 to 8 weeks from application to offer. You'll typically start with a recruiter screen, then move to a technical phone screen, and finally an onsite (or virtual onsite) with multiple rounds. Goldman moves faster for experienced hires but can slow down during busy quarters. I've seen some candidates wait 2+ weeks between rounds, so don't panic if things go quiet.

What technical skills are tested in the Goldman Sachs Data Engineer interview?

SQL is non-negotiable. You'll also be tested on Python, distributed systems (Spark, Hadoop), and data pipeline design. Goldman cares a lot about data modeling, ETL architecture, and how you handle large-scale data workflows. Expect questions on cloud platforms like AWS or GCP, and be ready to talk about orchestration tools like Airflow. They want engineers who can build reliable, production-grade systems.

How should I tailor my resume for a Goldman Sachs Data Engineer role?

Lead with impact. Goldman values quantifiable results, so frame your experience around scale (rows processed, pipeline latency reduced, cost savings). Highlight any work with financial data or regulated environments since that resonates with their business. Keep it to one page if you have under 10 years of experience. And mention specific tools like Spark, Airflow, or Kafka by name rather than vague phrases like 'big data technologies.'

What is the total compensation for a Goldman Sachs Data Engineer?

For an analyst-level Data Engineer (entry to ~3 years), base salary typically ranges from $100K to $130K with a bonus that can push total comp to $130K to $170K. At the associate level (3 to 7 years), base is roughly $130K to $165K, with total comp reaching $180K to $250K including bonus. VP-level engineers can see total comp north of $300K. These numbers are for New York. Other offices may adjust slightly.

How do I prepare for the behavioral interview at Goldman Sachs for a Data Engineer position?

Goldman takes culture fit seriously. Their core values are Partnership, Client Service, Integrity, and Excellence, so frame your stories around those themes. Prepare 4 to 5 strong examples covering teamwork, conflict resolution, ownership of a technical project, and a time you delivered under pressure. They want to see that you can work across teams in a fast-paced financial environment, not just write good code.

How hard are the SQL and coding questions in the Goldman Sachs Data Engineer interview?

SQL questions are medium to hard. Expect multi-join queries, window functions, CTEs, and performance optimization scenarios. Python coding questions tend to be medium difficulty, focused on data manipulation and algorithm basics rather than competitive programming puzzles. I've seen candidates get tripped up on query optimization and indexing questions specifically. Practice at datainterview.com/questions to get a feel for the right difficulty level.

Are ML or statistics concepts tested in the Goldman Sachs Data Engineer interview?

This isn't a data science role, so you won't face deep ML theory. That said, Goldman may ask about basic statistics (distributions, aggregations, anomaly detection) and how you'd build pipelines that feed ML models. Understanding feature engineering workflows and how data engineers support model training and serving is a plus. You don't need to derive gradient descent, but knowing the data lifecycle around ML systems helps.

What format should I use to answer behavioral questions at Goldman Sachs?

Use the STAR format (Situation, Task, Action, Result) but keep it tight. Goldman interviewers are busy and direct, so don't spend two minutes on setup. Get to the action fast and always quantify the result. For example, 'I redesigned the pipeline, which cut processing time by 60% and saved $40K per quarter.' Practice being concise. Rambling is one of the most common mistakes I see.

What happens during the Goldman Sachs Data Engineer onsite interview?

The onsite typically has 3 to 5 rounds lasting about 45 minutes each. You'll face at least one SQL round, one system design or data architecture round, one Python coding round, and one or two behavioral rounds. Some panels include a hiring manager conversation focused on your past projects and how you think about data problems at scale. It's a long day, so pace yourself and bring energy to every round.

What business metrics or domain concepts should I know for a Goldman Sachs Data Engineer interview?

Goldman is a financial services firm with $59.4B in revenue, so understanding basic financial concepts helps. Know what trade data, risk metrics, P&L reporting, and regulatory data requirements look like at a high level. You don't need to be a quant, but showing awareness of how data pipelines support trading desks, compliance, and client reporting will set you apart from candidates who only talk about generic ETL.

What are the most common mistakes in the Goldman Sachs Data Engineer interview?

Three things I see repeatedly. First, candidates write SQL that works but is wildly inefficient, and Goldman cares about performance at scale. Second, people underestimate the behavioral rounds and wing them. Third, candidates fail to connect their technical work to business outcomes. Goldman's culture is results-oriented. Always tie your answers back to impact. Practice your end-to-end stories at datainterview.com/questions before the real thing.

How should I prepare for system design questions as a Goldman Sachs Data Engineer?

Focus on designing data pipelines, not web applications. Think batch vs. streaming architectures, data lake design, schema evolution, and fault tolerance. A common prompt might be 'Design a pipeline that ingests trade data in real time and serves it to downstream analytics.' Practice drawing clear diagrams, explaining tradeoffs (latency vs. throughput, cost vs. reliability), and calling out where you'd use tools like Kafka, Spark, or Airflow. Specificity wins over hand-waving.

Goldman Sachs Data Engineer Interview Guide

Goldman Sachs Data Engineer Role

A Typical Week

A Week in the Life of a Goldman Sachs Data Engineer

Weekly time split

Culture notes

Projects & Impact Areas

Skills & What's Expected

Levels & Career Growth

Work Culture

Goldman Sachs Data Engineer Compensation

Goldman Sachs Data Engineer Interview Process

Initial Screen

Recruiter Screen

Hiring Manager Screen

Technical Assessment

SQL & Data Modeling

System Design

Onsite

Behavioral

Case Study

Goldman Sachs Data Engineer Interview Questions

Data Pipelines & Engineering

System Design

SQL & Data Manipulation

Data Warehouse

Data Modeling

Coding & Algorithms

Data Engineering

Cloud Infrastructure

How to Prepare for Goldman Sachs Data Engineer Interviews

Try a Real Interview Question

Daily net volume with idempotent status selection

Test Your Readiness

Frequently Asked Questions

Dan Lee

Related Articles

Scale AI Machine Learning Engineer Interview Guide

Salesforce Machine Learning Engineer Interview Guide

Salesforce Data Analyst Interview Guide