Design a Data Catalog

Dan Lee's profile image
Dan LeeData & AI Lead
Last updateMarch 9, 2026

Understanding the Problem

Product definition: A data catalog is a centralized registry where data teams can discover, understand, and trust every data asset in their organization, from warehouse tables and Kafka topics to Airflow pipelines and ML feature stores.

What is a Data Catalog?

At most companies beyond a certain size, data is everywhere and nobody knows where anything is. An analyst needs the orders table but isn't sure if it lives in Snowflake or Redshift, which version is canonical, or whether it's even being updated anymore. A data catalog solves this by giving every asset a home: a searchable record with its schema, owner, upstream dependencies, and freshness status.

Think of it as the "source of truth about your sources of truth." The catalog doesn't store the data itself. It stores metadata: what exists, who owns it, how it was created, what it feeds downstream, and whether you should trust it right now.

Two distinct personas use this system. Data consumers (analysts, data scientists, ML engineers) come to the catalog to find trusted datasets and understand what columns mean. Data producers (data engineers, platform teams) use it to register new assets, document schemas, and set ownership. Your design needs to serve both.

Functional Requirements

Core Requirements

  • Asset discovery via search: Users can search for tables, pipelines, dashboards, and ML features by name, description, column name, or tag. Results return in under 200ms.
  • Metadata ingestion from sources: The catalog ingests schema and asset metadata from sources like Snowflake, Hive, Kafka, dbt, and Airflow, either via crawlers or push-based connectors.
  • Lineage tracking: The catalog tracks upstream and downstream relationships between assets, so users can see what a table depends on and what breaks if it changes.
  • Data quality and freshness signals: Assets surface health information: last successful pipeline run, SLA breach status, and quality check results from tools like Great Expectations or dbt tests.
  • Tagging and ownership: Users and systems can apply tags (e.g., pii, finance, deprecated) and assign owners (teams or individuals) to assets.

Below the line (out of scope)

  • Fine-grained access control enforcement (the catalog can display who has access, but won't gate queries)
  • Data previews or sample row fetching
  • Automated data quality rule creation
Note: "Below the line" features are acknowledged but won't be designed in this lesson.

Non-Functional Requirements

  • Search latency: p99 under 200ms for asset search queries. Asset detail page loads under 500ms.
  • Ingestion freshness: Metadata changes in source systems should appear in the catalog within 5 minutes. Schema changes should never take more than 15 minutes to reflect.
  • Scale: Support up to 10 million assets across hundreds of data sources and connectors.
  • Read availability: 99.9% uptime for read paths (search and asset detail). Ingestion can tolerate brief delays during outages without data loss.
Tip: Always clarify requirements before jumping into design. Asking "do we need column-level lineage or is table-level enough?" or "is ingestion push or pull?" signals maturity and shapes the entire architecture. These aren't stalling tactics; they're the questions a senior engineer asks on day one.

Back-of-Envelope Estimation

A few assumptions to ground the numbers. A large company has roughly 5 million cataloged assets. Most activity is reads (search, browsing asset pages). Writes come from automated ingestion, not humans.

MetricEstimate
Total assets5 million
Avg metadata record size10 KB (schema, tags, lineage refs, quality scores)
Total metadata storage~50 GB
Search QPS (peak)~500 QPS (analysts querying during business hours)
Ingestion events/sec~200 events/sec (schema changes, pipeline completions, quality results)
Lineage edges~20 million (avg 4 edges per asset)
Graph store size~2 GB (edges are lightweight: two URNs + metadata)
Elasticsearch index size~100 GB (denormalized asset docs with all searchable fields)

The storage numbers are small compared to the data systems the catalog describes. The hard problems here aren't scale in the traditional sense. They're correctness (is this metadata actually fresh?) and relevance (does search surface the right asset, not just any asset matching the keyword?). Keep that in mind as you walk through the design.

The Set Up

Before you start drawing boxes and arrows, you need a clean domain model. Interviewers will push back if your entities are fuzzy or if you're conflating things that should be separate. Get this right first.

Core Entities

Five entities do most of the work here. Everything else in the system, search, ingestion, lineage traversal, is just moving data into and out of these five tables.

Asset is the central node. A table in Snowflake, a Kafka topic, an Airflow DAG, a dbt model, a Looker dashboard: all of these are assets. Every other entity hangs off an asset via its URN. The URN is a stable, globally unique identifier that looks like urn:snowflake:prod.analytics.orders or urn:kafka:prod.user-events. It's the key that lets you reconcile the same table appearing in multiple source systems without creating duplicates.

Schema is versioned. When a column gets dropped or a type changes, you don't overwrite the old schema; you write a new version. This is what lets downstream consumers detect breaking changes before they hit a pipeline failure at 3am.

Lineage is a directed edge between two assets. Upstream produces, downstream consumes. The edge also carries how the relationship was detected, whether that's SQL parsing, a dbt ref(), or an Airflow task dependency. That provenance matters when you're debugging why a lineage graph looks wrong.

Tag and Owner are both many-to-many with Assets. One asset can have multiple owners and multiple tags. That's what enables governance queries like "show me all PII-tagged tables owned by the payments team."

Key insight: The URN is load-bearing. If you use auto-increment IDs as your primary key for assets, you lose the ability to deduplicate across sources. A Snowflake crawler and a dbt manifest might both register prod.analytics.orders. The URN is what lets the ingestion service recognize they're the same asset and merge the metadata instead of creating two records.
CREATE TABLE assets (
    urn             VARCHAR(512) PRIMARY KEY,              -- e.g. urn:snowflake:prod.analytics.orders
    name            VARCHAR(255) NOT NULL,                 -- human-readable name
    source_type     VARCHAR(50) NOT NULL,                  -- 'snowflake', 'kafka', 'dbt', 'airflow'
    description     TEXT,                                  -- user-provided or auto-extracted
    created_at      TIMESTAMP NOT NULL DEFAULT now(),
    last_updated    TIMESTAMP NOT NULL DEFAULT now()
);

CREATE INDEX idx_assets_source_type ON assets(source_type);
CREATE INDEX idx_assets_last_updated ON assets(last_updated DESC);
CREATE TABLE schemas (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    asset_urn       VARCHAR(512) NOT NULL REFERENCES assets(urn),
    version         INT NOT NULL,                          -- monotonically increasing per asset
    columns         JSONB NOT NULL DEFAULT '[]',           -- [{name, type, description, nullable}]
    captured_at     TIMESTAMP NOT NULL DEFAULT now(),
    UNIQUE (asset_urn, version)
);

CREATE INDEX idx_schemas_asset_urn ON schemas(asset_urn, version DESC);

The columns field is JSONB rather than a normalized table. Column counts vary wildly across assets (a Kafka topic might have 5 fields; a Hive table might have 200), and you'll almost always read all columns together. Keeping them in a single JSONB blob makes reads cheap and avoids a 200-row join on every asset page load.

CREATE TABLE lineage (
    id                  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    upstream_urn        VARCHAR(512) NOT NULL REFERENCES assets(urn),
    downstream_urn      VARCHAR(512) NOT NULL REFERENCES assets(urn),
    detection_method    VARCHAR(50) NOT NULL,              -- 'sql_parse', 'dbt_ref', 'airflow_task'
    created_at          TIMESTAMP NOT NULL DEFAULT now(),
    UNIQUE (upstream_urn, downstream_urn, detection_method)
);

CREATE INDEX idx_lineage_upstream ON lineage(upstream_urn);
CREATE INDEX idx_lineage_downstream ON lineage(downstream_urn);

Both indexes matter. You traverse lineage in both directions: "what does this asset feed into?" and "what does this asset depend on?"

CREATE TABLE tags (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    asset_urn   VARCHAR(512) NOT NULL REFERENCES assets(urn),
    tag_key     VARCHAR(100) NOT NULL,                     -- e.g. 'sensitivity', 'domain'
    tag_value   VARCHAR(255) NOT NULL,                     -- e.g. 'PII', 'payments'
    applied_by  VARCHAR(255) NOT NULL,                     -- user ID or system name
    applied_at  TIMESTAMP NOT NULL DEFAULT now(),
    UNIQUE (asset_urn, tag_key, tag_value)
);

CREATE INDEX idx_tags_key_value ON tags(tag_key, tag_value);
CREATE TABLE owners (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    asset_urn       VARCHAR(512) NOT NULL REFERENCES assets(urn),
    owner_id        VARCHAR(255) NOT NULL,                 -- user or team identifier
    owner_type      VARCHAR(20) NOT NULL,                  -- 'user' or 'team'
    assigned_at     TIMESTAMP NOT NULL DEFAULT now(),
    UNIQUE (asset_urn, owner_id)
);

CREATE INDEX idx_owners_owner_id ON owners(owner_id);
Data Catalog Core Entities
Common mistake: Candidates often model lineage as a field on the Asset table (like upstream_urns TEXT[]). That breaks the moment you need to traverse more than one hop, filter by detection method, or query "what are all the downstream consumers of this asset?" Lineage needs its own table with indexed foreign keys on both ends.

API Design

The catalog has two primary interaction patterns: producers registering and updating assets, and consumers searching and exploring them. The API reflects that split cleanly.

// Register or update an asset (upsert by URN)
POST /assets
{
  "urn": "urn:snowflake:prod.analytics.orders",
  "name": "orders",
  "source_type": "snowflake",
  "description": "One row per order, partitioned by created_date"
}
-> { "urn": "urn:snowflake:prod.analytics.orders", "created": true }
// Search for assets by keyword, with optional filters
GET /assets/search?q=orders&source_type=snowflake&tag=PII&owner=payments-team&limit=20&offset=0
-> {
     "results": [
       {
         "urn": "urn:snowflake:prod.analytics.orders",
         "name": "orders",
         "source_type": "snowflake",
         "description": "...",
         "quality_score": 0.92,
         "last_updated": "2024-01-15T08:00:00Z",
         "tags": [{"key": "domain", "value": "payments"}],
         "owners": [{"owner_id": "payments-team", "owner_type": "team"}]
       }
     ],
     "total": 143
   }
// Fetch full asset detail (schema, tags, owners, freshness)
GET /assets/{urn}
-> {
     "urn": "urn:snowflake:prod.analytics.orders",
     "schema": { "version": 7, "columns": [...] },
     "tags": [...],
     "owners": [...],
     "quality_score": 0.92,
     "last_successful_run": "2024-01-15T07:45:00Z"
   }
// Get lineage for an asset (upstream or downstream, with depth limit)
GET /assets/{urn}/lineage?direction=upstream&depth=3
-> {
     "nodes": [
       { "urn": "urn:snowflake:prod.raw.events", "name": "events" }
     ],
     "edges": [
       {
         "upstream_urn": "urn:snowflake:prod.raw.events",
         "downstream_urn": "urn:snowflake:prod.analytics.orders",
         "detection_method": "sql_parse"
       }
     ]
   }
// Add a tag to an asset
POST /assets/{urn}/tags
{ "tag_key": "sensitivity", "tag_value": "PII" }
-> { "id": "uuid", "applied_at": "2024-01-15T09:00:00Z" }
// Register a lineage edge (called by ingestion workers)
POST /lineage
{
  "upstream_urn": "urn:snowflake:prod.raw.events",
  "downstream_urn": "urn:snowflake:prod.analytics.orders",
  "detection_method": "dbt_ref"
}
-> { "id": "uuid", "created": true }

The verb choices here are deliberate. POST /assets is an upsert, not a pure create, because ingestion workers will re-emit the same asset repeatedly as schemas evolve. GET /assets/search uses query parameters rather than a request body because search results should be bookmarkable and cacheable. The lineage endpoint returns a graph structure (nodes and edges) rather than a flat list, because the frontend needs both to render a lineage visualization.

Interview tip: When you present the lineage endpoint, mention the depth parameter proactively. Without a depth limit, a single lineage query on a heavily connected asset could traverse thousands of nodes. Interviewers will ask about this, and raising it yourself signals you've thought about graph traversal at scale.

High-Level Design

The catalog has four distinct data flows, each serving a different user need. Walk through them one at a time in your interview. Interviewers want to see you decompose the problem cleanly before you start optimizing.

1) Ingesting Metadata from Sources

Core components: Source Connectors, Kafka Metadata Topic, Metadata Ingestion Service, Metadata Store (PostgreSQL), Graph Store (Neo4j or recursive SQL).

The fundamental challenge here is that your metadata sources are wildly heterogeneous. Snowflake exposes an INFORMATION_SCHEMA. Airflow has a REST API. dbt produces a manifest.json. Kafka has a Schema Registry. You need a normalization layer that turns all of these into the same Asset/Schema/Lineage model.

Here's the flow:

  1. Source Connectors run as crawlers or receive push events from each source system. A Snowflake connector queries INFORMATION_SCHEMA.COLUMNS on a schedule; a dbt connector parses manifest.json after each dbt run; an Airflow connector listens to task completion callbacks.
  2. Each connector serializes its output as a metadata event (Avro or JSON) and publishes to a Kafka Metadata Topic. Every event carries a source_type, a canonical asset_urn, and a change_type (UPSERT, DELETE, SCHEMA_CHANGE).
  3. The Metadata Ingestion Service consumes from Kafka, normalizes the event into the Asset/Schema/Lineage model, and performs an upsert into the Metadata Store.
  4. Lineage edges extracted from dbt ref() calls or Airflow task dependencies are written to the Graph Store as directed edges.
{
  "event_type": "SCHEMA_CHANGE",
  "source_type": "snowflake",
  "asset_urn": "urn:snowflake:prod.analytics.orders",
  "schema_version": 14,
  "columns": [
    { "name": "order_id", "type": "VARCHAR", "nullable": false },
    { "name": "customer_id", "type": "VARCHAR", "nullable": false },
    { "name": "total_usd", "type": "DECIMAL(10,2)", "nullable": true }
  ],
  "detected_at": "2024-01-15T10:23:00Z"
}
Metadata Ingestion Path

The key design decision is push vs. pull. Pull (scheduled crawling) is easier to implement but introduces lag; a schema change in Snowflake might not appear in the catalog for 30 minutes if your crawler runs hourly. Push (DDL change logs, webhook callbacks) gives you near-real-time freshness but requires deeper integration with each source. In practice, you want both: push for sources that support it, pull as a fallback and consistency check.

Kafka in the middle is not optional. Without it, every connector writes directly to the Metadata Store, and you lose the ability to replay events during backfills, add new consumers (like a lineage parser) without modifying connectors, or handle ingestion spikes gracefully.

Interview tip: When the interviewer asks "how fresh is the metadata?", your answer lives here. Be specific: push-based connectors give you sub-minute lag; scheduled crawlers give you lag equal to their interval. The architecture lets you tune this per source.

2) Search and Asset Discovery

Core components: Metadata Store, Elasticsearch Search Index, Catalog API.

Analysts don't browse a tree of databases and schemas. They type "orders table payments team" into a search bar and expect the right result in the top three. That's a full-text search problem, not a SQL query problem, which is why you can't serve search directly from PostgreSQL.

The flow:

  1. Every write to the Metadata Store triggers a sync to Elasticsearch. This can be synchronous (write-through) or asynchronous (CDC via Debezium reading the Postgres WAL). Async is safer at scale; a slow Elasticsearch write shouldn't block the ingestion path.
  2. A user types a query in the Frontend. The request hits the Catalog API.
  3. The Catalog API translates the query into an Elasticsearch request with filters (source type, owner, tags) and returns a ranked list of matching assets.
  4. The user clicks an asset. The Catalog API fetches the full asset record from the Metadata Store (not Elasticsearch, which only stores a search-optimized projection).
GET /catalog/search?q=orders&source=snowflake&tag=pii&limit=20

{
  "results": [
    {
      "urn": "urn:snowflake:prod.analytics.orders",
      "name": "orders",
      "description": "One row per completed order. Source of truth for revenue reporting.",
      "owner": "payments-team",
      "tags": ["pii", "revenue", "certified"],
      "quality_score": 0.94,
      "last_updated": "2024-01-15T09:00:00Z"
    }
  ],
  "total": 1
}
Search and Serve Path

Why keep Elasticsearch and the Metadata Store separate? Elasticsearch is optimized for search (inverted indexes, relevance scoring, faceted filtering) but is not a reliable source of truth. It can fall behind, lose data during a cluster failure, or return stale results. PostgreSQL is your source of truth; Elasticsearch is a read-optimized projection of it. You can always rebuild the index from the Metadata Store if it drifts.

Common mistake: Candidates who propose serving search directly from PostgreSQL using ILIKE queries. This works for a few hundred assets. At millions of assets across hundreds of sources, it falls apart on both latency and relevance. Elasticsearch gives you sub-100ms p99 search with proper relevance tuning.

The Catalog API should be the single entry point for all client requests. It handles auth, rate limiting, and routes to the right backend (Elasticsearch for search, PostgreSQL for detail, Graph Store for lineage). Don't let the frontend talk to three different backends directly.


3) Lineage Tracking

Core components: SQL Parser Workers, Kafka Metadata Topic, Lineage Resolver, Graph Store, Catalog API.

Lineage answers "where does this data come from, and what depends on it?" It's the feature that turns a catalog from a fancy spreadsheet into something engineers actually trust during incidents.

The flow:

  1. SQL Parser Workers pull from query logs (Snowflake query history, Spark event logs) and static manifests (dbt manifest.json). Each worker extracts table and column references from SQL statements using an AST-based parser, then publishes the extracted lineage relationships as events to the same Kafka Metadata Topic used for schema and quality metadata. Every lineage event carries the upstream and downstream asset URNs and the detection method (query log, dbt manifest, etc.).
  2. The Lineage Resolver consumes these events from Kafka, maps raw table names to canonical asset URNs, deduplicates edges (the same orders -> revenue_summary relationship might appear in thousands of query log entries), and writes directed edges to the Graph Store.
  3. When a user opens an asset page, the Catalog API traverses the Graph Store to return upstream and downstream assets up to N hops.

Routing lineage events through Kafka is what makes the architecture consistent. The Lineage Resolver can be deployed, upgraded, or replaced without touching the SQL Parser Workers. You can replay the Kafka topic to rebuild the entire Graph Store from scratch after a migration. And if you later add a second lineage consumer (say, an impact analysis service), you don't modify a single parser.

# Simplified lineage edge deduplication
def upsert_lineage_edge(upstream_urn: str, downstream_urn: str, method: str):
    edge_key = f"{upstream_urn}::{downstream_urn}"
    if not graph_store.edge_exists(edge_key):
        graph_store.create_edge(
            upstream=upstream_urn,
            downstream=downstream_urn,
            detection_method=method,
            created_at=datetime.utcnow()
        )

The graph store choice matters. Neo4j handles deep traversals (5+ hops) elegantly with Cypher queries. A recursive SQL model in PostgreSQL works fine for shallow lineage (2-3 hops) and avoids another infrastructure dependency. At scale, with millions of assets and complex pipelines, Neo4j wins on traversal performance. For a v1, recursive SQL is a reasonable starting point.

Key insight: dbt lineage is easy because manifest.json gives you the full dependency graph for free. SQL query log parsing is hard because SQL is ambiguous, queries use aliases, and dynamic SQL (string concatenation) is nearly impossible to parse statically. Be honest about this in the interview. A great answer acknowledges the limitation and proposes a fallback: use dbt lineage as the high-quality source, query logs as a best-effort supplement.

4) Freshness and Quality Signals

Core components: Airflow (Orchestrator), Quality Tools (Great Expectations, dbt tests), Signal Aggregator, Metadata Store, Search Index.

An asset page without quality signals is just documentation. With them, it becomes a trust indicator. Analysts need to know: did this table load successfully today? Did it pass its row count checks? Is it 6 hours stale?

The flow:

  1. Airflow emits task completion events (success, failure, SLA miss) after each pipeline run. These are published as metadata events to the same Kafka topic used for schema ingestion.
  2. dbt test results and Great Expectations checkpoint results are similarly pushed as quality check events, carrying pass/fail status per check and the asset URN they apply to.
  3. The Signal Aggregator consumes these events and computes two things per asset: a rolling quality score (percentage of checks passing over the last N runs) and a freshness lag (time since the last successful load).
  4. These signals are written back to the Metadata Store and synced to Elasticsearch, so stale or failing assets can be demoted in search rankings.
CREATE TABLE asset_quality_signals (
    asset_urn       VARCHAR NOT NULL REFERENCES assets(urn),
    quality_score   FLOAT NOT NULL DEFAULT 1.0,   -- 0.0 to 1.0, rolling 7-day pass rate
    last_run_at     TIMESTAMP,                     -- last successful pipeline run
    last_run_status VARCHAR(20),                   -- 'success', 'failure', 'sla_miss'
    sla_breach      BOOLEAN NOT NULL DEFAULT FALSE,
    updated_at      TIMESTAMP NOT NULL DEFAULT now(),
    PRIMARY KEY (asset_urn)
);

The freshness signal is deceptively simple but operationally important. An asset with last_run_at more than 25 hours ago on a daily pipeline is stale. The catalog should surface this prominently, not bury it in a metadata field. Analysts who build dashboards on stale data and don't know it are the exact problem a catalog exists to prevent.

Interview tip: If the interviewer asks "how do you handle a source that doesn't emit quality events?", the answer is: you fall back to crawl-based freshness. The connector checks last modified timestamps from the source system (Snowflake's LAST_ALTERED in INFORMATION_SCHEMA.TABLES, for example) and infers freshness from that. It's less precise but better than nothing.

Putting It All Together

The full architecture has two planes: an ingestion plane that keeps metadata current, and a serving plane that answers user queries.

On the ingestion side, source connectors push metadata events to Kafka. The Ingestion Service normalizes and writes to the Metadata Store and Graph Store. SQL Parser Workers publish lineage events to the same Kafka topic; the Lineage Resolver consumes them and writes edges to the Graph Store. Quality signals from Airflow and dbt flow through Kafka and land in the Metadata Store via the Signal Aggregator. Every write to the Metadata Store triggers a sync to Elasticsearch.

On the serving side, the Catalog API is the single gateway. Search queries go to Elasticsearch. Asset detail requests go to the Metadata Store. Lineage traversals go to the Graph Store. The Frontend renders all of this in a single asset page: name, description, schema, owner, tags, quality score, freshness, and lineage graph.

The Metadata Store (PostgreSQL) is the source of truth for everything. Elasticsearch and the Graph Store are derived, queryable projections of it. If either goes down, you rebuild from Postgres.

At the scale described in the requirements (millions of assets, hundreds of sources), the Kafka topic and Ingestion Service need to be partitioned by asset_urn to ensure ordered processing per asset. The Catalog API should be stateless and horizontally scalable. Elasticsearch should run as a multi-node cluster with replicas for read availability.

Key insight: The catalog is itself a data pipeline. Treat it like one. Metadata events are your raw data. The Metadata Store is your curated layer. Elasticsearch is your serving layer. The same principles you'd apply to any data platform (idempotency, schema evolution, backfill support) apply here too.

Deep Dives

These are the questions that separate candidates who've read about data catalogs from those who've actually built them. Each one probes a specific failure mode that shows up in production.


"How do we keep metadata fresh when sources change?"

This is the question interviewers love because it has no clean answer. Every approach involves a trade-off between freshness, load on source systems, and operational complexity.

Bad Solution: Scheduled Full Crawls

The naive approach is a nightly cron job that connects to every source (Snowflake, Hive, Kafka) and re-scans all schemas. Simple to implement, easy to reason about.

The problem is that "nightly" means your catalog can be 23 hours stale. A data engineer drops a column at 9am; analysts are querying a catalog that still shows it until the next morning. At scale, full crawls also hammer source systems. Scanning millions of Snowflake tables on a schedule is expensive and slow.

Warning: Candidates who only propose scheduled crawls are implicitly accepting hours of metadata lag. Interviewers will push back with "what if a schema change breaks a downstream pipeline at 10am?" Have an answer ready.

Good Solution: Push-Based Ingestion via Source Connectors

Instead of pulling on a schedule, you flip the model. Source systems push metadata change events when something actually changes. Snowflake exposes DDL change logs; Kafka emits schema registry events; dbt generates a manifest.json on every run that captures the full schema and lineage graph.

Each connector translates these native events into a normalized MetadataChangeEvent and publishes to a Kafka metadata topic. The ingestion service consumes from that topic and writes to the metadata store. Lag drops from hours to minutes.

# Simplified MetadataChangeEvent schema
@dataclass
class MetadataChangeEvent:
    asset_urn: str          # e.g. urn:snowflake:prod.analytics.orders
    event_type: str         # SCHEMA_CHANGE | ASSET_CREATED | ASSET_DELETED
    source: str             # snowflake | hive | dbt | kafka
    schema_snapshot: dict   # current column definitions
    detected_at: datetime
    raw_payload: dict       # source-specific metadata for debugging

The trade-off: you're now dependent on each source system's change notification mechanism. Hive doesn't have great DDL event support. Some sources require polling anyway, just at a finer granularity.

Great Solution: Event-Driven Crawling with Schema Drift Detection

The real production answer combines push and pull. Push connectors handle sources that support it (Snowflake, dbt, Kafka). For sources that don't, you run lightweight incremental crawlers triggered by signals rather than clocks. An Airflow task completing is a signal to re-crawl the tables it touched. A new dbt run is a signal to re-ingest the manifest.

On top of that, you add a Schema Diff Worker that compares every incoming schema snapshot against the latest stored version and classifies the change: additive (new column), breaking (column dropped or type changed), or cosmetic (description updated). Breaking changes trigger alerts to asset owners before analysts hit the problem.

def classify_schema_change(old_schema: dict, new_schema: dict) -> str:
    old_cols = {c["name"]: c["type"] for c in old_schema["columns"]}
    new_cols = {c["name"]: c["type"] for c in new_schema["columns"]}

    dropped = set(old_cols) - set(new_cols)
    type_changed = {
        col for col in old_cols & new_cols
        if old_cols[col] != new_cols[col]
    }

    if dropped or type_changed:
        return "BREAKING"
    elif set(new_cols) - set(old_cols):
        return "ADDITIVE"
    return "COSMETIC"

Each non-cosmetic change creates a new schema version in the metadata store. Consumers can see exactly when a column was added or dropped, which is invaluable for debugging pipeline failures after a schema change.

Tip: Mentioning schema versioning with breaking change classification is strong senior signal. It shows you've thought about the downstream impact of metadata changes, not just the ingestion mechanics.
Push vs Pull Ingestion and Schema Drift Detection

"How do we make search actually useful?"

Most catalog implementations get ingestion working and then treat search as an afterthought. The result is a search bar that returns 200 tables matching "orders" with no way to know which one to trust.

Out of the box, Elasticsearch will match on asset names and descriptions. Type "orders" and you get every table with "orders" in the name, ranked by text relevance score. That's it.

The problem is that text relevance is the wrong signal for data discovery. A table named orders_v2_final_FINAL_use_this might score lower than an abandoned orders_test table from three years ago. Analysts learn quickly that the search results can't be trusted, and they go back to asking on Slack.

Warning: Don't just say "use Elasticsearch." Interviewers expect you to explain how you'd tune relevance. "We'd index the asset name and description" is not a complete answer.

Good Solution: Boosted Search with Metadata Signals

The fix is enriching each asset's Elasticsearch document with signals that proxy for trustworthiness, then using function score queries to boost results accordingly.

The four signals that matter most: query popularity (how often is this asset queried in the warehouse), recency (when was it last updated), ownership completeness (does it have an owner, description, and tags), and quality score (are its dbt tests passing). An asset that's queried 10,000 times a day, has a description, and passes all quality checks should surface above a stale undocumented table even if the text match is slightly weaker.

{
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "orders",
          "fields": ["name^3", "description^2", "tags.value", "owner_name"]
        }
      },
      "functions": [
        {
          "field_value_factor": {
            "field": "popularity_score",
            "factor": 1.5,
            "missing": 1
          }
        },
        {
          "field_value_factor": {
            "field": "completeness_score",
            "factor": 1.2,
            "missing": 0.5
          }
        },
        {
          "gauss": {
            "last_updated": {
              "origin": "now",
              "scale": "30d",
              "decay": 0.5
            }
          }
        }
      ],
      "boost_mode": "multiply"
    }
  }
}

This gets you meaningfully better results. The remaining gap is that popularity scores need to be computed somewhere and kept fresh in the index.

Great Solution: Popularity Service with Async Score Updates

The completeness score is easy to compute at write time. Popularity is harder because it comes from outside the catalog: warehouse query logs, BI tool usage, notebook access patterns.

A dedicated Popularity Service tails Snowflake's QUERY_HISTORY view and BI tool audit logs, computes a rolling 30-day query frequency per asset URN, and pushes score updates to Elasticsearch asynchronously. This decouples the hot search path from the analytics computation. Search reads a pre-computed score; the score gets refreshed every few hours.

The completeness score is computed inline when an asset is written to the metadata store: count how many of the five key fields (name, description, owner, at least one tag, schema) are populated and normalize to 0-1. Write it to both the metadata store and the Elasticsearch document in the same transaction.

def compute_completeness_score(asset: dict) -> float:
    fields = [
        bool(asset.get("description")),
        bool(asset.get("owner_id")),
        bool(asset.get("tags")),
        bool(asset.get("schema_version_id")),
        bool(asset.get("source_type")),
    ]
    return sum(fields) / len(fields)
Tip: Framing completeness score as a first-class metric that feeds back into search ranking is a detail most candidates miss. It creates a virtuous cycle: teams that document their assets get more discoverability, which incentivizes documentation.
Relevance-Tuned Asset Search

"How do we compute and serve column-level lineage at scale?"

Table-level lineage tells you that dim_customers feeds orders_summary. Column-level lineage tells you that orders_summary.revenue comes from orders.amount multiplied by orders.quantity. The second is dramatically more useful for impact analysis, and dramatically harder to build.

Bad Solution: Regex-Based SQL Parsing

The tempting shortcut is regex. Scan query logs for FROM table_name and JOIN table_name patterns, extract table references, and call it lineage. Fast to implement, works for simple queries.

Regex breaks on anything real. CTEs, subqueries, aliased tables, INSERT INTO ... SELECT, dynamic SQL with string interpolation. A query like SELECT a.col FROM (SELECT * FROM orders) a will fool most regex parsers. You end up with missing edges and phantom edges, which is worse than no lineage because analysts trust it.

Warning: If you propose regex parsing for SQL lineage in an interview, expect immediate pushback. It signals you haven't thought about the actual complexity of SQL in production environments.

Good Solution: AST-Based SQL Parsing

A proper abstract syntax tree parser handles the full SQL grammar. Libraries like sqlglot (Python) can parse SQL from Snowflake, Spark, BigQuery, and Hive dialects and return a structured tree you can walk to extract all table and column references.

import sqlglot
from sqlglot import exp

def extract_lineage(sql: str, dialect: str = "snowflake") -> dict:
    tree = sqlglot.parse_one(sql, dialect=dialect)

    sources = set()
    targets = set()

    for table in tree.find_all(exp.Table):
        sources.add(f"{table.db}.{table.name}".lower())

    # For INSERT/CREATE AS SELECT, find the target
    if isinstance(tree, (exp.Insert, exp.Create)):
        target = tree.find(exp.Table)
        if target:
            targets.add(f"{target.db}.{target.name}".lower())
            sources.discard(f"{target.db}.{target.name}".lower())

    return {"sources": list(sources), "targets": list(targets)}

AST parsing handles CTEs, subqueries, and most dialect-specific syntax. The remaining gap is dynamic SQL: queries built as strings at runtime in stored procedures or application code. You can't parse what you can't see.

Great Solution: Multi-Source Lineage with URN Resolution and Graph Traversal Optimization

Production lineage comes from multiple sources simultaneously: Snowflake query history, Spark event logs, dbt manifest.json, and Airflow task dependencies. Each source gives you a different slice of the lineage graph, and they overlap.

The Lineage Resolver's job is to take parsed references from all these sources and map them to canonical asset URNs. analytics.orders in a Snowflake query and ref('orders') in a dbt model are the same asset. The resolver maintains a URN mapping table and deduplicates edges before writing to the graph store.

CREATE TABLE lineage_edges (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    upstream_urn    VARCHAR NOT NULL,
    downstream_urn  VARCHAR NOT NULL,
    granularity     VARCHAR NOT NULL,      -- 'TABLE' or 'COLUMN'
    upstream_col    VARCHAR,               -- NULL for table-level edges
    downstream_col  VARCHAR,
    detection_method VARCHAR NOT NULL,     -- 'sql_parse' | 'dbt_ref' | 'airflow_dep'
    confidence      FLOAT NOT NULL DEFAULT 1.0,
    created_at      TIMESTAMP NOT NULL DEFAULT now(),
    UNIQUE (upstream_urn, downstream_urn, upstream_col, downstream_col)
);

CREATE INDEX idx_lineage_downstream ON lineage_edges(downstream_urn);
CREATE INDEX idx_lineage_upstream ON lineage_edges(upstream_urn);

For graph traversal, the key interview question is: how do you handle deep lineage chains without blowing up query time? A naive recursive CTE works for 3-4 hops but gets slow at 10+. The answer is to pre-compute and cache lineage subgraphs for high-traffic assets. When an analyst views orders_summary, you serve the cached 3-hop neighborhood; full traversal is a background job.

Tip: Bringing up confidence scores on lineage edges (lower confidence for SQL-parsed edges, higher for explicit dbt refs) shows you understand that lineage is probabilistic in practice. Not all edges are equally trustworthy.
Column-Level Lineage via SQL Parsing

"How do we surface data quality and freshness signals on asset pages?"

An analyst finds a table in the catalog. It has a description and an owner. But is it safe to use? Is it up to date? Did it fail validation last night? Without quality signals, the catalog is a phone book, not a trusted data platform.

Bad Solution: Manual Status Fields

The simplest approach is letting asset owners manually set a "status" field: Trusted, Experimental, Deprecated. Some teams do this.

It doesn't scale and it doesn't stay accurate. Owners forget to update status when pipelines start failing. A "Trusted" table that hasn't been updated in six months is worse than no label at all because it creates false confidence. Manual signals decay immediately.

Good Solution: Pipeline Run Events from Airflow

Wire Airflow to emit a metadata event on every task completion. Each event carries the asset URN, run timestamp, success/failure status, and row count. The ingestion service processes these events and updates a pipeline_runs table in the metadata store.

From this you can compute two simple signals: freshness lag (time since last successful run) and recent failure rate (failures in the last 7 days / total runs). Both are objective, automatically updated, and directly useful to analysts.

# Airflow callback to emit metadata event on task completion
def emit_catalog_event(context):
    asset_urn = context["task"].get_template_fields()  # convention: task stores URN
    event = {
        "asset_urn": asset_urn,
        "event_type": "PIPELINE_RUN",
        "status": "SUCCESS" if not context.get("exception") else "FAILURE",
        "run_at": context["execution_date"].isoformat(),
        "rows_written": context["task_instance"].xcom_pull(key="rows_written"),
    }
    kafka_producer.send("metadata-events", value=event)

The gap here is that pipeline success doesn't mean data quality. A pipeline can succeed and write 0 rows, or write rows with 40% null values in a critical column.

Great Solution: Quality Score Aggregation with dbt Test Integration and Search Demotion

The full picture combines pipeline run events with data quality test results. dbt tests and Great Expectations both produce structured pass/fail results per check, per asset. A Signal Aggregator consumes both streams and computes a rolling quality score.

The quality score is a weighted average: pipeline reliability (last 7 days), dbt test pass rate (last run), and freshness (hours since last successful update vs. expected SLA). Each component is normalized to 0-1 and weighted by severity.

def compute_quality_score(
    pipeline_success_rate: float,   # 0-1, last 7 days
    test_pass_rate: float,          # 0-1, last dbt run
    freshness_score: float,         # 1.0 if on time, decays toward 0
) -> float:
    weights = {
        "pipeline": 0.3,
        "tests": 0.5,
        "freshness": 0.2,
    }
    return (
        pipeline_success_rate * weights["pipeline"]
        + test_pass_rate * weights["tests"]
        + freshness_score * weights["freshness"]
    )

def compute_freshness_score(
    last_success: datetime,
    expected_sla_hours: int,
) -> float:
    lag_hours = (datetime.utcnow() - last_success).total_seconds() / 3600
    if lag_hours <= expected_sla_hours:
        return 1.0
    # Linear decay: score hits 0 at 3x the SLA
    return max(0.0, 1.0 - (lag_hours - expected_sla_hours) / (2 * expected_sla_hours))

The quality score then feeds back into Elasticsearch. Assets with a quality score below 0.5 get a penalty multiplier in search ranking and a visual "stale" or "failing" badge in the UI. This closes the loop: quality signals don't just live on the asset detail page, they affect discoverability. Analysts naturally find healthy assets first.

Tip: The search demotion piece is what makes this a great answer. Surfacing quality signals only on the detail page means analysts have to click in before they know there's a problem. Demoting failing assets in search results means they're less likely to be discovered in the first place. That's the difference between a catalog that informs and one that actually guides behavior.
Data Quality and Freshness Signal Pipeline

What is Expected at Each Level

Mid-Level

  • Define the five core entities (Asset, Schema, Lineage, Tag, Owner) and explain why the Asset URN is the stable identifier that ties everything together.
  • Sketch both the ingestion path (source connectors to Kafka to Metadata Store) and the search path (Metadata Store synced to Elasticsearch, served via Catalog API).
  • Explain why you can't just query the metadata store for search. The interviewer will push on this. The answer is latency and relevance: Elasticsearch gives you sub-200ms full-text search with faceting that a relational store can't match.
  • Design a basic API: GET /assets?q=orders&source=snowflake, GET /assets/{urn}, GET /assets/{urn}/lineage. You don't need to go deep, but you need to show the shape of the interface.

Senior

  • Go beyond table-level lineage. Explain how SQL parsing (AST-based, not regex) extracts column-level references from query logs and dbt manifests, and why column-level lineage is what actually unblocks impact analysis when a column gets renamed.
  • Articulate the push vs. pull ingestion trade-off clearly. Pull (scheduled crawling) is simpler to operate but introduces lag. Push (DDL change events from Snowflake streams or Kafka schema registry hooks) is fresher but requires source-side integration work. Know when each is appropriate.
  • Design schema drift detection: the Schema Diff Worker compares incoming schema versions, classifies changes as breaking (column dropped, type changed) vs. non-breaking (column added, description updated), and routes alerts to asset owners.
  • Explain how quality signals feed back into search ranking. An asset with three consecutive dbt test failures should be demoted in results, not just flagged on its detail page. Freshness lag and SLA breach status belong in the Elasticsearch document, not just the metadata store.

Staff+

  • Tackle multi-source federation head-on. When a Hive crawler and a dbt manifest both register prod.analytics.orders, you need a URN normalization layer that deduplicates them into a single canonical asset and merges their metadata. This is harder than it sounds; source systems use different naming conventions, and you need a resolver that maps hive://prod/analytics/orders and snowflake://prod.analytics.orders to the same node.
  • Design a pluggable connector SDK so that individual teams can self-serve new source integrations without touching the core catalog codebase. The SDK defines the event schema (what a metadata event looks like), authentication contracts, and retry semantics. Teams ship a connector; the platform consumes it.
  • Address access control integration. The catalog shouldn't just show what exists; it should show what a given user is allowed to see. That means the Catalog API needs to intersect asset results with the user's permissions from whatever access control system (Ranger, IAM, column-level masking policies) governs the underlying data. Candidates who raise this without prompting signal real platform maturity.
  • Think about operational evolution: how do you backfill lineage for assets that existed before the catalog was deployed? How do you handle a connector that goes silent for six hours? Staff candidates have opinions on these failure modes, not just the happy path.
Key takeaway: A data catalog is only as valuable as the trust it earns. Every architectural decision, from sub-200ms search latency to schema drift alerts to quality signals in rankings, exists to answer one question for the analyst staring at a search result: "Can I use this data right now?" Design for that question, and the rest follows.
Dan Lee's profile image

Written by

Dan Lee

Data & AI Lead

Dan is a seasoned data scientist and ML coach with 10+ years of experience at Google, PayPal, and startups. He has helped candidates land top-paying roles and offers personalized guidance to accelerate your data career.

Connect on LinkedIn