Data-Trust / Pipeline-to-Decision Agent

1📋

Overview & business value

When a data pipeline fails or data quality drifts, this agent traces the blast radius to the reports and decisions that depend on it, quarantines the bad asset, remediates, and notifies owners. It turns a silent data failure into a contained, explained, remediated incident with downstream business impact made explicit. It reads and reasons freely; every quarantine, re-run or notification is an Action Ticket carrying a MOP.

Problem

A pipeline fails quietly. The dashboard still renders — with stale or wrong numbers. By the time someone notices, decisions have been made on bad data, and in the worst case it has reached the finance flash. Tracing what's affected takes hours of manual lineage chasing.

Solution approach

Event-driven on a Monte Carlo alert or job failure. The agent finds the root cause in Databricks, confirms the data-quality drift, walks lineage to the affected consumers, scores business impact, quarantines the asset (with approval for business-critical), re-runs the fixed pipeline, and notifies owners.

Core capabilities

Failure / freshness detection
Root-cause analysis in Databricks
Data-quality drift confirmation (GE / Soda)
Blast-radius via lineage
Business-impact severity scoring
Quarantine the bad asset (catalog tag)
Fix + re-run with DQ re-validation
Finance-close protection

How it helps

Bad data is contained before anyone acts on it. The people who own the affected reports hear about it from the agent, not from a wrong board number. Decisions and the finance close are protected, and every containment action is reversible and audited.

Illustrative value model — plug in your own figures

Data incidents / year

↓ MTTR

Hours of triage saved / incident

Wrong numbers reaching the close

Value = incidents × triage hours saved × loaded rate, plus avoided bad-decision and restatement risk. Placeholders — substitute your baseline.

2📖

The story (for the sales conversation)

The one-liner

"A pipeline breaks at 2am. The dashboard still loads — with wrong numbers. This agent catches it, traces exactly which reports and decisions are affected, contains the bad data, and fixes it — before anyone acts on it."

😣 Today, without the agent

An upstream column gets renamed. The nightly job half-fails; the gold sales table goes stale. The Exec Revenue dashboard still renders yesterday's shape, so no one blinks. Finance is about to pull the daily flash from it. When someone finally spots the number looks off, an analyst spends the morning chasing lineage by hand to work out what else is wrong.

The damage isn't the broken job. It's the decisions made on bad data before anyone noticed.

😌 The same night, with the agent

A freshness alert fires. The agent investigates: the job failed because silver.orders.amount was renamed upstream, leaving a 37% null column. It walks lineage and names the three consumers — the Exec Revenue dashboard, the Tagetik finance flash, and the demand-forecast model — and flags the finance flash as critical. With approval it quarantines the table so the dashboards show "under review", patches the mapping, re-runs (1.28M rows, DQ passing), and tells Analytics and Finance it's safe.

The finance flash hadn't pulled yet. No wrong number ever reached the close.

"It never quietly deletes or overwrites data. It quarantines with approval, re-runs a known-good job, re-validates quality, and logs every step. You get speed at 2am without losing control of your data."

The trust point — lead with this for Data and Finance leadership

The villain: the silent failure

The dashboard still loads, so bad data spreads to decisions unnoticed.

The hero: blast radius in minutes

The agent finds the cause, names every affected report, and contains it.

The reason to trust it

Quarantine, re-run and notify are reversible Action Tickets; business-critical assets need approval.

💡How to use this tab: open here for Data/Finance leadership, tell the before/after, land the trust line, then show "Sample questions" and "Live scenarios". Keep Tools/Governance for the technical buyer.

3📥

Input data

Source system	Field / entity	Type	Purpose
Monte Carlo	incident (asset, type, ageHours, severity)	object	Trigger: freshness / volume anomaly
Databricks	job_runs (run, state, error)	object	Root cause of the failure
Great Expectations / Soda	suite results, failed checks	object	Confirm data-quality drift
OpenLineage	downstream edges	list	Blast radius to consumers
Unity Catalog	asset tags (business_critical, owner)	object	Severity + ownership
ServiceNow	incident (Data Incident)	record	Track incident to resolution

ℹ️Developer note: trigger is a Monte Carlo observability incident (or a Databricks job-failure webhook). The payload carries the affected asset; the agent pulls everything else via the read tools.

4⚙️

Processing flow (tools mapped per step)

ℹ️Each step shows its path and whether it is read-only or a write via Action Ticket.

Failure detected deterministic

Monte Carlo freshness alert (or job failure) wakes the agent.

montecarlo.alertIngest

Root cause non-deterministic

Query Databricks job runs; reason about the failure (schema drift, etc).

databricks.sqlQuery · system.job_runs

Confirm DQ drift deterministic

Run the Great Expectations / Soda suite on the upstream table.

dq.greatExpectations · dq.soda

Blast radius non-deterministic

Walk lineage downstream; reason about which consumers matter most.

lineage.query · direction=downstream

Business-impact severity non-deterministic

Read catalog tags (business_critical, owner, consumers); set severity.

databricks.unityCatalog · asset metadata

Quarantine the asset deterministic approval (critical) Action Ticket

Tag the asset quarantined in the catalog and flag dashboards "under review".

catalog.tag add_tag=quarantined (via ticket)

Remediate: fix + re-run fix re-run Action Ticket

Apply the column-mapping fix; re-run the job; re-validate DQ.

databricks.jobLauncher run-now (via ticket)

Notify + open incident deterministic Action Ticket

Open a ServiceNow Data Incident; notify Analytics and Finance owners.

servicenow.incident · teams.post (via ticket)

Prevent recurrence deterministic Action Ticket

Add a schema-contract check + a Soda monitor so a future rename fails fast.

dq rule add (via ticket)

5🔧

Skills & tools (real connectors)

ℹ️Each tool maps to a real processorKey. Read tools are direct; writes only run inside an Action Ticket.

montecarlo.alertIngestGraphQLTRANSFORM

dataobs.processors.MonteCarloAlertIngest

Ingests Monte Carlo observability incidents and queries data-health metrics (get_table_stats: row count, freshness, field health) via the Monte Carlo GraphQL API.

get_table_stats · incidents (read)

READ · direct

databricks.sqlQuerySQL API 2.0SOURCE

lakehouse.processors.DatabricksSQLQuery

Executes SQL on a Databricks SQL Warehouse via the Statement Execution API 2.0. OAuth2 / PAT.

POST /api/2.0/sql/statements

READ · direct

databricks.unityCatalogUC REST 2.1SOURCE

lakehouse.processors.DatabricksUnityCatalogConnector

Reads Unity Catalog via the UC REST API 2.1: list catalogs/schemas/tables and read table metadata (tags, owner).

list catalogs/schemas/tables · table metadata (read)

READ · direct

databricks.jobLauncherJobs API 2.1SOURCE

lakehouse.processors.DatabricksJobLauncher

Launches a Databricks job run via the Jobs REST API 2.1. OAuth2 / PAT. Used to re-run the fixed pipeline.

POST /api/2.1/jobs/run-now

WRITE · via Action Ticket

lineage.queryOpenLineageSOURCE

governance.processors.LineageTracker

Tracks lineage edges (source → transformation → target) and emits a lineage event envelope. Used to walk downstream consumers.

downstream edges (read)

READ · direct

dq.greatExpectationsin-flightTRANSFORM

dataquality.processors.GreatExpectationsValidate

Validates a payload against a GE-style suite (column existence, not-null, regex, between). Routes pass / reject.

suite: silver_orders · expect not_null(amount)

READ · validate

dq.sodaSoda CloudTRANSFORM

dataobs.processors.SodaCloudCheck / SodaCoreCheck

Triggers Soda Cloud scans and retrieves results (trigger_scan, get_scan_results); SodaCL checks (missing/duplicate/min/max/row_count).

trigger_scan · get_scan_results

READ · validate

catalog.tagDataHubSINK

governance.processors.DataHubMetadataEmit

Publishes metadata to DataHub (emit_metadata, get_entity, search_entities, add_tag). Used to tag an asset "quarantined".

add_tag = quarantined

WRITE · via Action Ticket

servicenow.incidentTable APISOURCE

itsm.processors.ServiceNowIncident

Creates and updates ServiceNow incidents via the Table API with idempotency by external-id.

/api/now/v2/table/incident

WRITE · via Action Ticket

teams.postGraphSINK

collaboration.processors.TeamsSendMessage

Notifies Analytics and Finance data owners that data is refreshed and safe to publish.

TeamsSendMessage · channel/chat

WRITE · via Action Ticket

⚠️Connectivity honesty: the Unity Catalog connector here is read-only (list + read metadata), so the quarantine flag is written via DataHubMetadataEmit (add_tag) rather than a fabricated catalog write. Atlan and Collibra sync connectors are alternatives for the catalog tag. The pipeline re-run is a real Databricks Jobs API run-now.

6🤖

Agent prompt (production)

🤖 System prompt

You are the Data-Trust (Pipeline-to-Decision) Orchestrator.

## Operating rules
1. Read/reason freely across Databricks, the catalog/lineage and the DQ tools.
2. You NEVER write directly. Every quarantine, re-run or notification is an
   Action Ticket carrying a MOP; ProcBot executes, Sherlock validates.
3. Deterministic for: re-running a failed job from a known checkpoint,
   quarantining by policy, opening the incident, notifying owners.
4. Non-deterministic for: root cause of a failure or drift, blast-radius
   assessment, severity of downstream business impact.
5. Human approval before: quarantining a business-critical (gold / finance)
   asset, or blocking a report used in the financial close.
6. Always state which downstream reports / decisions are affected, by lineage.

## Tools
READ (direct): montecarlo.alertIngest, databricks.sqlQuery,
  databricks.unityCatalog, lineage.query, dq.greatExpectations, dq.soda
WRITE (Action Ticket only): catalog.tag (quarantine), databricks.jobLauncher
  (re-run), servicenow.incident, teams.post

## Output (every step)
{ "decision": "...", "path": "deterministic | non-deterministic",
  "confidence": 0.0,
  "action_ticket": { "mop": "...", "scope": "...", "approval": "auto | human" } | null,
  "evidence": [ { "tool": "...", "record": "..." } ],
  "message_to_user": "..." }

7🛡️

Execution governance — Action Ticket → MOP

No quarantine, re-run or notification reaches a system from the model. The agent reasons; an Action Ticket carries the MOP; ProcBot executes; Sherlock validates; the ticket is reversible. Quarantining a business-critical or finance asset requires human approval first.

1 · Reason

Agent decides to quarantine / re-run / notify.

→

2 · Action Ticket

MOP id, scope, parameters, approval level.

→

3 · ProcBot executes

Tags the asset / runs the Databricks job. Model never writes.

→

4 · Sherlock validates

Re-runs DQ; confirms pass; lifts quarantine or rolls back.

{
  "action_ticket": "AT-5002",
  "mop": "MOP-PIPELINE-RERUN",
  "approval": "auto",
  "target": { "tool": "databricks.jobLauncher", "processorKey": "lakehouse.processors.DatabricksJobLauncher" },
  "procedure": { "api": "POST /api/2.1/jobs/run-now", "job": "daily_sales_etl",
                 "params": { "column_mapping": { "amount": "order_amount" } } },
  "validation": { "owner": "sherlock", "expect": "dq.greatExpectations.success == true" },
  "reversible": true
}

8🔀

Data flow

flowchart TD A([Monte Carlo freshness alert gold.sales_daily 9h stale]) --> B[Root cause databricks.sqlQuery job_runs] B --> C[Confirm DQ drift GE / Soda] C --> D[Blast radius lineage.query downstream] D --> E[Severity unityCatalog tags] E --> F{Business-critical?} F -->|Yes| G[Human approval] F -->|No| H{{MOP-DATA-QUARANTINE}} G --> H H --> I[catalog.tag quarantined dashboards under review] I --> J{{MOP-PIPELINE-RERUN}} J --> K[jobLauncher run-now re-validate DQ] K --> L[Open incident + notify ServiceNow / Teams] L --> M[Prevent recurrence schema contract + Soda monitor]

9🏗️

Systems touched

🔷 Databricks — job runs, SQL, re-run (Jobs API)

🗃️ Unity Catalog — asset tags + ownership (read)

🔗 OpenLineage — downstream blast radius

✅ Great Expectations / Soda — data-quality drift

📡 Monte Carlo — freshness / observability

🏷️ DataHub — quarantine tag

🛠️ ServiceNow + Teams — incident + owner notice

10🗄️

Mock data (seed to demo)

Monte Carlo alert

Databricks failure

GE result

Lineage

Catalog metadata

// montecarlo.alertIngest
{ "asset":"gold.sales_daily", "type":"freshness", "ageHours":9, "severity":"high" }

// databricks.sqlQuery system.job_runs (latest for daily_sales_etl)
{ "run":"run-99812", "state":"FAILED",
  "error":"AnalysisException: column 'amount' not found in silver.orders" }

// dq.greatExpectations { suite:"silver_orders" }
{ "success":false,
  "failed":["expect_column_values_to_not_be_null: amount (37% null)"] }

// lineage.query { asset:"gold.sales_daily", direction:"downstream" }
[ "dashboard:Exec Revenue", "tagetik:Daily Flash", "model:demand_forecast" ]

// databricks.unityCatalog { asset:"gold.sales_daily" }
{ "business_critical":true, "owner":"Analytics Team", "consumers":3 }

11💬

Sample questions (conversational triggers)

The natural-language prompts this agent is built to answer. Each maps to the live scenarios below.

🕒"Why is the Exec Revenue dashboard stale?"→ Scenarios 1–2

🔧"What broke in daily_sales_etl?"→ Scenarios 2–3

💥"What's affected if gold.sales_daily is bad?"→ Scenarios 4–5 (blast radius)

💼"Is the data behind today's finance flash trustworthy?"→ Scenario 9 (finance-close)

🧯"Quarantine the bad table and re-run the pipeline."→ Scenarios 6–7

🛡️"Make sure this can't happen silently again."→ Scenario 10 (prevent recurrence)

🎯 Recommended demo run

Seed the failing job, then ask "Why is the Exec Revenue dashboard stale?" — the agent walks scenarios 1→5, asks approval to quarantine the business-critical table (6), fixes + re-runs (7), opens the incident and notifies owners (8), confirms the finance flash was protected (9), and adds a guard against recurrence (10).

12🎬

Live scenarios (tool-execution traces)

ℹ️Ten end-to-end runs. Each step is a read (SOURCE), reasoning, an Action Ticket with a MOP, or an agent message.