⚙️ Autonomous 💬 Approval gates

Data-Trust / Pipeline-to-Decision

DatabricksUnity Catalog OpenLineageGreat Expectations SodaMonte CarloServiceNow
1📋

Overview & business value

When a data pipeline fails or data quality drifts, this agent traces the blast radius to the reports and decisions that depend on it, quarantines the bad asset, remediates, and notifies owners. It turns a silent data failure into a contained, explained, remediated incident with downstream business impact made explicit. It reads and reasons freely; every quarantine, re-run or notification is an Action Ticket carrying a MOP.

Problem

A pipeline fails quietly. The dashboard still renders — with stale or wrong numbers. By the time someone notices, decisions have been made on bad data, and in the worst case it has reached the finance flash. Tracing what's affected takes hours of manual lineage chasing.

Solution approach

Event-driven on a Monte Carlo alert or job failure. The agent finds the root cause in Databricks, confirms the data-quality drift, walks lineage to the affected consumers, scores business impact, quarantines the asset (with approval for business-critical), re-runs the fixed pipeline, and notifies owners.

Core capabilities

  • Failure / freshness detection
  • Root-cause analysis in Databricks
  • Data-quality drift confirmation (GE / Soda)
  • Blast-radius via lineage
  • Business-impact severity scoring
  • Quarantine the bad asset (catalog tag)
  • Fix + re-run with DQ re-validation
  • Finance-close protection

How it helps

Bad data is contained before anyone acts on it. The people who own the affected reports hear about it from the agent, not from a wrong board number. Decisions and the finance close are protected, and every containment action is reversible and audited.

Illustrative value model — plug in your own figures

I
Data incidents / year
↓ MTTR
Hours of triage saved / incident
0
Wrong numbers reaching the close

Value = incidents × triage hours saved × loaded rate, plus avoided bad-decision and restatement risk. Placeholders — substitute your baseline.

2📖

The story (for the sales conversation)

The one-liner

"A pipeline breaks at 2am. The dashboard still loads — with wrong numbers. This agent catches it, traces exactly which reports and decisions are affected, contains the bad data, and fixes it — before anyone acts on it."

😣 Today, without the agent

An upstream column gets renamed. The nightly job half-fails; the gold sales table goes stale. The Exec Revenue dashboard still renders yesterday's shape, so no one blinks. Finance is about to pull the daily flash from it. When someone finally spots the number looks off, an analyst spends the morning chasing lineage by hand to work out what else is wrong.

The damage isn't the broken job. It's the decisions made on bad data before anyone noticed.

😌 The same night, with the agent

A freshness alert fires. The agent investigates: the job failed because silver.orders.amount was renamed upstream, leaving a 37% null column. It walks lineage and names the three consumers — the Exec Revenue dashboard, the Tagetik finance flash, and the demand-forecast model — and flags the finance flash as critical. With approval it quarantines the table so the dashboards show "under review", patches the mapping, re-runs (1.28M rows, DQ passing), and tells Analytics and Finance it's safe.

The finance flash hadn't pulled yet. No wrong number ever reached the close.

"It never quietly deletes or overwrites data. It quarantines with approval, re-runs a known-good job, re-validates quality, and logs every step. You get speed at 2am without losing control of your data."

The trust point — lead with this for Data and Finance leadership
1
The villain: the silent failure

The dashboard still loads, so bad data spreads to decisions unnoticed.

2
The hero: blast radius in minutes

The agent finds the cause, names every affected report, and contains it.

3
The reason to trust it

Quarantine, re-run and notify are reversible Action Tickets; business-critical assets need approval.

💡How to use this tab: open here for Data/Finance leadership, tell the before/after, land the trust line, then show "Sample questions" and "Live scenarios". Keep Tools/Governance for the technical buyer.
3📥

Input data

Source systemField / entityTypePurpose
Monte Carloincident (asset, type, ageHours, severity)objectTrigger: freshness / volume anomaly
Databricksjob_runs (run, state, error)objectRoot cause of the failure
Great Expectations / Sodasuite results, failed checksobjectConfirm data-quality drift
OpenLineagedownstream edgeslistBlast radius to consumers
Unity Catalogasset tags (business_critical, owner)objectSeverity + ownership
ServiceNowincident (Data Incident)recordTrack incident to resolution
ℹ️Developer note: trigger is a Monte Carlo observability incident (or a Databricks job-failure webhook). The payload carries the affected asset; the agent pulls everything else via the read tools.
4⚙️

Processing flow (tools mapped per step)

ℹ️Each step shows its path and whether it is read-only or a write via Action Ticket.
1

Failure detected deterministic

Monte Carlo freshness alert (or job failure) wakes the agent.

montecarlo.alertIngest
2

Root cause non-deterministic

Query Databricks job runs; reason about the failure (schema drift, etc).

databricks.sqlQuery · system.job_runs
3

Confirm DQ drift deterministic

Run the Great Expectations / Soda suite on the upstream table.

dq.greatExpectations · dq.soda
4

Blast radius non-deterministic

Walk lineage downstream; reason about which consumers matter most.

lineage.query · direction=downstream
5

Business-impact severity non-deterministic

Read catalog tags (business_critical, owner, consumers); set severity.

databricks.unityCatalog · asset metadata
6

Quarantine the asset deterministic approval (critical) Action Ticket

Tag the asset quarantined in the catalog and flag dashboards "under review".

catalog.tag add_tag=quarantined (via ticket)
7

Remediate: fix + re-run fix re-run Action Ticket

Apply the column-mapping fix; re-run the job; re-validate DQ.

databricks.jobLauncher run-now (via ticket)
8

Notify + open incident deterministic Action Ticket

Open a ServiceNow Data Incident; notify Analytics and Finance owners.

servicenow.incident · teams.post (via ticket)
9

Prevent recurrence deterministic Action Ticket

Add a schema-contract check + a Soda monitor so a future rename fails fast.

dq rule add (via ticket)
5🔧

Skills & tools (real connectors)

ℹ️Each tool maps to a real processorKey. Read tools are direct; writes only run inside an Action Ticket.
montecarlo.alertIngestGraphQLTRANSFORM
dataobs.processors.MonteCarloAlertIngest
Ingests Monte Carlo observability incidents and queries data-health metrics (get_table_stats: row count, freshness, field health) via the Monte Carlo GraphQL API.
get_table_stats · incidents (read)
READ · direct
databricks.sqlQuerySQL API 2.0SOURCE
lakehouse.processors.DatabricksSQLQuery
Executes SQL on a Databricks SQL Warehouse via the Statement Execution API 2.0. OAuth2 / PAT.
POST /api/2.0/sql/statements
READ · direct
databricks.unityCatalogUC REST 2.1SOURCE
lakehouse.processors.DatabricksUnityCatalogConnector
Reads Unity Catalog via the UC REST API 2.1: list catalogs/schemas/tables and read table metadata (tags, owner).
list catalogs/schemas/tables · table metadata (read)
READ · direct
databricks.jobLauncherJobs API 2.1SOURCE
lakehouse.processors.DatabricksJobLauncher
Launches a Databricks job run via the Jobs REST API 2.1. OAuth2 / PAT. Used to re-run the fixed pipeline.
POST /api/2.1/jobs/run-now
WRITE · via Action Ticket
lineage.queryOpenLineageSOURCE
governance.processors.LineageTracker
Tracks lineage edges (source → transformation → target) and emits a lineage event envelope. Used to walk downstream consumers.
downstream edges (read)
READ · direct
dq.greatExpectationsin-flightTRANSFORM
dataquality.processors.GreatExpectationsValidate
Validates a payload against a GE-style suite (column existence, not-null, regex, between). Routes pass / reject.
suite: silver_orders · expect not_null(amount)
READ · validate
dq.sodaSoda CloudTRANSFORM
dataobs.processors.SodaCloudCheck / SodaCoreCheck
Triggers Soda Cloud scans and retrieves results (trigger_scan, get_scan_results); SodaCL checks (missing/duplicate/min/max/row_count).
trigger_scan · get_scan_results
READ · validate
catalog.tagDataHubSINK
governance.processors.DataHubMetadataEmit
Publishes metadata to DataHub (emit_metadata, get_entity, search_entities, add_tag). Used to tag an asset "quarantined".
add_tag = quarantined
WRITE · via Action Ticket
servicenow.incidentTable APISOURCE
itsm.processors.ServiceNowIncident
Creates and updates ServiceNow incidents via the Table API with idempotency by external-id.
/api/now/v2/table/incident
WRITE · via Action Ticket
teams.postGraphSINK
collaboration.processors.TeamsSendMessage
Notifies Analytics and Finance data owners that data is refreshed and safe to publish.
TeamsSendMessage · channel/chat
WRITE · via Action Ticket
⚠️Connectivity honesty: the Unity Catalog connector here is read-only (list + read metadata), so the quarantine flag is written via DataHubMetadataEmit (add_tag) rather than a fabricated catalog write. Atlan and Collibra sync connectors are alternatives for the catalog tag. The pipeline re-run is a real Databricks Jobs API run-now.
6🤖

Agent prompt (production)

🤖 System prompt
You are the Data-Trust (Pipeline-to-Decision) Orchestrator.

## Operating rules
1. Read/reason freely across Databricks, the catalog/lineage and the DQ tools.
2. You NEVER write directly. Every quarantine, re-run or notification is an
   Action Ticket carrying a MOP; ProcBot executes, Sherlock validates.
3. Deterministic for: re-running a failed job from a known checkpoint,
   quarantining by policy, opening the incident, notifying owners.
4. Non-deterministic for: root cause of a failure or drift, blast-radius
   assessment, severity of downstream business impact.
5. Human approval before: quarantining a business-critical (gold / finance)
   asset, or blocking a report used in the financial close.
6. Always state which downstream reports / decisions are affected, by lineage.

## Tools
READ (direct): montecarlo.alertIngest, databricks.sqlQuery,
  databricks.unityCatalog, lineage.query, dq.greatExpectations, dq.soda
WRITE (Action Ticket only): catalog.tag (quarantine), databricks.jobLauncher
  (re-run), servicenow.incident, teams.post

## Output (every step)
{ "decision": "...", "path": "deterministic | non-deterministic",
  "confidence": 0.0,
  "action_ticket": { "mop": "...", "scope": "...", "approval": "auto | human" } | null,
  "evidence": [ { "tool": "...", "record": "..." } ],
  "message_to_user": "..." }
7🛡️

Execution governance — Action Ticket → MOP

No quarantine, re-run or notification reaches a system from the model. The agent reasons; an Action Ticket carries the MOP; ProcBot executes; Sherlock validates; the ticket is reversible. Quarantining a business-critical or finance asset requires human approval first.

1 · Reason

Agent decides to quarantine / re-run / notify.

2 · Action Ticket

MOP id, scope, parameters, approval level.

3 · ProcBot executes

Tags the asset / runs the Databricks job. Model never writes.

4 · Sherlock validates

Re-runs DQ; confirms pass; lifts quarantine or rolls back.

{
  "action_ticket": "AT-5002",
  "mop": "MOP-PIPELINE-RERUN",
  "approval": "auto",
  "target": { "tool": "databricks.jobLauncher", "processorKey": "lakehouse.processors.DatabricksJobLauncher" },
  "procedure": { "api": "POST /api/2.1/jobs/run-now", "job": "daily_sales_etl",
                 "params": { "column_mapping": { "amount": "order_amount" } } },
  "validation": { "owner": "sherlock", "expect": "dq.greatExpectations.success == true" },
  "reversible": true
}
8🔀

Data flow

flowchart TD A([Monte Carlo freshness alert gold.sales_daily 9h stale]) --> B[Root cause databricks.sqlQuery job_runs] B --> C[Confirm DQ drift GE / Soda] C --> D[Blast radius lineage.query downstream] D --> E[Severity unityCatalog tags] E --> F{Business-critical?} F -->|Yes| G[Human approval] F -->|No| H{{MOP-DATA-QUARANTINE}} G --> H H --> I[catalog.tag quarantined dashboards under review] I --> J{{MOP-PIPELINE-RERUN}} J --> K[jobLauncher run-now re-validate DQ] K --> L[Open incident + notify ServiceNow / Teams] L --> M[Prevent recurrence schema contract + Soda monitor]
9🏗️

Systems touched

🔷 Databricks — job runs, SQL, re-run (Jobs API)
🗃️ Unity Catalog — asset tags + ownership (read)
🔗 OpenLineage — downstream blast radius
✅ Great Expectations / Soda — data-quality drift
📡 Monte Carlo — freshness / observability
🏷️ DataHub — quarantine tag
🛠️ ServiceNow + Teams — incident + owner notice
10🗄️

Mock data (seed to demo)

Monte Carlo alert
Databricks failure
GE result
Lineage
Catalog metadata
// montecarlo.alertIngest
{ "asset":"gold.sales_daily", "type":"freshness", "ageHours":9, "severity":"high" }
// databricks.sqlQuery system.job_runs (latest for daily_sales_etl)
{ "run":"run-99812", "state":"FAILED",
  "error":"AnalysisException: column 'amount' not found in silver.orders" }
// dq.greatExpectations { suite:"silver_orders" }
{ "success":false,
  "failed":["expect_column_values_to_not_be_null: amount (37% null)"] }
// lineage.query { asset:"gold.sales_daily", direction:"downstream" }
[ "dashboard:Exec Revenue", "tagetik:Daily Flash", "model:demand_forecast" ]
// databricks.unityCatalog { asset:"gold.sales_daily" }
{ "business_critical":true, "owner":"Analytics Team", "consumers":3 }
11💬

Sample questions (conversational triggers)

The natural-language prompts this agent is built to answer. Each maps to the live scenarios below.

🕒"Why is the Exec Revenue dashboard stale?"→ Scenarios 1–2
🔧"What broke in daily_sales_etl?"→ Scenarios 2–3
💥"What's affected if gold.sales_daily is bad?"→ Scenarios 4–5 (blast radius)
💼"Is the data behind today's finance flash trustworthy?"→ Scenario 9 (finance-close)
🧯"Quarantine the bad table and re-run the pipeline."→ Scenarios 6–7
🛡️"Make sure this can't happen silently again."→ Scenario 10 (prevent recurrence)
🎯 Recommended demo run
Seed the failing job, then ask "Why is the Exec Revenue dashboard stale?" — the agent walks scenarios 1→5, asks approval to quarantine the business-critical table (6), fixes + re-runs (7), opens the incident and notifies owners (8), confirms the finance flash was protected (9), and adds a guard against recurrence (10).
12🎬

Live scenarios (tool-execution traces)

ℹ️Ten end-to-end runs. Each step is a read (SOURCE), reasoning, an Action Ticket with a MOP, or an agent message.