BlogEngineering
Engineering

Zero-Touch Runbook Execution: Engineering Autonomous MOPs at Scale

Runbooks fail at 3 AM because humans do. Not because humans are incompetent - the engineers executing those runbooks are often exceptional. They fail because following a 47-step procedure under sleep deprivation, time pressure, and the psychological weight of a production outage is a fundamentally unreliable process. Steps get skipped. Conditions go unchecked. The wrong server gets restarted.

When we set out to build MOPs - Machine Operations Procedures - we started with a simple constraint: the system has to be safer than the human process it replaces, or it is not worth building. This post is about how we engineered toward that bar.

Why "Runbook Automation" Has Failed Before

There is a graveyard of runbook automation initiatives in enterprise IT. Most organizations have tried it. Most have quietly abandoned it. The failure modes are consistent:

MOPs are our answer to all four of these failure modes. They are not scripts with better logging. They are a fundamentally different approach to operational automation.

The MOP Execution Model

Pre-Execution Safety Checks

The most important thing a MOP does is check before it acts. Pre-execution checks are not a formality - they are the primary safety mechanism that prevents automation from making a bad situation worse.

Pre-checks are specific to each MOP. They fall into four categories:

🔍

State checks

Is the target resource in the expected state for this MOP? A connection pool restart MOP checks that the target is in degraded state, not failed state - the actions are different.

🔄

Dependency checks

Are the dependencies this MOP relies on healthy? A Kubernetes pod restart MOP checks that the node it is targeting is schedulable and that a replica is available to handle traffic during the restart.

📅

Change window checks

Is this action permitted right now? Checks against maintenance windows, change freeze periods, and business-hours restrictions configured per environment.

💥

Blast radius analysis

How many users or services will be affected by this action? If the estimated impact exceeds the configured threshold for autonomous action, the MOP requires human approval before proceeding.

If any pre-check fails, the MOP aborts cleanly - it does not attempt partial execution. It logs the failure reason and escalates to the human with a summary of what it was trying to do and what condition was not met. The human then decides whether to override the check or address the condition first.

The Execution Graph

A MOP is not a linear list of commands. It is a directed acyclic graph (DAG) of operations with explicit dependencies, conditional branches, and parallel execution where safe. This structure is what makes MOPs reliable at scale and under partial-failure conditions.

# Simplified MOP Schema (YAML)
mop_id: MOP-284
name: "Database Connection Pool Recovery"
version: "2.3.1"
category: database
blast_radius_max: medium

pre_checks:
  - check: service_state
    target: "{{ incident.affected_resource }}"
    expected_state: degraded
    abort_on_fail: true
  - check: replica_healthy
    target: "{{ incident.affected_resource }}.replica"
    abort_on_fail: true
  - check: no_active_writes
    timeout_seconds: 30
    abort_on_fail: false  # warn only

execution_graph:
  - id: drain_connections
    action: db.connection_pool.drain
    params:
      graceful_timeout: 15s
    on_failure: abort

  - id: restart_conn_manager
    action: service.restart
    params:
      service: "{{ incident.affected_resource }}.conn_manager"
      wait_healthy: true
    depends_on: [drain_connections]
    on_failure: rollback

  - id: reset_pool
    action: db.connection_pool.reset
    params:
      initial_size: "{{ env.pool_size_default }}"
    depends_on: [restart_conn_manager]
    on_failure: rollback

validation:
  - check: connection_wait_time
    threshold_ms: 50
    timeout_seconds: 60
  - check: service_health_probe
    endpoint: "{{ incident.affected_resource }}/health"
    expected_status: 200

rollback_graph:
  - id: restore_pool_config
    action: db.connection_pool.restore_config
    params:
      config_snapshot: "{{ pre_execution.pool_config }}"

Post-Execution Validation

After execution completes, the MOP enters validation mode. This is where we verify that the intended outcome was actually achieved - not just that the commands ran without error. A service can restart successfully and immediately crash again. Without outcome validation, you have automation that says "done" while the incident is still open.

Validation is outcome-based, not action-based. We do not check "did the restart command succeed?" We check "is the service now healthy, are response times within SLA, and have the original alert conditions cleared?"

Validation checks run continuously against defined thresholds for a configurable window (typically 5 minutes). If conditions return to baseline and hold for the validation window, the incident is marked resolved. If conditions do not clear within the timeout, or degrade further, rollback triggers automatically.

Automatic Rollback

Rollback is perhaps the hardest engineering challenge in operational automation. The naive approach - "just undo the steps in reverse order" - fails in many real-world scenarios because operations are not always reversible, and the system state after partial execution may differ from what any pre-designed rollback assumes.

Our approach to rollback involves three mechanisms:

The Audit Trail

Every MOP execution produces a structured audit record that is immutable once written. The record includes: the triggering incident context and confidence score, every pre-check result and its outcome, each execution step with start time, end time, parameters used, and output, every validation check result, the final resolution state, and any escalations or human approvals required.

This audit record is the foundation of both compliance and continuous improvement. Compliance teams get a precise record of every automated action. The Sherlock optimization engine uses outcome patterns across thousands of MOP executions to improve pre-check logic, update validation thresholds, and surface patterns in which MOPs succeed and which fail under which conditions.

What started as "runbook automation" has become, in practice, a self-improving operational knowledge base - one that gets more reliable every time it runs.

The MOP SDK, schema specification, and no-code MOP Builder are available to all Ops Singularity enterprise customers. Contact your Customer Success Engineer for access to the MOP authoring documentation.