Runbooks fail at 3 AM because humans do. Not because humans are incompetent - the engineers executing those runbooks are often exceptional. They fail because following a 47-step procedure under sleep deprivation, time pressure, and the psychological weight of a production outage is a fundamentally unreliable process. Steps get skipped. Conditions go unchecked. The wrong server gets restarted.
When we set out to build MOPs - Machine Operations Procedures - we started with a simple constraint: the system has to be safer than the human process it replaces, or it is not worth building. This post is about how we engineered toward that bar.
Why "Runbook Automation" Has Failed Before
There is a graveyard of runbook automation initiatives in enterprise IT. Most organizations have tried it. Most have quietly abandoned it. The failure modes are consistent:
- No precondition checking. A script that assumes the system is in a known good state before execution will behave unpredictably when it is not. Most runbook automation tools execute steps without verifying that the environment is actually ready for those steps.
- No outcome validation. A script can restart a service and report success - even if the service immediately crashed again. Without verification that the expected outcome was achieved, "automation" is just executing steps, not solving problems.
- No rollback. When automation makes things worse, you want to undo it. Most scripted runbooks have no rollback capability - they are linear procedures that assume forward progress. A failed halfway-through execution leaves the system in an undefined state.
- No audit trail. Operations teams and compliance functions need to know exactly what was done, when, and why. Automation that logs "script ran successfully" does not meet that bar.
MOPs are our answer to all four of these failure modes. They are not scripts with better logging. They are a fundamentally different approach to operational automation.
The MOP Execution Model
Pre-Execution Safety Checks
The most important thing a MOP does is check before it acts. Pre-execution checks are not a formality - they are the primary safety mechanism that prevents automation from making a bad situation worse.
Pre-checks are specific to each MOP. They fall into four categories:
State checks
Is the target resource in the expected state for this MOP? A connection pool restart MOP checks that the target is in degraded state, not failed state - the actions are different.
Dependency checks
Are the dependencies this MOP relies on healthy? A Kubernetes pod restart MOP checks that the node it is targeting is schedulable and that a replica is available to handle traffic during the restart.
Change window checks
Is this action permitted right now? Checks against maintenance windows, change freeze periods, and business-hours restrictions configured per environment.
Blast radius analysis
How many users or services will be affected by this action? If the estimated impact exceeds the configured threshold for autonomous action, the MOP requires human approval before proceeding.
If any pre-check fails, the MOP aborts cleanly - it does not attempt partial execution. It logs the failure reason and escalates to the human with a summary of what it was trying to do and what condition was not met. The human then decides whether to override the check or address the condition first.
The Execution Graph
A MOP is not a linear list of commands. It is a directed acyclic graph (DAG) of operations with explicit dependencies, conditional branches, and parallel execution where safe. This structure is what makes MOPs reliable at scale and under partial-failure conditions.
# Simplified MOP Schema (YAML)
mop_id: MOP-284
name: "Database Connection Pool Recovery"
version: "2.3.1"
category: database
blast_radius_max: medium
pre_checks:
- check: service_state
target: "{{ incident.affected_resource }}"
expected_state: degraded
abort_on_fail: true
- check: replica_healthy
target: "{{ incident.affected_resource }}.replica"
abort_on_fail: true
- check: no_active_writes
timeout_seconds: 30
abort_on_fail: false # warn only
execution_graph:
- id: drain_connections
action: db.connection_pool.drain
params:
graceful_timeout: 15s
on_failure: abort
- id: restart_conn_manager
action: service.restart
params:
service: "{{ incident.affected_resource }}.conn_manager"
wait_healthy: true
depends_on: [drain_connections]
on_failure: rollback
- id: reset_pool
action: db.connection_pool.reset
params:
initial_size: "{{ env.pool_size_default }}"
depends_on: [restart_conn_manager]
on_failure: rollback
validation:
- check: connection_wait_time
threshold_ms: 50
timeout_seconds: 60
- check: service_health_probe
endpoint: "{{ incident.affected_resource }}/health"
expected_status: 200
rollback_graph:
- id: restore_pool_config
action: db.connection_pool.restore_config
params:
config_snapshot: "{{ pre_execution.pool_config }}"
Post-Execution Validation
After execution completes, the MOP enters validation mode. This is where we verify that the intended outcome was actually achieved - not just that the commands ran without error. A service can restart successfully and immediately crash again. Without outcome validation, you have automation that says "done" while the incident is still open.
Validation is outcome-based, not action-based. We do not check "did the restart command succeed?" We check "is the service now healthy, are response times within SLA, and have the original alert conditions cleared?"
Validation checks run continuously against defined thresholds for a configurable window (typically 5 minutes). If conditions return to baseline and hold for the validation window, the incident is marked resolved. If conditions do not clear within the timeout, or degrade further, rollback triggers automatically.
Automatic Rollback
Rollback is perhaps the hardest engineering challenge in operational automation. The naive approach - "just undo the steps in reverse order" - fails in many real-world scenarios because operations are not always reversible, and the system state after partial execution may differ from what any pre-designed rollback assumes.
Our approach to rollback involves three mechanisms:
- Pre-execution state snapshot: Before any execution step, the MOP takes a targeted snapshot of the relevant system state - configuration values, service parameters, resource allocations. The rollback graph uses these snapshots as restore targets rather than assuming a known prior state.
- Step-level rollback graphs: Each execution step with a rollback risk has its own rollback action defined inline. If step 3 of 5 fails, only steps 1 and 2 need rolling back - not the entire MOP.
- Idempotency requirements: All MOP actions are required to be idempotent - running them twice produces the same result as running them once. This makes partial-execution recovery tractable and eliminates a class of "double-action" failures.
The Audit Trail
Every MOP execution produces a structured audit record that is immutable once written. The record includes: the triggering incident context and confidence score, every pre-check result and its outcome, each execution step with start time, end time, parameters used, and output, every validation check result, the final resolution state, and any escalations or human approvals required.
This audit record is the foundation of both compliance and continuous improvement. Compliance teams get a precise record of every automated action. The Sherlock optimization engine uses outcome patterns across thousands of MOP executions to improve pre-check logic, update validation thresholds, and surface patterns in which MOPs succeed and which fail under which conditions.
What started as "runbook automation" has become, in practice, a self-improving operational knowledge base - one that gets more reliable every time it runs.
The MOP SDK, schema specification, and no-code MOP Builder are available to all Ops Singularity enterprise customers. Contact your Customer Success Engineer for access to the MOP authoring documentation.