Building an AI That Observes Before It Acts: Inside the OIAO Architecture

Most monitoring systems are reactive by design. They detect. They alert. They wait for a human to decide what to do. This model made sense when the systems being monitored were simple enough that a human could hold the full picture in their head. Modern enterprise infrastructure has long since outgrown that assumption.

When we designed Sentinel AI, we made a deliberate architectural choice: the system would not act until it understood. Specifically, it would not take remediation action on an infrastructure event until it had completed a structured investigation of that event, assessed its confidence in the root cause diagnosis, and selected the appropriate response from a validated action library.

This is what we call the OIAO framework - Observe, Investigate, Act, Optimize. This post is a technical deep dive into how each phase works, the engineering decisions we made, and the tradeoffs we navigated.

OIAO is not a monitoring loop. It is an intelligence loop. The distinction matters: monitoring tells you something happened; intelligence tells you what it means and what to do about it.

Phase 1: OBSERVE - Signal Normalization at Scale

Phase 1

Observe - Ingest, normalize, and correlate signals from all sources

Sentinel ingests events from monitoring tools, cloud APIs, logs, and traces. Every signal is normalized into a unified event schema and passed through temporal and topological correlation engines before any analysis begins.

The first engineering challenge in building an autonomous operations system is the signal problem. An enterprise environment generates signals from dozens of sources - Datadog metrics, Splunk logs, PagerDuty alerts, Kubernetes events, cloud provider APIs, SNMP traps, and custom application telemetry - all with different schemas, different timestamp formats, different severity taxonomies, and different semantic meanings for seemingly identical field names.

Before Sentinel can understand anything, it needs to normalize everything. Our ingestion pipeline processes every incoming event through four stages:

Schema normalization: Every event is mapped to our canonical Event Schema, which includes normalized fields for source, target resource, event type, severity (mapped to a 0-100 criticality score), timestamp (UTC, microsecond precision), and a set of extracted semantic tags.
Topology enrichment: The event's source and affected resource are looked up in our topology graph - a real-time graph database of your infrastructure that maps services, their dependencies, ownership, and current health state. This gives every event immediate context: what is this resource, what does it depend on, who owns it, what is its current SLA tier?
Temporal correlation: The enriched event is compared against a sliding window of recent events (configurable, typically 15 minutes) to identify events that share topology, timing, or semantic similarity. Correlated events are grouped into an Incident Context rather than treated as individual alerts.
Deduplication and suppression: Known-noisy signals, maintenance window events, and events that fall below adaptive noise thresholds are filtered before they reach the investigation phase.

OIAO Architecture - Signal Flow Diagram

The output of the OBSERVE phase is not an alert. It is an Incident Context - a structured object containing the correlated signal set, affected resources and their topology context, inferred scope (blast radius estimate), and initial severity classification.

Phase 2: INVESTIGATE - Multi-Layer Root Cause Analysis

Phase 2

Investigate - Systematic root cause analysis with confidence scoring

The investigation engine runs multiple analytical layers in parallel - pattern matching, causal graph traversal, change correlation, and anomaly scoring - before synthesizing a root cause hypothesis with an associated confidence score.

This is where OIAO diverges most significantly from traditional monitoring systems. Investigation is not a lookup. It is not "does this alert match a known pattern?" It is a systematic analytical process that synthesizes multiple evidence streams into a root cause hypothesis.

The investigation engine runs four layers of analysis in parallel:

Layer 1: Pattern Library Matching

Sentinel maintains a continuously updated library of known incident patterns - signatures built from historical incidents in your environment and from our global anonymized incident corpus. Pattern matching is fast (milliseconds) and provides the initial hypothesis set. For common incident types, this alone often produces a high-confidence match.

Layer 2: Causal Graph Traversal

Using the topology graph, the investigation engine traces causality upstream and downstream from the affected resource. If a service is experiencing latency, the engine checks its dependencies - is the upstream database also showing anomalous query times? Is the network layer between them reporting packet loss? This traversal identifies whether the affected resource is the root cause or a symptom of a deeper failure.

Layer 3: Change Correlation

One of the most reliable indicators of incident root cause is a recent change. The investigation engine queries your change management systems (ServiceNow, Jira, Kubernetes deployment history, CI/CD pipeline logs) for changes to affected resources or their dependencies in the preceding 24 hours. A deployment that happened 90 minutes before incident onset is a strong causal signal.

Layer 4: Anomaly Context

The final layer evaluates the statistical anomalousness of the observed signals. Is this a 2-sigma deviation from baseline, or a 6-sigma event? Is it localized to one resource or correlated across a class of resources? Does it correlate with external signals like time-of-day traffic patterns, upstream provider status, or geographic load shifts?

These four layers synthesize into a root cause hypothesis - a structured object that includes the proposed root cause, supporting evidence from each analytical layer, and a confidence score (0-100) representing the model's certainty in the diagnosis.

# Example RCA Hypothesis Object (simplified)
{
  "incident_id": "INC-2026-03-19-0847",
  "hypothesis": {
    "root_cause": "connection_pool_exhaustion",
    "affected_resource": "db-prod-primary-us-east-1",
    "confidence": 94,
    "evidence": {
      "pattern_match": {
        "pattern_id": "DBCP-EXHAUST-001",
        "match_score": 0.97,
        "matched_signals": ["conn_wait_time_spike", "active_conn_at_max"]
      },
      "change_correlation": {
        "recent_change": "deploy/api-gateway/v2.4.1",
        "change_timestamp": "2026-03-19T06:12:00Z",
        "correlation_score": 0.82,
        "notes": "New API version increased default connection pool size by 3x"
      },
      "anomaly_context": {
        "sigma_deviation": 7.3,
        "anomaly_type": "step_change",
        "correlated_with_change": true
      }
    }
  },
  "recommended_action": "MOP-284",
  "escalate_if_confidence_below": 88
}

The confidence threshold is configurable per environment and per incident category. When confidence falls below the threshold, Sentinel escalates to a human with the full investigation brief pre-populated - the engineer sees what Sentinel found, what it was considering doing, and can approve, modify, or override the recommended action.

Phase 3: ACT - Structured Autonomous Execution

Phase 3

Act - Safe, validated execution via Machine Operations Procedures

Sentinel selects the highest-confidence MOP for the diagnosed root cause, runs pre-execution safety checks, executes the remediation steps, and validates the outcome. If validation fails, automatic rollback triggers.

Once the investigation produces a high-confidence root cause hypothesis, Sentinel selects the appropriate MOP (Machine Operations Procedure) from its action library. A MOP is not a script - it is a structured execution graph with explicit safety boundaries at every step.

Every MOP execution proceeds through four mandatory phases:

Pre-execution checks: Before any action is taken, Sentinel verifies that the preconditions for safe execution are met. These checks are specific to each MOP - for a connection pool restart, this might include verifying no active write transactions, checking that a replica is healthy, and confirming the target service is in degraded (not failed) state.
Execution: The MOP steps execute sequentially (or in parallel where designed), with each step producing structured output that feeds into the validation logic.
Post-execution validation: After execution completes, Sentinel runs a defined set of health checks to verify the intended outcome was achieved. For the connection pool example: connection wait times return to baseline, active connection count normalizes, service health probes pass.
Rollback (if needed): If validation fails, the MOP's rollback graph executes automatically - reversing actions in the correct dependency order and returning the system to its pre-intervention state before escalating to a human.

Phase 4: OPTIMIZE - Closing the Learning Loop

Phase 4

Optimize - Post-incident learning and infrastructure improvement

Sherlock, our optimization engine, reviews every resolved incident to improve detection accuracy, update pattern libraries, surface recurring infrastructure issues, and recommend proactive improvements before the next incident occurs.

Most operational AI systems stop at action. OIAO does not. The Optimize phase - powered by our Sherlock engine - is what turns every incident into institutional knowledge and every pattern into a proactive defense.

Sherlock performs post-incident analysis on every resolved incident, regardless of whether resolution was autonomous or human-assisted. It updates the pattern library with any new signals that proved diagnostic, recalibrates confidence thresholds based on outcome accuracy, identifies recurring incident patterns that suggest underlying infrastructure issues worth addressing proactively, and generates optimization recommendations - rightsizing suggestions, configuration changes, and architectural improvements - to reduce incident probability.

Over a 90-day period, customers consistently observe not just faster incident resolution but a measurable reduction in total incident frequency - because Sherlock's recommendations address the root causes before they manifest again.

The Engineering Decisions That Matter Most

Building OIAO required several non-obvious engineering choices that shaped the system's behavior in production. Three stand out as particularly consequential.

Confidence over speed. We deliberately chose to make investigation thorough rather than instant. An investigation that completes in 18 seconds with 94% confidence is more valuable than one that completes in 2 seconds with 67% confidence. The cost of wrong action in infrastructure automation is higher than the cost of a few extra seconds of investigation time.

Correlated incidents, not individual alerts. The decision to group correlated signals into Incident Contexts before any investigation begins was critical. Responding to individual alerts in a cascading failure would mean hundreds of simultaneous investigations and potentially conflicting actions. Responding to one root-cause-focused Incident Context means one focused investigation and one coordinated response.

Human-in-loop as a design feature, not a fallback. OIAO is designed with escalation as a first-class behavior, not a failure mode. When confidence is below threshold, the system does not try harder or lower its standards - it escalates gracefully, providing the human with everything they need to make a fast, well-informed decision. This design choice is what makes high-confidence autonomy trustworthy: operators know the system knows its own limits.

This post reflects the Sentinel AI architecture as of Q1 2026. Some implementation details have been simplified or generalized for publication. We welcome questions and discussion - reach out to engineering@opssingularity.com.