# Failure Modes Of The Autonomous Loop

This is the public taxonomy ZERO uses to review autonomous loop failures before
they reach live capital. Every serious replay, refusal, incident, or safety-gate
anomaly should map to one row or create a new row.

Each failure mode must answer: detection, blast radius, rollback, journal entry,
and alerting.

| Failure mode | Detection | Blast radius | Rollback | Journal entry | Alerting |
| --- | --- | --- | --- | --- | --- |
| Agent hallucinates a strategy and burns paper budget | Budget counter, repeated losing paper orders, replay quality score drops | Paper allocation only; no live lease | Freeze strategy, reset paper allocation, require human review before promotion | `paper_budget_burn` with prompt SHA, strategy id, rejected hypothesis | Operator console warning after 1 budget breach; page after 3 |
| `evolve` produces config that passes tests but fails at runtime | Runtime heartbeat degraded, config digest mismatch, first-cycle exception | Current deployment session; live stays off unless lease exists | Revert to last signed config bundle; quarantine proposed genome | `evolve_runtime_failure` with config diff hash and test run id | Immediate alert on runtime exception after config promotion |
| Agent and human edit journal concurrently | Journal sequence gap, hash-chain parent mismatch, concurrent writer lock conflict | Journal append path; replay remains readable but flagged partial | Preserve both writes as competing entries, require reconciliation entry | `journal_conflict` with both entry hashes and winning parent | Console alert immediately; public anomaly if replay references it |
| Hyperliquid returns malformed response and agent retries repeatedly | Response schema validator failure, retry counter, venue error class | Venue adapter for one deployment; order loop pauses | Stop retries, mark venue degraded, require fresh state reconciliation | `venue_malformed_response` with redacted response hash and retry count | Alert after first malformed live response or 3 paper responses |
| Stale market state drives a live decision | Market snapshot age exceeds policy, WS heartbeat gap, order intent timestamp drift | One decision path; signing route refuses commit | Refuse signature, refresh state, rerun observe phase | `stale_state_refusal` with snapshot age and policy max | Operator warning; escalates if repeated in same lease |
| Privy signer unavailable during live lease | Sign request timeout, authorization-key error, wallet id mismatch | Commit phase only; no private-key fallback | Refuse action, keep position manager in reduce-only/monitor mode | `signer_unavailable` with wallet id and error class | Immediate live alert; lease cannot renew until signer recovers |
| Kill switch fails to flatten or pause | Kill-switch command lacks confirmation receipt, open exposure remains | All live deployments under operator | Disable new signatures, mark deployment unsafe, require manual venue action | `kill_switch_anomaly` with expected vs observed exposure | Immediate high-severity page and persistent cockpit banner |
| Replay frame write fails after decision | Replay DB write error, missing replay id in evidence timeline | Audit layer; decision may have executed but public proof is incomplete | Retry append with same replay id; mark public profile partial until resolved | `replay_write_failure` with payload hash and DB error class | Alert after first live replay write miss |
| Public profile leaks sensitive material | Secret scanner, wallet/private key regex, redaction test failure | Public API and static metadata | Stop publishing affected payload, rotate leaked secret if real | `public_redaction_failure` with detector id, never raw secret | Immediate blocker; deployment must not proceed |
| Circuit breaker disabled or bypassed | Policy digest mismatch, breaker state absent, unsafe write path | Live lease and session promotion | Expire lease, restore signed policy bundle, require review | `circuit_breaker_bypass` with policy hash and call site | Immediate high-severity alert |

## Operational Rule

A new incident is not closed until it has a replay or refusal id, a journal
entry, a regression test or monitoring check, a documented rollback, and a
decision about whether the failure mode becomes public. Live-capital incidents
must cite journal root evidence from `/docs/journal-cryptography.md` or explain
why the signed root is unavailable.

## Public Postmortem Rule

ZERO should publish redacted postmortems for any live-trading incident involving
a safety-gate refusal anomaly, journal anomaly, signing anomaly, kill-switch
anomaly, or public evidence inconsistency. Routine refused orders are replay
artifacts, not incidents.
