Demo Runbook¶
Failure drills, expected outputs, and recovery procedures for the
make demo-up / make demo-run stack.
Operator paths. This runbook covers two audiences:
In-repo operator (
git clone) — usemake demo-*targets directly.Installed operator (
curl … install.sh | bash) — run the bundled scripts from~/.autonomyops/quickstart/. The bundle ships the compose files, the numbered demo scripts (01_build.shthrough05_failure_drills.sh), the policy source (demo/policies/), the lock fixture (demo/locks/example.lock.json), the demo Python agent (demo/agent_py/), and the cosign key-pair generator (demo/keys/generate.sh— the bundle ships only the generator; you generate a keypair locally on first use). The supply-chain demo (make demo-runequivalent), preflight, smoke, offline-drain, and failure-injection drills all run from the bundle.
A few flows still require a repo checkout because they need fixtures
or topology not bundled today: HA failover (make demo-ha-failover),
multi-CP fleet operations, and edge-relay demos under the edge
profile. These are scoped to the more advanced runbook flows below
and are explicitly called out where they appear.
Each make demo-* command below is annotated with its installed-operator
equivalent in parentheses. Both forms run the same script and produce
the same output; the only difference is whether you have a repo
checkout.
Installed: the per-command parentheticals throughout this runbook
(# (installed: bash demo/scripts/X)) are the source of truth for the
in-repo→installed mapping; sections below that don’t carry an explicit
(installed: ...) annotation describe in-repo-only flows (HA failover,
fleet operations, edge-relay demos under the edge profile) — those
are flagged inline where they appear.
Installed-mode prerequisites (run once, after extracting the bundle)¶
The bundle ships only the cosign key-pair generator, never the keys
themselves. Generate the keypair locally before running preflight or
any of the supply-chain scripts (02_push_attach_sign.sh,
03_verify_and_run.sh) — preflight hard-fails on missing keys, and
the signing scripts read from the same demo/keys/cosign.{key,pub}
paths.
Installed: the bash block below IS the installed-mode setup —
the in-repo equivalent is “do nothing” because the demo keypair is
committed at demo/keys/cosign.{key,pub} for in-repo developers.
cd ~/.autonomyops/quickstart
bash demo/keys/generate.sh # one-time, per extracted bundle
Re-running generate.sh rotates the demo keypair and invalidates any
previously-signed artifacts; do it only when starting fresh.
In-repo operators can skip this — the demo keypair is committed at
demo/keys/cosign.{key,pub} for local development convenience.
Pre-flight¶
make build # (installed: skip — autonomy binary is on PATH from install.sh)
make demo-up # (installed: bash demo/scripts/demo_up.sh)
Wait for all services to report healthy:
docker compose -f demo/docker-compose.yml ps
Expected (State column) — six non-profiled services:
NAME SERVICE STATE PORTS
demo-jaeger-1 jaeger running 0.0.0.0:16686->16686/tcp ...
demo-orchestrator-1 orchestrator running 0.0.0.0:8888->8888/tcp
demo-otel-collector-1 otel-collector running ...
demo-otel-sink-1 otel-sink running 0.0.0.0:4319->4318/tcp
demo-registry-1 registry running 0.0.0.0:5000->5000/tcp
demo-runtime-1 runtime running 0.0.0.0:7777->7777/tcp
All six must be running (not starting or unhealthy) before proceeding.
The two edged-node-* services declared in the same compose file are gated
behind the edge profile and are not started by default; they should not
appear in this list unless --profile edge was passed.
Bootstrap policy and run the agent:
# In-repo:
make demo-run
# Installed (bundle):
# (assumes `bash demo/keys/generate.sh` has already been run — see the
# "Installed-mode prerequisites" section above)
bash demo/scripts/01_build.sh
bash demo/scripts/02_push_attach_sign.sh
bash demo/scripts/03_verify_and_run.sh
The three numbered scripts together do exactly what make demo-run
does (it’s a thin wrapper around the same scripts).
Expected final output:
✓ PASS — echo allowed, shell denied correctly
Golden demo sequence (≤10 commands)¶
The repeatable demo path from a clean machine. Each command is idempotent and prints a stable success marker the operator can show on screen.
# 0. Verify tools, keys, paths, and regressions — fail fast before wasting time
make demo-preflight # (installed: bash demo/scripts/preflight.sh)
# Expected: "All preflight checks passed — ready for: make demo-up && make demo-run"
# 1. Build the Go binary
make build # (installed: skip — autonomy binary is on PATH from install.sh)
# 2. Start infrastructure + gate on health checks
make demo-smoke # (installed: bash demo/scripts/smoke.sh)
# Expected: "Smoke test passed — stack is healthy"
# 3. Build policy, push OCI artifacts, attach sidecars, sign with cosign,
# verify supply chain, and run the Python agent
make demo-run # (installed: bash demo/scripts/01_build.sh && \
# bash demo/scripts/02_push_attach_sign.sh && \
# bash demo/scripts/03_verify_and_run.sh)
# (installed-mode requires `bash demo/keys/generate.sh`
# to have been run once — see the prerequisites
# section near the top of this runbook)
# Expected: "✓ PASS — echo allowed, shell denied correctly"
# 4. Offline telemetry buffering + priority drain
make demo-offline-drain # (installed: bash demo/scripts/04_offline_then_drain.sh)
# Expected: "telemetry drain: OK — N events sent"
# 5. Failure-injection drills (optional)
make demo-drills # (installed: bash demo/scripts/05_failure_drills.sh)
# Expected: "Drills complete — passed: 9, failed: 0"
# 6. Tear down
make demo-clean # (installed: docker compose -f demo/docker-compose.yml down -v && rm -rf demo/data)
Stable markers worth confirming after each step:
Step |
Command |
Stable marker |
|---|---|---|
0 |
|
|
2 |
|
|
3 |
|
|
3 |
|
|
3 |
|
|
3 |
|
|
4 |
|
|
5 |
|
|
audit_id UUIDs vary per call (format is stable:
xxxxxxxx-xxxx-4xxx-xxxx-xxxxxxxxxxxx).
Drill 1 — Registry offline¶
Simulates: OCI registry failure during push/attach operations.
docker compose -f demo/docker-compose.yml stop registry
autonomy oci push-test-artifact --image localhost:5000/autonomy-demo/agent:v1
Expected (any non-zero exit):
Error: ... connection refused
Restore:
docker compose -f demo/docker-compose.yml start registry
Wait for registry to pass its health check (≤15s), then confirm recovery:
curl -sf http://localhost:5000/v2/
Expected:
{}
Push succeeds:
autonomy oci push-test-artifact --image localhost:5000/autonomy-demo/agent:v1
pushed ref=localhost:5000/autonomy-demo/agent:v1 digest=sha256:...
Drill 2 — Incompatible policy bundle¶
Simulates: Deploying a bundle whose required_runtime_version does not
satisfy the runtime binary version (0.1.0).
Installed: identical commands; demo/policies/ and demo/data/policy
resolve to the same paths after cd ~/.autonomyops/quickstart.
Build an incompatible bundle:
autonomy policy build \
--in demo/policies \
--out /tmp/bad-bundle.tar.gz \
--version 99.0.0 \
--name bad \
--runtime-version ">=99.0.0"
Attempt to load:
autonomy policy load \
--bundle /tmp/bad-bundle.tar.gz \
--manager-dir demo/data/policy
Expected (exit 1):
policy load: REJECTED — bundle requires runtime >=99.0.0, have 0.1.0
The current and LKG slots are unchanged. Confirm:
autonomy policy status --manager-dir demo/data/policy
Current: version=1.0.0 digest=sha256:... loaded=...
LKG: (none)
The runtime continues to serve requests under the previous policy. Verify:
curl -s -X POST http://localhost:7777/v1/tool \
-H 'Content-Type: application/json' \
-d '{"kind":"tool.echo","params":{"message":"still working"}}'
{"decision":"allow","output":"still working","policy_ref":"1.0.0"}
Drill 3 — OTLP backend offline during drain¶
Simulates: Telemetry backend unavailable during the drain cycle.
Confirm WAL has events from prior tool calls:
autonomy telemetry export --dir demo/data/wal --out - | wc -l
Stop the OTLP sink:
docker compose -f demo/docker-compose.yml stop otel-sink
Attempt drain to the now-dead endpoint:
autonomy telemetry drain \
--dir demo/data/wal \
--endpoint http://localhost:4319
Expected (exit 1):
telemetry drain: send error: ...connection refused
WAL is not modified by a failed drain. Confirm entry count is unchanged:
autonomy telemetry export --dir demo/data/wal --out - | wc -l
The count must be equal to or greater than before the failed drain.
Restore the sink:
docker compose -f demo/docker-compose.yml start otel-sink
Wait for the sink to accept connections (≤15s):
curl -s -o /dev/null -w "%{http_code}" -X POST http://localhost:4319/v1/logs \
-H 'Content-Type: application/json' -d '{}'
Expected: 200
Drain successfully:
autonomy telemetry drain \
--dir demo/data/wal \
--endpoint http://localhost:4319
Expected:
telemetry drain: OK — N events sent to http://localhost:4319
Re-run drain immediately (no new events):
autonomy telemetry drain \
--dir demo/data/wal \
--endpoint http://localhost:4319
Expected:
telemetry drain: nothing to drain
Offline → drain scenario¶
demo/scripts/04_offline_then_drain.sh runs the full offline accumulation
and priority drain sequence:
bash demo/scripts/04_offline_then_drain.sh
Expected sequence:
[demo] Simulating offline: stopping otel-sink...
✓ otel-sink stopped
[demo] Generating tool calls (runtime buffers events in WAL while sink is offline)...
[allow] tool.echo ×3
call 1: "decision":"allow"
call 2: "decision":"allow"
call 3: "decision":"allow"
[deny] tool.shell ×2
call 1: "decision":"deny"
call 2: "decision":"deny"
✓ 5 tool calls made (3 allow, 2 deny)
[demo] WAL entry count:
N events buffered in WAL
[demo] Bringing otel-sink back online...
✓ otel-sink ready at http://localhost:4319
[demo] Draining WAL → http://localhost:4319 in priority order (errors first, lifecycle last)...
telemetry drain: OK — N events sent to http://localhost:4319
✓ Drain complete
✓ Script 04 complete — offline WAL accumulation and priority drain demonstrated
Run all failure drills¶
make demo-drills # (installed: bash demo/scripts/05_failure_drills.sh)
Expected summary:
[demo] Drills complete — passed: 9, failed: 0
✓ All failure drills behaved correctly
If any drill fails, the output shows [DRILL FAIL] with the specific assertion
that did not hold.
Golden-output check¶
make demo-golden asserts that the live make demo-run output matches the
checked-in golden fixtures under demo/fixtures/. Run it after a successful
make demo-run:
make demo-golden # (installed: not yet supported — requires demo/fixtures/, tracked in #723)
Expected:
All golden checks passed
The check strips ANSI colour codes and variable fields (audit-ID UUIDs and timestamps) before comparing against the structural markers in the fixture. A drift between what the runtime currently emits and the recorded structure fails the check with a precise diff so the fixture can be intentionally regenerated rather than silently updated.
Fixture contents:
File |
Purpose |
|---|---|
|
Expected runtime responses for the two demo calls (echo allow, shell deny) |
|
Console-output template with |
Python-version guard¶
The demo Python agent is pinned to Python 3.12 via
demo/agent_py/.python-version. To assert the local environment satisfies
the pin without starting Docker:
make demo-check-python # (installed: bash demo/scripts/check_python.sh)
Pydantic v2 and LangChain are fully compatible with 3.12; running the agent on a different minor version produces deprecation warnings and may break the supply-chain demo.
Operator workflows (orchestrator + fleet)¶
These targets exercise the optional control-plane and fleet surfaces. They
require make demo-up to have already started the orchestrator at
localhost:8888.
Installed: the per-target H3 sections below carry their own
(installed: ...) annotations where a script equivalent exists; HA
failover, make demo-fleet-summary, and make demo-up-fleet are
in-repo-only (the bundle’s compose has the orchestrator service but
not the multi-node fleet topology those flows depend on).
Control-plane smoke (make demo-orchestrator-smoke)¶
Asserts the orchestrator’s health endpoint responds and that event ingestion
is idempotent — the same event_id posted twice is silently deduplicated
(does not double-count).
make demo-orchestrator-smoke # (installed: bash demo/scripts/orchestrator_smoke.sh)
Runs demo/scripts/orchestrator_smoke.sh. Pass criterion: zero non-zero
exits across the health probe and the duplicate-ingest check.
Recent control-plane events (make demo-cp-check)¶
Prints the last 10 events from the orchestrator’s event store. Useful as a
quick sanity check after make demo-run or after a polled-release demo:
make demo-cp-check # (installed: curl -s 'http://localhost:8888/v1/events?limit=10' | python3 -m json.tool)
Sends GET http://localhost:8888/v1/events?limit=10 and pretty-prints the
JSON. If python3 is not available, prints raw JSON; if no events have been
ingested yet, the response is {"events":[],"count":0}.
Fleet snapshot (make demo-show-fleet)¶
A two-query orchestrator snapshot for channel=stable. The Makefile
target runs:
GET /v1/releases/latest?channel=stable— the latest release pointer.GET /v1/events?limit=10— the ten most recent events from the orchestrator’s event store.
make demo-show-fleet # (installed: not yet supported — fleet-summary helper not bundled, tracked in #723)
The target does not call /v1/releases/{release_id}/acks directly,
so it does not show a structured per-node ack table. Acks are visible
indirectly — they appear in the recent-events stream as
ai.deployment.ack event rows when the orchestrator has received
recent traffic. Useful immediately after make demo-publish-release
to confirm the release pointer advanced and that ack events are
landing in the event store.
If no releases have been published yet, the first query prints
(no releases yet — run: make demo-publish-release) and the target
continues to the recent-events block.
HA failover (make demo-ha-failover)¶
Demonstrates that killing the leader control-plane does not produce a
double-promotion. Runs demo/scripts/10_ha_failover.sh which kills CP-1
(the current leader), waits for CP-2 to acquire the advisory lock, then
asserts no two CPs ever held the lock simultaneously.
make demo-ha-failover # (installed: not yet supported — multi-CP HA topology not bundled, tracked in #723)
Pass criterion: CP-2 reports session_lock_held=1 and CP-1 reports 0
within the failover window; no two-leader interval recorded in the audit
log.
Publish a desired-state release (make demo-publish-release)¶
Demonstrates the pull-based release model (v1.13 §1.2.3, advisory only —
the control plane never pushes to nodes). The script
(demo/scripts/06_releases.sh) waits for the orchestrator to be healthy,
then:
Publishes a release to channel
stableviaPOST /v1/releases.Simulates an agent poll:
GET /v1/releases/latest?channel=stable.Records three node acks via
POST /v1/nodes/{node_id}/ack, exercising the full ack-status set:Node
Status
Reason
node-alphaacceptedLock fingerprint matched, runtime compatible.
node-betarejectedrequired_runtime_version >=1.0.0 not satisfied by 0.1.0.node-gammafailedDisk-full or transient runtime failure during adoption.
Shows the fleet view via
GET /v1/releases/{release_id}/acks.Publishes a second release and confirms the latest pointer advances.
make demo-publish-release # (installed: bash demo/scripts/06_releases.sh)
Each run publishes new releases (sequence increments). Pair with
make demo-show-fleet to see the latest release pointer and the most
recent events (acks appear as ai.deployment.ack rows in that event
stream); pair with make demo-cp-check for the same recent-events view
without the release-pointer query. For a structured per-node ack table,
query the orchestrator directly:
curl http://localhost:8888/v1/releases/{release_id}/acks. The longer
narrative flow that frames this script in a multi-node story is in
02-multi-node-seed-once-update-everywhere.md;
this section is the single-command operator entry point.
Release poll loop and lifecycle events (make demo-poll-loop)¶
Installed: bash demo/scripts/07_poll_loop.sh covers the same flow
(prerequisite make demo-up-build is in-repo only — the bundle uses
prebuilt images, so use bash demo/scripts/demo_up.sh instead).
Demonstrates the runtime’s release poll loop emitting lifecycle events
that flow through the WAL → OTel bridge → orchestrator pipeline. The
script (demo/scripts/07_poll_loop.sh):
Verifies the orchestrator and runtime are healthy.
Publishes a new release to channel
stable.Waits for the runtime’s poll loop to fire (default interval 30 s; override with
POLL_WAIT=10 make demo-poll-loopif the runtime is started with a smaller--poll-interval).Drains the runtime WAL to the OTel bridge so events reach the orchestrator.
Queries
GET /v1/events?event_type=ai.deployment.lifecycleand parses the emitted phases.
make demo-poll-loop
Lifecycle phases emitted by runtime/poller.go (the script asserts the
first two; the verify-* phases require an explicit cosign pubkey
configuration):
Phase |
When emitted |
|---|---|
|
Every poll cycle. |
|
New |
|
Cosign pubkey is configured ( |
|
OCI + cosign + fingerprint + policy verification all succeeded. |
|
Any verification step failed (non-fatal — the poller keeps running). |
Prerequisites: make demo-up-build (full stack with orchestrator +
runtime + poll loop). The runtime must be started with
AUTONOMY_ORCHESTRATOR_URL set; docker-compose sets this automatically.
vNext acceptance harness (make demo-verify-vnext)¶
Runs the vNext Definition-of-Done harness end-to-end: supply-chain demo,
failure drills, control-plane telemetry, and lifecycle events. Builds the
binary first, then runs demo/scripts/verify_vnext.sh:
make demo-verify-vnext # (installed: bash demo/scripts/verify_vnext.sh)
Use this as the single command that asserts every demo-relevant invariant in one pass. It is the canonical pre-release acceptance check; CI runs the same script on every release-tag pipeline.
CI acceptance scripts¶
The same behaviors are tested automatically in CI via make ci:
Script |
What it asserts |
|---|---|
|
10× fingerprint stability; canonicalize round-trip; Go unit tests |
|
Deny-all before load; allow/deny after load; Python adapter tests |
|
Sidecar attach + pull byte-equality; cosign sign + verify (optional) |
|
WAL accumulation; durability on failed drain; priority drain |
Run locally (requires Docker):
make ci
Telemetry event priority order¶
The drain delivers events in this order:
Priority 0 (High) autonomy.error ← security and fault events
Priority 1 (Normal) autonomy.decision ← policy allow/deny decisions
Priority 1 (Normal) autonomy.action ← tool execution results
Priority 2 (Low) autonomy.lifecycle ← bundle load, stale, rejected
Within each priority tier, events are ordered by age (TierHot < TierWarm < TierCold) and then by sequence number.
Events expire after 30 days and are removed by store.Purge() at the start
of each drain cycle.
Jaeger UI¶
Traces from the runtime are forwarded to Jaeger at http://localhost:16686.
Open
http://localhost:16686.Select service:
autonomy-adk.Click Find Traces.
Each POST /v1/tool request appears as a trace with spans for policy
evaluation and tool execution.
Cleanup¶
Tear down and remove all generated data:
make demo-clean # (installed: docker compose -f demo/docker-compose.yml down -v && rm -rf demo/data)
This runs docker compose down -v (removes named volumes including registry
data) and rm -rf demo/data/.
To preserve the registry content across restarts, use demo-down instead:
make demo-down # (installed: docker compose -f demo/docker-compose.yml down)
Screen recording¶
make demo-record prints a step-by-step recording checklist with
narration cues and the stable output markers to call out during a 2–3
minute session.
make demo-record # (installed: bash demo/scripts/record_script.sh)
Stable markers worth showing on screen, in order:
Step |
Command |
Marker to show |
|---|---|---|
Preflight |
|
|
Supply-chain demo |
|
|
Policy enforcement |
|
|
Telemetry drain |
|
|
Failure drills |
|
|
Port-conflict notes during preflight:
If the demo stack is already running, port checks show
demo stack already running ✓(green, not a warning).If a port is occupied by an unrelated process, preflight prints a
conflictwarning with remediation steps (make demo-clean,sudo lsof -i :<port>).
Troubleshooting¶
Symptom |
Cause |
Fix |
|---|---|---|
|
cosign not on PATH |
Install: |
|
Demo key is not in cosign-native format |
Regenerate: |
|
Image was pushed but not signed |
Re-run from script 02: |
Runtime starts in deny-all mode |
No policy loaded |
Run |
|
Python launcher missing |
Install: |
Port conflict on 4318 or 7777 |
Another process is listening |
Check: |
Docker permission denied |
User not in docker group |
|
Python agent prints version warning |
Local Python is not 3.12 |
Install Python 3.12 (the demo pins to 3.12 via |
|
Demo output drifted from recorded fixture |
Inspect the diff. If the drift is intentional (legitimate runtime change), regenerate the fixture; if not, fix the runtime regression. |
Installed: the table’s make demo-* references map to the same
bash demo/scripts/... equivalents documented elsewhere in this
runbook (e.g. make demo-run-unsigned →
SKIP_SIGNING=1 bash demo/scripts/0{1,2}.sh,
make demo-preflight → bash demo/scripts/preflight.sh).