Tutorial 02 — Multi-Node: Seed Once, Update Everywhere (Peer Propagation)¶

Objective: Publish a desired-state release to the control plane, watch multiple runtime nodes poll for it, detect the candidate, and emit verifiable lifecycle telemetry proving each node received and evaluated the update.

What you will demonstrate:

Start a multi-node fleet with the Docker Compose fleet overlay
Publish a desired-state release to the control plane
Observe each node poll and emit polled → candidate_detected lifecycle events
Query the control plane for per-node telemetry confirming update propagation
Understand what is fully implemented vs. optional deployment modes

Time: ~20 minutes

Implementation Status¶

Before running this tutorial, understand which capabilities are fully implemented:

Capability	Status	Evidence
Control-plane release publication	✅ Implemented	`orchestrator/server.go:POST /v1/releases`
Release polling by runtime nodes	✅ Implemented	`runtime/poller.go`, 7 unit tests
Lifecycle telemetry (`polled`, `candidate_detected`, `verify_*`)	✅ Implemented	`runtime/poller.go:emitLifecycle()`
Control-plane event query	✅ Implemented	`orchestrator/server.go:GET /v1/events`
Node acknowledgement of releases	✅ Implemented	`orchestrator/server.go:POST /v1/nodes/{id}/ack`
Activation of release (policy swap)	✅ Implemented	`runtime/poller.go` (`PolicyActivator`), `cmd/autonomy/commands/runtime.go` (`makeRuntimeActivator`)
Edge-to-edge airgapped segment relay	✅ Implemented (edge module)	`edge/relay/executor.go`, `TestRelayE2E_MultiPeer`
Edge-to-edge relay in Docker demo	✅ Wired	`demo/docker-compose.yml` (`edged-node-a`, `edged-node-b` profile), `demo/scripts/08_edge_relay.sh`

What “seed once, update everywhere” means in the current codebase: The control plane is seeded with one desired-state release. Each runtime node independently polls, detects the candidate, verifies supply-chain constraints, and (when PolicyActivator is configured) activates the new policy without process restart. Activation persists active-lock.json in the runtime WAL directory to survive daemon restarts. The edge relay daemon (edged) is a separate binary for segment propagation; this tutorial focuses on runtime release polling + activation.

Architecture¶

                    ┌──────────────────────────────────────────┐
                    │           Control Plane                  │
                    │         :8888 /v1/releases               │
                    │         POST → desired-state record      │
                    │         GET  → latest release            │
                    └──────┬─────────────────────┬────────────┘
                           │ poll every 30s       │ poll every 30s
                    ┌──────▼──────┐        ┌──────▼──────┐
                    │  Runtime 1  │        │  Runtime 2  │
                    │  :7777      │        │  :7778      │
                    │             │        │             │
                    │  polled     │        │  polled     │
                    │  candidate  │        │  candidate  │
                    │  _detected  │        │  _detected  │
                    └──────┬──────┘        └──────┬──────┘
                           │ OTLP                 │ OTLP
                    ┌──────▼──────────────────────▼──────┐
                    │         OTel Collector              │
                    │         → otel-sink bridge          │
                    │         → Control-plane /v1/events  │
                    └─────────────────────────────────────┘

GET /v1/events?event_type=ai.deployment.lifecycle → shows events from both nodes

Step 0: Prerequisites¶

Complete Tutorial 01 first. The fleet demo uses the same policy bundle and agent image.

Ensure the stack is down:

make demo-down

Step 1: Start the Fleet Stack¶

The fleet overlay adds N runtime nodes (default N=3):

make demo-up-fleet N=2

This runs:

docker compose -f demo/docker-compose.yml -f demo/docker-compose.fleet.yml up -d --build

Wait for all nodes to be healthy:

# Node 1 (primary):
curl -s http://localhost:7777/health | jq .

# Node 2 (fleet):
curl -s http://localhost:7778/health | jq .

Expected:

{"status":"ok","mode":"normal"}

Check the control-plane:

curl -s http://localhost:8888/v1/health | jq .

Expected:

{"status":"ok"}

Step 2: Load Policy into Each Node¶

Repeat Tutorial 01 Steps 1–3 (or run make demo-run-unsigned) to build a bundle and load it:

make demo-run-unsigned

Confirm both nodes have a policy:

curl -s http://localhost:7777/health | jq .mode    # "normal"
curl -s http://localhost:7778/health | jq .mode    # "normal"

Step 3: Publish a Desired-State Release¶

The control plane stores releases as desired-state records. They are advisory only — they describe what the desired state should be, but do not push changes to nodes. Each node polls and decides independently.

For production HA durability, use the replicated PostgreSQL control-plane backend (orchestrator/pgstore/*) and migration flow (autonomy-orchestrator migrate). This tutorial’s local compose flow is the minimal single-instance path for release polling behavior.

# Capture the behavioral fingerprint from the lock we built in Tutorial 01
TARGET_FP=$(autonomy lock fingerprint --in /tmp/demo.lock.json 2>&1 | head -1)
echo "Target fingerprint: $TARGET_FP"

# Publish the release
RELEASE=$(curl -s -X POST http://localhost:8888/v1/releases \
  -H 'Content-Type: application/json' \
  -d "{
    \"channel\": \"stable\",
    \"target_lock_fingerprint\": \"${TARGET_FP}\",
    \"artifact_ref\": \"localhost:5000/autonomy-demo/agent:v1\",
    \"policy_ref\": \"localhost:5000/autonomy-demo/agent:v1-policy\",
    \"notes\": \"Tutorial 02 release - $(date -u +%Y-%m-%dT%H:%M:%SZ)\"
  }")

echo $RELEASE | jq .

Expected output:

{
  "release_id": "01JX...",
  "channel": "stable",
  "release_sequence": 1
}

Save the release ID:

RELEASE_ID=$(echo $RELEASE | jq -r .release_id)

Verify the release is queryable:

curl -s "http://localhost:8888/v1/releases/latest?channel=stable" | jq .

Expected (abridged):

{
  "release": {
    "release_id": "01JX...",
    "channel": "stable",
    "release_sequence": 1,
    "target_lock_fingerprint": "blake3:...",
    "artifact_ref": "localhost:5000/autonomy-demo/agent:v1"
  }
}

Control-plane authority model: The control plane stores desired-state releases as advisory records. It does NOT push configuration to nodes. Nodes pull via polling. This is intentional: offline-first design means nodes operate without connectivity. Evidence: orchestrator/store.go, v1.13 §1.2.3 advisory-only annotation in runtime/poller.go:1

Step 4: Wait for Poll Cycles¶

Each node polls every 30 seconds by default. The poll interval is configurable:

# Override for faster demo (5s interval):
# Add --poll-interval 5s to the runtime start command, or set via docker-compose env

# For this demo, wait at least one default interval:
echo "Waiting 35 seconds for poll cycles..."
sleep 35

During this time, each runtime’s Poller goroutine:

Calls GET /v1/releases/latest?channel=stable (15s client timeout)
Decodes the response into latestReleaseResponse
Compares TargetLockFingerprint against CurrentFingerprintFunc() (persisted active-lock.json)
If fingerprints differ, emits polled then candidate_detected
If verifier + pubkey are configured, emits verify_started and either verify_passed or verify_failed
On verify_passed, runs PolicyActivator and emits activated or activate_failed

Step 5: Drain Telemetry from Each Node¶

Each node accumulates lifecycle events in its local WAL. Drain to the control plane:

# The demo/docker-compose.yml runtime container has /data/wal bind-mounted
# to demo/data/wal. For fleet nodes, each has its own WAL volume.

# Drain primary node WAL:
docker compose -f demo/docker-compose.yml exec -T runtime \
  autonomy telemetry drain \
  --dir /data/wal \
  --endpoint http://otel-sink:4318

# If using fleet nodes, drain each one (they share the bind-mount in this demo setup)

Alternatively, use the automated target:

make demo-fleet-summary

Step 6: Query Per-Node Lifecycle Events¶

Query the control plane for lifecycle events from all nodes:

curl -s "http://localhost:8888/v1/events?event_type=ai.deployment.lifecycle&limit=20" | jq .

Expected output (abridged):

{
  "count": 4,
  "events": [
    {
      "event_id": "runtime:wal:5",
      "event_type": "ai.deployment.lifecycle",
      "node_id": "runtime",
      "timestamp": "2026-...",
      "payload": "{\"phase\":\"candidate_detected\",\"channel\":\"stable\",\"release_id\":\"01JX...\",\"release_sequence\":1,...}"
    },
    {
      "event_id": "runtime:wal:4",
      "event_type": "ai.deployment.lifecycle",
      "node_id": "runtime",
      "timestamp": "2026-...",
      "payload": "{\"phase\":\"polled\",\"channel\":\"stable\"}"
    }
  ]
}

Each event carries phase in the payload. Expected phases per poll cycle:

Phase	When emitted	Fields
`polled`	Every poll, always	`channel`
`candidate_detected`	Fingerprint differs from current	`channel`, `release_id`, `release_sequence`, `candidate_fp`, `current_fp`
`verify_started`	Only if `AUTONOMY_COSIGN_PUBKEY` set	`release_id`, `artifact_ref`
`verify_passed`	cosign verification succeeded	`release_id`, `artifact_ref`
`verify_failed`	cosign verification failed (non-fatal)	`release_id`, `artifact_ref`, `error`
`activated`	policy activation succeeded	`release_id`, `lock_fp`
`activate_failed`	policy activation failed (non-fatal)	`release_id`, `artifact_ref`, `error`

Verification and activation phases: In this tutorial, verify_* and activate_* phases are emitted only when a verifier public key is configured and verification passes. Set AUTONOMY_COSIGN_PUBKEY=/path/to/cosign.pub in the runtime service environment to enable verification and activation events. Evidence: runtime/poller.go:poll() lines 203-229

Step 7: Simulate Node Acknowledgement¶

After a node processes a release (in production: after activation), it acknowledges:

curl -s -X POST "http://localhost:8888/v1/nodes/node-1/ack" \
  -H 'Content-Type: application/json' \
  -d "{
    \"release_id\": \"${RELEASE_ID}\",
    \"status\": \"accepted\",
    \"reason\": \"tutorial demonstration\",
    \"lock_fingerprint_observed\": \"${TARGET_FP}\",
    \"runtime_version\": \"0.1.0\",
    \"timestamp\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"
  }" | jq .

Expected output:

{"stored": true}

Query fleet acknowledgements:

curl -s "http://localhost:8888/v1/releases/${RELEASE_ID}/acks" | jq .

Expected output:

{
  "count": 1,
  "acks": [
    {
      "node_id": "node-1",
      "status": "accepted",
      "reason": "tutorial demonstration",
      "lock_fingerprint_observed": "blake3:...",
      "runtime_version": "0.1.0"
    }
  ]
}

Step 8: Publish a Second Release and Observe Sequence Advance¶

curl -s -X POST http://localhost:8888/v1/releases \
  -H 'Content-Type: application/json' \
  -d "{
    \"channel\": \"stable\",
    \"target_lock_fingerprint\": \"blake3:0000000000000000000000000000000000000000000000000000000000000001\",
    \"artifact_ref\": \"localhost:5000/autonomy-demo/agent:v2\",
    \"policy_ref\": \"localhost:5000/autonomy-demo/agent:v2-policy\",
    \"notes\": \"Tutorial 02 release v2\"
  }" | jq .

Expected:

{
  "release_id": "01JY...",
  "channel": "stable",
  "release_sequence": 2
}

Verify the latest pointer advanced:

curl -s "http://localhost:8888/v1/releases/latest?channel=stable" | jq .release.release_sequence
# Expected: 2

Sequence semantics: release_sequence is monotonic per channel. Nodes use it to detect whether there is a new release since the last poll without comparing full payloads. Evidence: orchestrator/store.go:CreateRelease()

Edge Relay Demo (Advanced — Unit Test Level)¶

The edged daemon implements full airgapped segment relay:

Ingest segments from peers via mTLS transport
Store to local disk with ceiling enforcement and eviction
Schedule outbound relay via BoltDB ledger
Execute relay with bounded retry and dead-letter (INV-12)
Confirm success via all_peers or one_peer condition

This is fully implemented and tested at the unit level:

# Run the relay end-to-end test (in-process, no network):
cd edge
GOWORK=off go test ./relay/... -run TestRelayE2E_MultiPeer -v

Expected output (key lines):

INFO relay: segment delivered seg_id=seg-e2e-1 peer_id=peer-e2e-a attempt_num=1
INFO relay: segment delivered seg_id=seg-e2e-1 peer_id=peer-e2e-b attempt_num=1
INFO relay: success condition met seg_id=seg-e2e-1 condition=all_peers acked_count=2
INFO relay: segment delivered seg_id=seg-e2e-2 peer_id=peer-e2e-a attempt_num=1
INFO relay: segment delivered seg_id=seg-e2e-2 peer_id=peer-e2e-b attempt_num=1
INFO relay: success condition met seg_id=seg-e2e-2 condition=all_peers acked_count=2
PASS

The full failure-injection suite:

# From repo root:
make edge-fi

This runs the FI suite and produces a traceability report.

Relay in airgapped deployments: In production, edged is configured with a static peer list (known_peers in edge.toml) and runs on each node. When Node A ingests a segment (e.g., a new policy bundle), it schedules outbound relay to Node B and Node C. Once both ACK (if success_condition = all_peers), idx.RecordRelayed() marks the segment as successfully propagated. This is the airgapped “seed once, update everywhere” propagation path. Evidence: edge/relay/executor.go:checkSuccessCondition(), edge/relay/e2e_test.go:TestRelayE2E_MultiPeer

Automated Version¶

make demo-up
make demo-run          # build + sign + verify (Tutorial 01)
make demo-publish-release  # publish a release
make demo-poll-loop    # wait + drain + show lifecycle events
make demo-fleet-summary    # per-node telemetry

Troubleshooting¶

Symptom	Cause	Fix
No `candidate_detected` events	Poll interval not elapsed	Wait 35s; or use `POLL_WAIT=35` env override
`count: 0` in event query	WAL not drained yet	Run `autonomy telemetry drain` in runtime container
`404` on `GET /v1/releases/latest`	No releases published	Run Step 3
`verify_failed` in events	cosign pubkey set but artifact not signed	Unset `AUTONOMY_COSIGN_PUBKEY` or sign artifact first
Fleet node `connection refused` on :7778	Fleet compose not running	`make demo-up-fleet N=2`

What Just Happened¶

Started a multi-node fleet stack (primary + fleet nodes)
Published a desired-state release to the control plane with a target lock fingerprint
Waited for each node’s poll cycle (30s default) to fire
Drained WAL telemetry to the control plane via the OTel pipeline
Queried per-node lifecycle events confirming polled and candidate_detected phases
Demonstrated node acknowledgement (POST /v1/nodes/{id}/ack)
Published a second release and confirmed the release_sequence advanced
Ran the edge relay end-to-end unit test (TestRelayE2E_MultiPeer) to show the full airgapped propagation pipeline

Evidence Links¶

Claim	File	Symbol
Release publication	`orchestrator/server.go`	`handlePublishRelease()`
Release storage	`orchestrator/store.go`	`CreateRelease()`, `LatestRelease()`
Release polling	`runtime/poller.go`	`poll()`, `emitLifecycle()`
Lifecycle phases	`runtime/poller.go`	Lines 142–234
Node acknowledgement	`orchestrator/server.go`	`handleNodeAck()`
Event query	`orchestrator/server.go`	`handleQueryEvents()`
Edge relay e2e	`edge/relay/e2e_test.go`	`TestRelayE2E_MultiPeer`
Dead-letter invariant	`edge/relay/executor_test.go`	`TestFailRelay_Deadletter`

Next Tutorial¶

Tutorial 03 — Crash and Recovery: WAL + Safe-Point