Tutorial 02 — Multi-Node: Seed Once, Update Everywhere (Peer Propagation)¶
Objective: Publish a desired-state release to the control plane, watch multiple runtime nodes poll for it, detect the candidate, and emit verifiable lifecycle telemetry proving each node received and evaluated the update.
What you will demonstrate:
Start a multi-node fleet with the Docker Compose fleet overlay
Publish a desired-state release to the control plane
Observe each node poll and emit
polled → candidate_detectedlifecycle eventsQuery the control plane for per-node telemetry confirming update propagation
Understand what is fully implemented vs. optional deployment modes
Time: ~20 minutes
Implementation Status¶
Before running this tutorial, understand which capabilities are fully implemented:
Capability |
Status |
Evidence |
|---|---|---|
Control-plane release publication |
✅ Implemented |
|
Release polling by runtime nodes |
✅ Implemented |
|
Lifecycle telemetry ( |
✅ Implemented |
|
Control-plane event query |
✅ Implemented |
|
Node acknowledgement of releases |
✅ Implemented |
|
Activation of release (policy swap) |
✅ Implemented |
|
Edge-to-edge airgapped segment relay |
✅ Implemented (edge module) |
|
Edge-to-edge relay in Docker demo |
✅ Wired |
|
What “seed once, update everywhere” means in the current codebase: The control plane is seeded with one desired-state release. Each runtime node independently polls, detects the candidate, verifies supply-chain constraints, and (when
PolicyActivatoris configured) activates the new policy without process restart. Activation persistsactive-lock.jsonin the runtime WAL directory to survive daemon restarts. The edge relay daemon (edged) is a separate binary for segment propagation; this tutorial focuses on runtime release polling + activation.
Architecture¶
┌──────────────────────────────────────────┐
│ Control Plane │
│ :8888 /v1/releases │
│ POST → desired-state record │
│ GET → latest release │
└──────┬─────────────────────┬────────────┘
│ poll every 30s │ poll every 30s
┌──────▼──────┐ ┌──────▼──────┐
│ Runtime 1 │ │ Runtime 2 │
│ :7777 │ │ :7778 │
│ │ │ │
│ polled │ │ polled │
│ candidate │ │ candidate │
│ _detected │ │ _detected │
└──────┬──────┘ └──────┬──────┘
│ OTLP │ OTLP
┌──────▼──────────────────────▼──────┐
│ OTel Collector │
│ → otel-sink bridge │
│ → Control-plane /v1/events │
└─────────────────────────────────────┘
GET /v1/events?event_type=ai.deployment.lifecycle → shows events from both nodes
Step 0: Prerequisites¶
Complete Tutorial 01 first. The fleet demo uses the same policy bundle and agent image.
Ensure the stack is down:
make demo-down
Step 1: Start the Fleet Stack¶
The fleet overlay adds N runtime nodes (default N=3):
make demo-up-fleet N=2
This runs:
docker compose -f demo/docker-compose.yml -f demo/docker-compose.fleet.yml up -d --build
Wait for all nodes to be healthy:
# Node 1 (primary):
curl -s http://localhost:7777/health | jq .
# Node 2 (fleet):
curl -s http://localhost:7778/health | jq .
Expected:
{"status":"ok","mode":"normal"}
Check the control-plane:
curl -s http://localhost:8888/v1/health | jq .
Expected:
{"status":"ok"}
Step 2: Load Policy into Each Node¶
Repeat Tutorial 01 Steps 1–3 (or run make demo-run-unsigned) to build a bundle and load it:
make demo-run-unsigned
Confirm both nodes have a policy:
curl -s http://localhost:7777/health | jq .mode # "normal"
curl -s http://localhost:7778/health | jq .mode # "normal"
Step 3: Publish a Desired-State Release¶
The control plane stores releases as desired-state records. They are advisory only — they describe what the desired state should be, but do not push changes to nodes. Each node polls and decides independently.
For production HA durability, use the replicated PostgreSQL control-plane
backend (orchestrator/pgstore/*) and migration flow
(autonomy-orchestrator migrate). This tutorial’s local compose flow is the
minimal single-instance path for release polling behavior.
# Capture the behavioral fingerprint from the lock we built in Tutorial 01
TARGET_FP=$(autonomy lock fingerprint --in /tmp/demo.lock.json 2>&1 | head -1)
echo "Target fingerprint: $TARGET_FP"
# Publish the release
RELEASE=$(curl -s -X POST http://localhost:8888/v1/releases \
-H 'Content-Type: application/json' \
-d "{
\"channel\": \"stable\",
\"target_lock_fingerprint\": \"${TARGET_FP}\",
\"artifact_ref\": \"localhost:5000/autonomy-demo/agent:v1\",
\"policy_ref\": \"localhost:5000/autonomy-demo/agent:v1-policy\",
\"notes\": \"Tutorial 02 release - $(date -u +%Y-%m-%dT%H:%M:%SZ)\"
}")
echo $RELEASE | jq .
Expected output:
{
"release_id": "01JX...",
"channel": "stable",
"release_sequence": 1
}
Save the release ID:
RELEASE_ID=$(echo $RELEASE | jq -r .release_id)
Verify the release is queryable:
curl -s "http://localhost:8888/v1/releases/latest?channel=stable" | jq .
Expected (abridged):
{
"release": {
"release_id": "01JX...",
"channel": "stable",
"release_sequence": 1,
"target_lock_fingerprint": "blake3:...",
"artifact_ref": "localhost:5000/autonomy-demo/agent:v1"
}
}
Control-plane authority model: The control plane stores desired-state releases as advisory records. It does NOT push configuration to nodes. Nodes pull via polling. This is intentional: offline-first design means nodes operate without connectivity. Evidence:
orchestrator/store.go, v1.13 §1.2.3 advisory-only annotation inruntime/poller.go:1
Step 4: Wait for Poll Cycles¶
Each node polls every 30 seconds by default. The poll interval is configurable:
# Override for faster demo (5s interval):
# Add --poll-interval 5s to the runtime start command, or set via docker-compose env
# For this demo, wait at least one default interval:
echo "Waiting 35 seconds for poll cycles..."
sleep 35
During this time, each runtime’s Poller goroutine:
Calls
GET /v1/releases/latest?channel=stable(15s client timeout)Decodes the response into
latestReleaseResponseCompares
TargetLockFingerprintagainstCurrentFingerprintFunc()(persistedactive-lock.json)If fingerprints differ, emits
polledthencandidate_detectedIf verifier + pubkey are configured, emits
verify_startedand eitherverify_passedorverify_failedOn
verify_passed, runsPolicyActivatorand emitsactivatedoractivate_failed
Step 5: Drain Telemetry from Each Node¶
Each node accumulates lifecycle events in its local WAL. Drain to the control plane:
# The demo/docker-compose.yml runtime container has /data/wal bind-mounted
# to demo/data/wal. For fleet nodes, each has its own WAL volume.
# Drain primary node WAL:
docker compose -f demo/docker-compose.yml exec -T runtime \
autonomy telemetry drain \
--dir /data/wal \
--endpoint http://otel-sink:4318
# If using fleet nodes, drain each one (they share the bind-mount in this demo setup)
Alternatively, use the automated target:
make demo-fleet-summary
Step 6: Query Per-Node Lifecycle Events¶
Query the control plane for lifecycle events from all nodes:
curl -s "http://localhost:8888/v1/events?event_type=ai.deployment.lifecycle&limit=20" | jq .
Expected output (abridged):
{
"count": 4,
"events": [
{
"event_id": "runtime:wal:5",
"event_type": "ai.deployment.lifecycle",
"node_id": "runtime",
"timestamp": "2026-...",
"payload": "{\"phase\":\"candidate_detected\",\"channel\":\"stable\",\"release_id\":\"01JX...\",\"release_sequence\":1,...}"
},
{
"event_id": "runtime:wal:4",
"event_type": "ai.deployment.lifecycle",
"node_id": "runtime",
"timestamp": "2026-...",
"payload": "{\"phase\":\"polled\",\"channel\":\"stable\"}"
}
]
}
Each event carries phase in the payload. Expected phases per poll cycle:
Phase |
When emitted |
Fields |
|---|---|---|
|
Every poll, always |
|
|
Fingerprint differs from current |
|
|
Only if |
|
|
cosign verification succeeded |
|
|
cosign verification failed (non-fatal) |
|
|
policy activation succeeded |
|
|
policy activation failed (non-fatal) |
|
Verification and activation phases: In this tutorial,
verify_*andactivate_*phases are emitted only when a verifier public key is configured and verification passes. SetAUTONOMY_COSIGN_PUBKEY=/path/to/cosign.pubin the runtime service environment to enable verification and activation events. Evidence:runtime/poller.go:poll()lines 203-229
Step 7: Simulate Node Acknowledgement¶
After a node processes a release (in production: after activation), it acknowledges:
curl -s -X POST "http://localhost:8888/v1/nodes/node-1/ack" \
-H 'Content-Type: application/json' \
-d "{
\"release_id\": \"${RELEASE_ID}\",
\"status\": \"accepted\",
\"reason\": \"tutorial demonstration\",
\"lock_fingerprint_observed\": \"${TARGET_FP}\",
\"runtime_version\": \"0.1.0\",
\"timestamp\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"
}" | jq .
Expected output:
{"stored": true}
Query fleet acknowledgements:
curl -s "http://localhost:8888/v1/releases/${RELEASE_ID}/acks" | jq .
Expected output:
{
"count": 1,
"acks": [
{
"node_id": "node-1",
"status": "accepted",
"reason": "tutorial demonstration",
"lock_fingerprint_observed": "blake3:...",
"runtime_version": "0.1.0"
}
]
}
Step 8: Publish a Second Release and Observe Sequence Advance¶
curl -s -X POST http://localhost:8888/v1/releases \
-H 'Content-Type: application/json' \
-d "{
\"channel\": \"stable\",
\"target_lock_fingerprint\": \"blake3:0000000000000000000000000000000000000000000000000000000000000001\",
\"artifact_ref\": \"localhost:5000/autonomy-demo/agent:v2\",
\"policy_ref\": \"localhost:5000/autonomy-demo/agent:v2-policy\",
\"notes\": \"Tutorial 02 release v2\"
}" | jq .
Expected:
{
"release_id": "01JY...",
"channel": "stable",
"release_sequence": 2
}
Verify the latest pointer advanced:
curl -s "http://localhost:8888/v1/releases/latest?channel=stable" | jq .release.release_sequence
# Expected: 2
Sequence semantics:
release_sequenceis monotonic per channel. Nodes use it to detect whether there is a new release since the last poll without comparing full payloads. Evidence:orchestrator/store.go:CreateRelease()
Edge Relay Demo (Advanced — Unit Test Level)¶
The edged daemon implements full airgapped segment relay:
Ingest segments from peers via mTLS transport
Store to local disk with ceiling enforcement and eviction
Schedule outbound relay via BoltDB ledger
Execute relay with bounded retry and dead-letter (INV-12)
Confirm success via
all_peersorone_peercondition
This is fully implemented and tested at the unit level:
# Run the relay end-to-end test (in-process, no network):
cd edge
GOWORK=off go test ./relay/... -run TestRelayE2E_MultiPeer -v
Expected output (key lines):
INFO relay: segment delivered seg_id=seg-e2e-1 peer_id=peer-e2e-a attempt_num=1
INFO relay: segment delivered seg_id=seg-e2e-1 peer_id=peer-e2e-b attempt_num=1
INFO relay: success condition met seg_id=seg-e2e-1 condition=all_peers acked_count=2
INFO relay: segment delivered seg_id=seg-e2e-2 peer_id=peer-e2e-a attempt_num=1
INFO relay: segment delivered seg_id=seg-e2e-2 peer_id=peer-e2e-b attempt_num=1
INFO relay: success condition met seg_id=seg-e2e-2 condition=all_peers acked_count=2
PASS
The full failure-injection suite:
# From repo root:
make edge-fi
This runs the FI suite and produces a traceability report.
Relay in airgapped deployments: In production,
edgedis configured with a static peer list (known_peersinedge.toml) and runs on each node. When Node A ingests a segment (e.g., a new policy bundle), it schedules outbound relay to Node B and Node C. Once both ACK (ifsuccess_condition = all_peers),idx.RecordRelayed()marks the segment as successfully propagated. This is the airgapped “seed once, update everywhere” propagation path. Evidence:edge/relay/executor.go:checkSuccessCondition(),edge/relay/e2e_test.go:TestRelayE2E_MultiPeer
Automated Version¶
make demo-up
make demo-run # build + sign + verify (Tutorial 01)
make demo-publish-release # publish a release
make demo-poll-loop # wait + drain + show lifecycle events
make demo-fleet-summary # per-node telemetry
Troubleshooting¶
Symptom |
Cause |
Fix |
|---|---|---|
No |
Poll interval not elapsed |
Wait 35s; or use |
|
WAL not drained yet |
Run |
|
No releases published |
Run Step 3 |
|
cosign pubkey set but artifact not signed |
Unset |
Fleet node |
Fleet compose not running |
|
What Just Happened¶
Started a multi-node fleet stack (primary + fleet nodes)
Published a desired-state release to the control plane with a target lock fingerprint
Waited for each node’s poll cycle (30s default) to fire
Drained WAL telemetry to the control plane via the OTel pipeline
Queried per-node lifecycle events confirming
polledandcandidate_detectedphasesDemonstrated node acknowledgement (
POST /v1/nodes/{id}/ack)Published a second release and confirmed the
release_sequenceadvancedRan the edge relay end-to-end unit test (
TestRelayE2E_MultiPeer) to show the full airgapped propagation pipeline
Evidence Links¶
Claim |
File |
Symbol |
|---|---|---|
Release publication |
|
|
Release storage |
|
|
Release polling |
|
|
Lifecycle phases |
|
Lines 142–234 |
Node acknowledgement |
|
|
Event query |
|
|
Edge relay e2e |
|
|
Dead-letter invariant |
|
|