Tutorial 02 — Multi-Node: Seed Once, Update Everywhere (Peer Propagation)

Objective: Publish a desired-state release to the control plane, watch multiple runtime nodes poll for it, detect the candidate, and emit verifiable lifecycle telemetry proving each node received and evaluated the update.

What you will demonstrate:

  • Start a multi-node fleet with the Docker Compose fleet overlay

  • Publish a desired-state release to the control plane

  • Observe each node poll and emit polled candidate_detected lifecycle events

  • Query the control plane for per-node telemetry confirming update propagation

  • Understand what is fully implemented vs. optional deployment modes

Time: ~20 minutes


Implementation Status

Before running this tutorial, understand which capabilities are fully implemented:

Capability

Status

Evidence

Control-plane release publication

✅ Implemented

orchestrator/server.go:POST /v1/releases

Release polling by runtime nodes

✅ Implemented

runtime/poller.go, 7 unit tests

Lifecycle telemetry (polled, candidate_detected, verify_*)

✅ Implemented

runtime/poller.go:emitLifecycle()

Control-plane event query

✅ Implemented

orchestrator/server.go:GET /v1/events

Node acknowledgement of releases

✅ Implemented

orchestrator/server.go:POST /v1/nodes/{id}/ack

Activation of release (policy swap)

✅ Implemented

runtime/poller.go (PolicyActivator), cmd/autonomy/commands/runtime.go (makeRuntimeActivator)

Edge-to-edge airgapped segment relay

✅ Implemented (edge module)

edge/relay/executor.go, TestRelayE2E_MultiPeer

Edge-to-edge relay in Docker demo

✅ Wired

demo/docker-compose.yml (edged-node-a, edged-node-b profile), demo/scripts/08_edge_relay.sh

What “seed once, update everywhere” means in the current codebase: The control plane is seeded with one desired-state release. Each runtime node independently polls, detects the candidate, verifies supply-chain constraints, and (when PolicyActivator is configured) activates the new policy without process restart. Activation persists active-lock.json in the runtime WAL directory to survive daemon restarts. The edge relay daemon (edged) is a separate binary for segment propagation; this tutorial focuses on runtime release polling + activation.


Architecture

                    ┌──────────────────────────────────────────┐
                    │           Control Plane                  │
                    │         :8888 /v1/releases               │
                    │         POST → desired-state record      │
                    │         GET  → latest release            │
                    └──────┬─────────────────────┬────────────┘
                           │ poll every 30s       │ poll every 30s
                    ┌──────▼──────┐        ┌──────▼──────┐
                    │  Runtime 1  │        │  Runtime 2  │
                    │  :7777      │        │  :7778      │
                    │             │        │             │
                    │  polled     │        │  polled     │
                    │  candidate  │        │  candidate  │
                    │  _detected  │        │  _detected  │
                    └──────┬──────┘        └──────┬──────┘
                           │ OTLP                 │ OTLP
                    ┌──────▼──────────────────────▼──────┐
                    │         OTel Collector              │
                    │         → otel-sink bridge          │
                    │         → Control-plane /v1/events  │
                    └─────────────────────────────────────┘

GET /v1/events?event_type=ai.deployment.lifecycle → shows events from both nodes

Step 0: Prerequisites

Complete Tutorial 01 first. The fleet demo uses the same policy bundle and agent image.

Ensure the stack is down:

make demo-down

Step 1: Start the Fleet Stack

The fleet overlay adds N runtime nodes (default N=3):

make demo-up-fleet N=2

This runs:

docker compose -f demo/docker-compose.yml -f demo/docker-compose.fleet.yml up -d --build

Wait for all nodes to be healthy:

# Node 1 (primary):
curl -s http://localhost:7777/health | jq .

# Node 2 (fleet):
curl -s http://localhost:7778/health | jq .

Expected:

{"status":"ok","mode":"normal"}

Check the control-plane:

curl -s http://localhost:8888/v1/health | jq .

Expected:

{"status":"ok"}

Step 2: Load Policy into Each Node

Repeat Tutorial 01 Steps 1–3 (or run make demo-run-unsigned) to build a bundle and load it:

make demo-run-unsigned

Confirm both nodes have a policy:

curl -s http://localhost:7777/health | jq .mode    # "normal"
curl -s http://localhost:7778/health | jq .mode    # "normal"

Step 3: Publish a Desired-State Release

The control plane stores releases as desired-state records. They are advisory only — they describe what the desired state should be, but do not push changes to nodes. Each node polls and decides independently.

For production HA durability, use the replicated PostgreSQL control-plane backend (orchestrator/pgstore/*) and migration flow (autonomy-orchestrator migrate). This tutorial’s local compose flow is the minimal single-instance path for release polling behavior.

# Capture the behavioral fingerprint from the lock we built in Tutorial 01
TARGET_FP=$(autonomy lock fingerprint --in /tmp/demo.lock.json 2>&1 | head -1)
echo "Target fingerprint: $TARGET_FP"

# Publish the release
RELEASE=$(curl -s -X POST http://localhost:8888/v1/releases \
  -H 'Content-Type: application/json' \
  -d "{
    \"channel\": \"stable\",
    \"target_lock_fingerprint\": \"${TARGET_FP}\",
    \"artifact_ref\": \"localhost:5000/autonomy-demo/agent:v1\",
    \"policy_ref\": \"localhost:5000/autonomy-demo/agent:v1-policy\",
    \"notes\": \"Tutorial 02 release - $(date -u +%Y-%m-%dT%H:%M:%SZ)\"
  }")

echo $RELEASE | jq .

Expected output:

{
  "release_id": "01JX...",
  "channel": "stable",
  "release_sequence": 1
}

Save the release ID:

RELEASE_ID=$(echo $RELEASE | jq -r .release_id)

Verify the release is queryable:

curl -s "http://localhost:8888/v1/releases/latest?channel=stable" | jq .

Expected (abridged):

{
  "release": {
    "release_id": "01JX...",
    "channel": "stable",
    "release_sequence": 1,
    "target_lock_fingerprint": "blake3:...",
    "artifact_ref": "localhost:5000/autonomy-demo/agent:v1"
  }
}

Control-plane authority model: The control plane stores desired-state releases as advisory records. It does NOT push configuration to nodes. Nodes pull via polling. This is intentional: offline-first design means nodes operate without connectivity. Evidence: orchestrator/store.go, v1.13 §1.2.3 advisory-only annotation in runtime/poller.go:1


Step 4: Wait for Poll Cycles

Each node polls every 30 seconds by default. The poll interval is configurable:

# Override for faster demo (5s interval):
# Add --poll-interval 5s to the runtime start command, or set via docker-compose env

# For this demo, wait at least one default interval:
echo "Waiting 35 seconds for poll cycles..."
sleep 35

During this time, each runtime’s Poller goroutine:

  1. Calls GET /v1/releases/latest?channel=stable (15s client timeout)

  2. Decodes the response into latestReleaseResponse

  3. Compares TargetLockFingerprint against CurrentFingerprintFunc() (persisted active-lock.json)

  4. If fingerprints differ, emits polled then candidate_detected

  5. If verifier + pubkey are configured, emits verify_started and either verify_passed or verify_failed

  6. On verify_passed, runs PolicyActivator and emits activated or activate_failed


Step 5: Drain Telemetry from Each Node

Each node accumulates lifecycle events in its local WAL. Drain to the control plane:

# The demo/docker-compose.yml runtime container has /data/wal bind-mounted
# to demo/data/wal. For fleet nodes, each has its own WAL volume.

# Drain primary node WAL:
docker compose -f demo/docker-compose.yml exec -T runtime \
  autonomy telemetry drain \
  --dir /data/wal \
  --endpoint http://otel-sink:4318

# If using fleet nodes, drain each one (they share the bind-mount in this demo setup)

Alternatively, use the automated target:

make demo-fleet-summary

Step 6: Query Per-Node Lifecycle Events

Query the control plane for lifecycle events from all nodes:

curl -s "http://localhost:8888/v1/events?event_type=ai.deployment.lifecycle&limit=20" | jq .

Expected output (abridged):

{
  "count": 4,
  "events": [
    {
      "event_id": "runtime:wal:5",
      "event_type": "ai.deployment.lifecycle",
      "node_id": "runtime",
      "timestamp": "2026-...",
      "payload": "{\"phase\":\"candidate_detected\",\"channel\":\"stable\",\"release_id\":\"01JX...\",\"release_sequence\":1,...}"
    },
    {
      "event_id": "runtime:wal:4",
      "event_type": "ai.deployment.lifecycle",
      "node_id": "runtime",
      "timestamp": "2026-...",
      "payload": "{\"phase\":\"polled\",\"channel\":\"stable\"}"
    }
  ]
}

Each event carries phase in the payload. Expected phases per poll cycle:

Phase

When emitted

Fields

polled

Every poll, always

channel

candidate_detected

Fingerprint differs from current

channel, release_id, release_sequence, candidate_fp, current_fp

verify_started

Only if AUTONOMY_COSIGN_PUBKEY set

release_id, artifact_ref

verify_passed

cosign verification succeeded

release_id, artifact_ref

verify_failed

cosign verification failed (non-fatal)

release_id, artifact_ref, error

activated

policy activation succeeded

release_id, lock_fp

activate_failed

policy activation failed (non-fatal)

release_id, artifact_ref, error

Verification and activation phases: In this tutorial, verify_* and activate_* phases are emitted only when a verifier public key is configured and verification passes. Set AUTONOMY_COSIGN_PUBKEY=/path/to/cosign.pub in the runtime service environment to enable verification and activation events. Evidence: runtime/poller.go:poll() lines 203-229


Step 7: Simulate Node Acknowledgement

After a node processes a release (in production: after activation), it acknowledges:

curl -s -X POST "http://localhost:8888/v1/nodes/node-1/ack" \
  -H 'Content-Type: application/json' \
  -d "{
    \"release_id\": \"${RELEASE_ID}\",
    \"status\": \"accepted\",
    \"reason\": \"tutorial demonstration\",
    \"lock_fingerprint_observed\": \"${TARGET_FP}\",
    \"runtime_version\": \"0.1.0\",
    \"timestamp\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"
  }" | jq .

Expected output:

{"stored": true}

Query fleet acknowledgements:

curl -s "http://localhost:8888/v1/releases/${RELEASE_ID}/acks" | jq .

Expected output:

{
  "count": 1,
  "acks": [
    {
      "node_id": "node-1",
      "status": "accepted",
      "reason": "tutorial demonstration",
      "lock_fingerprint_observed": "blake3:...",
      "runtime_version": "0.1.0"
    }
  ]
}

Step 8: Publish a Second Release and Observe Sequence Advance

curl -s -X POST http://localhost:8888/v1/releases \
  -H 'Content-Type: application/json' \
  -d "{
    \"channel\": \"stable\",
    \"target_lock_fingerprint\": \"blake3:0000000000000000000000000000000000000000000000000000000000000001\",
    \"artifact_ref\": \"localhost:5000/autonomy-demo/agent:v2\",
    \"policy_ref\": \"localhost:5000/autonomy-demo/agent:v2-policy\",
    \"notes\": \"Tutorial 02 release v2\"
  }" | jq .

Expected:

{
  "release_id": "01JY...",
  "channel": "stable",
  "release_sequence": 2
}

Verify the latest pointer advanced:

curl -s "http://localhost:8888/v1/releases/latest?channel=stable" | jq .release.release_sequence
# Expected: 2

Sequence semantics: release_sequence is monotonic per channel. Nodes use it to detect whether there is a new release since the last poll without comparing full payloads. Evidence: orchestrator/store.go:CreateRelease()


Edge Relay Demo (Advanced — Unit Test Level)

The edged daemon implements full airgapped segment relay:

  • Ingest segments from peers via mTLS transport

  • Store to local disk with ceiling enforcement and eviction

  • Schedule outbound relay via BoltDB ledger

  • Execute relay with bounded retry and dead-letter (INV-12)

  • Confirm success via all_peers or one_peer condition

This is fully implemented and tested at the unit level:

# Run the relay end-to-end test (in-process, no network):
cd edge
GOWORK=off go test ./relay/... -run TestRelayE2E_MultiPeer -v

Expected output (key lines):

INFO relay: segment delivered seg_id=seg-e2e-1 peer_id=peer-e2e-a attempt_num=1
INFO relay: segment delivered seg_id=seg-e2e-1 peer_id=peer-e2e-b attempt_num=1
INFO relay: success condition met seg_id=seg-e2e-1 condition=all_peers acked_count=2
INFO relay: segment delivered seg_id=seg-e2e-2 peer_id=peer-e2e-a attempt_num=1
INFO relay: segment delivered seg_id=seg-e2e-2 peer_id=peer-e2e-b attempt_num=1
INFO relay: success condition met seg_id=seg-e2e-2 condition=all_peers acked_count=2
PASS

The full failure-injection suite:

# From repo root:
make edge-fi

This runs the FI suite and produces a traceability report.

Relay in airgapped deployments: In production, edged is configured with a static peer list (known_peers in edge.toml) and runs on each node. When Node A ingests a segment (e.g., a new policy bundle), it schedules outbound relay to Node B and Node C. Once both ACK (if success_condition = all_peers), idx.RecordRelayed() marks the segment as successfully propagated. This is the airgapped “seed once, update everywhere” propagation path. Evidence: edge/relay/executor.go:checkSuccessCondition(), edge/relay/e2e_test.go:TestRelayE2E_MultiPeer


Automated Version

make demo-up
make demo-run          # build + sign + verify (Tutorial 01)
make demo-publish-release  # publish a release
make demo-poll-loop    # wait + drain + show lifecycle events
make demo-fleet-summary    # per-node telemetry

Troubleshooting

Symptom

Cause

Fix

No candidate_detected events

Poll interval not elapsed

Wait 35s; or use POLL_WAIT=35 env override

count: 0 in event query

WAL not drained yet

Run autonomy telemetry drain in runtime container

404 on GET /v1/releases/latest

No releases published

Run Step 3

verify_failed in events

cosign pubkey set but artifact not signed

Unset AUTONOMY_COSIGN_PUBKEY or sign artifact first

Fleet node connection refused on :7778

Fleet compose not running

make demo-up-fleet N=2


What Just Happened

  • Started a multi-node fleet stack (primary + fleet nodes)

  • Published a desired-state release to the control plane with a target lock fingerprint

  • Waited for each node’s poll cycle (30s default) to fire

  • Drained WAL telemetry to the control plane via the OTel pipeline

  • Queried per-node lifecycle events confirming polled and candidate_detected phases

  • Demonstrated node acknowledgement (POST /v1/nodes/{id}/ack)

  • Published a second release and confirmed the release_sequence advanced

  • Ran the edge relay end-to-end unit test (TestRelayE2E_MultiPeer) to show the full airgapped propagation pipeline

Next Tutorial

Tutorial 03 — Crash and Recovery: WAL + Safe-Point