OS Reconstruction Rollouts

This document covers the OS reconstruction activation path for fleet rollouts. OS reconstruction rollouts replace the node’s operating system artifacts and restart the process under a new configuration epoch.

Overview

OS reconstruction is a special rollout_kind that goes beyond behavior/model activation. The activation sequence:

  1. Emit os_reconstruction.started

  2. RunReconstruction — verify artifact, apply OS manifest

  3. SaveFingerprint — persist the new target lock fingerprint

  4. RotateEpoch — advance the node’s configuration epoch counter

  5. Emit os_reconstruction.completed

  6. Request reboot — controlled process exit (exit code 42)

On any error in steps 2–4, the callback emits os_reconstruction.failed and returns the error (fail-closed). The activator surfaces this as activate_failed in its own telemetry.

Bootstrapper Interface

The runtime package does not import edge/bootstrap. Instead, it defines a Bootstrapper interface injected at the cmd layer:

type Bootstrapper interface {
    RunReconstruction(ctx context.Context, plan *rollout.RolloutPlan) error
    SaveFingerprint(ctx context.Context, fingerprint string) error
    RotateEpoch(ctx context.Context) error
}

This keeps the runtime package portable and testable without edge dependencies.

Reboot Mechanism

After successful reconstruction, the runtime calls os.Exit(42). This is a controlled process exit — not a shell command. The supervisor (systemd, k8s) should treat exit code 42 as “restart required, not a crash”.

The ExitFunc field in OSReconstructionConfig can be overridden for testing.

OPA Policy Preconditions

OS reconstruction rollouts have additional OPA policy gates beyond standard rollout activation:

Precondition

Rule

Threshold

Node must be idle

operational_state != "idle" → deny

Sufficient battery

battery <= 50 → deny

> 50%

Valid certificates

cert_validity_days < 7 → deny

≥ 7 days

These are implemented in policy/rollout.rego as os_reconstruction_precondition_failed rules. Standard rollout kinds (behavior, model) are not affected by these gates.

Boot Epoch

Each OS reconstruction advances the node’s boot epoch — a monotonic counter stored in the node’s OS fingerprint (edge/bootstrap/osfingerprint.go). Telemetry, caches, and other epoch-scoped state are reset at the new boundary.

The boot_epoch_increment event kind (registered in telemetry/events.go) marks this transition.

Telemetry Events

All OS reconstruction events are emitted as EventKindLifecycle with the phase carried in attrs["event_kind"]:

Phase Constant

Value

When

RolloutPhaseOSReconStarted

os_reconstruction.started

Sequence begins

RolloutPhaseOSReconCompleted

os_reconstruction.completed

All steps succeeded

RolloutPhaseOSReconFailed

os_reconstruction.failed

Any step failed

RolloutPhaseBootEpochIncrement

boot_epoch_increment

Epoch counter advanced

Additional attributes on each event:

  • node_id — the emitting node’s identity

  • plan_id — the rollout plan identifier

  • artifact_ref — OCI artifact reference (on started)

  • target_fingerprint — target lock fingerprint (on started/completed)

  • error — error message (on failed)

  • step — which step failed: run_reconstruction, save_fingerprint, or rotate_epoch

Routing

The Activator routes OS reconstruction rollouts via dispatchKind():

rollout_kind == "os_reconstruction" → OSReconstructionActivate callback

If no OSReconstructionActivate callback is configured, activation returns an error and emits activate_failed.