OS Reconstruction Rollouts¶
This document covers the OS reconstruction activation path for fleet rollouts. OS reconstruction rollouts replace the node’s operating system artifacts and restart the process under a new configuration epoch.
Overview¶
OS reconstruction is a special rollout_kind that goes beyond behavior/model
activation. The activation sequence:
Emit
os_reconstruction.startedRunReconstruction — verify artifact, apply OS manifest
SaveFingerprint — persist the new target lock fingerprint
RotateEpoch — advance the node’s configuration epoch counter
Emit
os_reconstruction.completedRequest reboot — controlled process exit (exit code 42)
On any error in steps 2–4, the callback emits os_reconstruction.failed and
returns the error (fail-closed). The activator surfaces this as activate_failed
in its own telemetry.
Bootstrapper Interface¶
The runtime package does not import edge/bootstrap. Instead, it defines a
Bootstrapper interface injected at the cmd layer:
type Bootstrapper interface {
RunReconstruction(ctx context.Context, plan *rollout.RolloutPlan) error
SaveFingerprint(ctx context.Context, fingerprint string) error
RotateEpoch(ctx context.Context) error
}
This keeps the runtime package portable and testable without edge dependencies.
Reboot Mechanism¶
After successful reconstruction, the runtime calls os.Exit(42). This is a
controlled process exit — not a shell command. The supervisor (systemd, k8s)
should treat exit code 42 as “restart required, not a crash”.
The ExitFunc field in OSReconstructionConfig can be overridden for testing.
OPA Policy Preconditions¶
OS reconstruction rollouts have additional OPA policy gates beyond standard rollout activation:
Precondition |
Rule |
Threshold |
|---|---|---|
Node must be idle |
|
— |
Sufficient battery |
|
> 50% |
Valid certificates |
|
≥ 7 days |
These are implemented in policy/rollout.rego as os_reconstruction_precondition_failed
rules. Standard rollout kinds (behavior, model) are not affected by these gates.
Boot Epoch¶
Each OS reconstruction advances the node’s boot epoch — a monotonic counter
stored in the node’s OS fingerprint (edge/bootstrap/osfingerprint.go).
Telemetry, caches, and other epoch-scoped state are reset at the new boundary.
The boot_epoch_increment event kind (registered in telemetry/events.go)
marks this transition.
Telemetry Events¶
All OS reconstruction events are emitted as EventKindLifecycle with the
phase carried in attrs["event_kind"]:
Phase Constant |
Value |
When |
|---|---|---|
|
|
Sequence begins |
|
|
All steps succeeded |
|
|
Any step failed |
|
|
Epoch counter advanced |
Additional attributes on each event:
node_id— the emitting node’s identityplan_id— the rollout plan identifierartifact_ref— OCI artifact reference (on started)target_fingerprint— target lock fingerprint (on started/completed)error— error message (on failed)step— which step failed:run_reconstruction,save_fingerprint, orrotate_epoch
Routing¶
The Activator routes OS reconstruction rollouts via dispatchKind():
rollout_kind == "os_reconstruction" → OSReconstructionActivate callback
If no OSReconstructionActivate callback is configured, activation returns an
error and emits activate_failed.