VAL 01 — Zero-Downtime Certificate Rotation Validation

1. Purpose and Guarantee

This validation proves that autonomy cert rotate preserves point-in-time connectivity for new mTLS client connections and completes within a bounded time window.

The zero-downtime claim in this validation is client-side, not server-side: the control-plane process remains running the entire time, while the client-side certificate and private key files for node-c are replaced atomically in place. The pre-rotation and post-rotation probes use the same file paths, so a new curl invocation naturally picks up the rotated client certificate without any control-plane restart.

cert rotate is crash-safe: if the process is interrupted before the rename, the temporary file is left behind and the original cert is untouched. If interrupted after the rename, the new cert is in place and the key file is updated in a separate atomic write. The worst case is a brief window where the cert and key files are from different generations; this is detectable and correctable by re-running cert rotate.

This validation captures live evidence of all six claims:

#

Claim

VAL01-1

cert list --expiring-within-days detects a near-expiry cert

VAL01-2

Old cert accepted over live mTLS before rotation

VAL01-3

cert rotate completes within 300 seconds (practical bound)

VAL01-4

Expiry window clears after rotation to 90-day validity

VAL01-5

New cert accepted over live mTLS without restarting the control-plane

VAL01-6

cert.rotated audit event recorded in the retained store


2. Scope

Covered

  • Client certificate rotation (edge node → control-plane mTLS path)

  • Expiry detection via autonomy cert list --expiring-within-days

  • Atomic in-place rotation via autonomy cert rotate

  • Timing bound: elapsed wall-clock seconds ≤ 300

  • New cert accepted without restarting the control-plane (new-connection continuity claim)

  • cert.rotated audit event captured in the retained file-backed store

  • Serial number change proving a new keypair was issued (not a no-op renewal)

Not covered (known gaps)

  • CA rotation: no CLI support; requires manual key replacement and leaf re-issuance

  • Server-side cert rotation: rotating the control-plane’s own TLS certificate requires a process restart (server TLS config is loaded once at startup); this is out of scope

  • OCSP-based live status: not implemented; revocation checking is CRL-only

  • In-flight request continuity: VAL01 proves successful new handshakes before and after rotation, not an uninterrupted stream of requests across the rotation window

  • Multi-node simultaneous rotation: single node pair (client + server); coordinated rotation across an HA cluster is covered in the self-hosted certificate rotation runbook (HA control-plane context section)

  • HSM-backed CA keys: the lab uses a locally generated CA; production CA key management is out of scope for automated validation

As of March 19, 2026, the current validation backlog does not define later VAL slices for CA rotation, server-certificate hot reload, uninterrupted in-flight continuity, or coordinated multi-node rotation. The only adjacent planned follow-on is VAL-02 Trust-chain rejection validation, which is expected to cover certificate accept/reject behavior, not the excluded rotation workflows above.


3. Harness

VAL01 is embedded in run_cert_lab() as Phase 8 within scripts/labs/run_cli_audit_lab.sh. No separate runner is required.

Setup inherited from earlier phases:

  • Phases 1–4: local CA generated by openssl; node-a and node-b certs issued

  • Phase 5: mTLS-enforcing control-plane started on localhost:18443 using the same CA (--tls-cert-file, --tls-key-file, --tls-ca-file, --tls-crl-file)

  • Phases 6–7: CRL distribution and revocation tests

Phase 8 setup:

  1. Issue a 2-day certificate for node-c.edge.local using the same CA (short validity so it falls inside the --expiring-within-days 5 window immediately)

  2. Run the six validation checks against this cert while the control-plane from Phase 5 remains live

The Phase 5 control-plane is not restarted between VAL01-2 (pre-rotation) and VAL01-5 (post-rotation). This is what proves continuity for new client connections across the rotation.


4. Exact Scenarios

VAL01-1 — Expiry Detection

Purpose: Confirm that cert list --expiring-within-days identifies a certificate about to expire.

Setup: node-c.edge.local was issued with --validity-days 2 so its expiry is approximately now + 48 hours, well within the 5-day window.

Action:

AUTONOMY_RBAC_ENFORCEMENT=0 \
  autonomy cert list \
    --cert-file /tmp/.../node-c.crt \
    --expiring-within-days 5

Evidence file: autonomy/cert-rotation-list-expiring.txt

Pass criterion: File contains the string expiring or the identity node-c.


VAL01-2 — Pre-rotation mTLS Connection

Purpose: Baseline — the 2-day cert is accepted by the live control-plane before rotation.

Action:

curl -fsS \
  --cacert /tmp/.../ca.crt \
  --cert   /tmp/.../node-c.crt \
  --key    /tmp/.../node-c.key \
  https://localhost:18443/v1/health

Evidence file: autonomy/cert-rotation-prerotate-health.json

Pass criterion: File contains "status":"ok".


VAL01-3 — Rotation Timing (Bounded Downtime)

Purpose: Prove that the rotation operation itself completes within the 5-minute practical bound. (Actual elapsed time is sub-second; the 300-second bound is a formal SLA floor rather than a tight measurement target.)

Action:

rotation_start=$(date +%s)
AUTONOMY_RBAC_ENFORCEMENT=0 \
  autonomy cert rotate \
    --cert-file /tmp/.../node-c.crt \
    --key-file  /tmp/.../node-c.key \
    --ca-cert   /tmp/.../ca.crt \
    --ca-key    /tmp/.../ca.key
rotation_end=$(date +%s)
rotation_elapsed=$((rotation_end - rotation_start))
printf "rotation_elapsed_seconds=%d\nbound_seconds=300\npass=%s\n" \
  "$rotation_elapsed" \
  "$([ "$rotation_elapsed" -le 300 ] && echo true || echo false)"

Evidence files:

  • autonomy/cert-rotation-rotate.txtcert rotate stdout

  • autonomy/cert-rotation-audit-rotate.log — slog line with cert.rotated

  • autonomy/cert-rotation-timing.txt — elapsed + bound + pass flag

Pass criterion: cert-rotation-timing.txt contains pass=true.


VAL01-4 — Expiry Window Cleared

Purpose: After rotation to 90-day validity the cert no longer appears in the 5-day expiry window.

Action:

AUTONOMY_RBAC_ENFORCEMENT=0 \
  autonomy cert list \
    --cert-file /tmp/.../node-c.crt \
    --expiring-within-days 5

Evidence file: autonomy/cert-rotation-list-after.txt

Pass criterion: File contains no certificates matched.


VAL01-5 — Post-rotation mTLS (Zero-Downtime Claim)

Purpose: The core claim. The new cert is accepted by the same, unmodified control-plane process immediately after rotation, with no restart.

Action:

curl -fsS \
  --cacert /tmp/.../ca.crt \
  --cert   /tmp/.../node-c.crt \
  --key    /tmp/.../node-c.key \
  https://localhost:18443/v1/health

Evidence file: autonomy/cert-rotation-postrotate-health.json

Pass criterion: File contains "status":"ok".

Note: the pre-rotation and post-rotation curl calls use the same file paths. The control-plane process is unchanged; the new curl invocation rereads the rotated client cert and key from those paths and establishes a fresh mTLS connection without any explicit restart or reload step.


VAL01-6 — Audit Event Captured

Purpose: Prove the cert.rotated event is written to the retained audit store, providing an operator-queryable record of every rotation.

Action:

AUTONOMY_RBAC_ENFORCEMENT=0 \
  autonomy audit query \
    --audit-dir "$AUTONOMY_AUDIT_DIR" \
    --event-type cert.rotated \
    --output json

Evidence file: autonomy/cert-rotation-audit-events.json

Pass criterion: File contains the string cert.rotated.


Serial Assertion (supplementary)

Not a scored check, but captured as additional proof that a new keypair was actually generated (not a no-op or timestamp-only renewal):

# Before rotation
openssl x509 -in node-c.crt -noout -subject -serial -dates \
  > cert-rotation-before-dates.txt

# After rotation
openssl x509 -in node-c.crt -noout -subject -serial -dates \
  > cert-rotation-after-dates.txt

Evidence files:

  • autonomy/cert-rotation-before-dates.txt

  • autonomy/cert-rotation-after-dates.txt

Criterion: before_serialafter_serial (both non-empty). Reported in the composite report as serials_differ=true.


5. Evidence Files

All files are written to $EVIDENCE_DIR/autonomy/ by the lab runner.

File

Produced by

Contains

cert-rotation-list-expiring.txt

cert list --expiring-within-days 5 (before rotation)

Identity row with expiring status

cert-rotation-prerotate-health.json

curl against live mTLS CP (before rotation)

{"status":"ok"}

cert-rotation-before-dates.txt

openssl x509 -serial -dates (before rotation)

Baseline serial, notBefore, notAfter

cert-rotation-timing.txt

date +%s bracketing cert rotate

rotation_elapsed_seconds, bound_seconds=300, pass=true/false

cert-rotation-rotate.txt

cert rotate stdout

rotated  identity=node-c.edge.local cert=... valid_days=90

cert-rotation-audit-rotate.log

cert rotate stderr (slog)

Structured log line: cert.rotated

cert-rotation-after-dates.txt

openssl x509 -serial -dates (after rotation)

Post-rotation serial, notBefore, notAfter

cert-rotation-list-after.txt

cert list --expiring-within-days 5 (after rotation)

(no certificates matched)

cert-rotation-postrotate-health.json

curl against live mTLS CP (after rotation, no restart)

{"status":"ok"}

cert-rotation-audit-events.json

audit query --event-type cert.rotated --output json

JSON array with cert.rotated record

cert-rotation-val01-report.txt

Composite report written by Phase 8

6-check PASS/FAIL + serial assertion


6. Pass/Fail Criteria

Check ID

Name

File

Pass condition

VAL01-1

expiry_detection

cert-rotation-list-expiring.txt

contains expiring or node-c

VAL01-2

prerotate_connect

cert-rotation-prerotate-health.json

contains "status":"ok"

VAL01-3

rotation_timing

cert-rotation-timing.txt

contains pass=true

VAL01-4

expiry_cleared

cert-rotation-list-after.txt

contains no certificates matched

VAL01-5

postrotate_connect

cert-rotation-postrotate-health.json

contains "status":"ok"

VAL01-6

audit_captured

cert-rotation-audit-events.json

contains cert.rotated

Overall pass: all 6 checks pass and cert-rotation-val01-report.txt reports 6/6 checks PASS.

Serial assertion: serials_differ=true in the report is expected but not a gate — it confirms a new keypair was issued; mismatch would indicate a cert rotate implementation regression.

Failure handling:

  • VAL01-2 or VAL01-5 fails (no "status":"ok"): the control-plane from Phase 5 may have exited; check cert-rotation-audit-rotate.log for errors and re-run the full cert lab

  • VAL01-3 fails (pass=false): impossible in practice (rotation is sub-second); indicates a system overload or clock skew issue in the CI environment

  • VAL01-4 fails (expiry not cleared): cert rotate may have been called without --validity-days; the default is 90 days — verify the rotate command output in cert-rotation-rotate.txt

  • VAL01-6 fails (no cert.rotated): audit dir may not match; check AUTONOMY_AUDIT_DIR is set to $RETAINED_EVIDENCE_DIR/store before the query


7. Deferred Coverage Matrix

The table below closes the remaining coverage ambiguity by stating exactly which rotation-adjacent cases are still outside VAL-01, whether a later committed VAL slice exists, and what kind of future work would be required to validate them.

Area

Covered by VAL-01?

Later committed VAL?

Current status

CA rotation

No

No

Product workflow not implemented; requires manual CA replacement and leaf re-issuance

Server-certificate hot reload

No

No

Product behavior not implemented; control-plane TLS cert is loaded at startup

In-flight request continuity

No

No

Not validated; would require a streaming or long-lived connection harness

Coordinated multi-node rotation

No

No

Runbook guidance exists, but no committed validation slice exercises clustered rotation

Trust-chain acceptance/rejection after cert changes

Partially adjacent

Yes, VAL-02

Planned separately as trust-chain rejection validation, not as rotation continuity validation

The safest operator interpretation today is:

  1. Treat VAL-01 as proof of bounded client-certificate rotation for fresh mTLS connections against a live control-plane.

  2. Treat the rows above as open validation gaps unless and until the product gains new capabilities or a later validation slice is explicitly added and merged.


8. Report Template

The composite report written to autonomy/cert-rotation-val01-report.txt follows this format:

# VAL 01 — Certificate Rotation Validation Report
timestamp: 2026-03-19T10:00:00Z

## Results
VAL01-1 expiry_detection:   PASS
VAL01-2 prerotate_connect:  PASS
VAL01-3 rotation_timing:    PASS  (elapsed=0s  bound=300s)
VAL01-4 expiry_cleared:     PASS
VAL01-5 postrotate_connect: PASS
VAL01-6 audit_captured:     PASS

## Serial assertion
  before_serial=3e8
  after_serial=3e9
  serials_differ=true

A run is green when:

  1. All six lines end in PASS

  2. serials_differ=true

The runner also prints VAL 01: 6/6 checks PASS (report: cert-rotation-val01-report.txt) to stdout so CI log scanners can grep for failure without parsing the report file.


9. How to Run

Phase 8 executes automatically as part of run_cert_lab() when the full lab is run:

export GOROOT=/home/ubuntu/.local/go1.25.7
export PATH="$GOROOT/bin:$PATH"
export GOTOOLCHAIN=local

bash scripts/labs/run_cli_audit_lab.sh

The composite report is printed to stdout as part of the cert lab output. All evidence files land in $EVIDENCE_DIR/autonomy/ (default: evidence/pr17-cli-audit-local-2026-03-17/autonomy/).

To inspect results after a run:

# Quick pass/fail
cat evidence/pr17-cli-audit-local-2026-03-17/autonomy/cert-rotation-val01-report.txt

# Verify zero-downtime claim
diff \
  evidence/pr17-cli-audit-local-2026-03-17/autonomy/cert-rotation-before-dates.txt \
  evidence/pr17-cli-audit-local-2026-03-17/autonomy/cert-rotation-after-dates.txt
# Serial lines must differ

# Confirm audit record
jq '.[0] | {event, actor, resource, outcome}' \
  evidence/pr17-cli-audit-local-2026-03-17/autonomy/cert-rotation-audit-events.json