NVIDIA GPU Integration¶

The nvidia-demo bundle demonstrates the AutonomyOps governance model for GPU-accelerated inference workloads. It does not perform real CUDA compute; it validates that the governance and attribution rules are enforced correctly before you integrate a real inference runtime.

What the demo covers¶

Container runtime GPU access via CDI (--device nvidia.com/gpu=all)
tool.infer.* attribution enforcement: inference calls require a non-empty model_id or they are rejected
The nvidia_safety.rego policy and its mapping to Python governance logic
Self-test mode (--check) for CI and post-deployment smoke tests

Hardware targets¶

Platform	Runtime	Notes
Jetson Orin (aarch64)	`nvidia` container runtime	`/dev/nvmap` exposed via CDI
x86_64 desktop GPU	`nvidia` container runtime	`/dev/nvidiactl`, `/dev/nvidia0`
No GPU (CI / dev)	`--runtime=runc`	Self-test passes; GPU probe is advisory

Prerequisites¶

Docker 24+ with NVIDIA container runtime installed (/etc/docker/daemon.json must include "default-runtime": "nvidia" or use --runtime=nvidia explicitly)
For Jetson Orin: JetPack 6 with CDI support enabled

Build the image¶

# From repo root — builds python:3.11-slim base; arm64 pulls automatically on Orin
docker build -t autonomyops/nvidia-demo:local demo/nvidia-demo/

Self-test (CI and smoke)¶

docker run --rm --runtime=nvidia autonomyops/nvidia-demo:local --check

Expected output:

[self-test] model_id present → accepted (HTTP 200) ✓
[self-test] model_id absent  → rejected (HTTP 400) ✓
[self-test] GPU probe: /dev/nvmap found
[self-test] PASS

On a host without an NVIDIA runtime or GPU devices, the GPU probe line reads:

[self-test] GPU probe: none of /dev/nvmap, /dev/nvidiactl, /dev/nvidia0 present (advisory)

The self-test still exits 0 because GPU access is advisory; the governance logic does not depend on GPU availability.

HTTP server mode¶

docker run --rm --runtime=nvidia -p 8080:8080 autonomyops/nvidia-demo:local

Allowed inference call¶

curl -s -X POST http://localhost:8080/infer \
    -H "Content-Type: application/json" \
    -d '{"kind":"tool.infer.classify","params":{"model_id":"sonic-v1","input":"..."}}'

Expected:

{"result": "inference accepted", "model_id": "sonic-v1", "node": "sonic-v1"}

Rejected call (no model_id)¶

curl -s -X POST http://localhost:8080/infer \
    -H "Content-Type: application/json" \
    -d '{"kind":"tool.infer.classify","params":{}}'

Expected (HTTP 400):

{"error": "model_id required — unattributed inference rejected by governance policy"}

This mirrors the allow rule in nvidia_safety.rego:

allow if {
    startswith(input.kind, "tool.infer.")
    input.params.model_id != ""
}

Inference calls that do not carry model_id are unattributable to a specific model version and are always denied regardless of other parameters.

Bundle policy¶

The nvidia-demo bundle (demo/bundles/nvidia.tar) carries demo/bundles/nvidia/policies/nvidia_safety.rego.

Allowed actions:

Kind	Condition
`tool.infer.*`	`model_id` must be non-empty
`lifecycle.*`	Unconditional (node start/stop/health)
`telemetry.emit`	Unconditional (WAL drain, OTLP export)
`tool.echo`	Unconditional (health probes)
`tool.shell`	Always denied
anything else	Denied (fail-closed default)

Inspecting the bundle¶

autonomy bundle inspect demo/bundles/nvidia.tar --local

# With policy text
autonomy bundle inspect demo/bundles/nvidia.tar --local --show-policy

Key manifest fields:

name:       nvidia-demo
version:    0.1.0
channel:    dev
context:    container
entrypoint: docker run --rm --runtime=nvidia --device nvidia.com/gpu=all
            ghcr.io/autonomyops/nvidia-demo:latest --check

CDI device access¶

On modern NVIDIA container runtime installations GPU devices are exposed via CDI (Container Device Interface):

# Check CDI devices available on the host
nvidia-ctk cdi list

# Run with all visible GPUs via CDI
docker run --rm --device nvidia.com/gpu=all autonomyops/nvidia-demo:local --check

# Run with a specific GPU by index
docker run --rm --device nvidia.com/gpu=0 autonomyops/nvidia-demo:local --check

On Jetson Orin the device path is /dev/nvmap; on discrete GPU hosts it is /dev/nvidiactl and /dev/nvidia0. The infer_server.py probe checks all three paths and reports advisory results.

Multi-arch images¶

The demo/nvidia-demo/Dockerfile uses python:3.11-slim which has both linux/amd64 and linux/arm64 variants. For Jetson Orin (aarch64) builds:

# Build and push a multi-arch manifest
docker buildx build \
    --platform linux/amd64,linux/arm64 \
    --tag ghcr.io/autonomyops/nvidia-demo:latest \
    --push demo/nvidia-demo/

The CI release pipeline builds both architectures using docker buildx with QEMU for the cross-platform layer. See build/demo-multiarch-images for the full Makefile targets.

Adapting for a production inference runtime¶

The demo stub is intentionally minimal — no real CUDA compute is performed. To replace it with a real runtime:

Replace infer_server.py with your model serving code. Retain the _handle_infer attribution check or call the AutonomyOps runtime endpoint (/v1/tool) for each inference request.
Update model_id routing so each request carries the exact model version string that matches your audit requirements.
Update nvidia_safety.rego if your inference kinds deviate from tool.infer.*; add package/version constraints as needed.

Add CUDA base layer to the Dockerfile:

FROM nvcr.io/nvidia/cuda:12.3.0-runtime-ubuntu22.04
# … add your inference server …

Rebuild and verify the bundle before activating on edge nodes:

autonomy bundle push my-inference.tar \
    registry.example.com/my-inference:v1.0.0 \
    --key demo/keys/cosign.key
autonomy bundle verify registry.example.com/my-inference:v1.0.0 \
    --pub-key demo/keys/cosign.pub