ROS 2 Governed Bridge

Audience: operators turning on, observing, or recovering the long-lived governed_ros2_bridge process — the runtime-owned C++ rclcpp bridge that subscribes on a separate “agent” DDS domain, POSTs every message to the AutonomyOps /v1/tool runtime for policy evaluation, and republishes allowed messages on the “real” DDS domain. This is the per-message counterpart to launch-level governance (the ROS 2 Governance reference covers the launch path).

The bridge is opt-in via --governed-bridge on either autonomy ros2 run (paid) or autonomy run ros2.launch (CE); default off preserves prior AutoRuntime behavior. This page tells you what to do when you turn it on, what to look for, and how to get out of trouble.

Walking through the demo first? Start at ROS 2 Governed Bridge Quickstart; it runs autonomy demo ros2-bridge end-to-end with allow + deny evidence. The runbook below assumes you already have a workload to govern.

Prerequisites

  • docker on PATH. The bridge runs in a container even when the workload runs natively — runtime/ros2bridge.BridgeProcess enforces NetworkMode=host and IPCMode=host, both of which are dispatched by runtime/exec via Docker.

  • ghcr.io/autonomyops/adk-ros2-runtime:<version> present locally. Pull with docker pull ghcr.io/autonomyops/adk-ros2-runtime:latest, or build from source via docker build -t ghcr.io/autonomyops/adk-ros2-runtime:local -f demo/ros2-runtime/Dockerfile ..

  • A policy bundle that allows the topics the bridge will republish. The embedded embedded:ros2-bridge-demo policy allows /cmd_vel (and /cmd_vel/*) and denies /disable_safety. Production fleets stage a custom bundle via autonomy bundle pull <ref> and pass --policy <ref> to the launch command.

Procedure

1. Pick two ROS_DOMAIN_IDs

The bridge subscribes on one domain and republishes on another. They must differ, or the bridge collapses into a loopback that defeats governance entirely (BridgeProcess.Run returns ErrSameDomain immediately).

Conventions used across the demo + docs:

Role

Default

Meaning

--agent-domain

99

Where the launched workload publishes — the “untrusted” side. Any DDS participant on this domain is intercepted.

--real-domain

42

Where allowed messages get republished — the “real robot” side. Production subscribers (motor controller, perception stack) live here.

If your fleet already uses a particular ROS_DOMAIN_ID for production traffic, pin it to --real-domain and pick any unused integer in 0..101 for --agent-domain. The runbook below assumes 99 / 42.

2. Enable the bridge on the launch

You must pass --bridge-topics to tell the bridge which workload topics to intercept. Without it the bridge falls back to a compiled-in default (/agent_chat typed std_msgs/msg/String) and the workload’s publishes on /cmd_vel / sensor topics / etc. are silently ungoverned — the runner prints a stderr WARN on this combination, but the launch still proceeds (the bridge’s fail-closed posture means no agent publish reaches real ungoverned in that state, just none reach real at all).

CE (no orchestrator required):

autonomy run \
    --image ghcr.io/autonomyops/adk-ros2-runtime:latest \
    --governed-bridge \
    --agent-domain 99 \
    --real-domain 42 \
    --bridge-topics '/cmd_vel:geometry_msgs/msg/Twist,/disable_safety:std_msgs/msg/Bool' \
    ros2.launch launch demo_robot arm_demo.launch.py

Paid tier (same flags, paid-tier surface):

autonomy ros2 run \
    --image ghcr.io/autonomyops/adk-ros2-runtime:latest \
    --governed-bridge \
    --agent-domain 99 \
    --real-domain 42 \
    --bridge-topics '/cmd_vel:geometry_msgs/msg/Twist,/disable_safety:std_msgs/msg/Bool' \
    launch demo_robot arm_demo.launch.py

What happens, in order:

  1. The runtime binds the in-process /v1/tool server to a random 127.0.0.1:<port>. The URL is injected into both the bridge container and the launched workload container as AUTONOMY_RUNTIME_URL.

  2. The bridge container is spawned (--network host --ipc host, subscribing on ROS_DOMAIN_ID=99).

  3. The launch waits for the bridge to print governed_ros2_bridge: ready agent_domain=99 real_domain=42 on stdout. The readiness wait is bounded by --bridge-ready-timeout (default 30s); see Step 5 if it times out.

  4. The launched workload starts with ROS_DOMAIN_ID=99 and --ipc=host injected — so its publishes land on the bridge’s subscription domain and share /dev/shm with the bridge container for FastDDS SHM transport.

  5. Every message the workload publishes on a bridged topic flows: workload bridge.subscribe POST /v1/tool policy (allow) bridge.republish real domain.

3. Confirm the loop is closed

In a second terminal, inspect decision frames the bridge has emitted so far (re-run after each publish — autonomy wal inspect reads the WAL file end-to-end on each invocation; there is no streaming wal subcommand):

# Preferred: first-class --bridge-only filter (#939 4-E.a).
autonomy wal inspect --kind autonomy.decision --bridge-only --json

# Equivalent jq form (still works; use this if you need richer projection).
autonomy wal inspect --kind autonomy.decision --json \
  | jq 'select(.event.attrs.bridge_origin == "governed_ros2_bridge")'

You should see one frame per bridged publish, each carrying:

  • tool=tool.ros2.topic.publish

  • outcome=allow (or deny if the policy rejected it)

  • bridge_origin=governed_ros2_bridge (#939 4-E.a marker; absent on direct node POSTs, present on bridge-routed POSTs — pinned by bridgeOriginFromRequest in runtime/server.go)

  • policy_ref matching the bundle’s manifest.policy_ref

If the marker is absent on bridge-routed POSTs, the bridge is not in fact mediating the publish — the workload is publishing directly on the real domain. Re-check that --agent-domain and --real-domain differ and that the workload’s ROS_DOMAIN_ID env was actually overridden (see Step 5).

4. Subscribe-side sanity check

On the real domain, confirm allowed messages are arriving:

ROS_DOMAIN_ID=42 ros2 topic echo /cmd_vel

Expected: one line per allow-decision in the WAL. If you see decisions but no echoes, the bridge is denying every message — check Step 6 and the policy.

Common operator situations

5. Troubleshooting the readiness gap

Symptom: the launch fails with

ros2: governed bridge did not signal ready within 30s

or

ros2: governed bridge exited before signaling ready: <wrapped error>

The first case is a soft timeout (the bridge is alive but slow); the second case is a hard exit (#940 fix — pre-fix this used to fall through to the soft timeout and silently corrupt the run). Both abort the launch — the runtime will not start the workload without a ready bridge.

Triage:

  1. Was the image pull cold? Cold pulls of adk-ros2-runtime regularly exceed 30s on slow networks. Pre-pull:

    docker pull ghcr.io/autonomyops/adk-ros2-runtime:latest
    

    then re-run. For pinned-bandwidth environments, raise the timeout on the launch: --bridge-ready-timeout 2m.

  2. Is the binary actually in the image?

    docker run --rm --entrypoint /bin/bash \
        ghcr.io/autonomyops/adk-ros2-runtime:latest \
        -c 'which governed_ros2_bridge && governed_ros2_bridge --version'
    

    If the binary is missing, your local image was built before the governed_ros2_bridge colcon target was added to demo/ros2-runtime/Dockerfile. Rebuild:

    docker build -t ghcr.io/autonomyops/adk-ros2-runtime:local -f demo/ros2-runtime/Dockerfile .
    autonomy run --image ghcr.io/autonomyops/adk-ros2-runtime:local --governed-bridge ...
    
  3. Is the bridge container exiting on a config error? Check the wrapped error in the abort message — ErrSameDomain and ErrRuntimeURLRequired both surface here. ErrSameDomain means you passed --agent-domain == --real-domain; pick different integers.

6. Recovering from a stuck bridge container

Symptom: the launch process has been killed (Ctrl-C, SIGKILL, host reboot) but docker ps shows the bridge container still running. Or: a fresh launch fails with a “port already in use” / “FastDDS already bound” stderr line.

The bridge is spawned as docker run --rm, so a clean shutdown removes the container. A killed launch process may leak it if Docker didn’t get the SIGTERM cascade in time.

  1. List bridge containers:

    docker ps --filter ancestor=ghcr.io/autonomyops/adk-ros2-runtime:latest --format 'table {{.ID}}\t{{.Status}}\t{{.Command}}'
    
  2. Stop with grace (lets the bridge flush its last decisions):

    docker stop <container-id>
    
  3. If stop hangs >10s, force-remove:

    docker rm -f <container-id>
    
  4. Re-launch. The runtime starts a fresh bridge on a fresh 127.0.0.1:<random> port; no shared state with the prior process.

If your shell history shows the bridge launched with --keep, the WAL directory under /tmp/autonomyops-demo-wal-* will still be on disk — that’s intended (see Step 7 for how to read it).

7. Inspecting the WAL after the fact

Every bridge-mediated publish writes one autonomy.decision frame to the WAL with the bridge_origin=governed_ros2_bridge marker. To pull the per-run audit trail:

# Preferred: first-class flag, no jq needed (#939 4-E.a).
autonomy wal inspect --kind autonomy.decision --bridge-only --json

# Equivalent older form (still supported).
autonomy wal inspect --kind autonomy.decision --json \
  | jq 'select(.event.attrs.bridge_origin == "governed_ros2_bridge")'

To distinguish bridge-routed decisions from direct-node POSTs (e.g. a node inside the launched container that calls /v1/tool itself without going through the bridge):

# Bridge-routed (first-class):
autonomy wal inspect --kind autonomy.decision --bridge-only --json \
  | jq '{tool, outcome, attrs: .event.attrs}'

# Direct node-POSTs (no marker — invert via jq; the negative case is
# rarer than the positive case, so it stays in jq).
autonomy wal inspect --kind autonomy.decision --json \
  | jq 'select(.event.attrs.tool == "tool.ros2.topic.publish" and .event.attrs.bridge_origin == null) | {tool, outcome, attrs: .event.attrs}'

The runtime sets bridge_origin only when the inbound POST’s params._bridge_origin field is the canonical sentinel governed_ros2_bridge AND the request kind is on the closed BridgeRoutableKinds set (#941 fix — pre-fix the marker could be spoofed by a node calling tool.echo with the marker in params).

8. Disabling the bridge cleanly

Just drop --governed-bridge from the launch. Without that flag the runtime falls back to:

  • ExecBridge (the runtime is the publisher of every node-level POST), matching the prior AutoRuntime behavior unchanged.

  • No bridge container is spawned, no ROS_DOMAIN_ID injection, no IPCMode=host on the workload.

Bridge containers that were already running won’t be terminated — see Step 6.

Multi-topic + generic-type interception (#939 4-A)

The bridge accepts arbitrary DDS message types on any number of topics in a single process via rclcpp::GenericSubscription + rclcpp::GenericPublisher. Operator configuration:

  • --bridge-topics (CLI, on both autonomy ros2 run and autonomy run) — comma-separated topic:type pairs OR repeated flags. The runner forwards these to RunOptions.BridgeTopics, which runtime/ros2.defaultStartGovernedBridge sets on BridgeProcess.Topics, which becomes the GOVERNED_BRIDGE_TOPICS env on the bridge container/native binary.

  • GOVERNED_BRIDGE_TOPICS env (direct, when invoking the bridge binary outside the runner — e.g. via docker run) — same comma-separated topic:type pairs. Each entry creates one subscription on the agent domain + one publisher on the real domain, typed by the operator-supplied type.

  • GOVERNED_BRIDGE_TOPIC (singular, back-compat) — one topic; the C++ side hard-defaults its type to std_msgs/msg/String (the pre-4-A behavior). Preferred for single-topic legacy wiring; new callers should use GOVERNED_BRIDGE_TOPICS with an explicit type.

  • Neither set — falls back to /agent_chat + std_msgs/msg/String, the compiled-in default.

Wire format addition. Every bridge-routed POST now carries params.payload_b64 = base64 of the message’s serialized CDR bytes, alongside params.type and the existing params.topic. The params.data field is still emitted for std_msgs/msg/String only (back-compat with the canonical wire-shape contract test); other types ship the bytes via payload_b64 alone. The runtime currently keys policy on topic + kind; field-level typed-policy via rosidl_typesupport_introspection_cpp decoding of payload_b64 lands in a follow-up.

Native + container dual-path is validated. The same C++ source compiles under both apt install ros-humble-ros-base natively and the adk-ros2-runtime docker image build. The bridge can run either way (Go-side: BridgeProcess.Image empty → native, set → container). End-to-end subscribe → POST → republish is smoke-validated on both paths against 3 different types.

Production hardening

  • Use the bridge for the topics you intend to govern, not for all of them. Direct-node POSTs to /v1/tool (no bridge) are also governed and can carry typed envelopes — split your fleet’s topics between the two paths according to typed-policy needs.

  • Stage the bridge behind a non-default policy bundle pinned via --policy <ref>. The embedded embedded:ros2-bridge-demo policy is for demos.

  • Layer SROS 2 / DDS-Security as defense-in-depth via --bridge-keystore + --bridge-enclave + --workload-enclave (the three flags are all-or-nothing, enforced before any side effect). Provision the keystore with autonomy ros2 keystore init/mint/permissions. End-to-end procedure + bypass-resistance verification in the SROS 2 runbook and SROS 2 quickstart.

  • Monitor the bridge container’s stderr for the rate-limited “POST failed” lines (#942 4-E.c — one line per topic per second, not per message). A sustained burst means the runtime listener died or the bridge can’t reach 127.0.0.1; correlate with autonomy wal status.

  • DDS interop note. Container-to-container DDS via --network host --ipc host works reliably (it’s what autonomy demo ros2-bridge uses); host-process-to-container DDS can fail to deliver under FastDDS even with shared /dev/shm because of how FastDDS announces locator addresses. Keep publishers and the bridge on the same side of the container/native boundary in production.

Reference