Skip to content

[multicast] DDM multicast exchange: V4 protocol, MRIB sync#696

Open
zeeshanlakhani wants to merge 4 commits intozl/mribfrom
zl/ddm-mcast
Open

[multicast] DDM multicast exchange: V4 protocol, MRIB sync#696
zeeshanlakhani wants to merge 4 commits intozl/mribfrom
zl/ddm-mcast

Conversation

@zeeshanlakhani
Copy link
Copy Markdown

@zeeshanlakhani zeeshanlakhani commented Apr 2, 2026

Adds multicast group subscription distribution to the DDM exchange protocol with a V4 version.

Key changes:

  • V4 exchange protocol with multicast support (V3 peers are unaffected)
  • UnderlayMulticastIpv6 validated newtype moved to mg-common (ff04::/64) (moved from rdb types)
  • MRIB->DDM sync in mg-lower/mrib.rs
  • Learned multicast state exposed via DDM admin API for Omicron consumption
  • Atomic update_imported_mcast on Db (single lock for import/delete/diff, which is a bit different from the tunnel work)
  • Collapsed send_update dispatch
  • Shared pull handler helpers (collect_underlay_tunnel, collect_multicast)
  • MulticastPathHop constructor
  • Some serde round-trip and validation tests, including for version handling

References

Stacked on zl/mrib (MRIB: Multicast RIB implementation #675) and depends on oxidecomputer/opte#924.

@zeeshanlakhani zeeshanlakhani force-pushed the zl/ddm-mcast branch 3 times, most recently from 339f250 to 1b5996d Compare April 2, 2026 11:49
@zeeshanlakhani zeeshanlakhani changed the title ddm-mcast [multicast] DDM multicast exchange: V4 protocol, MRIB sync, M2P hooks Apr 2, 2026
@zeeshanlakhani zeeshanlakhani marked this pull request as ready for review April 2, 2026 15:43
Adds multicast group subscription distribution to the DDM exchange
protocol with a V4 version bump (frozen V3 types for wire compat).

Key changes:
- V4 exchange protocol with multicast support (V3 peers are unaffected)
- UnderlayMulticastIpv6 validated newtype moved to mg-common (ff04::/64) (moved from rdb types)
- MRIB->DDM sync in mg-lower/mrib.rs
- OPTE M2P hooks for learned multicast routes (requires OPTE #924)
- Atomic update_imported_mcast on Db (single lock for import/delete/diff, which is a bit different from the tunnel work)
- Collapsed send_update dispatch
- Shared pull handler helpers (collect_underlay_tunnel, collect_multicast)
- MulticastPathHop constructor
- Some serde round-trip and validation tests, including for version handling

Stacked on zl/mrib (MRIB: Multicast RIB implementation [#675](#675)).
@zeeshanlakhani zeeshanlakhani force-pushed the zl/ddm-mcast branch 5 times, most recently from 4281301 to 5d7d89d Compare April 7, 2026 05:02
return;
}
let resp =
rt.block_on(async { client.advertise_multicast_groups(&routes).await });
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the dendrite side of the syncronization coming later?

Comment on lines +225 to +232
impl PartialEq for MulticastOrigin {
fn eq(&self, other: &Self) -> bool {
self.overlay_group == other.overlay_group
&& self.underlay_group == other.underlay_group
&& self.vni == other.vni
&& self.source == other.source
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be overlooking it, but where is this used?
I'm just trying to think about the implications this impl will have, depending on where/how it's used. (I'm a little paranoid after #650)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used on Hashset comparisons, but now commented.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My read on this is that the PartialEq impl is effectively defining what the "identity" of a MulticastOrigin is, based on which members we compare. That way we could do a replace() or something if the same MulticastOrigin (PartialEq returns Ord::Equal) is updated. Assuming I'm tracking properly, I just want to double check which members we're comparing. Is the VNI an attribute of the path or part of the identity? Put another way, should (Group X, VNI X) and (Group X, VNI Y) be considered the same MulticastOrigin or two unique ones?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for some future proofing, we want them to be considered different, as different VNIs mean different network contexts, even if the overlay group address is the same. Even though we're going with the arbitrary VNI setup right now, this keeps us from having logic bugs if we carry multiple VNIs in the future.

@taspelund
Copy link
Copy Markdown
Contributor

I left my initial round of review comments, which are mostly nits. I think the one place where I noticed what looks to be a real issue is that we don't cleanup all the OPTE multicast state in Exchange.expire_peer().

At this point, the main questions I have are around the intended operation of DDM as a component of the overall multicast architecture. Until I have a better understanding of the architecture we plan to move forward with, I'm not sure how to give good feedback as to how well these updates to DDM align with it.

I started having deeper questions when I started seeing PartialEq/Eq/Hash trait impls for types like MulticastOrigin, MulticastRoute, MulticastPathVector, and MulticastPathHop. In particular, I want to make sure I understand the intention for the shape of the DDM route updates (i.e. how are the different path fields used in mroute installation and rpf calculation), how we plan to propagate them to peers (i.e. what fields are updated and in what way), and how we plan to install them locally. It's hard for me to tell how well the types work together and where things could go wrong, e.g. when put into collections relying on manual trait impls (I hit a very similar scenario with the unicast RIB paths and had to go back and fix it in #650).

zeeshanlakhani added a commit that referenced this pull request Apr 8, 2026
This addresses review feedback on DDM multicast exchange PR (#696).

Additionally, oriented towards the goal of Omicron owning all OPTE
M2P (multicast-to-physical) mappings via the sled-agent, we remove direct OPTE
M2P writes from DDM. This was an oversight by me using similar patterns to
tunnel routing. The M2P table is global to xde, so having both DDM and Omicron's
reconciler write to it creates a conflict risk where Nexus could reap
DDM-written entries as stale.

Our original intention has always been Omicron driving most things
(-> Dpd/Dendrite, -> OPTE) due to the implicit and dynamic lifecycle of
multicast groups. The work here is for DDM to distribute multicast membership
and expose learned state via its admin API for Omicron to consume (and rely on
for port knowledge), which is currently a TODO that will be removed in Omicron
once this work is plumbed up.

Other changes:
  - Replace String errors with UnderlayMulticastError in mg-common
  - Use route key VNI instead of hard-coded DEFAULT_MULTICAST_VNI
  - Sort ddmadm multicast output by overlay group for deterministic display
  - Consolidate oxide_vpc imports in sys.rs
  - Derive Eq on MulticastOrigin, document PartialEq/Hash exclusion of metric with #649 reference
  - Downgrade multicast withdrawal success log to debug
  - Fix MRIB diagram indentation
  - Remove "old" zl/mrib rdb-types UnderlayMulticastIpv6 and use mg-common one
  - Remove unnecessary DEFAULT_MULTICAST_VNI constant
@zeeshanlakhani zeeshanlakhani changed the title [multicast] DDM multicast exchange: V4 protocol, MRIB sync, M2P hooks [multicast] DDM multicast exchange: V4 protocol, MRIB sync Apr 8, 2026
zeeshanlakhani added a commit that referenced this pull request Apr 8, 2026
This addresses review feedback on DDM multicast exchange PR (#696).

Additionally, oriented towards the goal of Omicron owning all OPTE
M2P (multicast-to-physical) mappings via the sled-agent, we remove direct OPTE
M2P writes from DDM. This was an oversight by me using similar patterns to
tunnel routing. The M2P table is global to xde, so having both DDM and Omicron's
reconciler write to it creates a conflict risk where Nexus could reap
DDM-written entries as stale.

Our original intention has always been Omicron driving most things
(-> Dpd/Dendrite, -> OPTE) due to the implicit and dynamic lifecycle of
multicast groups. The work here is for DDM to distribute multicast membership
and expose learned state via its admin API for Omicron to consume (and rely on
for port knowledge), which is currently a TODO that will be removed in Omicron
once this work is plumbed up.

Other changes:
  - Replace String errors with UnderlayMulticastError in mg-common
  - Use route key VNI instead of hard-coded DEFAULT_MULTICAST_VNI
  - Sort ddmadm multicast output by overlay group for deterministic display
  - Consolidate oxide_vpc imports in sys.rs
  - Derive Eq on MulticastOrigin, document PartialEq/Hash exclusion of metric with #649 reference
  - Downgrade multicast withdrawal success log to debug
  - Fix MRIB diagram indentation
  - Remove "old" zl/mrib rdb-types UnderlayMulticastIpv6 and use mg-common one
  - Remove unnecessary DEFAULT_MULTICAST_VNI constant
This addresses review feedback on DDM multicast exchange PR (#696).

Additionally, oriented towards the goal of Omicron owning all OPTE
M2P (multicast-to-physical) mappings via the sled-agent, we remove direct OPTE
M2P writes from DDM. This was an oversight by me using similar patterns to
tunnel routing. The M2P table is global to xde, so having both DDM and Omicron's
reconciler write to it creates a conflict risk where Nexus could reap
DDM-written entries as stale.

Our original intention has always been Omicron driving most things
(-> Dpd/Dendrite, -> OPTE) due to the implicit and dynamic lifecycle of
multicast groups. The work here is for DDM to distribute multicast membership
and expose learned state via its admin API for Omicron to consume (and rely on
for port knowledge), which is currently a TODO that will be removed in Omicron
once this work is plumbed up.

Other changes:
  - Replace String errors with UnderlayMulticastError in mg-common
  - Use route key VNI instead of hard-coded DEFAULT_MULTICAST_VNI
  - Sort ddmadm multicast output by overlay group for deterministic display
  - Consolidate oxide_vpc imports in sys.rs
  - Derive Eq on MulticastOrigin, document PartialEq/Hash exclusion of metric with #649 reference
  - Downgrade multicast withdrawal success log to debug
  - Fix MRIB diagram indentation
  - Remove "old" zl/mrib rdb-types UnderlayMulticastIpv6 and use mg-common one
  - Remove unnecessary DEFAULT_MULTICAST_VNI constant
Add an optional `if_name` field to PeerInfo (v2) so Omicron can learn
which switch port a DDM peer was discovered on. This enables Omicron
to use DDM as the primary source of truth for sled-to-port mapping,
replacing or cross-validating the current inventory-based approach.

The get_peers endpoint is versioned: v2+ returns PeerInfo with
if_name, v1 returns the original PeerInfo without it.
@zeeshanlakhani
Copy link
Copy Markdown
Author

zeeshanlakhani commented Apr 8, 2026

Thanks @taspelund.

I left my initial round of review comments, which are mostly nits. I think the one place where I noticed what looks to be a real issue is that we don't cleanup all the OPTE multicast state in Exchange.expire_peer().

Actually, this was removed, as we didn't need it (more in the next response).

At this point, the main questions I have are around the intended operation of DDM as a component of the overall multicast architecture. Until I have a better understanding of the architecture we plan to move forward with, I'm not sure how to give good feedback as to how well these updates to DDM align with it.

I started having deeper questions when I started seeing PartialEq/Eq/Hash trait impls for types like MulticastOrigin, MulticastRoute, MulticastPathVector, and MulticastPathHop. In particular, I want to make sure I understand the intention for the shape of the DDM route updates (i.e. how are the different path fields used in mroute installation and rpf calculation), how we plan to propagate them to peers (i.e. what fields are updated and in what way), and how we plan to install them locally. It's hard for me to tell how well the types work together and where things could go wrong, e.g. when put into collections relying on manual trait impls (I hit a very similar scenario with the unicast RIB paths and had to go back and fix it in #650).

So, per RFD 488's API-driven approach, Omicron is entirely responsible for programming the OPTE MRIB (functionally M2P mappings) and the switch/DPD state (multicast group table state). I'm working on clearing up many bits of this RFD, as we've been working through the end-to-end implementation.

| #696 (comment)

So, Omicron does the work here. With the way that multicast groups can dynamically be joined at any time (implicitly), this approach vs our two-fold approach for Unicast (w.r.t. Dendrite/DPD) is preferred.

DDM's job is distributing multicast group membership across the underlay, not programming the data plane. I was following bits of the tunnel routing implementation and mistakenly included OPTE M2P writes (tunnel routing writes to the V2B). I've now removed the direct OPTE M2P writes from DDM (sys.rs) that were in the original push. The M2P table is global to the xde driver and Omicron's reconciler also writes to it, so two independent writers to the same global table creates a conflict risk anyway.

The data flow is now (and should be):

  1. MRIB change -> mg-lower -> DDM admin API (advertise group subscription)
  2. DDM distributes to peers via exchange protocol (path vectors provide loop detection)
  3. DDM stores learned routes, exposed via get_multicast_groups() admin API
  4. Omicron reads DDM state and owns all M2P programming.
  • There's even a TODO in Omicron to plumb this work up and replace or resolve what we get from the inventory involving port information (then used/updated in DPD on the switch):
 // TODO: Use DDM as the primary source of truth for sled→port
 // mapping, with inventory as cross-validation.
 //
 // Currently we trust inventory (MGS/SP topology) for sled→port
 // mapping. DDM (maghemite/ddmd) on switches has authoritative
 // knowledge of which sleds are reachable on which ports.
 //
 // Future approach:
 //   1. Query DDM for operational sled→port mapping
 //      // TODO: Add GET /peers endpoint to ddm-admin-client (`get_peers` already here)
 //      // returning Map<peer_addr, PeerInfo> where PeerInfo
 //      // includes port/interface field  ---> I had this about ready & it's up now
 //   2. Use DDM mapping as primary source for multicast routing
 //   3. Cross-validate against inventory to detect mismatches
 //   4. On mismatch: invalidate cache, log warning, potentially
 //      trigger inventory reconciliation
 //
 // This catches cases where inventory is stale or a sled moved
 // but inventory hasn't updated yet.

DDM distributes and does peer exchange. Omicron programs the data plane. The Omicron side of consuming DDM's learned multicast state is the next piece of work. On the MRIB entry side, static routes flow in via PUT /static/mroute and DELETE /static/mroute on the mg-admin API. The future mcastd IGMP host-proxy will provide the dynamic route path handling.

Regarding peer expiry, the DB cleanup and withdrawal redistribution to peers now correctly handles multicast routes on peer expiry (remove_nexthop_routes returns to_remove_mcast, which feeds the withdrawal Update propagated to transit peers).

Regarding RPF, we operate at two layers. At the DDM CP level, path-vector loop detection serves as CP RPF, as MulticastPathHop carries the router ID and underlay address, and announcements where our hostname already appears in the path are dropped in exchange.rs. The MRIB in rdb carries an rpf_neighbor field per route for data-plane RPF (verifying incoming multicast traffic arrived from the expected interface). DDM provides the topology information (path vectors, nexthops) that Omicron can/could use to compute the correct RPF neighbor, which becomes more important for multi-rack and less so for a single rack, when that's required.

@zeeshanlakhani zeeshanlakhani requested a review from taspelund April 8, 2026 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants