Skip to content

feat: add mlx5 DC (Dynamically Connected) transport#103

Open
harshavardhana wants to merge 6 commits intoRDMA-Rust:mainfrom
harshavardhana:feat/mlx5-dc-transport
Open

feat: add mlx5 DC (Dynamically Connected) transport#103
harshavardhana wants to merge 6 commits intoRDMA-Rust:mainfrom
harshavardhana:feat/mlx5-dc-transport

Conversation

@harshavardhana
Copy link
Copy Markdown

Summary

Add DC (Dynamically Connected) transport support behind --features mlx5. Addresses #102.

What's added

src/mlx5/ module:

  • Mlx5Context — query DC device capabilities
  • DcQpBuilder — create DCI/DCT QPs via mlx5dv_create_qp()
  • DcInitiator — client-side QP with wr_set_dc_addr() for per-WR target addressing
  • DcTarget — server-side QP (exposes DCTN for remote DCIs, assigned after RTR transition)
  • Proper CQ lifecycle: DcInitiator/DcTarget hold CQ references ensuring correct drop order

src/ibverbs/ additions:

  • QueuePairAttribute::as_raw_ptr() / as_raw_ptr_const() / attr_mask_raw() for direct ibv_modify_qp calls
  • ProtectionDomain::create_srq() and destroy_srq() for DC Target SRQ management

Feature gate

--features mlx5 — requires rdma-mummy-sys with mlx5 feature for mlx5dv bindings.
No impact on existing RC/UC/UD transports.

Testing

All 4 tests pass on ConnectX-6 (vendor_part_id 4129, FW 28.44.1036):

  • DC device query: max_dc_rd_atom=64
  • DCI creation + state transitions (INIT→RTR→RTS)
  • DCT creation + state transitions (INIT→RTR, DCTN assigned at RTR)
  • DC RDMA WRITE loopback: 4096 bytes verified

Production validated: 90 GB/s GET throughput over DC transport (2-node, 48x NVMe, dual-rail 400GbE).

Dependencies

Requires rdma-mummy-sys with mlx5 feature (PR: RDMA-Rust/rdma-mummy-sys#20).

Feature-gated behind `--features mlx5`. Requires ConnectX-4+ and
system-installed mlx5 provider (MLNX_OFED or rdma-core).

New module `sideway::mlx5`:
- Mlx5Context: query DC capabilities (max_dc_rd_atom, etc.)
- DcQpBuilder: create DCI/DCT QPs via mlx5dv_create_qp()
- DcInitiator: client-side QP with per-WR addressing (wr_set_dc_addr)
- DcTarget: server-side QP (DCTN) accepting from any DCI

DC eliminates N² QP state in large clusters. N nodes need
1 DCT + W DCIs per node instead of N×C RC QPs.

Depends on rdma-mummy-sys with mlx5 feature for mlx5dv bindings.
- DcInitiator/DcTarget hold CQ references to ensure correct drop order
  (QP destroyed via ibv_destroy_qp before CQs freed)
- DCI: IBV_QP_INIT_ATTR_SEND_OPS_FLAGS in ibv comp_mask for extended send ops
- DCI: max_recv_wr=0 (initiator doesn't receive)
- DCT: requires SRQ (added srq() builder method)
- ProtectionDomain: create_srq() and destroy_srq() helpers
- Verified on ConnectX-6: DCI qp_num=7771, DCT dctn=0
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces mlx5 vendor-specific extensions, specifically Dynamically Connected (DC) transport support. It adds Mlx5Context for querying device capabilities and DcInitiator/DcTarget for DC operations. Additionally, it introduces Shared Receive Queue (SRQ) support in ProtectionDomain. Feedback focuses on correcting error handling for RDMA verbs functions that return error codes directly, improving SRQ management using RAII patterns, and addressing a path dependency in Cargo.toml that could break external builds.


[dependencies]
rdma-mummy-sys = "0.2.3"
rdma-mummy-sys = { path = "../rdma-mummy-sys" }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a path dependency for rdma-mummy-sys will break the build for any external users or CI environments that do not have the repository checked out at that specific relative path. Since this feature depends on a specific PR in the sys crate, consider using a git dependency until the changes are published to a registry.

Suggested change
rdma-mummy-sys = { path = "../rdma-mummy-sys" }
rdma-mummy-sys = { git = "https://github.com/RDMA-Rust/rdma-mummy-sys", features = ["mlx5"] }

src/mlx5/dc.rs Outdated
let ret = unsafe {
rdma_mummy_sys::ibv_modify_qp(self.qp.as_ptr(), &mut qp_attr, attr.attr_mask_raw())
};
if ret == 0 { Ok(()) } else { Err(std::io::Error::last_os_error()) }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

ibv_modify_qp returns the error code directly as a positive integer. It does not set the global errno. Therefore, std::io::Error::last_os_error() is incorrect here and should be replaced with from_raw_os_error(ret).

Suggested change
if ret == 0 { Ok(()) } else { Err(std::io::Error::last_os_error()) }
if ret == 0 { Ok(()) } else { Err(std::io::Error::from_raw_os_error(ret)) }

src/mlx5/dc.rs Outdated
let ret = unsafe {
rdma_mummy_sys::ibv_modify_qp(self.qp.as_ptr(), &mut qp_attr, attr.attr_mask_raw())
};
if ret == 0 { Ok(()) } else { Err(std::io::Error::last_os_error()) }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the DcInitiator implementation, ibv_modify_qp returns the error code directly. last_os_error() should not be used.

Suggested change
if ret == 0 { Ok(()) } else { Err(std::io::Error::last_os_error()) }
if ret == 0 { Ok(()) } else { Err(std::io::Error::from_raw_os_error(ret)) }

Comment on lines 74 to 93
pub fn create_srq(
&self,
max_wr: u32,
max_sge: u32,
) -> Result<NonNull<rdma_mummy_sys::ibv_srq>, std::io::Error> {
let mut attr: rdma_mummy_sys::ibv_srq_init_attr = unsafe { std::mem::zeroed() };
attr.attr.max_wr = max_wr;
attr.attr.max_sge = max_sge;
let srq = unsafe { rdma_mummy_sys::ibv_create_srq(self.pd.as_ptr(), &mut attr) };
NonNull::new(srq).ok_or_else(std::io::Error::last_os_error)
}
}

/// Destroy an SRQ.
///
/// # Safety
/// The SRQ must not be in use by any QP.
pub unsafe fn destroy_srq(srq: NonNull<rdma_mummy_sys::ibv_srq>) {
rdma_mummy_sys::ibv_destroy_srq(srq.as_ptr());
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of SRQ management lacks RAII safety and is inconsistent with the rest of the library (e.g., QueuePair). Returning a raw NonNull pointer forces the user to manually call destroy_srq, which is error-prone.

I suggest wrapping the SRQ in a struct that implements Drop and holds an Arc<ProtectionDomain> to ensure correct destruction order.

    pub fn create_srq(
        self: &Arc<Self>,
        max_wr: u32,
        max_sge: u32,
    ) -> Result<SharedReceiveQueue, std::io::Error> {
        let mut attr: rdma_mummy_sys::ibv_srq_init_attr = unsafe { std::mem::zeroed() };
        attr.attr.max_wr = max_wr;
        attr.attr.max_sge = max_sge;
        let srq = unsafe { rdma_mummy_sys::ibv_create_srq(self.pd.as_ptr(), &mut attr) };
        let srq = NonNull::new(srq).ok_or_else(std::io::Error::last_os_error)?;
        Ok(SharedReceiveQueue {
            srq,
            _pd: Arc::clone(self),
        })
    }
}

/// A wrapper for a Shared Receive Queue (SRQ).
pub struct SharedReceiveQueue {
    pub(crate) srq: NonNull<rdma_mummy_sys::ibv_srq>,
    _pd: Arc<ProtectionDomain>,
}

impl Drop for SharedReceiveQueue {
    fn drop(&mut self) {
        unsafe { rdma_mummy_sys::ibv_destroy_srq(self.srq.as_ptr()) };
    }
}

mlx5dv::mlx5dv_query_device(self.dev_ctx.context.as_ptr().cast(), &mut attrs)
};
if ret != 0 {
return Err(std::io::Error::last_os_error());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the verbs API, functions that return an int (like mlx5dv_query_device) typically return the error code directly (the value of errno) rather than setting the global errno and returning -1. Using last_os_error() may return an unrelated error from a previous call.

Suggested change
return Err(std::io::Error::last_os_error());
return Err(std::io::Error::from_raw_os_error(ret));

@dragonJACson
Copy link
Copy Markdown
Contributor

Thank you for your contribution and for sharing the real-world test results — we really appreciate both.

We would prefer to split the scope of this PR into several smaller steps and merge them into the main branch incrementally. For example, it may make sense to first land SRQ support so that other standard ibverbs QP types can also make use of it, then build up a foundational mlx5dv framework focused on the control plane only — for example, pieces such as query_ctx — and only after that introduce the data-plane interfaces.

Throughout this process, please also make sure the implementation matches our project conventions. In particular, this includes our naming style and our RAII design principles. For naming, we generally prefer fully spelled-out names rather than abbreviations — for example, SharedReceivedQueue instead of an acronym — unless the name needs to correspond directly to an existing verbs API function or field. cc @FujiZ

@harshavardhana harshavardhana force-pushed the feat/mlx5-dc-transport branch from dfd0524 to 7858bfe Compare April 1, 2026 07:33
DCT qp_num is 0 at creation time — the kernel assigns it during
the INIT→RTR state transition. Read dctn() live from the QP struct
instead of caching at build time.

Also: remove dctn field from DcTarget, add SRQ post_recv support.

All 4 DC tests pass on ConnectX-6:
- test_dc_device_query: max_dc_rd_atom=64
- test_dc_create_dci: qp_num assigned
- test_dc_create_dct: dctn=0 at create, assigned after RTR
- test_dc_rdma_write_loopback: 4096 bytes verified
@harshavardhana harshavardhana force-pushed the feat/mlx5-dc-transport branch from 7858bfe to aef191d Compare April 1, 2026 07:35
- context.rs: remove unused dev_ctx() getter
- dc.rs: use max_recv_sge from builder instead of hardcoded 1
@harshavardhana harshavardhana force-pushed the feat/mlx5-dc-transport branch from 46b67bb to 58541e0 Compare April 2, 2026 06:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants