feat: add mlx5 DC (Dynamically Connected) transport#103
feat: add mlx5 DC (Dynamically Connected) transport#103harshavardhana wants to merge 6 commits intoRDMA-Rust:mainfrom
Conversation
Feature-gated behind `--features mlx5`. Requires ConnectX-4+ and system-installed mlx5 provider (MLNX_OFED or rdma-core). New module `sideway::mlx5`: - Mlx5Context: query DC capabilities (max_dc_rd_atom, etc.) - DcQpBuilder: create DCI/DCT QPs via mlx5dv_create_qp() - DcInitiator: client-side QP with per-WR addressing (wr_set_dc_addr) - DcTarget: server-side QP (DCTN) accepting from any DCI DC eliminates N² QP state in large clusters. N nodes need 1 DCT + W DCIs per node instead of N×C RC QPs. Depends on rdma-mummy-sys with mlx5 feature for mlx5dv bindings.
- DcInitiator/DcTarget hold CQ references to ensure correct drop order (QP destroyed via ibv_destroy_qp before CQs freed) - DCI: IBV_QP_INIT_ATTR_SEND_OPS_FLAGS in ibv comp_mask for extended send ops - DCI: max_recv_wr=0 (initiator doesn't receive) - DCT: requires SRQ (added srq() builder method) - ProtectionDomain: create_srq() and destroy_srq() helpers - Verified on ConnectX-6: DCI qp_num=7771, DCT dctn=0
…nvestigation needed)
There was a problem hiding this comment.
Code Review
This pull request introduces mlx5 vendor-specific extensions, specifically Dynamically Connected (DC) transport support. It adds Mlx5Context for querying device capabilities and DcInitiator/DcTarget for DC operations. Additionally, it introduces Shared Receive Queue (SRQ) support in ProtectionDomain. Feedback focuses on correcting error handling for RDMA verbs functions that return error codes directly, improving SRQ management using RAII patterns, and addressing a path dependency in Cargo.toml that could break external builds.
|
|
||
| [dependencies] | ||
| rdma-mummy-sys = "0.2.3" | ||
| rdma-mummy-sys = { path = "../rdma-mummy-sys" } |
There was a problem hiding this comment.
Using a path dependency for rdma-mummy-sys will break the build for any external users or CI environments that do not have the repository checked out at that specific relative path. Since this feature depends on a specific PR in the sys crate, consider using a git dependency until the changes are published to a registry.
| rdma-mummy-sys = { path = "../rdma-mummy-sys" } | |
| rdma-mummy-sys = { git = "https://github.com/RDMA-Rust/rdma-mummy-sys", features = ["mlx5"] } |
src/mlx5/dc.rs
Outdated
| let ret = unsafe { | ||
| rdma_mummy_sys::ibv_modify_qp(self.qp.as_ptr(), &mut qp_attr, attr.attr_mask_raw()) | ||
| }; | ||
| if ret == 0 { Ok(()) } else { Err(std::io::Error::last_os_error()) } |
There was a problem hiding this comment.
ibv_modify_qp returns the error code directly as a positive integer. It does not set the global errno. Therefore, std::io::Error::last_os_error() is incorrect here and should be replaced with from_raw_os_error(ret).
| if ret == 0 { Ok(()) } else { Err(std::io::Error::last_os_error()) } | |
| if ret == 0 { Ok(()) } else { Err(std::io::Error::from_raw_os_error(ret)) } |
src/mlx5/dc.rs
Outdated
| let ret = unsafe { | ||
| rdma_mummy_sys::ibv_modify_qp(self.qp.as_ptr(), &mut qp_attr, attr.attr_mask_raw()) | ||
| }; | ||
| if ret == 0 { Ok(()) } else { Err(std::io::Error::last_os_error()) } |
There was a problem hiding this comment.
src/ibverbs/protection_domain.rs
Outdated
| pub fn create_srq( | ||
| &self, | ||
| max_wr: u32, | ||
| max_sge: u32, | ||
| ) -> Result<NonNull<rdma_mummy_sys::ibv_srq>, std::io::Error> { | ||
| let mut attr: rdma_mummy_sys::ibv_srq_init_attr = unsafe { std::mem::zeroed() }; | ||
| attr.attr.max_wr = max_wr; | ||
| attr.attr.max_sge = max_sge; | ||
| let srq = unsafe { rdma_mummy_sys::ibv_create_srq(self.pd.as_ptr(), &mut attr) }; | ||
| NonNull::new(srq).ok_or_else(std::io::Error::last_os_error) | ||
| } | ||
| } | ||
|
|
||
| /// Destroy an SRQ. | ||
| /// | ||
| /// # Safety | ||
| /// The SRQ must not be in use by any QP. | ||
| pub unsafe fn destroy_srq(srq: NonNull<rdma_mummy_sys::ibv_srq>) { | ||
| rdma_mummy_sys::ibv_destroy_srq(srq.as_ptr()); | ||
| } |
There was a problem hiding this comment.
The current implementation of SRQ management lacks RAII safety and is inconsistent with the rest of the library (e.g., QueuePair). Returning a raw NonNull pointer forces the user to manually call destroy_srq, which is error-prone.
I suggest wrapping the SRQ in a struct that implements Drop and holds an Arc<ProtectionDomain> to ensure correct destruction order.
pub fn create_srq(
self: &Arc<Self>,
max_wr: u32,
max_sge: u32,
) -> Result<SharedReceiveQueue, std::io::Error> {
let mut attr: rdma_mummy_sys::ibv_srq_init_attr = unsafe { std::mem::zeroed() };
attr.attr.max_wr = max_wr;
attr.attr.max_sge = max_sge;
let srq = unsafe { rdma_mummy_sys::ibv_create_srq(self.pd.as_ptr(), &mut attr) };
let srq = NonNull::new(srq).ok_or_else(std::io::Error::last_os_error)?;
Ok(SharedReceiveQueue {
srq,
_pd: Arc::clone(self),
})
}
}
/// A wrapper for a Shared Receive Queue (SRQ).
pub struct SharedReceiveQueue {
pub(crate) srq: NonNull<rdma_mummy_sys::ibv_srq>,
_pd: Arc<ProtectionDomain>,
}
impl Drop for SharedReceiveQueue {
fn drop(&mut self) {
unsafe { rdma_mummy_sys::ibv_destroy_srq(self.srq.as_ptr()) };
}
}| mlx5dv::mlx5dv_query_device(self.dev_ctx.context.as_ptr().cast(), &mut attrs) | ||
| }; | ||
| if ret != 0 { | ||
| return Err(std::io::Error::last_os_error()); |
There was a problem hiding this comment.
In the verbs API, functions that return an int (like mlx5dv_query_device) typically return the error code directly (the value of errno) rather than setting the global errno and returning -1. Using last_os_error() may return an unrelated error from a previous call.
| return Err(std::io::Error::last_os_error()); | |
| return Err(std::io::Error::from_raw_os_error(ret)); |
|
Thank you for your contribution and for sharing the real-world test results — we really appreciate both. We would prefer to split the scope of this PR into several smaller steps and merge them into the main branch incrementally. For example, it may make sense to first land Throughout this process, please also make sure the implementation matches our project conventions. In particular, this includes our naming style and our RAII design principles. For naming, we generally prefer fully spelled-out names rather than abbreviations — for example, |
dfd0524 to
7858bfe
Compare
DCT qp_num is 0 at creation time — the kernel assigns it during the INIT→RTR state transition. Read dctn() live from the QP struct instead of caching at build time. Also: remove dctn field from DcTarget, add SRQ post_recv support. All 4 DC tests pass on ConnectX-6: - test_dc_device_query: max_dc_rd_atom=64 - test_dc_create_dci: qp_num assigned - test_dc_create_dct: dctn=0 at create, assigned after RTR - test_dc_rdma_write_loopback: 4096 bytes verified
7858bfe to
aef191d
Compare
- context.rs: remove unused dev_ctx() getter - dc.rs: use max_recv_sge from builder instead of hardcoded 1
46b67bb to
58541e0
Compare
Summary
Add DC (Dynamically Connected) transport support behind
--features mlx5. Addresses #102.What's added
src/mlx5/module:Mlx5Context— query DC device capabilitiesDcQpBuilder— create DCI/DCT QPs viamlx5dv_create_qp()DcInitiator— client-side QP withwr_set_dc_addr()for per-WR target addressingDcTarget— server-side QP (exposes DCTN for remote DCIs, assigned after RTR transition)src/ibverbs/additions:QueuePairAttribute::as_raw_ptr()/as_raw_ptr_const()/attr_mask_raw()for direct ibv_modify_qp callsProtectionDomain::create_srq()anddestroy_srq()for DC Target SRQ managementFeature gate
--features mlx5— requiresrdma-mummy-syswith mlx5 feature for mlx5dv bindings.No impact on existing RC/UC/UD transports.
Testing
All 4 tests pass on ConnectX-6 (vendor_part_id 4129, FW 28.44.1036):
Production validated: 90 GB/s GET throughput over DC transport (2-node, 48x NVMe, dual-rail 400GbE).
Dependencies
Requires rdma-mummy-sys with mlx5 feature (PR: RDMA-Rust/rdma-mummy-sys#20).