Skip to content

feat: replaces PersistentConnector monkey-patch with proper nixl_conn…#6913

Open
dsocek wants to merge 1 commit intoai-dynamo:mainfrom
dsocek:nixl-singleton-and-refcount-fix
Open

feat: replaces PersistentConnector monkey-patch with proper nixl_conn…#6913
dsocek wants to merge 1 commit intoai-dynamo:mainfrom
dsocek:nixl-singleton-and-refcount-fix

Conversation

@dsocek
Copy link
Contributor

@dsocek dsocek commented Mar 5, 2026

Overview:

Under concurrent multimodal workloads, nixl_connect has two issues:

  1. Every operation creates a new NIXL agent instead of reusing one (under load this spawns 50+ agents, exhausting resources).
  2. When any operation finishes, it tears down the remote agent for everyone (breaking other in-flight transfers with NIXL_ERR_NOT_FOUND).

There is an existing workaround in embedding_transfer.py that monkey-patches the cleanup to a no-op and subclasses the connector. This prevents the crashes but leaks remote agents and affects all nixl_connect users globally.

This PR fixes both issues properly in nixl_connect itself and removes the workaround.

Details:

nixl_connect/__init__.py:

  • Reuse a single shared connection instead of creating a new one every time
  • Track how many operations are using each remote agent — only clean it up when the last one finishes

embedding_transfer.py:

  • Remove PersistentConnector subclass
  • Remove Remote._release monkey-patch
  • Use standard nixl_connect.Connector() directly

Where should the reviewer start?

  • nixl_connect/__init__.py: _create_connection() and the new acquire_remote_ref()/release_remote_ref() methods
  • embedding_transfer.py: removal of PersistentConnector and monkey-patch

Related Issues:

Summary by CodeRabbit

  • Refactor
    • Standardized connector implementation to use the shared Connector API, improving reliability and consistency in connection management.
    • Enhanced remote agent lifecycle with reference counting and connection pooling for optimized resource utilization.

@dsocek dsocek requested review from a team as code owners March 5, 2026 00:14
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2026

👋 Hi dsocek! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added the external-contribution Pull request is from an external contributor label Mar 5, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 5, 2026

Walkthrough

This change refactors NIXL agent lifecycle management by introducing native remote reference counting and shared connection pooling in nixl_connect.Connector, replacing a workaround that monkey-patched release behavior. The embedding_transfer.py module is updated to use the standard Connector API instead of custom PersistentConnector subclassing.

Changes

Cohort / File(s) Summary
Core Reference Counting & Lifecycle Management
lib/bindings/python/src/dynamo/nixl_connect/__init__.py
Introduces per-connection remote reference tracking with acquire/release methods. Implements shared singleton Connection per Connector with lazy initialization. Extends lifecycle coordination for Remote and Descriptor cleanup with idempotent release semantics and logging.
NIXL Connector Integration
components/src/dynamo/common/multimodal/embedding_transfer.py
Removes custom PersistentConnector subclass and monkey-patched release behavior. NixlReadEmbeddingSender simplified to use plain Connector. NixlReadEmbeddingReceiver signature expanded with embedding parameters and now performs async descriptor initialization.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A rabbit hops through lifecycle zones,
Where agents once leaked and roamed,
Now reference counts keep watch and care,
With shared connections everywhere!
No more patches, just clean API flow,
Off we go, off we go! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: replacing PersistentConnector monkey-patch with proper nixl_connect implementation.
Description check ✅ Passed The description comprehensively covers the overview, details, where to start, and related issues, matching the required template structure.
Linked Issues check ✅ Passed All objectives from issue #6912 are met: connection reuse via shared connection, reference-counted lifecycle management, and removal of the downstream workaround.
Out of Scope Changes check ✅ Passed All changes are directly focused on the stated objectives: fixing nixl_connect concurrency issues and removing the workaround from embedding_transfer.py.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lib/bindings/python/src/dynamo/nixl_connect/__init__.py`:
- Around line 609-616: The current release_remote_ref method treats missing
names as a "last release" because it uses self._remote_refs.get(name, 0) <= 1;
change the logic in release_remote_ref to first check presence (if name not in
self._remote_refs: return False) so unknown names are ignored, then read the
count, if count <= 1 pop and return True, otherwise decrement
self._remote_refs[name] and return False; reference: release_remote_ref,
self._remote_refs, and self._remote_refs_lock (ensure the presence check and
subsequent pop/decrement happen while holding the lock).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6b9551e9-ced9-4ba5-9166-35ec3fc75cfd

📥 Commits

Reviewing files that changed from the base of the PR and between 96237ba and 02965ca.

📒 Files selected for processing (2)
  • components/src/dynamo/common/multimodal/embedding_transfer.py
  • lib/bindings/python/src/dynamo/nixl_connect/__init__.py

@rmccorm4 rmccorm4 requested a review from GuanLuo March 5, 2026 00:24
@rmccorm4
Copy link
Contributor

rmccorm4 commented Mar 5, 2026

Hi @dsocek can you help fix the merge conflict and update to latest main on your branch?

@GuanLuo to review

…ect implementation

Signed-off-by: Daniel Socek <daniel.socek@intel.com>
@dsocek dsocek force-pushed the nixl-singleton-and-refcount-fix branch from 02965ca to 4d381e6 Compare March 5, 2026 20:58
@rmccorm4
Copy link
Contributor

rmccorm4 commented Mar 6, 2026

/ok to test 4d381e6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contribution Pull request is from an external contributor feat size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Proper NIXL agent lifecycle management in nixl_connect

3 participants