Skip to content

feat: Qwen3-VL-30B recipe for agg/disagg and encoder cache with vLLM patch#6919

Open
esoba wants to merge 11 commits intoai-dynamo:mainfrom
esoba:esoba/qwen3-vl-recipe
Open

feat: Qwen3-VL-30B recipe for agg/disagg and encoder cache with vLLM patch#6919
esoba wants to merge 11 commits intoai-dynamo:mainfrom
esoba:esoba/qwen3-vl-recipe

Conversation

@esoba
Copy link
Contributor

@esoba esoba commented Mar 5, 2026

Overview:

Recipe for how to enable encoder cache and configure agg, E/PD disagg, and EP/D disagg deployments for Qwen3-VL-30B-A3B. The recipe includes patches to enable agg encoder cache, dataset generation at varying image re-use levels, and perf comparisons with encoder cache on and off

Details:

vLLM image patch — patch_vllm_agg_encoder_cache.sh applies the vLLM patch needed to enable ECConnector type ECBoth for agg encoder cache per the documentation here

Dataset generation — generate-datasets-job.yaml creates 5 datasets at varying levels of image overlap using mulitmodal request generator here. Hardcoded dataset variants represent how much of the dataset you need to traverse to start seeing duplicates (in the code represented as ratio of total slots to image slots).

Cache on/off deployments — deploy-cache-{on/off}.yaml per configuration deployment showing --multimodal-embedding-cache-capacity-gb functionality. NIXL write set as default for encoder disagg transfer

Analysis - analysis.yaml to sweep through results for generated datasets. Results kept in PVC, printed out to see via kubectl logs

Closes DIS-1479

Summary by CodeRabbit

  • Documentation
    • Added comprehensive documentation and configuration for Qwen3-VL-30B encoder cache experiment workflow
    • Included deployment configurations for aggregated and disaggregated architectures
    • Provides benchmarking setup for measuring encoder cache performance impact

esoba added 5 commits March 4, 2026 18:06
Signed-off-by: Elijah Soba <esoba@nvidia.com>
Signed-off-by: Elijah Soba <esoba@nvidia.com>
Signed-off-by: Elijah Soba <esoba@nvidia.com>
Signed-off-by: Elijah Soba <esoba@nvidia.com>
@esoba esoba requested review from a team as code owners March 5, 2026 02:39
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2026

@github-actions github-actions bot added documentation Improvements or additions to documentation external-contribution Pull request is from an external contributor labels Mar 5, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 5, 2026

Walkthrough

This PR introduces a comprehensive Qwen3-VL-30B encoder cache experiment framework for disaggregated deployments. It includes documentation, infrastructure setup (persistent volume claims, model downloading, dataset generation), vllm patches for encoder cache support, deployment manifests across three configurations (aggregated, disagg E/PD, disagg EP/D), and performance benchmarking workflows with analysis pipelines to compare encoder cache-off vs. cache-on scenarios.

Changes

Cohort / File(s) Summary
Documentation
recipes/qwen3-vl-30b/README.md
New comprehensive guide documenting the Qwen3-VL-30B encoder cache experiment workflow, including prerequisites, quick-start instructions, deployment modes, and cache analysis procedures.
Infrastructure Setup
recipes/qwen3-vl-30b/model-cache/model-cache.yaml, recipes/qwen3-vl-30b/model-cache/model-download.yaml, recipes/qwen3-vl-30b/data-gen/generate-datasets-job.yaml
Kubernetes resources for persistent storage (model, compilation, perf caches), model downloading from HuggingFace, and dataset generation with 5000 requests across five image-pool distributions.
Patch Utilities
recipes/qwen3-vl-30b/patches/patch_vllm_agg_encoder_cache.sh
Bash script that patches vllm Docker images with aggregated encoder cache support via two vllm PRs (34182, 34783) with automated image building and validation.
Aggregated Deployment
recipes/qwen3-vl-30b/vllm/agg/deploy-cache-off.yaml, recipes/qwen3-vl-30b/vllm/agg/deploy-cache-on.yaml, recipes/qwen3-vl-30b/vllm/agg/perf-cache-off.yaml, recipes/qwen3-vl-30b/vllm/agg/perf-cache-on.yaml, recipes/qwen3-vl-30b/vllm/agg/analysis.yaml
Deployment and benchmarking manifests for aggregated topology with encoder cache enabled/disabled, including aiperf performance measurement and CSV-based analysis comparing p90 latencies.
Disagg E/PD Deployment
recipes/qwen3-vl-30b/vllm/disagg-e-pd/deploy.yaml, recipes/qwen3-vl-30b/vllm/disagg-e-pd/deploy-cache-off.yaml, recipes/qwen3-vl-30b/vllm/disagg-e-pd/deploy-cache-on.yaml, recipes/qwen3-vl-30b/vllm/disagg-e-pd/perf.yaml, recipes/qwen3-vl-30b/vllm/disagg-e-pd/perf-cache-off.yaml, recipes/qwen3-vl-30b/vllm/disagg-e-pd/perf-cache-on.yaml, recipes/qwen3-vl-30b/vllm/disagg-e-pd/analysis.yaml
Multi-component deployment (frontend, encode workers, PD workers) with disaggregated inference, cache variants, and corresponding benchmarking and analysis workflows.
Disagg EP/D Deployment
recipes/qwen3-vl-30b/vllm/disagg-ep-d/deploy-cache-off.yaml, recipes/qwen3-vl-30b/vllm/disagg-ep-d/deploy-cache-on.yaml, recipes/qwen3-vl-30b/vllm/disagg-ep-d/perf-cache-off.yaml, recipes/qwen3-vl-30b/vllm/disagg-ep-d/perf-cache-on.yaml, recipes/qwen3-vl-30b/vllm/disagg-ep-d/analysis.yaml
Similar EP/D disaggregation topology (encode-prefill and decode workers) with cache variants, performance benchmarking with aiperf, and result aggregation into summary CSVs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~35 minutes

Poem

🐰 Whiskers twitching with delight,
Cache experiments shining bright,
Three topologies, benchmarks run,
Encoder speedups—tests are fun!
Qwen3-VL hops toward the night. 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly summarizes the main change: adding a recipe for Qwen3-VL-30B with encoder cache support and aggregated/disaggregated deployment configurations using vLLM patches.
Description check ✅ Passed The PR description covers all required template sections with clear, specific information about changes, deployment configurations, and linked issue.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

🟠 Major comments (24)
recipes/qwen3-vl-30b/model-cache/model-download.yaml-16-23 (1)

16-23: ⚠️ Potential issue | 🟠 Major

Add container securityContext to avoid default root/privilege escalation behavior.

Please explicitly set non-root execution and disable privilege escalation for this job container.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/model-cache/model-download.yaml` around lines 16 - 23,
Add a pod/container securityContext for the job container named
"model-download": set securityContext.runAsUser and runAsGroup to a non-root
UID/GID (e.g., 1000), set securityContext.runAsNonRoot=true, and under the
container spec set securityContext.allowPrivilegeEscalation=false (and
optionally readOnlyRootFilesystem=true and capabilities.drop: ["ALL"]) to
prevent root/privilege escalation; update the Pod spec where the container
"model-download" is defined to include these securityContext fields.
recipes/qwen3-vl-30b/vllm/agg/analysis.yaml-17-20 (1)

17-20: ⚠️ Potential issue | 🟠 Major

Add non-root and no-priv-escalation controls for this Job.

The container runs with default privileges. Add explicit pod/container securityContext to align with baseline hardening expectations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/vllm/agg/analysis.yaml` around lines 17 - 20, The Job's
Pod spec (the block containing restartPolicy and the containers list,
specifically the container named "analysis" using image "python:3.11") needs
explicit non-root and no-privilege-escalation controls: add a top-level pod
securityContext with runAsNonRoot: true and a numeric runAsUser (e.g., 1000) and
add a container-level securityContext on the "analysis" container that sets
allowPrivilegeEscalation: false, drops all capabilities (capabilities: drop:
["ALL"]), and enables readOnlyRootFilesystem: true (and optionally
runAsNonRoot/restrictive runAsUser there as well) so the container cannot run as
root or escalate privileges.
recipes/qwen3-vl-30b/vllm/disagg-e-pd/analysis.yaml-17-20 (1)

17-20: ⚠️ Potential issue | 🟠 Major

Harden the analysis container security context.

This Job currently allows default root/privilege settings. Please set pod/container securityContext (at minimum runAsNonRoot: true and allowPrivilegeEscalation: false) to reduce runtime risk.

Suggested hardening diff
 spec:
   backoffLimit: 1
   completions: 1
   parallelism: 1
   template:
     metadata:
       labels:
         app: benchmark-analysis
         topology: disagg-e-pd
     spec:
+      securityContext:
+        runAsNonRoot: true
+        seccompProfile:
+          type: RuntimeDefault
       restartPolicy: Never
       containers:
         - name: analysis
           image: python:3.11
+          securityContext:
+            allowPrivilegeEscalation: false
+            capabilities:
+              drop: ["ALL"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/vllm/disagg-e-pd/analysis.yaml` around lines 17 - 20,
Add a Pod and container securityContext to harden the Job that runs the
"analysis" container: set pod.spec.securityContext.runAsNonRoot: true and
pod.spec.securityContext.runAsUser to a non-root UID (e.g., 1000), and add
container.securityContext for the "analysis" container with runAsNonRoot: true,
runAsUser matching the pod UID and allowPrivilegeEscalation: false (also
consider readOnlyRootFilesystem: true and capabilities.drop: ["ALL"]). Update
the manifest sections that define the Pod/Job spec and the "analysis" container
to include these fields so the container cannot run as root or escalate
privileges.
recipes/qwen3-vl-30b/README.md-89-136 (1)

89-136: ⚠️ Potential issue | 🟠 Major

Add explicit teardown between cache-off and cache-on deploy runs.

The commands currently allow both deployments to coexist (different resource names), which can consume the same GPU pool and distort benchmark comparisons.

Suggested command patch pattern
 # Cache OFF
 kubectl apply -f vllm/disagg-ep-d/deploy-cache-off.yaml -n ${NAMESPACE}
 kubectl apply -f vllm/disagg-ep-d/perf-cache-off.yaml -n ${NAMESPACE}
+# wait for completion, then tear down before cache-on
+kubectl delete -f vllm/disagg-ep-d/deploy-cache-off.yaml -n ${NAMESPACE}

 # Cache ON
 kubectl apply -f vllm/disagg-ep-d/deploy-cache-on.yaml -n ${NAMESPACE}
 kubectl apply -f vllm/disagg-ep-d/perf-cache-on.yaml -n ${NAMESPACE}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/README.md` around lines 89 - 136, The README currently
applies cache-off and then cache-on manifests without tearing down the prior
deployment, which lets both coexist and skew GPU usage; update the instructions
for each option (agg, disagg-ep-d, disagg-e-pd) to explicitly delete the
previous deployment resources before enabling cache-on by adding a teardown step
that deletes the cache-off manifests/pods (e.g., remove resources created by
deploy-cache-off.yaml and the benchmark pod names like
qwen3-vl-30b-agg-benchmark-cache-off) or run kubectl delete -f
deploy-cache-off.yaml (and associated perf YAMLs) and wait for pods to terminate
before applying deploy-cache-on.yaml and perf-cache-on.yaml so the cache-off and
cache-on runs never overlap.
recipes/qwen3-vl-30b/patches/patch_vllm_agg_encoder_cache.sh-64-71 (1)

64-71: ⚠️ Potential issue | 🟠 Major

Fetch patches using immutable commit-scoped URLs and harden curl error handling.

GitHub PR diff URLs (e.g., pull/34182.diff) are mutable—they reflect the PR's current head and change if commits are pushed or rebased. Additionally, curl -sL silently stores HTTP error responses (404/500) without failing, which can cause opaque failures downstream during patching.

Replace PR-based URLs with immutable commit-scoped diffs and add curl error handling:

Suggested hardening
 echo "Preparing patch diffs..."
-curl -sL "https://github.com/vllm-project/vllm/pull/34182.diff" > "${WORKDIR}/vllm_pr34182.diff"
-curl -sL "https://github.com/vllm-project/vllm/pull/34783.diff" | python3 -c '
+curl -fsSL --retry 5 "https://github.com/vllm-project/vllm/compare/<base-commit>..<head-commit-34182>.diff" > "${WORKDIR}/vllm_pr34182.diff"
+curl -fsSL --retry 5 "https://github.com/vllm-project/vllm/compare/<base-commit>..<head-commit-34783>.diff" | python3 -c '
 import sys
 chunks = sys.stdin.read().split("diff --git ")
 filtered = [c for c in chunks if c.startswith("a/vllm/")]
 print("".join("diff --git " + c for c in filtered), end="")
 ' > "${WORKDIR}/vllm_pr34783_vllm_only.diff"
+
+# Verify patch integrity (pin real SHA256 hashes after obtaining diffs)
+# echo "<expected-sha256>  ${WORKDIR}/vllm_pr34182.diff" | sha256sum -c -
+# echo "<expected-sha256>  ${WORKDIR}/vllm_pr34783_vllm_only.diff" | sha256sum -c -

Replace <base-commit>, <head-commit-34182>, <head-commit-34783> with actual commit SHAs from the vllm repository, and add real checksums for verification.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/patches/patch_vllm_agg_encoder_cache.sh` around lines 64
- 71, Replace the mutable PR diff URLs used when writing vllm_pr34182.diff and
vllm_pr34783_vllm_only.diff with immutable commit-scoped URLs (use the target
repo commit SHAs for the base..head diffs) and harden the curl usage in the two
curl invocations: enable --fail (or -f), keep -sL as desired, check curl's exit
status immediately and abort on failure, and verify the downloaded files against
expected checksums before proceeding; update the two references to
"${WORKDIR}/vllm_pr34182.diff" and "${WORKDIR}/vllm_pr34783_vllm_only.diff"
accordingly and ensure any Python filtering step still receives valid input if
curl fails.
recipes/qwen3-vl-30b/vllm/disagg-e-pd/perf.yaml-24-49 (1)

24-49: ⚠️ Potential issue | 🟠 Major

Bound the model/dataset readiness waits.

The current loops can run forever when prerequisites never arrive, causing stuck benchmark Pods and resource leakage in CI/ops workflows.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/vllm/disagg-e-pd/perf.yaml` around lines 24 - 49, Add a
bounded timeout to both readiness loops so they cannot run indefinitely:
introduce a MAX_WAIT_SECONDS or MAX_RETRIES env var (used alongside MODEL_NAME
and FRONTEND for the model loop, and DATASET_DIR/tags for the dataset loop),
track elapsed time or attempt count inside the "until curl ..." model readiness
block and inside the dataset "until [ \"${missing}\" -eq 0 ]" loop, and when the
limit is exceeded log a clear error (including MODEL_NAME or which dataset tag
is missing), exit non-zero to fail the job, and ensure directories
(ARTIFACT_BASE_DIR, DATASET_DIR) are still created before failing.
recipes/qwen3-vl-30b/vllm/agg/perf-cache-on.yaml-24-49 (1)

24-49: ⚠️ Potential issue | 🟠 Major

Add failure bounds to readiness polling.

Without timeout guards, this Pod can block forever waiting for model or dataset availability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/vllm/agg/perf-cache-on.yaml` around lines 24 - 49, The
loops that wait for the model and for dataset files (the curl/jq readiness loop
using MODEL_NAME and FRONTEND, and the dataset loop that iterates over tags,
missing, input_file, DATASET_DIR) need bounded retries: add a max retry counter
or timeout variable (e.g., MAX_RETRIES or MAX_WAIT_SECS) and increment/check it
inside each until loop; when exceeded, log a clear error including what timed
out (model or which dataset tag) and exit non‑zero so the Pod fails fast instead
of blocking forever. Ensure the new logic returns non‑zero on timeout and
include the relevant identifiers (MODEL_NAME, FRONTEND, DATASET_DIR, tags) in
the error messages to aid debugging.
recipes/qwen3-vl-30b/vllm/disagg-ep-d/perf-cache-on.yaml-11-23 (1)

11-23: ⚠️ Potential issue | 🟠 Major

Add explicit container security hardening.

No container securityContext is set here; default root/privilege behavior is retained, which degrades security posture.

Also applies to: 80-119

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/vllm/disagg-ep-d/perf-cache-on.yaml` around lines 11 -
23, The container named "benchmark" lacks a pod/container securityContext; add a
securityContext to the container spec for the "benchmark" container (and repeat
for the other similar containers referenced) that enforces non-root execution
and least privilege: set runAsNonRoot: true and runAsUser to a non-root UID
(e.g., 1000), set allowPrivilegeEscalation: false, set readOnlyRootFilesystem:
true, drop all Linux capabilities (capabilities.drop: ["ALL"]) and add a
restricted seccomp profile if available; also consider setting fsGroup for
shared volumes. Update the container spec where the image python:3.11 and
command block appear (the "benchmark" container) and mirror the same
securityContext for the other blocks noted (lines ~80-119).
recipes/qwen3-vl-30b/vllm/agg/perf-cache-on.yaml-11-23 (1)

11-23: ⚠️ Potential issue | 🟠 Major

Set explicit container security controls.

This benchmark Pod currently relies on default root privilege behavior. Add container hardening (allowPrivilegeEscalation: false, dropped caps, seccomp, non-root once image is pre-baked).

Also applies to: 80-119

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/vllm/agg/perf-cache-on.yaml` around lines 11 - 23, The
benchmark container block (name: benchmark, image: python:3.11) needs explicit
hardening: add a container.securityContext with allowPrivilegeEscalation: false,
capabilities.drop: ["ALL"], seccompProfile.type: RuntimeDefault, and set
runAsNonRoot: true with a non-zero runAsUser (or set
podSecurityContext.runAsNonRoot/runAsUser) so the container does not run as
root; apply the same container.securityContext/podSecurityContext changes to the
other similar container block referenced in the comment (the block covering
lines 80-119).
recipes/qwen3-vl-30b/vllm/disagg-e-pd/perf.yaml-11-23 (1)

11-23: ⚠️ Potential issue | 🟠 Major

Lock down container privileges explicitly.

This Pod does not define container securityContext; it inherits root/default privilege behavior, which is a security posture gap and can be blocked by restricted admission policies.

Also applies to: 73-102

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/vllm/disagg-e-pd/perf.yaml` around lines 11 - 23, The
benchmark container (name: "benchmark") is missing a securityContext which
leaves it running with default/root privileges; update the Pod/Container spec to
add a securityContext for the benchmark container (and mirror the same changes
for other containers on lines 73-102) that sets runAsNonRoot: true and a
non-zero runAsUser, disallows privilege escalation (allowPrivilegeEscalation:
false), drops all Linux capabilities and only adds minimal required ones, and
enables readOnlyRootFilesystem: true (optionally set seccompProfile/unconfined
to RuntimeDefault) to explicitly lock down privileges.
recipes/qwen3-vl-30b/vllm/disagg-ep-d/perf-cache-on.yaml-24-49 (1)

24-49: ⚠️ Potential issue | 🟠 Major

Prevent indefinite waiting on dependencies.

Both readiness loops are open-ended; introduce deadlines/fail-fast behavior to avoid permanently stuck benchmark Pods.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/vllm/disagg-ep-d/perf-cache-on.yaml` around lines 24 -
49, The readiness loops currently can wait forever; add deadline/fail-fast logic
for both the model readiness loop (the until that curls
http://${FRONTEND}:8000/v1/models checking MODEL_NAME) and the dataset wait loop
(the until that checks files in DATASET_DIR using tags/missing). Introduce
timeout variables (e.g., MODEL_READY_TIMEOUT and DATASET_WAIT_TIMEOUT or max
retries), capture a start timestamp before each loop, and on each iteration
compare elapsed time; if the timeout is exceeded log a clear error including the
relevant ID/paths (MODEL_NAME, FRONTEND, DATASET_DIR) and exit non-zero to fail
fast. Ensure the curl check still uses quiet/exit codes and respect the timeout
by failing when the deadline is reached.
recipes/qwen3-vl-30b/vllm/disagg-ep-d/analysis.yaml-18-21 (1)

18-21: ⚠️ Potential issue | 🟠 Major

Harden the analysis container security context.

The Job container runs with default privileges (no explicit allowPrivilegeEscalation/capability drop/non-root settings), which is a preventable security gap.

Also applies to: 112-118

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/vllm/disagg-ep-d/analysis.yaml` around lines 18 - 21,
The analysis container currently runs with default privileges; update the
manifest for the container named "analysis" (the entry under containers -> -
name: analysis) by adding a securityContext at both pod and container levels:
set podSecurityContext/runAsNonRoot: true with runAsUser/runAsGroup/fsGroup as
appropriate, and in the container securityContext set allowPrivilegeEscalation:
false, privileged: false, readOnlyRootFilesystem: true, capabilities.drop:
["ALL"], and seccompProfile.type: RuntimeDefault to ensure non-root,
least-privilege execution for the analysis job.
recipes/qwen3-vl-30b/vllm/agg/perf-cache-off.yaml-24-49 (1)

24-49: ⚠️ Potential issue | 🟠 Major

Add timeouts to readiness wait loops.

Both waits are unbounded; if the frontend/model or datasets never become ready, this Pod hangs indefinitely and keeps resources pinned.

⏱️ Suggested bounded wait pattern
+          MODEL_WAIT_TIMEOUT_SEC="${MODEL_WAIT_TIMEOUT_SEC:-1800}"
+          DATASET_WAIT_TIMEOUT_SEC="${DATASET_WAIT_TIMEOUT_SEC:-1800}"
+          model_wait_start=$(date +%s)
           echo "Waiting for model '${MODEL_NAME}' at http://${FRONTEND}:8000/v1/models..."
           until curl -s "http://${FRONTEND}:8000/v1/models" | jq -e --arg model "${MODEL_NAME}" '.data[]? | select(.id == $model)' >/dev/null 2>&1; do
+            if (( $(date +%s) - model_wait_start > MODEL_WAIT_TIMEOUT_SEC )); then
+              echo "Timed out waiting for model '${MODEL_NAME}'"
+              exit 1
+            fi
             echo "[$(date '+%H:%M:%S')] Model not ready, retrying in 5s..."
             sleep 5
           done
           echo "Model '${MODEL_NAME}' is ready."
...
+          dataset_wait_start=$(date +%s)
           until [ "${missing}" -eq 0 ]; do
             missing=0
             for tag in "${tags[@]}"; do
               input_file="${DATASET_DIR}/qwen3_vl_5000req_3img_${tag}.jsonl"
               if [ ! -f "${input_file}" ]; then
                 missing=1
                 break
               fi
             done
             if [ "${missing}" -eq 1 ]; then
+              if (( $(date +%s) - dataset_wait_start > DATASET_WAIT_TIMEOUT_SEC )); then
+                echo "Timed out waiting for datasets in ${DATASET_DIR}"
+                exit 1
+              fi
               echo "Waiting for datasets to appear in ${DATASET_DIR} ..."
               sleep 15
             fi
           done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/vllm/agg/perf-cache-off.yaml` around lines 24 - 49, The
two unbounded readiness loops (the model readiness curl loop that references
MODEL_NAME and FRONTEND and the dataset polling loop that iterates over tags and
checks files in DATASET_DIR) must be converted to bounded waits: add a max-wait
duration or max-retries counter (e.g., MAX_WAIT_SECS or MAX_RETRIES) and track
elapsed time or attempts inside each loop, logging a clear error via echo and
exiting non-zero if the timeout is reached; ensure the model loop still polls
http://${FRONTEND}:8000/v1/models for MODEL_NAME and the dataset loop still
checks each input_file in "${DATASET_DIR}" before deciding success, but break
out and fail cleanly when the timeout/retry limit is exceeded.
recipes/qwen3-vl-30b/vllm/agg/perf-cache-off.yaml-11-23 (1)

11-23: ⚠️ Potential issue | 🟠 Major

Harden container security defaults for this benchmark Pod.

The container runs without an explicit securityContext (root/default privilege behavior), which weakens isolation and may fail in restricted clusters.

🔒 Suggested hardening
   containers:
     - name: benchmark
       image: python:3.11
+      securityContext:
+        allowPrivilegeEscalation: false
+        capabilities:
+          drop: ["ALL"]
+        seccompProfile:
+          type: RuntimeDefault
+        # After pre-baking dependencies into the image, also enforce:
+        # runAsNonRoot: true
+        # runAsUser: 1000
+        # runAsGroup: 1000

Also applies to: 80-119

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/vllm/agg/perf-cache-off.yaml` around lines 11 - 23, The
benchmark container definition (name: benchmark, image: python:3.11, command
block) is missing a securityContext—add a pod/container securityContext to
harden defaults: set runAsNonRoot: true and runAsUser to a non-root UID, set
readOnlyRootFilesystem: true, set allowPrivilegeEscalation: false, and drop all
Linux capabilities (and add only required ones if any); apply the same
securityContext pattern to the other similar container definitions referenced
(lines 80-119) so all benchmark pods run with non-root, least-privilege
settings.
recipes/qwen3-vl-30b/data-gen/generate-datasets-job.yaml-17-24 (1)

17-24: ⚠️ Potential issue | 🟠 Major

Avoid forcing root execution for this Job.

Setting runAsUser, runAsGroup, and fsGroup to 0 broadens blast radius and may violate restricted cluster policies. Prefer non-root execution and explicit container hardening.

🔒 Suggested safer baseline
       securityContext:
-        runAsUser: 0
-        runAsGroup: 0
-        fsGroup: 0
+        runAsNonRoot: true
+        runAsUser: 1000
+        runAsGroup: 1000
+        fsGroup: 1000
       containers:
         - name: generate-datasets
+          securityContext:
+            allowPrivilegeEscalation: false
+            capabilities:
+              drop: ["ALL"]
+            seccompProfile:
+              type: RuntimeDefault
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/data-gen/generate-datasets-job.yaml` around lines 17 -
24, The manifest forces root by setting securityContext
runAsUser/runAsGroup/fsGroup to 0 for the Job's pod (securityContext block near
the generate-datasets container using image dynamo:latest-vllm-runtime); change
this to a non-root baseline by removing the zero values and instead enable
runAsNonRoot: true and set runAsUser (and runAsGroup/fsGroup if needed) to a
non-root UID/GID (e.g., 1000), or ensure the image is built to run as a non-root
user and document the required UID; apply these changes in the
pod/securityContext that contains the container named generate-datasets and
validate the container image supports the chosen UID.
recipes/qwen3-vl-30b/vllm/disagg-e-pd/perf-cache-on.yaml-24-49 (1)

24-49: ⚠️ Potential issue | 🟠 Major

Add timeout bounds to both blocking wait loops.

The model readiness loop and dataset availability loop are unbounded. A misconfigured frontend name or missing dataset files will keep the pod alive forever instead of failing fast.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/vllm/disagg-e-pd/perf-cache-on.yaml` around lines 24 -
49, The two unbounded waiting loops (the model readiness curl loop that checks
MODEL_NAME against FRONTEND and the dataset availability loop that iterates over
tags with DATASET_DIR and missing) must enforce timeouts and fail fast:
introduce configurable timeout variables (e.g., MODEL_WAIT_TIMEOUT and
DATASET_WAIT_TIMEOUT) and record a start timestamp before each loop, then on
each iteration check elapsed time against the timeout and if exceeded log a
clear error including MODEL_NAME/FRONTEND or DATASET_DIR and exit non‑zero;
alternatively implement a max-retries counter (e.g., MODEL_MAX_RETRIES,
DATASET_MAX_RETRIES) incremented each loop with a final error+exit when reached
and keep the existing sleep intervals. Ensure the error paths include the same
identifying context (MODEL_NAME, FRONTEND, DATASET_DIR, tags) so failures are
actionable.
recipes/qwen3-vl-30b/vllm/disagg-e-pd/perf-cache-off.yaml-24-49 (1)

24-49: ⚠️ Potential issue | 🟠 Major

Unbounded polling loops can stall benchmarks indefinitely.

Both wait loops run forever on dependency failures (frontend never ready or datasets never materialize). Add max-wait handling and exit non-zero on timeout.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/vllm/disagg-e-pd/perf-cache-off.yaml` around lines 24 -
49, The two unbounded wait loops (the model-ready loop that calls curl against
http://${FRONTEND}:8000/v1/models and the dataset wait loop that checks until [
"${missing}" -eq 0 ]) must enforce a max-wait and fail fast: introduce a
MAX_WAIT_SECS or MAX_RETRIES variable and an attempt/elapsed-time counter,
increment it inside each loop, and when the limit is exceeded print a clear
error including MODEL_NAME/FRONTEND or DATASET_DIR/tags context and exit
non-zero (e.g., exit 1); update the retry echo messages to show
attempts/remaining time and ensure both the model readiness block and the
dataset presence block use this same timeout pattern so they won’t hang
indefinitely.
recipes/qwen3-vl-30b/vllm/disagg-e-pd/perf-cache-off.yaml-10-119 (1)

10-119: ⚠️ Potential issue | 🟠 Major

Pod should declare an explicit container security context.

This benchmark container currently relies on default privileges. Please harden it with explicit security settings (no privilege escalation, dropped capabilities), and move runtime package installs into the image if non-root execution is required.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/vllm/disagg-e-pd/perf-cache-off.yaml` around lines 10 -
119, The benchmark container currently runs with default privileges; update the
container spec for the container named "benchmark" to include an explicit
securityContext: set runAsNonRoot: true (and runAsUser/runAsGroup to a non-root
UID/GID such as 1000), set allowPrivilegeEscalation: false, add capabilities:
drop: ["ALL"], and preferably set readOnlyRootFilesystem:true if the workload
permits; also consider adding a pod-level securityContext where appropriate.
Remove or move the runtime apt installs and pip install steps from the in-line
command into the container image (or ensure the image supports non-root package
installation) so the container can run non-root without needing root package
installs. Ensure any mounted paths (e.g., /perf-cache, ARTIFACT_BASE_DIR,
DATASET_DIR) have appropriate permissions for the chosen non-root UID to
read/write.
recipes/qwen3-vl-30b/vllm/disagg-e-pd/deploy.yaml-8-12 (1)

8-12: ⚠️ Potential issue | 🟠 Major

This topology has the same shared-RWO cache bottleneck as other disagg manifests.

The two worker services share model-cache and compilation-cache while scaling to 8 GPU pods. With ReadWriteOnce/local-path claims, cross-node placement is fragile and can leave pods Pending.

Recommend enforcing worker co-location with node affinity/selector, or using RWX storage for shared cache mounts.

Also applies to: 63-75, 117-129

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/vllm/disagg-e-pd/deploy.yaml` around lines 8 - 12, The
manifest's pvcs section defines shared PVCs "model-cache" and
"compilation-cache" that are likely created with ReadWriteOnce/local-path,
causing a shared-RWO bottleneck when scaling GPU workers; update the deployment
to either (a) enforce worker co-location so all pods that mount "model-cache"
and "compilation-cache" are scheduled on the same node (add
nodeAffinity/nodeSelector to the worker Deployment/StatefulSet spec for the
services that mount these PVCs), or (b) switch the PVCs to use RWX-capable
storageClass (replace the PV/PVC definitions for "model-cache" and
"compilation-cache" with a storageClass that supports ReadWriteMany) so multiple
nodes can mount them concurrently; apply the same change to the other
occurrences noted (the blocks around lines 63-75 and 117-129) so all worker
services consistently use either co-location or RWX storage.
recipes/qwen3-vl-30b/vllm/disagg-ep-d/perf-cache-off.yaml-10-119 (1)

10-119: ⚠️ Potential issue | 🟠 Major

Benchmark container is missing explicit security hardening.

There is no container securityContext, so privilege escalation/root constraints are left to defaults (also flagged by CKV_K8S_20/23). Add explicit hardening (allowPrivilegeEscalation: false, drop capabilities). If you want runAsNonRoot, move apt/pip install into a prebuilt image first.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/vllm/disagg-ep-d/perf-cache-off.yaml` around lines 10 -
119, The benchmark container (name: benchmark) lacks a securityContext—add one
to harden the pod: set allowPrivilegeEscalation: false, add capabilities.drop:
["ALL"], and set runAsNonRoot: true (or runAsUser to a non-root UID) and
readOnlyRootFilesystem: true; because the Dockerfile currently runs apt/pip
in-container, instead if you must run as non-root move those install steps into
a prebuilt image and reference that image in the container spec (or keep a
separate init image to perform privileged installs) so the benchmark container
can safely use runAsNonRoot and the other securityContext fields.
recipes/qwen3-vl-30b/vllm/agg/deploy-cache-off.yaml-8-12 (1)

8-12: ⚠️ Potential issue | 🟠 Major

Shared RWO cache PVCs with 8 worker replicas can cause Pending pods on multi-node clusters.

VllmWorker replicas mount the same model-cache and compilation-cache claims. Those claims are ReadWriteOnce with local-path (see recipes/qwen3-vl-30b/model-cache/model-cache.yaml), so workers must be co-located on one node to mount successfully. Without explicit placement constraints, scheduling can fail or become brittle.

Use RWX-capable storage for shared caches, or add explicit node affinity/selector so all workers are intentionally pinned to the same node.

Also applies to: 62-74

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/vllm/agg/deploy-cache-off.yaml` around lines 8 - 12, The
pvcs block in deploy-cache-off.yaml mounts the same ReadWriteOnce local-path
claims (model-cache and compilation-cache) across VllmWorker replicas causing
Pending pods on multi-node clusters; either convert the cache PVCs to use an
RWX-capable storageClass in the model-cache/model-cache.yaml (replace local-path
RWO with an RWX storageClass) or add explicit pod placement to the VllmWorker
spec (nodeSelector/nodeAffinity or a nodeName) so all replicas are pinned to the
same node; update the pvcs configuration or the VllmWorker deployment spec
accordingly (referencing the PVC names model-cache and compilation-cache and the
VllmWorker replicas field) to ensure mounts succeed.
recipes/qwen3-vl-30b/vllm/disagg-e-pd/perf-cache-on.yaml-10-119 (1)

10-119: ⚠️ Potential issue | 🟠 Major

Container security posture is currently default/overly permissive.

The benchmark container has no explicit securityContext (same CKV_K8S_20/23 class of issue). Please set at least allowPrivilegeEscalation: false and dropped capabilities, and consider switching to a prebuilt benchmark image so it can run non-root.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/vllm/disagg-e-pd/perf-cache-on.yaml` around lines 10 -
119, The benchmark container (name: benchmark, image: python:3.11) lacks a
securityContext — add a securityContext block on the benchmark container that
sets allowPrivilegeEscalation: false, drops all nonessential capabilities
(capabilities: drop: ["ALL"]), and enforcement fields like runAsNonRoot: true
and a non-root runAsUser (e.g., 1000); optionally add readOnlyRootFilesystem:
true and runAsGroup if needed. If you switch to a prebuilt non-root benchmark
image, ensure the same securityContext remains and remove any elevated fields
that would require root privileges.
recipes/qwen3-vl-30b/vllm/disagg-ep-d/perf-cache-off.yaml-24-49 (1)

24-49: ⚠️ Potential issue | 🟠 Major

Readiness and dataset wait loops can hang forever.

Both until loops have no timeout, so a bad frontend endpoint or missing datasets can keep this pod running indefinitely.

Suggested bounded-wait pattern
+          MAX_WAIT_SECONDS="${MAX_WAIT_SECONDS:-1800}"
+          start_ts=$(date +%s)
           until curl -s "http://${FRONTEND}:8000/v1/models" | jq -e --arg model "${MODEL_NAME}" '.data[]? | select(.id == $model)' >/dev/null 2>&1; do
+            if [ $(( $(date +%s) - start_ts )) -ge "${MAX_WAIT_SECONDS}" ]; then
+              echo "Timed out waiting for model '${MODEL_NAME}'" >&2
+              exit 1
+            fi
             echo "[$(date '+%H:%M:%S')] Model not ready, retrying in 5s..."
             sleep 5
           done
@@
+          start_ts=$(date +%s)
           until [ "${missing}" -eq 0 ]; do
+            if [ $(( $(date +%s) - start_ts )) -ge "${MAX_WAIT_SECONDS}" ]; then
+              echo "Timed out waiting for datasets in ${DATASET_DIR}" >&2
+              exit 1
+            fi
             missing=0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/vllm/disagg-ep-d/perf-cache-off.yaml` around lines 24 -
49, The readiness and dataset wait loops (the model-ready loop using MODEL_NAME
and FRONTEND checking curl to /v1/models, and the dataset loop that checks files
in DATASET_DIR using tags/r00 r10 r25 r50 r75 and the missing variable) must be
bounded: add a max timeout or max_retries counter and track elapsed
time/retries; if the timeout is exceeded, log an error with context (include
MODEL_NAME and FRONTEND for the model loop, and which input_file or DATASET_DIR
for the dataset loop) and exit non‑zero. Replace the unbounded until constructs
with loops that increment a retry counter (or check elapsed seconds), sleep as
currently done between attempts, and break/exit with non‑zero after the timeout;
keep the existing success logs (e.g., "Model '${MODEL_NAME}' is ready.") when
checks pass. Ensure both loops return distinct, descriptive error messages so
callers can diagnose whether the frontend or datasets timed out.
recipes/qwen3-vl-30b/vllm/disagg-e-pd/deploy-cache-off.yaml-8-12 (1)

8-12: ⚠️ Potential issue | 🟠 Major

Both worker pools share RWO local-path PVCs, which can break disagg scheduling.

VllmEncodeWorker and VllmPdWorker both mount the same model-cache and compilation-cache claims while scaling to 8 total GPU pods. With ReadWriteOnce + local-path claims, this only works reliably when all pods land on one node; otherwise mounts/scheduling can fail.

Please add explicit node placement for co-location or switch shared cache PVCs to RWX-capable storage.

Also applies to: 63-75, 109-121

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/vllm/disagg-e-pd/deploy-cache-off.yaml` around lines 8 -
12, The shared PVCs (model-cache and compilation-cache) are defined with
RWO/local-path which will fail when VllmEncodeWorker and VllmPdWorker scale
across nodes; either make the PVCs RWX-capable or enforce co-location: update
the pvcs entries for model-cache and compilation-cache to use accessModes:
ReadWriteMany and an RWX storageClassName (e.g., NFS/CephFS) so they can be
mounted by multiple pods, OR add explicit node placement (nodeSelector/affinity
or podAffinity) to both VllmEncodeWorker and VllmPdWorker so they are scheduled
onto the same node; apply the same change to the other identical pvc/worker
blocks mentioned in the comment.
🟡 Minor comments (3)
recipes/qwen3-vl-30b/README.md-56-56 (1)

56-56: ⚠️ Potential issue | 🟡 Minor

Fix dataset job filename typo in the Quick Start explanation.

data-gen/generate-datastes-job.yaml appears misspelled; this should reference generate-datasets-job.yaml.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/README.md` at line 56, The README contains a typo in the
Quick Start filename reference: replace the incorrect string
"data-gen/generate-datastes-job.yaml" with the correct
"data-gen/generate-datasets-job.yaml" in the README.md line that describes
testing datasets (the sentence mentioning `data-gen/generate-datastes-job.yaml`
and the r_{##} tagging).
recipes/qwen3-vl-30b/README.md-51-55 (1)

51-55: ⚠️ Potential issue | 🟡 Minor

Replica counts in overview don’t match the EP/D manifest in this PR.

The overview says EP/D is x2 EP + x6 decode, but recipes/qwen3-vl-30b/vllm/disagg-ep-d/deploy-cache-off.yaml defines 4 and 4. Please align docs/manifests.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/README.md` around lines 51 - 55, The README's replica
counts for the "disagg EP/D" mode (currently written as "x2 EP workers, x6
decode workers") do not match the manifest in
recipes/qwen3-vl-30b/vllm/disagg-ep-d/deploy-cache-off.yaml which defines 4 EP
and 4 decode replicas; update either the README or the manifest so they match:
either change the README line to "disagg EP/D x4 EP, x4 decode" to reflect the
YAML, or modify the manifest's replica counts in deploy-cache-off.yaml (and any
sibling deploy-cache-on.yaml) to be 2 EP and 6 decode to match the README;
ensure both the overview text in README.md and the manifests
(deploy-cache-off.yaml and deploy-cache-on.yaml if present) are consistent.
recipes/qwen3-vl-30b/patches/patch_vllm_agg_encoder_cache.sh-56-67 (1)

56-67: ⚠️ Potential issue | 🟡 Minor

Add a python3 availability check before its first use.

The script validates Docker availability at line 56 but invokes python3 at line 66 without a prerequisite check. On minimal hosts lacking python3, this causes a late failure after temporary directories and downloads have already been created.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/patches/patch_vllm_agg_encoder_cache.sh` around lines 56
- 67, Add a pre-flight check for python3 similar to the existing Docker check:
before the curl that pipes to python3 -c, verify python3 is available (use
command -v python3 >/dev/null 2>&1) and if not print an explanatory message and
exit non‑zero; update the script near the WORKDIR/trap block (referencing
WORKDIR and the curl ... | python3 -c invocation) so the script fails fast on
systems without python3.
🧹 Nitpick comments (1)
recipes/qwen3-vl-30b/vllm/agg/analysis.yaml (1)

25-111: Consider extracting shared analysis logic to a reusable script/template.

This inline Python is effectively duplicated across topology manifests; centralizing it (parameterized by topology/path) will reduce drift and future maintenance errors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/vllm/agg/analysis.yaml` around lines 25 - 111, Inline
analysis logic (topology/base/out_dir and load_p90_by_metric) is duplicated
across manifests; extract it into a reusable parameterized script and call it
from each manifest. Create a single Python script (e.g., analyze_cache.py) that
accepts topology and base-path arguments and implements existing symbols:
load_p90_by_metric, tags, metrics, rows assembly, CSV writing (csv_out) and
printing; replace the inline heredoc blocks in each manifest with a call to that
script passing topology and path (or tag list) so manifests only supply
parameters not duplicated logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@recipes/qwen3-vl-30b/model-cache/model-download.yaml`:
- Around line 35-37: Pin the huggingface_hub dependency when installing in the
model-download step so hf_transfer acceleration is preserved: replace the
unpinned "pip install ... huggingface_hub" with a pinned v0.x spec (e.g.,
huggingface_hub~=0.14) or, alternatively, migrate to the newer hf_xet workflow
by installing hf_xet and setting HF_XET_HIGH_PERFORMANCE=1 (instead of relying
on HF_HUB_ENABLE_HF_TRANSFER). Update the install command shown in the snippet
(the pip install line) and ensure the chosen approach is applied consistently
across all recipe model-download manifests.

---

Major comments:
In `@recipes/qwen3-vl-30b/data-gen/generate-datasets-job.yaml`:
- Around line 17-24: The manifest forces root by setting securityContext
runAsUser/runAsGroup/fsGroup to 0 for the Job's pod (securityContext block near
the generate-datasets container using image dynamo:latest-vllm-runtime); change
this to a non-root baseline by removing the zero values and instead enable
runAsNonRoot: true and set runAsUser (and runAsGroup/fsGroup if needed) to a
non-root UID/GID (e.g., 1000), or ensure the image is built to run as a non-root
user and document the required UID; apply these changes in the
pod/securityContext that contains the container named generate-datasets and
validate the container image supports the chosen UID.

In `@recipes/qwen3-vl-30b/model-cache/model-download.yaml`:
- Around line 16-23: Add a pod/container securityContext for the job container
named "model-download": set securityContext.runAsUser and runAsGroup to a
non-root UID/GID (e.g., 1000), set securityContext.runAsNonRoot=true, and under
the container spec set securityContext.allowPrivilegeEscalation=false (and
optionally readOnlyRootFilesystem=true and capabilities.drop: ["ALL"]) to
prevent root/privilege escalation; update the Pod spec where the container
"model-download" is defined to include these securityContext fields.

In `@recipes/qwen3-vl-30b/patches/patch_vllm_agg_encoder_cache.sh`:
- Around line 64-71: Replace the mutable PR diff URLs used when writing
vllm_pr34182.diff and vllm_pr34783_vllm_only.diff with immutable commit-scoped
URLs (use the target repo commit SHAs for the base..head diffs) and harden the
curl usage in the two curl invocations: enable --fail (or -f), keep -sL as
desired, check curl's exit status immediately and abort on failure, and verify
the downloaded files against expected checksums before proceeding; update the
two references to "${WORKDIR}/vllm_pr34182.diff" and
"${WORKDIR}/vllm_pr34783_vllm_only.diff" accordingly and ensure any Python
filtering step still receives valid input if curl fails.

In `@recipes/qwen3-vl-30b/README.md`:
- Around line 89-136: The README currently applies cache-off and then cache-on
manifests without tearing down the prior deployment, which lets both coexist and
skew GPU usage; update the instructions for each option (agg, disagg-ep-d,
disagg-e-pd) to explicitly delete the previous deployment resources before
enabling cache-on by adding a teardown step that deletes the cache-off
manifests/pods (e.g., remove resources created by deploy-cache-off.yaml and the
benchmark pod names like qwen3-vl-30b-agg-benchmark-cache-off) or run kubectl
delete -f deploy-cache-off.yaml (and associated perf YAMLs) and wait for pods to
terminate before applying deploy-cache-on.yaml and perf-cache-on.yaml so the
cache-off and cache-on runs never overlap.

In `@recipes/qwen3-vl-30b/vllm/agg/analysis.yaml`:
- Around line 17-20: The Job's Pod spec (the block containing restartPolicy and
the containers list, specifically the container named "analysis" using image
"python:3.11") needs explicit non-root and no-privilege-escalation controls: add
a top-level pod securityContext with runAsNonRoot: true and a numeric runAsUser
(e.g., 1000) and add a container-level securityContext on the "analysis"
container that sets allowPrivilegeEscalation: false, drops all capabilities
(capabilities: drop: ["ALL"]), and enables readOnlyRootFilesystem: true (and
optionally runAsNonRoot/restrictive runAsUser there as well) so the container
cannot run as root or escalate privileges.

In `@recipes/qwen3-vl-30b/vllm/agg/deploy-cache-off.yaml`:
- Around line 8-12: The pvcs block in deploy-cache-off.yaml mounts the same
ReadWriteOnce local-path claims (model-cache and compilation-cache) across
VllmWorker replicas causing Pending pods on multi-node clusters; either convert
the cache PVCs to use an RWX-capable storageClass in the
model-cache/model-cache.yaml (replace local-path RWO with an RWX storageClass)
or add explicit pod placement to the VllmWorker spec (nodeSelector/nodeAffinity
or a nodeName) so all replicas are pinned to the same node; update the pvcs
configuration or the VllmWorker deployment spec accordingly (referencing the PVC
names model-cache and compilation-cache and the VllmWorker replicas field) to
ensure mounts succeed.

In `@recipes/qwen3-vl-30b/vllm/agg/perf-cache-off.yaml`:
- Around line 24-49: The two unbounded readiness loops (the model readiness curl
loop that references MODEL_NAME and FRONTEND and the dataset polling loop that
iterates over tags and checks files in DATASET_DIR) must be converted to bounded
waits: add a max-wait duration or max-retries counter (e.g., MAX_WAIT_SECS or
MAX_RETRIES) and track elapsed time or attempts inside each loop, logging a
clear error via echo and exiting non-zero if the timeout is reached; ensure the
model loop still polls http://${FRONTEND}:8000/v1/models for MODEL_NAME and the
dataset loop still checks each input_file in "${DATASET_DIR}" before deciding
success, but break out and fail cleanly when the timeout/retry limit is
exceeded.
- Around line 11-23: The benchmark container definition (name: benchmark, image:
python:3.11, command block) is missing a securityContext—add a pod/container
securityContext to harden defaults: set runAsNonRoot: true and runAsUser to a
non-root UID, set readOnlyRootFilesystem: true, set allowPrivilegeEscalation:
false, and drop all Linux capabilities (and add only required ones if any);
apply the same securityContext pattern to the other similar container
definitions referenced (lines 80-119) so all benchmark pods run with non-root,
least-privilege settings.

In `@recipes/qwen3-vl-30b/vllm/agg/perf-cache-on.yaml`:
- Around line 24-49: The loops that wait for the model and for dataset files
(the curl/jq readiness loop using MODEL_NAME and FRONTEND, and the dataset loop
that iterates over tags, missing, input_file, DATASET_DIR) need bounded retries:
add a max retry counter or timeout variable (e.g., MAX_RETRIES or MAX_WAIT_SECS)
and increment/check it inside each until loop; when exceeded, log a clear error
including what timed out (model or which dataset tag) and exit non‑zero so the
Pod fails fast instead of blocking forever. Ensure the new logic returns
non‑zero on timeout and include the relevant identifiers (MODEL_NAME, FRONTEND,
DATASET_DIR, tags) in the error messages to aid debugging.
- Around line 11-23: The benchmark container block (name: benchmark, image:
python:3.11) needs explicit hardening: add a container.securityContext with
allowPrivilegeEscalation: false, capabilities.drop: ["ALL"],
seccompProfile.type: RuntimeDefault, and set runAsNonRoot: true with a non-zero
runAsUser (or set podSecurityContext.runAsNonRoot/runAsUser) so the container
does not run as root; apply the same
container.securityContext/podSecurityContext changes to the other similar
container block referenced in the comment (the block covering lines 80-119).

In `@recipes/qwen3-vl-30b/vllm/disagg-e-pd/analysis.yaml`:
- Around line 17-20: Add a Pod and container securityContext to harden the Job
that runs the "analysis" container: set pod.spec.securityContext.runAsNonRoot:
true and pod.spec.securityContext.runAsUser to a non-root UID (e.g., 1000), and
add container.securityContext for the "analysis" container with runAsNonRoot:
true, runAsUser matching the pod UID and allowPrivilegeEscalation: false (also
consider readOnlyRootFilesystem: true and capabilities.drop: ["ALL"]). Update
the manifest sections that define the Pod/Job spec and the "analysis" container
to include these fields so the container cannot run as root or escalate
privileges.

In `@recipes/qwen3-vl-30b/vllm/disagg-e-pd/deploy-cache-off.yaml`:
- Around line 8-12: The shared PVCs (model-cache and compilation-cache) are
defined with RWO/local-path which will fail when VllmEncodeWorker and
VllmPdWorker scale across nodes; either make the PVCs RWX-capable or enforce
co-location: update the pvcs entries for model-cache and compilation-cache to
use accessModes: ReadWriteMany and an RWX storageClassName (e.g., NFS/CephFS) so
they can be mounted by multiple pods, OR add explicit node placement
(nodeSelector/affinity or podAffinity) to both VllmEncodeWorker and VllmPdWorker
so they are scheduled onto the same node; apply the same change to the other
identical pvc/worker blocks mentioned in the comment.

In `@recipes/qwen3-vl-30b/vllm/disagg-e-pd/deploy.yaml`:
- Around line 8-12: The manifest's pvcs section defines shared PVCs
"model-cache" and "compilation-cache" that are likely created with
ReadWriteOnce/local-path, causing a shared-RWO bottleneck when scaling GPU
workers; update the deployment to either (a) enforce worker co-location so all
pods that mount "model-cache" and "compilation-cache" are scheduled on the same
node (add nodeAffinity/nodeSelector to the worker Deployment/StatefulSet spec
for the services that mount these PVCs), or (b) switch the PVCs to use
RWX-capable storageClass (replace the PV/PVC definitions for "model-cache" and
"compilation-cache" with a storageClass that supports ReadWriteMany) so multiple
nodes can mount them concurrently; apply the same change to the other
occurrences noted (the blocks around lines 63-75 and 117-129) so all worker
services consistently use either co-location or RWX storage.

In `@recipes/qwen3-vl-30b/vllm/disagg-e-pd/perf-cache-off.yaml`:
- Around line 24-49: The two unbounded wait loops (the model-ready loop that
calls curl against http://${FRONTEND}:8000/v1/models and the dataset wait loop
that checks until [ "${missing}" -eq 0 ]) must enforce a max-wait and fail fast:
introduce a MAX_WAIT_SECS or MAX_RETRIES variable and an attempt/elapsed-time
counter, increment it inside each loop, and when the limit is exceeded print a
clear error including MODEL_NAME/FRONTEND or DATASET_DIR/tags context and exit
non-zero (e.g., exit 1); update the retry echo messages to show
attempts/remaining time and ensure both the model readiness block and the
dataset presence block use this same timeout pattern so they won’t hang
indefinitely.
- Around line 10-119: The benchmark container currently runs with default
privileges; update the container spec for the container named "benchmark" to
include an explicit securityContext: set runAsNonRoot: true (and
runAsUser/runAsGroup to a non-root UID/GID such as 1000), set
allowPrivilegeEscalation: false, add capabilities: drop: ["ALL"], and preferably
set readOnlyRootFilesystem:true if the workload permits; also consider adding a
pod-level securityContext where appropriate. Remove or move the runtime apt
installs and pip install steps from the in-line command into the container image
(or ensure the image supports non-root package installation) so the container
can run non-root without needing root package installs. Ensure any mounted paths
(e.g., /perf-cache, ARTIFACT_BASE_DIR, DATASET_DIR) have appropriate permissions
for the chosen non-root UID to read/write.

In `@recipes/qwen3-vl-30b/vllm/disagg-e-pd/perf-cache-on.yaml`:
- Around line 24-49: The two unbounded waiting loops (the model readiness curl
loop that checks MODEL_NAME against FRONTEND and the dataset availability loop
that iterates over tags with DATASET_DIR and missing) must enforce timeouts and
fail fast: introduce configurable timeout variables (e.g., MODEL_WAIT_TIMEOUT
and DATASET_WAIT_TIMEOUT) and record a start timestamp before each loop, then on
each iteration check elapsed time against the timeout and if exceeded log a
clear error including MODEL_NAME/FRONTEND or DATASET_DIR and exit non‑zero;
alternatively implement a max-retries counter (e.g., MODEL_MAX_RETRIES,
DATASET_MAX_RETRIES) incremented each loop with a final error+exit when reached
and keep the existing sleep intervals. Ensure the error paths include the same
identifying context (MODEL_NAME, FRONTEND, DATASET_DIR, tags) so failures are
actionable.
- Around line 10-119: The benchmark container (name: benchmark, image:
python:3.11) lacks a securityContext — add a securityContext block on the
benchmark container that sets allowPrivilegeEscalation: false, drops all
nonessential capabilities (capabilities: drop: ["ALL"]), and enforcement fields
like runAsNonRoot: true and a non-root runAsUser (e.g., 1000); optionally add
readOnlyRootFilesystem: true and runAsGroup if needed. If you switch to a
prebuilt non-root benchmark image, ensure the same securityContext remains and
remove any elevated fields that would require root privileges.

In `@recipes/qwen3-vl-30b/vllm/disagg-e-pd/perf.yaml`:
- Around line 24-49: Add a bounded timeout to both readiness loops so they
cannot run indefinitely: introduce a MAX_WAIT_SECONDS or MAX_RETRIES env var
(used alongside MODEL_NAME and FRONTEND for the model loop, and DATASET_DIR/tags
for the dataset loop), track elapsed time or attempt count inside the "until
curl ..." model readiness block and inside the dataset "until [ \"${missing}\"
-eq 0 ]" loop, and when the limit is exceeded log a clear error (including
MODEL_NAME or which dataset tag is missing), exit non-zero to fail the job, and
ensure directories (ARTIFACT_BASE_DIR, DATASET_DIR) are still created before
failing.
- Around line 11-23: The benchmark container (name: "benchmark") is missing a
securityContext which leaves it running with default/root privileges; update the
Pod/Container spec to add a securityContext for the benchmark container (and
mirror the same changes for other containers on lines 73-102) that sets
runAsNonRoot: true and a non-zero runAsUser, disallows privilege escalation
(allowPrivilegeEscalation: false), drops all Linux capabilities and only adds
minimal required ones, and enables readOnlyRootFilesystem: true (optionally set
seccompProfile/unconfined to RuntimeDefault) to explicitly lock down privileges.

In `@recipes/qwen3-vl-30b/vllm/disagg-ep-d/analysis.yaml`:
- Around line 18-21: The analysis container currently runs with default
privileges; update the manifest for the container named "analysis" (the entry
under containers -> - name: analysis) by adding a securityContext at both pod
and container levels: set podSecurityContext/runAsNonRoot: true with
runAsUser/runAsGroup/fsGroup as appropriate, and in the container
securityContext set allowPrivilegeEscalation: false, privileged: false,
readOnlyRootFilesystem: true, capabilities.drop: ["ALL"], and
seccompProfile.type: RuntimeDefault to ensure non-root, least-privilege
execution for the analysis job.

In `@recipes/qwen3-vl-30b/vllm/disagg-ep-d/perf-cache-off.yaml`:
- Around line 10-119: The benchmark container (name: benchmark) lacks a
securityContext—add one to harden the pod: set allowPrivilegeEscalation: false,
add capabilities.drop: ["ALL"], and set runAsNonRoot: true (or runAsUser to a
non-root UID) and readOnlyRootFilesystem: true; because the Dockerfile currently
runs apt/pip in-container, instead if you must run as non-root move those
install steps into a prebuilt image and reference that image in the container
spec (or keep a separate init image to perform privileged installs) so the
benchmark container can safely use runAsNonRoot and the other securityContext
fields.
- Around line 24-49: The readiness and dataset wait loops (the model-ready loop
using MODEL_NAME and FRONTEND checking curl to /v1/models, and the dataset loop
that checks files in DATASET_DIR using tags/r00 r10 r25 r50 r75 and the missing
variable) must be bounded: add a max timeout or max_retries counter and track
elapsed time/retries; if the timeout is exceeded, log an error with context
(include MODEL_NAME and FRONTEND for the model loop, and which input_file or
DATASET_DIR for the dataset loop) and exit non‑zero. Replace the unbounded until
constructs with loops that increment a retry counter (or check elapsed seconds),
sleep as currently done between attempts, and break/exit with non‑zero after the
timeout; keep the existing success logs (e.g., "Model '${MODEL_NAME}' is
ready.") when checks pass. Ensure both loops return distinct, descriptive error
messages so callers can diagnose whether the frontend or datasets timed out.

In `@recipes/qwen3-vl-30b/vllm/disagg-ep-d/perf-cache-on.yaml`:
- Around line 11-23: The container named "benchmark" lacks a pod/container
securityContext; add a securityContext to the container spec for the "benchmark"
container (and repeat for the other similar containers referenced) that enforces
non-root execution and least privilege: set runAsNonRoot: true and runAsUser to
a non-root UID (e.g., 1000), set allowPrivilegeEscalation: false, set
readOnlyRootFilesystem: true, drop all Linux capabilities (capabilities.drop:
["ALL"]) and add a restricted seccomp profile if available; also consider
setting fsGroup for shared volumes. Update the container spec where the image
python:3.11 and command block appear (the "benchmark" container) and mirror the
same securityContext for the other blocks noted (lines ~80-119).
- Around line 24-49: The readiness loops currently can wait forever; add
deadline/fail-fast logic for both the model readiness loop (the until that curls
http://${FRONTEND}:8000/v1/models checking MODEL_NAME) and the dataset wait loop
(the until that checks files in DATASET_DIR using tags/missing). Introduce
timeout variables (e.g., MODEL_READY_TIMEOUT and DATASET_WAIT_TIMEOUT or max
retries), capture a start timestamp before each loop, and on each iteration
compare elapsed time; if the timeout is exceeded log a clear error including the
relevant ID/paths (MODEL_NAME, FRONTEND, DATASET_DIR) and exit non-zero to fail
fast. Ensure the curl check still uses quiet/exit codes and respect the timeout
by failing when the deadline is reached.

---

Minor comments:
In `@recipes/qwen3-vl-30b/patches/patch_vllm_agg_encoder_cache.sh`:
- Around line 56-67: Add a pre-flight check for python3 similar to the existing
Docker check: before the curl that pipes to python3 -c, verify python3 is
available (use command -v python3 >/dev/null 2>&1) and if not print an
explanatory message and exit non‑zero; update the script near the WORKDIR/trap
block (referencing WORKDIR and the curl ... | python3 -c invocation) so the
script fails fast on systems without python3.

In `@recipes/qwen3-vl-30b/README.md`:
- Line 56: The README contains a typo in the Quick Start filename reference:
replace the incorrect string "data-gen/generate-datastes-job.yaml" with the
correct "data-gen/generate-datasets-job.yaml" in the README.md line that
describes testing datasets (the sentence mentioning
`data-gen/generate-datastes-job.yaml` and the r_{##} tagging).
- Around line 51-55: The README's replica counts for the "disagg EP/D" mode
(currently written as "x2 EP workers, x6 decode workers") do not match the
manifest in recipes/qwen3-vl-30b/vllm/disagg-ep-d/deploy-cache-off.yaml which
defines 4 EP and 4 decode replicas; update either the README or the manifest so
they match: either change the README line to "disagg EP/D x4 EP, x4 decode" to
reflect the YAML, or modify the manifest's replica counts in
deploy-cache-off.yaml (and any sibling deploy-cache-on.yaml) to be 2 EP and 6
decode to match the README; ensure both the overview text in README.md and the
manifests (deploy-cache-off.yaml and deploy-cache-on.yaml if present) are
consistent.

---

Nitpick comments:
In `@recipes/qwen3-vl-30b/vllm/agg/analysis.yaml`:
- Around line 25-111: Inline analysis logic (topology/base/out_dir and
load_p90_by_metric) is duplicated across manifests; extract it into a reusable
parameterized script and call it from each manifest. Create a single Python
script (e.g., analyze_cache.py) that accepts topology and base-path arguments
and implements existing symbols: load_p90_by_metric, tags, metrics, rows
assembly, CSV writing (csv_out) and printing; replace the inline heredoc blocks
in each manifest with a call to that script passing topology and path (or tag
list) so manifests only supply parameters not duplicated logic.

Comment on lines +35 to +37
set -eux
pip install --no-cache-dir huggingface_hub hf_transfer
hf download "$MODEL_NAME" --revision "$MODEL_REVISION"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Find the file mentioned in the review
find . -name "model-download.yaml" -o -name "*qwen*" -type f | head -20

Repository: ai-dynamo/dynamo

Length of output: 600


🏁 Script executed:

# List the recipes directory structure
git ls-files recipes/ | grep -E "(qwen|model-cache)" | head -20

Repository: ai-dynamo/dynamo

Length of output: 1057


🏁 Script executed:

# Read the target file to verify the code and context
cat -n recipes/qwen3-vl-30b/model-cache/model-download.yaml

Repository: ai-dynamo/dynamo

Length of output: 1665


🏁 Script executed:

# Check other model-download.yaml files for comparison
for file in recipes/*/model-cache/model-download.yaml; do
  echo "=== $file ==="
  grep -A 2 "pip install" "$file" || echo "No pip install found"
done

Repository: ai-dynamo/dynamo

Length of output: 1730


🏁 Script executed:

# Check if there's a README in recipes explaining the versioning strategy
cat recipes/qwen3-vl-30b/README.md 2>/dev/null || echo "No README found"

Repository: ai-dynamo/dynamo

Length of output: 7116


🏁 Script executed:

# Check other deployment/inference manifests to see if they pin versions
rg "pip install.*huggingface" --type yaml -A 1 -B 1 | head -50

Repository: ai-dynamo/dynamo

Length of output: 3285


🏁 Script executed:

# Check if there are any requirements.txt or setup files in recipes
find recipes/ -name "requirements*.txt" -o -name "setup.py" | head -20

Repository: ai-dynamo/dynamo

Length of output: 42


🏁 Script executed:

# Check if there's a base Dockerfile or version constraints elsewhere
rg "huggingface" --type yaml | grep -i version | head -20

Repository: ai-dynamo/dynamo

Length of output: 42


🏁 Script executed:

# Look for any documentation about dependency management
find . -name "*.md" -type f -exec grep -l "pip install\|dependencies\|versions" {} \; | head -10

Repository: ai-dynamo/dynamo

Length of output: 2300


🏁 Script executed:

# Check the Python base image to see what default versions might be installed
# Also verify if hf_transfer is a real package and can be version-pinned
python3 <<'EOF'
# Check if hf_transfer is a real PyPI package
import json
try:
    import urllib.request
    resp = urllib.request.urlopen('https://pypi.org/pypi/hf-transfer/json')
    data = json.loads(resp.read().decode())
    releases = list(data['releases'].keys())
    print(f"hf-transfer is a real package with {len(releases)} releases")
    print(f"Latest versions: {sorted(releases, reverse=True)[:5]}")
except Exception as e:
    print(f"Error checking hf-transfer: {e}")
EOF

Repository: ai-dynamo/dynamo

Length of output: 215


🏁 Script executed:

# Check if huggingface_hub has version compatibility info
python3 <<'EOF'
import json
try:
    import urllib.request
    resp = urllib.request.urlopen('https://pypi.org/pypi/huggingface-hub/json')
    data = json.loads(resp.read().decode())
    releases = list(data['releases'].keys())
    print(f"huggingface-hub is a real package with {len(releases)} releases")
    print(f"Latest versions: {sorted(releases, reverse=True)[:5]}")
except Exception as e:
    print(f"Error checking huggingface-hub: {e}")
EOF

Repository: ai-dynamo/dynamo

Length of output: 219


🌐 Web query:

huggingface_hub hf_transfer package version stability compatibility

💡 Result:

What hf_transfer is (and how “stable” it is)

  • hf_transfer is a power-user Rust-based accelerator intended for extremely high-bandwidth transfers (the project explicitly says it’s not a general usability tool and “comes generally as-is”).[3]
  • Latest published hf-transfer on PyPI is 0.1.9 (Jan 7, 2025); it requires Python ≥ 3.7 and ships platform wheels (Windows/macOS/Linux, etc.).[2]

Compatibility with huggingface_hub (most important point)

  • huggingface_hub v1.0+ removed support for the optional hf_transfer integration. As a result, HF_HUB_ENABLE_HF_TRANSFER is ignored in v1.x, and Hugging Face recommends using hf_xet instead (and HF_XET_HIGH_PERFORMANCE for tuning).[1]
  • Current huggingface_hub releases are v1.x (e.g., PyPI shows 1.5.0 uploaded Feb 26, 2026), so in modern environments hf_transfer generally won’t do anything through huggingface_hub anymore.[4]

Practical “version stability / compatibility” guidance

  • If you need HF_HUB_ENABLE_HF_TRANSFER=1 to work via huggingface_hub, you typically must stay on huggingface_hub v0.x (pre‑v1.0 behavior).[1]
  • If you are on huggingface_hub v1.x, install/use hf_xet rather than hf_transfer for accelerated transfers.[1]
  • Be aware there have been packaging/installer edge cases around the huggingface_hub[hf_transfer] extra (reported as broken for some pip versions), which is another reason pinning and testing your toolchain matters if you’re on older huggingface_hub where hf_transfer still integrates.[5]

Sources:
[1] Hugging Face docs — Migrating to huggingface_hub v1.0 (hf_transfer removed; env var ignored; use hf_xet / HF_XET_HIGH_PERFORMANCE). (huggingface.co)
[2] PyPI — hf-transfer 0.1.9 (release date, Python requirement, wheels). (pypi.org)
[3] GitHub — huggingface/hf_transfer README (power-user disclaimer; “as-is”). (github.com)
[4] PyPI — huggingface-hub (shows v1.x current releases, incl. 1.5.0 upload date). (pypi.org)
[5] GitHub issue — huggingface_hub #3029 (hf_transfer extra broken for some pip versions). (github.com)

Citations:


🏁 Script executed:

# Verify the file one more time to confirm it uses HF_HUB_ENABLE_HF_TRANSFER
sed -n '24,37p' recipes/qwen3-vl-30b/model-cache/model-download.yaml

Repository: ai-dynamo/dynamo

Length of output: 567


Pin huggingface_hub version—unpinned install breaks hf_transfer acceleration.

Unpinned huggingface_hub will pull v1.x (current latest), which removed support for the hf_transfer integration. The HF_HUB_ENABLE_HF_TRANSFER=1 environment variable (line 29) is silently ignored in v1.x, causing the intended download acceleration to fail without error.

Pin huggingface_hub to v0.x to restore hf_transfer functionality, or migrate to hf_xet (recommended by Hugging Face for modern setups) with the HF_XET_HIGH_PERFORMANCE=1 environment variable instead. Apply the fix consistently across all recipe model-download manifests.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3-vl-30b/model-cache/model-download.yaml` around lines 35 - 37,
Pin the huggingface_hub dependency when installing in the model-download step so
hf_transfer acceleration is preserved: replace the unpinned "pip install ...
huggingface_hub" with a pinned v0.x spec (e.g., huggingface_hub~=0.14) or,
alternatively, migrate to the newer hf_xet workflow by installing hf_xet and
setting HF_XET_HIGH_PERFORMANCE=1 (instead of relying on
HF_HUB_ENABLE_HF_TRANSFER). Update the install command shown in the snippet (the
pip install line) and ensure the chosen approach is applied consistently across
all recipe model-download manifests.

@esoba esoba changed the title Qwen3-VL-30B + Encoder Cache Recipe feat: Qwen3-VL-30B recipe for agg/disagg and encoder cache with vLLM patch Mar 5, 2026
@github-actions github-actions bot added the feat label Mar 5, 2026
Signed-off-by: Elijah Soba <esoba@nvidia.com>
Signed-off-by: Elijah Soba <esoba@nvidia.com>
--max-model-len 16384 \
--disable-log-requests \
--enable-prefix-caching \
--multimodal-embedding-cache-capacity-gb 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@esoba can you give input on your perf findings with 10GB of cache enabled? Curious if we want a larger size or not for this recipe

Signed-off-by: Elijah Soba <esoba@nvidia.com>
@rmccorm4
Copy link
Contributor

rmccorm4 commented Mar 6, 2026

/ok to test 4c1a65f

esoba added 3 commits March 6, 2026 11:36
Signed-off-by: Elijah Soba <esoba@nvidia.com>
…d some AI comments

Signed-off-by: Elijah Soba <esoba@nvidia.com>
Signed-off-by: Elijah Soba <esoba@nvidia.com>
@rmccorm4
Copy link
Contributor

rmccorm4 commented Mar 7, 2026

/ok to test 70f4e6f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation external-contribution Pull request is from an external contributor feat size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants