From 8e8c032a015552aac808f074772fc41a21161efe Mon Sep 17 00:00:00 2001
From: Kai Xu <kaix@nvidia.com>
Date: Sat, 28 Mar 2026 16:07:29 -0700
Subject: [PATCH 1/5] Add Agent Evaluation skill for accuracy benchmarking

Add a Claude Code skill for evaluating LLM accuracy using NeMo Evaluator
Launcher (NEL). Based on the upstream nel-assistant skill (commit f1fa073)
with ModelOpt-specific additions:

- Auto-detect ModelOpt quantization format from hf_quant_config.json
  and set the correct vLLM/SGLang --quantization flag
- Quantization-aware benchmark defaults (recommend MMLU, GSM8K,
  ARC-Challenge for quantized models)
- Workspace management for multi-user environments (Step 0)
- Disable MD036/MD029 markdownlint rules for upstream NEL formatting

The skill guides users through NEL config generation, model card
research, and evaluation execution (local and SLURM).

Signed-off-by: Kai Xu <kaix@nvidia.com>
---
 .claude/skills/evaluation/SKILL.md            | 386 ++++++++++++++++++
 .../evals/nemotron3-nano-bf16-reasoning.json  |  26 ++
 .markdownlint-cli2.yaml                       |   2 +
 3 files changed, 414 insertions(+)
 create mode 100644 .claude/skills/evaluation/SKILL.md
 create mode 100644 .claude/skills/evaluation/evals/nemotron3-nano-bf16-reasoning.json

diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md
new file mode 100644
index 0000000000..9f4ffe1504
--- /dev/null
+++ b/.claude/skills/evaluation/SKILL.md
@@ -0,0 +1,386 @@
+---
+name: evaluation
+description: Evaluate accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Use when user says "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "accuracy drop", "run nel", or needs to measure how quantization affects model quality. Handles model deployment, config generation, and evaluation execution.
+license: Apache-2.0
+# Based on nel-assistant skill from NeMo Evaluator Launcher (commit f1fa073)
+# https://github.com/NVIDIA-NeMo/Evaluator/tree/f1fa073/packages/nemo-evaluator-launcher/.claude/skills/nel-assistant
+# Modifications: renamed to evaluation, added workspace management (Step 0),
+# auto-detect ModelOpt quantization format, quantization-aware benchmark defaults.
+---
+
+## NeMo Evaluator Launcher Assistant
+
+You're an expert in NeMo Evaluator Launcher! Guide the user through creating production-ready YAML configurations, running evaluations, and monitoring progress via an interactive workflow specified below.
+
+### Workspace (multi-user / Slack bot)
+
+If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Check for existing workspaces — especially if evaluating a model from a prior PTQ or deployment step. Reuse the existing workspace so you have access to the quantized checkpoint and any code modifications.
+
+### Workflow
+
+```text
+Config Generation Progress:
+- [ ] Step 0: Check workspace (if MODELOPT_WORKSPACE_ROOT is set)
+- [ ] Step 1: Check if nel is installed
+- [ ] Step 2: Build the base config file
+- [ ] Step 3: Configure model path and parameters
+- [ ] Step 4: Fill in remaining missing values
+- [ ] Step 5: Confirm tasks (iterative)
+- [ ] Step 6: Advanced - Multi-node (Data Parallel)
+- [ ] Step 7: Advanced - Interceptors
+- [ ] Step 8: Run the evaluation
+```
+
+**Step 1: Check if nel is installed**
+
+Test that `nel` is installed with `nel --version`.
+
+If not, instruct the user to `pip install nemo-evaluator-launcher`.
+
+**Step 2: Build the base config file**
+
+Prompt the user with "I'll ask you 5 questions to build the base config we'll adjust in the next steps". Guide the user through the 5 questions using AskUserQuestion:
+
+1. Execution:
+
+- Local
+- SLURM
+
+2. Deployment:
+
+- None (External)
+- vLLM
+- SGLang
+- NIM
+- TRT-LLM
+
+3. Auto-export:
+
+- None (auto-export disabled)
+- MLflow
+- wandb
+
+4. Model type
+
+- Base
+- Chat
+- Reasoning
+
+5. Benchmarks:
+  Allow for multiple choices in this question.
+1. Standard LLM Benchmarks (like MMLU, IFEval, GSM8K, ...)
+2. Code Evaluation (like HumanEval, MBPP, and LiveCodeBench)
+3. Math & Reasoning (like AIME, GPQA, MATH-500, ...)
+4. Safety & Security (like Garak and Safety Harness)
+5. Multilingual (like MMATH, Global MMLU, MMLU-Prox)
+
+DON'T ALLOW FOR ANY OTHER OPTIONS, only the ones listed above under each category (Execution, Deployment, Auto-export, Model type, Benchmarks). YOU HAVE TO GATHER THE ANSWERS for the 5 questions before you can build the base config.
+
+When you have all the answers, run the script to build the base config:
+
+```bash
+nel skills build-config --execution <local|slurm> --deployment <none|vllm|sglang|nim|trtllm> --model_type <base|chat|reasoning> --benchmarks <standard|code|math_reasoning|safety|multilingual> [--export <none|mlflow|wandb>] [--output <OUTPUT>]
+```
+
+Where `--output` depends on what the user provides:
+
+- Omit: Uses current directory with auto-generated filename
+- Directory: Writes to that directory with auto-generated filename
+- File path (*.yaml): Writes to that specific file
+
+It never overwrites existing files.
+
+**Step 3: Configure model path and parameters**
+
+Ask for model path. Determine type:
+
+- Checkpoint path (starts with `/` or `./`) → set `deployment.checkpoint_path: <path>` and `deployment.hf_model_handle: null`
+- HF handle (e.g., `org/model-name`) → set `deployment.hf_model_handle: <handle>` and `deployment.checkpoint_path: null`
+
+**Auto-detect ModelOpt quantization format** (checkpoint paths only):
+
+Check for `hf_quant_config.json` in the checkpoint directory:
+
+```bash
+cat <checkpoint_path>/hf_quant_config.json 2>/dev/null
+```
+
+If found, read `quantization.quant_algo` and set the correct vLLM/SGLang quantization flag in `deployment.extra_args`:
+
+| `quant_algo` | Flag to add |
+|-------------|-------------|
+| `FP8` | `--quantization modelopt` |
+| `W4A8_AWQ` | `--quantization modelopt` |
+| `NVFP4`, `NVFP4_AWQ` | `--quantization modelopt_fp4` |
+
+If no `hf_quant_config.json`, the checkpoint is unquantized — no flag needed.
+
+**Quantization-aware benchmark defaults:**
+
+When a quantized checkpoint is detected, recommend benchmarks sensitive to quantization accuracy loss:
+
+- **Always include**: MMLU (general knowledge, most affected by quantization)
+- **Recommended**: GSM8K (math reasoning — sensitive to precision loss), ARC-Challenge (reasoning)
+- **Good to add**: HumanEval (code generation — catches subtle degradation), Winogrande (commonsense)
+- **Less useful for quant comparison**: IFEval (instruction following — rarely affected by quantization)
+
+Present these recommendations to the user and ask which to include. If the user already specified benchmarks, keep their choice but mention any accuracy-sensitive benchmarks they may have missed.
+
+Use WebSearch to find model card (HuggingFace, build.nvidia.com). Read it carefully, the FULL text, the devil is in the details. Extract ALL relevant configurations:
+
+- Sampling params (`temperature`, `top_p`)
+- Context length (`deployment.extra_args: "--max-model-len <value>"`)
+- TP/DP settings (to set them appropriately, AskUserQuestion on how many GPUs the model will be deployed)
+- Reasoning config (if applicable):
+  - reasoning on/off: use either:
+    - `adapter_config.custom_system_prompt` (like `/think`, `/no_think`) and no `adapter_config.params_to_add` (leave `params_to_add` unrelated to reasoning untouched)
+    - `adapter_config.params_to_add` for payload modifier (like `"chat_template_kwargs": {"enable_thinking": true/false}`) and no `adapter_config.custom_system_prompt` and `adapter_config.use_system_prompt: false` (leave `custom_system_prompt` and `use_system_prompt` unrelated to reasoning untouched).
+  - reasoning effort/budget (if it's configurable, AskUserQuestion what reasoning effort they want)
+  - higher `max_new_tokens`
+  - etc.
+- Deployment-specific `extra_args` for vLLM/SGLang (look for the vLLM/SGLang deployment command)
+- Deployment-specific vLLM/SGLang versions (by default we use latest docker images, but you can control it with `deployment.image` e.g. vLLM above `vllm/vllm-openai:v0.11.0` stopped supporting `rope-scaling` arg used by Qwen models)
+- ARM64 / non-standard GPU compatibility: The default `vllm/vllm-openai` image only supports common GPU architectures. For ARM64 platforms or GPUs with non-standard compute capabilities (e.g., NVIDIA GB10 with sm_121), use NGC vLLM images instead:
+  - Example: `deployment.image: nvcr.io/nvidia/vllm:26.01-py3`
+  - AskUserQuestion about their GPU architecture if the model card doesn't specify deployment constraints
+- Any preparation requirements (e.g., downloading reasoning parsers, custom plugins):
+  - If the model card mentions downloading files (like reasoning parsers, custom plugins) before deployment, add `deployment.pre_cmd` with the download command
+  - Use `curl` instead of `wget` as it's more widely available in Docker containers
+  - Example: `pre_cmd: curl -L -o reasoning_parser.py https://huggingface.co/.../reasoning_parser.py`
+  - When using `pip install` in `pre_cmd`, always use `--no-cache-dir` to avoid cross-device link errors in Docker containers (the pip cache and temp directories may be on different filesystems)
+  - Example: `pre_cmd: pip3 install --no-cache-dir flash-attn --no-build-isolation`
+- Any other model-specific requirements
+
+Remember to check `evaluation.nemo_evaluator_config` and `evaluation.tasks.*.nemo_evaluator_config` overrides too for parameters to adjust (e.g. disabling reasoning)!
+
+Present findings, explain each setting, ask user to confirm or adjust. If no model card found, ask user directly for the above configurations.
+
+**Step 4: Fill in remaining missing values**
+
+- Find all remaining `???` missing values in the config.
+- Ask the user only for values that couldn't be auto-discovered from the model card (e.g., SLURM hostname, account, output directory, MLflow/wandb tracking URI). Don't propose any defaults here. Let the user give you the values in plain text.
+- Ask the user if they want to change any other defaults e.g. execution partition or walltime (if running on SLURM) or add MLflow/wandb tags (if auto-export enabled).
+
+**Step 5: Confirm tasks (iterative)**
+
+Show tasks in the current config. Loop until the user confirms the task list is final:
+
+1. Tell the user: "Run `nel ls tasks` to see all available tasks".
+2. Ask if they want to add/remove tasks or add/remove/modify task-specific parameter overrides.
+   To add per-task `nemo_evaluator_config` as specified by the user, e.g.:
+
+   ```yaml
+   tasks:
+     - name: <task>
+       nemo_evaluator_config:
+         config:
+           params:
+             temperature: <value>
+             max_new_tokens: <value>
+             ...
+   ```
+
+3. Apply changes.
+4. Show updated list and ask: "Is the task list final, or do you want to make more changes?"
+
+**Known Issues**
+
+- NeMo-Skills workaround (self-deployment only): If using `nemo_skills.*` tasks with self-deployment (vLLM/SGLang/NIM), add at top level:
+
+  ```yaml
+  target:
+    api_endpoint:
+      api_key_name: DUMMY_API_KEY
+  ```
+
+  For the None (External) deployment the `api_key_name` should be already defined. The `DUMMY_API_KEY` export is handled in Step 8.
+
+**Step 6: Advanced - Multi-node**
+
+There are two multi-node patterns. Ask the user which applies:
+
+**Pattern A: Multi-instance (independent instances with HAProxy)**
+
+Only if model >120B parameters or user wants more throughput. Explain: "Each node runs an independent deployment instance. HAProxy load-balances requests across all instances."
+
+```yaml
+execution:
+    num_nodes: 4       # Total nodes
+    num_instances: 4   # 4 independent instances → HAProxy auto-enabled
+```
+
+**Pattern B: Multi-node single instance (Ray TP/PP across nodes)**
+
+When a single model is too large for one node and needs pipeline parallelism across nodes. Use `vllm_ray` deployment config:
+
+```yaml
+defaults:
+  - deployment: vllm_ray   # Built-in Ray cluster setup (replaces manual pre_cmd)
+
+execution:
+    num_nodes: 2           # Single instance spanning 2 nodes
+
+deployment:
+    tensor_parallel_size: 8
+    pipeline_parallel_size: 2
+```
+
+**Pattern A+B combined: Multi-instance with multi-node instances**
+
+For very large models needing both cross-node parallelism AND multiple instances:
+
+```yaml
+defaults:
+  - deployment: vllm_ray
+
+execution:
+    num_nodes: 4       # Total nodes
+    num_instances: 2   # 2 instances of 2 nodes each → HAProxy auto-enabled
+
+deployment:
+    tensor_parallel_size: 8
+    pipeline_parallel_size: 2
+```
+
+**Common Confusions**
+
+- **`num_instances`** controls independent deployment instances with HAProxy. **`data_parallel_size`** controls DP replicas *within* a single instance.
+- Global data parallelism is `num_instances x data_parallel_size` (e.g., 2 instances x 8 DP each = 16 replicas).
+- With multi-instance, `parallelism` in task config is the total concurrent requests across all instances, not per-instance.
+- `num_nodes` must be divisible by `num_instances`.
+
+**Step 7: Advanced - Interceptors**
+
+- Tell the user they should see: <https://docs.nvidia.com/nemo/evaluator/latest/libraries/nemo-evaluator/interceptors/index.html> .
+- DON'T provide any general information about what interceptors typically do in API frameworks without reading the docs. If the user asks about interceptors, only then read the webpage to provide precise information.
+- If the user asks you to configure some interceptor, then read the webpage of this interceptor and configure it according to the `--overrides` syntax but put the values in the YAML config under `evaluation.nemo_evaluator_config.config.target.api_endpoint.adapter_config` (NOT under `target.api_endpoint.adapter_config`) instead of using CLI overrides.
+  By defining `interceptors` list you'd override the full chain of interceptors which can have unintended consequences like disabling default interceptors. That's why use the fields specified in the `CLI Configuration` section after the `--overrides` keyword to configure interceptors in the YAML config.
+
+**Documentation Errata**
+
+- The docs may show incorrect parameter names for logging. Use `max_logged_requests` and `max_logged_responses` (NOT `max_saved_*` or `max_*`).
+
+**Step 8: Run the evaluation**
+
+Print the following commands to the user. Propose to execute them in order to confirm the config works as expected before the full run.
+
+**Important**: Export required environment variables based on your config. If any tokens or keys are missing (e.g. `HF_TOKEN`, `NGC_API_KEY`, `api_key_name` from the config), ask the user to put them in a `.env` file in the project root so you can run `set -a && source .env && set +a` (or equivalent) before executing `nel run` commands.
+
+```bash
+# If using pre_cmd or post_cmd:
+export NEMO_EVALUATOR_TRUST_PRE_CMD=1
+
+# If using nemo_skills.* tasks with self-deployment:
+export DUMMY_API_KEY=dummy
+```
+
+1. **Dry-run** (validates config without running):
+
+   ```bash
+   nel run --config <config_path> --dry-run
+   ```
+
+2. **Test with limited samples** (quick validation run):
+
+   ```bash
+   nel run --config <config_path> -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10
+   ```
+
+3. **Re-run a single task** (useful for debugging or re-testing after config changes):
+
+   ```bash
+   nel run --config <config_path> -t <task_name>
+   ```
+
+   Combine with `-o` for limited samples: `nel run --config <config_path> -t <task_name> -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10`
+
+4. **Full evaluation** (production run):
+
+   ```bash
+   nel run --config <config_path>
+   ```
+
+After the dry-run, check the output from `nel` for any problems with the config. If there are no problems, propose to first execute the test run with limited samples and then execute the full evaluation. If there are problems, resolve them before executing the full evaluation.
+
+**Monitoring Progress**
+
+After job submission, you can monitor progress using:
+
+1. **Check job status:**
+
+   ```bash
+   nel status <invocation_id>
+   nel info <invocation_id>
+   ```
+
+2. **Stream logs** (Local execution only):
+
+   ```bash
+   nel logs <invocation_id>
+   ```
+
+   Note: `nel logs` is not supported for SLURM execution.
+
+3. **Inspect logs via SSH** (SLURM workaround):
+
+   When `nel logs` is unavailable (SLURM), use SSH to inspect logs directly:
+
+   First, get log locations:
+
+   ```bash
+   nel info <invocation_id> --logs
+   ```
+
+   Then, use SSH to view logs:
+
+   **Check server deployment logs:**
+
+   ```bash
+   ssh <username>@<hostname> "tail -100 <log path from `nel info <invocation_id> --logs`>/server-<slurm_job_id>-*.log"
+   ```
+
+   Shows vLLM server startup, model loading, and deployment errors (e.g., missing wget/curl).
+
+   **Check evaluation client logs:**
+
+   ```bash
+   ssh <username>@<hostname> "tail -100 <log path from `nel info <invocation_id> --logs`>/client-<slurm_job_id>.log"
+   ```
+
+   Shows evaluation progress, task execution, and results.
+
+   **Check SLURM scheduler logs:**
+
+   ```bash
+   ssh <username>@<hostname> "tail -100 <log path from `nel info <invocation_id> --logs`>/slurm-<slurm_job_id>.log"
+   ```
+
+   Shows job scheduling, health checks, and overall execution flow.
+
+   **Search for errors:**
+
+   ```bash
+   ssh <username>@<hostname> "grep -i 'error\|warning\|failed' <log path from `nel info <invocation_id> --logs`>/*.log"
+   ```
+
+---
+
+Direct users with issues to:
+
+- **GitHub Issues:** <https://github.com/NVIDIA-NeMo/Evaluator/issues>
+- **GitHub Discussions:** <https://github.com/NVIDIA-NeMo/Evaluator/discussions>
+
+Now, copy this checklist and track your progress:
+
+```text
+Config Generation Progress:
+- [ ] Step 0: Check workspace (if multi-user)
+- [ ] Step 1: Check if nel is installed
+- [ ] Step 2: Build the base config file
+- [ ] Step 3: Configure model path and parameters
+- [ ] Step 4: Fill in remaining missing values
+- [ ] Step 5: Confirm tasks (iterative)
+- [ ] Step 6: Advanced - Multi-node (Data Parallel)
+- [ ] Step 7: Advanced - Interceptors
+- [ ] Step 8: Run the evaluation
+```
diff --git a/.claude/skills/evaluation/evals/nemotron3-nano-bf16-reasoning.json b/.claude/skills/evaluation/evals/nemotron3-nano-bf16-reasoning.json
new file mode 100644
index 0000000000..6fb32570eb
--- /dev/null
+++ b/.claude/skills/evaluation/evals/nemotron3-nano-bf16-reasoning.json
@@ -0,0 +1,26 @@
+{
+  "skills": ["nel-assistant"],
+  "query": "Help me evaluate Nemotron 3 Nano BF16 from NVIDIA",
+  "files": [],
+  "expected_behavior": [
+    "Verifies nel is installed by running 'nel --version'",
+    "Asks all 5 base config questions (execution, deployment, auto-export, model type, benchmarks) before generating the config",
+    "Runs 'nel skills build-config' with correct flags matching user answers: --execution slurm --deployment vllm --model-type reasoning --benchmarks standard code math_reasoning --export mlflow",
+    "Searches the web for the model card on HuggingFace and extracts model-specific settings",
+    "Sets correct HF handle: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
+    "Sets reasoning sampling params from model card: temperature=1.0, top_p=1.0",
+    "Configures reasoning toggle via params_to_add with chat_template_kwargs.enable_thinking (not via system prompt)",
+    "Disables reasoning for IFEval task using enable_thinking: false with use_system_prompt: false",
+    "Adds deployment.pre_cmd using curl (not wget) to download nano_v3_reasoning_parser.py from HuggingFace",
+    "Adds vLLM extra_args including --trust-remote-code, --reasoning-parser-plugin, --reasoning-parser nano_v3, --max-num-seqs 8",
+    "Pins vLLM image to v0.12.0 or later as required by model card",
+    "Adds target.api_endpoint.api_key_name: DUMMY_API_KEY for nemo_skills tasks with self-deployment",
+    "Fills in all ??? placeholders after asking the user for SLURM hostname, account, output_dir, MLflow tracking_uri, and experiment_name",
+    "Applies user-requested SLURM customizations: partition batch_short, walltime 00:20:00, MLflow tag scenario: demo",
+    "Presents task list and waits for user confirmation before proceeding",
+    "Configures request and response logging interceptors under evaluation.nemo_evaluator_config.config.target.api_endpoint.adapter_config using correct field names (max_logged_requests/max_logged_responses, not max_saved_*)",
+    "Handles dry-run failure for missing HF_TOKEN_FOR_GPQA_DIAMOND by offering to fix the config",
+    "Successfully submits test run with limit_samples=10 after dry-run passes",
+    "Provides monitoring commands (nel status, nel info --logs) and inspects server logs via SSH when asked"
+  ]
+}
diff --git a/.markdownlint-cli2.yaml b/.markdownlint-cli2.yaml
index 4c5a690145..f6de39a4a9 100644
--- a/.markdownlint-cli2.yaml
+++ b/.markdownlint-cli2.yaml
@@ -2,6 +2,8 @@ config:
   MD013: false # line-length
   MD024: false # no-duplicate-heading
   MD028: false # no-blanks-blockquote
+  MD029: false # ol-prefix — upstream NEL skill uses actual numbers
   MD033: false # no-inline-html
+  MD036: false # no-emphasis-as-heading — upstream NEL skill uses **Bold** as headers
   MD041: false # first-line-heading
   MD059: false # no-hard-tabs

From 31ff035f015dccc18be3a9e01435a9f76436cbef Mon Sep 17 00:00:00 2001
From: Kai Xu <kaix@nvidia.com>
Date: Sun, 29 Mar 2026 10:50:50 -0700
Subject: [PATCH 2/5] Refactor and add modelopt path

Signed-off-by: Kai Xu <kaix@nvidia.com>
---
 .claude/skills/evaluation/SKILL.md            | 90 ++-----------------
 .../evals/base-model-local-execution.json     | 18 ++++
 .../evals/external-deployment-eval.json       | 17 ++++
 .../evals/interceptor-configuration.json      | 16 ++++
 .../evals/multi-node-evaluation.json          | 18 ++++
 .../evaluation/evals/nel-not-installed.json   | 12 +++
 .../evals/nemotron3-nano-bf16-reasoning.json  |  2 +-
 .../evals/nvfp4-auto-detect-quantization.json | 16 ++++
 .../quantized-checkpoint-local-vllm.json      | 22 +++++
 .../evals/reasoning-model-sglang.json         | 21 +++++
 .../evals/safety-multilingual-benchmarks.json | 17 ++++
 .../evals/wandb-export-code-benchmarks.json   | 18 ++++
 .../evals/workspace-reuse-from-ptq.json       | 15 ++++
 .../references/model-card-research.md         | 30 +++++++
 .../evaluation/references/multi-node.md       | 53 +++++++++++
 15 files changed, 281 insertions(+), 84 deletions(-)
 create mode 100644 .claude/skills/evaluation/evals/base-model-local-execution.json
 create mode 100644 .claude/skills/evaluation/evals/external-deployment-eval.json
 create mode 100644 .claude/skills/evaluation/evals/interceptor-configuration.json
 create mode 100644 .claude/skills/evaluation/evals/multi-node-evaluation.json
 create mode 100644 .claude/skills/evaluation/evals/nel-not-installed.json
 create mode 100644 .claude/skills/evaluation/evals/nvfp4-auto-detect-quantization.json
 create mode 100644 .claude/skills/evaluation/evals/quantized-checkpoint-local-vllm.json
 create mode 100644 .claude/skills/evaluation/evals/reasoning-model-sglang.json
 create mode 100644 .claude/skills/evaluation/evals/safety-multilingual-benchmarks.json
 create mode 100644 .claude/skills/evaluation/evals/wandb-export-code-benchmarks.json
 create mode 100644 .claude/skills/evaluation/evals/workspace-reuse-from-ptq.json
 create mode 100644 .claude/skills/evaluation/references/model-card-research.md
 create mode 100644 .claude/skills/evaluation/references/multi-node.md

diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md
index 9f4ffe1504..b4778b79af 100644
--- a/.claude/skills/evaluation/SKILL.md
+++ b/.claude/skills/evaluation/SKILL.md
@@ -112,48 +112,22 @@ If found, read `quantization.quant_algo` and set the correct vLLM/SGLang quantiz
 | `FP8` | `--quantization modelopt` |
 | `W4A8_AWQ` | `--quantization modelopt` |
 | `NVFP4`, `NVFP4_AWQ` | `--quantization modelopt_fp4` |
+| Other values | Try `--quantization modelopt`; consult vLLM/SGLang docs if unsure |
 
-If no `hf_quant_config.json`, the checkpoint is unquantized — no flag needed.
+If no `hf_quant_config.json`, also check `config.json` for a `quantization_config` section with `quant_method: "modelopt"`. If neither is found, the checkpoint is unquantized — no flag needed.
 
 **Quantization-aware benchmark defaults:**
 
 When a quantized checkpoint is detected, recommend benchmarks sensitive to quantization accuracy loss:
 
-- **Always include**: MMLU (general knowledge, most affected by quantization)
+- **Always include**: MMLU (general knowledge — typically shows measurable accuracy loss from quantization)
 - **Recommended**: GSM8K (math reasoning — sensitive to precision loss), ARC-Challenge (reasoning)
 - **Good to add**: HumanEval (code generation — catches subtle degradation), Winogrande (commonsense)
-- **Less useful for quant comparison**: IFEval (instruction following — rarely affected by quantization)
+- **Less useful for quant comparison**: IFEval (instruction following — typically less affected, but worth including for aggressive quantization like FP4)
 
 Present these recommendations to the user and ask which to include. If the user already specified benchmarks, keep their choice but mention any accuracy-sensitive benchmarks they may have missed.
 
-Use WebSearch to find model card (HuggingFace, build.nvidia.com). Read it carefully, the FULL text, the devil is in the details. Extract ALL relevant configurations:
-
-- Sampling params (`temperature`, `top_p`)
-- Context length (`deployment.extra_args: "--max-model-len <value>"`)
-- TP/DP settings (to set them appropriately, AskUserQuestion on how many GPUs the model will be deployed)
-- Reasoning config (if applicable):
-  - reasoning on/off: use either:
-    - `adapter_config.custom_system_prompt` (like `/think`, `/no_think`) and no `adapter_config.params_to_add` (leave `params_to_add` unrelated to reasoning untouched)
-    - `adapter_config.params_to_add` for payload modifier (like `"chat_template_kwargs": {"enable_thinking": true/false}`) and no `adapter_config.custom_system_prompt` and `adapter_config.use_system_prompt: false` (leave `custom_system_prompt` and `use_system_prompt` unrelated to reasoning untouched).
-  - reasoning effort/budget (if it's configurable, AskUserQuestion what reasoning effort they want)
-  - higher `max_new_tokens`
-  - etc.
-- Deployment-specific `extra_args` for vLLM/SGLang (look for the vLLM/SGLang deployment command)
-- Deployment-specific vLLM/SGLang versions (by default we use latest docker images, but you can control it with `deployment.image` e.g. vLLM above `vllm/vllm-openai:v0.11.0` stopped supporting `rope-scaling` arg used by Qwen models)
-- ARM64 / non-standard GPU compatibility: The default `vllm/vllm-openai` image only supports common GPU architectures. For ARM64 platforms or GPUs with non-standard compute capabilities (e.g., NVIDIA GB10 with sm_121), use NGC vLLM images instead:
-  - Example: `deployment.image: nvcr.io/nvidia/vllm:26.01-py3`
-  - AskUserQuestion about their GPU architecture if the model card doesn't specify deployment constraints
-- Any preparation requirements (e.g., downloading reasoning parsers, custom plugins):
-  - If the model card mentions downloading files (like reasoning parsers, custom plugins) before deployment, add `deployment.pre_cmd` with the download command
-  - Use `curl` instead of `wget` as it's more widely available in Docker containers
-  - Example: `pre_cmd: curl -L -o reasoning_parser.py https://huggingface.co/.../reasoning_parser.py`
-  - When using `pip install` in `pre_cmd`, always use `--no-cache-dir` to avoid cross-device link errors in Docker containers (the pip cache and temp directories may be on different filesystems)
-  - Example: `pre_cmd: pip3 install --no-cache-dir flash-attn --no-build-isolation`
-- Any other model-specific requirements
-
-Remember to check `evaluation.nemo_evaluator_config` and `evaluation.tasks.*.nemo_evaluator_config` overrides too for parameters to adjust (e.g. disabling reasoning)!
-
-Present findings, explain each setting, ask user to confirm or adjust. If no model card found, ask user directly for the above configurations.
+Read `references/model-card-research.md` for the full extraction checklist (sampling params, reasoning config, ARM64 compatibility, pre_cmd, etc.). Use WebSearch to research the model card, present findings, and ask the user to confirm.
 
 **Step 4: Fill in remaining missing values**
 
@@ -197,57 +171,7 @@ Show tasks in the current config. Loop until the user confirms the task list is
 
 **Step 6: Advanced - Multi-node**
 
-There are two multi-node patterns. Ask the user which applies:
-
-**Pattern A: Multi-instance (independent instances with HAProxy)**
-
-Only if model >120B parameters or user wants more throughput. Explain: "Each node runs an independent deployment instance. HAProxy load-balances requests across all instances."
-
-```yaml
-execution:
-    num_nodes: 4       # Total nodes
-    num_instances: 4   # 4 independent instances → HAProxy auto-enabled
-```
-
-**Pattern B: Multi-node single instance (Ray TP/PP across nodes)**
-
-When a single model is too large for one node and needs pipeline parallelism across nodes. Use `vllm_ray` deployment config:
-
-```yaml
-defaults:
-  - deployment: vllm_ray   # Built-in Ray cluster setup (replaces manual pre_cmd)
-
-execution:
-    num_nodes: 2           # Single instance spanning 2 nodes
-
-deployment:
-    tensor_parallel_size: 8
-    pipeline_parallel_size: 2
-```
-
-**Pattern A+B combined: Multi-instance with multi-node instances**
-
-For very large models needing both cross-node parallelism AND multiple instances:
-
-```yaml
-defaults:
-  - deployment: vllm_ray
-
-execution:
-    num_nodes: 4       # Total nodes
-    num_instances: 2   # 2 instances of 2 nodes each → HAProxy auto-enabled
-
-deployment:
-    tensor_parallel_size: 8
-    pipeline_parallel_size: 2
-```
-
-**Common Confusions**
-
-- **`num_instances`** controls independent deployment instances with HAProxy. **`data_parallel_size`** controls DP replicas *within* a single instance.
-- Global data parallelism is `num_instances x data_parallel_size` (e.g., 2 instances x 8 DP each = 16 replicas).
-- With multi-instance, `parallelism` in task config is the total concurrent requests across all instances, not per-instance.
-- `num_nodes` must be divisible by `num_instances`.
+If the user needs multi-node evaluation (model >120B, or more throughput), read `references/multi-node.md` for the configuration patterns (HAProxy multi-instance, Ray TP/PP, or combined).
 
 **Step 7: Advanced - Interceptors**
 
@@ -374,7 +298,7 @@ Now, copy this checklist and track your progress:
 
 ```text
 Config Generation Progress:
-- [ ] Step 0: Check workspace (if multi-user)
+- [ ] Step 0: Check workspace (if MODELOPT_WORKSPACE_ROOT is set)
 - [ ] Step 1: Check if nel is installed
 - [ ] Step 2: Build the base config file
 - [ ] Step 3: Configure model path and parameters
diff --git a/.claude/skills/evaluation/evals/base-model-local-execution.json b/.claude/skills/evaluation/evals/base-model-local-execution.json
new file mode 100644
index 0000000000..6bb277fd4c
--- /dev/null
+++ b/.claude/skills/evaluation/evals/base-model-local-execution.json
@@ -0,0 +1,18 @@
+{
+  "skills": ["evaluation"],
+  "query": "evaluate Qwen/Qwen3-0.6B on standard benchmarks locally",
+  "files": [],
+  "expected_behavior": [
+    "Verifies nel is installed",
+    "Asks 5 base config questions — user selects: execution=local, deployment=vllm, export=none, model_type=base, benchmarks=standard",
+    "Runs nel skills build-config --execution local --deployment vllm --model_type base --benchmarks standard",
+    "Sets deployment.hf_model_handle to Qwen/Qwen3-0.6B and deployment.checkpoint_path to null",
+    "No hf_quant_config.json since this is an HF hub model — no quantization flag needed",
+    "Searches web for Qwen3-0.6B model card to extract deployment settings",
+    "For local execution: no SLURM-specific config needed",
+    "Fills remaining ??? values",
+    "Shows task list for confirmation",
+    "Runs dry-run, then test with limit_samples=10, then full evaluation",
+    "Reports results"
+  ]
+}
diff --git a/.claude/skills/evaluation/evals/external-deployment-eval.json b/.claude/skills/evaluation/evals/external-deployment-eval.json
new file mode 100644
index 0000000000..e20286fb9a
--- /dev/null
+++ b/.claude/skills/evaluation/evals/external-deployment-eval.json
@@ -0,0 +1,17 @@
+{
+  "skills": ["evaluation"],
+  "query": "evaluate my model that's already running at http://myserver:8000",
+  "files": [],
+  "expected_behavior": [
+    "Verifies nel is installed",
+    "Asks 5 base config questions — user selects deployment=None (External) since model is already deployed",
+    "Runs nel skills build-config with --deployment none",
+    "Configures target.api_endpoint with the user's existing server URL",
+    "Does NOT start a deployment — uses the external endpoint directly",
+    "api_key_name should already be defined for external deployment",
+    "Asks user for model type (base/chat/reasoning) and benchmark selection",
+    "Fills remaining config values",
+    "Runs dry-run, test, then full evaluation against the external endpoint",
+    "Reports results"
+  ]
+}
diff --git a/.claude/skills/evaluation/evals/interceptor-configuration.json b/.claude/skills/evaluation/evals/interceptor-configuration.json
new file mode 100644
index 0000000000..d2fbe1b78c
--- /dev/null
+++ b/.claude/skills/evaluation/evals/interceptor-configuration.json
@@ -0,0 +1,16 @@
+{
+  "skills": ["evaluation"],
+  "query": "evaluate my model and configure request/response logging interceptors",
+  "files": [],
+  "expected_behavior": [
+    "Follows standard evaluation workflow through Step 6",
+    "In Step 7 (Interceptors): tells user to see the interceptors documentation URL",
+    "Does NOT provide general information about interceptors without reading the docs",
+    "If user asks to configure logging interceptor: reads the interceptor webpage",
+    "Configures interceptor under evaluation.nemo_evaluator_config.config.target.api_endpoint.adapter_config (NOT under target.api_endpoint.adapter_config)",
+    "Uses field names from CLI Configuration section after --overrides keyword",
+    "Does NOT define interceptors list directly (would override full chain with unintended consequences)",
+    "Uses correct field names: max_logged_requests and max_logged_responses (NOT max_saved_* or max_*)",
+    "Proceeds with evaluation after interceptor configuration"
+  ]
+}
diff --git a/.claude/skills/evaluation/evals/multi-node-evaluation.json b/.claude/skills/evaluation/evals/multi-node-evaluation.json
new file mode 100644
index 0000000000..3a60b4a38b
--- /dev/null
+++ b/.claude/skills/evaluation/evals/multi-node-evaluation.json
@@ -0,0 +1,18 @@
+{
+  "skills": ["evaluation"],
+  "query": "evaluate a 405B parameter model, I have 4 nodes available",
+  "files": [],
+  "expected_behavior": [
+    "Verifies nel is installed",
+    "Asks 5 base config questions",
+    "In Step 6 (Advanced - Multi-node): asks which pattern applies",
+    "For a single 405B model: recommends Pattern B (multi-node single instance with Ray TP/PP)",
+    "Uses vllm_ray deployment config: defaults: [deployment: vllm_ray]",
+    "Sets execution.num_nodes: 2 or 4 depending on GPU memory",
+    "Configures deployment.tensor_parallel_size and pipeline_parallel_size",
+    "Explains: num_instances controls independent instances (with HAProxy), while this is single-instance across nodes",
+    "If user wants throughput AND cross-node: explains Pattern A+B combined",
+    "Notes: num_nodes must be divisible by num_instances",
+    "Proceeds with standard evaluation flow after multi-node config"
+  ]
+}
diff --git a/.claude/skills/evaluation/evals/nel-not-installed.json b/.claude/skills/evaluation/evals/nel-not-installed.json
new file mode 100644
index 0000000000..48e213bf69
--- /dev/null
+++ b/.claude/skills/evaluation/evals/nel-not-installed.json
@@ -0,0 +1,12 @@
+{
+  "skills": ["evaluation"],
+  "query": "evaluate my quantized model",
+  "files": [],
+  "expected_behavior": [
+    "Checks if nel is installed by running 'nel --version'",
+    "nel command not found or errors",
+    "Instructs user to install: pip install nemo-evaluator-launcher",
+    "Does NOT attempt to proceed without nel installed",
+    "After user installs, re-checks nel --version and proceeds with workflow"
+  ]
+}
diff --git a/.claude/skills/evaluation/evals/nemotron3-nano-bf16-reasoning.json b/.claude/skills/evaluation/evals/nemotron3-nano-bf16-reasoning.json
index 6fb32570eb..8f6ad62b9d 100644
--- a/.claude/skills/evaluation/evals/nemotron3-nano-bf16-reasoning.json
+++ b/.claude/skills/evaluation/evals/nemotron3-nano-bf16-reasoning.json
@@ -1,5 +1,5 @@
 {
-  "skills": ["nel-assistant"],
+  "skills": ["evaluation"],
   "query": "Help me evaluate Nemotron 3 Nano BF16 from NVIDIA",
   "files": [],
   "expected_behavior": [
diff --git a/.claude/skills/evaluation/evals/nvfp4-auto-detect-quantization.json b/.claude/skills/evaluation/evals/nvfp4-auto-detect-quantization.json
new file mode 100644
index 0000000000..707c24c7fb
--- /dev/null
+++ b/.claude/skills/evaluation/evals/nvfp4-auto-detect-quantization.json
@@ -0,0 +1,16 @@
+{
+  "skills": ["evaluation"],
+  "query": "evaluate accuracy of my NVFP4 quantized model at ./llama-nvfp4",
+  "files": [],
+  "expected_behavior": [
+    "Verifies nel is installed",
+    "Asks 5 base config questions",
+    "Sets deployment.checkpoint_path to ./llama-nvfp4",
+    "Auto-detects quantization by reading ./llama-nvfp4/hf_quant_config.json",
+    "Finds quant_algo contains 'FP4' or 'NVFP4' and adds --quantization modelopt_fp4 to deployment.extra_args",
+    "Does NOT use --quantization modelopt (that's for FP8 only)",
+    "Recommends quantization-sensitive benchmarks: MMLU, GSM8K, ARC-Challenge",
+    "Mentions that NVFP4 inference requires Blackwell GPUs",
+    "Proceeds with standard evaluation flow"
+  ]
+}
diff --git a/.claude/skills/evaluation/evals/quantized-checkpoint-local-vllm.json b/.claude/skills/evaluation/evals/quantized-checkpoint-local-vllm.json
new file mode 100644
index 0000000000..be152ecbb0
--- /dev/null
+++ b/.claude/skills/evaluation/evals/quantized-checkpoint-local-vllm.json
@@ -0,0 +1,22 @@
+{
+  "skills": ["evaluation"],
+  "query": "evaluate my FP8 quantized Llama checkpoint at ./llama-3.1-8b-fp8 on MMLU and GSM8K",
+  "files": [],
+  "expected_behavior": [
+    "Verifies nel is installed by running 'nel --version'",
+    "Asks all 5 base config questions (execution, deployment, auto-export, model type, benchmarks)",
+    "Runs 'nel skills build-config' with correct flags matching user answers",
+    "Sets deployment.checkpoint_path to ./llama-3.1-8b-fp8 and deployment.hf_model_handle to null",
+    "Auto-detects quantization format by reading ./llama-3.1-8b-fp8/hf_quant_config.json",
+    "Finds quant_algo=FP8 and adds --quantization modelopt to deployment.extra_args",
+    "Recommends accuracy-sensitive benchmarks: MMLU (always), GSM8K (math reasoning), ARC-Challenge",
+    "Searches web for Llama-3.1-8B model card and extracts sampling params, context length, TP settings",
+    "Asks user for GPU count to set tensor_parallel_size",
+    "Fills in remaining ??? values by asking user",
+    "Shows task list and confirms with user",
+    "Runs dry-run first: nel run --config <config> --dry-run",
+    "Then test run with limit_samples=10",
+    "Then full evaluation",
+    "Reports accuracy results per benchmark"
+  ]
+}
diff --git a/.claude/skills/evaluation/evals/reasoning-model-sglang.json b/.claude/skills/evaluation/evals/reasoning-model-sglang.json
new file mode 100644
index 0000000000..8527c37a43
--- /dev/null
+++ b/.claude/skills/evaluation/evals/reasoning-model-sglang.json
@@ -0,0 +1,21 @@
+{
+  "skills": ["evaluation"],
+  "query": "evaluate QwQ-32B reasoning model with math benchmarks on SLURM using SGLang",
+  "files": [],
+  "expected_behavior": [
+    "Verifies nel is installed",
+    "Asks 5 base config questions — user selects: execution=slurm, deployment=sglang, model_type=reasoning, benchmarks=math_reasoning",
+    "Runs nel skills build-config with --execution slurm --deployment sglang --model_type reasoning --benchmarks math_reasoning",
+    "Searches web for QwQ-32B model card",
+    "Configures reasoning toggle: either via adapter_config.custom_system_prompt (/think, /no_think) or via adapter_config.params_to_add with chat_template_kwargs.enable_thinking",
+    "Sets higher max_new_tokens for reasoning (thinking tokens can be long)",
+    "Asks user about reasoning effort/budget if configurable",
+    "Configures SGLang-specific deployment settings",
+    "Asks user for SLURM hostname, account, partition, walltime",
+    "For nemo_skills.* tasks with self-deployment: adds target.api_endpoint.api_key_name: DUMMY_API_KEY",
+    "Disables reasoning for tasks where it's not needed (e.g., IFEval) using task-level overrides",
+    "Shows task list for confirmation",
+    "Exports DUMMY_API_KEY=dummy before running",
+    "Runs dry-run, test, then full evaluation"
+  ]
+}
diff --git a/.claude/skills/evaluation/evals/safety-multilingual-benchmarks.json b/.claude/skills/evaluation/evals/safety-multilingual-benchmarks.json
new file mode 100644
index 0000000000..6f2cf96b5b
--- /dev/null
+++ b/.claude/skills/evaluation/evals/safety-multilingual-benchmarks.json
@@ -0,0 +1,17 @@
+{
+  "skills": ["evaluation"],
+  "query": "evaluate my chat model on safety and multilingual benchmarks",
+  "files": [],
+  "expected_behavior": [
+    "Verifies nel is installed",
+    "Asks 5 base config questions — user selects: model_type=chat, benchmarks=safety multilingual",
+    "Runs nel skills build-config with --model_type chat --benchmarks safety multilingual",
+    "Safety benchmarks include: Garak and Safety Harness",
+    "Multilingual benchmarks include: MMATH, Global MMLU, MMLU-Prox",
+    "Searches web for model card to extract chat-specific settings (system prompt, sampling params)",
+    "Configures chat template and system prompt if needed",
+    "Shows task list including both safety and multilingual tasks",
+    "Allows user to add/remove tasks in Step 5 confirmation loop",
+    "Proceeds with evaluation"
+  ]
+}
diff --git a/.claude/skills/evaluation/evals/wandb-export-code-benchmarks.json b/.claude/skills/evaluation/evals/wandb-export-code-benchmarks.json
new file mode 100644
index 0000000000..f4434bc3f2
--- /dev/null
+++ b/.claude/skills/evaluation/evals/wandb-export-code-benchmarks.json
@@ -0,0 +1,18 @@
+{
+  "skills": ["evaluation"],
+  "query": "evaluate my model on code benchmarks and export results to Weights & Biases",
+  "files": [],
+  "expected_behavior": [
+    "Verifies nel is installed",
+    "Asks 5 base config questions — user selects: export=wandb, benchmarks=code",
+    "Runs nel skills build-config with --export wandb --benchmarks code",
+    "Code benchmarks include: HumanEval, MBPP, LiveCodeBench",
+    "Asks user for wandb tracking URI and project name",
+    "Fills in wandb-specific config values",
+    "Asks if user wants to add wandb tags",
+    "Shows task list for confirmation",
+    "Runs dry-run to validate config including wandb connection",
+    "Proceeds with test and full evaluation",
+    "Results are exported to wandb automatically"
+  ]
+}
diff --git a/.claude/skills/evaluation/evals/workspace-reuse-from-ptq.json b/.claude/skills/evaluation/evals/workspace-reuse-from-ptq.json
new file mode 100644
index 0000000000..35e88c3059
--- /dev/null
+++ b/.claude/skills/evaluation/evals/workspace-reuse-from-ptq.json
@@ -0,0 +1,15 @@
+{
+  "skills": ["evaluation"],
+  "query": "evaluate the model I just quantized",
+  "files": [],
+  "expected_behavior": [
+    "Checks if MODELOPT_WORKSPACE_ROOT is set",
+    "If set: reads skills/common/workspace-management.md",
+    "Lists existing workspaces and finds the one from prior PTQ step",
+    "Reuses the workspace to access the quantized checkpoint",
+    "Auto-detects quantization format from hf_quant_config.json in the checkpoint",
+    "Sets correct deployment.extra_args based on detected format (--quantization modelopt or modelopt_fp4)",
+    "Recommends quantization-sensitive benchmarks since this is a quantized model",
+    "Proceeds with standard evaluation workflow"
+  ]
+}
diff --git a/.claude/skills/evaluation/references/model-card-research.md b/.claude/skills/evaluation/references/model-card-research.md
new file mode 100644
index 0000000000..4397f88736
--- /dev/null
+++ b/.claude/skills/evaluation/references/model-card-research.md
@@ -0,0 +1,30 @@
+# Model Card Research
+
+Use WebSearch to find the model card (HuggingFace, build.nvidia.com). Read it carefully, the FULL text, the devil is in the details. Extract ALL relevant configurations:
+
+- Sampling params (`temperature`, `top_p`)
+- Context length (`deployment.extra_args: "--max-model-len <value>"`)
+- TP/DP settings (to set them appropriately, AskUserQuestion on how many GPUs the model will be deployed)
+- Reasoning config (if applicable):
+  - reasoning on/off: use either:
+    - `adapter_config.custom_system_prompt` (like `/think`, `/no_think`) and no `adapter_config.params_to_add` (leave `params_to_add` unrelated to reasoning untouched)
+    - `adapter_config.params_to_add` for payload modifier (like `"chat_template_kwargs": {"enable_thinking": true/false}`) and no `adapter_config.custom_system_prompt` and `adapter_config.use_system_prompt: false` (leave `custom_system_prompt` and `use_system_prompt` unrelated to reasoning untouched).
+  - reasoning effort/budget (if it's configurable, AskUserQuestion what reasoning effort they want)
+  - higher `max_new_tokens`
+  - etc.
+- Deployment-specific `extra_args` for vLLM/SGLang (look for the vLLM/SGLang deployment command)
+- Deployment-specific vLLM/SGLang versions (by default we use latest docker images, but you can control it with `deployment.image` e.g. vLLM above `vllm/vllm-openai:v0.11.0` stopped supporting `rope-scaling` arg used by Qwen models)
+- ARM64 / non-standard GPU compatibility: The default `vllm/vllm-openai` image only supports common GPU architectures. For ARM64 platforms or GPUs with non-standard compute capabilities (e.g., NVIDIA GB10 with sm_121), use NGC vLLM images instead:
+  - Example: `deployment.image: nvcr.io/nvidia/vllm:26.01-py3`
+  - AskUserQuestion about their GPU architecture if the model card doesn't specify deployment constraints
+- Any preparation requirements (e.g., downloading reasoning parsers, custom plugins):
+  - If the model card mentions downloading files (like reasoning parsers, custom plugins) before deployment, add `deployment.pre_cmd` with the download command
+  - Use `curl` instead of `wget` as it's more widely available in Docker containers
+  - Example: `pre_cmd: curl -L -o reasoning_parser.py https://huggingface.co/.../reasoning_parser.py`
+  - When using `pip install` in `pre_cmd`, always use `--no-cache-dir` to avoid cross-device link errors in Docker containers (the pip cache and temp directories may be on different filesystems)
+  - Example: `pre_cmd: pip3 install --no-cache-dir flash-attn --no-build-isolation`
+- Any other model-specific requirements
+
+Remember to check `evaluation.nemo_evaluator_config` and `evaluation.tasks.*.nemo_evaluator_config` overrides too for parameters to adjust (e.g. disabling reasoning)!
+
+Present findings, explain each setting, ask user to confirm or adjust. If no model card found, ask user directly for the above configurations.
diff --git a/.claude/skills/evaluation/references/multi-node.md b/.claude/skills/evaluation/references/multi-node.md
new file mode 100644
index 0000000000..a7b9d27fbf
--- /dev/null
+++ b/.claude/skills/evaluation/references/multi-node.md
@@ -0,0 +1,53 @@
+# Multi-Node Evaluation Patterns
+
+There are two multi-node patterns. Ask the user which applies:
+
+## Pattern A: Multi-instance (independent instances with HAProxy)
+
+Only if model >120B parameters or user wants more throughput. Explain: "Each node runs an independent deployment instance. HAProxy load-balances requests across all instances."
+
+```yaml
+execution:
+    num_nodes: 4       # Total nodes
+    num_instances: 4   # 4 independent instances → HAProxy auto-enabled
+```
+
+## Pattern B: Multi-node single instance (Ray TP/PP across nodes)
+
+When a single model is too large for one node and needs pipeline parallelism across nodes. Use `vllm_ray` deployment config:
+
+```yaml
+defaults:
+  - deployment: vllm_ray   # Built-in Ray cluster setup (replaces manual pre_cmd)
+
+execution:
+    num_nodes: 2           # Single instance spanning 2 nodes
+
+deployment:
+    tensor_parallel_size: 8
+    pipeline_parallel_size: 2
+```
+
+## Pattern A+B combined: Multi-instance with multi-node instances
+
+For very large models needing both cross-node parallelism AND multiple instances:
+
+```yaml
+defaults:
+  - deployment: vllm_ray
+
+execution:
+    num_nodes: 4       # Total nodes
+    num_instances: 2   # 2 instances of 2 nodes each → HAProxy auto-enabled
+
+deployment:
+    tensor_parallel_size: 8
+    pipeline_parallel_size: 2
+```
+
+## Common Confusions
+
+- **`num_instances`** controls independent deployment instances with HAProxy. **`data_parallel_size`** controls DP replicas *within* a single instance.
+- Global data parallelism is `num_instances x data_parallel_size` (e.g., 2 instances x 8 DP each = 16 replicas).
+- With multi-instance, `parallelism` in task config is the total concurrent requests across all instances, not per-instance.
+- `num_nodes` must be divisible by `num_instances`.

From 3eb159137395be04f2e05a9f2e168c14086dad74 Mon Sep 17 00:00:00 2001
From: Kai Xu <kaix@nvidia.com>
Date: Sun, 29 Mar 2026 11:16:05 -0700
Subject: [PATCH 3/5] Remove extra eval files, keep 2 core scenario

Signed-off-by: Kai Xu <kaix@nvidia.com>
---
 .../evals/base-model-local-execution.json     | 18 ----------------
 .../evals/external-deployment-eval.json       | 17 ---------------
 .../evals/interceptor-configuration.json      | 16 --------------
 .../evals/multi-node-evaluation.json          | 18 ----------------
 .../evaluation/evals/nel-not-installed.json   | 12 -----------
 .../evals/nvfp4-auto-detect-quantization.json | 16 --------------
 .../evals/reasoning-model-sglang.json         | 21 -------------------
 .../evals/safety-multilingual-benchmarks.json | 17 ---------------
 .../evals/wandb-export-code-benchmarks.json   | 18 ----------------
 .../evals/workspace-reuse-from-ptq.json       | 15 -------------
 10 files changed, 168 deletions(-)
 delete mode 100644 .claude/skills/evaluation/evals/base-model-local-execution.json
 delete mode 100644 .claude/skills/evaluation/evals/external-deployment-eval.json
 delete mode 100644 .claude/skills/evaluation/evals/interceptor-configuration.json
 delete mode 100644 .claude/skills/evaluation/evals/multi-node-evaluation.json
 delete mode 100644 .claude/skills/evaluation/evals/nel-not-installed.json
 delete mode 100644 .claude/skills/evaluation/evals/nvfp4-auto-detect-quantization.json
 delete mode 100644 .claude/skills/evaluation/evals/reasoning-model-sglang.json
 delete mode 100644 .claude/skills/evaluation/evals/safety-multilingual-benchmarks.json
 delete mode 100644 .claude/skills/evaluation/evals/wandb-export-code-benchmarks.json
 delete mode 100644 .claude/skills/evaluation/evals/workspace-reuse-from-ptq.json

diff --git a/.claude/skills/evaluation/evals/base-model-local-execution.json b/.claude/skills/evaluation/evals/base-model-local-execution.json
deleted file mode 100644
index 6bb277fd4c..0000000000
--- a/.claude/skills/evaluation/evals/base-model-local-execution.json
+++ /dev/null
@@ -1,18 +0,0 @@
-{
-  "skills": ["evaluation"],
-  "query": "evaluate Qwen/Qwen3-0.6B on standard benchmarks locally",
-  "files": [],
-  "expected_behavior": [
-    "Verifies nel is installed",
-    "Asks 5 base config questions — user selects: execution=local, deployment=vllm, export=none, model_type=base, benchmarks=standard",
-    "Runs nel skills build-config --execution local --deployment vllm --model_type base --benchmarks standard",
-    "Sets deployment.hf_model_handle to Qwen/Qwen3-0.6B and deployment.checkpoint_path to null",
-    "No hf_quant_config.json since this is an HF hub model — no quantization flag needed",
-    "Searches web for Qwen3-0.6B model card to extract deployment settings",
-    "For local execution: no SLURM-specific config needed",
-    "Fills remaining ??? values",
-    "Shows task list for confirmation",
-    "Runs dry-run, then test with limit_samples=10, then full evaluation",
-    "Reports results"
-  ]
-}
diff --git a/.claude/skills/evaluation/evals/external-deployment-eval.json b/.claude/skills/evaluation/evals/external-deployment-eval.json
deleted file mode 100644
index e20286fb9a..0000000000
--- a/.claude/skills/evaluation/evals/external-deployment-eval.json
+++ /dev/null
@@ -1,17 +0,0 @@
-{
-  "skills": ["evaluation"],
-  "query": "evaluate my model that's already running at http://myserver:8000",
-  "files": [],
-  "expected_behavior": [
-    "Verifies nel is installed",
-    "Asks 5 base config questions — user selects deployment=None (External) since model is already deployed",
-    "Runs nel skills build-config with --deployment none",
-    "Configures target.api_endpoint with the user's existing server URL",
-    "Does NOT start a deployment — uses the external endpoint directly",
-    "api_key_name should already be defined for external deployment",
-    "Asks user for model type (base/chat/reasoning) and benchmark selection",
-    "Fills remaining config values",
-    "Runs dry-run, test, then full evaluation against the external endpoint",
-    "Reports results"
-  ]
-}
diff --git a/.claude/skills/evaluation/evals/interceptor-configuration.json b/.claude/skills/evaluation/evals/interceptor-configuration.json
deleted file mode 100644
index d2fbe1b78c..0000000000
--- a/.claude/skills/evaluation/evals/interceptor-configuration.json
+++ /dev/null
@@ -1,16 +0,0 @@
-{
-  "skills": ["evaluation"],
-  "query": "evaluate my model and configure request/response logging interceptors",
-  "files": [],
-  "expected_behavior": [
-    "Follows standard evaluation workflow through Step 6",
-    "In Step 7 (Interceptors): tells user to see the interceptors documentation URL",
-    "Does NOT provide general information about interceptors without reading the docs",
-    "If user asks to configure logging interceptor: reads the interceptor webpage",
-    "Configures interceptor under evaluation.nemo_evaluator_config.config.target.api_endpoint.adapter_config (NOT under target.api_endpoint.adapter_config)",
-    "Uses field names from CLI Configuration section after --overrides keyword",
-    "Does NOT define interceptors list directly (would override full chain with unintended consequences)",
-    "Uses correct field names: max_logged_requests and max_logged_responses (NOT max_saved_* or max_*)",
-    "Proceeds with evaluation after interceptor configuration"
-  ]
-}
diff --git a/.claude/skills/evaluation/evals/multi-node-evaluation.json b/.claude/skills/evaluation/evals/multi-node-evaluation.json
deleted file mode 100644
index 3a60b4a38b..0000000000
--- a/.claude/skills/evaluation/evals/multi-node-evaluation.json
+++ /dev/null
@@ -1,18 +0,0 @@
-{
-  "skills": ["evaluation"],
-  "query": "evaluate a 405B parameter model, I have 4 nodes available",
-  "files": [],
-  "expected_behavior": [
-    "Verifies nel is installed",
-    "Asks 5 base config questions",
-    "In Step 6 (Advanced - Multi-node): asks which pattern applies",
-    "For a single 405B model: recommends Pattern B (multi-node single instance with Ray TP/PP)",
-    "Uses vllm_ray deployment config: defaults: [deployment: vllm_ray]",
-    "Sets execution.num_nodes: 2 or 4 depending on GPU memory",
-    "Configures deployment.tensor_parallel_size and pipeline_parallel_size",
-    "Explains: num_instances controls independent instances (with HAProxy), while this is single-instance across nodes",
-    "If user wants throughput AND cross-node: explains Pattern A+B combined",
-    "Notes: num_nodes must be divisible by num_instances",
-    "Proceeds with standard evaluation flow after multi-node config"
-  ]
-}
diff --git a/.claude/skills/evaluation/evals/nel-not-installed.json b/.claude/skills/evaluation/evals/nel-not-installed.json
deleted file mode 100644
index 48e213bf69..0000000000
--- a/.claude/skills/evaluation/evals/nel-not-installed.json
+++ /dev/null
@@ -1,12 +0,0 @@
-{
-  "skills": ["evaluation"],
-  "query": "evaluate my quantized model",
-  "files": [],
-  "expected_behavior": [
-    "Checks if nel is installed by running 'nel --version'",
-    "nel command not found or errors",
-    "Instructs user to install: pip install nemo-evaluator-launcher",
-    "Does NOT attempt to proceed without nel installed",
-    "After user installs, re-checks nel --version and proceeds with workflow"
-  ]
-}
diff --git a/.claude/skills/evaluation/evals/nvfp4-auto-detect-quantization.json b/.claude/skills/evaluation/evals/nvfp4-auto-detect-quantization.json
deleted file mode 100644
index 707c24c7fb..0000000000
--- a/.claude/skills/evaluation/evals/nvfp4-auto-detect-quantization.json
+++ /dev/null
@@ -1,16 +0,0 @@
-{
-  "skills": ["evaluation"],
-  "query": "evaluate accuracy of my NVFP4 quantized model at ./llama-nvfp4",
-  "files": [],
-  "expected_behavior": [
-    "Verifies nel is installed",
-    "Asks 5 base config questions",
-    "Sets deployment.checkpoint_path to ./llama-nvfp4",
-    "Auto-detects quantization by reading ./llama-nvfp4/hf_quant_config.json",
-    "Finds quant_algo contains 'FP4' or 'NVFP4' and adds --quantization modelopt_fp4 to deployment.extra_args",
-    "Does NOT use --quantization modelopt (that's for FP8 only)",
-    "Recommends quantization-sensitive benchmarks: MMLU, GSM8K, ARC-Challenge",
-    "Mentions that NVFP4 inference requires Blackwell GPUs",
-    "Proceeds with standard evaluation flow"
-  ]
-}
diff --git a/.claude/skills/evaluation/evals/reasoning-model-sglang.json b/.claude/skills/evaluation/evals/reasoning-model-sglang.json
deleted file mode 100644
index 8527c37a43..0000000000
--- a/.claude/skills/evaluation/evals/reasoning-model-sglang.json
+++ /dev/null
@@ -1,21 +0,0 @@
-{
-  "skills": ["evaluation"],
-  "query": "evaluate QwQ-32B reasoning model with math benchmarks on SLURM using SGLang",
-  "files": [],
-  "expected_behavior": [
-    "Verifies nel is installed",
-    "Asks 5 base config questions — user selects: execution=slurm, deployment=sglang, model_type=reasoning, benchmarks=math_reasoning",
-    "Runs nel skills build-config with --execution slurm --deployment sglang --model_type reasoning --benchmarks math_reasoning",
-    "Searches web for QwQ-32B model card",
-    "Configures reasoning toggle: either via adapter_config.custom_system_prompt (/think, /no_think) or via adapter_config.params_to_add with chat_template_kwargs.enable_thinking",
-    "Sets higher max_new_tokens for reasoning (thinking tokens can be long)",
-    "Asks user about reasoning effort/budget if configurable",
-    "Configures SGLang-specific deployment settings",
-    "Asks user for SLURM hostname, account, partition, walltime",
-    "For nemo_skills.* tasks with self-deployment: adds target.api_endpoint.api_key_name: DUMMY_API_KEY",
-    "Disables reasoning for tasks where it's not needed (e.g., IFEval) using task-level overrides",
-    "Shows task list for confirmation",
-    "Exports DUMMY_API_KEY=dummy before running",
-    "Runs dry-run, test, then full evaluation"
-  ]
-}
diff --git a/.claude/skills/evaluation/evals/safety-multilingual-benchmarks.json b/.claude/skills/evaluation/evals/safety-multilingual-benchmarks.json
deleted file mode 100644
index 6f2cf96b5b..0000000000
--- a/.claude/skills/evaluation/evals/safety-multilingual-benchmarks.json
+++ /dev/null
@@ -1,17 +0,0 @@
-{
-  "skills": ["evaluation"],
-  "query": "evaluate my chat model on safety and multilingual benchmarks",
-  "files": [],
-  "expected_behavior": [
-    "Verifies nel is installed",
-    "Asks 5 base config questions — user selects: model_type=chat, benchmarks=safety multilingual",
-    "Runs nel skills build-config with --model_type chat --benchmarks safety multilingual",
-    "Safety benchmarks include: Garak and Safety Harness",
-    "Multilingual benchmarks include: MMATH, Global MMLU, MMLU-Prox",
-    "Searches web for model card to extract chat-specific settings (system prompt, sampling params)",
-    "Configures chat template and system prompt if needed",
-    "Shows task list including both safety and multilingual tasks",
-    "Allows user to add/remove tasks in Step 5 confirmation loop",
-    "Proceeds with evaluation"
-  ]
-}
diff --git a/.claude/skills/evaluation/evals/wandb-export-code-benchmarks.json b/.claude/skills/evaluation/evals/wandb-export-code-benchmarks.json
deleted file mode 100644
index f4434bc3f2..0000000000
--- a/.claude/skills/evaluation/evals/wandb-export-code-benchmarks.json
+++ /dev/null
@@ -1,18 +0,0 @@
-{
-  "skills": ["evaluation"],
-  "query": "evaluate my model on code benchmarks and export results to Weights & Biases",
-  "files": [],
-  "expected_behavior": [
-    "Verifies nel is installed",
-    "Asks 5 base config questions — user selects: export=wandb, benchmarks=code",
-    "Runs nel skills build-config with --export wandb --benchmarks code",
-    "Code benchmarks include: HumanEval, MBPP, LiveCodeBench",
-    "Asks user for wandb tracking URI and project name",
-    "Fills in wandb-specific config values",
-    "Asks if user wants to add wandb tags",
-    "Shows task list for confirmation",
-    "Runs dry-run to validate config including wandb connection",
-    "Proceeds with test and full evaluation",
-    "Results are exported to wandb automatically"
-  ]
-}
diff --git a/.claude/skills/evaluation/evals/workspace-reuse-from-ptq.json b/.claude/skills/evaluation/evals/workspace-reuse-from-ptq.json
deleted file mode 100644
index 35e88c3059..0000000000
--- a/.claude/skills/evaluation/evals/workspace-reuse-from-ptq.json
+++ /dev/null
@@ -1,15 +0,0 @@
-{
-  "skills": ["evaluation"],
-  "query": "evaluate the model I just quantized",
-  "files": [],
-  "expected_behavior": [
-    "Checks if MODELOPT_WORKSPACE_ROOT is set",
-    "If set: reads skills/common/workspace-management.md",
-    "Lists existing workspaces and finds the one from prior PTQ step",
-    "Reuses the workspace to access the quantized checkpoint",
-    "Auto-detects quantization format from hf_quant_config.json in the checkpoint",
-    "Sets correct deployment.extra_args based on detected format (--quantization modelopt or modelopt_fp4)",
-    "Recommends quantization-sensitive benchmarks since this is a quantized model",
-    "Proceeds with standard evaluation workflow"
-  ]
-}

From 99dac7d083fd32b020fc131e72268df0ef442ea6 Mon Sep 17 00:00:00 2001
From: Kai Xu <kaix@nvidia.com>
Date: Tue, 31 Mar 2026 17:42:50 -0700
Subject: [PATCH 4/5] Address review comments

Signed-off-by: Kai Xu <kaix@nvidia.com>
---
 .../{evals => tests}/nemotron3-nano-bf16-reasoning.json           | 0
 .../{evals => tests}/quantized-checkpoint-local-vllm.json         | 0
 2 files changed, 0 insertions(+), 0 deletions(-)
 rename .claude/skills/evaluation/{evals => tests}/nemotron3-nano-bf16-reasoning.json (100%)
 rename .claude/skills/evaluation/{evals => tests}/quantized-checkpoint-local-vllm.json (100%)

diff --git a/.claude/skills/evaluation/evals/nemotron3-nano-bf16-reasoning.json b/.claude/skills/evaluation/tests/nemotron3-nano-bf16-reasoning.json
similarity index 100%
rename from .claude/skills/evaluation/evals/nemotron3-nano-bf16-reasoning.json
rename to .claude/skills/evaluation/tests/nemotron3-nano-bf16-reasoning.json
diff --git a/.claude/skills/evaluation/evals/quantized-checkpoint-local-vllm.json b/.claude/skills/evaluation/tests/quantized-checkpoint-local-vllm.json
similarity index 100%
rename from .claude/skills/evaluation/evals/quantized-checkpoint-local-vllm.json
rename to .claude/skills/evaluation/tests/quantized-checkpoint-local-vllm.json

From 56309a8abd2003e7644398c06b4c988ab08303a6 Mon Sep 17 00:00:00 2001
From: Kai Xu <kaix@nvidia.com>
Date: Tue, 31 Mar 2026 23:25:58 -0700
Subject: [PATCH 5/5] Address review comments

Signed-off-by: Kai Xu <kaix@nvidia.com>
---
 .claude/skills/evaluation/SKILL.md            | 27 ++++----
 .../references/quantization-benchmarks.md     | 26 ++++++++
 .claude/skills/evaluation/tests/evals.json    | 65 +++++++++++++++++++
 .../tests/nemotron3-nano-bf16-reasoning.json  | 26 --------
 .../quantized-checkpoint-local-vllm.json      | 22 -------
 5 files changed, 103 insertions(+), 63 deletions(-)
 create mode 100644 .claude/skills/evaluation/references/quantization-benchmarks.md
 create mode 100644 .claude/skills/evaluation/tests/evals.json
 delete mode 100644 .claude/skills/evaluation/tests/nemotron3-nano-bf16-reasoning.json
 delete mode 100644 .claude/skills/evaluation/tests/quantized-checkpoint-local-vllm.json

diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md
index b4778b79af..957fbfc54e 100644
--- a/.claude/skills/evaluation/SKILL.md
+++ b/.claude/skills/evaluation/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: evaluation
-description: Evaluate accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Use when user says "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "accuracy drop", "run nel", or needs to measure how quantization affects model quality. Handles model deployment, config generation, and evaluation execution.
+description: Evaluates accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Triggers on "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "accuracy drop", "run nel". Handles deployment, config generation, and evaluation execution. Not for quantizing models (use ptq) or deploying/serving models (use deployment).
 license: Apache-2.0
 # Based on nel-assistant skill from NeMo Evaluator Launcher (commit f1fa073)
 # https://github.com/NVIDIA-NeMo/Evaluator/tree/f1fa073/packages/nemo-evaluator-launcher/.claude/skills/nel-assistant
@@ -21,7 +21,7 @@ If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md
 ```text
 Config Generation Progress:
 - [ ] Step 0: Check workspace (if MODELOPT_WORKSPACE_ROOT is set)
-- [ ] Step 1: Check if nel is installed
+- [ ] Step 1: Check if nel is installed and if user has existing config
 - [ ] Step 2: Build the base config file
 - [ ] Step 3: Configure model path and parameters
 - [ ] Step 4: Fill in remaining missing values
@@ -31,11 +31,11 @@ Config Generation Progress:
 - [ ] Step 8: Run the evaluation
 ```
 
-**Step 1: Check if nel is installed**
+**Step 1: Check prerequisites**
 
-Test that `nel` is installed with `nel --version`.
+Test that `nel` is installed with `nel --version`. If not, instruct the user to `pip install nemo-evaluator-launcher`.
 
-If not, instruct the user to `pip install nemo-evaluator-launcher`.
+If the user already has a config file (e.g., "run this config", "evaluate with my-config.yaml"), skip to Step 8. Optionally review it for common issues (missing `???` values, quantization flags) before running.
 
 **Step 2: Build the base config file**
 
@@ -76,6 +76,8 @@ Prompt the user with "I'll ask you 5 questions to build the base config we'll ad
 
 DON'T ALLOW FOR ANY OTHER OPTIONS, only the ones listed above under each category (Execution, Deployment, Auto-export, Model type, Benchmarks). YOU HAVE TO GATHER THE ANSWERS for the 5 questions before you can build the base config.
 
+> **Note:** These categories come from NEL's `build-config` CLI. If `nel skills build-config --help` shows different options than listed above, use the CLI's current options instead.
+
 When you have all the answers, run the script to build the base config:
 
 ```bash
@@ -116,16 +118,11 @@ If found, read `quantization.quant_algo` and set the correct vLLM/SGLang quantiz
 
 If no `hf_quant_config.json`, also check `config.json` for a `quantization_config` section with `quant_method: "modelopt"`. If neither is found, the checkpoint is unquantized — no flag needed.
 
-**Quantization-aware benchmark defaults:**
-
-When a quantized checkpoint is detected, recommend benchmarks sensitive to quantization accuracy loss:
+> **Note:** Some models require additional env vars for deployment (e.g., `VLLM_NVFP4_GEMM_BACKEND=marlin` for Nemotron Super). These are not in `hf_quant_config.json` — they are discovered during model card research below.
 
-- **Always include**: MMLU (general knowledge — typically shows measurable accuracy loss from quantization)
-- **Recommended**: GSM8K (math reasoning — sensitive to precision loss), ARC-Challenge (reasoning)
-- **Good to add**: HumanEval (code generation — catches subtle degradation), Winogrande (commonsense)
-- **Less useful for quant comparison**: IFEval (instruction following — typically less affected, but worth including for aggressive quantization like FP4)
+**Quantization-aware benchmark defaults:**
 
-Present these recommendations to the user and ask which to include. If the user already specified benchmarks, keep their choice but mention any accuracy-sensitive benchmarks they may have missed.
+When a quantized checkpoint is detected, read `references/quantization-benchmarks.md` for benchmark sensitivity rankings and recommended sets. Present recommendations to the user and ask which to include.
 
 Read `references/model-card-research.md` for the full extraction checklist (sampling params, reasoning config, ARM64 compatibility, pre_cmd, etc.). Use WebSearch to research the model card, present findings, and ask the user to confirm.
 
@@ -191,7 +188,7 @@ Print the following commands to the user. Propose to execute them in order to co
 **Important**: Export required environment variables based on your config. If any tokens or keys are missing (e.g. `HF_TOKEN`, `NGC_API_KEY`, `api_key_name` from the config), ask the user to put them in a `.env` file in the project root so you can run `set -a && source .env && set +a` (or equivalent) before executing `nel run` commands.
 
 ```bash
-# If using pre_cmd or post_cmd:
+# If using pre_cmd or post_cmd (review pre_cmd content before enabling — it runs arbitrary commands):
 export NEMO_EVALUATOR_TRUST_PRE_CMD=1
 
 # If using nemo_skills.* tasks with self-deployment:
@@ -299,7 +296,7 @@ Now, copy this checklist and track your progress:
 ```text
 Config Generation Progress:
 - [ ] Step 0: Check workspace (if MODELOPT_WORKSPACE_ROOT is set)
-- [ ] Step 1: Check if nel is installed
+- [ ] Step 1: Check if nel is installed and if user has existing config
 - [ ] Step 2: Build the base config file
 - [ ] Step 3: Configure model path and parameters
 - [ ] Step 4: Fill in remaining missing values
diff --git a/.claude/skills/evaluation/references/quantization-benchmarks.md b/.claude/skills/evaluation/references/quantization-benchmarks.md
new file mode 100644
index 0000000000..a0ca45453c
--- /dev/null
+++ b/.claude/skills/evaluation/references/quantization-benchmarks.md
@@ -0,0 +1,26 @@
+# Quantization-Aware Benchmark Recommendations
+
+When evaluating a quantized checkpoint, prioritize benchmarks that are sensitive to precision loss.
+
+## Sensitivity ranking
+
+| Priority | Benchmarks | Why |
+|----------|-----------|-----|
+| **Always include** | MMLU | General knowledge — typically shows measurable accuracy loss from quantization |
+| **Recommended** | GSM8K, ARC-Challenge | Math reasoning and general reasoning — sensitive to precision loss |
+| **Good to add** | HumanEval, Winogrande | Code generation and commonsense — catches subtle degradation |
+| **Less useful for quant comparison** | IFEval | Instruction following — typically less affected, but worth including for aggressive quantization like FP4 |
+
+## Recommended sets by use case
+
+| Use case | Benchmarks |
+|----------|-----------|
+| Quick sanity check | MMLU |
+| Standard quant validation | MMLU, GSM8K, ARC-Challenge |
+| Thorough evaluation | MMLU, GSM8K, ARC-Challenge, HumanEval, Winogrande |
+| Code-focused model | HumanEval, MBPP, MMLU |
+| Reasoning model | GSM8K, MATH-500, GPQA, MMLU |
+
+## How to use
+
+Present these recommendations to the user and ask which to include. If the user already specified benchmarks, keep their choice but mention any accuracy-sensitive benchmarks they may have missed.
diff --git a/.claude/skills/evaluation/tests/evals.json b/.claude/skills/evaluation/tests/evals.json
new file mode 100644
index 0000000000..0f35dacd7a
--- /dev/null
+++ b/.claude/skills/evaluation/tests/evals.json
@@ -0,0 +1,65 @@
+[
+  {
+    "name": "nemotron3-nano-bf16-reasoning",
+    "skills": ["evaluation"],
+    "query": "Help me evaluate Nemotron 3 Nano BF16 from NVIDIA",
+    "files": [],
+    "expected_behavior": [
+      "Verifies nel is installed by running 'nel --version'",
+      "Asks all 5 base config questions (execution, deployment, auto-export, model type, benchmarks) before generating the config",
+      "Runs 'nel skills build-config' with correct flags matching user answers: --execution slurm --deployment vllm --model-type reasoning --benchmarks standard code math_reasoning --export mlflow",
+      "Searches the web for the model card on HuggingFace and extracts model-specific settings",
+      "Sets correct HF handle: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
+      "Sets reasoning sampling params from model card: temperature=1.0, top_p=1.0",
+      "Configures reasoning toggle via params_to_add with chat_template_kwargs.enable_thinking (not via system prompt)",
+      "Disables reasoning for IFEval task using enable_thinking: false with use_system_prompt: false",
+      "Adds deployment.pre_cmd using curl (not wget) to download nano_v3_reasoning_parser.py from HuggingFace",
+      "Adds vLLM extra_args including --trust-remote-code, --reasoning-parser-plugin, --reasoning-parser nano_v3, --max-num-seqs 8",
+      "Pins vLLM image to v0.12.0 or later as required by model card",
+      "Adds target.api_endpoint.api_key_name: DUMMY_API_KEY for nemo_skills tasks with self-deployment",
+      "Fills in all ??? placeholders after asking the user for SLURM hostname, account, output_dir, MLflow tracking_uri, and experiment_name",
+      "Applies user-requested SLURM customizations: partition batch_short, walltime 00:20:00, MLflow tag scenario: demo",
+      "Presents task list and waits for user confirmation before proceeding",
+      "Configures request and response logging interceptors under evaluation.nemo_evaluator_config.config.target.api_endpoint.adapter_config using correct field names (max_logged_requests/max_logged_responses, not max_saved_*)",
+      "Handles dry-run failure for missing HF_TOKEN_FOR_GPQA_DIAMOND by offering to fix the config",
+      "Successfully submits test run with limit_samples=10 after dry-run passes",
+      "Provides monitoring commands (nel status, nel info --logs) and inspects server logs via SSH when asked"
+    ]
+  },
+  {
+    "name": "quantized-checkpoint-local-vllm",
+    "skills": ["evaluation"],
+    "query": "evaluate my FP8 quantized Llama checkpoint at ./llama-3.1-8b-fp8 on MMLU and GSM8K",
+    "files": [],
+    "expected_behavior": [
+      "Verifies nel is installed by running nel --version",
+      "Asks all 5 base config questions (execution, deployment, auto-export, model type, benchmarks)",
+      "Runs nel skills build-config with correct flags matching user answers",
+      "Sets deployment.checkpoint_path to ./llama-3.1-8b-fp8 and deployment.hf_model_handle to null",
+      "Auto-detects quantization format by reading ./llama-3.1-8b-fp8/hf_quant_config.json",
+      "Finds quant_algo=FP8 and adds --quantization modelopt to deployment.extra_args",
+      "Recommends accuracy-sensitive benchmarks from references/quantization-benchmarks.md",
+      "Searches web for Llama-3.1-8B model card and extracts sampling params, context length, TP settings",
+      "Fills in remaining missing values by asking user",
+      "Runs dry-run, then test with limit_samples=10, then full evaluation",
+      "Reports accuracy results per benchmark"
+    ]
+  },
+  {
+    "name": "slurm-quantized-model",
+    "skills": ["evaluation"],
+    "query": "Evaluate my quantized Llama-3.1-8B-FP8 checkpoint on mmlu and gsm8k on the SLURM cluster",
+    "files": [],
+    "expected_behavior": [
+      "Verifies nel is installed by running nel --version",
+      "Asks 5 base config questions with execution=slurm pre-selected based on user request",
+      "Runs nel skills build-config with --execution slurm --deployment vllm --benchmarks standard",
+      "Detects FP8 quantization from hf_quant_config.json and sets deployment.extra_args with --quantization modelopt",
+      "Reads references/quantization-benchmarks.md and recommends accuracy-sensitive benchmarks",
+      "Uses WebSearch to research model card for sampling params and context length",
+      "Fills in SLURM-specific values: hostname, account, partition from user input",
+      "Runs dry-run validation before full evaluation",
+      "Provides SSH-based log monitoring commands for SLURM execution"
+    ]
+  }
+]
diff --git a/.claude/skills/evaluation/tests/nemotron3-nano-bf16-reasoning.json b/.claude/skills/evaluation/tests/nemotron3-nano-bf16-reasoning.json
deleted file mode 100644
index 8f6ad62b9d..0000000000
--- a/.claude/skills/evaluation/tests/nemotron3-nano-bf16-reasoning.json
+++ /dev/null
@@ -1,26 +0,0 @@
-{
-  "skills": ["evaluation"],
-  "query": "Help me evaluate Nemotron 3 Nano BF16 from NVIDIA",
-  "files": [],
-  "expected_behavior": [
-    "Verifies nel is installed by running 'nel --version'",
-    "Asks all 5 base config questions (execution, deployment, auto-export, model type, benchmarks) before generating the config",
-    "Runs 'nel skills build-config' with correct flags matching user answers: --execution slurm --deployment vllm --model-type reasoning --benchmarks standard code math_reasoning --export mlflow",
-    "Searches the web for the model card on HuggingFace and extracts model-specific settings",
-    "Sets correct HF handle: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
-    "Sets reasoning sampling params from model card: temperature=1.0, top_p=1.0",
-    "Configures reasoning toggle via params_to_add with chat_template_kwargs.enable_thinking (not via system prompt)",
-    "Disables reasoning for IFEval task using enable_thinking: false with use_system_prompt: false",
-    "Adds deployment.pre_cmd using curl (not wget) to download nano_v3_reasoning_parser.py from HuggingFace",
-    "Adds vLLM extra_args including --trust-remote-code, --reasoning-parser-plugin, --reasoning-parser nano_v3, --max-num-seqs 8",
-    "Pins vLLM image to v0.12.0 or later as required by model card",
-    "Adds target.api_endpoint.api_key_name: DUMMY_API_KEY for nemo_skills tasks with self-deployment",
-    "Fills in all ??? placeholders after asking the user for SLURM hostname, account, output_dir, MLflow tracking_uri, and experiment_name",
-    "Applies user-requested SLURM customizations: partition batch_short, walltime 00:20:00, MLflow tag scenario: demo",
-    "Presents task list and waits for user confirmation before proceeding",
-    "Configures request and response logging interceptors under evaluation.nemo_evaluator_config.config.target.api_endpoint.adapter_config using correct field names (max_logged_requests/max_logged_responses, not max_saved_*)",
-    "Handles dry-run failure for missing HF_TOKEN_FOR_GPQA_DIAMOND by offering to fix the config",
-    "Successfully submits test run with limit_samples=10 after dry-run passes",
-    "Provides monitoring commands (nel status, nel info --logs) and inspects server logs via SSH when asked"
-  ]
-}
diff --git a/.claude/skills/evaluation/tests/quantized-checkpoint-local-vllm.json b/.claude/skills/evaluation/tests/quantized-checkpoint-local-vllm.json
deleted file mode 100644
index be152ecbb0..0000000000
--- a/.claude/skills/evaluation/tests/quantized-checkpoint-local-vllm.json
+++ /dev/null
@@ -1,22 +0,0 @@
-{
-  "skills": ["evaluation"],
-  "query": "evaluate my FP8 quantized Llama checkpoint at ./llama-3.1-8b-fp8 on MMLU and GSM8K",
-  "files": [],
-  "expected_behavior": [
-    "Verifies nel is installed by running 'nel --version'",
-    "Asks all 5 base config questions (execution, deployment, auto-export, model type, benchmarks)",
-    "Runs 'nel skills build-config' with correct flags matching user answers",
-    "Sets deployment.checkpoint_path to ./llama-3.1-8b-fp8 and deployment.hf_model_handle to null",
-    "Auto-detects quantization format by reading ./llama-3.1-8b-fp8/hf_quant_config.json",
-    "Finds quant_algo=FP8 and adds --quantization modelopt to deployment.extra_args",
-    "Recommends accuracy-sensitive benchmarks: MMLU (always), GSM8K (math reasoning), ARC-Challenge",
-    "Searches web for Llama-3.1-8B model card and extracts sampling params, context length, TP settings",
-    "Asks user for GPU count to set tensor_parallel_size",
-    "Fills in remaining ??? values by asking user",
-    "Shows task list and confirms with user",
-    "Runs dry-run first: nel run --config <config> --dry-run",
-    "Then test run with limit_samples=10",
-    "Then full evaluation",
-    "Reports accuracy results per benchmark"
-  ]
-}