Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
4c17dfc
Add Claude Code PTQ skill for agent-assisted model quantization
mxinO Mar 25, 2026
70bb65f
Update workspace management to support single-user case
mxinO Mar 25, 2026
d8e1d37
Reference workspace management in PTQ skill
mxinO Mar 25, 2026
60e654c
Improve step 4 routing and use nvidia-smi for GPU detection
mxinO Mar 25, 2026
0667f13
update launcher guide
mxinO Mar 25, 2026
1b0b346
clear remote path
mxinO Mar 25, 2026
e796a64
bridge config to launcher
mxinO Mar 25, 2026
b5bd2b4
restructure the decision tree
mxinO Mar 25, 2026
2b99eb5
bridge partiton
mxinO Mar 25, 2026
630a954
optimize flow
mxinO Mar 26, 2026
21f68b9
unsupported -> unlisted
mxinO Mar 26, 2026
5a4cb26
remote vs local path
mxinO Mar 26, 2026
84ad824
slurm single node fix
mxinO Mar 26, 2026
87988e8
Merge remote-tracking branch 'origin/main' into mxin/agent-ptq
mxinO Mar 26, 2026
fc457fc
update for unlisted model
mxinO Apr 1, 2026
fe8516c
add tests
mxinO Apr 1, 2026
d8f1764
address comments
mxinO Apr 1, 2026
448091b
Add FakeUnsupported-0.6B test case for custom module patching
mxinO Apr 2, 2026
6bd0609
Merge remote-tracking branch 'origin/main' into mxin/agent-ptq
mxinO Apr 2, 2026
bf3f574
Simplify MoE Pattern 2: table of plugin examples instead of inline code
mxinO Apr 2, 2026
9cec7ba
Add PTQ skill to changelog
mxinO Apr 2, 2026
f612c66
Prefer patching ModelOpt plugin over custom scripts for unlisted models
mxinO Apr 2, 2026
8a3de6e
Mark PTQ skill as early testing in changelog
mxinO Apr 3, 2026
8e73642
Reject tilde in workspace path - won't expand remotely
mxinO Apr 3, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .claude/clusters.yaml.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# ModelOpt Remote Cluster Configuration
# Copy to ~/.config/modelopt/clusters.yaml (user-level, recommended)
# or .claude/clusters.yaml (project-level, can be committed).

clusters:
# GPU workstation or SLURM login node
my-cluster:
login_node: cluster-login.example.com
user: myusername
ssh_key: ~/.ssh/id_rsa
# ssh_proxy: "socat - PROXY:localhost:%h:%p,proxyport=3128" # optional
workspace: /path/to/remote/workdir
gpu_type: H100 # used for quantization format recommendation
# slurm:
# default_account: my_account
# default_partition: batch_short

default_cluster: my-cluster
80 changes: 80 additions & 0 deletions .claude/skills/common/environment-setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Environment Setup

Common detection for all ModelOpt skills. After this, you know what's available.

## Env-1. Get ModelOpt source

```bash
ls examples/llm_ptq/hf_ptq.py 2>/dev/null && echo "Source found"
```

If not found: `git clone https://github.com/NVIDIA/Model-Optimizer.git && cd Model-Optimizer`

If found, ensure the source is up to date:

```bash
git pull origin main
```

If previous runs left patches in `modelopt/` (from 4C unlisted model work), check whether they should be kept. Reset only if starting a completely new task: `git checkout main`.

## Env-2. Local or remote?

1. **User explicitly requests local or remote** → follow the user's choice
2. **User doesn't specify** → check for cluster config:

```bash
cat ~/.config/modelopt/clusters.yaml 2>/dev/null || cat .claude/clusters.yaml 2>/dev/null
```

If a cluster config exists with content → **use the remote cluster** (do not fall back to local even if local GPUs are available — the cluster config indicates the user's preferred execution environment). Otherwise → **local execution**.

For remote, connect:

```bash
source .claude/skills/common/remote_exec.sh
remote_load_cluster <cluster_name>
remote_check_ssh
remote_detect_env # sets REMOTE_ENV_TYPE = slurm / docker / bare
```

If remote but no config, ask user for: hostname, SSH username, SSH key path, remote workdir. Create `~/.config/modelopt/clusters.yaml` (see `skills/common/remote-execution.md` for format).

## Env-3. What compute is available?

Run on the **target machine** (local, or via `remote_run` if remote):

```bash
which srun sbatch 2>/dev/null && echo "SLURM"
docker info 2>/dev/null | grep -qi nvidia && echo "Docker+GPU"
nvidia-smi --query-gpu=name,memory.total --format=csv,noheader 2>/dev/null
```

Also check:

```bash
ls tools/launcher/launch.py 2>/dev/null && echo "Launcher available"
```

**No GPU detected?**

- If local with no GPU and no cluster config → ask the user:
*"No local GPU detected. Do you have a remote machine or cluster with GPUs? If so, I'll need connection details (hostname, SSH username, key path, remote workdir) to run there."*
- If user provides remote info → create `clusters.yaml`, go back to Env-2
- If user has no GPU anywhere → **stop**: this task requires a CUDA GPU

## Summary

After this, you should know:

- ModelOpt source location
- Local or remote (+ cluster config if remote)
- SLURM / Docker+GPU / bare GPU
- Launcher availability
- GPU model and count

Return to the skill's SKILL.md for the execution path based on these results.

## Multi-user / Slack bot

If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md` before proceeding.
147 changes: 147 additions & 0 deletions .claude/skills/common/remote-execution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# Remote Execution

Read this when Claude Code runs on a different machine than the target GPU cluster/workstation. This covers SSH connectivity, cluster config, persistent sessions, and remote command execution.

---

## 1. Cluster Config

Config locations (checked in order, first found wins):

1. `~/.config/modelopt/clusters.yaml` — user-level (not committed, recommended)
2. `.claude/clusters.yaml` — project-level (can be committed for shared defaults)
3. Interactive input — if neither file exists, ask the user (see SKILL.md Step 0) and write `~/.config/modelopt/clusters.yaml` before proceeding

```yaml
clusters:
my-cluster:
login_node: cluster-login.example.com # SSH hostname or SSH config alias
user: username # SSH user
ssh_key: ~/.ssh/id_rsa # (optional) SSH key path
ssh_proxy: "socat - PROXY:localhost:%h:%p,proxyport=3128" # (optional) proxy
workspace: /absolute/path/to/workdir # Remote working directory
gpu_type: H100 # For quant format recommendation
slurm: # (optional) pre-fill SLURM defaults
default_account: my_account
default_partition: batch_short

default_cluster: my-cluster
```

See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types.

---

## 2. Connect and Establish Persistent Session

```bash
source .claude/skills/common/remote_exec.sh
remote_load_cluster <cluster_name> # or omit name to use default_cluster
remote_check_ssh # validates connectivity + starts persistent session
```

`remote_check_ssh` starts an SSH **ControlMaster** connection. All subsequent `remote_run` / `remote_sync_*` / SCP calls reuse this single connection:

- ~180ms per command (vs 5-15s per new connection)
- Eliminates flaky proxy timeouts
- Auto-cleaned up when the shell exits

---

## 3. Detect Remote Environment

```bash
remote_detect_env
```

Auto-discovers whether the remote has SLURM, Docker, or bare-metal GPUs. Sets `REMOTE_ENV_TYPE` to `slurm`, `docker`, `bare`, or `unknown`.

After detection, proceed with the environment-specific setup:

- **SLURM** → prefix all commands with `remote_run`. For SLURM job scripts, see the skill's own references.
- **Docker** → use `remote_docker_run <container> "<command>"`
- **Bare metal** → use `remote_run` directly

---

## 4. Running Commands Remotely

### Single commands

```bash
remote_run "nvidia-smi"
remote_run "python --version"
remote_run "sbatch /path/to/job.sh"
```

`remote_run` uses base64 encoding internally, so special characters (`%`, `$`, quotes) work without escaping. It retries up to 3 times on SSH failures.

### Syncing files

```bash
# Local → remote
remote_sync_to /local/path remote_subdir

# Remote → local
remote_sync_from remote_subdir /local/path
```

Both use rsync over the persistent SSH session with default excludes (`.git`, `__pycache__`, `.claude`, `*.pyc`, `node_modules`, `*.egg-info`). The `.claude` directory is intentionally excluded — skills and config should not be synced to the remote machine.

### SCP (alternative to rsync)

SCP also reuses the persistent session automatically via ControlMaster:

```bash
scp /local/script.sh ${REMOTE_USER}@${REMOTE_HOST}:/remote/path/
```

---

## 5. The Two-Script Pattern

When submitting SLURM jobs remotely, write **two files** locally to avoid shell escaping issues:

1. **SLURM wrapper** (e.g., `job_slurm.sh`) — `#SBATCH` directives + `srun` with container
2. **Inner runner** (e.g., `run.sh`) — the actual work (runs inside the container)

Then upload both and submit:

```bash
remote_sync_to /local/scripts/ scripts/
JOBID=$(remote_run "sbatch /remote/path/scripts/job_slurm.sh" | grep -o '[0-9]\+' | tail -1)
```

---

## 6. Verifying Results Remotely

```bash
remote_run "ls -lh <output_path>/"
remote_run "cat <output_path>/hf_quant_config.json"
```

Or fetch results to local:

```bash
remote_sync_from <remote_output_subdir> /local/output/
```

---

## 7. Troubleshooting

| Problem | Cause | Fix |
| ------- | ----- | --- |
| `Connection timed out during banner exchange` | Proxy/login node overloaded | `remote_run` retries 3x automatically; use persistent session to avoid |
| SSH proxy completely unreachable (`Network is unreachable`) | VPN/proxy host is down or not running on this machine | Check if VPN is connected; verify `socat`/proxy service is running locally; try direct SSH by temporarily removing `ssh_proxy` from config |
| `unix_listener: cannot bind to path ... Read-only file system` | SSH ControlMaster socket in non-writable `/tmp` | `remote_exec.sh` auto-finds writable dir; ensure `TMPDIR` or `/tmp/claude-*` exists |
| `cd: /home/user/~/path: No such file or directory` | `~` not expanding on remote | Use absolute paths in `workspace` config, not `~/...` |
| Login nodes resolve home dirs differently | Symlinked home dirs vary by node | Use absolute lustre/NFS paths (e.g., `/lustre/fs1/...`) in job scripts |
| `#!` becomes `#\!` in scripts | Shell environment mangles shebang | Fix with `sed -i 's\|^#\\\\!\|#!\|' script.sh` after writing |

## Reference Files

- **`skills/common/remote_exec.sh`** — Full utility library (session, run, sync, SLURM, Docker helpers)
- **`.claude/clusters.yaml`** — Active cluster configuration
- **`.claude/clusters.yaml.example`** — Annotated example config
Loading
Loading