A deterministic quality scorer for AI agent instruction files. The Codecov for your SKILL.md.
AI coding agents use instruction files — SKILL.md, CLAUDE.md, .cursorrules, AGENTS.md — to define behavior. These files degrade silently: triggers overlap, instructions contradict, edge cases slip through. Schliff catches that before your users do.
Zero dependencies. No LLM needed. Same input, same score. Python 3.9+ stdlib only.
pip install schliff
schliff score path/to/SKILL.mdDeterministic static analysis. No LLM required. Same input, same output, every time.
| Dimension | Weight | What it catches |
|---|---|---|
| structure | 15% | Missing frontmatter, empty headers, no examples, dead content |
| triggers | 20% | Eval-suite trigger accuracy, false positives, missed activations |
| quality | 20% | Thin assertions, missing feature coverage, low coherence |
| edges | 15% | No edge cases defined, missing categories (invalid, scale, unicode) |
| efficiency | 10% | Hedging, filler words, repetition, low signal-to-noise |
| composability | 10% | Missing scope boundaries, no error behavior, no handoff points |
| clarity | 5% | Contradictions, vague references, ambiguous instructions |
| security | 8% | (opt-in via --security) Prompt injection, data exfiltration, obfuscation, dangerous commands |
| runtime | 10% | (opt-in) Actual Claude behavior against eval assertions |
Weights are renormalized across measured dimensions (sum to 1.0). Without --runtime, the 7 structural dimensions carry 100% of the score.
Grades: S (>=95) / A (>=85) / B (>=75) / C (>=65) / D (>=50) / E (>=35) / F (<35)
Full methodology and weight rationale: docs/SCORING.md
Schliff detects score inflation. The benchmark suite tests 6 common gaming patterns — all caught:
| Gaming attempt | How Schliff catches it |
|---|---|
| Empty headers (inflate structure) | Header content check — empty sections penalized |
| Keyword stuffing (inflate triggers) | Dedup + frequency cap on repeated terms |
| Copy-paste examples | Repeated-line detection — score drops 94 → 43 |
| Contradictory instructions | "always X" vs "never X" contradiction finder |
| Bloated preamble | Signal-to-noise ratio via sqrt density curve |
| Missing scope boundaries | 10 composability sub-checks, not a single binary |
Reproduce: python benchmarks/anti-gaming/run.py
pip install schliff # or: pipx install schliff
schliff demo # see it in action instantly
schliff score path/to/SKILL.md # score any skill file
schliff score CLAUDE.md # works with any format
schliff score --url https://github.com/.../SKILL.md # score remote files
schliff compare skill-v1.md skill-v2.md # side-by-side comparison
schliff suggest path/to/SKILL.md # ranked fixes with impact
schliff doctor # scan all installed skills
schliff report path/to/SKILL.md # markdown report (+ --gist)
schliff drift --repo . # find stale references
schliff sync . # cross-file consistency check
schliff track path/to/SKILL.md # score history + sparklinegit clone https://github.com/Zandereins/schliff.git && bash schliff/install.sh
# Inside Claude Code:
/schliff:init path/to/SKILL.md # bootstrap eval suite + baseline
/schliff:auto # patch → measure → keep or revert → repeatPrerequisites: Python 3.9+, Bash, Git, jq
Write instruction file --> schliff score --> schliff suggest --> fix --> ship
(any format) (9 dimensions) (ranked fixes) │
↓
schliff track <-- schliff sync <-- schliff drift
(trend over time) (cross-file) (stale references)
Works with any AI coding agent: Claude Code (SKILL.md), Cursor (.cursorrules), GitHub Copilot (AGENTS.md), or project configs (CLAUDE.md). Schliff grinds instruction files to production quality.
| Skill | Before | After | Iterations | Author |
|---|---|---|---|---|
| agent-review-panel | 64.0 [D] | 85.6 [A] | 3 rounds | @wan-huiyan |
| shieldclaw (OpenClaw plugin) | 68.3 [C] | 94.6 [A] | 1 round | @Zandereins |
demo skill (demo/bad-skill/) |
54.0 [D] | 98.3 [S] | 18 | @Zandereins |
The demo skill — a vague, hedging-filled deployment helper — goes from [D] to [S] in 18 autonomous iterations:
structure 70 → 100 Frontmatter, examples, concrete commands
triggers 0 → 100 Description keywords, negative boundaries
quality 0 → 95 Eval suite generated, assertions added
edges 0 → 100 Edge cases synthesized
efficiency 35 → 93 Hedging removed, information density up
composability 30 → 90 Scope boundaries, error behavior, deps
clarity 90 → 100 Vague references resolved
Real-world skills vary. Complex skills plateau around [A] to [S] depending on eval suite coverage.
ShieldClaw is a prompt injection defense plugin for the OpenClaw agent framework — not a Claude Code skill. Schliff scored its SKILL.md at 68.3 [C], and after one round of /schliff:auto, it reached 94.6 [A] while staying under 300 tokens. Adding an eval-suite unlocked 3 previously-unmeasured dimensions (triggers, quality, edges), which drove most of the 26-point gain.
Run schliff score on your skill and add your result.
"It's become a core part of my skill development workflow!" — @wan-huiyan
@wan-huiyan used schliff to improve agent-review-panel from 64 to 85.6 across three rounds. Along the way, SKILL.md went from 1,331 to 340 lines — a 75% token reduction via references/ extraction. A/B testing on a 1,132-line document confirmed identical review quality with fewer tokens.
- @wan-huiyan — agent-review-panel (64 → 85.6, 3 rounds)
- @Zandereins — shieldclaw, OpenClaw plugin (68.3 → 94.6, 1 round)
- Add your project
| Command | Purpose |
|---|---|
schliff demo |
Score a built-in bad skill — see schliff in action instantly |
schliff score <path> |
Score any instruction file (SKILL.md, CLAUDE.md, .cursorrules, AGENTS.md) |
schliff score --url <url> |
Score a remote file from GitHub (HTTPS-only) |
schliff score --security |
Include security dimension (injection, exfiltration, obfuscation) |
schliff compare <a> <b> |
Side-by-side quality comparison with dimension deltas |
schliff suggest <path> |
Ranked actionable fixes with estimated score impact |
schliff diff <path> |
Show score delta vs. previous commit (or any --ref) |
schliff verify <path> |
CI gate — exit 0/1, --min-score, --regression, pre-commit hook |
schliff doctor |
Scan all installed skills + instruction files, health grades, drift analysis |
schliff badge <path> |
Generate copy-paste markdown badge |
schliff report <path> |
Generate Markdown quality report (--gist for shareable link) |
schliff score --tokens |
Section-by-section token breakdown with format-specific budgets |
schliff drift --repo <dir> |
Find stale paths, scripts, and make targets in instruction files |
schliff sync <dir> |
Cross-file consistency: contradictions, gaps, redundancies |
schliff track <path> |
Score history over time with sparkline and regression detection |
| Command | Purpose |
|---|---|
/schliff:auto |
Autonomous improvement loop with EMA-based stopping |
/schliff:init <path> |
Bootstrap eval suite + baseline from any SKILL.md |
/schliff:analyze |
One-shot gap analysis with ranked fix recommendations |
/schliff:mesh |
Detect trigger conflicts across all installed skills |
/schliff:report |
Generate shareable markdown report with badge |
Score skills in CI. Block regressions. The Codecov for SKILL.md files.
# GitHub Action
- uses: Zandereins/schliff@v7
with:
skill-path: '.claude/skills/my-skill/SKILL.md'
minimum-score: '75'
comment-on-pr: 'true'# Or use the CLI directly
schliff verify path/to/SKILL.md --min-score 75 --regression# .pre-commit-config.yaml
repos:
- repo: https://github.com/Zandereins/schliff
rev: v7.1.0
hooks:
- id: schliff-verify
args: ['--min-score', '75']- Playground — interactive browser-based scorer, try before installing
- Leaderboard — community scoreboard (scaffold, external storage coming)
Inspired by Karpathy's autoresearch — but Schliff is a linter, not a research loop. You can run schliff score in CI without ever touching the improvement loop.
| autoresearch | Schliff | |
|---|---|---|
| Target | ML training scripts | Claude Code SKILL.md files |
| Patches | 100% LLM-generated | 60-70% deterministic rules, 30-40% LLM |
| Scoring | 1 metric | 7 dimensions + optional runtime |
| Anti-gaming | None | 6 detection vectors |
| Memory | Stateless | Cross-session episodic store |
| Dependencies | External (ML frameworks) | Python 3.9+ stdlib only |
| Tests | Minimal | 732 unit + 99 integration |
Architecture — How the scoring engine and improvement loop connect (view diagram on GitHub)
The scorer is the ruler. Claude is the craftsman.
flowchart TB
subgraph Scoring ["Scoring Engine (deterministic, no LLM)"]
SKILL[SKILL.md + eval-suite.json] --> PARSE[Parse & Extract]
PARSE --> S1[Structure]
PARSE --> S2[Triggers]
PARSE --> S3[Quality]
PARSE --> S4[Edges]
PARSE --> S5[Efficiency]
PARSE --> S6[Composability]
PARSE --> S7[Clarity]
S1 & S2 & S3 & S4 & S5 & S6 & S7 --> COMPOSITE[Weighted Composite + Grade]
end
subgraph Loop ["Improvement Loop (Claude Code)"]
COMPOSITE --> GRADIENT[Identify Weakest Dimension]
GRADIENT --> MEMORY[(Episodic Memory)]
MEMORY --> PREDICT[Predict Strategy Success]
PREDICT --> PATCH[Generate Patch]
PATCH --> APPLY[Apply + Re-score]
APPLY -->|delta > 0| KEEP[Keep]
APPLY -->|delta <= 0| REVERT[Revert]
KEEP & REVERT --> EMA{EMA Plateau?}
EMA -->|no| GRADIENT
EMA -->|yes| DONE[Done]
end
Note: Mermaid diagram renders on GitHub. On PyPI, view the repository for the visual.
60-70% of patches follow deterministic rules (frontmatter fixes, noise removal, TODO cleanup, hedging elimination). The LLM handles the remaining 30-40% — structural reorganization, example generation, edge case synthesis.
The structural score measures file organization, not runtime effectiveness. A skill scoring 95/100 structurally can still produce wrong output at runtime — use --runtime scoring for that.
The trigger scorer uses TF-IDF heuristics. Skills whose domain vocabulary overlaps with generic terms (e.g., "review", "analyze") may hit a precision ceiling around 75-80. Precision/recall reporting helps diagnose this.
[![Schliff: 99 [S]](https://img.shields.io/badge/Schliff-99%2F100_%5BS%5D-brightgreen)](https://github.com/Zandereins/schliff)Found a scoring bug? Add a test case and open an issue.
Want to improve scoring logic? Edit the relevant scoring/*.py, run bash scripts/test-integration.sh, PR the diff.
MIT
schliff (German) — the finishing cut. "Den letzten Schliff geben" = to give something its final polish.
![schliff score: bad skill [D] vs production skill [S]](https://raw.githubusercontent.com/Zandereins/schliff/main/demo/schliff-demo.gif?v=10)