Skip to content

Zandereins/schliff

Repository files navigation

Schliff

A deterministic quality scorer for AI agent instruction files. The Codecov for your SKILL.md.

PyPI Python 3.9+ Downloads License: MIT GitHub stars Structural Score Tests

AI coding agents use instruction files — SKILL.md, CLAUDE.md, .cursorrules, AGENTS.md — to define behavior. These files degrade silently: triggers overlap, instructions contradict, edge cases slip through. Schliff catches that before your users do.

Zero dependencies. No LLM needed. Same input, same score. Python 3.9+ stdlib only.

pip install schliff
schliff score path/to/SKILL.md

schliff score: bad skill [D] vs production skill [S]


Scoring

Deterministic static analysis. No LLM required. Same input, same output, every time.

Dimension Weight What it catches
structure 15% Missing frontmatter, empty headers, no examples, dead content
triggers 20% Eval-suite trigger accuracy, false positives, missed activations
quality 20% Thin assertions, missing feature coverage, low coherence
edges 15% No edge cases defined, missing categories (invalid, scale, unicode)
efficiency 10% Hedging, filler words, repetition, low signal-to-noise
composability 10% Missing scope boundaries, no error behavior, no handoff points
clarity 5% Contradictions, vague references, ambiguous instructions
security 8% (opt-in via --security) Prompt injection, data exfiltration, obfuscation, dangerous commands
runtime 10% (opt-in) Actual Claude behavior against eval assertions

Weights are renormalized across measured dimensions (sum to 1.0). Without --runtime, the 7 structural dimensions carry 100% of the score.

Grades: S (>=95) / A (>=85) / B (>=75) / C (>=65) / D (>=50) / E (>=35) / F (<35)

Full methodology and weight rationale: docs/SCORING.md


Anti-Gaming

Schliff detects score inflation. The benchmark suite tests 6 common gaming patterns — all caught:

Gaming attempt How Schliff catches it
Empty headers (inflate structure) Header content check — empty sections penalized
Keyword stuffing (inflate triggers) Dedup + frequency cap on repeated terms
Copy-paste examples Repeated-line detection — score drops 94 → 43
Contradictory instructions "always X" vs "never X" contradiction finder
Bloated preamble Signal-to-noise ratio via sqrt density curve
Missing scope boundaries 10 composability sub-checks, not a single binary

Reproduce: python benchmarks/anti-gaming/run.py


Quick Start

Score any instruction file (no Claude Code needed)

pip install schliff          # or: pipx install schliff
schliff demo                                        # see it in action instantly
schliff score path/to/SKILL.md                      # score any skill file
schliff score CLAUDE.md                             # works with any format
schliff score --url https://github.com/.../SKILL.md # score remote files
schliff compare skill-v1.md skill-v2.md             # side-by-side comparison
schliff suggest path/to/SKILL.md                    # ranked fixes with impact
schliff doctor                                       # scan all installed skills
schliff report path/to/SKILL.md                    # markdown report (+ --gist)
schliff drift --repo .                              # find stale references
schliff sync .                                      # cross-file consistency check
schliff track path/to/SKILL.md                     # score history + sparkline

Autonomous improvement (requires Claude Code)

git clone https://github.com/Zandereins/schliff.git && bash schliff/install.sh

# Inside Claude Code:
/schliff:init path/to/SKILL.md    # bootstrap eval suite + baseline
/schliff:auto                      # patch → measure → keep or revert → repeat

Prerequisites: Python 3.9+, Bash, Git, jq

Where Schliff fits

Write instruction file  -->  schliff score  -->  schliff suggest  -->  fix  -->  ship
     (any format)            (9 dimensions)      (ranked fixes)        │
                                                                       ↓
                             schliff track  <--  schliff sync  <--  schliff drift
                             (trend over time)   (cross-file)    (stale references)

Works with any AI coding agent: Claude Code (SKILL.md), Cursor (.cursorrules), GitHub Copilot (AGENTS.md), or project configs (CLAUDE.md). Schliff grinds instruction files to production quality.


Results

Skill Before After Iterations Author
agent-review-panel 64.0 [D] 85.6 [A] 3 rounds @wan-huiyan
shieldclaw (OpenClaw plugin) 68.3 [C] 94.6 [A] 1 round @Zandereins
demo skill (demo/bad-skill/) 54.0 [D] 98.3 [S] 18 @Zandereins

The demo skill — a vague, hedging-filled deployment helper — goes from [D] to [S] in 18 autonomous iterations:

  structure         70 → 100     Frontmatter, examples, concrete commands
  triggers           0 → 100     Description keywords, negative boundaries
  quality            0 → 95      Eval suite generated, assertions added
  edges              0 → 100     Edge cases synthesized
  efficiency        35 → 93      Hedging removed, information density up
  composability     30 → 90      Scope boundaries, error behavior, deps
  clarity           90 → 100     Vague references resolved

Real-world skills vary. Complex skills plateau around [A] to [S] depending on eval suite coverage.

ShieldClaw is a prompt injection defense plugin for the OpenClaw agent framework — not a Claude Code skill. Schliff scored its SKILL.md at 68.3 [C], and after one round of /schliff:auto, it reached 94.6 [A] while staying under 300 tokens. Adding an eval-suite unlocked 3 previously-unmeasured dimensions (triggers, quality, edges), which drove most of the 26-point gain.

Run schliff score on your skill and add your result.

Community

"It's become a core part of my skill development workflow!" — @wan-huiyan

@wan-huiyan used schliff to improve agent-review-panel from 64 to 85.6 across three rounds. Along the way, SKILL.md went from 1,331 to 340 lines — a 75% token reduction via references/ extraction. A/B testing on a 1,132-line document confirmed identical review quality with fewer tokens.

Used by


Commands

CLI (standalone — pip install schliff)

Command Purpose
schliff demo Score a built-in bad skill — see schliff in action instantly
schliff score <path> Score any instruction file (SKILL.md, CLAUDE.md, .cursorrules, AGENTS.md)
schliff score --url <url> Score a remote file from GitHub (HTTPS-only)
schliff score --security Include security dimension (injection, exfiltration, obfuscation)
schliff compare <a> <b> Side-by-side quality comparison with dimension deltas
schliff suggest <path> Ranked actionable fixes with estimated score impact
schliff diff <path> Show score delta vs. previous commit (or any --ref)
schliff verify <path> CI gate — exit 0/1, --min-score, --regression, pre-commit hook
schliff doctor Scan all installed skills + instruction files, health grades, drift analysis
schliff badge <path> Generate copy-paste markdown badge
schliff report <path> Generate Markdown quality report (--gist for shareable link)
schliff score --tokens Section-by-section token breakdown with format-specific budgets
schliff drift --repo <dir> Find stale paths, scripts, and make targets in instruction files
schliff sync <dir> Cross-file consistency: contradictions, gaps, redundancies
schliff track <path> Score history over time with sparkline and regression detection

Claude Code skills (require integration)

Command Purpose
/schliff:auto Autonomous improvement loop with EMA-based stopping
/schliff:init <path> Bootstrap eval suite + baseline from any SKILL.md
/schliff:analyze One-shot gap analysis with ranked fix recommendations
/schliff:mesh Detect trigger conflicts across all installed skills
/schliff:report Generate shareable markdown report with badge

CI Integration

Score skills in CI. Block regressions. The Codecov for SKILL.md files.

# GitHub Action
- uses: Zandereins/schliff@v7
  with:
    skill-path: '.claude/skills/my-skill/SKILL.md'
    minimum-score: '75'
    comment-on-pr: 'true'
# Or use the CLI directly
schliff verify path/to/SKILL.md --min-score 75 --regression

Pre-commit Hook

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/Zandereins/schliff
    rev: v7.1.0
    hooks:
      - id: schliff-verify
        args: ['--min-score', '75']

Web

  • Playground — interactive browser-based scorer, try before installing
  • Leaderboard — community scoreboard (scaffold, external storage coming)

How it differs from autoresearch

Inspired by Karpathy's autoresearch — but Schliff is a linter, not a research loop. You can run schliff score in CI without ever touching the improvement loop.

autoresearch Schliff
Target ML training scripts Claude Code SKILL.md files
Patches 100% LLM-generated 60-70% deterministic rules, 30-40% LLM
Scoring 1 metric 7 dimensions + optional runtime
Anti-gaming None 6 detection vectors
Memory Stateless Cross-session episodic store
Dependencies External (ML frameworks) Python 3.9+ stdlib only
Tests Minimal 732 unit + 99 integration

Architecture — How the scoring engine and improvement loop connect (view diagram on GitHub)

The scorer is the ruler. Claude is the craftsman.

flowchart TB
    subgraph Scoring ["Scoring Engine (deterministic, no LLM)"]
        SKILL[SKILL.md + eval-suite.json] --> PARSE[Parse & Extract]
        PARSE --> S1[Structure]
        PARSE --> S2[Triggers]
        PARSE --> S3[Quality]
        PARSE --> S4[Edges]
        PARSE --> S5[Efficiency]
        PARSE --> S6[Composability]
        PARSE --> S7[Clarity]
        S1 & S2 & S3 & S4 & S5 & S6 & S7 --> COMPOSITE[Weighted Composite + Grade]
    end

    subgraph Loop ["Improvement Loop (Claude Code)"]
        COMPOSITE --> GRADIENT[Identify Weakest Dimension]
        GRADIENT --> MEMORY[(Episodic Memory)]
        MEMORY --> PREDICT[Predict Strategy Success]
        PREDICT --> PATCH[Generate Patch]
        PATCH --> APPLY[Apply + Re-score]
        APPLY -->|delta > 0| KEEP[Keep]
        APPLY -->|delta <= 0| REVERT[Revert]
        KEEP & REVERT --> EMA{EMA Plateau?}
        EMA -->|no| GRADIENT
        EMA -->|yes| DONE[Done]
    end
Loading

Note: Mermaid diagram renders on GitHub. On PyPI, view the repository for the visual.

60-70% of patches follow deterministic rules (frontmatter fixes, noise removal, TODO cleanup, hedging elimination). The LLM handles the remaining 30-40% — structural reorganization, example generation, edge case synthesis.


Limitations

The structural score measures file organization, not runtime effectiveness. A skill scoring 95/100 structurally can still produce wrong output at runtime — use --runtime scoring for that.

The trigger scorer uses TF-IDF heuristics. Skills whose domain vocabulary overlaps with generic terms (e.g., "review", "analyze") may hit a precision ceiling around 75-80. Precision/recall reporting helps diagnose this.


Badge

[![Schliff: 99 [S]](https://img.shields.io/badge/Schliff-99%2F100_%5BS%5D-brightgreen)](https://github.com/Zandereins/schliff)

Schliff: 99 [S]


Contributing

Found a scoring bug? Add a test case and open an issue. Want to improve scoring logic? Edit the relevant scoring/*.py, run bash scripts/test-integration.sh, PR the diff.

License

MIT


schliff (German) — the finishing cut. "Den letzten Schliff geben" = to give something its final polish.

About

Deterministic quality scorer for AI agent instruction files — 8-dimension scoring with security, multi-format (SKILL.md, CLAUDE.md, .cursorrules, AGENTS.md), anti-gaming detection, zero dependencies

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors