Schliff

A deterministic quality scorer for AI agent instruction files. The Codecov for your SKILL.md.

AI coding agents use instruction files — SKILL.md, CLAUDE.md, .cursorrules, AGENTS.md — to define behavior. These files degrade silently: triggers overlap, instructions contradict, edge cases slip through. Schliff catches that before your users do.

Zero dependencies. No LLM needed. Same input, same score. Python 3.9+ stdlib only.

pip install schliff
schliff score path/to/SKILL.md

Scoring

Deterministic static analysis. No LLM required. Same input, same output, every time.

Dimension	Weight	What it catches
structure	15%	Missing frontmatter, empty headers, no examples, dead content
triggers	20%	Eval-suite trigger accuracy, false positives, missed activations
quality	20%	Thin assertions, missing feature coverage, low coherence
edges	15%	No edge cases defined, missing categories (invalid, scale, unicode)
efficiency	10%	Hedging, filler words, repetition, low signal-to-noise
composability	10%	Missing scope boundaries, no error behavior, no handoff points
clarity	5%	Contradictions, vague references, ambiguous instructions
security	8%	(opt-in via --security) Prompt injection, data exfiltration, obfuscation, dangerous commands
runtime	10%	(opt-in) Actual Claude behavior against eval assertions

Weights are renormalized across measured dimensions (sum to 1.0). Without --runtime, the 7 structural dimensions carry 100% of the score.

Grades: S (>=95) / A (>=85) / B (>=75) / C (>=65) / D (>=50) / E (>=35) / F (<35)

Full methodology and weight rationale: docs/SCORING.md

Anti-Gaming

Schliff detects score inflation. The benchmark suite tests 6 common gaming patterns — all caught:

Gaming attempt	How Schliff catches it
Empty headers (inflate structure)	Header content check — empty sections penalized
Keyword stuffing (inflate triggers)	Dedup + frequency cap on repeated terms
Copy-paste examples	Repeated-line detection — score drops 94 → 43
Contradictory instructions	"always X" vs "never X" contradiction finder
Bloated preamble	Signal-to-noise ratio via sqrt density curve
Missing scope boundaries	10 composability sub-checks, not a single binary

Reproduce: python benchmarks/anti-gaming/run.py

Quick Start

Score any instruction file (no Claude Code needed)

pip install schliff          # or: pipx install schliff
schliff demo                                        # see it in action instantly
schliff score path/to/SKILL.md                      # score any skill file
schliff score CLAUDE.md                             # works with any format
schliff score --url https://github.com/.../SKILL.md # score remote files
schliff compare skill-v1.md skill-v2.md             # side-by-side comparison
schliff suggest path/to/SKILL.md                    # ranked fixes with impact
schliff doctor                                       # scan all installed skills
schliff report path/to/SKILL.md                    # markdown report (+ --gist)
schliff drift --repo .                              # find stale references
schliff sync .                                      # cross-file consistency check
schliff track path/to/SKILL.md                     # score history + sparkline

Autonomous improvement (requires Claude Code)

git clone https://github.com/Zandereins/schliff.git && bash schliff/install.sh

# Inside Claude Code:
/schliff:init path/to/SKILL.md    # bootstrap eval suite + baseline
/schliff:auto                      # patch → measure → keep or revert → repeat

Prerequisites: Python 3.9+, Bash, Git, jq

Where Schliff fits

Write instruction file  -->  schliff score  -->  schliff suggest  -->  fix  -->  ship
     (any format)            (9 dimensions)      (ranked fixes)        │
                                                                       ↓
                             schliff track  <--  schliff sync  <--  schliff drift
                             (trend over time)   (cross-file)    (stale references)

Works with any AI coding agent: Claude Code (SKILL.md), Cursor (.cursorrules), GitHub Copilot (AGENTS.md), or project configs (CLAUDE.md). Schliff grinds instruction files to production quality.

Results

Skill	Before	After	Iterations	Author
agent-review-panel	64.0 [D]	85.6 [A]	3 rounds	@wan-huiyan
shieldclaw (OpenClaw plugin)	68.3 [C]	94.6 [A]	1 round	@Zandereins
demo skill (`demo/bad-skill/`)	54.0 [D]	98.3 [S]	18	@Zandereins

The demo skill — a vague, hedging-filled deployment helper — goes from [D] to [S] in 18 autonomous iterations:

  structure         70 → 100     Frontmatter, examples, concrete commands
  triggers           0 → 100     Description keywords, negative boundaries
  quality            0 → 95      Eval suite generated, assertions added
  edges              0 → 100     Edge cases synthesized
  efficiency        35 → 93      Hedging removed, information density up
  composability     30 → 90      Scope boundaries, error behavior, deps
  clarity           90 → 100     Vague references resolved

Real-world skills vary. Complex skills plateau around [A] to [S] depending on eval suite coverage.

ShieldClaw is a prompt injection defense plugin for the OpenClaw agent framework — not a Claude Code skill. Schliff scored its SKILL.md at 68.3 [C], and after one round of /schliff:auto, it reached 94.6 [A] while staying under 300 tokens. Adding an eval-suite unlocked 3 previously-unmeasured dimensions (triggers, quality, edges), which drove most of the 26-point gain.

Run schliff score on your skill and add your result.

Community

"It's become a core part of my skill development workflow!" — @wan-huiyan

@wan-huiyan used schliff to improve agent-review-panel from 64 to 85.6 across three rounds. Along the way, SKILL.md went from 1,331 to 340 lines — a 75% token reduction via references/ extraction. A/B testing on a 1,132-line document confirmed identical review quality with fewer tokens.

Used by

@wan-huiyan — agent-review-panel (64 → 85.6, 3 rounds)
@Zandereins — shieldclaw, OpenClaw plugin (68.3 → 94.6, 1 round)
Add your project

Commands

CLI (standalone — `pip install schliff`)

Command	Purpose
`schliff demo`	Score a built-in bad skill — see schliff in action instantly
`schliff score <path>`	Score any instruction file (SKILL.md, CLAUDE.md, .cursorrules, AGENTS.md)
`schliff score --url <url>`	Score a remote file from GitHub (HTTPS-only)
`schliff score --security`	Include security dimension (injection, exfiltration, obfuscation)
`schliff compare <a> <b>`	Side-by-side quality comparison with dimension deltas
`schliff suggest <path>`	Ranked actionable fixes with estimated score impact
`schliff diff <path>`	Show score delta vs. previous commit (or any `--ref`)
`schliff verify <path>`	CI gate — exit 0/1, `--min-score`, `--regression`, pre-commit hook
`schliff doctor`	Scan all installed skills + instruction files, health grades, drift analysis
`schliff badge <path>`	Generate copy-paste markdown badge
`schliff report <path>`	Generate Markdown quality report (`--gist` for shareable link)
`schliff score --tokens`	Section-by-section token breakdown with format-specific budgets
`schliff drift --repo <dir>`	Find stale paths, scripts, and make targets in instruction files
`schliff sync <dir>`	Cross-file consistency: contradictions, gaps, redundancies
`schliff track <path>`	Score history over time with sparkline and regression detection

Claude Code skills (require integration)

Command	Purpose
`/schliff:auto`	Autonomous improvement loop with EMA-based stopping
`/schliff:init <path>`	Bootstrap eval suite + baseline from any SKILL.md
`/schliff:analyze`	One-shot gap analysis with ranked fix recommendations
`/schliff:mesh`	Detect trigger conflicts across all installed skills
`/schliff:report`	Generate shareable markdown report with badge

CI Integration

Score skills in CI. Block regressions. The Codecov for SKILL.md files.

# GitHub Action
- uses: Zandereins/schliff@v7
  with:
    skill-path: '.claude/skills/my-skill/SKILL.md'
    minimum-score: '75'
    comment-on-pr: 'true'

# Or use the CLI directly
schliff verify path/to/SKILL.md --min-score 75 --regression

Pre-commit Hook

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/Zandereins/schliff
    rev: v7.1.0
    hooks:
      - id: schliff-verify
        args: ['--min-score', '75']

Web

Playground — interactive browser-based scorer, try before installing
Leaderboard — community scoreboard (scaffold, external storage coming)

How it differs from autoresearch

Inspired by Karpathy's autoresearch — but Schliff is a linter, not a research loop. You can run schliff score in CI without ever touching the improvement loop.

	autoresearch	Schliff
Target	ML training scripts	Claude Code SKILL.md files
Patches	100% LLM-generated	60-70% deterministic rules, 30-40% LLM
Scoring	1 metric	7 dimensions + optional runtime
Anti-gaming	None	6 detection vectors
Memory	Stateless	Cross-session episodic store
Dependencies	External (ML frameworks)	Python 3.9+ stdlib only
Tests	Minimal	732 unit + 99 integration

Architecture — How the scoring engine and improvement loop connect (view diagram on GitHub)

The scorer is the ruler. Claude is the craftsman.

flowchart TB
    subgraph Scoring ["Scoring Engine (deterministic, no LLM)"]
        SKILL[SKILL.md + eval-suite.json] --> PARSE[Parse & Extract]
        PARSE --> S1[Structure]
        PARSE --> S2[Triggers]
        PARSE --> S3[Quality]
        PARSE --> S4[Edges]
        PARSE --> S5[Efficiency]
        PARSE --> S6[Composability]
        PARSE --> S7[Clarity]
        S1 & S2 & S3 & S4 & S5 & S6 & S7 --> COMPOSITE[Weighted Composite + Grade]
    end

    subgraph Loop ["Improvement Loop (Claude Code)"]
        COMPOSITE --> GRADIENT[Identify Weakest Dimension]
        GRADIENT --> MEMORY[(Episodic Memory)]
        MEMORY --> PREDICT[Predict Strategy Success]
        PREDICT --> PATCH[Generate Patch]
        PATCH --> APPLY[Apply + Re-score]
        APPLY -->|delta > 0| KEEP[Keep]
        APPLY -->|delta <= 0| REVERT[Revert]
        KEEP & REVERT --> EMA{EMA Plateau?}
        EMA -->|no| GRADIENT
        EMA -->|yes| DONE[Done]
    end

Note: Mermaid diagram renders on GitHub. On PyPI, view the repository for the visual.

60-70% of patches follow deterministic rules (frontmatter fixes, noise removal, TODO cleanup, hedging elimination). The LLM handles the remaining 30-40% — structural reorganization, example generation, edge case synthesis.

Limitations

The structural score measures file organization, not runtime effectiveness. A skill scoring 95/100 structurally can still produce wrong output at runtime — use --runtime scoring for that.

The trigger scorer uses TF-IDF heuristics. Skills whose domain vocabulary overlaps with generic terms (e.g., "review", "analyze") may hit a precision ceiling around 75-80. Precision/recall reporting helps diagnose this.

Badge

[![Schliff: 99 [S]](https://img.shields.io/badge/Schliff-99%2F100_%5BS%5D-brightgreen)](https://github.com/Zandereins/schliff)

Contributing

Found a scoring bug? Add a test case and open an issue. Want to improve scoring logic? Edit the relevant scoring/*.py, run bash scripts/test-integration.sh, PR the diff.

License

MIT

schliff (German) — the finishing cut. "Den letzten Schliff geben" = to give something its final polish.

Name		Name	Last commit message	Last commit date
Latest commit History 187 Commits
.claude-plugin		.claude-plugin
.github		.github
.paul		.paul
action		action
benchmarks/anti-gaming		benchmarks/anti-gaming
commands/schliff		commands/schliff
demo		demo
docs		docs
playground		playground
skills		skills
web		web
.gitignore		.gitignore
.pre-commit-hooks.yaml		.pre-commit-hooks.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
install.sh		install.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Schliff

Scoring

Anti-Gaming

Quick Start

Score any instruction file (no Claude Code needed)

Autonomous improvement (requires Claude Code)

Where Schliff fits

Results

Community

Used by

Commands

CLI (standalone — `pip install schliff`)

Claude Code skills (require integration)

CI Integration

Pre-commit Hook

Web

How it differs from autoresearch

Limitations

Badge

Contributing

License

About

Uh oh!

Releases 8

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Schliff

Scoring

Anti-Gaming

Quick Start

Score any instruction file (no Claude Code needed)

Autonomous improvement (requires Claude Code)

Where Schliff fits

Results

Community

Used by

Commands

CLI (standalone — pip install schliff)

Claude Code skills (require integration)

CI Integration

Pre-commit Hook

Web

How it differs from autoresearch

Limitations

Badge

Contributing

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

CLI (standalone — `pip install schliff`)

Packages