Skip to content

ASRagab/optimize-anything

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

optimize-anything

LLM-guided optimization for text artifacts using an iterative propose-evaluate-reflect loop with a bring-your-own evaluator.

Quickstart (v2)

# 1) Install
curl -fsSL https://raw.githubusercontent.com/ASRagab/optimize-anything/main/install.sh | bash

# 2) Create a seed artifact
echo "Write a concise support prompt" > seed.txt

# 3) Generate a starter evaluator (default: judge/python template)
optimize-anything generate-evaluator seed.txt \
  --objective "Score clarity, actionability, and specificity" \
  > eval.py

# 4) Optimize
optimize-anything optimize seed.txt \
  --judge-model openai/gpt-4o-mini \
  --objective "Improve clarity and specificity" \
  --model openai/gpt-4o-mini \
  --budget 20 \
  --parallel --workers 4 \
  --cache \
  --run-dir runs \
  --output result.txt

CLI stdout returns a JSON summary — see Result Contract for the full shape.

How It Works

optimize-anything runs a GEPA (Guided Evolutionary Prompt Algorithm) loop: propose → evaluate → reflect, repeating until budget is exhausted or early stopping kicks in.

seed.txt ──► [Propose] ──► candidates
                 ▲               │
                 │           [Evaluate]
             [Reflect] ◄──── scores + diagnostics
  1. Propose — The optimizer generates candidate artifacts from your seed (or from scratch in seedless mode).
  2. Evaluate — Each candidate is scored by your evaluator. Three evaluator types are supported: a command evaluator (any executable that reads JSON on stdin and writes a score on stdout), an HTTP evaluator (a service that accepts POST requests), or a built-in LLM judge (no evaluator script required — just pass --judge-model).
  3. Reflect — Scores and diagnostics feed back into the next proposal round. The loop continues, progressively improving the artifact toward your objective.

The evaluator is the only thing you bring. Everything else — proposal strategy, reflection, early stopping, caching, parallelism — is handled by the optimizer.

Runtime Modes

Dataset / Valset modes

Use --dataset for multi-task optimization (one evaluator call per example). Add --valset for generalization validation.

optimize-anything optimize prompt.txt \
  --judge-model openai/gpt-4o-mini \
  --objective "Generalize across customer request types" \
  --dataset data/train.jsonl \
  --valset data/val.jsonl \
  --model openai/gpt-4o-mini \
  --budget 120 --parallel --workers 6 --cache --run-dir runs

Multi-provider validation

Cross-check one artifact with multiple judge providers:

optimize-anything validate result.txt \
  --providers openai/gpt-4o-mini anthropic/claude-sonnet-4-5 google/gemini-2.0-flash \
  --objective "Score clarity, constraints, and robustness" \
  --intake-file intake.json

Seedless mode

No seed file required; GEPA bootstraps from objective.

optimize-anything optimize --no-seed \
  --objective "Draft a concise, testable API prompt" \
  --model openai/gpt-4o-mini \
  --judge-model openai/gpt-4o-mini

--no-seed requires both --objective and --model.

Early stopping and cache reuse

  • Early stop is auto-enabled when --budget > 30 (or force with --early-stop)
  • Reuse prior evaluator cache with --cache-from (requires --cache + --run-dir)
optimize-anything optimize seed.txt \
  --evaluator-command bash eval.sh \
  --model openai/gpt-4o-mini \
  --budget 150 \
  --cache --cache-from runs/run-20260303-120000 \
  --run-dir runs \
  --early-stop --early-stop-window 12 --early-stop-threshold 0.003

Score range options

For command/HTTP evaluators:

  • --score-range unit (default): enforce score in [0, 1]
  • --score-range any: allow any finite float
optimize-anything optimize seed.txt \
  --evaluator-command bash eval.sh \
  --model openai/gpt-4o-mini \
  --score-range any

CLI Subcommands

  • optimize
  • generate-evaluator
  • intake
  • explain
  • budget
  • score
  • analyze
  • validate

Claude Code Plugin

optimize-anything is also a Claude Code plugin with guided slash commands and skills.

Installation

# In Claude Code
/plugin install ASRagab/optimize-anything

Or clone and install locally:

git clone https://github.com/ASRagab/optimize-anything.git
cd optimize-anything
/plugin install .

Slash Commands

Command Description
/optimize-anything:optimize Guided optimization workflow — walks you through mode selection, evaluator setup, execution, and results
/optimize-anything:quick Zero-config one-shot optimization. Just provide a file and objective.
/optimize-anything:analyze Discover quality dimensions for an artifact and objective
/optimize-anything:score Score a single artifact without optimization
/optimize-anything:validate Cross-validate with multiple LLM judges
/optimize-anything:compare Side-by-side comparison of two artifacts
/optimize-anything:budget Get a budget recommendation for your artifact
/optimize-anything:explain Preview the optimization plan without running it
/optimize-anything:intake Normalize and validate an intake specification

Skills

The plugin includes three skills that Claude Code can invoke automatically:

  • optimization-guide — Full workflow walkthrough covering modes, configuration, budget, and result interpretation
  • generate-evaluator — Choose the right evaluator pattern (judge, command, composite) and generate a script
  • evaluator-patterns — Library of ready-to-use evaluator templates for prompts, code, docs, and agent instructions

Typical Plugin Workflow

/optimize-anything:analyze prompt.txt --objective "Score clarity"
  → discovers quality dimensions

/optimize-anything:quick prompt.txt "improve clarity and specificity"
  → runs analyze + optimize with sensible defaults, shows diff

/optimize-anything:validate result.txt --providers openai/gpt-4o anthropic/claude-sonnet-4-5
  → cross-checks the result with multiple judges

Reference

Result Contract

CLI stdout returns a JSON summary with these fields:

  • best_artifact
  • total_metric_calls
  • score_summary (initial, latest, best, deltas, num_candidates)
  • top_diagnostics (list of {name, value})
  • plateau_detected, plateau_guidance
  • optional evaluator_failure_signal
  • optional early_stopped, stopped_at_iteration

Evaluator Protocol (v2)

Evaluator input payload (stdin JSON for command mode, POST JSON for HTTP mode):

{"_protocol_version": 2, "candidate": "...", "example": {...}, "task_model": "..."}
  • candidate is required
  • _protocol_version, example, and task_model are optional/additive
  • legacy evaluators that only read candidate remain compatible

Evaluator output payload:

{"score": 0.75, "notes": "optional diagnostics"}
  • score is required
  • additional keys are treated as side-info

Intake

Intake is optional structured guidance you pass to the optimizer to shape how evaluation works. Instead of relying solely on an --objective string, intake lets you declare quality dimensions, hard constraints, evaluation patterns, and execution preferences — giving you finer control over what "better" means for your artifact.

Use intake when your evaluation criteria are multi-dimensional, when you need to enforce hard constraints, or when you want consistent evaluation behavior across runs.

Pass it inline or from a file:

# Inline
optimize-anything optimize seed.txt \
  --intake-json '{"quality_dimensions": ["clarity", "specificity"], "hard_constraints": ["max 100 words"]}' \
  --judge-model openai/gpt-4o-mini

# From file
optimize-anything optimize seed.txt \
  --intake-file intake.json \
  --judge-model openai/gpt-4o-mini

optimize-anything intake normalizes and validates these keys:

  • artifact_class
  • quality_dimensions
  • hard_constraints
  • evaluation_pattern
  • execution_mode
  • evaluator_cwd

optimize flags (complete)

Exactly one evaluator source is required: --evaluator-command OR --evaluator-url OR --judge-model.

Flag Description Default
--no-seed Run without seed file; bootstrap from objective false
--evaluator-command <cmd...> Command evaluator (stdin/stdout JSON) --
--evaluator-url <url> HTTP evaluator endpoint --
--intake-json <json> Inline intake spec --
--intake-file <path> Intake spec file --
--evaluator-cwd <path> Working dir for command evaluator --
--objective <text> Optimization objective --
--background <text> Extra domain context --
--dataset <train.jsonl> Training dataset JSONL --
--valset <val.jsonl> Validation dataset JSONL (requires --dataset) --
--budget <int> Max evaluator calls 100
--output, -o <file> Write best artifact to file --
--model <model> Proposer model (or env fallback) OPTIMIZE_ANYTHING_MODEL
--judge-model <model> Built-in LLM judge evaluator model --
--judge-objective <text> Judge objective override falls back to --objective
--api-base <url> Override LiteLLM API base --
--diff Print unified diff (seed vs best) to stderr false
--run-dir <path> Save run artifacts in timestamped run dir --
--parallel Enable parallel evaluator calls false
--workers <int> Max workers for parallel evaluation --
--cache Enable evaluator cache false
--cache-from <run-dir> Copy prior fitness_cache into new run --
--early-stop Enable plateau early stop auto on when budget > 30
--early-stop-window <int> Plateau window size 10
--early-stop-threshold <float> Min improvement required over window 0.005
--spec-file <path> Load TOML spec defaults --
--task-model <model> Optional metadata forwarded to evaluators --
`--score-range unit any` Score validation mode for cmd/http

Learn More

About

Optimize any text artifact using gepa — prompts, code, configs, skills

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors