Local service for benchmark-guided code improvement
日本語 · Why · Run Locally · Job Flow · Code Tour · API · Development
autoarc is an open-source Gleam service for running benchmark-driven code
experiments on local Git repositories. It creates isolated worktrees, asks pi
to propose changes, runs the benchmark itself, and records jobs plus
experiments in SQLite.
autoarc keeps the loop intentionally small and inspectable: one repository,
one benchmark contract, repeated measured code improvements.
For a deeper walkthrough, see docs/architecture.md and docs/code-tour.md.
Important
autoarc is designed for trusted local repositories. It runs model-authored
edits, git, and bun benchmarks on your machine. Start with the bundled
example, use it on repos you are comfortable modifying locally, and make sure
you have a working local pi login before running real jobs. The trust model
and operating guidance live in SECURITY.md.
autoarc is intentionally narrow. It gives you:
- a local HTTP service for running benchmark-guided code improvement jobs
- a required human
research_directionthat sets the direction of work - a model-authored benchmark contract under
.autoarc/ - isolated experiments in Git worktrees
- benchmark-owned promotion decisions
- durable job and experiment history in SQLite plus inspectable logs and worktrees
autoarc is not trying to be a hosted platform, a
general-purpose agent framework, or a sandbox for untrusted code.
- Benchmark-owned decisions. Models can propose changes, but
autoarcruns the metric and decides whether a candidate improves the frontier. - Git isolation by default. Each experiment runs in its own worktree against a clean base commit.
- Durable experiment history. Jobs, experiments, commits, metrics, and summaries are stored in SQLite.
- Manual or automatic promotion. Improved candidates can wait for review or be rechecked automatically on top of the latest frontier.
- Bounded long-running work. Agent calls and benchmark runs have explicit wall-clock time limits so jobs fail clearly instead of hanging forever.
- Not a hosted service.
autoarcis built to run on your machine against repos on your machine. - Not a product prioritizer. It improves what your benchmark measures; it does not decide what your benchmark should care about.
- Not a sandbox. It executes model-authored code and benchmarks with your local user permissions.
- Not a sprawling framework. The scope is intentionally small enough to read, understand, and extend.
autoarc was inspired by two adjacent projects:
autoresearch, which explores autonomous experiment loops against a measurable training setupsymphony, which explores autonomous implementation runs against project work
The fastest way to understand the workflow is to run the bundled example repo:
make live-testThat command:
- starts the local service
- copies
test/fixtures/example_repo/into a temp repo - asks
pito design a benchmark contract - runs a few experiments and prints a job summary
It resets ./autoarc-data/ at the start of the run and requires pi, bun,
and a working local pi login.
Requirements:
- Erlang/OTP 28
- Gleam 1.14
gitbunpiwith a working local login- a clean target repository
Start the service:
gleam deps download
gleam runYou can also copy .env.example to .env and edit it locally. By default,
runtime data is written to ./autoarc-data/ in this repo. That directory is
gitignored so you can inspect autoarc.sqlite, logs, and worktrees without
cluttering the repo.
POST /v1/jobsreceives a local repo path, a required humanresearch_direction, an experiment count, and a promotion mode.- The design step turns that human direction into
.autoarc/config.json,.autoarc/benchmark.ts, and.autoarc/design.mdon a job branch. autoarcruns the baseline benchmark once to establish the starting frontier.- Mutation experiments run in separate worktrees and stay within the allowed editable paths.
- Improved candidates are either held for manual promotion or rechecked automatically on top of the latest frontier before merge.
The design step writes these files under .autoarc/:
.autoarc/config.json.autoarc/benchmark.ts.autoarc/design.md
config.json contains:
metric_namedirectionasminimizeormaximizeeditable_pathsas repo-relative file or directory prefixes
The benchmark entrypoint is .autoarc/benchmark.ts and it runs with:
bun run .autoarc/benchmark.tsThe benchmark prints one JSON object to stdout:
{"metric_name":"score","metric_value":1.23,"summary":"short explanation"}Create a job:
curl -X POST http://127.0.0.1:8000/v1/jobs \
-H 'content-type: application/json' \
-d '{
"repo_path": "/absolute/path/to/repo",
"research_direction": "Improve the benchmark by simplifying hot-path code and avoiding broad refactors.",
"num_experiments": 4,
"promotion_mode": "auto"
}'Inspect a job:
curl http://127.0.0.1:8000/v1/jobs/1Inspect an experiment:
curl http://127.0.0.1:8000/v1/experiments/1Promote a candidate manually:
curl -X POST http://127.0.0.1:8000/v1/experiments/1/promote| Env var | What it controls | Default |
|---|---|---|
HOST |
Bind host | 127.0.0.1 |
PORT |
HTTP port | 8000 |
DATA_DIR |
Runtime data directory | ./autoarc-data |
SECRET_KEY_BASE |
Wisp/Mist signing secret | autoarc-dev-secret |
API_KEY |
Optional x-api-key requirement |
unset |
MAX_CONCURRENCY |
Worker concurrency | 2 |
DEFAULT_MODEL |
Default pi model override |
unset |
AGENT_TIMEOUT_MS |
Wall-clock limit for each pi design or experiment command |
900000 |
BENCHMARK_TIMEOUT_MS |
Wall-clock limit for each benchmark run | 300000 |
Runtime data lives under DATA_DIR:
autoarc.sqlitelives at the top level ofDATA_DIR- per-repo job artifacts live under
DATA_DIR/repos/<repo_id>/jobs/<job_id>/ repo_idis Autoarc's stable internal identifier for a canonical repo root- completed and failed jobs keep an inspectable frontier worktree in their job directory
- internal experiment worktrees and temporary
autoarc/*branches are cleaned up when they are no longer needed
src/autoarc/runtime/HTTP entrypoints, coordinator, workers, and message flowsrc/autoarc/integration/shell, git, benchmark, andpiboundariessrc/autoarc/persistence/SQLite schema setup and queriessrc/autoarc/types/shared records and enums grouped by concerndocs/architecture.mdlifecycle, promotion rules, and design notesdocs/code-tour.mdreading order, file map, and common change pathstest/API and workflow teststest/fixtures/example_repo/small repo used by tests and the live workflow harness
If you are opening the repo for the first time, start here:
src/autoarc/runtime/app.gleamsrc/autoarc/runtime/api.gleamsrc/autoarc/runtime/coordinator.gleamsrc/autoarc/runtime/worker.gleam
That path gets you from boot, to HTTP entry, to scheduling, to the side-effect work itself.
For the deeper version, including “what file to read for what question,” see docs/code-tour.md.
make help
make deps
make format
make testYou can still run the raw Gleam commands directly:
gleam format
gleam testFor the opt-in live workflow test:
make live-testYou can override the live-test setup from .env or on the command line:
make live-test \
MODEL=gpt-5.4 \
EXPERIMENT_COUNT=5 \
CONCURRENCY=1 \
RESEARCH_DIRECTION='Focus on parser and benchmark hot paths.'The live test uses DATA_DIR itself as the runtime data directory, so
you can inspect the normal on-disk layout directly in ./autoarc-data by
default. It resets that directory at the start of the run, then uses a fresh
temp copy of test/fixtures/example_repo/ as the input repo. It requires pi,
bun, and a working local pi login.
See SECURITY.md for the trust model, CONTRIBUTING.md for contributor workflow, and AGENTS.md for repo-specific agent instructions.
Built by Arcnem AI in Tokyo.