Skip to content

arcnem-ai/autoarc

Repository files navigation

Arcnem AI

autoarc

Local service for benchmark-guided code improvement

日本語 · Why · Run Locally · Job Flow · Code Tour · API · Development


autoarc is an open-source Gleam service for running benchmark-driven code experiments on local Git repositories. It creates isolated worktrees, asks pi to propose changes, runs the benchmark itself, and records jobs plus experiments in SQLite.

autoarc keeps the loop intentionally small and inspectable: one repository, one benchmark contract, repeated measured code improvements.

For a deeper walkthrough, see docs/architecture.md and docs/code-tour.md.

Important

autoarc is designed for trusted local repositories. It runs model-authored edits, git, and bun benchmarks on your machine. Start with the bundled example, use it on repos you are comfortable modifying locally, and make sure you have a working local pi login before running real jobs. The trust model and operating guidance live in SECURITY.md.

What Autoarc Does

autoarc is intentionally narrow. It gives you:

  • a local HTTP service for running benchmark-guided code improvement jobs
  • a required human research_direction that sets the direction of work
  • a model-authored benchmark contract under .autoarc/
  • isolated experiments in Git worktrees
  • benchmark-owned promotion decisions
  • durable job and experiment history in SQLite plus inspectable logs and worktrees

autoarc is not trying to be a hosted platform, a general-purpose agent framework, or a sandbox for untrusted code.

Why autoarc

  • Benchmark-owned decisions. Models can propose changes, but autoarc runs the metric and decides whether a candidate improves the frontier.
  • Git isolation by default. Each experiment runs in its own worktree against a clean base commit.
  • Durable experiment history. Jobs, experiments, commits, metrics, and summaries are stored in SQLite.
  • Manual or automatic promotion. Improved candidates can wait for review or be rechecked automatically on top of the latest frontier.
  • Bounded long-running work. Agent calls and benchmark runs have explicit wall-clock time limits so jobs fail clearly instead of hanging forever.

What It Is Not

  • Not a hosted service. autoarc is built to run on your machine against repos on your machine.
  • Not a product prioritizer. It improves what your benchmark measures; it does not decide what your benchmark should care about.
  • Not a sandbox. It executes model-authored code and benchmarks with your local user permissions.
  • Not a sprawling framework. The scope is intentionally small enough to read, understand, and extend.

Inspiration

autoarc was inspired by two adjacent projects:

  • autoresearch, which explores autonomous experiment loops against a measurable training setup
  • symphony, which explores autonomous implementation runs against project work

Start With The Example

The fastest way to understand the workflow is to run the bundled example repo:

make live-test

That command:

  • starts the local service
  • copies test/fixtures/example_repo/ into a temp repo
  • asks pi to design a benchmark contract
  • runs a few experiments and prints a job summary

It resets ./autoarc-data/ at the start of the run and requires pi, bun, and a working local pi login.

Run Locally

Requirements:

  • Erlang/OTP 28
  • Gleam 1.14
  • git
  • bun
  • pi with a working local login
  • a clean target repository

Start the service:

gleam deps download
gleam run

You can also copy .env.example to .env and edit it locally. By default, runtime data is written to ./autoarc-data/ in this repo. That directory is gitignored so you can inspect autoarc.sqlite, logs, and worktrees without cluttering the repo.

Job Flow

  1. POST /v1/jobs receives a local repo path, a required human research_direction, an experiment count, and a promotion mode.
  2. The design step turns that human direction into .autoarc/config.json, .autoarc/benchmark.ts, and .autoarc/design.md on a job branch.
  3. autoarc runs the baseline benchmark once to establish the starting frontier.
  4. Mutation experiments run in separate worktrees and stay within the allowed editable paths.
  5. Improved candidates are either held for manual promotion or rechecked automatically on top of the latest frontier before merge.

Benchmark Contract

The design step writes these files under .autoarc/:

  • .autoarc/config.json
  • .autoarc/benchmark.ts
  • .autoarc/design.md

config.json contains:

  • metric_name
  • direction as minimize or maximize
  • editable_paths as repo-relative file or directory prefixes

The benchmark entrypoint is .autoarc/benchmark.ts and it runs with:

bun run .autoarc/benchmark.ts

The benchmark prints one JSON object to stdout:

{"metric_name":"score","metric_value":1.23,"summary":"short explanation"}

API

Create a job:

curl -X POST http://127.0.0.1:8000/v1/jobs \
  -H 'content-type: application/json' \
  -d '{
    "repo_path": "/absolute/path/to/repo",
    "research_direction": "Improve the benchmark by simplifying hot-path code and avoiding broad refactors.",
    "num_experiments": 4,
    "promotion_mode": "auto"
  }'

Inspect a job:

curl http://127.0.0.1:8000/v1/jobs/1

Inspect an experiment:

curl http://127.0.0.1:8000/v1/experiments/1

Promote a candidate manually:

curl -X POST http://127.0.0.1:8000/v1/experiments/1/promote

Configuration

Env var What it controls Default
HOST Bind host 127.0.0.1
PORT HTTP port 8000
DATA_DIR Runtime data directory ./autoarc-data
SECRET_KEY_BASE Wisp/Mist signing secret autoarc-dev-secret
API_KEY Optional x-api-key requirement unset
MAX_CONCURRENCY Worker concurrency 2
DEFAULT_MODEL Default pi model override unset
AGENT_TIMEOUT_MS Wall-clock limit for each pi design or experiment command 900000
BENCHMARK_TIMEOUT_MS Wall-clock limit for each benchmark run 300000

Runtime data lives under DATA_DIR:

  • autoarc.sqlite lives at the top level of DATA_DIR
  • per-repo job artifacts live under DATA_DIR/repos/<repo_id>/jobs/<job_id>/
  • repo_id is Autoarc's stable internal identifier for a canonical repo root
  • completed and failed jobs keep an inspectable frontier worktree in their job directory
  • internal experiment worktrees and temporary autoarc/* branches are cleaned up when they are no longer needed

Repository Layout

  • src/autoarc/runtime/ HTTP entrypoints, coordinator, workers, and message flow
  • src/autoarc/integration/ shell, git, benchmark, and pi boundaries
  • src/autoarc/persistence/ SQLite schema setup and queries
  • src/autoarc/types/ shared records and enums grouped by concern
  • docs/architecture.md lifecycle, promotion rules, and design notes
  • docs/code-tour.md reading order, file map, and common change paths
  • test/ API and workflow tests
  • test/fixtures/example_repo/ small repo used by tests and the live workflow harness

Code Tour

If you are opening the repo for the first time, start here:

  1. src/autoarc/runtime/app.gleam
  2. src/autoarc/runtime/api.gleam
  3. src/autoarc/runtime/coordinator.gleam
  4. src/autoarc/runtime/worker.gleam

That path gets you from boot, to HTTP entry, to scheduling, to the side-effect work itself.

For the deeper version, including “what file to read for what question,” see docs/code-tour.md.

Development

make help
make deps
make format
make test

You can still run the raw Gleam commands directly:

gleam format
gleam test

For the opt-in live workflow test:

make live-test

You can override the live-test setup from .env or on the command line:

make live-test \
  MODEL=gpt-5.4 \
  EXPERIMENT_COUNT=5 \
  CONCURRENCY=1 \
  RESEARCH_DIRECTION='Focus on parser and benchmark hot paths.'

The live test uses DATA_DIR itself as the runtime data directory, so you can inspect the normal on-disk layout directly in ./autoarc-data by default. It resets that directory at the start of the run, then uses a fresh temp copy of test/fixtures/example_repo/ as the input repo. It requires pi, bun, and a working local pi login.

See SECURITY.md for the trust model, CONTRIBUTING.md for contributor workflow, and AGENTS.md for repo-specific agent instructions.


Built by Arcnem AI in Tokyo.

About

Local service for benchmark-guided code improvement

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors