Paper: Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation (arXiv 2026)
This repository contains the code, benchmark suite, and evaluation framework accompanying the paper.
A benchmark suite for evaluating Compiled AI β a paradigm where LLMs generate executable code artifacts during a one-time "compilation" phase, eliminating runtime inference costs.
Current LLM-based automation approaches suffer from:
- High per-transaction costs β every request requires LLM inference
- Non-deterministic outputs β identical inputs can produce different results
- Latency variability β P99 latency is unpredictable
- Reliability gaps β 35-65% failure rates in multi-turn scenarios
Instead of calling LLMs at runtime, Compiled AI:
- Generates code once from a YAML specification
- Validates through 4 stages (Security β Syntax β Execution β Accuracy)
- Executes deterministically with zero runtime LLM costs
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β COMPILED AI BENCHMARK SYSTEM β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββ
β YAML β
β Spec β
ββββββββ¬βββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CODE FACTORY β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β Templates β β Modules β β Prompt Blocks β β
β β βββββββββββββ β β βββββββββββββ β β βββββββββββββ β β
β β β Simple β β β β Database β β β β HIPAA β β β
β β β Streaming β β β β HTTP β β β β PCI-DSS β β β
β β β Validator β β β β Notif β β β β SOC2 β β β
β β β Batch β β β β ... β β β β ... β β β
β β βββββββββββββ β β βββββββββββββ β β βββββββββββββ β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β β β β
β ββββββββββββββββββββββΌβββββββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββββββββ β
β β CONFIG AGENT β β
β β β’ Parse YAML spec β β
β β β’ Select template β β
β β β’ Compose modules β β
β β β’ Assemble prompt β β
β βββββββββββββ¬βββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββ β
β β GENERATOR ββββββββββββ β
β β β’ LLM prompt β β β
β β β’ Code extraction β β Regeneration β
β β β’ Max 5 attempts ββββββββββββ€ on failure β
β βββββββββββββ¬βββββββββββββ β β
ββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββ
β β
βΌ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4-STAGE VALIDATION PIPELINE β
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β SECURITY βββββΆβ SYNTAX βββββΆβ EXECUTION βββββΆβ ACCURACY β β
β β (Stage 1) β β (Stage 2) β β (Stage 3) β β (Stage 4) β β
β β β β β β β β β β
β β β’ Bandit β β β’ AST parse β β β’ Sandbox β β β’ Golden β β
β β β’ Semgrep β β β’ mypy β β β’ Fixtures β β outputs β β
β β β’ Secrets β β β’ ruff β β β’ Timeout β β β’ Threshold β β
β β β’ OWASP β β β’ radon β β β’ Coverage β β check β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β β β β β
β β FAIL β FAIL β FAIL β FAIL β
β ββββββββββββββββββββ΄βββββββββββββββββββ΄βββββββββββββββββββΌβββββββββββββββββββββββ€
β β β
β Regenerate βββββββββββββββββββ β PASS β
β βΌ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββ
β
βΌ
ββββββββββββββββββββββββββ
β VALIDATED ARTIFACT β
β β’ Temporal Activity β
β β’ Production-ready β
β β’ Zero runtime LLM β
β β’ Deterministic β
ββββββββββββββββββββββββββ
# Clone the repository
git clone https://github.com/XY-Corp/CompiledAI.git
cd CompiledAI
# Install with uv (recommended)
uv sync
# Or with pip
pip install -e .# Set up API keys
cp .env.example .env
# Edit .env with your ANTHROPIC_API_KEY and/or OPENAI_API_KEY
# Run the benchmark
uv run python run_benchmark.py --dataset bfcl --baseline code_factoryThe benchmark evaluates 7 metric categories:
| Category | Key Metrics | Competitive Target |
|---|---|---|
| Token Efficiency | Compression ratio, LOC/token, Break-even N* | >4x compression |
| Latency | TTFT, TPOT, P50, P99, Jitter | TTFT <2s, TPOT <200ms |
| Consistency | Semantic entropy, Exact match rate | Entropy = 0 |
| Reliability | Task completion, Error rate | >50% completion |
| Code Quality | Cyclomatic complexity, Coverage, pass@k | Cyclomatic <10 |
| Validation | First-pass rate, Regen attempts | >70% first-pass |
| Cost | Generation cost, TCO, Determinism Advantage | DA > 1 |
Compiled AI becomes cost-effective after N* executions:
N* = Generation_Cost / Runtime_Cost_Per_Execution
For function-calling tasks: N β 17 executions* (paper result, BFCL evaluation).
| Category | Example Tasks |
|---|---|
| Document Processing | EOB extraction, Invoice parsing |
| Data Transformation | Schema mapping, Normalization |
| Decision Logic | Eligibility checks, Routing rules |
| API Orchestration | Multi-system updates, Webhooks |
The paper evaluates on two external benchmark datasets:
| Dataset | Instances | Description |
|---|---|---|
| BFCL v3 | 400 | Berkeley Function Calling Leaderboard β function calling accuracy |
| DocILE | 5,680 invoices | Document Information Extraction β KILE and LIR metrics |
# Download BFCL from HuggingFace (free)
python scripts/download_bfcl.py
# Download DocILE (requires access token from https://docile.rossum.ai/)
./scripts/download_dataset_docile.sh YOUR_TOKEN annotated-trainval datasets/docile --unzipDownloaded data goes into datasets/ (excluded from git).
CompiledAI includes a 3-gate security validation pipeline that protects against prompt injection, data leakage, and vulnerable code generation.
User Prompt β INPUT GATE β Compilation β CODE GATE β Execution β OUTPUT GATE β Result
(validates (LLM Coder (validates (runs (checks for
user input) generates generated code) leakage)
code) code)
| Gate | Validators | Purpose |
|---|---|---|
| INPUT GATE | PromptInjectionValidator, PIIScanner | Block malicious prompts, detect PII |
| CODE GATE | CodeShieldValidator | Block vulnerable generated code |
| OUTPUT GATE | CanaryManager | Detect system prompt leakage |
# INPUT GATE tests (prompt injection + PII detection)
uv run python run_benchmark.py --dataset security_input_gate --baseline code_factory
# CODE GATE tests (vulnerable code detection - 20 deterministic fixtures)
uv run python run_benchmark.py --dataset security_code_gate_fixtures
# OUTPUT GATE tests (canary leakage detection)
uv run python run_benchmark.py --dataset security_output_gate --baseline code_factory
# Direct validator testing with confusion matrix metrics
uv run python scripts/run_security_benchmark.py --category input_injection| Gate | Dataset | Instances | Success Rate |
|---|---|---|---|
| INPUT GATE | security_input_gate |
55 | 96.7% |
| CODE GATE | security_code_gate_fixtures |
20 | 100% |
| OUTPUT GATE | security_output_gate |
40 | 87.5% |
Compare against:
- Direct LLM β Per-transaction inference
- LangChain Agent β Tool-using agent
- Multi-Agent β AutoGen-style coordination
- Human Code β Hand-written implementation
CompiledAI/
βββ src/compiled_ai/
β βββ factory/ # Code generation (Templates, Modules, Prompts)
β βββ validation/ # 4-stage validation pipeline
β βββ baselines/ # Comparison implementations
β βββ metrics/ # All 7 metric categories
β βββ runner/ # Benchmark execution & dataset loading
β βββ utils/ # LLM client, logging, sandbox
βββ datasets/ # Downloaded datasets (gitignored)
β βββ xy_benchmark/ # Internal benchmark tasks
β βββ bfcl_v3/ # BFCL function calling (download required)
β # DocILE goes in datasets/docile/ (download required)
βββ scripts/ # CLI entry points & dataset downloaders
βββ results/ # Benchmark results (gitignored)
βββ tests/ # Unit & integration tests
# Install dev dependencies
uv sync --group dev
# Run tests
pytest
# Type checking
mypy src/
# Linting
ruff check src/Based on evaluation frameworks and datasets including:
- BFCL v3 β Berkeley Function Calling Leaderboard (gorilla-llm)
- DocILE β Document Information Extraction benchmark
- Pan & Wang 2025 β Break-even analysis for code generation
- AgentBench β Multi-turn agent benchmark (ICLR 2024)
MIT
If you use this work, please cite:
@article{trooskens2026compiledai,
title={Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation},
author={Trooskens, Geert and Karlsberg, Aaron and Sharma, Anmol and De Brouwer, Lamara
and Van Puyvelde, Max and Young, Matthew and Thickstun, John
and Alterovitz, Gil and De Brouwer, Walter A.},
journal={arXiv preprint},
year={2026}
}Note:
results/,logs/, andworkflows/are generated at runtime and are not tracked in git.