Skip to content

XY-Corp/CompiledAI

Repository files navigation

Compiled AI

Paper: Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation (arXiv 2026)

This repository contains the code, benchmark suite, and evaluation framework accompanying the paper.


A benchmark suite for evaluating Compiled AI β€” a paradigm where LLMs generate executable code artifacts during a one-time "compilation" phase, eliminating runtime inference costs.

The Problem

Current LLM-based automation approaches suffer from:

  • High per-transaction costs β€” every request requires LLM inference
  • Non-deterministic outputs β€” identical inputs can produce different results
  • Latency variability β€” P99 latency is unpredictable
  • Reliability gaps β€” 35-65% failure rates in multi-turn scenarios

The Solution: Compiled AI

Instead of calling LLMs at runtime, Compiled AI:

  1. Generates code once from a YAML specification
  2. Validates through 4 stages (Security β†’ Syntax β†’ Execution β†’ Accuracy)
  3. Executes deterministically with zero runtime LLM costs
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                              COMPILED AI BENCHMARK SYSTEM                                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                      β”‚   YAML      β”‚
                                      β”‚   Spec      β”‚
                                      β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                                             β”‚
                                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                 CODE FACTORY                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                       β”‚
β”‚  β”‚   Templates     β”‚    β”‚    Modules      β”‚    β”‚  Prompt Blocks  β”‚                       β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚                       β”‚
β”‚  β”‚  β”‚ Simple    β”‚  β”‚    β”‚  β”‚ Database  β”‚  β”‚    β”‚  β”‚ HIPAA     β”‚  β”‚                       β”‚
β”‚  β”‚  β”‚ Streaming β”‚  β”‚    β”‚  β”‚ HTTP      β”‚  β”‚    β”‚  β”‚ PCI-DSS   β”‚  β”‚                       β”‚
β”‚  β”‚  β”‚ Validator β”‚  β”‚    β”‚  β”‚ Notif     β”‚  β”‚    β”‚  β”‚ SOC2      β”‚  β”‚                       β”‚
β”‚  β”‚  β”‚ Batch     β”‚  β”‚    β”‚  β”‚ ...       β”‚  β”‚    β”‚  β”‚ ...       β”‚  β”‚                       β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚                       β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                       β”‚
β”‚            β”‚                    β”‚                      β”‚                                 β”‚
β”‚            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                 β”‚
β”‚                                 β–Ό                                                        β”‚
β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                            β”‚
β”‚                    β”‚    CONFIG AGENT        β”‚                                            β”‚
β”‚                    β”‚  β€’ Parse YAML spec     β”‚                                            β”‚
β”‚                    β”‚  β€’ Select template     β”‚                                            β”‚
β”‚                    β”‚  β€’ Compose modules     β”‚                                            β”‚
β”‚                    β”‚  β€’ Assemble prompt     β”‚                                            β”‚
β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                            β”‚
β”‚                                β”‚                                                         β”‚
β”‚                                β–Ό                                                         β”‚
β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                            β”‚
β”‚                    β”‚      GENERATOR         │──────────┐                                 β”‚
β”‚                    β”‚  β€’ LLM prompt          β”‚          β”‚                                 β”‚
β”‚                    β”‚  β€’ Code extraction     β”‚          β”‚ Regeneration                    β”‚
β”‚                    β”‚  β€’ Max 5 attempts      │◄────────── on failure                      β”‚
β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚                       β”‚
                                 β–Ό                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                          4-STAGE VALIDATION PIPELINE                                     β”‚
β”‚                                                                                          β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”‚
β”‚   β”‚  SECURITY   │───▢│   SYNTAX    │───▢│  EXECUTION  │───▢│  ACCURACY   β”‚              β”‚
β”‚   β”‚  (Stage 1)  β”‚    β”‚  (Stage 2)  β”‚    β”‚  (Stage 3)  β”‚    β”‚  (Stage 4)  β”‚              β”‚
β”‚   β”‚             β”‚    β”‚             β”‚    β”‚             β”‚    β”‚             β”‚              β”‚
β”‚   β”‚ β€’ Bandit    β”‚    β”‚ β€’ AST parse β”‚    β”‚ β€’ Sandbox   β”‚    β”‚ β€’ Golden    β”‚              β”‚
β”‚   β”‚ β€’ Semgrep   β”‚    β”‚ β€’ mypy      β”‚    β”‚ β€’ Fixtures  β”‚    β”‚   outputs   β”‚              β”‚
β”‚   β”‚ β€’ Secrets   β”‚    β”‚ β€’ ruff      β”‚    β”‚ β€’ Timeout   β”‚    β”‚ β€’ Threshold β”‚              β”‚
β”‚   β”‚ β€’ OWASP     β”‚    β”‚ β€’ radon     β”‚    β”‚ β€’ Coverage  β”‚    β”‚   check     β”‚              β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚
β”‚         β”‚                  β”‚                  β”‚                  β”‚                      β”‚
β”‚         β”‚ FAIL             β”‚ FAIL             β”‚ FAIL             β”‚ FAIL                 β”‚
β”‚         └──────────────────┴──────────────────┴──────────────────┼───────────────────────
β”‚                                                                  β”‚                      β”‚
β”‚                                     Regenerate β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       βœ“ PASS        β”‚
β”‚                                                                                β–Ό        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                                                 β”‚
                                                                                 β–Ό
                                                                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                                                    β”‚   VALIDATED ARTIFACT   β”‚
                                                                    β”‚   β€’ Temporal Activity  β”‚
                                                                    β”‚   β€’ Production-ready   β”‚
                                                                    β”‚   β€’ Zero runtime LLM   β”‚
                                                                    β”‚   β€’ Deterministic      β”‚
                                                                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Installation

# Clone the repository
git clone https://github.com/XY-Corp/CompiledAI.git
cd CompiledAI

# Install with uv (recommended)
uv sync

# Or with pip
pip install -e .

Quick Start

# Set up API keys
cp .env.example .env
# Edit .env with your ANTHROPIC_API_KEY and/or OPENAI_API_KEY

# Run the benchmark
uv run python run_benchmark.py --dataset bfcl --baseline code_factory

Metrics

The benchmark evaluates 7 metric categories:

Category Key Metrics Competitive Target
Token Efficiency Compression ratio, LOC/token, Break-even N* >4x compression
Latency TTFT, TPOT, P50, P99, Jitter TTFT <2s, TPOT <200ms
Consistency Semantic entropy, Exact match rate Entropy = 0
Reliability Task completion, Error rate >50% completion
Code Quality Cyclomatic complexity, Coverage, pass@k Cyclomatic <10
Validation First-pass rate, Regen attempts >70% first-pass
Cost Generation cost, TCO, Determinism Advantage DA > 1

Break-Even Analysis

Compiled AI becomes cost-effective after N* executions:

N* = Generation_Cost / Runtime_Cost_Per_Execution

For function-calling tasks: N β‰ˆ 17 executions* (paper result, BFCL evaluation).

Task Categories

Category Example Tasks
Document Processing EOB extraction, Invoice parsing
Data Transformation Schema mapping, Normalization
Decision Logic Eligibility checks, Routing rules
API Orchestration Multi-system updates, Webhooks

Datasets

The paper evaluates on two external benchmark datasets:

Dataset Instances Description
BFCL v3 400 Berkeley Function Calling Leaderboard β€” function calling accuracy
DocILE 5,680 invoices Document Information Extraction β€” KILE and LIR metrics

Download Datasets

# Download BFCL from HuggingFace (free)
python scripts/download_bfcl.py

# Download DocILE (requires access token from https://docile.rossum.ai/)
./scripts/download_dataset_docile.sh YOUR_TOKEN annotated-trainval datasets/docile --unzip

Downloaded data goes into datasets/ (excluded from git).

Security Validation Pipeline

CompiledAI includes a 3-gate security validation pipeline that protects against prompt injection, data leakage, and vulnerable code generation.

Architecture

User Prompt β†’ INPUT GATE β†’ Compilation β†’ CODE GATE β†’ Execution β†’ OUTPUT GATE β†’ Result
              (validates     (LLM Coder    (validates    (runs        (checks for
               user input)    generates     generated     code)         leakage)
                              code)         code)
Gate Validators Purpose
INPUT GATE PromptInjectionValidator, PIIScanner Block malicious prompts, detect PII
CODE GATE CodeShieldValidator Block vulnerable generated code
OUTPUT GATE CanaryManager Detect system prompt leakage

Running Security Benchmarks

# INPUT GATE tests (prompt injection + PII detection)
uv run python run_benchmark.py --dataset security_input_gate --baseline code_factory

# CODE GATE tests (vulnerable code detection - 20 deterministic fixtures)
uv run python run_benchmark.py --dataset security_code_gate_fixtures

# OUTPUT GATE tests (canary leakage detection)
uv run python run_benchmark.py --dataset security_output_gate --baseline code_factory

# Direct validator testing with confusion matrix metrics
uv run python scripts/run_security_benchmark.py --category input_injection

Security Benchmark Results

Gate Dataset Instances Success Rate
INPUT GATE security_input_gate 55 96.7%
CODE GATE security_code_gate_fixtures 20 100%
OUTPUT GATE security_output_gate 40 87.5%

Baselines

Compare against:

  • Direct LLM β€” Per-transaction inference
  • LangChain Agent β€” Tool-using agent
  • Multi-Agent β€” AutoGen-style coordination
  • Human Code β€” Hand-written implementation

Project Structure

CompiledAI/
β”œβ”€β”€ src/compiled_ai/
β”‚   β”œβ”€β”€ factory/          # Code generation (Templates, Modules, Prompts)
β”‚   β”œβ”€β”€ validation/       # 4-stage validation pipeline
β”‚   β”œβ”€β”€ baselines/        # Comparison implementations
β”‚   β”œβ”€β”€ metrics/          # All 7 metric categories
β”‚   β”œβ”€β”€ runner/           # Benchmark execution & dataset loading
β”‚   └── utils/            # LLM client, logging, sandbox
β”œβ”€β”€ datasets/             # Downloaded datasets (gitignored)
β”‚   β”œβ”€β”€ xy_benchmark/     # Internal benchmark tasks
β”‚   └── bfcl_v3/          # BFCL function calling (download required)
β”‚       # DocILE goes in datasets/docile/ (download required)
β”œβ”€β”€ scripts/              # CLI entry points & dataset downloaders
β”œβ”€β”€ results/              # Benchmark results (gitignored)
└── tests/                # Unit & integration tests

Development

# Install dev dependencies
uv sync --group dev

# Run tests
pytest

# Type checking
mypy src/

# Linting
ruff check src/

Research

Based on evaluation frameworks and datasets including:

  • BFCL v3 β€” Berkeley Function Calling Leaderboard (gorilla-llm)
  • DocILE β€” Document Information Extraction benchmark
  • Pan & Wang 2025 β€” Break-even analysis for code generation
  • AgentBench β€” Multi-turn agent benchmark (ICLR 2024)

License

MIT

Citation

If you use this work, please cite:

@article{trooskens2026compiledai,
  title={Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation},
  author={Trooskens, Geert and Karlsberg, Aaron and Sharma, Anmol and De Brouwer, Lamara
          and Van Puyvelde, Max and Young, Matthew and Thickstun, John
          and Alterovitz, Gil and De Brouwer, Walter A.},
  journal={arXiv preprint},
  year={2026}
}

Note: results/, logs/, and workflows/ are generated at runtime and are not tracked in git.

About

Benchmark suite

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors