Oak Bench

Oak Bench exists so you can evaluate agent reliability and safety on realistic workloads: do agents follow the right procedures, respect constraints, and adhere to policies instead of drifting or cutting corners? Oak Health Insurance is the first scenario—a lifelike healthcare insurance setting you can run and score with the benchmark harness.

What's in this repo

Purpose	What to use
Demo / showcase	Install `oak_health/` as a package and run `cuga-oak-health`
Agent evaluation	Run `uv run run.py oak_health` from the repo root
Unit tests	Run `pytest` from `oak_health/tests/`

Repository Structure

oak-benchmark/
├── run.py                          # Benchmark runner (starts services + runs eval)
├── eval_bench_sdk.py               # Evaluation script
├── oak_health_test_suite_v1.json   # Benchmark task definitions
├── config/
│   ├── global.env                  # Shared configuration
│   └── oak_health_insurance.env    # App-specific configuration
├── helpers/                        # Eval helper modules and scripts
├── oak_health/                       # The Health API — installable package
│   ├── pyproject.toml              # Package metadata and entry points
│   ├── src/
│   │   └── oak_health/             # Core FastAPI application
│   │       ├── main.py             # FastAPI app (port 8090)
│   │       ├── models.py           # Pydantic models
│   │       └── data.py             # Seed data and fixtures
│   ├── oak_mcp_servers.yaml        # MCP server config (used by CUGA registry)
│   └── tests/                      # Unit tests for the API
└── scripts/                        # Visualization utilities

Oak Health Insurance — demo app

The oak_health/ directory is the Oak Health Insurance package: a self-contained Python package. Install it once, then launch the server with a single command.

Install

cd oak_health
uv pip install .

Run

uv run cuga-oak-health

This starts the FastAPI server at http://localhost:8090.

Interactive API docs are available at http://localhost:8090/docs.

Oak Health Insurance — benchmark

The benchmark runner (run.py) automatically starts the Oak Health Insurance FastAPI app and CUGA registry, runs the evaluation, then shuts everything down.

Prerequisites

CUGA Agent installed at ../cuga-agent
API keys configured in .env (see below)

Setup

# 1. Create virtual environment and install dependencies
uv venv
source .venv/bin/activate
uv sync

# 2. Configure API keys
cp .env.example .env   # or: touch .env

Add to .env:

ANTHROPIC_API_KEY=your-anthropic-key
OPENAI_API_KEY=your-openai-key

# Optional — Langfuse tracing
LANGFUSE_SECRET_KEY=your-secret-key
LANGFUSE_PUBLIC_KEY=your-public-key
LANGFUSE_HOST=http://localhost:3000

Run the benchmark

uv run run.py oak_health

Options

# Repeat N times and report mean ± std
uv run run.py oak_health --rep 5

# Filter by difficulty
uv run run.py oak_health --difficulty easy
uv run run.py oak_health --difficulty medium
uv run run.py oak_health --difficulty hard

# Run a single task
uv run run.py oak_health --task approved_claims

# Keep services running after eval (useful for debugging)
uv run run.py oak_health --no-cleanup

Results

# Open visualization dashboard
./scripts/viz.sh oak_health_insurance

Results are written to results/ (JSON) and logging/.

Unit Tests

Tests live in oak_health/tests/ and cover every API endpoint.

cd oak_health/tests
pytest

Or from the repo root:

pytest oak_health/tests/

Configuration

`config/oak_health_insurance.env`

Key settings:

MCP_SERVERS_FILE="oak_health/oak_mcp_servers.yaml"
CUGA_LOGGING_DIR="./logging"

Advanced CUGA agent settings:

DYNACONF_ADVANCED_FEATURES__CUGA_MODE = "accurate"
DYNACONF_FEATURES__FORCED_APPS = ["oak_health_insurance"]
DYNACONF_FEATURES__LOCAL_SANDBOX = true

`config/global.env`

Shared settings applied to all components (loaded automatically).

Metrics

The benchmark tracks:

Metric	Description
Pass Rate	% of tasks where the agent produced a correct answer
Keyword Match Rate	% of expected keywords found in responses
Tool Call Recall	% of expected tools the agent actually called
Tool Call Precision	% of agent's tool calls that were expected
Tool Call F1	Harmonic mean of recall and precision
Avg Latency	Mean time per task (seconds)

Langfuse Tracing (Optional)

Start Langfuse locally:

git clone https://github.com/langfuse/langfuse.git
cd langfuse && docker compose up

Get keys from the UI at http://localhost:3000 → Project Settings → API Keys.
Add to .env and enable in config/global.env.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
config		config
config_loader		config_loader
helpers		helpers
oak_health		oak_health
scripts		scripts
templates		templates
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
calculate_test_score.py		calculate_test_score.py
eval_bench.py		eval_bench.py
eval_bench_sdk.py		eval_bench_sdk.py
oak_health_test_suite_v1.json		oak_health_test_suite_v1.json
pyproject.toml		pyproject.toml
run.py		run.py
setup_cuga.sh		setup_cuga.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Oak Bench

What's in this repo

Repository Structure

Oak Health Insurance — demo app

Install

Run

Oak Health Insurance — benchmark

Prerequisites

Setup

Run the benchmark

Options

Results

Unit Tests

Configuration

`config/oak_health_insurance.env`

`config/global.env`

Metrics

Langfuse Tracing (Optional)

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Oak Bench

What's in this repo

Repository Structure

Oak Health Insurance — demo app

Install

Run

Oak Health Insurance — benchmark

Prerequisites

Setup

Run the benchmark

Options

Results

Unit Tests

Configuration

config/oak_health_insurance.env

config/global.env

Metrics

Langfuse Tracing (Optional)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`config/oak_health_insurance.env`

`config/global.env`

Packages