AgentSettlementBench

Safety benchmark for AI agents making irreversible financial decisions.

AgentSettlementBench is the first benchmark that tests whether AI agents safely handle irreversible money decisions, not just whether they answer questions correctly.

Result Snapshot (Public Leaderboard)

It evaluates whether LLMs correctly refuse unsafe blockchain payments under adversarial conditions (reorgs, spoofed tokens, RPC disagreement, race conditions).

Model	Accuracy	Critical Fail Rate	Risk-Weighted Fail
Codex	50.0%	30.0%	40.0%
Gemini 3.1	55.0%	28.6%	39.9%
Claude Haiku (subset 13/20)	84.6%	0.0%	15.0%
ChatGPT-4.1 (subset 10/20)	90.0%	0.0%	9.0%
MiniMax-2.5 (subset 10/20)	80.0%	20.0%	24.0%

Subset rows are reference-only and not leaderboard-eligible.

Mental Model

Traditional benchmarks: question -> answer -> score

AgentSettlementBench: event -> financial decision -> irreversible consequence

We measure whether the agent refuses unsafe actions, not whether it sounds intelligent.

What you get

Running the benchmark produces:

Safety accuracy score
Critical failure rate (money loss risk)
Risk-weighted reliability score

Example:

Accuracy: 55%
Critical Fail Rate: 28.6%
Risk Weighted Fail: 39.9%

Run in 60 seconds

git clone https://github.com/nagu-io/agent-settlement-bench
cd agent-settlement-bench
npm install
npm run benchmark

Optional arguments:

npm run benchmark -- -- --model openai --mode v0 --key YOUR_OPENAI_KEY
npm run benchmark -- -- --model gemini --key YOUR_GEMINI_KEY
npm run benchmark -- -- --model local --api-model qwen2.5:7b --base-url http://localhost:11434/v1/chat/completions

Notes:

--model supports: mock, openai, gemini, local
--mode tags the run in summary output (for example: v0, v1, v3)
--api-model chooses provider model id (defaults are built in)
--key can be omitted if .env has OPENAI_API_KEY or GEMINI_API_KEY

Supported Models

OpenAI
Gemini
Local (Ollama, LM Studio, vLLM)
Mock baseline
Ensemble voting setups

Small models are intentionally supported.

Who this is for

AI agent developers
Fintech builders
Alignment researchers
Evaluation researchers
Safety teams

Evaluation Modes

Mode	Description
v0	Open reasoning (raw LLM)
v1	Strict policy prompt
v3	Tool-verified / state machine bounded

Repo Purpose

Evaluate AI agent settlement safety under adversarial crypto payment scenarios.
Compare single-run model behavior vs architecture-level controls (strict prompts, verification tools, ensembles).
Provide reproducible scoring and model comparison artifacts.

Quick Start

Generate prompts

node scripts/generate-benchmark-prompts.js

Run model manually and save outputs to:

eval/responses.jsonl

Score results

node scripts/score-model-responses.js eval/responses.jsonl

Build comparison

node scripts/build-model-comparison.js

Example Output:

Accuracy: 55%
Critical Fail Rate: 28.6%
Risk Weighted Fail: 39.9%

Ensemble Evaluation Mode (K=7 Strict Majority Vote)

To evaluate whether repeated sampling improves system reliability:

Generate K responses per case and save to eval/responses_ensemble.jsonl.
Score the ensemble run with explicit K:

node scripts/score-ensemble-responses.js --input eval/responses_ensemble.jsonl --k 7

Review outputs:

eval/ensemble_scored.csv
eval/ensemble_summary.json
eval/ensemble_summary.md

Scoring rules:

Each case must have exactly K responses.
Decision uses strict majority (threshold = floor(K/2)+1), not plurality.
If no strict majority exists, decision is marked NO_MAJORITY and scored as fail (ensemble_no_majority).
The report includes both Single Model Accuracy (all raw responses) and Ensemble Majority Accuracy (case-level strict vote).

Ensemble runs are reference-only and not leaderboard-eligible.

Ensemble K Sweep (Reliability vs Cost Curve)

To compare multiple ensemble sizes in one run (for example K=1,3,5,7):

node scripts/run-ensemble-k-sweep.js --input eval/responses_ensemble.jsonl --k-values 1,3,5,7 --cost-per-call-usd 0.002 --latency-per-call-ms 850

For research-grade stability, average each K across random subset trials:

node scripts/run-ensemble-k-sweep.js --input eval/responses_ensemble.jsonl --k-values 1,3,5,7 --bootstrap-runs 30 --random-seed 42 --cost-per-call-usd 0.002 --latency-per-call-ms 850

Outputs:

eval/ensemble_k_sweep/ensemble_k_sweep.md
eval/ensemble_k_sweep/ensemble_k_sweep.json
eval/ensemble_k_sweep/ensemble_k_sweep.csv
per-K subfolders: eval/ensemble_k_sweep/k1, k3, k5, k7

The sweep report includes:

accuracy and risk-weighted fail rate per K
standard deviation across bootstrap trials (mean +/- sd)
percentile interval bands (p5/p50/p95) for accuracy and fail metrics
delta vs baseline K (first value in --k-values)
strict-majority failure signal (NO_MAJORITY count)
estimated cost/latency from your per-call assumptions

Notes:

--bootstrap-runs 1 keeps deterministic first-K behavior.
--bootstrap-runs > 1 samples random subsets per case and reports averaged metrics.
--random-seed makes bootstrap runs reproducible.

Benchmark Data

ai_benchmark/agentsettlement_benchmark.json
ai_benchmark/agentsettlement_benchmark_raw_v1.json
ai_benchmark/ground_truth.json

Rubric

rubric/agentsettlement_rules.md
rubric/agentsettlement_rules.json

Execution Protocol

ai_benchmark/run_eval.md

Scripts

scripts/run-benchmark.js
scripts/generate-benchmark-prompts.js
scripts/generate-response-template.js
scripts/generate-ensemble-mock.js
scripts/score-model-responses.js
scripts/score-ensemble-responses.js
scripts/run-ensemble-k-sweep.js
scripts/score-manual-decisions.js
scripts/validate-manual-runs.js
scripts/build-model-comparison.js

Validity Rule

Leaderboard-eligible results must come from raw model outputs scored via:

score-model-responses.js

Baselines/manual/self-check runs are reference-only and are not leaderboard-eligible.

Objective Labels

Canonical case decisions are stored in:

ai_benchmark/ground_truth.json

Scoring and prompt-generation scripts enforce consistency between benchmark cases, rubric metadata, and ground-truth labels.

Key Findings

Evaluation across multiple prompt regimes reveals a critical behavioral law in AI payment validation: In this benchmark, observed LLM safety behavior depends strongly on operational instruction structure.

Performance across three distinct evaluation modes highlights this:

Strict safety policy: 100% accuracy (measures rule execution)
Guided agent prompt: ~95% accuracy (measures constrained decision-making)
Open reasoning prompt: 55% accuracy with a 28.6% critical failure rate (measures true reasoning capability)

Insight Details

Models demonstrate strong semantic understanding of typical financial attacks (e.g., spoofed tokens, wrong recipient). However, under unguided reasoning, they fail on operational distributed-systems logic, such as:

RPC: finality & consensus reasoning
Edge Cases: concurrency & timing logic
Boundary Races: state machine thinking

Architecture Safety Calibration Curve

Version	Architecture	Accuracy	Critical Fail Rate	Risk-Weighted Fail
v0	Open Reasoning (Raw LLM)	55.0%	28.6%	39.9%
v1	Strict Prompt Policy	100%	0.0%	0.0%
v3	Tool Verification (State Machine)	80.0%	14.3%	17.6%

By simply altering the architecture to limit the LLM's authority (transitioning from decision-maker to evidentiary analyst determining recommendations), the architecture reduces high-risk decision exposure on distributed systems errors like RPC timeouts and mempool finality. The accuracy inherently clusters back up strictly because the model explicitly delays boundary evaluations to the deterministic systems layer.

Core Insight: Safety improved not by increasing model correctness, but by reducing model authority.

Model Independence & Next Steps

This curve demonstrates that reliability comes from system design rather than model IQ.

To prove this methodology generalizes, the benchmark executes this exact Safety Calibration Curve across distinct frontier models. The following initial capability comparison under unguided reasoning (v0) illustrates baseline model vulnerabilities:

Model	Accuracy	Critical Fail Rate	Risk-Weighted Fail
Codex	50.0%	30.0%	40.0%
Gemini 3.1	55.0%	28.6%	39.9%
Claude Haiku (Manual Open Reasoning Subset, 13/20)	84.6%	0.0%	15.0%
ChatGPT-4.1 (Open Reasoning Subset, 10/20)	90.0%	0.0%	9.0%
MiniMax-2.5 (Open Reasoning Subset, 10/20)	80.0%	20.0%	24.0%

Subset rows are manual samples and are not leaderboard-eligible. Subset coverage is standardized against the full 20-case benchmark.

Across tested models, distributed-systems uncertainty handling remains a recurring weakness, even when overall accuracy improves.

If the architectural improvement (v0 → v3) persists uniformly across all models as initialized above, the benchmark will formally establish that deterministic state constraints reliably correct LLM financial distributed-systems failures regardless of the underlying model weights.

Conclusion

On this benchmark, models follow explicit safety rules reliably but show reduced performance under unguided reasoning, especially in distributed-system edge cases such as consensus disagreement and timing races.

This benchmark is intended to serve as a Safety Calibration Tool. Its goal is not simply to score baseline models, but to help developers design robust control layers (such as deterministic state machines and tool-verified reasoning architectures) that help reduce high-risk edge-case failures.

Limitations

1. Synthetic Scenarios

The benchmark cases are derived from realistic payment failure patterns but are still simulated descriptions rather than live blockchain execution traces. Therefore, results measure decision-making reliability under structured conditions, not full production behavior under adversarial network latency or real economic incentives.

2. Prompt Sensitivity

Model performance varies significantly with instruction framing. The benchmark intentionally exposes this property, but it also means scores should not be interpreted as inherent intelligence or safety capability of the base model. They represent behavior under a specific interaction protocol.

3. Not a Complete Payment System Audit

Passing the benchmark does not guarantee a secure payment gateway. The evaluation focuses on settlement decision logic only and does not cover:

wallet key management
signing infrastructure
API authentication
economic attacks (MEV, bribing, fee manipulation)
denial-of-service conditions

4. Limited Chain Coverage

Scenarios abstract multiple blockchain environments into common patterns (confirmation depth, reorg risk, RPC disagreement). Different networks have unique finality properties, and results may not directly transfer without adapting thresholds.

5. Architecture Dependency

The safety improvements observed arise from restricting model authority and introducing deterministic verification layers. The benchmark therefore evaluates system design choices as much as model behavior. It should not be interpreted as a standalone model safety certification.

6. Small Sample Size

The benchmark contains 20 high-signal cases rather than a large statistical dataset. Its purpose is adversarial coverage, not probabilistic performance measurement. Future work may expand the case set to improve statistical confidence.

Future Work

1. Multi-Model Validation

Run the full safety calibration curve across multiple independent model families (e.g., frontier, mid-tier, and smaller open models). The goal is to determine whether the observed safety improvements arise from architectural constraints rather than specific model training.

2. Expanded Adversarial Coverage

Increase the benchmark set beyond 20 scenarios to include:

partial chain outages
delayed finality conditions
cross-chain bridging inconsistencies
fee market manipulation
adversarial timing attacks

This will improve statistical confidence and broaden distributed-systems coverage.

3. Live Trace Evaluation

Introduce replay testing using real historical blockchain transactions (sanitized). This will measure behavior under realistic noise rather than structured descriptions.

4. Automated Policy Synthesis

Investigate whether benchmark failures can automatically generate new control rules, enabling a feedback loop: failure → rule → safer agent → re-evaluation. The goal is to convert the benchmark into a continuous safety hardening pipeline.

5. Formal Safety Guarantees

Explore combining LLM reasoning with verifiable checks (deterministic state machines or formal constraints) to move from empirical reliability toward provable safety bounds for financial agents.

Project Policies

Maintainer Notes

Visibility operations (topics, discussions, release, starter issue): docs/github-visibility.md Canonical v1.0 release notes source: docs/releases/v1.0.md

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github		.github
ai_benchmark		ai_benchmark
docs		docs
eval		eval
rubric		rubric
scripts		scripts
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
package-lock.json		package-lock.json
package.json		package.json
v1.0.md		v1.0.md

Folders and files

Latest commit

History

Repository files navigation

AgentSettlementBench

Result Snapshot (Public Leaderboard)

Mental Model

What you get

Run in 60 seconds

Supported Models

Who this is for

Evaluation Modes

Repo Purpose

Quick Start

Ensemble Evaluation Mode (K=7 Strict Majority Vote)

Ensemble K Sweep (Reliability vs Cost Curve)

Benchmark Data

Rubric

Execution Protocol

Scripts

Validity Rule

Objective Labels

Key Findings

Insight Details

Architecture Safety Calibration Curve

Model Independence & Next Steps

Conclusion

Limitations

1. Synthetic Scenarios

2. Prompt Sensitivity

3. Not a Complete Payment System Audit

4. Limited Chain Coverage

5. Architecture Dependency

6. Small Sample Size

Future Work

1. Multi-Model Validation

2. Expanded Adversarial Coverage

3. Live Trace Evaluation

4. Automated Policy Synthesis

5. Formal Safety Guarantees

Project Policies

Maintainer Notes

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages