An open-source benchmark for evaluating AI models on OpenTelemetry instrumentation tasks across multiple programming languages.
- Benchmark: OTelBench results
- Blog post: Benchmarking OpenTelemetry: Can AI trace your failed login?
Requires Harbor (uv tool install harbor), Docker, and relevant API KEYs.
By default, we use the terminus-2 agent (default for Harbor) via OpenRouter to compare models.
You are free to use others, including well-known CLI AI Agents like Claude Code, Codex, or Cursor CLI.
You need to clone this repo:
git clone git@github.com:QuesmaOrg/otel-bench.git
cd otel-benchRun a single task, for a single model:
export ANTHROPIC_API_KEY=...
harbor run \
--path datasets/otel \
--task-name cpp-simple \
--agent terminus-2
--model anthropic/claude-opus-4-5-20251101Task names allow wildcards, so if you want to run all Go tasks, it works like:
export OPENAI_API_KEY=...
harbor run \
--path datasets/otel \
--task-name go-* \
--agent terminus-2 \
--model openai/gpt-5.2Run all tasks with a few models, with 3 attempts per model-task combination:
export OPENROUTER_API_KEY=...
harbor run \
--path datasets/otel \
--agent terminus-2 \
--model openrouter/google/gemini-3-pro-preview \
--model openrouter/anthropic/claude-opus-4-5 \
--model openrouter/openai/gpt-5.2-codex \
--n-attempts 3You can view trajectories (interactions between the agent and the system) via harbor view jobs.
Our overview of Harbor in Migrating CompileBench to Harbor: standardizing AI agent evals.
The OpenTelemetry dataset datasets/otel contains a set of tasks testing AI models' ability to instrument applications with OpenTelemetry across 11 programming languages. So far, it contains the following tasks:
- C++: simple, advanced, distributed-context-propagation
- Go: http-tracing, distributed-context-propagation, workflow-tracing, microservices, grpc-fix, microservices-logs, microservices-traces, microservices-traces-simple
- Java: simple, advanced, distributed-context-propagation, microservices
- JavaScript: microservices
- .NET: microservices
- PHP: distributed-context-propagation, microservices
- Python: distributed-context-propagation, microservices
- Ruby: microservices
- Rust: distributed-context-propagation, microservices
- Erlang: microservices
- Swift: microservices
- Tasks are internet dependent and require internet access to run
- Task solution instructions are not yet included (work in progress)
- Results are in benchmark-results/otel, for reference and comparison - we generate these from
jobs(so far pipeline is no included).
Apache 2.0, see LICENSE for details.