OpenTelemetry Benchmark (OTelBench) by Quesma

An open-source benchmark for evaluating AI models on OpenTelemetry instrumentation tasks across multiple programming languages.

Benchmark: OTelBench results
Blog post: Benchmarking OpenTelemetry: Can AI trace your failed login?

Quick start

Requires Harbor (uv tool install harbor), Docker, and relevant API KEYs.

By default, we use the terminus-2 agent (default for Harbor) via OpenRouter to compare models. You are free to use others, including well-known CLI AI Agents like Claude Code, Codex, or Cursor CLI.

You need to clone this repo:

git clone git@github.com:QuesmaOrg/otel-bench.git
cd otel-bench

Run a single task, for a single model:

export ANTHROPIC_API_KEY=...
harbor run \ 
  --path datasets/otel \ 
  --task-name cpp-simple \ 
  --agent terminus-2
  --model anthropic/claude-opus-4-5-20251101

Task names allow wildcards, so if you want to run all Go tasks, it works like:

export OPENAI_API_KEY=...
harbor run \ 
  --path datasets/otel \ 
  --task-name go-* \ 
  --agent terminus-2 \ 
  --model openai/gpt-5.2

Run all tasks with a few models, with 3 attempts per model-task combination:

export OPENROUTER_API_KEY=...
harbor run \ 
  --path datasets/otel \ 
  --agent terminus-2 \ 
  --model openrouter/google/gemini-3-pro-preview \ 
  --model openrouter/anthropic/claude-opus-4-5 \ 
  --model openrouter/openai/gpt-5.2-codex \ 
  --n-attempts 3

You can view trajectories (interactions between the agent and the system) via harbor view jobs. Our overview of Harbor in Migrating CompileBench to Harbor: standardizing AI agent evals.

Content

The OpenTelemetry dataset datasets/otel contains a set of tasks testing AI models' ability to instrument applications with OpenTelemetry across 11 programming languages. So far, it contains the following tasks:

C++: simple, advanced, distributed-context-propagation
Go: http-tracing, distributed-context-propagation, workflow-tracing, microservices, grpc-fix, microservices-logs, microservices-traces, microservices-traces-simple
Java: simple, advanced, distributed-context-propagation, microservices
JavaScript: microservices
.NET: microservices
PHP: distributed-context-propagation, microservices
Python: distributed-context-propagation, microservices
Ruby: microservices
Rust: distributed-context-propagation, microservices
Erlang: microservices
Swift: microservices

Notes

Tasks are internet dependent and require internet access to run
Task solution instructions are not yet included (work in progress)
Results are in benchmark-results/otel, for reference and comparison - we generate these from jobs (so far pipeline is no included).

License

Apache 2.0, see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
benchmark-results/otel		benchmark-results/otel
datasets/otel		datasets/otel
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenTelemetry Benchmark (OTelBench) by Quesma

Quick start

Content

Notes

License

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpenTelemetry Benchmark (OTelBench) by Quesma

Quick start

Content

Notes

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages