Standards for building agents, better
-
Updated
Feb 22, 2026 - TypeScript
Standards for building agents, better
Agentic testing for agentic codebases
Ship agents you can audit.
The pre-flight check for AI agents
The open-source AgentOps evaluation harness.
Qualitative benchmark suite for evaluating AI coding agents and orchestration paradigms on realistic, complex development tasks
GitHub template for agent-testable SaaS apps. Next.js 16 + shadcn/ui + Neon Postgres + agent-browser e2e testing via accessibility tree.
Agent testing automation 🤖 by simulating users 👥 and agents 🤝 with judge ⚖️(langwatch-scenario)
The definitive benchmark for AI agents on OpenClaw. 45 tasks across 4 tiers. Powered by MyClaw.ai
PHP testing framework for LLM agents — multi-turn dialogs, cassette replay, tool calling, LLM-as-judge assertions
Token-efficient stochastic testing for AI agents. 5-20x cost reduction. 10 framework adapters. Paper: arXiv:2603.02601
𝘈 𝘔𝘶𝘭𝘵𝘪-𝘈𝘨𝘦𝘯𝘵 𝘚𝘺𝘴𝘵𝘦𝘮 𝘧𝘰𝘳 𝘊𝘳𝘰𝘴𝘴-𝘊𝘩𝘦𝘤𝘬𝘪𝘯𝘨 𝘗𝘩𝘪𝘴𝘩𝘪𝘯𝘨 𝘜𝘙𝘓𝘴.
Diagnose your AI agents in production. Extract policies from prompts, evaluate traces, generate diagnostic reports.
Holdout scenario evaluation harness for AI agents. Doer/Judge/Adversary/Observer roles, probabilistic satisfaction scoring, append-only JSONL audit trails with integrity hashes. Created Dec 2025.
QUEEF — Quality User Experience Enforcement Framework. Test structural UX of agent responses.
Real performance testing for CI/CD pipelines, staging environments, and load balancer validation
AI Agent Evaluation and Monitoring Guide
Behavior test framework for AI agents. Define tests in YAML. Run against transcripts. Get scored reports.
Regression and evaluation toolkit for prompt and agent output quality
Demonstration of testing and evaluation patterns for AI agents using Azure AI evaluation tools with custom evaluators
Add a description, image, and links to the agent-testing topic page so that developers can more easily learn about it.
To associate your repository with the agent-testing topic, visit your repo's landing page and select "manage topics."