Skip to content

feat(evaluators): add built-in budget evaluator for per-agent cost tracking#144

Open
amabito wants to merge 7 commits intoagentcontrol:mainfrom
amabito:feat/budget-evaluator
Open

feat(evaluators): add built-in budget evaluator for per-agent cost tracking#144
amabito wants to merge 7 commits intoagentcontrol:mainfrom
amabito:feat/budget-evaluator

Conversation

@amabito
Copy link

@amabito amabito commented Mar 21, 2026

Summary

Scope

  • User-facing/API changes:

    • "budget" evaluator registered alongside regex/list/json/sql
    • BudgetStore protocol + InMemoryBudgetStore (dict + threading.Lock)
    • BudgetEvaluatorConfig: limits list, optional pricing table, path configs
  • Internal changes:

    • evaluators/builtin/src/agent_control_evaluators/budget/ -- 4 files, ~650 LOC
    • evaluators/builtin/tests/budget/test_budget.py -- 63 tests
  • Out of scope:

    • No Postgres store, no DB tables, no new dependencies

Risk and Rollout

  • Risk level: low -- new evaluator, no changes to existing code. 230 existing tests untouched.
  • Rollback plan: revert PR

Testing

  • Added or updated automated tests (63 tests incl. thread safety, NaN/Inf, scope injection, double-count)
  • Ran pytest tests/ -- 293 passed
  • Manually verified behavior

Checklist

amabito added 7 commits March 21, 2026 09:30
…acking

Closes agentcontrol#130

Add BudgetEvaluator -- a deterministic evaluator that tracks cumulative
LLM token and cost usage per agent, per channel, per user, with
configurable time windows (daily/weekly/monthly/cumulative).

Components:
- BudgetStore protocol + InMemoryBudgetStore (dict + threading.Lock)
- BudgetSnapshot frozen dataclass for atomic state reads
- BudgetEvaluator with scope key building, period key derivation,
  token extraction, and optional model pricing estimation
- BudgetLimitRule config with scope, per, window, limit_usd, limit_tokens
- 48 tests covering store, config, evaluator, registration

Design:
- In-memory only (no PostgreSQL, no new dependencies)
- Store is "dumb" (accumulate + check), evaluator is "smart" (resolve
  scope, derive period, extract tokens, check limits)
- record_and_check() is atomic (single lock acquisition)
- Evaluator instances are cached per config (thread-safe by design)
- matched=True only when limit exceeded, confidence=1.0 always
- Utilization ratio in metadata, not confidence
…arial tests

3-body review findings:

Security:
- Sanitize pipe/equals in scope key metadata values (injection prevention)
- Add max_buckets=100K to InMemoryBudgetStore (OOM prevention, fail-closed)
- Block dunder attribute access in _extract_by_path
- Add math.isfinite guard on extracted cost values
- Skip per-user rules when per field missing from metadata (was collapsing
  per-user budgets into global bucket)

Correctness:
- Changed exceeded check from > to >= (utilization=100% now triggers exceeded)
- Removed unused BudgetSnapshot import from evaluator.py

Tests (6 adversarial):
- Exact limit boundary (USD and tokens)
- Scope key injection via pipe character
- max_buckets OOM prevention
- per-field missing skips rule
- dunder path rejection

54 budget tests, 284 total evaluator tests passing.
… dunder guard

R2 findings:

- _sanitize_scope_value: percent-encode |/= instead of replacing with _
  (was causing key collisions between "a|b" and "a_b")
- max_buckets fail-closed: spent_usd/spent_tokens now 0.0/0 (not recorded,
  previously reported current-call-only values misleading callers)
- _extract_by_path: narrowed guard from startswith("_") to startswith("__")
  (single-underscore dict keys are legitimate data fields)
- Fixed tautological test assertion in test_scope_key_injection_pipe
- Added 3 tests: no-collision, single-underscore access, NaN/Inf cost

57 budget tests, 287 total evaluator tests passing.
R4 finding: negative pricing rates in config caused _estimate_cost to
return negative cost_usd, which subtracted from spent_usd and disabled
USD limit enforcement entirely.

Fix: max(0.0, cost) in _estimate_cost return.
Test: negative pricing rates produce spent_usd >= 0.

58 budget tests, 288 total evaluator tests passing.
R5 finding: Inf pricing rates produced inf cost, permanently locking
buckets in exceeded state. max(0.0, inf) = inf.

Fix: isfinite + negative check on _estimate_cost return value.
Tests: Inf pricing rate test, strengthened negative pricing assertion.

59 budget tests passing.
…dation

R8 finding: float("nan") passed the `v <= 0` validator (IEEE 754:
nan <= 0 is False). NaN limit_usd silently disabled budget enforcement
because all NaN comparisons return False.

Fix: added math.isfinite(v) guard to validate_limit_usd.
Tests: NaN and Inf limit_usd rejection.

61 budget tests, 291 total evaluator tests passing.
…pe+period

R10 finding: when multiple limit rules share the same (scope_key, period_key),
each rule called record_and_check() independently, causing the same tokens
and cost to be counted N times in the store.

Fix: track recorded (scope_key, period_key) pairs per evaluate() call.
First rule records; subsequent rules for the same pair use get_snapshot().

Tests: 2 new tests for same-scope double-count prevention.
63 budget tests, 293 total evaluator tests passing.

Review loop: R9 CLEAN, R10 fix, R11 CLEAN -- 3 consecutive clean achieved.
Copy link
Contributor

@lan17 lan17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this! It looks overall in the right direction to me, but I think we can improve the design a bit to make it more universal.

Could you also implement this as a contrib evaluator so it doesn't become a built-in evaluator right away? We will then put it to use and once its in a production ready stage we can move to builtin.

scope: dict[str, str] = Field(default_factory=dict)
per: str | None = None
window: Literal["daily", "weekly", "monthly"] | None = None
limit_usd: float | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this is a wrong abstraction for budgeting. What if the budget is in different currency? Why do we need separate fields for tokens vs usd?

It might be better to do something like an integer for a limit and then define "currency" Enum which could be USD, tokens, Euros, etc.,

I don't think there's a use case for having floating point for USD or Euros, no?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll switch to an integer limit + a Currency enum (USD, EUR, tokens, etc.). Float was unnecessary — cents-level precision is handled by using integer cents anyway.


scope: dict[str, str] = Field(default_factory=dict)
per: str | None = None
window: Literal["daily", "weekly", "monthly"] | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we want hourly or half-an-hour?

Maybe its better to define "window" as an integer in seconds, or minutes? That way you can express whatever window you want.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will change to window_seconds: int. I'll keep a few named constants (DAILY = 86400 etc.) as convenience but the field itself will be raw seconds.

on the first breach.

Attributes:
scope: Static scope dimensions that must match for this rule
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth it to give some examples here for scope?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will add. Something like {"agent": "summarizer", "channel": "slack"}.

limits: list[BudgetLimitRule] = Field(min_length=1)
pricing: dict[str, dict[str, float]] | None = None
token_path: str | None = None
cost_path: str | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this evaluator be computing cost in USD based on model and token count?

Doesn't seem like an LLM step should be passing it down here. I maybe wrong on this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was: the caller already has cost from the LLM response, so passing it avoids maintaining a pricing table. But I see the argument — if the evaluator owns cost computation, the contract is simpler and the caller can't lie about cost. One question: should the evaluator maintain its own pricing table, or pull from an external source (e.g. LiteLLM's model cost map)?

return max(ratios) if ratios else 0.0


class InMemoryBudgetStore:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This should go into a separate file so that interface for store and its implementation are separate.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will split into store.py (protocol) and memory_store.py (InMemoryBudgetStore).

"""Atomically record usage and return current budget state.

Args:
scope_key: Composite scope identifier (e.g. "channel=slack|user_id=u1").
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doc should go into interface docs instead of here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, will move to the protocol definition.

def record_and_check(
self,
scope_key: str,
period_key: str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed to be passed in? Shouldn't implementation be figuring this out on its own based on current time?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. The store should derive the period from window_seconds + current time internally. I was passing it in for testability but I can inject a clock instead.

input_tokens: int,
output_tokens: int,
cost_usd: float,
limit_usd: float | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to pass in limit here? Can't we instantiate the store with the BudgetEvaluatorConfig or something similar so it knows what limits are already?

If we want to share the store between many different kinds of limits and keys, can't we just pass in BudgetLimitRule here instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'll have the store accept BudgetLimitRule at registration time so it knows its own limits. record_and_check just takes usage data.

# ---------------------------------------------------------------------------


class TestInMemoryBudgetStore:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests should follow Given/When/Then behavioral comment style.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will restructure. Quick example to confirm this is what you mean:

def test_single_record_under_limit(self):
    # Given: store with a $10 daily limit
    # When: record $3 of usage
    # Then: not breached, ratio ~0.3

Attributes:
scope: Static scope dimensions that must match for this rule
to apply. Empty dict = global rule.
per: If set, the limit is applied independently for each unique
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this. Can't we just handle separate budgets by having multiple rules with different scope dicts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For static scopes, agreed — multiple rules work. My concern is the dynamic case: "each user gets $10/day" where users aren't known at config time. With per, one rule covers all future users. Without it, you'd need to generate rules on the fly. Would a group_by field work? e.g. group_by: "user_id" means "apply this limit independently per distinct user_id value." Open to other approaches if you have something in mind.

@amabito
Copy link
Author

amabito commented Mar 24, 2026

@lan17 Thanks for the review. Moving to contrib — no objection, will restructure. Responded to each thread inline. Aiming for R2 within a week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants