-
Notifications
You must be signed in to change notification settings - Fork 16
feat(evaluators): add built-in budget evaluator for per-agent cost tracking #144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
adaa614
8ae042f
4cd08eb
0b41ae9
5a100f8
96e59ba
fa414d7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| """Budget evaluator for per-agent LLM cost and token tracking.""" | ||
|
|
||
| from agent_control_evaluators.budget.config import BudgetEvaluatorConfig | ||
| from agent_control_evaluators.budget.evaluator import BudgetEvaluator | ||
| from agent_control_evaluators.budget.store import ( | ||
| BudgetSnapshot, | ||
| BudgetStore, | ||
| InMemoryBudgetStore, | ||
| ) | ||
|
|
||
| __all__ = [ | ||
| "BudgetEvaluator", | ||
| "BudgetEvaluatorConfig", | ||
| "BudgetSnapshot", | ||
| "BudgetStore", | ||
| "InMemoryBudgetStore", | ||
| ] |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,81 @@ | ||
| """Configuration for the budget evaluator.""" | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import math | ||
| from typing import Any, Literal | ||
|
|
||
| from pydantic import Field, field_validator, model_validator | ||
|
|
||
| from agent_control_evaluators._base import EvaluatorConfig | ||
|
|
||
|
|
||
| class BudgetLimitRule(EvaluatorConfig): | ||
| """A single budget limit rule. | ||
|
|
||
| Each rule defines a ceiling (USD and/or tokens) for a combination | ||
| of scope dimensions and time window. Multiple rules can apply to | ||
| the same step -- the evaluator checks all of them and triggers | ||
| on the first breach. | ||
|
|
||
| Attributes: | ||
| scope: Static scope dimensions that must match for this rule | ||
| to apply. Empty dict = global rule. | ||
| per: If set, the limit is applied independently for each unique | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't quite understand this. Can't we just handle separate budgets by having multiple rules with different
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For static scopes, agreed — multiple rules work. My concern is the dynamic case: "each user gets $10/day" where users aren't known at config time. With per, one rule covers all future users. Without it, you'd need to generate rules on the fly. Would a group_by field work? e.g. group_by: "user_id" means "apply this limit independently per distinct user_id value." Open to other approaches if you have something in mind. |
||
| value of this metadata field (e.g. "user_id" creates per-user | ||
| budgets within the scope). | ||
| window: Time window for accumulation. None = cumulative (no reset). | ||
| limit_usd: Maximum USD spend in the window. None = uncapped. | ||
| limit_tokens: Maximum tokens in the window. None = uncapped. | ||
| """ | ||
|
|
||
| scope: dict[str, str] = Field(default_factory=dict) | ||
| per: str | None = None | ||
| window: Literal["daily", "weekly", "monthly"] | None = None | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if we want hourly or half-an-hour? Maybe its better to define "window" as an integer in seconds, or minutes? That way you can express whatever window you want.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed. Will change to window_seconds: int. I'll keep a few named constants (DAILY = 86400 etc.) as convenience but the field itself will be raw seconds. |
||
| limit_usd: float | None = None | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems like this is a wrong abstraction for budgeting. What if the budget is in different currency? Why do we need separate fields for tokens vs usd? It might be better to do something like an integer for a limit and then define "currency" Enum which could be USD, tokens, Euros, etc., I don't think there's a use case for having floating point for USD or Euros, no?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. I'll switch to an integer limit + a Currency enum (USD, EUR, tokens, etc.). Float was unnecessary — cents-level precision is handled by using integer cents anyway. |
||
| limit_tokens: int | None = None | ||
|
|
||
| @model_validator(mode="after") | ||
| def at_least_one_limit(self) -> "BudgetLimitRule": | ||
| if self.limit_usd is None and self.limit_tokens is None: | ||
| raise ValueError("At least one of limit_usd or limit_tokens must be set") | ||
| return self | ||
|
|
||
| @field_validator("limit_usd") | ||
| @classmethod | ||
| def validate_limit_usd(cls, v: float | None) -> float | None: | ||
| if v is not None and (not math.isfinite(v) or v <= 0): | ||
| raise ValueError("limit_usd must be a finite positive number") | ||
| return v | ||
|
|
||
| @field_validator("limit_tokens") | ||
| @classmethod | ||
| def validate_limit_tokens(cls, v: int | None) -> int | None: | ||
| if v is not None and v <= 0: | ||
| raise ValueError("limit_tokens must be positive") | ||
| return v | ||
|
|
||
|
|
||
| class BudgetEvaluatorConfig(EvaluatorConfig): | ||
| """Configuration for the budget evaluator. | ||
|
|
||
| Attributes: | ||
| limits: List of budget limit rules. Each is checked independently. | ||
| pricing: Optional model pricing table. Maps model name to per-1K | ||
| token rates. Used to derive cost_usd from token counts when | ||
| cost is not provided in step data. | ||
| token_path: Dot-notation path to extract token usage from step | ||
| data (e.g. "usage.total_tokens"). If None, looks for standard | ||
| fields (input_tokens, output_tokens, total_tokens, usage). | ||
| cost_path: Dot-notation path to extract cost from step data. | ||
| model_path: Dot-notation path to extract model name (for pricing lookup). | ||
| metadata_paths: Mapping of metadata field name to dot-notation path | ||
| in step data. Used to extract scope dimensions (channel, user_id, etc). | ||
| """ | ||
|
|
||
| limits: list[BudgetLimitRule] = Field(min_length=1) | ||
| pricing: dict[str, dict[str, float]] | None = None | ||
| token_path: str | None = None | ||
| cost_path: str | None = None | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't this evaluator be computing cost in USD based on model and token count? Doesn't seem like an LLM step should be passing it down here. I maybe wrong on this.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My thinking was: the caller already has cost from the LLM response, so passing it avoids maintaining a pricing table. But I see the argument — if the evaluator owns cost computation, the contract is simpler and the caller can't lie about cost. One question: should the evaluator maintain its own pricing table, or pull from an external source (e.g. LiteLLM's model cost map)?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe you already include a pricing table in the evaluator config, so for version 1 of this evaluator we can just rely on that?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Makes sense. I'll have the evaluator compute cost from the config pricing table + model + token counts. That lets me drop cost_path entirely — just need token_path and a new model_path to extract the model name from the input. If the model isn't in the pricing table, fail closed. |
||
| model_path: str | None = None | ||
| metadata_paths: dict[str, str] = Field(default_factory=dict) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth it to give some examples here for scope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will add. Something like {"agent": "summarizer", "channel": "slack"}.