Skip to content

feat: time-bound runs, live stats display, and send-window metrics#58

Open
acere wants to merge 3 commits intoawslabs:mainfrom
acere:feature/time-bound-runs
Open

feat: time-bound runs, live stats display, and send-window metrics#58
acere wants to merge 3 commits intoawslabs:mainfrom
acere:feature/time-bound-runs

Conversation

@acere
Copy link
Copy Markdown
Collaborator

@acere acere commented Apr 2, 2026

Closes #57

What

Adds time-bound test runs, a live stats display, send-window-based throughput metrics, and fixes a StopIteration bug in invocation loops.

Changes

llmeter/runner.py

  • New run_duration parameter on _RunConfig/Runner/run(): clients send requests continuously for a fixed duration. Mutually exclusive with n_requests.
  • New _invoke_for_duration / _invoke_duration / _invoke_duration_c methods — clean separation from count-bound _invoke_n / _invoke_n_c.
  • _tick_time_bar async task advances a time-based progress bar every 0.5s.
  • _run() dispatches to the right invocation path based on _time_bound flag.
  • total_requests always derived from RunningStats._count (single source of truth).
  • Both _invoke_n_no_wait and _invoke_for_duration use while/next() instead of for-in-cycle() to prevent StopIteration from silently killing the loop.
  • record_send() called before each endpoint.invoke() for send-window timing.

llmeter/utils.py

  • RunningStats.record_send(): tracks _first_send_time / _last_send_time.
  • RPM in snapshot() uses send window instead of response-side elapsed time.
  • New "output_tps" special spec: aggregate output tokens/s based on send window.
  • snapshot() returns placeholder values ("—") when _count == 0.

llmeter/live_display.py (new)

  • LiveStatsDisplay: HTML table in Jupyter (grouped columns), ANSI multi-line in terminals.
  • _classify / _group_stats: auto-groups stats by key patterns (Throughput, TTFT, TTLT, Tokens, Errors).
  • Updates in-place, shows placeholders immediately before first response.

llmeter/experiments.py

  • LoadTest: new run_duration, low_memory, progress_bar_stats fields forwarded to each run.

docs/user_guide/run_experiments.md

  • New sections: Time-bound runs, Live progress-bar statistics, Low-memory mode.

examples/Time-bound runs with Bedrock OpenAI API.ipynb (new)

  • End-to-end notebook using bedrock-mantle endpoint with LoadTest, custom stats, low-memory mode, and comparison charts (RPM, TPS, TTFT, TTLT vs clients).

Tests (51 new, 504 total)

  • test_running_stats.py: record_send, update, to_stats, snapshot (placeholders, rpm, output_tps, send window, aggregations).
  • test_live_display.py: _classify, _group_stats, _in_notebook, LiveStatsDisplay (disabled, terminal, overwrite, prefix).
  • test_experiments.py: LoadTest with run_duration/low_memory/progress_bar_stats.
  • test_runner.py: time-bound validation, _invoke_for_duration, full runs with duration.

Usage

# Time-bound run
result = await runner.run(run_duration=60, clients=5)

# Time-bound LoadTest
load_test = LoadTest(
    endpoint=my_endpoint,
    payload=sample_payload,
    sequence_of_clients=[1, 5, 10, 20],
    run_duration=60,
    low_memory=True,
    output_path="outputs/load_test",
)
result = await load_test.run()
result.plot_results()

acere added 2 commits April 1, 2026 11:38
- Add `low_memory` parameter to Runner/run() that writes responses to
  disk without keeping them in memory, for large-scale test runs.
- Introduce `RunningStats` class that accumulates metrics incrementally
  (counts, sums, sorted values for percentile computation).
- Replace `_builtin_stats` cached_property on Result with `_preloaded_stats`
  populated by RunningStats during the run or from stats.json on load.
- Add `snapshot()` method on RunningStats for live progress-bar display
  of p50/p90 TTFT, p50/p90 TTLT, median tokens/s, total tokens, and
  failure count — configurable via `progress_bar_stats` parameter.
- Add `_compute_stats()` classmethod on Result as fallback for manually
  constructed Result objects and post-load_responses() recomputation.
- Update tests for the new stats flow.
Add run_duration parameter for time-bound test runs:
- New run_duration on Runner/run() and LoadTest: clients send requests
  continuously for a fixed duration instead of a fixed count.
- Dedicated _invoke_for_duration / _invoke_duration_c methods (separate
  from count-bound _invoke_n / _invoke_n_c).
- Time-based progress bar via _tick_time_bar async task.
- Mutual exclusivity validation between n_requests and run_duration.

Add LiveStatsDisplay for readable live metrics:
- New llmeter/live_display.py: HTML table in Jupyter (grouped columns
  for Throughput, TTFT, TTLT, Tokens, Errors), ANSI multi-line in
  terminals. Updates in-place, shows placeholders before first response.
- Replaces single-line tqdm postfix with a separate stats row.

Improve throughput metric accuracy:
- RunningStats.record_send() tracks send-side timestamps.
- RPM and output_tps use send window (first-to-last request sent)
  instead of response-side elapsed time, preventing taper-off as
  clients finish.
- output_tps (aggregate tokens/s) added to default snapshot stats.

Fix StopIteration silently terminating invocation loops:
- Both _invoke_n_no_wait and _invoke_for_duration now use while/next()
  instead of for-in-cycle() to prevent StopIteration from streaming
  endpoints from killing the loop.

Add LoadTest support for new features:
- run_duration, low_memory, progress_bar_stats forwarded to each run.

Add example notebook and documentation:
- examples/Time-bound runs with Bedrock OpenAI API.ipynb: end-to-end
  demo using bedrock-mantle endpoint with LoadTest, custom stats,
  low-memory mode, and comparison charts (RPM, TPS, TTFT, TTLT).
- docs/user_guide/run_experiments.md: new sections for time-bound runs,
  live progress-bar stats, and low-memory mode.

Add tests (51 new, 504 total):
- test_running_stats.py: record_send, update, to_stats, snapshot
  (placeholders, rpm, output_tps, send window, aggregations).
- test_live_display.py: _classify, _group_stats, _in_notebook,
  LiveStatsDisplay (disabled, terminal, overwrite, prefix).
- test_experiments.py: LoadTest with run_duration/low_memory/
  progress_bar_stats field storage and runner forwarding.
- test_runner.py: time-bound validation, _invoke_for_duration,
  full run with duration, output path, multiple clients.
Comment on lines +20 to +29
_GROUP_PATTERNS: list[tuple[str, str]] = [
("rpm", "Throughput"),
("tps", "Throughput"),
("ttft", "TTFT"),
("ttlt", "TTLT"),
("token", "Tokens"),
("fail", "Errors"),
]

_GROUP_ORDER = ["Throughput", "TTFT", "TTLT", "Tokens", "Errors", "Other"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these could be condensed to a single config variable like below?

_GROUP_PATTERNS = (
    ("Throughput", ("rpm", "tps")),
    ("TTFT", ("ttft",)),
    ("TTLT", ("ttlt",)),
    ("Tokens", ("token",)),
    ("Errors", ("fail",)),
    ("Other", ("",)),
)

If it's an immutable type like this, it could also nicely become the default value of an argument groups in LiveStatsDisplay constructor, instead of a module-level constant?

stats = self._builtin_stats.copy()
else:
# Fallback: compute from responses (e.g. Result constructed manually)
stats = self._compute_stats(self)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be cached back to _preloaded_stats so it's not recomputed on subsequent accesses?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

result._preloaded_stats = None
else:
# Compute stats from the loaded responses
result._preloaded_stats = cls._compute_stats(result)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to callback _contributed_stats when a result is saved to file and loaded again? It looks like, even if the contributed stats get saved to stats.json, they might be overridden/deleted here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i missed it. fixed and also added a dedicated set of tests.

llmeter/utils.py Outdated
Comment on lines +121 to +132
DEFAULT_SNAPSHOT_STATS: dict[str, tuple[str, ...] | str] = {
"rpm": "rpm",
"output_tps": "output_tps",
"p50_ttft": ("time_to_first_token", "p50"),
"p90_ttft": ("time_to_first_token", "p90"),
"p50_ttlt": ("time_to_last_token", "p50"),
"p90_ttlt": ("time_to_last_token", "p90"),
"p50_tps": ("time_per_output_token", "p50", "inv"),
"input_tokens": ("num_tokens_input", "sum"),
"output_tokens": ("num_tokens_output", "sum"),
"fail": "failed",
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big fan of defining name aliases at this level - shouldn't that be more of a display-level property?

It also feels weird that this class is separate from Result stats... I'd suggest to instead revisit the way Result itself computes stats and add capability for some to be built calculated on running basis during the Run. After all, callbacks can already choose to _update_contributed_stats at any time?

Then, the LiveStatsDisplay could just be configured which stats to pull (e.g. time_to_first_token-p50) with alias names / groups / whatever other display-level properties.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restructured the relationship between Results and LiveStatDisplay. now it should be more consistent

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be optional now if n_requests is optional in _RunConfig?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

tokenizer: Tokenizer | Any | None = None
clients: int = 1
n_requests: int | None = None
run_duration: int | float | None = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should either be a timedelta type, or have a name that explicitly indicates its units?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added timedelta type as option, and clarified in docstrings that any numerical type represents duration in seconds.

self._time_bound = self.run_duration is not None
if self._time_bound:
# For time-bound runs, _n_requests is unknown upfront
self._n_requests = 0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both n_requests and _n_requests? And the inconsistency of the public property being nullable while the private one's getting set to 0?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combined into a single variable

Comment on lines +563 to +567
async def _invoke_duration_c(
self,
payload: list[dict],
clients: int = 1,
) -> tuple[float, float, float]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit concerned by the amount of duplication introduced by defining parallel _invoke_duration_c, _invoke_duration, _invoke_for_duration methods, rather than sharing anything with the corresponding _invoke_n... methods. Since these are all private, couldn't we consolidate some to a single method that tracks both the number and duration and terminates when either condition is met?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidated from 6 to 3 methods.

Consolidate live display config (review comment 1):
- Merge _GROUP_PATTERNS + _GROUP_ORDER into single DEFAULT_GROUPS tuple
- Make groups a constructor parameter on LiveStatsDisplay

Move display aliases from RunningStats to LiveStatsDisplay (comment 4):
- Remove RunningStats.snapshot() and DEFAULT_SNAPSHOT_STATS
- Add rpm/output_tps as regular keys in RunningStats.to_stats()
- Add LiveStatsDisplay.format_stats() owning alias mapping + formatting
- New DEFAULT_DISPLAY_STATS in live_display.py maps display labels to
  canonical stat keys (e.g. "time_to_first_token-p50")
- Runner passes raw to_stats() output; display handles the rest

Cache fallback stats computation (comment 2):
- Result.stats property caches _compute_stats back to _preloaded_stats

Preserve contributed stats on load (comment 3):
- Result.load(load_responses=True) merges extra keys from stats.json
  so callback-contributed stats survive save/load round-trips

Make Result fields optional (comment 5):
- total_requests, clients, n_requests now optional to match _RunConfig

Accept timedelta for run_duration (comment 6):
- run_duration accepts int | float | timedelta; normalized in __post_init__

Remove _n_requests indirection (comment 7):
- Eliminated private _n_requests; n_requests set directly to resolved value

Consolidate invoke methods (comment 8):
- Merged 6 methods into 3: _invoke_n_no_wait (n + duration),
  _invoke_client (replaces _invoke_n/_invoke_duration),
  _invoke_clients (replaces _invoke_n_c/_invoke_duration_c)

Tests:
- Add TestContributedStatsRoundTrip (8 tests) for save/load round-trips
- Add TestSendWindowStats for rpm/output_tps in to_stats()
- Add TestFormatStat for display formatting
- Update all tests for renamed methods and new APIs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: time-bound runs, live stats display, and send-window metrics

2 participants