feat: time-bound runs, live stats display, and send-window metrics by acere · Pull Request #58 · awslabs/llmeter

acere · 2026-04-02T17:20:31Z

Closes #57

What

Adds time-bound test runs, a live stats display, send-window-based throughput metrics, and fixes a StopIteration bug in invocation loops.

Changes

`llmeter/runner.py`

New run_duration parameter on _RunConfig/Runner/run(): clients send requests continuously for a fixed duration. Mutually exclusive with n_requests.
New _invoke_for_duration / _invoke_duration / _invoke_duration_c methods — clean separation from count-bound _invoke_n / _invoke_n_c.
_tick_time_bar async task advances a time-based progress bar every 0.5s.
_run() dispatches to the right invocation path based on _time_bound flag.
total_requests always derived from RunningStats._count (single source of truth).
Both _invoke_n_no_wait and _invoke_for_duration use while/next() instead of for-in-cycle() to prevent StopIteration from silently killing the loop.
record_send() called before each endpoint.invoke() for send-window timing.

`llmeter/utils.py`

RunningStats.record_send(): tracks _first_send_time / _last_send_time.
RPM in snapshot() uses send window instead of response-side elapsed time.
New "output_tps" special spec: aggregate output tokens/s based on send window.
snapshot() returns placeholder values ("—") when _count == 0.

`llmeter/live_display.py` (new)

LiveStatsDisplay: HTML table in Jupyter (grouped columns), ANSI multi-line in terminals.
_classify / _group_stats: auto-groups stats by key patterns (Throughput, TTFT, TTLT, Tokens, Errors).
Updates in-place, shows placeholders immediately before first response.

`llmeter/experiments.py`

LoadTest: new run_duration, low_memory, progress_bar_stats fields forwarded to each run.

`docs/user_guide/run_experiments.md`

New sections: Time-bound runs, Live progress-bar statistics, Low-memory mode.

`examples/Time-bound runs with Bedrock OpenAI API.ipynb` (new)

End-to-end notebook using bedrock-mantle endpoint with LoadTest, custom stats, low-memory mode, and comparison charts (RPM, TPS, TTFT, TTLT vs clients).

Tests (51 new, 504 total)

test_running_stats.py: record_send, update, to_stats, snapshot (placeholders, rpm, output_tps, send window, aggregations).
test_live_display.py: _classify, _group_stats, _in_notebook, LiveStatsDisplay (disabled, terminal, overwrite, prefix).
test_experiments.py: LoadTest with run_duration/low_memory/progress_bar_stats.
test_runner.py: time-bound validation, _invoke_for_duration, full runs with duration.

Usage

# Time-bound run
result = await runner.run(run_duration=60, clients=5)

# Time-bound LoadTest
load_test = LoadTest(
    endpoint=my_endpoint,
    payload=sample_payload,
    sequence_of_clients=[1, 5, 10, 20],
    run_duration=60,
    low_memory=True,
    output_path="outputs/load_test",
)
result = await load_test.run()
result.plot_results()

- Add `low_memory` parameter to Runner/run() that writes responses to disk without keeping them in memory, for large-scale test runs. - Introduce `RunningStats` class that accumulates metrics incrementally (counts, sums, sorted values for percentile computation). - Replace `_builtin_stats` cached_property on Result with `_preloaded_stats` populated by RunningStats during the run or from stats.json on load. - Add `snapshot()` method on RunningStats for live progress-bar display of p50/p90 TTFT, p50/p90 TTLT, median tokens/s, total tokens, and failure count — configurable via `progress_bar_stats` parameter. - Add `_compute_stats()` classmethod on Result as fallback for manually constructed Result objects and post-load_responses() recomputation. - Update tests for the new stats flow.

Add run_duration parameter for time-bound test runs: - New run_duration on Runner/run() and LoadTest: clients send requests continuously for a fixed duration instead of a fixed count. - Dedicated _invoke_for_duration / _invoke_duration_c methods (separate from count-bound _invoke_n / _invoke_n_c). - Time-based progress bar via _tick_time_bar async task. - Mutual exclusivity validation between n_requests and run_duration. Add LiveStatsDisplay for readable live metrics: - New llmeter/live_display.py: HTML table in Jupyter (grouped columns for Throughput, TTFT, TTLT, Tokens, Errors), ANSI multi-line in terminals. Updates in-place, shows placeholders before first response. - Replaces single-line tqdm postfix with a separate stats row. Improve throughput metric accuracy: - RunningStats.record_send() tracks send-side timestamps. - RPM and output_tps use send window (first-to-last request sent) instead of response-side elapsed time, preventing taper-off as clients finish. - output_tps (aggregate tokens/s) added to default snapshot stats. Fix StopIteration silently terminating invocation loops: - Both _invoke_n_no_wait and _invoke_for_duration now use while/next() instead of for-in-cycle() to prevent StopIteration from streaming endpoints from killing the loop. Add LoadTest support for new features: - run_duration, low_memory, progress_bar_stats forwarded to each run. Add example notebook and documentation: - examples/Time-bound runs with Bedrock OpenAI API.ipynb: end-to-end demo using bedrock-mantle endpoint with LoadTest, custom stats, low-memory mode, and comparison charts (RPM, TPS, TTFT, TTLT). - docs/user_guide/run_experiments.md: new sections for time-bound runs, live progress-bar stats, and low-memory mode. Add tests (51 new, 504 total): - test_running_stats.py: record_send, update, to_stats, snapshot (placeholders, rpm, output_tps, send window, aggregations). - test_live_display.py: _classify, _group_stats, _in_notebook, LiveStatsDisplay (disabled, terminal, overwrite, prefix). - test_experiments.py: LoadTest with run_duration/low_memory/ progress_bar_stats field storage and runner forwarding. - test_runner.py: time-bound validation, _invoke_for_duration, full run with duration, output path, multiple clients.

athewsey · 2026-04-06T10:41:27Z

llmeter/live_display.py

+_GROUP_PATTERNS: list[tuple[str, str]] = [
+    ("rpm", "Throughput"),
+    ("tps", "Throughput"),
+    ("ttft", "TTFT"),
+    ("ttlt", "TTLT"),
+    ("token", "Tokens"),
+    ("fail", "Errors"),
+]
+
+_GROUP_ORDER = ["Throughput", "TTFT", "TTLT", "Tokens", "Errors", "Other"]


Maybe these could be condensed to a single config variable like below?

_GROUP_PATTERNS = ( ("Throughput", ("rpm", "tps")), ("TTFT", ("ttft",)), ("TTLT", ("ttlt",)), ("Tokens", ("token",)), ("Errors", ("fail",)), ("Other", ("",)), )

If it's an immutable type like this, it could also nicely become the default value of an argument groups in LiveStatsDisplay constructor, instead of a module-level constant?

athewsey · 2026-04-06T11:00:49Z

llmeter/results.py

-        stats = self._builtin_stats.copy()
+        else:
+            # Fallback: compute from responses (e.g. Result constructed manually)
+            stats = self._compute_stats(self)


Should this be cached back to _preloaded_stats so it's not recomputed on subsequent accesses?

athewsey · 2026-04-06T11:05:03Z

llmeter/results.py

                result._preloaded_stats = None
+        else:
+            # Compute stats from the loaded responses
+            result._preloaded_stats = cls._compute_stats(result)


What happens to callback _contributed_stats when a result is saved to file and loaded again? It looks like, even if the contributed stats get saved to stats.json, they might be overridden/deleted here?

yes, i missed it. fixed and also added a dedicated set of tests.

athewsey · 2026-04-06T11:24:03Z

llmeter/utils.py

+    DEFAULT_SNAPSHOT_STATS: dict[str, tuple[str, ...] | str] = {
+        "rpm": "rpm",
+        "output_tps": "output_tps",
+        "p50_ttft": ("time_to_first_token", "p50"),
+        "p90_ttft": ("time_to_first_token", "p90"),
+        "p50_ttlt": ("time_to_last_token", "p50"),
+        "p90_ttlt": ("time_to_last_token", "p90"),
+        "p50_tps": ("time_per_output_token", "p50", "inv"),
+        "input_tokens": ("num_tokens_input", "sum"),
+        "output_tokens": ("num_tokens_output", "sum"),
+        "fail": "failed",
+    }


Not a big fan of defining name aliases at this level - shouldn't that be more of a display-level property?

It also feels weird that this class is separate from Result stats... I'd suggest to instead revisit the way Result itself computes stats and add capability for some to be built calculated on running basis during the Run. After all, callbacks can already choose to _update_contributed_stats at any time?

Then, the LiveStatsDisplay could just be configured which stats to pull (e.g. time_to_first_token-p50) with alias names / groups / whatever other display-level properties.

restructured the relationship between Results and LiveStatDisplay. now it should be more consistent

athewsey · 2026-04-06T11:26:29Z

llmeter/results.py

Should this be optional now if n_requests is optional in _RunConfig?

athewsey · 2026-04-06T11:28:37Z

llmeter/runner.py

    tokenizer: Tokenizer | Any | None = None
    clients: int = 1
    n_requests: int | None = None
+    run_duration: int | float | None = None


Perhaps this should either be a timedelta type, or have a name that explicitly indicates its units?

Added timedelta type as option, and clarified in docstrings that any numerical type represents duration in seconds.

athewsey · 2026-04-06T11:31:07Z

llmeter/runner.py

+        self._time_bound = self.run_duration is not None
+        if self._time_bound:
+            # For time-bound runs, _n_requests is unknown upfront
+            self._n_requests = 0


Do we need both n_requests and _n_requests? And the inconsistency of the public property being nullable while the private one's getting set to 0?

Combined into a single variable

athewsey · 2026-04-06T11:36:43Z

llmeter/runner.py

+    async def _invoke_duration_c(
+        self,
+        payload: list[dict],
+        clients: int = 1,
+    ) -> tuple[float, float, float]:


A bit concerned by the amount of duplication introduced by defining parallel _invoke_duration_c, _invoke_duration, _invoke_for_duration methods, rather than sharing anything with the corresponding _invoke_n... methods. Since these are all private, couldn't we consolidate some to a single method that tracks both the number and duration and terminates when either condition is met?

Consolidated from 6 to 3 methods.

Consolidate live display config (review comment 1): - Merge _GROUP_PATTERNS + _GROUP_ORDER into single DEFAULT_GROUPS tuple - Make groups a constructor parameter on LiveStatsDisplay Move display aliases from RunningStats to LiveStatsDisplay (comment 4): - Remove RunningStats.snapshot() and DEFAULT_SNAPSHOT_STATS - Add rpm/output_tps as regular keys in RunningStats.to_stats() - Add LiveStatsDisplay.format_stats() owning alias mapping + formatting - New DEFAULT_DISPLAY_STATS in live_display.py maps display labels to canonical stat keys (e.g. "time_to_first_token-p50") - Runner passes raw to_stats() output; display handles the rest Cache fallback stats computation (comment 2): - Result.stats property caches _compute_stats back to _preloaded_stats Preserve contributed stats on load (comment 3): - Result.load(load_responses=True) merges extra keys from stats.json so callback-contributed stats survive save/load round-trips Make Result fields optional (comment 5): - total_requests, clients, n_requests now optional to match _RunConfig Accept timedelta for run_duration (comment 6): - run_duration accepts int | float | timedelta; normalized in __post_init__ Remove _n_requests indirection (comment 7): - Eliminated private _n_requests; n_requests set directly to resolved value Consolidate invoke methods (comment 8): - Merged 6 methods into 3: _invoke_n_no_wait (n + duration), _invoke_client (replaces _invoke_n/_invoke_duration), _invoke_clients (replaces _invoke_n_c/_invoke_duration_c) Tests: - Add TestContributedStatsRoundTrip (8 tests) for save/load round-trips - Add TestSendWindowStats for rpm/output_tps in to_stats() - Add TestFormatStat for display formatting - Update all tests for renamed methods and new APIs

acere added 2 commits April 1, 2026 11:38

athewsey reviewed Apr 6, 2026

View reviewed changes

Conversation

acere commented Apr 2, 2026

What

Changes

llmeter/runner.py

llmeter/utils.py

llmeter/live_display.py (new)

llmeter/experiments.py

docs/user_guide/run_experiments.md

examples/Time-bound runs with Bedrock OpenAI API.ipynb (new)

Tests (51 new, 504 total)

Usage

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`llmeter/runner.py`

`llmeter/utils.py`

`llmeter/live_display.py` (new)

`llmeter/experiments.py`

`docs/user_guide/run_experiments.md`

`examples/Time-bound runs with Bedrock OpenAI API.ipynb` (new)