Skip to content

Add plan for visualizing experiment results in the debugger#233

Open
smirnovlad wants to merge 10 commits intomainfrom
experiment-visualization-plan
Open

Add plan for visualizing experiment results in the debugger#233
smirnovlad wants to merge 10 commits intomainfrom
experiment-visualization-plan

Conversation

@smirnovlad
Copy link
Copy Markdown
Collaborator

No description provided.

smirnovlad and others added 3 commits March 14, 2026 00:42
The original plan incorrectly assumed step_candidates and all_trajectories
are saved in results.json. They are not — strategies return them but
run_tts_eval.py only saves a subset of fields. Added Step 0 prerequisite
to save tree-building data before the converter can work.
smirnovlad

This comment was marked as off-topic.

Copy link
Copy Markdown
Collaborator Author

@smirnovlad smirnovlad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Reposting with proper formatting — ignore the previous review)

Nice implementation — follows the plan well and the app.js changes for offline mode / sample navigation are solid. One critical bug and a couple of minor things.

Critical: import before sys.path (convert_results_to_debugger.py:23)

from service_app.core.debugger_events import ... runs before sys.path.insert on line 26, so Python can't find the service_app package → ModuleNotFoundError on every run.

Fix: move PROJECT_ROOT / sys.path.insert above the import:

PROJECT_ROOT = Path(__file__).resolve().parents[1]
sys.path.insert(0, str(PROJECT_ROOT))

from service_app.core.debugger_events import convert_strategy_result_to_debugger_run

Minor: import yaml dependency

pyyaml is not in the base deps (only in service extras). Consider adding a note or using omegaconf (already a dep via hydra).

Nit: speculative keys in run_tts_eval.py

all_traces and all_trajectory_steps — I don't see any strategy returning these keys. Not harmful (the if key in result guard skips them), but consider removing to avoid confusion.

Serialization: looks good

Checked: save_results_json uses safe_serialize() which handles StepCandidate objects. And for beam search, step_candidates entries are already plain dicts. No issues here.

app.js offline mode: looks good

Graceful fallback to cached examples when API is down, auto-enable cached mode toggle, sample navigation with prev/next and incorrect filter — all solid.

- Move sys.path.insert before service_app import to avoid ModuleNotFoundError
- Remove all_traces and all_trajectory_steps keys (no strategy returns them)
Both are used by debugger_events.py: all_traces for self-consistency
visualization, all_trajectory_steps for offline BoN step-level display.
@smirnovlad
Copy link
Copy Markdown
Collaborator Author

One more thing: scripts/README_debugger.md feels misplaced — we keep docs in docs/. Please move it to docs/service/debugger.md (next to the existing docs/service/SERVICE_API_GUIDE.md).

Copy link
Copy Markdown
Collaborator Author

@smirnovlad smirnovlad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed two follow-up commits on this branch:

  1. 1069ce5 — Fixed the critical import order bug in convert_results_to_debugger.py (moved sys.path.insert before the service_app import)
  2. 7d3e57e — Restored all_traces and all_trajectory_steps keys that I incorrectly removed (they are used by debugger_events.py for self-consistency and offline BoN visualization)

Remaining suggestion: move scripts/README_debugger.md to docs/service/debugger.md (see comment above).

Copy link
Copy Markdown
Collaborator Author

@smirnovlad smirnovlad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final review after fixes

All previously identified issues have been resolved:

  • Import order — fixed (sys.path.insert now comes before service_app import)
  • Linting — flake8 (noqa: E402), black, isort all pass
  • Tree data keys — all 7 keys restored; verified each is used by at least one strategy + debugger_events.py
  • Docs location — moved to docs/service/debugger.md

What this PR does

  1. run_tts_eval.py — saves tree visualization data that strategies already compute but was previously discarded
  2. convert_results_to_debugger.py — new converter script that transforms experiment results into debugger-compatible JSON
  3. app.js + index.html — file upload, sample navigation (prev/next), incorrect only filter, offline mode fallback
  4. docs/service/debugger.md — usage docs
  5. Test config — quick beam search config for MATH-500 with OpenRouter

Non-blocking note

pyyaml dependency: the converter uses import yaml which requires pyyaml. It is available via hydra (already a dep for eval), but worth mentioning in docs if someone runs the converter in a minimal env.

Should be tested end-to-end by the author: run an experiment with updated run_tts_eval.py, then run the converter and verify the debugger renders the tree.

LGTM from my side.

@smirnovlad-test
Copy link
Copy Markdown
Collaborator

smirnovlad-test commented Mar 20, 2026

image
  1. Not clear names of entries in selector of cached examples
  2. "Выберите файл" -> add styles
  3. Hardcoded "Online bon" for each cached example? It is like a constant
  4. More tests (with bigger beam size, number of candidates and so on)

Improve experiment visualizer: add --merge flag, timestamps, and UI fixes
  - Add --merge flag to converter for accumulating results in cached_examples.json
  - Add run timestamps and dataset index to scenario titles
  - Add requires_scorer field to fix self-consistency "Run" button
  - Use unique IDs per strategy to prevent scenario collisions when combining
  - Replace browser-native file input with English "Choose JSON file…" button
  - Auto-run cached example when switching scenarios
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants