Add plan for visualizing experiment results in the debugger#233
Add plan for visualizing experiment results in the debugger#233smirnovlad wants to merge 10 commits intomainfrom
Conversation
The original plan incorrectly assumed step_candidates and all_trajectories are saved in results.json. They are not — strategies return them but run_tts_eval.py only saves a subset of fields. Added Step 0 prerequisite to save tree-building data before the converter can work.
… experiments in visual debugger
smirnovlad
left a comment
There was a problem hiding this comment.
(Reposting with proper formatting — ignore the previous review)
Nice implementation — follows the plan well and the app.js changes for offline mode / sample navigation are solid. One critical bug and a couple of minor things.
Critical: import before sys.path (convert_results_to_debugger.py:23)
from service_app.core.debugger_events import ... runs before sys.path.insert on line 26, so Python can't find the service_app package → ModuleNotFoundError on every run.
Fix: move PROJECT_ROOT / sys.path.insert above the import:
PROJECT_ROOT = Path(__file__).resolve().parents[1]
sys.path.insert(0, str(PROJECT_ROOT))
from service_app.core.debugger_events import convert_strategy_result_to_debugger_runMinor: import yaml dependency
pyyaml is not in the base deps (only in service extras). Consider adding a note or using omegaconf (already a dep via hydra).
Nit: speculative keys in run_tts_eval.py
all_traces and all_trajectory_steps — I don't see any strategy returning these keys. Not harmful (the if key in result guard skips them), but consider removing to avoid confusion.
Serialization: looks good
Checked: save_results_json uses safe_serialize() which handles StepCandidate objects. And for beam search, step_candidates entries are already plain dicts. No issues here.
app.js offline mode: looks good
Graceful fallback to cached examples when API is down, auto-enable cached mode toggle, sample navigation with prev/next and incorrect filter — all solid.
- Move sys.path.insert before service_app import to avoid ModuleNotFoundError - Remove all_traces and all_trajectory_steps keys (no strategy returns them)
Both are used by debugger_events.py: all_traces for self-consistency visualization, all_trajectory_steps for offline BoN step-level display.
|
One more thing: |
smirnovlad
left a comment
There was a problem hiding this comment.
I've pushed two follow-up commits on this branch:
- 1069ce5 — Fixed the critical import order bug in
convert_results_to_debugger.py(movedsys.path.insertbefore theservice_appimport) - 7d3e57e — Restored
all_tracesandall_trajectory_stepskeys that I incorrectly removed (they are used bydebugger_events.pyfor self-consistency and offline BoN visualization)
Remaining suggestion: move scripts/README_debugger.md to docs/service/debugger.md (see comment above).
smirnovlad
left a comment
There was a problem hiding this comment.
Final review after fixes
All previously identified issues have been resolved:
- Import order — fixed (sys.path.insert now comes before service_app import)
- Linting — flake8 (noqa: E402), black, isort all pass
- Tree data keys — all 7 keys restored; verified each is used by at least one strategy + debugger_events.py
- Docs location — moved to docs/service/debugger.md
What this PR does
- run_tts_eval.py — saves tree visualization data that strategies already compute but was previously discarded
- convert_results_to_debugger.py — new converter script that transforms experiment results into debugger-compatible JSON
- app.js + index.html — file upload, sample navigation (prev/next), incorrect only filter, offline mode fallback
- docs/service/debugger.md — usage docs
- Test config — quick beam search config for MATH-500 with OpenRouter
Non-blocking note
pyyaml dependency: the converter uses import yaml which requires pyyaml. It is available via hydra (already a dep for eval), but worth mentioning in docs if someone runs the converter in a minimal env.
Should be tested end-to-end by the author: run an experiment with updated run_tts_eval.py, then run the converter and verify the debugger renders the tree.
LGTM from my side.
Improve experiment visualizer: add --merge flag, timestamps, and UI fixes - Add --merge flag to converter for accumulating results in cached_examples.json - Add run timestamps and dataset index to scenario titles - Add requires_scorer field to fix self-consistency "Run" button - Use unique IDs per strategy to prevent scenario collisions when combining - Replace browser-native file input with English "Choose JSON file…" button - Auto-run cached example when switching scenarios

No description provided.