Skip to content

Task: Verify ToT implementation against original paper#240

Open
smirnovlad wants to merge 3 commits intomainfrom
task/tot-verification
Open

Task: Verify ToT implementation against original paper#240
smirnovlad wants to merge 3 commits intomainfrom
task/tot-verification

Conversation

@smirnovlad
Copy link
Copy Markdown
Collaborator

@smirnovlad smirnovlad commented Mar 20, 2026

Task for: @rvz16

Verify that our beam search + LLM-as-a-critic (PR #161) correctly implements Tree of Thoughts from the original paper.

Phase 1: Reproduce Game of 24 results with GPT-4

  • Compare our prompts (config/prompts/tree-of-thought/game24/) with original prompts
  • Create experiment config for Game of 24 + GPT-4 via OpenRouter + beam_width=5
  • Implement Game of 24 evaluator (verify expression = 24 using given numbers)
  • Run on 100 puzzles (indices 900–999), target ~74% success rate (paper result)
  • Compare trajectories with original logs

Phase 2: Run experiments with Qwen2.5-Math-7B-Instruct

After Phase 1 confirms correctness:

  • MATH-500 — beam search + LLM-as-a-critic, 3 seeds
  • OlympiadBench — beam search + LLM-as-a-critic, 3 seeds
  • GaoKao 2023 En — beam search + LLM-as-a-critic, 3 seeds
  • Minerva Math — beam search + LLM-as-a-critic, 3 seeds

References

See full task description in docs/tasks/tot_verification.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants