-
Notifications
You must be signed in to change notification settings - Fork 85
Problems in Reproducing the Results #13
Copy link
Copy link
Open
Description
I tried to reproduce the results of Figure 2, which show the impact of increasing test-time computation in reflection. I modified the maximum number of iterations in locomo_test.py and hotpotqa.py, and evaluated on locomo10.json and eval_400.json. However, the F1 score did not improve, and it seems that the deep reflection mechanism did not work as expected. Could you please advise on how to resolve this issue?
iteration1_batch_results_0_9.json
iteration1_batch_statistics_0_9.json
iteration3_batch_results_0_9.json
iteration3_batch_statistics_0_9.json
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels