Problems in Reproducing the Results

I tried to reproduce the results of Figure 2, which show the impact of increasing test-time computation in reflection. I modified the maximum number of iterations in locomo_test.py and hotpotqa.py, and evaluated on locomo10.json and eval_400.json. However, the F1 score did not improve, and it seems that the deep reflection mechanism did not work as expected. Could you please advise on how to resolve this issue?

[iteration1_batch_results_0_9.json](https://github.com/user-attachments/files/26328513/iteration1_batch_results_0_9.json)
[iteration1_batch_statistics_0_9.json](https://github.com/user-attachments/files/26328514/iteration1_batch_statistics_0_9.json)

[iteration3_batch_results_0_9.json](https://github.com/user-attachments/files/26328518/iteration3_batch_results_0_9.json)
[iteration3_batch_statistics_0_9.json](https://github.com/user-attachments/files/26328519/iteration3_batch_statistics_0_9.json)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems in Reproducing the Results #13

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Problems in Reproducing the Results #13

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions