Skip to content

Problems in Reproducing the Results #13

@Ann997943

Description

@Ann997943

I tried to reproduce the results of Figure 2, which show the impact of increasing test-time computation in reflection. I modified the maximum number of iterations in locomo_test.py and hotpotqa.py, and evaluated on locomo10.json and eval_400.json. However, the F1 score did not improve, and it seems that the deep reflection mechanism did not work as expected. Could you please advise on how to resolve this issue?

iteration1_batch_results_0_9.json
iteration1_batch_statistics_0_9.json

iteration3_batch_results_0_9.json
iteration3_batch_statistics_0_9.json

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions