Hi WildBench Team 👋,
We have updated a series of Fine-Tuned Judge Models:
opencompass/CompassJudger-1-32B-Instruct
opencompass/CompassJudger-1-14B-Instruct
opencompass/CompassJudger-1-7B-Instruct
opencompass/CompassJudger-1-1.5B-Instruct
In relevant experiments, we tested the reliability of these models as Judgers. Among them, CJ-1-14B and 32B achieved a Judge correlation of over 95% with GPT4o-0806 on the WildBench dataset, which can be considered as a low-cost alternative to GPT4o. For more details, please refer to our paper. https://arxiv.org/pdf/2410.16256
Thank you!
Hi WildBench Team 👋,
We have updated a series of Fine-Tuned Judge Models:
opencompass/CompassJudger-1-32B-Instruct
opencompass/CompassJudger-1-14B-Instruct
opencompass/CompassJudger-1-7B-Instruct
opencompass/CompassJudger-1-1.5B-Instruct
In relevant experiments, we tested the reliability of these models as Judgers. Among them, CJ-1-14B and 32B achieved a Judge correlation of over 95% with GPT4o-0806 on the WildBench dataset, which can be considered as a low-cost alternative to GPT4o. For more details, please refer to our paper. https://arxiv.org/pdf/2410.16256
Thank you!