| general |
DROP |
drop_gen_a2697c |
IAAR-Shanghai/xVerify-9B-C |
3 shot |
Accuracy |
| general |
IFEval |
IFEval_gen |
IFEvaluator |
0 shot |
Average accuracy on all IFEval benchmarks |
| general |
AGIEval |
agieval_xver_gen |
IAAR-Shanghai/xVerify-9B-C |
0 shot |
Accuracy |
| general |
MMLU-PRO |
mmlu_pro_few_shot_xver_gen |
IAAR-Shanghai/xVerify-9B-C |
5 shot |
Average accuracy on all mmlu pro benchmarks |
| math |
Omni-MATH |
omni_math_gen |
KbsdJames/Omni-Judge |
0 shot |
Accuracy |
| math |
OlympiadBenchMath |
OlympiadBenchMath_0 shot_xver_gen |
IAAR-Shanghai/xVerify-9B-C |
0 shot |
Accuracy |
| math |
GSM8K |
gsm8k_0 shot_xver_gen |
IAAR-Shanghai/xVerify-9B-C |
0 shot |
Accuracy |
| math |
MATH-500 |
math_prm800k_500_0 shot_cot_xver_gen |
IAAR-Shanghai/xVerify-9B-C |
0 shot |
Accuracy |
| math |
AIME_2024 |
aime2024_repeat8_cver_gen |
CompassVerifier-7B |
0 shot |
Average accuracy of 8 run |
| math |
AIME_2025 |
aime2025_repeat8_cver_gen |
CompassVerifier-7B |
0 shot |
Average accuracy of 8 run |
| math |
HMMT_Feb_2025 |
hmmt2025_repeat8_cver_gen |
CompassVerifier-7B |
0 shot |
Average accuracy of 8 run |
| math |
CMIMC_2025 |
cmimc2025_repeat8_cver_gen |
CompassVerifier-7B |
0 shot |
Average accuracy of 8 run |
| math |
BRUMO_2025 |
brumo2025_repeat8_cver_gen |
CompassVerifier-7B |
0 shot |
Average accuracy of 8 run |
| code |
HumanEval |
humaneval_gen_8e312c |
HumanEvalEvaluator |
0 shot |
Pass@1 |
| code |
HumanEval+ |
humaneval_plus_gen_8e312c |
HumanEvalPlusEvaluator |
0 shot |
Pass@1 |
| code |
MBPP |
sanitized_mbpp_mdblock_gen_a447ff |
MBPPEvaluator |
3 shot |
Pass@1 |
| code |
LiveCodeBench(v5) |
livecodebench_gen |
LCBCodeGenerationEvaluator |
0 shot |
Pass@1 |
| reasoning |
ARC_c |
ARC_c_xver_gen |
IAAR-Shanghai/xVerify-9B-C |
0 shot |
Accuracy |
| reasoning |
BBH |
bbh_xver_gen |
IAAR-Shanghai/xVerify-9B-C |
0 shot |
Accuracy |
| reasoning |
KOR-Bench |
korbench_single_0_shot_xver_gen |
IAAR-Shanghai/xVerify-9B-C |
0 shot |
Average accuracy on all korbench benchmarks |
| reasoning |
CaLM |
calm |
CaLMEvaluator |
0 shot |
Average accuracy on all calm benchmarks |
| reasoning |
GPQA |
gpqa_xver_gen |
IAAR-Shanghai/xVerify-9B-C |
0 shot |
Accuracy |