general |
DROP |
drop_gen_a2697c |
IAAR-Shanghai/xVerify-9B-C |
3shot |
accuracy |
general |
IFEval |
IFEval_gen |
IFEvaluator |
0shot |
Average accuracy on all IFEval benchmarks |
general |
AGIEval |
agieval_xver_gen |
IAAR-Shanghai/xVerify-9B-C |
0shot |
accuracy |
general |
MMLU-PRO |
mmlu_pro_few_shot_xver_gen |
IAAR-Shanghai/xVerify-9B-C |
5shot |
Average accuracy on all mmlu pro benchmarks |
math |
Omni-MATH |
omni_math_gen |
KbsdJames/Omni-Judge |
0shot |
accuracy |
math |
OlympiadBenchMath |
OlympiadBenchMath_0shot_xver_gen |
IAAR-Shanghai/xVerify-9B-C |
0shot |
accuracy |
math |
GSM8K |
gsm8k_0shot_xver_gen |
IAAR-Shanghai/xVerify-9B-C |
0shot |
accuracy |
math |
MATH-500 |
math_prm800k_500_0shot_cot_xver_gen |
IAAR-Shanghai/xVerify-9B-C |
0shot |
accuracy |
math |
AIME_2024 |
aime2024_repeat8_xver_gen |
IAAR-Shanghai/xVerify-9B-C |
0shot |
Average accuracy of 8 run |
code |
HumanEval |
humaneval_gen_8e312c |
HumanEvalEvaluator |
0shot |
pass@1 |
code |
HumanEval+ |
humaneval_plus_gen_8e312c |
HumanEvalPlusEvaluator |
0shot |
pass@1 |
code |
MBPP |
sanitized_mbpp_mdblock_gen_a447ff |
MBPPEvaluator |
3 shot |
pass@1 |
code |
LiveCodeBench(v5) |
livecodebench_gen |
LCBCodeGenerationEvaluator |
0shot |
pass@1 |
reasoning |
ARC_c |
ARC_c_xver_gen |
IAAR-Shanghai/xVerify-9B-C |
0shot |
accuracy |
reasoning |
BBH |
bbh_xver_gen |
IAAR-Shanghai/xVerify-9B-C |
0shot |
accuracy |
reasoning |
KOR-Bench |
korbench_single_0_shot_xver_gen |
IAAR-Shanghai/xVerify-9B-C |
0shot |
Average accuracy on all korbench benchmarks |
reasoning |
CaLM |
calm |
CaLMEvaluator |
0shot |
Average accuracy on all calm benchmarks |
reasoning |
GPQA |
gpqa_xver_gen |
IAAR-Shanghai/xVerify-9B-C |
0 shot |
accuracy |