Loading…

{{ $t('all_loading') }}

{{ $t('config_train_settings') }}

{{ $t('config_train_param_title') }}

  • {{ $t('config_train_param_1') }}
  • {{ $t('config_train_param_2') }}
Llama Llama for long CoT Qwen2.5 Qwen2.5 for long CoT Qwen3 for long CoT
GPU 8*A100 8*A100 8*A100 8*A100 8*A100
Base model Meta-Llama3.1-8B Meta-Llama3.1-8B Qwen2.5-7B Qwen2.5-7B Qwen3-8B-Base
deepspeed ds_z3_config ds_z3_config ds_z3_config ds_z3_config ds_z3_config
template default default default default default
cutoff_len 4096 32768 4096 32768 32768
preprocessing_num_workers 16 16 16 16 16
packing false true false true true
per_device_train_batch_size 4 2 4 2 2
gradient_accumulation_steps 4 2 4 2 2
learning_rate 2.0e-5 3.0e-5 5.0e-6 5.0e-5 5.0e-5
use_liger_kernel true true true true true
num_train_epochs 3.0 3.0 3.0 3.0 3.0
lr_scheduler_type cosine cosine cosine cosine cosine
warmup_ratio 0.03 0.03 0.1 0.1 0.1

{{ $t('config_test_settings') }}

{{ $t('config_test_param_title') }}

Llama3.1 Qwen2.5 Qwen3
max-out-len 32768 32768 32768
hf-type chat chat chat
inference setting vllm_llama3_1_8b_instruct vllm_qwen2_5_7b_instruct vllm_qwen3_8b_instruct
accelerator vllm + cutoff vllm + cutoff vllm + cutoff
temperature 0 0 0.6
top-p - - 0.95
top-k - - 20
  • {{ $t('config_test_1') }}
    • {{ $t('config_test_1_2') }}
  • {{ $t('config_test_2') }}
Domain Benchmarks Benchmark_file Evaluator Shot Metric
general DROP drop_gen_a2697c IAAR-Shanghai/xVerify-9B-C 3 shot Accuracy
general IFEval IFEval_gen IFEvaluator 0 shot Average accuracy on all IFEval benchmarks
general AGIEval agieval_xver_gen IAAR-Shanghai/xVerify-9B-C 0 shot Accuracy
general MMLU-PRO mmlu_pro_few_shot_xver_gen IAAR-Shanghai/xVerify-9B-C 5 shot Average accuracy on all mmlu pro benchmarks
math Omni-MATH omni_math_gen KbsdJames/Omni-Judge 0 shot Accuracy
math OlympiadBenchMath OlympiadBenchMath_0 shot_xver_gen IAAR-Shanghai/xVerify-9B-C 0 shot Accuracy
math GSM8K gsm8k_0 shot_xver_gen IAAR-Shanghai/xVerify-9B-C 0 shot Accuracy
math MATH-500 math_prm800k_500_0 shot_cot_xver_gen IAAR-Shanghai/xVerify-9B-C 0 shot Accuracy
math AIME_2024 aime2024_repeat8_cver_gen CompassVerifier-7B 0 shot Average accuracy of 8 run
math AIME_2025 aime2025_repeat8_cver_gen CompassVerifier-7B 0 shot Average accuracy of 8 run
math HMMT_Feb_2025 hmmt2025_repeat8_cver_gen CompassVerifier-7B 0 shot Average accuracy of 8 run
math CMIMC_2025 cmimc2025_repeat8_cver_gen CompassVerifier-7B 0 shot Average accuracy of 8 run
math BRUMO_2025 brumo2025_repeat8_cver_gen CompassVerifier-7B 0 shot Average accuracy of 8 run
code HumanEval humaneval_gen_8e312c HumanEvalEvaluator 0 shot Pass@1
code HumanEval+ humaneval_plus_gen_8e312c HumanEvalPlusEvaluator 0 shot Pass@1
code MBPP sanitized_mbpp_mdblock_gen_a447ff MBPPEvaluator 3 shot Pass@1
code LiveCodeBench(v5) livecodebench_gen LCBCodeGenerationEvaluator 0 shot Pass@1
reasoning ARC_c ARC_c_xver_gen IAAR-Shanghai/xVerify-9B-C 0 shot Accuracy
reasoning BBH bbh_xver_gen IAAR-Shanghai/xVerify-9B-C 0 shot Accuracy
reasoning KOR-Bench korbench_single_0_shot_xver_gen IAAR-Shanghai/xVerify-9B-C 0 shot Average accuracy on all korbench benchmarks
reasoning CaLM calm CaLMEvaluator 0 shot Average accuracy on all calm benchmarks
reasoning GPQA gpqa_xver_gen IAAR-Shanghai/xVerify-9B-C 0 shot Accuracy

{{ $t('config_dataset_selection_rules') }}

{{ $t('config_dataset_subtitle_selection_criteria') }}

{{ $t('config_dataset_criteria') }} {{ $t('config_dataset_requirements') }}
{{ $t('config_dataset_criteria_impact_metrics') }}
{{ $t('config_dataset_criteria_priority_ranking') }}
{{ $t('config_dataset_criteria_recency_requirement') }}
{{ $t('config_dataset_criteria_size_limitation') }}
{{ $t('config_dataset_criteria_suitability_assessment') }}

{{ $t('config_dataset_quality_assurance') }}

{{ $t('config_scorer_title') }}