Loading…

{{ $t('all_loading') }}

{{ $t('config_train_settings') }}

{{ $t('config_train_param_title') }}

  • {{ $t('config_train_param_1') }}
  • {{ $t('config_train_param_2') }}
LLaMA Llama for long CoT Qwen Qwen for long CoT
GPU 8*A100 8*A100 8*A100 8*A100
Base model Meta-Llama-3.1-8B Meta-Llama-3.1-8B Qwen2.5-7B Qwen2.5-7B
deepspeed ds_z2_config ds_z2_config ds_z2_config ds_z2_config
template default default default default
cutoff_len 4096 32768 4096 32768
preprocessing_num_workers 16 16 true true
Packing false true false true
per_device_train_batch_size 4 1 4 1
per_device_train_batch_size 4 2 4 2
learning_rate 2.0e-5 3.0e-5 5.0e-6 5.0e-5
use_liger_kernel true true true true
num_train_epochs 3.0 3.0 3.0 3.0
lr_scheduler_type cosine cosine cosine cosine
warmup_ratio 0.03 0.03 0.1 0.1

{{ $t('config_test_settings') }}

{{ $t('config_test_param_title') }}

LLaMA Qwen
max-out-len 32768 32768
hf-type chat chat
inference setting vllm_llama3_1_8b_instruct vllm_qwen2_5_7b_instruct
accelerator vllm + cutoff vllm + cutoff
  • {{ $t('config_test_1') }}
    • {{ $t('config_test_1_2') }}
  • {{ $t('config_test_2') }}
Domain Benchmarks Benchmark_file Evaluator Shot Metric
general DROP drop_gen_a2697c IAAR-Shanghai/xVerify-9B-C 3shot accuracy
general IFEval IFEval_gen IFEvaluator 0shot Average accuracy on all IFEval benchmarks
general AGIEval agieval_xver_gen IAAR-Shanghai/xVerify-9B-C 0shot accuracy
general MMLU-PRO mmlu_pro_few_shot_xver_gen IAAR-Shanghai/xVerify-9B-C 5shot Average accuracy on all mmlu pro benchmarks
math Omni-MATH omni_math_gen KbsdJames/Omni-Judge 0shot accuracy
math OlympiadBenchMath OlympiadBenchMath_0shot_xver_gen IAAR-Shanghai/xVerify-9B-C 0shot accuracy
math GSM8K gsm8k_0shot_xver_gen IAAR-Shanghai/xVerify-9B-C 0shot accuracy
math MATH-500 math_prm800k_500_0shot_cot_xver_gen IAAR-Shanghai/xVerify-9B-C 0shot accuracy
math AIME_2024 aime2024_repeat8_xver_gen IAAR-Shanghai/xVerify-9B-C 0shot Average accuracy of 8 run
code HumanEval humaneval_gen_8e312c HumanEvalEvaluator 0shot pass@1
code HumanEval+ humaneval_plus_gen_8e312c HumanEvalPlusEvaluator 0shot pass@1
code MBPP sanitized_mbpp_mdblock_gen_a447ff MBPPEvaluator 3 shot pass@1
code LiveCodeBench(v5) livecodebench_gen LCBCodeGenerationEvaluator 0shot pass@1
reasoning ARC_c ARC_c_xver_gen IAAR-Shanghai/xVerify-9B-C 0shot accuracy
reasoning BBH bbh_xver_gen IAAR-Shanghai/xVerify-9B-C 0shot accuracy
reasoning KOR-Bench korbench_single_0_shot_xver_gen IAAR-Shanghai/xVerify-9B-C 0shot Average accuracy on all korbench benchmarks
reasoning CaLM calm CaLMEvaluator 0shot Average accuracy on all calm benchmarks
reasoning GPQA gpqa_xver_gen IAAR-Shanghai/xVerify-9B-C 0 shot accuracy

{{ $t('config_dataset_selection_rules') }}

{{ $t('config_dataset_subtitle_selection_criteria') }}

{{ $t('config_dataset_criteria') }} {{ $t('config_dataset_requirements') }}
{{ $t('config_dataset_criteria_impact_metrics') }}
{{ $t('config_dataset_criteria_priority_ranking') }}
{{ $t('config_dataset_criteria_recency_requirement') }}
{{ $t('config_dataset_criteria_size_limitation') }}
{{ $t('config_dataset_criteria_suitability_assessment') }}

{{ $t('config_dataset_quality_assurance') }}

{{ $t('config_scorer_title') }}