Training Settings

Framework: LLaMA_Factory version 0.9.2

Base Model: Llama-3.1-8B, Qwen2.5-7B

Parameter Details

LLaMA Llama for long CoT Qwen Qwen for long CoT
GPU 8*A100 8*A100 8*A100 8*A100
Base model Meta-Llama-3.1-8B Meta-Llama-3.1-8B Qwen2.5-7B Qwen2.5-7B
deepspeed ds_z2_config ds_z2_config ds_z2_config ds_z2_config
template default default default default
cutoff_len 4096 32768 4096 32768
preprocessing_num_workers 16 16 true true
Packing false true false true
per_device_train_batch_size 4 1 4 1
per_device_train_batch_size 4 2 4 2
learning_rate 2.0e-5 3.0e-5 5.0e-6 5.0e-5
use_liger_kernel true true true true
num_train_epochs 3.0 3.0 3.0 3.0
lr_scheduler_type cosine cosine cosine cosine
warmup_ratio 0.03 0.03 0.1 0.1

Testing Settings

Framework: OpenCompass version 0.4.2

Parameter Details

LLaMA Qwen
max-out-len 32768 32768
hf-type chat chat
inference setting vllm_llama3_1_8b_instruct vllm_qwen2_5_7b_instruct
accelerator vllm+cutoff vllm+cutoff
Domain Benchmarks Benchmark_file Evaluator Shot Metric
general DROP drop_gen_a2697c IAAR-Shanghai/xVerify-9B-C 3shot accuracy
general IFEval IFEval_gen IFEvaluator 0shot Average accuracy on all IFEval benchmarks
general AGIEval agieval_xver_gen IAAR-Shanghai/xVerify-9B-C 0shot accuracy
general MMLU-PRO mmlu_pro_few_shot_xver_gen IAAR-Shanghai/xVerify-9B-C 5shot Average accuracy on all mmlu pro benchmarks
math Omni-MATH omni_math_gen KbsdJames/Omni-Judge 0shot accuracy
math OlympiadBenchMath OlympiadBenchMath_0shot_xver_gen IAAR-Shanghai/xVerify-9B-C 0shot accuracy
math GSM8K gsm8k_0shot_xver_gen IAAR-Shanghai/xVerify-9B-C 0shot accuracy
math MATH-500 math_prm800k_500_0shot_cot_xver_gen IAAR-Shanghai/xVerify-9B-C 0shot accuracy
math AIME_2024 aime2024_repeat8_xver_gen IAAR-Shanghai/xVerify-9B-C 0shot Average accuracy of 8 run
code HumanEval humaneval_gen_8e312c HumanEvalEvaluator 0shot pass@1
code HumanEval+ humaneval_plus_gen_8e312c HumanEvalPlusEvaluator 0shot pass@1
code MBPP sanitized_mbpp_mdblock_gen_a447ff MBPPEvaluator 3 shot pass@1
code LiveCodeBench livecodebench_gen LCBCodeGenerationEvaluator 0shot pass@1
reasoning ARC_c ARC_c_xver_gen IAAR-Shanghai/xVerify-9B-C 0shot accuracy
reasoning BBH bbh_xver_gen IAAR-Shanghai/xVerify-9B-C 0shot accuracy
reasoning KOR-Bench korbench_single_0_shot_xver_gen IAAR-Shanghai/xVerify-9B-C 0shot Average accuracy on all korbench benchmarks
reasoning CaLM calm CaLMEvaluator 0shot Average accuracy on all calm benchmarks
reasoning GPQA gpqa_xver_gen IAAR-Shanghai/xVerify-9B-C 0 shot accuracy

Dataset Selection Rules

To ensure fairness and effectiveness in evaluation, we have established strict dataset selection criteria. All datasets used for training and evaluation are sourced from the Hugging Face platform and must meet the following screening conditions:

Selection Criteria

Criteria Requirements
Impact Metrics likes or downloads count greater than 10 on Hugging Face platform, ensuring community recognition
Priority Ranking In domains with abundant training data (e.g., general dialogue, mathematical reasoning, code generation), prioritize datasets with higher likes counts
Recency Requirement Dataset last updated after January 2023, ensuring data timeliness and relevance
Size Limitation Dataset size limited to under 1M, balancing training effectiveness with computational resource consumption
Suitability Assessment Dataset purpose or content must be suitable for SFT (Supervised Fine-Tuning) training, containing high-quality instruction-response pairs

Quality Assurance

To ensure training data quality, we conduct the following additional validations: