Framework: LLaMA_Factory version 0.9.2
Base Model: Llama-3.1-8B, Qwen2.5-7B
LLaMA | Llama for long CoT | Qwen | Qwen for long CoT | |
---|---|---|---|---|
GPU | 8*A100 | 8*A100 | 8*A100 | 8*A100 |
Base model | Meta-Llama-3.1-8B | Meta-Llama-3.1-8B | Qwen2.5-7B | Qwen2.5-7B |
deepspeed | ds_z2_config | ds_z2_config | ds_z2_config | ds_z2_config |
template | default | default | default | default |
cutoff_len | 4096 | 32768 | 4096 | 32768 |
preprocessing_num_workers | 16 | 16 | true | true |
Packing | false | true | false | true |
per_device_train_batch_size | 4 | 1 | 4 | 1 |
per_device_train_batch_size | 4 | 2 | 4 | 2 |
learning_rate | 2.0e-5 | 3.0e-5 | 5.0e-6 | 5.0e-5 |
use_liger_kernel | true | true | true | true |
num_train_epochs | 3.0 | 3.0 | 3.0 | 3.0 |
lr_scheduler_type | cosine | cosine | cosine | cosine |
warmup_ratio | 0.03 | 0.03 | 0.1 | 0.1 |
Framework: OpenCompass version 0.4.2
LLaMA | Qwen | |
---|---|---|
max-out-len | 32768 | 32768 |
hf-type | chat | chat |
inference setting | vllm_llama3_1_8b_instruct | vllm_qwen2_5_7b_instruct |
accelerator | vllm+cutoff | vllm+cutoff |
math-eval-harness
and
lm-evaluation-harness
, to ensure consistency and comparability with
existing benchmarks.
xVerify
,
Omni-Judge
) to extract and evaluate answers, ensuring
higher accuracy and robustness in the results.
Domain | Benchmarks | Benchmark_file | Evaluator | Shot | Metric |
---|---|---|---|---|---|
general | DROP | drop_gen_a2697c | IAAR-Shanghai/xVerify-9B-C | 3shot | accuracy |
general | IFEval | IFEval_gen | IFEvaluator | 0shot | Average accuracy on all IFEval benchmarks |
general | AGIEval | agieval_xver_gen | IAAR-Shanghai/xVerify-9B-C | 0shot | accuracy |
general | MMLU-PRO | mmlu_pro_few_shot_xver_gen | IAAR-Shanghai/xVerify-9B-C | 5shot | Average accuracy on all mmlu pro benchmarks |
math | Omni-MATH | omni_math_gen | KbsdJames/Omni-Judge | 0shot | accuracy |
math | OlympiadBenchMath | OlympiadBenchMath_0shot_xver_gen | IAAR-Shanghai/xVerify-9B-C | 0shot | accuracy |
math | GSM8K | gsm8k_0shot_xver_gen | IAAR-Shanghai/xVerify-9B-C | 0shot | accuracy |
math | MATH-500 | math_prm800k_500_0shot_cot_xver_gen | IAAR-Shanghai/xVerify-9B-C | 0shot | accuracy |
math | AIME_2024 | aime2024_repeat8_xver_gen | IAAR-Shanghai/xVerify-9B-C | 0shot | Average accuracy of 8 run |
code | HumanEval | humaneval_gen_8e312c | HumanEvalEvaluator | 0shot | pass@1 |
code | HumanEval+ | humaneval_plus_gen_8e312c | HumanEvalPlusEvaluator | 0shot | pass@1 |
code | MBPP | sanitized_mbpp_mdblock_gen_a447ff | MBPPEvaluator | 3 shot | pass@1 |
code | LiveCodeBench | livecodebench_gen | LCBCodeGenerationEvaluator | 0shot | pass@1 |
reasoning | ARC_c | ARC_c_xver_gen | IAAR-Shanghai/xVerify-9B-C | 0shot | accuracy |
reasoning | BBH | bbh_xver_gen | IAAR-Shanghai/xVerify-9B-C | 0shot | accuracy |
reasoning | KOR-Bench | korbench_single_0_shot_xver_gen | IAAR-Shanghai/xVerify-9B-C | 0shot | Average accuracy on all korbench benchmarks |
reasoning | CaLM | calm | CaLMEvaluator | 0shot | Average accuracy on all calm benchmarks |
reasoning | GPQA | gpqa_xver_gen | IAAR-Shanghai/xVerify-9B-C | 0 shot | accuracy |
To ensure fairness and effectiveness in evaluation, we have established strict dataset selection criteria. All datasets used for training and evaluation are sourced from the Hugging Face platform and must meet the following screening conditions:
Criteria | Requirements |
---|---|
Impact Metrics |
likes or
downloads count greater than
10 on Hugging Face platform,
ensuring community recognition
|
Priority Ranking |
In domains with abundant training data (e.g.,
general dialogue, mathematical reasoning, code
generation), prioritize datasets with higher
likes counts
|
Recency Requirement | Dataset last updated after January 2023, ensuring data timeliness and relevance |
Size Limitation | Dataset size limited to under 1M, balancing training effectiveness with computational resource consumption |
Suitability Assessment | Dataset purpose or content must be suitable for SFT (Supervised Fine-Tuning) training, containing high-quality instruction-response pairs |
To ensure training data quality, we conduct the following additional validations: