Loading…

{{ $t('all_loading') }}

{{ $t('comparison_title') }}

{{ $t('comparison_subtitle_1') }}

{{ $t('comparison_subtitle_2') }}

{{ getDisplayName(ds.name) }} {{ $t('all_' + tag.toLowerCase()) !== ('all_' + tag.toLowerCase()) ? $t('all_' + tag.toLowerCase()) : getTagDisplayName(tag) }}
{{ getDatasetByName(dsName).affiliation }}
{{ $t('comparison_release_year') }}
{{ getDatasetByName(dsName).year || 'N/A' }}
{{ $t('comparison_dataset_size') }}
{{ getDatasetByName(dsName).size || 'N/A' }}
{{ $t('comparison_tags') }}
{{ $t('all_' + tag.toLowerCase()) !== ('all_' + tag.toLowerCase()) ? $t('all_' + tag.toLowerCase()) : getTagDisplayName(tag) }}
{{ $t('comparison_overall_average') }}
{{ formatScore(getDatasetByName(dsName).overall_avg) }}

{{ $t('comparison_performance_overview') }}

{{ $t('comparison_overall') }}
{{ $t('all_general') }}
{{ $t('all_math') }}
{{ $t('all_code') }}
{{ $t('all_reasoning') }}

{{ $t('comparison_llama_model') }}

{{ $t('comparison_qwen_model') }}

{{ $t('comparison_data_score_overview') }}

Heuristic (Answer Length)

{{ $t('comparison_dataset') }} {{ $t('comparison_min') }} {{ $t('comparison_max') }} {{ $t('comparison_average') }}
{{ getDisplayName(ds.name) }} {{ ds.min }} {{ ds.max }} {{ ds.avg }}

LLM-as-Judge (Q Score)

LLM-as-Judge (QA Score)

Model-based Scorer

LLM-as-Judge

Clarity

Clarity (Q)

Clarity (QA)

Coherence

Coherence (Q)

Coherence (QA)

Completeness

Completeness (Q)

Completeness (QA)

Complexity

Complexity (Q)

Complexity (QA)

Correctness

Correctness (Q)

Correctness (QA)

Meaningfulness

Meaningfulness (Q)

Meaningfulness (QA)

Difficulty and Relevance

Difficulty (Q)

Relevance (QA)

Model-based Scorer

Model-based Scorer (Q)

Deita Complexity

Thinking Prob

Model-based Scorer (QA)

Deita Quality

IFD

Reward Model