Loading dataset information...

{{ error }}

Back to Leaderboard

OpenDataArena-Dataset Comparison

Compare datasets by model performance and multi-dimensional data scores — for both instruction and instruction–response data.

Please select datasets to compare.

{{ getDisplayName(ds.name) }} {{ getTagDisplayName(tag) }}
{{ getDisplayName(dsName) }} Hugging Face Paper
{{ getDatasetByName(dsName).affiliation }}
Release Year
{{ getDatasetByName(dsName).year || 'N/A' }}
Dataset Size
{{ getDatasetByName(dsName).size || 'N/A' }}
Tags
{{ getTagDisplayName(tag) }}
Overall Average
{{ formatScore(getDatasetByName(dsName).overall_avg) }}

Performance Overview

Overall
General
Math
Code
Reasoning

Llama Model

Qwen Model

Data Score Overview

Heuristic (Answer Length)

Dataset Min Max Average
{{ getDisplayName(ds.name) }} {{ ds.min }} {{ ds.max }} {{ ds.avg }}

LLM-as-Judge (Q Score)

LLM-as-Judge (QA Score)

Model-based Data

LLM-as-Judge

Clarity

Clarity (Q)

Clarity (QA)

Coherence

Coherence (Q)

Coherence (QA)

Completeness

Completeness (Q)

Completeness (QA)

Complexity

Complexity (Q)

Complexity (QA)

Correctness

Correctness (Q)

Correctness (QA)

Meaningfulness

Meaningfulness (Q)

Meaningfulness (QA)

Difficulty and Relevance

Difficulty (Q)

Relevance (QA)

Model-based Scorer

Model-based Scorer (Q)

Deita Complexity

Thinking Prob

Model-based Scorer (QA)

Deita Quality

IFD

Reward Model