Any comparison with existing models on common benchmarks? Text? Coding? MMLU?

ted_dunning · 2025-11-02T20:31:18 1762115478

Did you even look at the article?

Evaluation Benchmarks Our evaluation encompasses three primary categories of benchmarks, each designed to assess distinct capabilities of the model:

• Language Understanding and Reasoning: Hellaswag [121], ARC-Challenge [14], Winogrande [83], MMLU [36], TriviaQA [47], MMLU-Redux [26], MMLU-Pro [103], GPQA-Diamond [82], BBH [94], and [105].

• Code Generation: LiveCodeBench v6 4 [44], EvalPlus [60].

• Math & Reasoning: AIME 2025, MATH 500, HMMT 2025, PolyMath-en.

• Long-context: MRCR 5 , RULER [38], Frames [52], HELMET-ICL [118], RepoQA [61], Long Code Arena [13] and LongBench v2 [6].

• Chinese Language Understanding and Reasoning: C-Eval [43], and CMMLU [55].

lostmsu · 2025-11-02T21:51:52 1762120312

What article? README and the linked Tech Report don't list MMLU results.