Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Any comparison with existing models on common benchmarks? Text? Coding? MMLU?


Did you even look at the article?

Evaluation Benchmarks Our evaluation encompasses three primary categories of benchmarks, each designed to assess distinct capabilities of the model:

• Language Understanding and Reasoning: Hellaswag [121], ARC-Challenge [14], Winogrande [83], MMLU [36], TriviaQA [47], MMLU-Redux [26], MMLU-Pro [103], GPQA-Diamond [82], BBH [94], and [105].

• Code Generation: LiveCodeBench v6 4 [44], EvalPlus [60].

• Math & Reasoning: AIME 2025, MATH 500, HMMT 2025, PolyMath-en.

• Long-context: MRCR 5 , RULER [38], Frames [52], HELMET-ICL [118], RepoQA [61], Long Code Arena [13] and LongBench v2 [6].

• Chinese Language Understanding and Reasoning: C-Eval [43], and CMMLU [55].


What article? README and the linked Tech Report don't list MMLU results.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: