Evaluation Benchmarks Our evaluation encompasses three primary categories of benchmarks, each designed to assess distinct capabilities of the model:
• Language Understanding and Reasoning: Hellaswag [121], ARC-Challenge [14], Winogrande [83], MMLU [36], TriviaQA [47], MMLU-Redux [26], MMLU-Pro [103], GPQA-Diamond [82], BBH [94], and [105].
• Code Generation: LiveCodeBench v6 4 [44], EvalPlus [60].
• Math & Reasoning: AIME 2025, MATH 500, HMMT 2025, PolyMath-en.
• Long-context: MRCR 5 , RULER [38], Frames [52], HELMET-ICL [118], RepoQA [61], Long Code Arena [13] and LongBench v2 [6].
• Chinese Language Understanding and Reasoning: C-Eval [43], and CMMLU [55].