Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Why does GPT-5.1 Codex underperform GPT-5 Codex on Terminal-Bench? (transluce.org)
9 points by mengk 67 days ago | hide | past | favorite | 1 comment


Terminal-Bench evaluates how well models carry out complex tasks in the terminal. On the official leaderboard, GPT-5.1 Codex underperforms GPT-5 Codex by 6.5 percentage points, even with the same scaffold. What explains the regression? In under an hour, Docent finds that the regression probably stems from timeout errors, not performance.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: