Why does GPT-5.1 Codex underperform GPT-5 Codex on Terminal-Bench?

mengk · 2026-02-17T18:06:31 1771351591

Terminal-Bench evaluates how well models carry out complex tasks in the terminal. On the official leaderboard, GPT-5.1 Codex underperforms GPT-5 Codex by 6.5 percentage points, even with the same scaffold. What explains the regression? In under an hour, Docent finds that the regression probably stems from timeout errors, not performance.