I was initially excited by 4.7, as it does a lot better in my tests, but their reasoning/pricing is really weird and unpredictable.
Apart from that, in real-life usage, gpt-5.3-codex is ~10x cheaper in my case, simply because of the cached input discount (otherwise it would still be around 3-4x cheaper anyway).
Medium reasoning has regressed since 4.6. While None and Max have improved since 4.6 in our benchmark.
We suspect that this is how Claude tries to cope with the increased user base.
Note, Google and OpenAI probably did something similar long ago.
> Instruction following. Opus 4.7 is substantially better at following instructions. Interestingly, this means that prompts written for earlier models can sometimes now produce unexpected results: where previous models interpreted instructions loosely or skipped parts entirely, Opus 4.7 takes the instructions literally. Users should re-tune their prompts and harnesses accordingly.
Yay! They finally fixed instruction following, so people can stop bashing my benchmarks[0] for being broken, because Opus 4.6 did poorly on them and called my tests broken...
One of the worst is TikTok, even as a developer, when someone sends me a TikTok link and I have to visit it, I get stuck in the browser (same with the app but I uninstalled it), and it feels almost device-breaking the way they trap you in.
I guess writing code is now like creating punch-cards for old computers. Or even more recently, as writing ASM instead of using a higher level language like C. Now we simply write our "code" in a higher language, natural language, and the LLM is the compiler.
It needs to be something stronger than just deterministic.
With the right settings, a LLM is deterministic. But even then, small variations in input can cause very unforeseen changes in output, sometimes drastic, sometimes minor. Knowing that I'm likely misusing the vocabulary, I would go with saying that this counts as the output being chaotic so we need compilers to be non-chaotic (and deterministic, I think you might be able to have something that is non-deterministic and non-chaotic). I'm not sure that a non-chaotic LLM could ever exist.
(Thinking on it a bit more, there are some esoteric languages that might be chaotic, so this might be more difficult to pin down than I thought.)
Also, give the same programming task to 2 devs and you end up with 2 different solutions. Heck, have the same dev do the same thing twice and you will have 2 different ones.
Determinism seems like this big gotcha, but in it self, is it really?
> Heck, have the same dev do the same thing twice and you will have 2 different ones
"Do the same thing" I need to be pedantic here because if they do the same thing, the exact same solution will be produced.
The compiler needs to guarantee that across multiple systems. How would QA know they're testing the version that is staged to be pushed to prod if you can't guarantee it's the same ?
I cringe every time I read this "punch card" narrative. We are not at this stage at all. You are comparing deterministic stuff and LLMs which are not deterministic and may or may not give you what you want. In fact I personally barely use autonomous Agents in my brownfield codebase because they generate so much unmaintainable slop.
Except that compiler is a non-deterministic pull of a slot-machine handle. No thanks, I'll keep my programming skills; COBOL programmers command a huge salary in 2026, soon all competent programmers will.
Releasing version 9.0 of my self-hosted analytics app[0]. I will finally add an in-app cron job editor, so you can easily schedule clean-up jobs, data retention settings, newsletters/summaries, etc.
reply