You're super right -- this is probably the one crack in our narrative and one that I sorely need to address. Hope to be back with something positive on this front soon, we're setting up all the benchmark harnesses to do this more equitably.
I think we have one on the site right now -- it's roughly 4.1-mini pricing. We're not aiming to make money off of individual users, which is why we're trialing a free thing (and trying to partner with open-source frameworks). Our bread and butter is more companies doing this at scale & licensing.
Yes, they can -- I actually tried a semantic edit implementation in Aider. It got the "correct edit format" percentage to 100%, but didn't really budge the overall percent correct on SOTA models. I should push it sometime, since it really helps the reliability of these local models like Qwen3. If you reach out to me, I can try to share some of this code with you as well (it needs to be cleaned up).
But yes, 1. have some code, 2. create a patch (semantic, diff, or udiff formats all work), and 3. apply will return it to you very fast. There's roughly a 10-15% merge error rate when we last benchmarked on using Claude 3.7 Sonnet to create diff patches, and with us it was 4%; and you can use the Apply as a backup if the merge fails.