But the code produced from the formal spec would still be nondeterministic. And ...

nyrikki · 2026-03-12T20:14:49 1773346489

Even with classic compilation, it is only the semantic behavior that is preserved.

What the Church–Rosser property/confluence is in term rewriting in lambda calculus is a possible lens.

To have a formally verified spec, one has to use some decidable fragment of FO.

If you try to replace code generation with rewriting things can get complicated fast.[2]

Rust uses affine types as an example and people try to add petri-nets[0] but in general petri-net reachability is Ackerman-complete [1]

It is just the trade off of using a context free like system like an LLM with natural language.

HoTT and how dependent types tend to break isomorphic ≃ equal Is another possible lens.

[0] https://arxiv.org/abs/2212.02754v3

[1] https://arxiv.org/abs/2212.02754v3

[2] https://arxiv.org/abs/2407.20822

pron · 2026-03-13T01:55:35 1773366935

First, it's not a question of decidability but of tractability. Verifying programs in a language that has nothing but boolean variables, no subroutines, and loops at depth of at most 2 - far, far, from Turing-completeness - is already intractable (reduction from TQBF).

Second, it's very easy to have some specs decided tractably, at least in many practical instances, but they are far too weak to specify most correctness properties programs need. You mentioned the Rust type system, and it cannot specify properties with interleaved quantifiers, which most interesting properties require.

And as for HoTT - or any of the many equivalent rich formalisms - checking their proofs is tractable, but not finding them. The intractability of verification of even very limited languages (again TQBF) holds regardless of how the verification is done.

I think it's best to take it step by step, and CodeSpeak's approach is pragmatic.

nyrikki · 2026-03-14T04:38:04 1773463084

I think there is a bit of the map territory relation here.

> First, it's not a question of decidability but of tractability

The question of decidability is a form of many-to-one, reduction. In fact RE-complete is defined by many-to-one reductions.

In a computational complexity sense, tractability is a far stronger notion. Basically an algorithm efficient if its time complexity at most PTIME for any size n input. A problem is "tractable" if there is an efficient algorithm that solves it.

You are correct, if you limit your expressiveness to PTIME, where because P == co-P, PEM/tight apartness/Omniscience principles hold.

But the problem is that Church–Rosser Property[0] (proofs ~= programs) and Brouwer–Heyting–Kolmogorov Interpretation[1] (Propositions at types) are NOT binary SAT, and you have concepts like mere propositions[3] that are very different than just BSAT.

But CodeSpeak doesn't have formal specifications, so this is irrelevant. Their example code output produced code with path traversal/resource exhaustion risks and correctness issues and is an example.

My personal opinion is that we will need to work within the limitations of the systems, and while it is trivial to come up with your own canary, I would recommend playing with [3] before the models directly target it.

Generating new code from a changed spec will be less difficult, specifically when the mess of real world specs comes into play. You can play with the example on CodeSpeak's front page, trying to close the various holes the software has with malformed/malicious input, giving the LLM the existing code base and you will see that "brown m&m"[3] problem arise quickly. At least for me if I prompt it to look at the changed natural language spec, generating new code it was more successful.

But for some models like the qwen3 coder next, the style resulted in far less happy path protections, which that model seems to have been trained on to deliver by default in some cases.

[0] https://calhoun.nps.edu/entities/publication/015f1bab-6642-4... [1] https://www.cs.cornell.edu/courses/cs6110/2017sp/lectures/le... [2] https://www.cambridge.org/core/journals/journal-of-functiona... [3] https://codemanship.wordpress.com/2025/10/03/llms-context-wi...

pron · 2026-03-14T19:24:55 1773516295

> But the problem is that Church–Rosser Property[0] (proofs ~= programs) and Brouwer–Heyting–Kolmogorov Interpretation[1] (Propositions at types) are NOT binary SAT, and you have concepts like mere propositions[3] that are very different than just BSAT.

The computational limits imposed on program verification are independent of the logical theory used, and depend only on its expressive strength. Many if not most interesting program properties require interleaved properties (forall-exists or forall-exists-forall etc.) which are intractable to verify.

> But CodeSpeak doesn't have formal specifications, so this is irrelevant.

The lack of formalism also doesn't matter to the limitations on correctness. If you wish to know, with certainty, that a program satisfies some property, that knowledge has a cost. But both humans and LLMs write programs at least in part based on inductive rather than deductive reasoning, which cannot be rigorously given a level of confidence. That may not be what we want, but that's what computational complexity says we can have, so we're forced to work in this way.

We should complain about what we don't have, but demanding things we can't have isn't going to help. There's no fundamental reason why AI shouldn't, someday, be able to program as well as humans, but there are fundamental limitations to producing the software we wish we had. Humans can rigorously use deductive methods, with mechanical help, and AI could possibly use it, too. But there's no reason to believe AI could break the size barrier of such problems. People have been able to rigorously verify only very small programs, and maybe AI could do a little better if only because of its tenacity, but if we expect to produce perfect programs by any means, we'll be waiting for a long time.

vidarh · 2026-03-12T17:56:57 1773338217

It doesn't matter if the code is different if the spec is formal enough to validate the software against it.

I have no idea about codespeak - I was responding to the comments above, not about codespeak.

pron · 2026-03-12T19:04:42 1773342282

Validating programs against a formal spec is very, very hard for foundational computational complexity reasons. There's a reason why the largest programs whose code was fully verified against a formal spec, and at an enormous cost, were ~10KLOC. If you want to do it using proofs, then lines of proof outnumber lines of code 10-1000 to 1, and the work is far harder than for proofs in mathematics (that are typically much shorter). There are less absolute ways of checking spec conformance at some useful level of confidence, and they can be worthwhile, but they require expertise and care (I'm very much in favour of using them, but the thought that AI can "just" prove conformance to a formal spec ignores the computational complexity results in that field).

vidarh · 2026-03-12T19:28:42 1773343722

For most cases we don't need nearly that comprehensive verification. This is expecting more off AI written code than we ever bother to subject most human written code to. There's a vast chasm there we only need to even slightly start to bridge to get to far higher confidence levels than the typical human dev team achieves.

pron · 2026-03-13T02:09:32 1773367772

> For most cases we don't need nearly that comprehensive verification. This is expecting more off AI written code than we ever bother to subject most human written code to.

True.

> There's a vast chasm there we only need to even slightly start to bridge to get to far higher confidence levels than the typical human dev team achieves.

The word "slightly" is doing a lot of work here to the point of making it impossible to estimate. For example, the complexity classes P and NP are only slightly apart, and yet that's where a very practical barrier between feasibility and infeasibility lies. I don't doubt that one day AI may be able to write programs as well as humans, although nobody can estimate how soon that day will come, but nobody knows how wide the gap between that and "far higher confidence" is. Maybe there are fundamental computational complexity barriers in that gap that no amount of intelligence can cross, and maybe there aren't. Nobody knows yet.

What we do know is that anything humans do is possible - after all, we're doing it - and many things we need and humans can't do (including predicting nonlinear systems like the behavious of economy) no machine can do drastically better because of complexity limitations.