Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Anyone working on decompiler LLMs? Seems like we could render all code open source.

Training data would be easy to make in this case. Build tons of free GitHub code with various compilers and train on inverting compilation. This is a case where synthetic training data is appropriate and quite easy to generate.

You could train the decompiler to just invert compilation and the use existing larger code LLMs to do things like add comments.



The potential implications of this are huge. Not just open sourcing, but imagine easily decompiling and modifying proprietary apps to fix bugs or add features. This could be a huge unlock, especially for long dead programs.

For legal reasons I bet this will become blocked behavior in major models.


I've never seen a law forbidding decompiling programs. But, some programs forbid to decompile applications by the license agreement. Further, you still don't have any right on this source code. It depends on the license...


A mere decompilation or general reverse engineering should be fine in many if not most jurisdictions [1]. But it is a whole different matter to make use of any results from doing so.

[1] https://www.law.cornell.edu/wex/reverse_engineering



Using an LLM (or any technique) to decompile proprietary code is not clean room design. Declaring the results "open source" is deception and theft, which undermines the free open source software movement.


Only if you use the decompiled code. But if one team uses decompiled code to write up a spec, then another team writes an implementation based on that spec, then that could be considered clean room design. In this case, the decompiler would merely be a tool for reverse engineering.


It is true that at least some jurisdictions do also explicitly allow for reverse engineering to achieve interoperability, but I don't know if such provision is widespread.


> Seems like we could render all code open source

Unfortunately not really. Having the source is a first step, but you also need the rights to use it (read, modify, execute, redistribute the modifications), and only the authors of the code can grant these rights.


Doesn't it count as 'clean room' reverse engineering - or alternatively, we could develop an LLM that's trained on the outputs and side-effects of any given function, and learns to reproduce the source code from that.

Or, going back to the original idea, while the source code produced in such a way might be illegal, it's very likely 'clean' enough to train an LLM on it to be able to help in reproducing such an application.


IANAL but if your only source for your LLM is that code, I would assume the code it produces would be at high risk of being counterfeit.

I would guess clean room would still require having someone reading the LLM-decompiled code, write a spec, and have someone else write the code.

But this is definitely a good question, especially given the recent court verdicts. If you can launder open source licensed code, why not proprietary binaries? Although I don't think the situation is the same. I wouldn't expect how you decompile a code matters.


> Seems like we could render all code open source.

I agree. I think "AI generating/understanding source code" is a huge red herring. If AI was any good at understanding code, it would just build (or fix) the binary.

And I believe how it will turn out to be, when we really have AI programmers, they will not bother with human-readable code, but code everything in machine code (and if they are tasked in maintaining existing system, they will understand in its entirety, across the SW and HW stack). It's kinda like diffusion models that generate images don't actually bother with learning drawing techniques.


Why wouldn't AIs benefit from using abstractions? At the very least it saves tokens. Fewer tokens means less time spent solving a problem, which means more problem solving throughput. That is true for machines and people alike.

If anything I expect AI-written programs in the not so distant future to be incomprehensible because they're too short. Something like reading an APL program.


I agree, they might create abstractions, but I doubt they're going to reuse the same abstractions as human programming languages.


> Anyone working on decompiler LLMs?

Here is an LLM for x86 to C decompilation: https://github.com/albertan017/LLM4Decompile


Unminifying isn't decompiling.

It's just renaming variable and functions and inserting line breaks.


Minifying includes way more tricks than shorter variable names and removing white-space


No but it’s a baby brother of the same problem. Compiling is a much more complex transform but ultimately it is just a code transform.


It is true that compilation and minification are both code transformations (it's a correct reduction [1]), but this doesn't seem a very useful observation in this discussion. In the end, everything you do to something is an operation. But that's not very workable.

In practice, compilation is often (not always, agreed!) from a language A to a lower level language B such that the runtime for language A can't run language B or vice-versa, if language A has a runtime at all. Minification is always from language A to the same language A.

The implication is that in practice, deminification is not the same exercise as decompilation. You can even want to run a deminification phase after a decompilation phase, using two separate tools, because one tool will be good at translating back, and the other will be good at pretty printing.

[1] https://en.wikipedia.org/wiki/Reductionism


There was a paper about this at CGO earlier this year [1]. Correctness is a problem that is hard to solve, though; 50% accuracy might not be enough for serious use cases, especially given that the relation to the original input for manual intervention is hard to preserve.

[1]: https://arxiv.org/abs/2305.12520


>Seems like we could render all code open source.

That's not how copyright and licensing works.

You could already break the law and open yourself up to lawsuits and prosecution by stealing intellectual property and violating its owners rights before there were LLMs. They just make it more convenient, not less illegal.


I think there's actually some potential here, considering LLMs are already very good at translating text between human languages. I don't think LLMs on their own would be very good, but a specially trained AI model perhaps, such as those trained for protein folding. I think what an LLM could do best is generate better decompiled code, giving better names to symbols, and generating code in a style a human is more likely to write.

I usually crap on things like chatgpt for being unreliable and hallucinating a lot. But in this particular case, decompilers already usually generate inaccurate code, and it takes a lot of work to fix the decompiled code to make it correct (I speak from experience). So introducing AI here may not be such a huge stretch. Just don't expect an AI/LLM to generate perfectly correct decompiled code and we're good (wishful thinking).


It can’t really compensate for missing variable and function names, not to mention comments.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: