JS minification is fairly mechanical and comparably simple, so the inversion sho...

panda-giddiness · on Aug 29, 2024

> JS minification is fairly mechanical and comparably simple, so the inversion should be relatively easy.

Just because a task is simple doesn't mean its inverse need be. Examples:

  - multiplication / prime factorization
  - deriving / integrating
  - remembering the past / predicting the future

Code unobfuscation is clearly one of those difficult inverse problems, as it can be easily exacerbated by any of the following problems:

  - bugs
  - unused or irrelevant routines
  - incorrect implementations that incidentally give the right results

In that sense, it would be fortunate if chatGPT could give decent results at unobfuscating code, as there is no a priori expectation that it should be able to do so. It's good that you've also checked chatGPT's code unobfuscation capabilities on a more difficult problem, but I think you've only discovered an upper limit. I wouldn't consider the example in the OP to be trivial.

lifthrasiir · on Aug 29, 2024

Of course, it is not generalizable! In my experience though, most minifiers do only the following:

- Whitespace removal, which is trivially invertible.

- Comment removal, which we never expect to recover via unminification.

- Renaming to shorter names, which is tedious to track but still mechanical. And most minifiers have little understanding of underlying types anyway, so they are usually very conservative and rarely reuse the same mangled identifier for multiple uses. (Google Closure Compiler is a significant counterexample here, but it is also known to be much slower.)

- Constant folding and inlining, which is annoying but can be still tracked. Again, most minifiers are limited in their reasoning to do extensive constant folding and inlining.

- Language-specific transformations, like turning `a; b; c;` into `a, b, c;` and `if (a) b;` into `a && b;` whenever possible. They will be hard to understand if you don't know in advance, but there aren't too many of them anyway.

As a result, minified code still remains comparably human-readable with some note taking and perseverance. And since these transformations are mostly local, I would expect LLMs can pick them up by their own as well.

(But why? Because I do inspect such programs fairly regularly, for example for comments like https://news.ycombinator.com/item?id=39066262)

cal85 · on Aug 29, 2024

I feel you’re downplaying the obfuscatory power of name-mangling. Reversing that (giving everything meaningful names) is surely a difficult problem?

chucksmash · on Aug 29, 2024

JSNice[1] is an academic project that did a pretty good job of this in the 2010s and they give some pointers on how it is accomplished[2].

[1]: http://jsnice.org/

[2]: https://www.sri.inf.ethz.ch/jsnice

lifthrasiir · on Aug 29, 2024

I would say the actual difficulty greatly varies. It is generally easy if you have a good guess about what the code would actually do. It would be much harder if you have nothing to guess, but usually you should have something to start with. Much like debugging, you need a detective mindset to be good at reverse engineering, and name mangling is a relatively easy obstacle to handle in this scale.

Let me give some concrete example from my old comment [1]. The full code in question was as follows, with only whitespaces added:

    function smb(){
      var a,b,c,d,e,h,l;
      return t(function(m){
        a=new aj;
        b=document.createElement("ytd-player");
        try{
          document.body.prepend(b)
        }catch(p){
          return m.return(4)
        }
        c=function(){
          b.parentElement&&b.parentElement.removeChild(b)
        };
        0<b.getElementsByTagName("div").length?
          d=b.getElementsByTagName("div")[0]:
          (d=document.createElement("div"),b.appendChild(d));
        e=document.createElement("div");
        d.appendChild(e);
        h=document.createElement("video");
        l=new Blob([new Uint8Array([/* snip */])],{type:"video/webm"});
        h.src=lc(Mia(l));
        h.ontimeupdate=function(){
          c();
          a.resolve(0)
        };
        e.appendChild(h);
        h.classList.add("html5-main-video");
        setTimeout(function(){
          e.classList.add("ad-interrupting")
        },200);
        setTimeout(function(){
          c();
          a.resolve(1)
        },5E3);
        return m.return(a.promise)
      })
    }

Many local variables should be easy to reconstruct: b -> player, c -> removePlayer, d -> playerDiv1, e -> playerDiv2, h -> playerVideo, l -> blob (we don't know which blob it is yet though). We still don't know about non-local names including t, aj, lc, Mia and m, but we are reasonably sure that it builds some DOM tree that looks like `<ytd-player><div></div><div class="ad-interrupting"><video class="html5-main-video"></div></ytd-player>`. We can also infer that `removePlayer` would be some sort of a cleanup function, as it gets eventually called in any possible control flow visible here.

Given that `a.resolve` is the final function to be executed, even later than `removePlayer`, it will be some sort of "returning" function. You will need some information about how async functions are desugared to fully understand that (and also `m.return`), but such information is not strictly necessary here. In fact, you can safely ignore `lc` and `Mia` because it eventually sets `playerVideo.src` and we are not that interested in the exact contents here. (Actually, you will fall into a rabbit hole if you are going to dissect `Mia`. Better to assume first and verify later.)

And from there you can conclude that this function constructs a certain DOM tree, sets some class after 200 ms, and then "returns" 0 if the video "ticks" or 1 on timeout, giving my initial hypothesis. I then hardened my hypothesis by looking at the blob itself, which turned out to be a 3-second-long placeholder video and fits with the supposed timeout of 5 seconds. If it were something else, then I would look further to see what I might have missed.

[1] https://news.ycombinator.com/item?id=38346602

fkyoureadthedoc · on Aug 29, 2024

I believe the person you're responding to is saying that it's hard to do automated / programmatically. Yes a human can decode this trivial example without too much effort, but doing it via API in a fraction of the time and effort with a customizable amount of commentary/explanation is preferable in my opinion.

lifthrasiir · on Aug 30, 2024

Indeed that aspect was something I failed to get initially, but I still stand by my opinion because most of my reconstruction had been local. Local "reasoning" can be often done without the actual reasoning, so while it's great that we can automate the local reasoning, it falls short of the full reasoning necessary to do the general unobfuscation.

cjf101 · on Aug 29, 2024

This is, IMO, the better way to approach this problem. Minification applies rules to transform code, if we know the rules, we can reverse the process (but can't recover any lost information directly).

A nice, constrained, way to use a LLM here to enhance this solution is to ask it some variation of "what should this function be named?" and feed the output to a rename refactoring function.

You could do the same for variables, or be more holistic and ask it to rename variables and add comments (but risk the LLM changing what the code does).

refulgentis · on Aug 29, 2024

How do we end up with you pasting large blocks of code and detailed step-by-step explanations of what it does, in response to someone noting that just because process A is simple, it doesn't mean inverting A is simple?

This thread is incredibly distracting, at least 4 screenfuls to get through.

I'm really tired of the motte/bailey comments on HN on AI, where the motte is "meh the AI is useless, amateurish answer thats easy to beat" and bailey is "but it didn't name a couple global variables '''correctly'''." It verges on trolling at this point, and is at best self-absorbed and making the rest of us deal with it.

lifthrasiir · on Aug 30, 2024

Because the original reply missed three explicit adverbs to hint that this is not a general rule (EDIT: and also had mistaken my comment to be dismissive). And I believe it was not in a bad faith, so I went to give more contexts to justify my reasoning. If you are not interested in that, please just hide it because otherwise I can do nothing to improve the status quo and I personally enjoyed the entire conversation.

mgkimsal · on Aug 29, 2024

> As a result, minified code still remains comparably human-readable with some note taking and perseverance.

At least some of the time, simply taking it and reformatting to be unfolded and on multiple lines is useful enough to be readable/debuggable. FIXING that bug is likely more complex, because you have to find where it is in the original code, which, to my eyes, isn't always easy to spot.

dheera · on Aug 30, 2024

> - Comment removal, which we never expect to recover via unminification.

ChatGPT is quite good at adding meaningful comments back to uncommented code, actually.

Paste some code and add "comment the shit out of this" as a prompt.

drakythe · on Aug 29, 2024

As a point of order Code Minification != Code Obfuscation.

Minification does tend to obfuscate as as side effect, but it is not the goal, so reversing minification becomes much easier. Obfuscation on the other hand can minify code, but crucially that isn't the place it starts from. As the goal is different between minificaiton and obfuscation reversing them takes different efforts and I'd much rather attempt to reverse minification than I would obfuscation.

I'd also readily believe there are hundreds/thousands of examples online of reverse code minification (or here is code X, here is code X _after_ minifcation) that LLMs have ingested in their training data.

jmb99 · on Aug 29, 2024

Yeah, having run some state of the art obfuscated code through ChatGPT, it still fails miserably. Even what was state of the art 20 years ago it can't make heads or tails of.

johnfn · on Aug 29, 2024

> JS minification is fairly mechanical and comparably simple, so the inversion should be relatively easy.

This is stated as if it's a truism, but I can't understand how you can actually believe this. Converting `let userSignedInTimestamp = new Date()` to `let x = new Date()` is trivial, but going the other way probably requires reading and understanding the rest of the surrounding code to see in what contexts `x` is being used. Also, the rest of the code is also minified, making this even more challenging. Even if you do all that right, it's at best it's still a lossy conversion, since the name of the variable could capture characteristics that aren't explicitly outlined in the code at all.

lifthrasiir · on Aug 30, 2024

You are technically true, but I think you should try some reverse engineering to see that it is usually possible to reconstruct much of them in spite of the amount of transformations made. I do understand that this fact might be hard to believe without any prior.

EDIT: I think I got why some comments complain I downplayed the power of LLM here. I never meant to, and I wanted to say that the unminification is a relatively easy task compared to other reverse engineering tasks. It is great we can automate the easy task, but we still have to wait for a better model to do much more.

johnfn · on Sept 2, 2024

I have tried reconstructing minified code (I thought that would be obvious from my example). It feels like it takes just a bit less thought than it did to write the code in the first place, which is definitely not something I would classify as "comparably simple".

viscanti · on Aug 29, 2024

Because of how trivial that step is, it's likely pretty easy to just take lots of code and minify it. Then you have the training data you need to learn to generate full code from minified code. If your goal is to generate additional useful training data for your LLM, it could make sense to actually do that.

wwarner · on Aug 29, 2024

I suspect, but definitely do not know, that all the coding aspects of llms work something like this. It’s such a fundamentally different problem from a paragraph, which should never be the same as any other paragraph. Seems to me that coding is a bit more like the game of go, where an absolute score can be used to guide learning. Seed the system with lots and lots of leetcode examples from reality, and then train it to write tests, and now you have a closed loop that can train itself.

viscanti · on Aug 29, 2024

If you're able to generate minified code from all the code you can find on the internet, you end up with a very large training set. Of course in some scenarios you won't know what the original variable names were, but you would expect to be able to get something very usable out of it. These things, where you can deterministically generate new and useful training data, you would expect to be used.

hluska · on Aug 29, 2024

And I can’t understand why any reasonably intelligent human feels the need to be this abrasive. You could educate but instead you had to be condescending.

Max-q · on Aug 29, 2024

Converting a picture from color to black and white is a fairly simple task. Getting back the original in color is not easy. This is if course due to data lost in the process.

Minification works in the same way. A lot of information needed for understanding the code is lost. Getting back that information can be a very demanding task.

lifthrasiir · on Aug 29, 2024

But it is not much different from reading through badly documented codes without any comments or meaningful names. In fact, many codes to be minified are not that bad and thus it is often possible to infer the original code just from its structure. It is still not a trivial task, but I think my comment never implied that.

015a · on Aug 29, 2024

The act of reducing the length of variable names by replacing something descriptive (like "timeFactor") with something much shorter ("i") may be mechanical and simple, but it is destructive and reversing that is not relatively easy; in fact, its impossible to do without a fairly sophisticated understanding of what the code does. That's what the LLM did for this; which isn't exactly surprising, but it is cool; being so immediately dismissive isn't cool.

lifthrasiir · on Aug 30, 2024

I never meant to be dismissive, in fact my current job is to build a runtime for ML accelerator! I rather wanted to show that unminification is much easier than unobfuscation, and that the SOTA model is yet to do the latter.

Also, it should be noted that the name reconstruction is not a new problem and was already partly solved multiple times before the LLM era. LLM is great in that it can do this without massive retraining, but the reconstruction depends much on the local context (which was how earlier solutions approached the problem), so it doesn't really show its reasoning capability.

GaggiX · on Aug 29, 2024

Random try (the first one) with Claude 3.5 Sonnet: https://claude.site/artifacts/246c1b1a-3088-447a-a526-b1e716...

I'm not on PC so it's not tested.

lifthrasiir · on Aug 29, 2024

That's much better in that most of the original code remains present and comments are not that far off, but its understanding of global variables are utterly wrong (to be expected though, as many of them serve multiple purposes).

Earw0rm · on Aug 29, 2024

Yep, I've tried to use LLMs to disassemble and decompile binaries (giving them the hex bytes as plaintext), they do OK on trivial/artificial cases but quickly fail after that.