This is conjecture. OTOH I measured while I designed the LuaJIT IR.
1. An array index is just as suitable as a pointer for dereferencing.
2. What matters is how many dereferences are needed and their locality.
3. Data structure density is important to get high cache utilization.
References show a lot of locality: 40% of all IR operands reference the previous node. 70% reference the previous 10 nodes. A linear IR is the best cache-optimized data structure for this.
That said, dereferencing of an operand happens less often than one might think. Most of the time, one really needs the operand index itself, e.g. for hashes or comparisons. Again, indexes have many advantages over pointers here.
What paid off the most was to use a fixed size IR instruction format (only 64 bit!) with 2 operands and 16 bit indexes. The restriction to 2 operands is actually beneficial, since it helps with commoning and makes you think about IR design. The 16 bit index range is not a limitation in practice (split IR chunks, if you need to). The high orthogonality of the IR avoids many iterations and unpredictable branches in the compiler itself.
The 16 bit indexes also enable the use of tagged references in the compiler code (not in the IR). The tag caches node properties: type, flags, constness. This avoids even more dereferences. LuaJIT uses this in the front pipeline for fast type checks and on-the-fly folding.
Coming up with a silly markting name, writing a naive implementation and then claiming it's their invention is impertinent. Especially since they mention LuaJIT itself in the text ...
I promise that this was a genuine case of parallel evolution – I didn't read any LuaJIT source or documentation while writing the code or article.
Afterwards, I searched for "reverse linear scan register allocation", discovered the blog post which referenced LuaJIT, and linked it in the text.
I've updated the article to move the LuaJIT reference to the top, and adjusted the phrasing to be clear that this is not an invention, it's simply an interesting implementation. What you call "naive", others may consider didactic :)
Let me know if you have other feedback or suggested changes, either here or via email (in my profile).
You may want to clarify that in the GitHub repo, too. See my issue there.
If you want to go the didactic route, then consider documenting the improvements over the naive implementation: register hinting, register priorities (PHI), two-headed register picking, fixed register picking, optimized register picking for 2-operand instructions (x86/x64), register pair picking, ABI calling-conventions, weak allocations, cost heuristics, eviction heuristics, lazy/eager spill/restore, rematerialization, register shuffling (PHI) with cycle breaking, register renaming, etc. That's all in ~2000 lines of lj_asm.c.
I had already changed the title after your reply. The objection is about the naming, which implies an invention claim without further explanation. It's not about the code.
Are you demanding that independent implementations of an idea you explicitly dedicated to the public domain must then describe the various ways your version is “superior”?
Yes, of course it's vulnerable, verified with Docker debian:sid. That was my first reaction when I read this, but I wanted to verify it first. You beat me with this post.
Since you've already let the cat out of the hat (which is not ideal), please file the bugs at Debian and Ubuntu.
Test command:
redis-cli eval 'return select(2, loadstring("\027")):match("binary") and "VULNERABLE" or "OK"' 0
While we're at it, redis has ignored the advice at: http://lua-users.org/wiki/SandBoxes
Almost all of the critical functions (loadstring, load, getmetatable, getfenv, ...) are present and unprotected in the redis 'SandBox' (which isn't).
Which means, disable scripting or shut down your redis instances NOW, which do not run with the same privileges as any client which has access to this. Scripting can be disabled by renaming the EVAL and EVALSHA commands to unguessable names.
I don't think getmetatable et al are really problems, are they? You can mess things up in the sandbox, but that's not escaping it. I think that page is trying to build a sandbox where even a lua script can eval other lua code without blowback, but that's not what redis is trying to achieve.
While you're right about the binary loadstring issue, that lua-users page is way overly paranoid. The best Lua sandboxing implementation I know of is the one Wikipedia uses, and it allows a lot of what's "unsafe" there.
The page is trying to build a sandbox where a lua script can eval other untrusted lua code within the same lua execution environment. Many, even most?, people are only interested in isolating the host application from the lua environment.
One year ago I hardened LuaJIT's VM against these kind of attacks. Since then, there has been a constant influx of complaints and issues filed. All bitterly complaining their code, which mistakenly assumed a fixed hash table iteration order, is now broken.
Even when told that the Lua manual clearly states the undefined order since 20 years, they do not cease to complain. They do not realize this change helped them to discover a serious bug in their code (the order could differ even before that change). Sigh.
You can now have a guess, what one of the lesser enlightened forks of LuaJIT did ...
I’m not surprised. The same issue occurred in python.
And to be fair it’s a pain in the ass to debug and find out why something happens to implicitly depend on iteration order (float stability is common but not alone). And their code did work beforehand, for most values of work.
The biggest pain in the ass is that — at least in python - while you can set the hash seed explicitely if you don’t the langage doesn’t tell you. This makes reproducing the issue very annoying when only some seeds trigger it.
> the order could differ even before that change
While the order could differ I assume it was deterministic and nothing influencing those bits had changed in a while.
And now, as predicted by core developer Raymond Hettinger in his Modern Dictionaries talk, Python's dicts are now guaranteed to be ordered by default (as of 3.7).
To the extent this creates a performance penalty, it's a little annoying that a few systems create a behavior dependency that is not truly needed in a core type when there is a separate type that implements the desired semantics. But then again if it doesn't make it slower it should be fine.
Fwiw, the way Raymond re-implemented dictionaries made them more efficient (the algorithms are the same, but it's now much more cache friendly), and had the side effect of making them ordered. He and many others advocated for having the ordering guaranteed going forward.
And now they're locked into an implementation that is order-preserving, even if an optimization tomorrow could eke out another 10%, if ordering was not guaranteed.
Actually, LuaJIT 1.x is just that: a translator from a register-based bytecode to machine code using templates (small assembler snippets) with fixed register assignment. There's only a little bit more magic to that, like template variants depending on the inferred type etc.
You can compare the performance of LuaJIT 1.x and 2.0 yourself on the benchmark page (for x86). The LuaJIT 1.x JIT-compiled code is only slightly faster than the heavily tuned LuaJIT 2.x VM plus the 2.x interpreter written in assembly language by hand. Sometimes the 2.x interpreter even beats the 1.x compiler.
A lot of this is due to the better design of the 2.x VM (object layout, stack layout, calling conventions, builtins etc.). But from the perspective of the CPU, a heavily optimized interpreter does not look that different from simplistic, template-generated code. The interpreter dispatch overhead can be moved to independent dependency-chains by the CPU, if you're doing this right.
Of course, the LuaJIT 2.x JIT compiler handily beats both the 2.x interpreter and the 1.x compiler.
Article: "We can also refute Bernstein’s argument from first principles: the kind of people who can effectively hand-optimize code are expensive and not incredibly plentiful."
Commenter: "IMO he couldn't give a convincing answer to the guy who asked about LuaJIT author being out of a job."
Guy in audience: "I was that guy in the audience."
LuaJIT author: "Actually, LuaJIT 1.x is just that"
Voice in my head: "Aspen 20, I show you at one thousand eight hundred and forty-two knots, across the ground."
Meta: Apologies for the abstract response, but I couldn't figure out a better way to present the parallel. It can be hard to explain artistic allusions without ruining them. What I mean to say is that this pattern of responses reminded me in a delightful way of the classic story of the SR-71 ground speed check: http://www.econrates.com/reality/schul.html
I'm impressed (as usual with your work) that you're able to get that level of performance from an interpreter, although as you note it's not an apples-to-apples comparison. I wonder what you'd get from the 2.x VM design with a templating JIT compiler.
But, as you said, my main point is your last sentence -- optimizing compilers like LuaJIT 2.x are impressive and necessary.
No. You DO need a good understanding of a computer language and of JIT compilers to understand the code base for any just-in-time compiler for that computer language.
LuaJIT is not a toy compiler from a textbook. There's a lot of inherent complexity in a production compiler that employs advanced optimizations and needs to work on various CPU architectures and operating systems. This reflects in the code.
The only correction I have: LuaJIT _does_ have 64 bit integers, e.g. 0x0123456789abcdefLL.