Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One more thing - the use of a stack-based bytecode is also curious. PHP itself (the Zend engine) uses a register-based bytecode (well, it uses a horror-show of an in-memory bytecode-like-thing, which could best be described as a register-based bytecode). The PHP engine details leak into the language a lot, so the difficulties they describe aren't unexpected.

Not only that, but bleeding-edge JITs typically use register-based bytecode. Java's JIT and all the modern versions of Javascript VMs use register-based (in many cases actually translating from stack-based bytecode that the parser outputs, to register-based bytecode consumed by the JIT compilers). Since 2003, when my PhD advisor and his co-authors (Google "David Gregg" and "Anton Ertl") showed register VMs could be much more efficient that stack-based ones, the jury has settled on register-based VMs.

So I'm curious as to why a stack-based instruction set was used in the VM design?



As noted elsewhere in this thread, a stack-based design typically produces more compact bytecode. Compactness was a concern for us because of the size of FB's PHP code base. Also, generally speaking a stack-based design tends to be easier to deal with when working to get a prototype VM up and running quickly.

Many of the advantages of register-based designs (ability to optimize by rewriting the program at the bytecode level, ability to write faster interpreters, ability to map bytecode registers to physical registers, etc.) weren't particularly attractive to us because we knew we were going to build an x64 JIT that did its own analysis and optimization to take advantage of type information observed at run time.

Thus, we drafted a stack-based design for HipHop bytecode. It captured PHP's semantics correctly and happened to fit in fairly well with PHP's evaluation order, so we ran with it and here we are.


Makes total sense, thanks!

Would using a register-based bytecode not have been useful for the x64 JIT?


Check out our HHIR work. It's an SSA-based IR, that gets us most of the advantages of a register representation. But it is at a much lower level than the bytecode.


AFAICT, two reasons come to mind:

1) A stack-based ISA is still more space-compact. (AFAIR the Shi et al. paper mentions something like a 40+% increase in space requirement for the instructions.)

2) The performance improvement of a register-based ISA is only visible for interpreters that suffer most from instruction dispatch [1]. PHP is a rather complex programming language that is most certainly not bound by instruction dispatch at all. So I guess it could very well make sense to stick with a simple stack-based ISA, which incidentally is also easier to compile from the AST.

[1] for the sake of completeness: the stack architecture emits many instructions to push operands onto the operand stack. A register-based interpreter does not need those. Hence, the overall number of dispatches is lower, and if dispatch cost is your bottleneck the overall performance increases. OTOH, you need more space, because in addition to the opcodes, you need to specify/encode which registers to take operands from and put results to. (Hint: quadruple code and the likes.)


1) Yes, that's my recollection too. I think 40% is exactly right.

2) But this is exactly my point. Zend PHP uses a register-based bytecode. The HipHop JIT could choose anything - why choose stack-based - its not only different, but also worse! If I had to guess, I'd say it was chosen to make the HipHop interpreter easier to implement, and then was legacy code when creating the JIT, and it was easier to keep it than to replace it.


> all the modern versions of Javascript VMs use register-based (in many cases actually translating from stack-based bytecode that the parser outputs, to register-based bytecode consumed by the JIT compilers)

Pretty sure V8 doesn't as it has no interpreter at all, it has a fast JIT and a slow JIT, but it only ever runs native code.


Sorry, I didn't mean to imply they all use interpreters. The more compilery parts use three-address code and SSA, which are IRs that are conceptually similar to a register-based bytecode.


Java's HotSpot is a stack machine, AFAIK. Dalvik is a register machine, but I wouldn't want to point to that as a positive. ;)


Internally, Hotspot doesn't use a stack representation. It converts it to an IR (internal representation) which is not stack based, which is my point. See Fig 2 from http://www-leland.stanford.edu/class/cs343/resources/java-ho...


Huh, so it's not. Thanks for the interesting read!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: