Bytecode is Evil
After having worked with .NET and virtual execution systems in general over several years, it struck me that there has to be an easier way to deal with intermediate code. Bytecode (or IL) seems ideal for use in a virtual execution system, because it’s a platform-neutral instruction set. It allows the JIT to compile it to the optimal native code for the platform the program is executed on.
However, there are serious problems with bytecode:
- All control flow information is lost
- IL is hard to interpret compared to walking an expression tree
- Transformation or translation is hard (admittedly easier than native code, though)
As a consequence of the loss of control flow, the JIT has a much harder time optimizing. It has to attempt to reconstruct the control flow of a method from the bytecode that compilers have given it. This might fail if a compiler emits rather obscure code. And as mentioned, tools that operate on IL have an incredibly hard time trying to remap it to high-level code (i.e. SL#).
Surely, we can do better than this. Imagine if all code was compiled to trees – ASTs, specifically. The JIT would have all the information it could ever wish for, for optimization. Tools would be able to easily walk the compiled code and translate expressions.
Of course, with such a model, a few issues arise:
- Code can easily be reverse engineered
- AST representations of code would be much larger than the bytecode equivalents
First of all, reverse engineering: REing bytecode is already incredibly easy. We have tools like Reflector and JD that can almost entirely recreate the original high-level code from the bytecode. Even then, many have written obfuscators that insert dead code, alter control flow to be more obscure and confusing, obfuscate class/method/variable names, and so on. Easy reverse engineering is something we won’t get around when we’re dealing with abstract code like bytecode.
As for space: I won’t deny that AST representations of code will grow quite large. This is arguably a problem for embedded devices (and similar deployment targets), but for computers, it hardly matters. If an IL instruction takes somewhere between 1 to 6 bytes, then an AST node might take 2 to 12 bytes. That isn’t so bad.
Thinking about it, there isn’t much reason, apart from saving space, that we compile to bytecode. Bytecode is arguably a redundant and cumbersome step in the process of program compilation/execution. I can only imagine that we’re stuck with IL because we’re still in that assembly language mindset that comes from native code.
So, this was mostly a brain dump/rant/idea. I figure something can actually be done with the AST approach, and I do intend to do so. I’ll blog more about that later.
Good observations. I’m currently working on a DSL layered on top of an s-expression based language, which allows embedded calls in the s-exp language. In effect, the the s-exp language is the IL. However, I skip the translation to “IL” and produce Cons-based ASTs from both languages. If, as you point out, the resource problems can be solved, this is an excellent approach. Maintaining the existing code and adding new constructs is a piece of cake compared to actually doing IL translation.
TechNeilogy - February 5, 2011 at 16:10 |
Indeed. For example, if the runtime provides a set of ‘basic’ nodes like For, While, DoWhile, etc., and you want to have a ForEach node, you just inherit For, and reduce your ForEach to that node upon compilation.
In addition, one could make it so every node can have data attached to it (attributes in .NET terms) – you would be able to annotate literally any piece of code, as opposed to only ‘high-level’ constructs such as types, members, etc.
Debugging info is also massively simplified; you could attach any piece of debug info to any node, containing line number, column number, file name, programming language name, etc.
XTZGZoReX - February 5, 2011 at 18:33 |
[...] good while back, I blogged about how bytecode is evil. This post will explain exactly what I intend to do with the AST [...]
Managed JIT Compilation « Zor's Blog - June 25, 2011 at 19:59 |