Another weekend has passed, which meens another weekend project has been finished. Im not sure how I managed this as I currently spend most of my spare time shooting super mutants and ghouls in the face.
The project this time… A really sucky JIT compiler.
My recent success writing a small compiler and VM for a made up language really got me interested in this area once again. My compiler would generate bytecode for a stack based architecture and then my VM would emulate this.
While this aproach works, its not very efficient; The bytecode is never going to change after its compiled, yet we still pay the code of dynamic dispatch. I had an idea… could I generate sequences of host assembly instruction that directly encapsulate each of my bytecodes? If I could do this, then I could just string these little assembly blocks together in the same order as the bytecode and invoke it directly on the host.
I chose to target x86 code for my proof of concept because I am most familiar with it. The same aproach should in theory work for almost any architecture however.
I kept the bytecode basicaly the same as before for simplicity. Next up I wrote an assembly listing for nasm that contained all of these atomic little assembly blocks. A quick and dirty python script takes the binary output from nasm and the generated map file and outputs a generated cpp file containing all of the assembly chunks.
The assemly chunks start off life as follows:
ins_const: mov eax, 0xaabbccdd push eax ins_getl: mov eax, [ebp+0xaabbccdd] push eax ins_setl: pop eax mov [ebp+0xaabbccdd], eax ins_add: pop eax; add [esp], eax;
And after assembly and passing threw my script I end up with the following:
{ ins_const , 6 , 1, -1, "\xb8\xdd\xcc\xbb\xaa\x50" }, { ins_getl , 7 , 2, -1, "\x8b\x85\xdd\xcc\xbb\xaa\x50" }, { ins_setl , 7 , 3, -1, "\x58\x89\x85\xdd\xcc\xbb\xaa" }, { ins_add , 4 , -1, -1, "\x58\x01\x04\x24" },
Some instructions take operands however, which the script deduces by looking for the operand constant 0xaabbccdd in the output bytestream after assembly. If the least significant bytes have changed after assembly then its fairly clear this is a relative operand, which are slightly more involved to fixup.
Most of the effort was spent wrapping up the interface to crapjit to make it east to build a codebuffer and handle relocations and labels.
As usual to prove things work I decided to jit a function to compute the factoral of a number. The code to jit this looks as follows:
label_t jit_factoral(crapjit_t & j) { reloc_t l0; label_t fn; // prologue fn = j.emit_label(); j.emit_frame(0); // if (val <= 1) j.emit_getl(2); j.emit_const(1); j.emit_leq(); l0 = j.emit_jz(); // return 1 j.emit_const(1); j.emit_return(0); // else l0.set(j.emit_label()); // return val * factoral(val - 1) j.emit_getl(2); j.emit_getl(2); j.emit_const(1); j.emit_sub(); j.emit_call().set(fn); j.emit_sink(1); j.emit_mul(); j.emit_return(0); return fn; }
Bare in mind that this is a stack architecture we are emulating so its fairly inefficient to start with. I would have felt deeply ashamed if I didnt at least include one kind of optimization in crap jit. Noticing that many assembly chunks start with ‘pop eax’ and end with ‘push eax’; when chaining such chunks, these instructions can be removed since they basicaly nop.
The generated code is absolutely horrible, yet functional:
00320000 push ebp 00320001 mov ebp,esp 00320003 sub esp,0 00320009 mov eax,dword ptr [ebp+8] 0032000F push eax 00320010 mov eax,1 00320015 pop edx 00320016 cmp edx,eax 00320018 setle al 0032001B and eax,1 0032001E cmp eax,0 00320021 je 00320034 00320027 mov eax,1 0032002C pop ebp 0032002D add esp,0 00320033 ret 00320034 mov eax,dword ptr [ebp+8] 0032003A push eax 0032003B mov eax,dword ptr [ebp+8] 00320041 push eax 00320042 mov eax,1 00320047 sub dword ptr [esp],eax 0032004A call 00320000 0032004F add esp,4 00320055 mul eax,dword ptr [esp] 00320058 mov dword ptr [esp],eax 0032005B pop eax 0032005C pop ebp 0032005D add esp,0 00320063 ret
Once side benefit of targeting x86 is that the ABI is trivialy compatable with my stack based architecture, so I was able to directly call into my JIT code without any kind of interface. That was a nice freebie.
All the code for this is on github and has been tested on windows and linux.
There is loads of scope to improve the generated output from crapjit. One aproach would be to store all of the IR instructions in a buffer, and then as a final synthesis step use a maximal munch tiling algorythm to cover the IR in assembly chunks. Some of the worst case code could be improved usign this method. Furthermore, the IR in the buffer itself could have all the usual optimization passes executed on it. Lastly a final peephole optimizer could clean up a ton of redundant operations in the output assembly.
It was fun to see that a simple JIT compiler could be made during the breaks from playing fallout 4 over a single weekend. Even if the output code is shockingly poor, its functional, and certainly faster then interpreting bytecode, so lets not be too quick to poke fun at it.