Crap JIT

Another weekend has passed, which meens another weekend project has been finished.  Im not sure how I managed this as I currently spend most of my spare time shooting super mutants and ghouls in the face.

The project this time… A really sucky JIT compiler.

My recent success writing a small compiler and VM for a made up language really got me interested in this area once again.  My compiler would generate bytecode for a stack based architecture and then my VM would emulate this.

While this aproach works, its not very efficient; The bytecode is never going to change after its compiled, yet we still pay the code of dynamic dispatch.  I had an idea… could I generate sequences of host assembly instruction that directly encapsulate each of my bytecodes?  If I could do this, then I could just string these little assembly blocks together in the same order as the bytecode and invoke it directly on the host.

I chose to target x86 code for my proof of concept because I am most familiar with it.  The same aproach should in theory work for almost any architecture however.

I kept the bytecode basicaly the same as before for simplicity.  Next up I wrote an assembly listing for nasm that contained all of these atomic little assembly blocks.  A quick and dirty python script takes the binary output from nasm and the generated map file and outputs a generated cpp file containing all of the assembly chunks.

The assemly chunks start off life as follows:

ins_const:
    mov  eax, 0xaabbccdd
    push eax
ins_getl:
    mov  eax, [ebp+0xaabbccdd]
    push eax
ins_setl:
    pop eax
    mov [ebp+0xaabbccdd], eax
ins_add:
    pop eax;
    add [esp], eax;

And after assembly and passing threw my script I end up with the following:

{ ins_const   , 6 ,  1, -1, "\xb8\xdd\xcc\xbb\xaa\x50" },
{ ins_getl    , 7 ,  2, -1, "\x8b\x85\xdd\xcc\xbb\xaa\x50" },
{ ins_setl    , 7 ,  3, -1, "\x58\x89\x85\xdd\xcc\xbb\xaa" },
{ ins_add     , 4 , -1, -1, "\x58\x01\x04\x24" },

Some instructions take operands however, which the script deduces by looking for the operand constant 0xaabbccdd in the output bytestream after assembly.  If the least significant bytes have changed after assembly then its fairly clear this is a relative operand, which are slightly more involved to fixup.

Most of the effort was spent wrapping up the interface to crapjit to make it east to build a codebuffer and handle relocations and labels.

As usual to prove things work I decided to jit a function to compute the factoral of a number.  The code to jit this looks as follows:

label_t jit_factoral(crapjit_t & j) {
    reloc_t l0;
    label_t fn;
    // prologue
    fn = j.emit_label();
    j.emit_frame(0);
    // if (val <= 1)
    j.emit_getl(2);
    j.emit_const(1);
    j.emit_leq();
    l0 = j.emit_jz();
    // return 1
    j.emit_const(1);
    j.emit_return(0);
    // else
    l0.set(j.emit_label());
    // return val * factoral(val - 1)
    j.emit_getl(2);
    j.emit_getl(2);
    j.emit_const(1);
    j.emit_sub();
    j.emit_call().set(fn);
    j.emit_sink(1);
    j.emit_mul();
    j.emit_return(0);
    return fn;
}

Bare in mind that this is a stack architecture we are emulating so its fairly inefficient to start with.  I would have felt deeply ashamed if I didnt at least include one kind of optimization in crap jit.  Noticing that many assembly chunks start with ‘pop eax’ and end with ‘push eax’; when chaining such chunks, these instructions can be removed since they basicaly nop.

The generated code is absolutely horrible, yet functional:

00320000  push        ebp
00320001  mov         ebp,esp
00320003  sub         esp,0
00320009  mov         eax,dword ptr [ebp+8]
0032000F  push        eax
00320010  mov         eax,1
00320015  pop         edx
00320016  cmp         edx,eax
00320018  setle       al
0032001B  and         eax,1
0032001E  cmp         eax,0
00320021  je          00320034
00320027  mov         eax,1
0032002C  pop         ebp
0032002D  add         esp,0
00320033  ret
00320034  mov         eax,dword ptr [ebp+8]
0032003A  push        eax
0032003B  mov         eax,dword ptr [ebp+8]
00320041  push        eax
00320042  mov         eax,1
00320047  sub         dword ptr [esp],eax
0032004A  call        00320000
0032004F  add         esp,4
00320055  mul         eax,dword ptr [esp]
00320058  mov         dword ptr [esp],eax
0032005B  pop         eax
0032005C  pop         ebp
0032005D  add         esp,0
00320063  ret

Once side benefit of targeting x86 is that the ABI is trivialy compatable with my stack based architecture, so I was able to directly call into my JIT code without any kind of interface.  That was a nice freebie.

All the code for this is on github and has been tested on windows and linux.

There is loads of scope to improve the generated output from crapjit.  One aproach would be to store all of the IR instructions in a buffer, and then as a final synthesis step use a maximal munch tiling algorythm to cover the IR in assembly chunks.  Some of the worst case code could be improved usign this method.  Furthermore, the IR in the buffer itself could have all the usual optimization passes executed on it.  Lastly a final peephole optimizer could clean up a ton of redundant operations in the output assembly.

It was fun to see that a simple JIT compiler could be made during the breaks from playing fallout 4 over a single weekend.  Even if the output code is shockingly poor, its functional, and certainly faster then interpreting bytecode, so lets not be too quick to poke fun at it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s