This weekend past, I achieved a long-standing goal… I created a working compiler and Virtual Machine for a made up mini language. The target mini language is intentionally tiny and minimal.
Here are some buzzwords: Its a entirely handwritten recursive decent compiler. For expressions I use a slightly strange shunting yard implementation that I dreamed up with a while back. The target ISA is stack based similar to JAVA, Pascal or early LUA. There is only one data type… the 32bit integer.
At this point I’ll just throw up a link to the repo: https://github.com/8BitPimp/ccml
Be warned however that there a pungent reek of code smell emanating from there (it was a weekend project after all). So gas masks are advisable.
Some important constraints of the language are: Only integers are supported, there is also no globals or arrays or any such thing, also semantic checking is completely ignored.
It also helps to see a code fragment of this target language. Here is my factorial example I uses to prove to myself that recursion worked fine:
function factoral(val)
if (val <= 1)
return 1
else
return val * factoral(val - 1)
end
end
function main()
return factoral(4)
end
The above code currently translates into the following instructions:
-- factoral
0000 INS_GETV -1
0005 INS_CONST 1
0010 INS_LEQ
0011 INS_NOT
0012 INS_CJMP 37
0017 INS_CONST 1
0022 INS_RET 1
0027 INS_CONST 1
0032 INS_CJMP 64
0037 INS_GETV -1
0042 INS_GETV -1
0047 INS_CONST 1
0052 INS_SUB
0053 INS_CALL 0
0058 INS_MUL
0059 INS_RET 1
0064 INS_CONST 0
0069 INS_RET 1
-- main
0074 INS_CONST 4
0079 INS_CALL 0
0084 INS_RET 0
0089 INS_CONST 0
0094 INS_RET 0
The syntax of my mini language is somewhat similar to LUA, who’s code I always found nice to look at. I first learned real programming using BlitzBasic so I have a soft spot for the BASIC syntax. Also the author of BlitzBasic, Mark Sibley is one of my heros!
One of the more interesting aspects of this little project, was the design and implementation of the stack machine the VM emulates. I have read a fair bit about stack machines in the past, but somehow I never quite groked their operation. My main sticking point was the ABI, how are parameters passed/used, how are local stored/used, how is a frame constructed?
I had visions that I would need loads of stacks for different things, locals, return points, arguments etc. This turned out to be way off, and I was able to make a stack machine that used just two stacks, however even this could certainly be reduced to just with a bit of rework. I think there are advantages to maintaining multiple stacks however.
Things clicked for me when I realized I could use one stack for computing expressions, as well as storing locals and arguments. The second stack stored procedure frames, which captured a ‘frame pointer’ and a ‘return location’. The top procedure frame on the stack always refers to the function currently being executed.
In my system, the frame pointer is somewhat conventional, and procedure arguments can be found immediately bellow it on the stack, and local variables can be found immediately above it on the stack. This makes code generation simple, as accessing locals and arguments, becomes independent of the head of the stack, and each other.
During the course of execution the stack may look as follows:
08 scratch <-- top of stack is scratch for expression
07 var0 <-- FP (top)
06 arg1
05 arg0
04 scratch <-- may have scratch space if call part of incomplete expression
03 var1
02 var0 <-- FP (older)
01 arg1
00 arg0
Scratch space is the term I’m assigning to the fragments of a not yet fully evaluated expression. For something like ‘a+b’, both a and b will live on the stack before they are combined via the addition. If a function is called as part of an expression, these fragments will still hand around until it returns, and evaluation resumes.
My stack machine has two instructions GETV and SETV that provide everything I need to work with arguments and locals. Lets looks at the operation of GETV:
GETV <int32_t operand>:
1. index = frame.fp + operand
2. push(stack[index])
GETV will index the stack relative to FP+<operand> and place that element on the top of the stack. This has the effect that operands >= 0 refer to locals and operands < 0 refer to arguments.
SETV <int32_t operand>:
1. index = frame.fp + operand
2. stack[index] = pop()
SETV is much the same, however it pops the top item on the stack and overwrites the element on the stack that it indexes.
With these two instructions, my code can easily read and write arguments and locals. Perfect!
Unlike a regular register ISA, local variables are not allocated on the stack by moving the stack pointer, all that that is needed to allocate a local on the stack is to push an item on the stack and not pop it. This has the effect of incrementing the top of the stack, reserving space which can now be indexed relative to the fp. In my mini language, a variable declaration is preceded by the ‘var’ keyword which makes it really easy to know that we don’t need to pop anything after this statement has finished. It will happily sit there on the stack and can now be assigned to and read from.
One more trouble point is returning from a function… how do we know how to clean up the stack?
I took the easy route and made my RET instruction fairly complex. RET takes one operand, which is the number of items to discard from the stack. In this manner, RET can discard all local variables and arguments, similar to a callee save function. The parser knows how many arguments and locals have been provided, so is becomes trivial to emit this instruction.
There was one kink in this plan however… If we want to return a value from a function, how do we do that. Since the return value will be sitting on the top of the stack at the point that we return we cant simply discard all of the top elements.
My RET function operates as follows:
RET <int32_t operand>
1. save the top item on the stack (return value)
2. discard <operand> items from the stack
3. push the previously saved top item
4. pc = frame.return_value
5. discard the top procedure frame
That is a fairly meaty RET function, but I didn’t spend too long thinking about it and it seemed justified.
To simplify things in my compiler I enforce that every function must return a value, which defaults to 0 if no return statement is issued in the program. This is crude and implemented by issuing a { CONST 0, RET } instruction pair at the end of every function.
I believe this is the same trick that LUA uses. It my seem wasteful, but it keeps things simple stupid.
So to conclude, how good is my mini language. Well…
…Its crap…
…but it works…
…and I learned a ton!
I would chalk this up as a complete success.
The code is highly smelly, but hey! this was a weekend project. Its up on github.