Catmull-Rom Splines

I have been working on some code to generate procedural race tracks for a wipeout style game. Part of this work involved using splines to smooth out the track so it looks natural and curvy.

For this purpose I used a fourth order CatmulRom spline which has the nice property that its curve travels through the control points.

I also brushed of my high-school calculus and worked out a derivative function for the spline which is usefull for extracting normal vectors at any point on the curve.

// evaluate spine
// where t is between 0.f and 1.f
float catmullrom_spline(float t, float v0, float v1, float v2, float v3) {

  const float c1 = /********/    1.0f * v1    /*******/   /********/;
  const float c2 = -0.5f * v0 +  /*******/    0.5f * v2   /********/;
  const float c3 =  1.0f * v0 + -2.5f * v1 +  2.0f * v2 + -0.5f * v3;
  const float c4 = -0.5f * v0 +  1.5f * v1 + -1.5f * v2 +  0.5f * v3;

  return (c4 * t*t*t) + (c3 * t*t) + (c2 * t) + c1;

// calculate spline derivative
// where t is between 0.f and 1.f
float catmullrom_spline_deriv(float t, float v0, float v1, float v2, float v3) {

  const float c2 = -0.5f * v0 +  /*******/    0.5f * v2   /********/;
  const float c3 =  1.0f * v0 + -2.5f * v1 +  2.0f * v2 + -0.5f * v3;
  const float c4 = -0.5f * v0 +  1.5f * v1 + -1.5f * v2 +  0.5f * v3;

  return (3.f * c4 * t*t) + (2.f * c3 * t) + (c2);

The above code is only one dimensional but it can be fairly easily scaled up to two or more dimensions as follows:

// evaluate spline
// where t is between 0.f and 1.f
float2 catmullrom_spline(float t, float2 v0, float2 v1, float2 v2, float2 v3) {
  return float2{
    catmullrom_spline(t, v0.x, v1.x, v2.x, v3.x),
    catmullrom_spline(t, v0.y, v1.y, v2.y, v3.y)

// calculate spline derivative
// where t is between 0.f and 1.f
// note that the return value will most likely need to be normalized.
float2 catmullrom_spline_deriv(float t, float2 v0, float2 v1, float2 v2, float2 v3) {
  return float2{
    catmullrom_spline_deriv(t, v0.x, v1.x, v2.x, v3.x),
    catmullrom_spline_deriv(t, v0.y, v1.y, v2.y, v3.y)

Hopefully these functions are usefull to someone at some point.

All things RISC-V

I have been increasingly interested in the RISC-V instruction set architecture recently and have undertaken a number of personal projects in this area.

For my initial foray into this realm I wanted to get my toes wet and decided to write an emulator for an RV32I processor, the base instruction set. The project quickly grew arms and legs and I had achieved much more than I thought I would at the outset.

The repository can be found here:

In its current state it is functions as a user mode emulator providing system call emulation which allows the virtualized processes to interact with the host OS. It passes many of the relevant RISC-V compliance tests and can even run an extensive set of benchmarks such as Dhrystone, Coremark, etc.

One of the most exciting features to me though, it that it can run Doom and Quake at decent speeds.

Seeing DOOM boot on your own VM is very satisfying.
QUAKE requires support for floating point arithmetic.

These two games were the most valuable benchmarks in terms of measuring execution speed, and initially they ran fairly slowly. DOOM was very easy to port as most of the hard work had been done as part of the DoomGeneric project. Quake however, required a fairly large amount of rework to build for my virtual machine which I spun off into its own project portable_quake which very much follows the principles of the DOOM port.

My pursuit of speed ultimately lead me to write an X86-64 JIT compiler for the VM allowing me to dynamically translate RISC-V code blocks into X86-64 code. This was a fascinating, rewarding and fairly complex process, but I learned a heck of a lot doing it which is always my main goal.

There is enough distinct topics in this project that I want to focus on them in more depth in following articles. The project is far from complete however and there are a lot of updates in the pipeline.

I also want to share some of the work I have done writing a tiny RV32I processor in Verilog to run on an FPGA using and OpenSource flow using Verilator, APIO and IceStorm.

‘Surge’ for Ludum Dare 36

Last weekend I made spent 48 hours making a game from scratch for the Ludum dare game jam.  You can see the various videos I took during its development below (in reverse order of course :p).


Binary available here:

It was an absolute blast to make and I learned so much during the 48 hours of its development.

The game code was written from scratch in C++, the framework is based on SDL2, music was made in Ableton Live using an Amiga sample pack, pixel art was made with Aseprite and sound effects in SFXR.

Some key learning points were Finite state machines for enemy sequences.  Lissajou curves and hermite splines were used for some of the enemy and boss movement patterns.  I also had my first experience of using SDL_Mixer as an alternative to my usual choice of bass audio.

In the preceeding week, I had ducktaped together some of my previous small projects, so I was able to leverage a nice spatial hash implementation I wrote to handle all the collisions.  I also made use of a reference counted object library I made with an object factory and very simple RTTI.  These elements all worked as well as I could have hoped, requiring just a few tweaks during the compo.

I also think the vector version of the game looks pretty neat and could be and aesthetic worth exploring in the future.

Quake 1 Palette Fun

Recently I have been playing around with the Quake 1 source, the SDL port [1] to be specific.  Quake was released in 1996 by id software as the highly anticipated successor to Doom.  It was also the first game of its kind to implement a proper 3d engine, quite a leap from the 2.5d engine of Doom et al.


One of the things that interests me about Quake is that it shipped with a highly optimised software rasterizer; and I have a weird facination with software rendering.  The quake source also contains a OpenGL rendering system, known as GLQuake, but I dont think the SDL port is quite setup to use it out of the box.

Internaly quake does all of its rasterization using 8 bit palletized colour.  While im sure this palette must be part of the games pak data files, it was simple for me to extract the pallette directly from quakes memory, as shown below:


Because each colour is represented as 1 byte, the pallette thus can only contain 256 entries.  Each colour entry in the pallete is stored as a RGB triplet, expanding each byte in the colour buffer to 24 bit true colour before its presented on the monitor.  The palette setup is done as follows:

// vid_sdl.c
void VID_SetPalette(unsigned char* palette) {
    int i;
    SDL_Color colors[256];
    for (i = 0; i < 256; ++i) {
        colors[i].r = *palette++;
        colors[i].g = *palette++;
        colors[i].b = *palette++;
    SDL_SetColors(screen, colors, 0, 256);

Its clear just by looking at the screen shots that Quake has quite an advanced lighting system for the time, using lightmaps to store all of the incident light at each surface of the world.

Having the source to hand, one of the first things I wanted to do was to disable the surface textures so that I could view just the lightmap by itself.  As it turns out this was quite a tricky task, since Quake has to make a few interesting optimizations in this area.

// d_scan.c
// void D_DrawSpans8(espan_t * pspan) {
// ...
do {
    *pdest++ = *(pbase + (s >> 16) +
                         (t >> 16) * cachewidth);
    s += sstep;
    t += tstep;
} while (--spancount > 0);

Looking at the rasterization code above, which forms the inner loop of the surface rendering, it is apparent that only one texture is being indexed (pbase*), despite it being clear that each surface should be the product of the lightmap and the albido texture.  Its also clear in the above code that quake uses 16:16 fixed point arithmetic while indexing the lookup texture.

A little more digging reveales that once a surface has been deemed visible by quakes PVS System [2], it gets cached.  During this caching process, a new texture is allocated, which is indeed the product of the lighting contribution and the albido texture.

// r_surf.c
void R_DrawSurfaceBlock8_mip0(void) {
    int v, i, b, lightstep, lighttemp, light;
    unsigned char pix, *psource, *prowdest;
    psource = pbasesource;
    prowdest = prowdestbase;
    static const int STEPS = 16;
    static const int LIGHTMASK = 0xff00;
    for (v = 0; v < r_numvblocks; v++) {
        // FIXME: make these locals?
        // FIXME: use delta rather than both right and left, like ASM?
        lightleft = r_lightptr[0];
        lightright = r_lightptr[1];
        r_lightptr += r_lightwidth;
        lightleftstep = (r_lightptr[0] - lightleft) / STEPS;
        lightrightstep = (r_lightptr[1] - lightright) / STEPS;
        for (i = 0; i < STEPS; i++) {
            lighttemp = lightleft - lightright;
            lightstep = lighttemp / STEPS;
            light = lightright;
            for (b = STEPS-1; b >= 0; b--) {
                pix = psource[b];
                prowdest[b] = ((uint8_t*)vid.colormap)[(light & LIGHTMASK) + pix];
                light += lightstep;
            psource += sourcetstep;
            lightright += lightrightstep;
            lightleft += lightleftstep;
            prowdest += surfrowbytes;
        if (psource >= r_sourcemax)
            psource -= r_stepback;

The function shown above is responsible for calculating the product of the light and the albido texture and writing that into the cached texture.  Looking at the inner most loop reveals that psource is the source texture itself and light is the lighting contribution for this texel.  vid.colormap is clearly some kind of lookup table used to combine the two attributes.

To test this out, I set pix to a fixed entry in the quake pallete and fired up the game, to see the following:


Nice! My hunch that pix was the source texture was correct, and now we are presented with only the lighting contributions.  One thing to note is that all dynamic objects in the world still appear textured, and this is because Quake has two rendering pipelines, the one for static BSP geometry that I have been playing around with, and one for dymanic models that I havent delved into yet.  That is yet another optimisation.

In order to learn more I dumped vid.colormap and wrote a quick program to transform it by Quakes palette and save it out as the image below:


True to games of that era, its a lookup table containing pre multiplied versions of the base palette.  Such and aproach is necessary since we can easily transform from a 8bit colour index to the 24bit colour space that we would need to be in to perform our multiplication with the light, however its not possible to go back from a 24bit colour to the correct 8bit index.  By using a pre-multiplied palette we can effectively perform the multiplication, while staying in the 8bit colour index space.

If we instead set light in the above function to zero so that we are always indexing the top row of the colour table, we should render without any lighting contributions.


Another interesting point is that Quake also performs Mip Mapping [3], the bennefits of which are two fold.  Using smaller textures for far away surfaces reduces memory bandwith and uses less cache while also avoiding visual distortions caused by Alising.

This these changes must be made to each of the following:

// r_surf.c
void R_DrawSurfaceBlock8_mip0(void);
void R_DrawSurfaceBlock8_mip1(void);
void R_DrawSurfaceBlock8_mip2(void);
void R_DrawSurfaceBlock8_mip3(void);

When quake performs its caching, depending on the shape, size and slope of the polygon, one of the above functions will be chosen from a table of function pointers surfmiptable.

// r_surf.c
// void R_DrawSurface(void) {
if (r_pixbytes == 1) {
    pblockdrawer = surfmiptable[r_drawsurf.surfmip];



One fun modification that I wanted to make to the rasterizer was to add unreal style filtering [4] to quake, a challenge made more fun because it has to be done using 16:16 fixed point.  The block below shows the parts I changed:

// d_scan.c

#define DITHER 1

void D_DrawSpans8(espan_t* pspan) {
    // ...
#if (DITHER)
    int d_s, d_t;

    // ...

    pbase = (unsigned char*)cacheblock;

    // ...

    do {
        pdest = (unsigned char*)((byte*)d_viewbuffer + (screenwidth * pspan->v) + pspan->u);

#if (DITHER)
        // dither offsets
        d_s = ((pspan->v&1) << 16);
        d_t = ((pspan->u&1) << 16) ^ d_s;

        // ...

        do {
            // calculate s and t at the far end of the span
            spancount = (count>=8) ? 8 : count;
            count -= spancount;

            // ...

            do {
#if (DITHER)
                // compute dither offsets for texel
                int s_offs = (s - d_s + 0x8000);
                int t_offs = (t - d_t + 0x8000);
                // advance the dither pattern
                d_s ^= 0x10000;
                int s_offs = s;
                int t_offs = t;
                *pdest++ = *(pbase + (s_offs >> 16) +
                                     (t_offs >> 16) * cachewidth);
                s += sstep;
                t += tstep;

            } while (--spancount > 0);

            s = snext;
            t = tnext;

        } while (count > 0);

    } while ((pspan = pspan->pnext) != NULL);

In the end it looks as follows:



This is a fairly dodgy hack to say the least as you can see we are reading +/-1 texels out of bounds around the edges of each texture.  It will require a bit more digging around in the source to correct this problem however.



If any of this was interesting to you, Fabien Sanglard has also written a fantastic teardown of the Quake source code [5].  He also must like the visual aperance of the Unreal texture filter and did something similar for Quake 2, see the end of this page [6].  You can see however that the original flipcode article contains a subtle bug however that is also visible in Fabiend’s screenshots.

The original flipcode article suggests a kernel which adds a small bias to the uv coordinates of half a texel which is undesirable:

          (X&1)==0        (X&1==1)
(Y&1)==0 | u+=.25,v+=.00  u+=.50,v+=.75
(Y&1)==1 | u+=.75,v+=.50  u+=.00,v+=.25

A kernel such as the following should be roughly equivenant, yet adds no bias:

          (X&1)==0        (X&1==1)
(Y&1)==0 | u+=.5,v+=.5  u-=.5,v+=.5
(Y&1)==1 | u-=.5,v-=.5  u+=.5,v-=.5



[1] The SDL port was written by Sam Oscar Lantinga, the author or SDL and can be found here:

[2] Potential Visibility Set:

[3] Mip Mapping:

[4] Unreal style filtering:

[5] Quake source code review:

[6] Quake 2 rendering:


random.h is a header only random number generation library for games, image processing and procedural content generation.

A few years back I wrote some handy floating point random number generation routines for a raytracer.  Time and again, I have needed these routines for other programs I have been working on.  It seems like a good idea to wrap them up in a small header and release them in the public domain for others to use too.

Each RNG function takes an 64bit integer as a seed value and permutes it in place.  The resulting bits are used for the generation of the result.  I like this design as it keeps the RNG functions stateless and thread safe.  You just have to keep track of the uint64_t you have been using as your RNG stream.

Included are various 1, 2 and 3 dimentional random number generators with a range of different distributions.

Triangular random noise is very usefull when dithering audio and images during the process of bitdepth conversion.  Typicaly, triangular noise is added to the signal immediately before type casting (at the same magnitude as the LSB of the destination bit depth).

The 2d and 3d vector RNG functions generate statisticaly unbiased vectors inside and on the unit circle or sphere (Generating vectors by randomizing the components and normalizing would bias the vector to the corners of unit cube).  I have used these functions successfully in the past for generating ambient occlusion maps.

I wrote a little program to generate some images to display the output distribution of each function.  A 3d point is generated by filling its component from the RNG function under test, and the resulting point is projected in space, as well as projected on each component axis (the red, blue and green planes).  Hopefully the images help a little to convey the distribution of each function.

Again, you can find the library here:

randfu() - unsigned random 
range [0,1]


randfs() - signed random 
range [-1,+1]


trandfs() - triangular random 
range[-1,+1], tends to 0


grandfs() - gaussian random 
range ~[-1,+1], tends to 0


vrand3d() - 3d vector random 
magnitude range: [0, 1]


nvrand3d() - normalized 3d vector random 
magnitude = 1


vrand2d() - 2d vector random
magnitude range: [0,1]


nvrand2d() - normalized 2d vector random
magnitude = 1


Binary Heap

Despite lack of activity here I have been very busy programming cool stuff which I hope to get around documenting and sharing with you all.

I need to implement the A* algorythm for one of my projects. Its a fairly common to speed up of the implemention to use a binary heap for the open list because of its O(log(n)) insertion and O(1) removal properties.  As I code it up I thought it would be nice to share my binary heap implementation, something which can be a bit fiddly to get right in practice.  I wrote a little fuzz tester to nuke it, so that I can be fairly sure it works correctly.

Gist: BinaryHeap

I chose to make the size fixed at compile time for simplicity, and becaues I am happy to terminate a pathfinding search that exceeds a fixed open list size, as old school games used to do.  It should however not require much work at all to grow the heap size when a node is pushed when currently at full capacity.

I will also leave you with this picture from my latest project:


Crap JIT

Another weekend has passed, which meens another weekend project has been finished.  Im not sure how I managed this as I currently spend most of my spare time shooting super mutants and ghouls in the face.

The project this time… A really sucky JIT compiler.

My recent success writing a small compiler and VM for a made up language really got me interested in this area once again.  My compiler would generate bytecode for a stack based architecture and then my VM would emulate this.

While this aproach works, its not very efficient; The bytecode is never going to change after its compiled, yet we still pay the code of dynamic dispatch.  I had an idea… could I generate sequences of host assembly instruction that directly encapsulate each of my bytecodes?  If I could do this, then I could just string these little assembly blocks together in the same order as the bytecode and invoke it directly on the host.

I chose to target x86 code for my proof of concept because I am most familiar with it.  The same aproach should in theory work for almost any architecture however.

I kept the bytecode basicaly the same as before for simplicity.  Next up I wrote an assembly listing for nasm that contained all of these atomic little assembly blocks.  A quick and dirty python script takes the binary output from nasm and the generated map file and outputs a generated cpp file containing all of the assembly chunks.

The assemly chunks start off life as follows:

    mov  eax, 0xaabbccdd
    push eax
    mov  eax, [ebp+0xaabbccdd]
    push eax
    pop eax
    mov [ebp+0xaabbccdd], eax
    pop eax;
    add [esp], eax;

And after assembly and passing threw my script I end up with the following:

{ ins_const   , 6 ,  1, -1, "\xb8\xdd\xcc\xbb\xaa\x50" },
{ ins_getl    , 7 ,  2, -1, "\x8b\x85\xdd\xcc\xbb\xaa\x50" },
{ ins_setl    , 7 ,  3, -1, "\x58\x89\x85\xdd\xcc\xbb\xaa" },
{ ins_add     , 4 , -1, -1, "\x58\x01\x04\x24" },

Some instructions take operands however, which the script deduces by looking for the operand constant 0xaabbccdd in the output bytestream after assembly.  If the least significant bytes have changed after assembly then its fairly clear this is a relative operand, which are slightly more involved to fixup.

Most of the effort was spent wrapping up the interface to crapjit to make it east to build a codebuffer and handle relocations and labels.

As usual to prove things work I decided to jit a function to compute the factoral of a number.  The code to jit this looks as follows:

label_t jit_factoral(crapjit_t & j) {
    reloc_t l0;
    label_t fn;
    // prologue
    fn = j.emit_label();
    // if (val <= 1)
    l0 = j.emit_jz();
    // return 1
    // else
    // return val * factoral(val - 1)
    return fn;

Bare in mind that this is a stack architecture we are emulating so its fairly inefficient to start with.  I would have felt deeply ashamed if I didnt at least include one kind of optimization in crap jit.  Noticing that many assembly chunks start with ‘pop eax’ and end with ‘push eax’; when chaining such chunks, these instructions can be removed since they basicaly nop.

The generated code is absolutely horrible, yet functional:

00320000  push        ebp
00320001  mov         ebp,esp
00320003  sub         esp,0
00320009  mov         eax,dword ptr [ebp+8]
0032000F  push        eax
00320010  mov         eax,1
00320015  pop         edx
00320016  cmp         edx,eax
00320018  setle       al
0032001B  and         eax,1
0032001E  cmp         eax,0
00320021  je          00320034
00320027  mov         eax,1
0032002C  pop         ebp
0032002D  add         esp,0
00320033  ret
00320034  mov         eax,dword ptr [ebp+8]
0032003A  push        eax
0032003B  mov         eax,dword ptr [ebp+8]
00320041  push        eax
00320042  mov         eax,1
00320047  sub         dword ptr [esp],eax
0032004A  call        00320000
0032004F  add         esp,4
00320055  mul         eax,dword ptr [esp]
00320058  mov         dword ptr [esp],eax
0032005B  pop         eax
0032005C  pop         ebp
0032005D  add         esp,0
00320063  ret

Once side benefit of targeting x86 is that the ABI is trivialy compatable with my stack based architecture, so I was able to directly call into my JIT code without any kind of interface.  That was a nice freebie.

All the code for this is on github and has been tested on windows and linux.

There is loads of scope to improve the generated output from crapjit.  One aproach would be to store all of the IR instructions in a buffer, and then as a final synthesis step use a maximal munch tiling algorythm to cover the IR in assembly chunks.  Some of the worst case code could be improved usign this method.  Furthermore, the IR in the buffer itself could have all the usual optimization passes executed on it.  Lastly a final peephole optimizer could clean up a ton of redundant operations in the output assembly.

It was fun to see that a simple JIT compiler could be made during the breaks from playing fallout 4 over a single weekend.  Even if the output code is shockingly poor, its functional, and certainly faster then interpreting bytecode, so lets not be too quick to poke fun at it.

A simpler tinier expression parser

In my previous post I showed the expression parser that I have been using in some of my hobby projects.  I always thought of this as fairly small and neat in its operation.  It made sense to me and I was happy with it.  However… Since writing that post, I have been thinking about expression parsing again and I have made some discoveries.

Last night I tried to eliminate the stack from my expression parse and use simply recursion and the main stack itself.  I came up with the following algorithm, which works great, and I was amazed at how tiny it is:

int prec(char c) {
    switch (c) {
    case ('=') : return 1;
    case ('&') : return 2;
    case ('+') : return 3;
    case ('*') : return 4;
    default:     return 0;

void level(int l) {
    while (char t = peek()) {
        if (t == ')') return;
        if (prec(t) > l) {
        else if (prec(t) == l) {
            t = next();
        else {

void expr() {

Since nobody can come up with anything original these days, I wanted to know more about the algorithm I have just found, since it surely has been discovered by those who came before me. It seems this is a crappy weird version of a precedence climbing parser. This link here does a very good job of explaining how it works, and how to extend it to handle things like differing association, which I had not considered.

After seeing that article I wanted to try out the jist of the algorithm as they present it. Here is the full source for my little test program I made to play around with it:

#include <stdio.h>
#include <assert.h>

// the source expession
static const char * g = "z=a+b*c+d=e";
//static const char * g = "a+b&c";
//static const char * g = "a+(b&c)";

// lexer duties
char next()         { if (*g) return *g++; else return '\0'; }
char peek()         { return *g; }
void putc(char c)   { putchar(c); }
void expect(char c) { char t = next(); assert(t == c); }
void expr(int v);

// munch an itentifier or parenthesized expression
void ident() {
    char n = next();
    if (n == '(') { expr(1); expect(')'); return; }
    assert(n >= 'a' && n <= 'z');

// the precedence table
int prec(char c) {
    switch (c) {
    case ('=') : return 1;
    case ('&') : return 2;
    case ('+') : return 3;
    case ('*') : return 4;
    default:     return 0;

// precedence climbing expression parser
void expr(int min_prec) {
    while (prec(peek()) >= min_prec) {
        char c = next();

int main() {
    return 0;

This version is much much smaller and neater then either of the versions I came up with. I will definitely be ditching my strange methods for this one. I am always happy to learn something new, and especially when its better then what I was previously doing. I also boldly said expressions were the hardest part of a recursive decent compiler in my previous post… well I certainly don’t think that’s true anymore!

A simple tiny expression parser

I mentioned in my previous post that I parse expressions in my mini language using a strange shunting yard algorithm.  It mixes explicit stack management and recursive decent.  Its operation makes intuitive sense to me which is why I use it, and the code to implement it happens to also be very short.

An expression parser is one of the hardest parts of a recursive descent parser to write, but with a little practice it doesn’t have to be hard at all.  I’m sure there are much better ways out there to implement an expression parser so I would be interested to see them if anyone has any good links.  Are they smaller then the ~100 lines (including comments) this one takes up?

The code presented below is almost identical in operation to the one from my mini parser.  I added a tiny Lexer and REPL loop to it so it can be used as a very simple infix to postfix expression evaluator.  I also used the C libraries setjmp() function to add very simple error handling.

It supports normal operators [+ – * /], parenthesis, the unary not operator [ ! ], and function calls. Only single character variables and literals are supported, so that the focus remains on the parser rather then the Lexer.  Spaces are not supported either.

Try a simple expression such as:
> A*B-C(1+3)
> 1*(3+4)
> 2*!A

In the event that wordpress feels the need to mangle my source code here is an alternative link which will maintain the formatting:

#include <assert.h>
#include <stdint.h>
#include <vector>
#include <setjmp.h>
#include <stdio.h>

typedef char token_e;

namespace {
// return the precedence of an operator
int op_prec(token_e o) {
	switch (o) {
	case ('&'): return  1;
	case ('+'): 
	case ('-'): return  2;
	case ('*'): 
	case ('/'): return  3;
	default:    return  0;

// return true if token is an operator
bool is_op(token_e o) {
	return op_prec(o) != 0;

// return true if token is a variable identifier
bool is_var(token_e o) {
	return (o>='A' && o<='Z') || (o>='a' && o<='z');

// return true if token is a numeric literal 
bool is_num(token_e o) { 
	return (o>='0' && o<='9');
} // namespace {}

struct parser_t {

	std::vector<token_e> op_;
	const char * s_;
	jmp_buf error_buf;

	void error() {
		longjmp (error_buf, 1);

	parser_t(const char * stream)
		: op_()
		, s_(stream)

	char found_op () { return (is_op  (*s_)) ? *(s_++) : 0; }
	char found_var() { return (is_var (*s_)) ? *(s_++) : 0; }
	char found_num() { return (is_num (*s_)) ? *(s_++) : 0; }

	// check for specific token in the token stream
	bool found(char ch) {
		if (*s_ == ch) {
			return true;
		return false;

	// unconditionally expect a token in the token stream
	void expect(char ch) {
		if (*(s_++) != ch) {
			printf ("\nexpected '%c'\n", ch);
			error ();

	// ---- ---- ---- ---- ---- ---- ---- ---- <expression parser>

	void pop_op() {
		assert(op_.size() > 0);
		token_e t = op_.back();
		putchar (t);

	void push_op(uint32_t tide, token_e tok) {
		while (op_.size() > tide) {
			if (op_prec (op_.back ()) > op_prec (tok))
				pop_op ();

	void pop_all_op(uint32_t tide) {
		while (op_.size() > tide)

	void parse_lhs(uint32_t tide) {
		// parenthesis
		if (found('(')) {
			parse_expr ();
			expect (')');
		// identifier
		else if (char var = found_var()) {
			// identifier name
			putchar (var);
			// index an object
			while (found('.')) {
				if (char ix = found_var()) {
					putchar (ix);
					putchar ('@');
				else {
					printf ("\nexpecting identifier, got '%c'\n", *s_);
					error ();
			// function call
			if (found('(')) {
				uint32_t num_args = 0;
				// argument list
				if (!found(')')) {
					do {
					} while (found(','));
					expect (')');
				// call instruction
		// numeric literal
		else if (char num = found_num()) {
			putchar (num);
		// unexpected token
		else {
			printf ("\nunexpected token '%c'\n", *s_);
			error ();

	// parse an expression with extended control
	void parse_expr_ex(uint32_t tide) {
		bool unary_not = found ('!');
		if (unary_not) {
			// logical not instruction
		if (char op = found_op()) {
			push_op (tide, op);

	// parse a full expression
	void parse_expr() {
		uint32_t tide = op_.size();

	// ---- ---- ---- ---- ---- ---- ---- ---- </expression parser>

	// parse root
	void parse() {
		if (!setjmp(error_buf)) {
			parse_expr ();
		} else {
			printf ("unable to parse\n");

// supported operators: 
//      &       and
//      + -     add sub
//      * /     mul div
//      !       unary not
// example input:
//      A+B*C*(0-3)
//      A&!B
//      2+A(B,C+1)
int main() {
	char buffer[1024];
	// little REPL loop
	for (;;) {
		printf ("expr: ");
		fgets(buffer, sizeof(buffer), stdin);
		if (buffer[0] == '\0')
		parser_t parser(buffer);
		printf ("post: ");
		printf ("\n\n");
	return 0;

Mini Compiler and Stack Machine VM

This weekend past, I achieved a long-standing goal… I created a working compiler and Virtual Machine for a made up mini language.  The target mini language is intentionally tiny and minimal.

Here are some buzzwords: Its a entirely handwritten recursive decent compiler. For expressions I use a slightly strange shunting yard implementation that I dreamed up with a while back.  The target ISA is stack based similar to JAVA, Pascal or early LUA.  There is only one data type… the 32bit integer.

At this point I’ll just throw up a link to the repo:
Be warned however that there a pungent reek of code smell emanating from there (it was a weekend project after all).  So gas masks are advisable.

Some important constraints of the language are: Only integers are supported, there is also no globals or arrays or any such thing, also semantic checking is completely ignored.

It also helps to see a code fragment of this target language.  Here is my factorial example I uses to prove to myself that recursion worked fine:

function factoral(val)
    if (val <= 1)
        return 1
        return val * factoral(val - 1)

function main()
    return factoral(4)

The above code currently translates into the following instructions:

-- factoral
0000 INS_GETV   -1
0005 INS_CONST  1
0010 INS_LEQ   
0011 INS_NOT   
0012 INS_CJMP   37
0017 INS_CONST  1
0022 INS_RET    1
0027 INS_CONST  1
0032 INS_CJMP   64
0037 INS_GETV   -1
0042 INS_GETV   -1
0047 INS_CONST  1
0052 INS_SUB   
0053 INS_CALL   0
0058 INS_MUL   
0059 INS_RET    1
0064 INS_CONST  0
0069 INS_RET    1

-- main
0074 INS_CONST  4
0079 INS_CALL   0
0084 INS_RET    0
0089 INS_CONST  0
0094 INS_RET    0

The syntax of my mini language is somewhat similar to LUA, who’s code I always found nice to look at.  I first learned real programming using BlitzBasic so I have a soft spot for the BASIC syntax.  Also the author of BlitzBasic, Mark Sibley is one of my heros!

One of the more interesting aspects of this little project, was the design and implementation of the stack machine the VM emulates. I have read a fair bit about stack machines in the past, but somehow I never quite groked their operation.  My main sticking point was the ABI, how are parameters passed/used, how are local stored/used, how is a frame constructed?

I had visions that I would need loads of stacks for different things, locals, return points, arguments etc.  This turned out to be way off, and I was able to make a stack machine that used just two stacks, however even this could certainly be reduced to just with a bit of rework.  I think there are advantages to maintaining multiple stacks however.

Things clicked for me when I realized I could use one stack for computing expressions, as well as storing locals and arguments.  The second stack stored procedure frames, which captured a ‘frame pointer’ and a ‘return location’. The top procedure frame on the stack always refers to the function currently being executed.

In my system, the frame pointer is somewhat conventional, and procedure arguments can be found immediately bellow it on the stack, and local variables can be found immediately above it on the stack.  This makes code generation simple, as accessing locals and arguments, becomes independent of the head of the stack, and each other.

During the course of execution the stack may look as follows:

08 scratch <-- top of stack is scratch for expression
07 var0    <-- FP (top)
06 arg1
05 arg0
04 scratch <-- may have scratch space if call part of incomplete expression
03 var1
02 var0    <-- FP (older)
01 arg1
00 arg0

Scratch space is the term I’m assigning to the fragments of a not yet fully evaluated expression.  For something like ‘a+b’, both a and b will live on the stack before they are combined via the addition.  If a function is called as part of an expression, these fragments will still hand around until it returns, and evaluation resumes.
My stack machine has two instructions GETV and SETV that provide everything I need to work with arguments and locals.  Lets looks at the operation of GETV:

    GETV <int32_t operand>:
    1. index = frame.fp + operand
    2. push(stack[index])

GETV will index the stack relative to FP+<operand> and place that element on the top of the stack.  This has the effect that operands >= 0 refer to locals and operands < 0 refer to arguments.

    SETV <int32_t operand>:
    1. index = frame.fp + operand
    2. stack[index] = pop()

SETV is much the same, however it pops the top item on the stack and overwrites the element on the stack that it indexes.

With these two instructions, my code can easily read and write arguments and locals.  Perfect!

Unlike a regular register ISA, local variables are not allocated on the stack by moving the stack pointer, all that that is needed to allocate a local on the stack is to push an item on the stack and not pop it.  This has the effect of incrementing the top of the stack, reserving space which can now be indexed relative to the fp.  In my mini language, a variable declaration is preceded by the ‘var’ keyword which makes it really easy to know that we don’t need to pop anything after this statement has finished.  It will happily sit there on the stack and can now be assigned to and read from.

One more trouble point is returning from a function… how do we know how to clean up the stack?

I took the easy route and made my RET instruction fairly complex.  RET takes one operand, which is the number of items to discard from the stack.  In this manner, RET can discard all local variables and arguments, similar to a callee save function.  The parser knows how many arguments and locals have been provided, so is becomes trivial to emit this instruction.

There was one kink in this plan however… If we want to return a value from a function, how do we do that.  Since the return value will be sitting on the top of the stack at the point that we return we cant simply discard all of the top elements.

My RET function operates as follows:

RET <int32_t operand>
    1. save the top item on the stack (return value)
    2. discard <operand> items from the stack
    3. push the previously saved top item
    4. pc = frame.return_value
    5. discard the top procedure frame

That is a fairly meaty RET function, but I didn’t spend too long thinking about it and it seemed justified.

To simplify things in my compiler I enforce that every function must return a value, which defaults to 0 if no return statement is issued in the program.  This is crude and implemented by issuing a { CONST 0, RET } instruction pair at the end of every function.

I believe this is the same trick that LUA uses.  It my seem wasteful, but it keeps things simple stupid.

So to conclude, how good is my mini language.  Well…

…Its crap…

…but it works…

…and I learned a ton!
I would chalk this up as a complete success.

The code is highly smelly, but hey! this was a weekend project.  Its up on github.