Coroutines, x64 and Visual Studio

I really like co-routines, finding them really useful for programming game AI and scripting.  Since visual studio has dropped support for inline assembly when compiling x64 code I thought I would be out of luck trying to implement co-routines for this target.  Fortunately visual studio can still compile assembly files using masm in x64 mode.  In order to implement co-routines I had to familiarities myself with x64 calling conventions, something that has changed quite a bit from x86.  The first 4 arguments are passed on the stack, there are many more callee save registers, and more obviously all registers and pointers have been extended to 64bits.

Below is a small demo showing how to implement co-routines in visual studio for x64 targets.

#include <stdio.h>

typedef unsigned long long u64;
typedef void * user_t;
typedef void *stack_t;

typedef void (*cofunc_t)( stack_t *token, user_t arg );

/* coroutine functions */
extern "C" void yield_( stack_t *token );
extern "C" void enter_( stack_t *token, user_t arg );

/* artificial stack */
const int nSize = 1024 * 1024;
static char stack[ nSize ] = { 0 };

/* prepare a coroutine stack */
void prepare( stack_t *token, void *stack, u64 size, cofunc_t func )
    u64 *s64     = (u64*)( (char*)stack + size);
         s64    -= 10;                 // 10 items exist on stack
         s64[0]  = 0;                  // R15
         s64[1]  = 0;                  // R14
         s64[2]  = 0;                  // R13
         s64[3]  = 0;                  // R12
         s64[4]  = 0;                  // RSI
         s64[5]  = 0;                  // RDI
         s64[6]  = (u64) s64 + 64;     // RBP
         s64[7]  = 0;                  // RBX
         s64[8]  = (u64) func;         // return address
         s64[9]  = (u64) yield_;       // coroutine return address
          *token = (stack_t) s64;      // save the stack for yield

/* coroutine function */
void threadFunc( stack_t *token, user_t arg )
    for ( int i=0; i<10; i++ )
        printf( "  coroutine %d\n", i );
        yield_( token );

/* program entry point */
int main( )
    stack_t token = nullptr;

    /* prepare the stack */
    prepare( &token, stack, nSize, threadFunc );
    /* enter the coroutine */
    enter_( &token, (void*)0x12345678 );
    /* simple test loop */
    for ( int i=0; i<10; i++ )
        printf( "main thread %d\n", i );
        yield_( &token );
    /* program done */
    printf( "program exit\n" );
    getchar( );



;---- ---- ---- ---- ---- ---- ---- ----
; coroutine yield function
;   : void yield_( void * token );
;   'token' -&amp;gt; RCX
yield_ proc

    push RBX
    push RBP
    push RDI
    push RSI
    push R12
    push R13
    push R14
    push R15

    mov  RAX ,  RSP
    mov  RSP , [RCX]
    mov [RCX],  RAX

    pop R15
    pop R14
    pop R13
    pop R12
    pop RSI
    pop RDI
    pop RBP
    pop RBX


yield_ endp

;---- ---- ---- ---- ---- ---- ---- ----
; enter a co-routine
;   : void enter_( void * token, void * arg1, ... );
;   'token'     -&amp;gt; RCX
;   'arg1, ...' -&amp;gt; RDX, R8, and R9
enter_ proc

    jmp yield_

enter_ endp


Playing with MicroThreads

I recently bought Game Programming Gems 2 and was captivated by an article titled ‘Micro-Threads for Game Object AI’.

A really neat technique is describes that can performing very fast cooperative multitasking, also sometimes referred to as co routines.  The heart of this technique involves manipulating the processor stack pointer to switch between multiple contexts of execution.  Its very similar in concept to true multitasking, however each thread decides when to share the processor.  The concept is simple enough, but really so powerful and radically different from many other techniques for AI I have used.  I knew I would have to program up a text to play with this idea.  Using this method, each object in my game can pretend it is running in its own thread and I am sure this can find application elsewhere too.

The article provides a nice description, but doesn’t provide enough information to directly copy an implementation, and my copy arrived without the CD, but that is fine however and I am having fun of working it out my own.  I have made a few projects in the past which manipulated the CPU registers to perform various tricks, so I can build upon that knowledge.

My first implementation appeared to operate correctly until exit where my call to delete[] threw up an assert.  This is because the fake stack was getting sprayed with bad data somehow from within my virtual thread. I will be spending lots of time in the debugger until I find out the cause.

Update 1:

I believe that the debug run time version of printf() is writing past the end of the 1kb stack, causing a buffer overflow. Hmm….

update 2:

I went back and read the article all over again and there were many hints that I simply didn’t have enough stack space at all.  The author also proposed a very nice improvement, where we effectively extend the stack size by copying each micro thread into much larger stack before execution.  Thus the Micro Thread code can make use of a large stack space, but individually they only take up a small amount of space.  After execution, we extract the smaller part of the stack again.  This places the limitation that _yield() cant be called deep in some code in the thread, but that is hardly a restriction at all.

Here is my demo code for anyone that want to have a look at its implementation.  I wrapped it up with a little API as well so it should be somewhat readable.


update 3:

My brain must have been thinking of improvements overnight since I woke up with three ideas I had to experiment with. In each case it was to address the stability and protect against stack overflows. The most major improvement is the use of VirtualProtect(), which is used here to set the bottom of the shared stack region to a guard page. This wastes 4k of stack space, but it is well worth it since now a nice exception is thrown if we write to the stack anywhere in the bottom 4k.

The second improvement, is to check the value of the stack pointer upon return from a micro thread. If it falls within the space allocated to each thread then everything is fine, otherwise we will need to allocate more stack space for that thread. If this check fails, and esp lies outside of our thread space, then we can be sure that we wont be saving all of that threads state during the task switch. That would lead to very very bad things.

A final improvement is to address what happen when the micro thread function returns. Currently, there is no return address on the stack, so execution could jump literally anywhere.
The solution is to ensure that we have a valid address for this function to return to. We can set the return address to a custom function that can handle the situation on one of two ways. We can setup the stack and processor state so that it will execute the micro thread function once more, thus it will feel like our micro thread function is permanently called. Alternatively we can mark that thread as finished, switch back to our main thread, and then kill the micro thread cleanly. Since both options have merit, I will implement both of then and let the user specify the handling during thread creation.

update 4:

I am now playing with Vectored Exception handlers to see if I can recover from stack overflows, and automatically terminate a malfunctioning thread.  I have confirmed that an exception is being thrown when I intentionally overflow the stack from within a micro thread.  It shouldn’t be too hard to switch the context back from within a vectored exception handler.  I remember having played around with something similar, having self modifying code from an exception handler if a debugger is detected.  According to the original article, it states that structured exceptions no longer work correctly from micro threads since no handlers are on the stack, perhaps this is something I can look into addressing also.

update 5:

Here is the latest version of the library provided as a source release.  The vectored exception handler really works amazingly well, terminating any micro thread that attempts to overwrite its stack.  If a micro thread returns from its function, the thread will terminate as well.  There are many more stability improvements and validity check making this quite a viable option for use in any project now.  I have included a small test program also so that the functionality can be demonstrated.