SBCL: The Assembly Code Breadboard - 8 minutes read
EDIT: Lutz Euler points out that the sequence (used to) encode an effective address with an index register but no base. The mistake doesn’t affect the meaning of the instruction, but forces a wasteful encoding. The difference in machine code are as follows.
I fixed the definition of , but not the disassembly snippets below; they still show the old machine code.
Earlier this week, I took another look at the F18. As usual with Chuck Moore’s work, it’s hard to tell the difference between insanity and mere brilliance ;) One thing that struck me is how small the stack is: 10 slots, with no fancy overflow/underflow trap. The rationale is that, if you need more slots, you’re doing it wrong, and that silent overflow is useful when you know what you’re doing. That certainly jibes with my experience on the HP-41C and with x87. It also reminds me of a post of djb’s decrying our misuse of x87’s rotating stack: his thesis was that, with careful scheduling, a “free” makes the stack equivalent – if not superior – to registers. The post ends with a (non-pipelined) loop that wastes no cycle on shuffling data, thanks to the x87’s implicit stack rotation.
That lead me to wonder what implementation techniques become available for stack-based VMs that restrict their stack to, e.g., 8 slots. Obviously, it would be ideal to keep everything in registers. However, if we do that naïvely, push and pop become a lot more complicated; there’s a reason why Forth engines usually cache only the top 1-2 elements of the stack.
I decided to mimic the x87 and the F18 (EDIT: modulo the latter’s two TOS cache registers): pushing/popping doesn’t cause any data movement. Instead, like the drawing below shows, they decrement/increment a modular counter that points to the top of the stack (TOS). That would still be slow in software (most ISAs can’t index registers). The key is that the counter can’t take too many values: only 8 values if there are 8 slots in the stack. Stack VMs already duplicate primops for performance reasons (e.g., to help the BTB by spreading out execution of the same primitive between multiple addresses), so it seems reasonable to specialise primitives for all 8 values the stack counter can take.
In a regular direct threaded VM, most primops would end with a code sequence that jumps to the next one ( ), something like add rsi, 8 ; increment virtual IP before jumping jmp [rsi-8] ; jump to the address RSI previously pointed to where is the virtual instruction pointer, and VM instructions are simply pointers to the machine code for the relevant primitive.
I’ll make two changes to this sequence. I don’t like hardcoding addresses in bytecode, and 64 bits per virtual instruction is overly wasteful. Instead, I’ll encode offsets from the primop code block: mov eax, [rsi] add rsi, 4 add rax, rdi jmp rax where is the base address for primops.
I also need to dispatch based on the new value of the implicit stack counter. I decided to make the dispatch as easy as possible by storing the variants of each primop at regular intervals (e.g. one page). I rounded that up to bytes to minimise aliasing accidents. becomes something like mov eax, [rsi] add rsi, 4 lea rax, [rax + rdi + variant_offset] jmp rax
The trick is that , and the stack counter is (usually) known when the primitive is compiled. If the stack is left as is, so is the counter; pushing a value decrements the counter and popping one increments it.
That seems reasonable enough. Let’s see if we can make it work.
I want to explore a problem for which I’ll emit a lot of repetitive machine code. SLIME’s REPL and SBCL’s assembler are perfect for the task! (I hope it’s clear that I’m using unsupported internals; if it breaks, you keep the pieces.)
The basic design of the VM is:
The idea is that we’ll define functions to emit assembly code for each primitive; these functions will be implicitly parameterised on thanks to . We can then call them as many times as needed to cover all values of . The only complication is that code sequences will differ in length, so we must insert padding to keep everything in sync. That’s what does:
This function is used by to emit the machine code for a bunch of primitives, while tracking the start offset for each primitive.
The code for lives between bytes 0 and 32. Let’s take a look at the version for and .
is at 32-64, and at 128-152:
These are relatively tight. I certainly like how little data shuffling there is; the sequence is a bit hairy, but the indirect branch is likely its weakest (and least avoidable) point.
A VM without control flow isn’t even a toy. First, unconditional relative jumps. These can be encoded as , with the 32 bit offset relative to the end of . We just overwrite with the new address.
Call and return are at the heart of Forth-like engines. is easy: just pop from the control stack into .
Call is a bit more complicated. It’s like , but pushes the address of the next instruction to the control stack:
Let’s look at the resulting machine code.
We finally almost have enough for interesting demos. The only important thing that’s still missing is calls from CL into the VM. I’ll assume that the caller takes care of saving any important register, and that the primop ( ) and virtual IP ( ) registers are setup correctly. The stack will be filled on entry, by copying values from the buffer points to, and written back on exit.
The CL-side interlocutor of follows, as a VOP:
The only thing missing is to store the machine code for our primop in a range of memory that’s executable.
We should also test function calls
Instead of executing directly, this bytecode sequence calls to whatever is 8 bytes (2 dwords) after the call instruction; in our case, .
Writing bytecode by hand is annoying. This tiny functions takes care of the stupid stuff.
We can now write
We can now either write (basically) straightline code or infinite loops. We need conditionals. Their implementation is much like , with a tiny twist. Let’s start with jump if (top of stack is) non-zero and jump if zero.
It’s hard to write programs without immediate values. Earlier control flow primitives already encode immediate data in the virtual instruction stream. We’ll do the same for , , and :
Finally, we have enough for a decent-looking (if useless) loop. First, update the primop code page:
One million iterations of this stupid loop that only decrements a counter took 15M cycles. 15 cycles/iteration really isn’t that bad… especially considering that it executes an indirect jump after loading 1, after subtracting, and after comparing with 0.
We can do better by fusing into :
Decrementing a counter and jumping if non zero is a common operation (old x86 even implemented that in hardware, with ). Let’s add decrement and jump if non-zero ( ) to the VM:
That’s better… But I’m really not convinced by the conditional move. The branch will usually be predictable, so it makes sense to expose that to the hardware and duplicate the sequence.
The resulting code isn’t too large, and the two indirect jumps are 16 bytes apart.
This alternative implementation does work better on our stupid loop.
Let’s see how that compares to straight assembly code.
My slow macbook air gets 1 iteration/cycle on a loop that’s 100% control overhead. With , a good implementation of a reasonable specialised operator, the loop is about 6x as slow as native code. A worse implementation of is still only 8x as slow as pure native code, and horribly unspecialised bytecode is 11-15x as slow as native code.
Specialising primops to a virtual stack pointer is feasible in practice, when the stack is restricted to a small size. It also seems to have a reasonable runtime overhead for threaded interpreters. I’m not actually interested in straight stack languages; however, I believe that a fixed stack VM makes a nice runtime IR, when coupled with a mechanism for local variables. We’ll see if I find time to translate a high level language into superoperators for such a VM. Fused operators would reduce the importance of ; in constrast, simpler function calls (because there’s less shuffling of items to stack them up in the right position) would remain as useful.
SBCL has definitely proven itself to be a good companion to explore the generation of domain-specific machine code. I don’t know of any other language implementation with that kind of support for interactive programming and machine code generation (and inspection). FWIW, I believe LuaJIT + dynasm will soon be comparable.
Steel Bank Common Lisp: because sometimes C abstracts away too much ;)
Source: Www.pvk.ca
Powered by NewsAPI.org