GA144 note: Native threaded execution

One approach to running general code on the GA144 is to use a virtual machine. For example polyForth uses a VM, and I have also used VMs on GA144 in my attempts to get it to run C. However the interpretive overhead of these VMs means that their best case performance is only a tiny fraction of the GA144's capability. For example polyForth typically runs 100 times slower than native GA144 code.

This leads to the idea of not having a VM at all, and executing native code directly from external storage. The difficulty is how to handle jumps, branches and calls. This is a description of one scheme that Michael, Mark and I have been working on.



When the GA144 boots, ROM in the SPI node (for example) reads out the SPI flash contents and sends it to a neighbor. These instructions can make the neighbor do anything, such as load RAM or set up registers, or even send some instructions to its neighbor. This can be extended indefinitely, potentially making a boot chain that covers the whole chip.

With one modification, this bootstrap procedure is the basis for the new execution model, called native threaded execution.

The SPI node boots by reading a stream from flash, but instead of always loading from flash at address zero, it reads the load address from the running node.

Each code sequence is a fragment, accessed by a fragment number -- a scaled byte offset into flash. Fragments are sized sequences of native instructions -- these execute on the X node. When it executes, a fragment must tell F which fragment is next.

One important property of fragments is that the whole fragment always executes. There is no way to execute only part of a fragment. Some examples of code flow with this kind of fragment:

In fact any program can be broken up into fragments like this. Every branch or label in the program is a fragment boundary. In other words, a fragment is a basic block: a sequence of non-branch instructions.

But consider subroutine calls -- that is, Forth words. To call a word FOO, a fragment needs to request that FOO executes next, but also must specify another fragment to be executed when FOO completes. So there must be a call-return stack.

Here the definition:

: 1+   1 + ;

is compiled into fragment 123. It ends by executing EXIT. A fragment 701 sets up a call of 1+. It pushes the "return address" 702 on the R stack, then jumps to fragment 123. When 123 EXITs, it pops 702 from the return stack, and branches to it. In this way the code can execute a general call/return stack.

This leads to a slightly more complicated implementation.

F is the fetch node. It accepts fragment addresses and drives the SPI flash, sending the fragment contents to D.

D is the decompressor node. Currently it does some minor transformations on the instruction stream.

X is the execution node, which actually executes the instructions.

R is the return stack node. It contains the stack, as well as a few commonly used definitions, such as EXIT.

The top of the parameter stack (TOS) is the hardware top-of-stack (T) in X. The rest of the parameter stack is in RAM in X. The X node's a register points to the top of the stack, which grows down. There are about 24 cells for the parameter stack.


The physical layout matches this design. 705 is the F node, because it is connected to the SPI flash.

The RAM controller is cyan, and its storage nodes are yellow. 192 words total, about 384 bytes. Some sample implementations of basic words:

+ fetches the value from the parameter stack, then adds them, leaving the result in T:

CODE +                #returns #inline
    @+ . +

DROP fetches from the stack into T. Note that it doesn't need to balance X's hardware stack:

CODE DROP             #returns #inline

SWAP again does not need to balance the hardware stack:

CODE SWAP             #returns #inline
    @ over !


As described so far, there need to be 'glue' fragments between consecutive calls. So to compile a definition:

: FOO  A B C ;

there must be three glue fragments after calls to A, B and C.

Of course the tail call optimization can remove one of these. But the compiler can actually can do better than this.

The first fragment of FOO pushes the fragment numbers of B and C on the R stack, then jumps to A. When A returns, it executes B. When B returns it executes C.

This is not new - it's sometimes used by exploit code where it's called return oriented programming. In this way, FOO compiles to a single setup fragment. The return stack needs to be deep enough to hold a chain of calls like this -- so the implementation has a 50 cell R stack.

And of course the compiler can go even further and inline the code for A, B and C.

If the words are all inlinable, then FOO doesn't need any calls at all. Compilation in this case is simple concatenation of each word's definition. And the fixed overhead of call/return is avoided.

With inlining, many pieces of code compile to straight sequences of instructions. When this happens to an inner loop - like a FOR, DO or UNTIL - it compiles to a fragment that jumps back to itself.

If the fragment is straight-line code, it's an easy transformation to compile a native jump back to the start of the code instead of re-requesting the same fragment. This trivial change makes things about 200X faster.

Running on the GreenArrays eval board

To execute on the board, get the ga144tools, then (assuming $PORT is the USB port that the GA144 is attached to) do:

python swapforth.fa
python $PORT write image
python $PORT

The first command runs the fragment assembler, building a binary flash image in file image. The second command writes image to SPI flash. The third command loads the Native Threaded runtime, which proceeds to execute the flash. The word boot in swapforth.fa is executed at startup.

The implementation is in three parts:

A benchmark

Running a million iterations of 0 drop has been used as a GA144 benchmark:

: asd 1000 for 1000 for 0 drop next next ;

For comparison, this takes 15 ms when written as native GA144 code, and 3000 ms in polyForth.

In the native threaded implementation, it runs in 75 ms.