Homebrew CPU in an FPGA

I've been playing around with small computers - PICs, Scenix, Rabbits - for a few years now, and they never seemed to do quite what I wanted. So I decided to design my own small CPU, and implement it in an FPGA. I aimed low, so it's pretty simple. Some highlights:

Some of the many things it doesn't have:

It actually can run real applications - a couple of things I've done with it so far:

All the source code is in this directory

Programming Model

There are two seperate stacks: one for data and one for call-return. Both stacks are 16 levels deep.

I followed a FORTH naming convention for the instructions:

These are the control flow instructions:

Both "jump" and "call" can be predicated (that is, conditionally executed) depending on the value on top of the stack. Also, a jump or call can optionally pop the stack, i.e. do a 'drop' instruction.

Assembly Language

The assembler uses a FORTH-like syntax, so labels are preceded by a colon (:). Calls are just the label. FORTH doesn't have an explicit jump, so I used greater-than (>) followed by the label. Return is a semicolon (;). For example:

: foo ;                          # declares an empty subroutine "foo"
: bar foo foo ;                  # subroutine "bar" calls "foo" twice
: bar2 foo >foo                  # more efficient version of "bar"

The predication on jumps and calls is pretty fancy. It allows you to do these tests:

modifier condition
?fa never
?tr always
?lt ST0 < 0
?gt ST0 > 0
?eq ST0 == 0
?ne ST0 != 0
?le ST0 <= 0
?ge ST0 >= 0

A (-) suffix means also pop the stack. So "?ne- foo" looks at ST0, if it's not zero decides to make the call, pops the stack, then either calls foo or does a NOP.

For example, a loop that calls "foo" 666 times might look like:

        666                       # loop counter on top of stack, ST0
: loop
        foo                       # call foo
        -1 +                      # decrement loop counter
        ?ne >loop                 # ST0 not zero yet, do it again
        drop                      # Remove loop counter from stack

Saving Space

The machine can only hold 1K instructions, so code size is important. Here are things that I do to save space. Some of these are old hat, some are specific to the CPU.

Some examples

Some frequently-used operations:

: 2dup                                  # a b -- a b a b
        over                            # a b a
        over;                           # a b a b

: 2drop                                 # a b --
        drop                            # a
        drop;                           #

: 1-    -1 +;                           # a -- a-1

: neg                                   # a -- -a
        not                             # ~a
                                        # and fall-through into...
: 1+    1 +;                            # a -- a+1

: -                                     # a b -- a-b
        neg                             # a -b
        +;                              # a-b

A subroutine that takes parameters a and b, and returns 2a+b:


: 2a+b                                  # a b -- 2a+b
        over                            # a b a
        +                               # a b+a
        +;                              # a+b+a

Maximum of two numbers:

: max                                   # a b -- max(a,b)
        2dup                            # a b a b
        -                               # a b a-b
        ?gt- >_drop                     # a
        nip;                            # b

: _drop drop;

A subroutine to calculate the greatest common divisor (GCD) using Dijkstra's algorithm:

# GCD(m, n)
#     if (m == n)
#        return m
#     if (m > n) 
#        return GCD(m - n, n)
#     else
#        return GCD(m, n - m)

With care, this can fit in 8 instructions:

: gcd                   # m n -- gcd(m,n)
        2dup            # m n m n
        -               # m n m-n
        ?eq >2drop      # if (m==n), return m
        ?gt- _swap      # if (m>n), swap them, so (m<n)
        over            # m n m
        -               # m n-m
        >gcd            # repeat

: _swap swap;           # subroutine to do a swap

VHDL Implementation in an FPGA

The program counter is 10 bits wide, as are all memory address paths. Memory, busses and ALU are all 16-bits wide.

The call-return stack lives in a dedicated 16x10 RAM, and the topmost item on that stack is the program counter. The logic for doing a call or return is trivial: just increment or decrement the CSP.

The data stack is more complicated, because many instructions reference the top two items on the stack. If the whole stack was in main memory, this would require two reads. To avoid this, the VHDL implementation keeps top-of-stack (ST0) in a register. The data stack pointer SP actually points to the second item on the stack (ST1).

This makes stack operations a little bit fiddly. Pushing X on the stack means that the CPU has to:

  1. increment SP
  2. write ST0 to the new SP (i.e. ST1)
  3. copy X to ST0

and to pop an item off the stack:

  1. read ST1
  2. decrement SP
  3. copy the old ST1 into ST0

A binary operation - like addition - requires these stages:

  1. read ST1
  2. add ST0 to ST1, result into ST0
  3. decrement SP.

Execution is a 3-cycle process:

  1. Fetch current instruction from memory. Write back the incremented PC.
  2. Read: depending on the instruction, this is either a read of ST1 from the data stack or from memory. Adjust SP and CSP according to the current instruction.
  3. Write: depending on the instruction this can be a write to the data stack or to main memory.

There are four instruction formats, distinguished by their two high bits

00: ALU

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 LOGIC IOW ALU STORE FETCH RET WRITE STACK

STACK Change stack pointer at the end of the read cycle. 00 means no change, 01 means increment, 11 means decrement.
WRITE If 1 then do a write-cycle write. The destination of the write is either ST1 or [ST1], depending on the STORE field. The data written is always ST0.
RET add 1 to the Call Stack Pointer (CSP): do a subroutine return.
FETCH If 0 then read-cycle read is from ST1. If 1 then the read is from memory address [ST0].
STORE If 0 then the write-cycle write (if any) is to ST1. If 1 then the last cycle write is to memory address [ST1].
ALU The three bit code controls the ALU function. If ST0 is top-of-stack and ST1 is the element under it, then the possible functions are:
000
ST1 + ST0
001
Result from LOGIC unit, see below
010
ST0 >> 1
011
status word read:
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
CARRY SP
CARRY Carry out (that is, bit 16) of (ST1 + ST0)
SP Data stack pointer. This is a pointer to the item under the top-of-stack, aka ST1. The stack lives at addresses 0x3f0-0x3ff and grows upwards.
101,110,111
I/O read 1,2,3
IOW Perform an I/O write on the last cycle.
LOGIC The four bit code controls the result of the logic unit. The possible bitwise functions of the 16 bit values ST0 and ST1 are:
0000
CLEAR0
0001
ANDST1 & ST0
0010
AND_REVERSEST0 & !ST1
0011
COPYST0
0100
AND_INVERTED!ST0 & ST1
0101
NOOPST1
0110
XORST1 ^ ST0
0111
ORST1 | ST0
1000
NOR!(ST1 | ST0)
1001
EQUIV!(ST1 ^ ST0)
1010
INVERT!ST1
1011
OR_REVERSEST0 | !ST1
1100
COPY_INVERTED!ST0
1101
OR_INVERTED!ST0 | ST1
1110
LOGIC_NAND!(ST1 & ST0)
1111
SET1

01: Push literal

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 1 LITERAL

The 14-bit literal is sign extended to 16 bits and pushed on the stack.

10: jump

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 0 PRED DROP DEST

The 3 bit PRED field specifies the predicate that must be true if the CPU should take the jump.

000false
001ST0 < 0
010ST0 == 0
011ST0 <= 0
100ST0 > 0
101ST0 !- 0
110ST0 >= 0
111true

The DROP field is 1 if the CPU should pop the stack. Note that this pop happens after predicate evaluation, and before the jump.

11: call

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 1 PRED DROP DEST

PRED and DROP fields are the same as for "jump" above.

How the assembler maps instructions to micro-ops

It's pretty clear how the assembler translates call, jump, and load-immediate instructions to bit patterns. The remaining instructions all use the ALU; here's a table that shows the mapping from mnemonic to microcode fields.

Opcode 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 LOGIC IOW ALU STORE FETCH RET WRITE STACK
dup 0 0 0011 (ST0) 0 001 (LOGIC) 0 0 0 1 01
drop 0 0 0101 (ST1) 0 001 (LOGIC) 0 0 0 0 11
over 0 0 0101 (ST1) 0 001 (LOGIC) 0 0 0 1 01
swap 0 0 0101 (ST1) 0 001 (LOGIC) 0 0 0 1 00
nip 0 0 0101 (ST0) 0 001 (LOGIC) 0 0 0 0 11
+ 0 0 - 0 000 (ADD) 0 0 0 0 01
not 0 0 1100 (!ST0) 0 001 (LOGIC) 0 0 0 0 00
and 0 0 0001 (ST0 & ST1) 0 001 (LOGIC) 0 0 0 0 01
or 0 0 0111 (ST0 | ST1) 0 001 (LOGIC) 0 0 0 0 01
xor 0 0 0110 (ST0 ^ ST1) 0 001 (LOGIC) 0 0 0 0 01
rshift1 0 0 - 0 010 (RSHIFT1) 0 0 0 0 01
@ 0 0 0101 (ST1) 0 001 (LOGIC) 0 1 0 0 00
! 0 0 0101 (ST1) 0 001 (LOGIC) 1 0 0 1 11
in 0 0 - 0 101,110,111 (IN) 0 0 0 1 01
out 0 0 0101 (ST1) 1 001 (LOGIC) 1 0 0 0 11
; 0 0 0011 (ST0) 0 001 (LOGIC) 0 0 1 0 00