You are on page 1of 31

david.carybro s.

co m

http://david.carybro s.co m/html/minimal_instructio n_set.html

minimal instruction set


When I was working on this, I thought it was pretty creative and unique. Later I f ound out that I'd been stuck in a ``Turing Tar Pit''. Still, this looks like a more reasonable instruction set than some of the really ugly things that have been developed by other people who have also wasted a lot of time in a ``Turing Tar Pit''. Describes a microprocessor instruction set developed by David Cary that packs 2 instructions into 8 bits of RAM. As f ar as I know, this is *the* Minimal Instruction Set (f or a single-processor Von Neuman machine). In other words, I know of no other (Turing-complete) instruction set that has as many or f ewer distinct instructions (DI). 2003-01-04:DAV: I just discovered ``A Minimal CISC'' by Douglas W. Jones http://www.cs.uiowa.edu/~jones/arch/cisc/ that has f ewer distinct instructions (DI): only 8, so 5 of them can pack into a 16 bit word.

The instructions are: 1. NOP: No operation. 2. DUP: Duplicate the stack top. This is the only way to allocate stack space. 3. ONE: Shift the stack top left one bit, shifting one into the least significant bit. 4. ZERO: Shift the stack top left one bit, shifting zero into the least significant bit. 5. LOAD: Use the value on the stack top as a memory address; replace it with the contents of the referenced location. 6. POP: Store the value from the top of the stack in the memory location referenced by the second word on the stack; pop both. 7. SUB: Subtract the top value on the stack from the value below it, pop both and push the result. 8. JPOS: If the word below the stack top is positive, jump to the word pointed to by the stack top. In any case, pop both. ... any constant can be pushed on the stack by a DUP followed by 16 ONE or ZERO instructions. Zero may be pushed on the stack by the sequence DUP DUP SUB; negation may be done by subtracting from zero; addition may be done by subtracting a negated value, and pushing zero prior to pushing an unconditional branch address allows an unconditional branch.

See also its counterpart, "T he Ultimate RISC" by Douglas W. Jones http://www.cs.uiowa.edu/~jones/arch/risc/ T he "Whitespace" programming language is similar to this "Minimal CISC". http://compsoc.dur.ac.uk/whitespace/ T he "Path" language http://pathlang.sourcef orge.net/ has an interesting 2D layout, and a near-minimal instruction set (expressed in single characters): + - increment the current memory cell - - decrement the current memory cell

} - go to the next memory cell { - go to the previous memory cell , - input an ascii character f rom stdin into the current memory cell . - output an ascii character f rom the current memory cell into stdout / \ - Turn (unconditional branch) ^ < > V - Turn if current memory cell is not equal to 0 ! - jump next symbol $ - start here heading right # - end here any other character including spaces - do nothing 2002-12-07:[FIXME: move ``Turing Tar Pit'' inf ormation here.] computer_architecture.html#tarpit contents: [FIXME:] See also computer_architecture.html#misc f or other attempts at a Minimal Instruction Set Computer (MISC). Some of them have even been produced in silicon.

What is t he minimum number of inst ruct ions f or a Turing-complet e von Neumann machine ?
Here David Cary pushes the MISC idea to ugly extremes. [FIXME: need a catchy name f or this architecture]

"whenever you excessibly constrain any parameter, something else has got to give." -- Don Lancaster.

I've put together yet another MISC instruction set. I've squeezed it down to 11 instructions. I don't think I've painted myself into a corner yet (I hope). If I could just squeeze out 3 more instructions, I could pack 5 of them into a 16 bit cell. One advantage of having f ew instructions -- you can document them in a reasonably-sized email, rather than needing a entire book to document the details of scads of instructions. T here's 2 very dif f erent ways of counting the the "size" of an instruction set. We could count the number of dif f erent instruction mnemonics (NM) mentioned in the assembler documentation. T his method is subjective, since dif f erent assemblers may describe the same machine with dif f erent numbers of mnemonics (f or example, one assembler may require programmers to type "LOAD dest T O PC" to get the ef f ect of an immediate jump. Another assembler may require programmers to use a additional mnemonic "JUMP T O dest" to get exactly the same bit pattern in the f inal executable.) Processers with smaller NM are likely to have regular, orthogonal instruction sets (the same addressing modes apply to all instructions) which makes them easier to program than processors with larger NM. T he minimum number of mnemonics NM is 1: the "move" instruction used by T TA http://www.rdrop.com/~cary/html/computer_architecture.html#tta . Since you know what the instruction will be in a T TA architecture, it takes zero bits to specif y -- -- but T TA still requires a bunch of bits per instruction to

select registers and/or addressing modes. Another way of counting the "size" of an instruction set is to count all possible distinct instructions (DI), all valid variations of "source register", "destination register", and "addressing mode". Using this count, T TA has lots more than 1 instruction. T his is a more objective count: if the longest instruction has b bits, then the number DI is equal to 2^b minus the number of invalid instructions of that length. Processors with smaller DI are likely to need f ewer bits b to specif y each instruction. Processors with f ewer bits per instruction pack more instructions into a given number of bits of RAM, and execute more instructions per second with a given RAM bandwidth. Unf ortunately, this advantage is partially (perhaps completely) cancelled out by the f act that processors with smaller DI of ten require *more* instructions (more RAM and/or more cycles) to implement a given piece of f unctionality than processors with larger DI. Nevertheless, it is a interesting intellectual challenge to wonder What is the minimum number of bits bmin required per instruction ? What is the minimum number of distinct instructions DI f or a Turing-complete von Neumann machine ? T he shBoom(tm) microprocessor f rom Patriot Scientif ic Corporation http://www.ptsc.com/ http://www.circuitcellar.com/articles/misc/tom-92.pdf packs instructions into 8 bits. So we know DImin is 256 or less. T he "Itty Bitty Stack Machine" http://www.ittybittycomputers.com/IttyBitty/IBSM.htm is very similar to the F21, with most instructions packed into 5 bits. (A total of about 54 instructions) (T his instruction set is allegedly "f ast, that is, it should be capable of emulation at a raw speed not slower than 10% of the host native hardware.") T he "F21" Forth engine by Chuck Moore F21 STACK PROCESSOR CPU DESCRIPT ION http://pisa.rockef eller.edu:8080/MISC/F21.specs has 27 distinct instructions, each one is packed into 5 bits (neglecting the bits that f ollow "#" and branches). So we know DImin is 27 or less. Clive Sinclair http://www.cdworld.co.uk/zx2000/clive.html claims that he has designed a CPU with only "16 principle instructions", but he doesn't list any details. Dr Neil Burgess mentions "ultraRISC processor, that has only 15 instructions" http://www.acue.adelaide.edu.au/leap/discipline/eng/Burgess.html but he doesn't list any details. Can a CPU really be designed to have 16 or f ewer distinct instructions DI, such that one can pack 2 instructions into 8 bits of RAM ? 8 instructions into a 32 bit word ? Or are these people counting NM, not DI ? Starting with the elegant 27 instruction set f or the "F21" Forth engine by Chuck Moore F21 STACK PROCESSOR CPU DESCRIPT ION http://pisa.rockef eller.edu:8080/MISC/F21.specs , and eliminating instructions that could be emulated by (sometimes lengthy) combinations of the other instructions, David Cary managed to get a (very ugly) 16 instruction set. "# push pop A! A@ T =0 @A !A xor and + com 2/ dup over drop". (16 instructions as of 1998) Can you think of a more "elegant" (yet still "complete") instruction set of 16 or f ewer distinct instructions DI ?

Turing complete cellular automata are proof that a Turing Machine can be built without *any* distinct "instructions". Hint: You soon get to the stage "Every instruction in this set is essential. if I eliminate *this*, I won't be able to do *that*, no matter how many of the remaining instructions I string together". While you can't simply eliminate 1 instruction f rom this set, sometimes you can replace 3 instructions f rom a set with 2 completely dif f erent instructions that, given the remaining instructions, can still do *that*. It helps to assume you have some scratchpad RAM, since the f ewer internal registers you have, the f ewer instructions you need to shuf f le things back and f orth between those registers. Myron Plichota and vic plichota have developed a much more elegant set of 16 instructions they call qUark ../mirror/quark.txt With a bit of a challenge f rom "vic plichota" <atsvap@cgo.wave.ca>, and many ideas f rom Myron Plichota, I've developed a instruction set with only 13 instructions: "# ! + xor nand 2/ push pop dropR swap dup T =0 nop" (13 instructions as of 1999-02-22) Programmers model: there is a instruction pointer P and 2 stacks, the "return stack" and the "data stack". T he top of the data stack is called T and the second on the data stack is called S; the top of the return stack is called R. T R P S . . . . . . . I've managed to whittle it down even more with a ``conditional skip if arithmetic result not zero'' idea f rom Alan Grimes: Instruction summary: "# ! + xor nand 2/ push pop toA Af rom call nop" (12 instructions as of 1999-0322) I've managed to whittle it down even more "# ! + xor nand 2/ push popA AT call nop" (11 instructions as of 1999-03-24) I *think* this is still f unctionally complete, and can still do everything that any other Turing-complete CPU can do. It just takes a *lot* more instruction cycles to do most things than most CPUs. 2000-01-05:DAV: I've just stumbled across "BF: An Eight-Instruction Turing-Complete Programming Language which was invented by Urban Mueller solely f or the purpose of being able to create a compiler that was less than 256 bytes in size, f or the Amiga OS." More BF details: A very clean instruction set, although it takes f ar more instruction cycles than even my 11 instructions to do even the most trivial things. I wonder if we could redef ine the "input" and "output" op-codes ...

** Programmers model
"# ! + xor nand 2/ push popA AT call nop" (11 instructions as of 1999-03-24) [perhaps more Forth-like names would be >R t oR inst ead of push A> Afrom inst ead of AT RA ?? inst ead of popA ]

2 push-down stacks, the "return stack" and the "data stack". T A RP S . . . . . . . P = the instruction pointer (program counter). R = the top value of the return stack. T he return stack stores subroutine return addresses, address pointers, and occasional data. A = A register to hold temporary data (during "dup", "swap", "drop", and "pop"). T = the top value of the data stack (which hold data values and occasional address pointers). S = the second value on the data stack All registers are the same length, the length of a memory address. [FIXME: A implementation needs to choose: sizeof _address (in bits) sizeof _data_word (in bits) (must be less than or equal to sizeof _address) depth of data stack (in addresses) depth of return stack (in addresses) (Moore f ixed (sizeof _address) = (sizeof _data_word + 1 bit). What other choices would be "interesting" ???) ] [A slow, minimal-gate implementation needs stack pointers that point to S and R in RAM. Faster processors can keep most or all of the stack on-chip ... What should the processor do when stacks overf low or underf low ? ] ** Acronyms and notation: cell: the implementation's native integer size, the number of bits read and wrtten at once. [ ... ] indicates that everything inside the brackets is contained in a single cell. (. . .): parameter-stack diagram, T to f ar right, then S. <. . .>: return-stack diagram, R on f ar right. |: pipe char used to indicate an "either-or" choice Subroutines are documented with the initial (to the lef t of the "-") and f inal (to the right of the "-") state of the data stack and the return stack.

** Ext ernal int erf ace summary


Data on the external memory bus always comes f rom T or f rom "external devices", and always goes to "external devices" or T or the instruction latch. All data in memory is accessed only on cell boundaries (i.e., only 1 whole cell is read or written at a time). Addresses on the memory bus always come f rom P or R (or perhaps f rom external devices while the CPU is not using the bus). Since instructions are always only 4 bits each, instructions are packed into "cells", as many as will f it (e.g., 3 instructions per 12 bit cells on some machines, or 4 instructions per 16 bit cell, 5 instructions per 20 bit cell, or 8 instructions per 32 bit cell on other machines. ). (does it make any sense f or "cells" to be bigger or smaller

than sizeof _data_word, the size of a single word of memory ?). A minimum of 3 instructions per cell is needed f or the ugly hack that works around the lack of a proper "load" command). Any opcode ``# !R+ + xor nand 2/ push popA AT call nop'' can occur in any slot. [FIXME: How are interrupts handled ? computer_architecture.html#interrupt ] It seems that the instructions execute so quickly that the bottleneck is the speed of the RAM.

** Memory access inst ruct ions:


Surprisingly, only 1 memory read instruction is needed f or both in-line literals and f or data everywhere else in memory: // ( -- x ) @P+ // load dat a in RAM at [P], push it ont o T, t hen increment P. load // dit t o # // dit t o For example, when the CPU executes a cell that starts with 3 "#" instructions, then the f ollowing 3 cells are pushed onto the data stack (the second one ends up on S, the third one ends up on T ), then the remaining instructions in the cell are executed, and then the 4th f ollowing cell is loaded into the instruction latch and executed. // ( ... - ... dat a1 dat a2 dat a3 ) [ # # # ... ] [dat a1] [dat a2] [dat a3] [more inst ruct ions ...] ... T he assembler pseudo-op "#(value)" keeps track of implementation details (word size, register size, signextension, the current instruction word, where the next literal/instruction cell will be loaded f rom, etc.). "#(value)" expands to "#" (or perhaps "# com" or "# unsigned com") and packs the (possibly complemented) literal value into the next available (sizeof _data_word) cell, such that when that code is executed, the desired value gets loaded into T. To load data that is *not* an in-line literal, somehow calculate the address and get it onto R, then pack these instructions into a single cell: "swapPR load swapPR" . T here is also only one write instruction: // ( x -- ) < address -- address+1 > !R+ // pop T, st ore popped value t o RAM at [R], t hen increment R. st ore // dit t o (??? should we use A f or the address instead of R, a la the F21 ?) (hardware *might* be a bit simpler if P is used f or *all* addresses to the memory bus, replacing !R+ with !P+).

2 operand arit hmet ical inst ruct ions


T he 2 operand arithmetical instructions always pop 2 items of f the data stack (S and T ) and push the result back onto the data stack (into T ).

+ ( n1 n2 -- n1_+_n2 ) add S t o T xor ( n1 n2 -- n1_bit xor_n2 ) bit wise exclusive-or S t o T nand ( n1 n2 -- ~(n1_bit and_n2) ) bit wise nand S t o T One way to write sof tware to implement multiple-precision arithemetic is to use the most signif icant bit of T as the carry bit, and extending precision in sizeof _address-1 bit chunks.

1 operand arit hmet ical inst ruct ions


2/ ( x -- x/2 ) arit hmet ic shift right of T (keep sign), and skip next inst ruct ion if T was not zero. (Next inst ruct ion only execut ed if T was (and st ill is) 0). nop does not hing [can "nop" be eliminated ?] T his "2/" is the only conditional instruction in the entire set. T he skipped instruction is typically "call" or "nop" or another ``2/''. For example, to shif t 3 bits to the right, do "2/ 2/ 2/ nop". (I leave it as an exercise to the reader to show this sequence always has the net ef f ect of unconditionally shif ting T three times to the right). [It may make hardware simpler (allow interrupts at end of every cell) if the programmer model always includes a virtual "nop" at the end of every cell, i.e., even if a 2/ is the last instruction of a cell, the f irst instruction of the next cell is executed unconditionally. T his means the compiler must insert or delete a ``nop'' to make the ``2/'' do what the programmer expects: 2/ call --> [... nop] [2/ call ...] 2/ nop call --> [... 2/ ] [call ... ] 2/ 2/ --> [... 2/ ] [2/ ... ] ] [Other options: perhaps make "2/", if the result is not zero, skip *all* the remaining instructions of the current cell. perhaps make "2/", if the result is not zero, skip precisely 3 instructions: f ollowing it in this cell and perhaps the start of the next cell. T his delay slot could be f illed with "AT push popA " reducing the need f or the "nop" instruction. ]

St ack manipulat ion inst ruct ions


push ( x -- ) < -- x > pop T, push ont o R popA ( -- ) < x -- > pop R, push ont o A dropR ( -- ) < x -- > // alias for popA AT ( -- x ) push a copy of A ont o T (leaving A unchanged). Data in any register in the list (T R A T ) can be moved in a single cycle to the register to its right. "popA" is a hint to the reader that the value of A will soon be used with AT; "dropR" is a hint to the reader that the value of A is now irrelevant.

Branches (Change-of -program-f low inst ruct ions)


T here is only 1 branch instruction, only one way to modif y the value of P (but lots of dif f erent aliases to make the intent of the programmer clear): // various aliases for t he same branch inst ruct ion bit pat t ern swapPR ( -- ) < fut ure_P -- past _P > // swap P wit h R. call branch ret urn exit ; // pronounced "exit " jmp P typically (but not always) points to the cell f ollowing the cell f rom which the currently-executing instructions were taken. Once a cell is loaded into the instruction latch f or execution, P is incremented, and then *all* of the instructions in that cell are executed, (possibly modif ying P) f rom the f irst to the last. T he CPU never skips the remaining instructions in a cell (not even f or the call instruction). If a "call" is immediately f ollowed by a "#" instruction, then that distant value at [P] is loaded, *not* the data in the cell immediately f ollowing the cell currently being executed. Af ter *all* instructions in a cell have executed, the next cell of instructions is read f rom this new value of [P] into the instruction latch, P is incremented, and then all the instructions in that cell are always executed (possibly modif ying P). Whatever value of P exists immediately af ter they are all executed is the address where the next instruction cell will be loaded f rom. Sequences that "temporarily" change P must be very caref ul to restore P bef ore the end of that cell. All branch destinations are to (the f irst instruction of ) a particular cell, not to some (other) instruction inside that cell.

** pref et ch implement at ion


Pref etch may be implemented on some chips. It has no ef f ect on the programmer model; all implementations act "as if " there was no pref etch. Typically, while the instructions in one cell are executed, the next cell is being pre-f etched. T hat pre-f etched cell will in turn be executed unless (a) the current cell mentions "load" (#), which diverts the pre-f etched cell to T instead of the instruction latch, increments P, and starts pre-f etching the f ollowing cell. (b) the current cell modif ies P with "swapPR". T he CPU will then f lush the ``next cell'' it speculatively pre-loaded, then starts pref etching the next instruction where P now points. [A strange alternative: Instead, a chip *could* stick with the absolute simplest thing to do with "call" -- merely swap R and P, and not worry about the consequences. T his has a *major* impact on the programmer's model. T his means that pre-f etched cells are *never* ``wasted''. T he consequences are that af ter a cell with one ``call'' and no ``load'' instructions is f etched, the cell at the f ollowing address is pref etched and will be executed unconditionally (it is in the "branch delay slot" of a "delayed branch" a la Sparc ???), while pref etch loads the

f irst cell of that subroutine. (note that there is *both* a call delay slot *and* a return delay slot). Conf using, dif f icult-to-explain things to happen if one tries to mix "call" f ollowed several "#"s in the same cell. T he f irst "#" loads data f rom the cell immediately f ollowing the cell being executed, while the f ollowing "#"s load data f rom distant cells. Other strange things happen if you put multiple call instructions in consecutive cells. If you do this, then the macro f or @R+ expands to: : @R+ // next 4 inst ruct ion must be in same cell. swapPR @P+ // put s a wast ed prefet ched value in T, st art s loading desired value swapPR @P+ // put s t he desired value in T, st art s loading next inst ruct ion cnop [ cnop ] // t his cell is skipped swap // get t he wast ed value on t op drop // get rid of it . ; ] [ we might be able to have a minimal implementation that packs only 1 instruction per cell, f eeding out of a 4 bit RAM, if we change the instruction buf f er so it pre-f etches and holds not only the currently executing instruction, but also the 2 f ollowing instructions. T he ugly hack to f ake a "load [R]": would work like this: while executing the source code "swapPR swapPR # push call", the instruction buf f er would look like this: 1. ``swapPR swapPR #'' (P and R get swapped; then data at [P++] is read) 2. ``swapPR # (dat a)'' (P and R get swapped; instruction at [P++] is read) 3. ``# (dat a) push'' (data shunted to T, instruction buf f er overwritten with NOP; instruction at [P++] is read) 4. ``nop push call'' (etc.) every instruction triggers 1 read cycle to get the next instruction and shif ts the instruction queue down one. ] If we we are f eeding out of a 4 bit RAM but we want the register size to be N times larger, then we could pref etch N+1 f ollowing instructions (conf used yet ?). T hen the sequence would be (f or N=3, so T = 3*4 bits = 12 bits) 1. ``swapPR nop nop swapPR #'' (P and R get swapped; then data at [P++] is read) 2. ``nop nop swapPR # (dat a)'' (nop; then data at [P++] is read) 3. ``nop swapPR # (dat a) (dat a)'' (nop; then data at [P++] is read) 4. ``swapPR # (dat a) (dat a) (dat a)'' (P and R get swapped; instruction at [P++] is read) 5. ``# (dat a) (dat a) (dat a) push'' (data shunted to T, instruction buf f er overwritten with NOPs; instruction at [P++] is read) 6. ``nop nop nop push call'' (etc.)

** The compiler/assembler
T he assembler pseudo-op ``cnop'' f ills the rest of the current cell with (zero or more) NOPs, suf f icient to align

cnop the next instruction with the start of the next cell. T his alignment is necessary when you want that next instruction to be the destination of some branch in the code. (T his is implied by labels at the start of a subroutine, and the "then" and "else" of a "if -then" statement. ). T he compiler automatically generates inline code whenever the word being compiled would take *less* space inline than the subroutine call. (-- idea f rom Myron Plichota) Should the compiler choose to actually generate a call, ``.... subrout ine_label ..." expands to something like this: [ ... # push call] [ subrout ine_label ] [ ... ] ... subrout ine_label: ... // inst ruct ions t o act ually do somet hing [ ... ret urn dropR ] To reduce space slightly, the compiler may choose to compact lists of subroutine calls (af ter making sure that none of the subroutines will interf ere too much with the return stack) f rom [ ... # push call] [ name_1 ] [ ... # push call] [ name_2 ] [ ... # push call] [ name_3 ] [ ... # push call] [ name_4 ] [ ... ret urn dropR] to or to [ # push # push ] [ name_4 ] [ name_3 ] [ # push # push jump dropR ] [ name_2 ] [ name_1 ] (T his is a generalization of "tailbiting"). To minimize space (but not time) in long sequences of only subroutine calls some compilers may use direct threading. data_compression.html#program_compression Here is an example of a f ully re-entrant directthreaded subroutine list interpreter:

[ # # # # push push ] [ name_1 ] [ name_2 ] [ name_3 ] [ name_4 ] [ push push jump dropR ]

[... # push call] [ subrout ine_list _int erpret er ] [ sub1 ] [ sub2 ] [ sub3 ] [ sub4 ] [ cont inue ] cont inue: // ( ) < oldP p_subrout ine_list > [dropR dropR ...] ... subrout ine_list _int erpret er: ( -- ) < p_subrout ine_list -- > [ swapPR # swapPR push call cnop ] // = [ @R! push call cnop ] [ # jump dropR cnop ] [ subrout ine_list _int erpret er ] T his subroutine list interpreter assumes these subroutines don't modif y the top 2 items on the return stack; it's ok if they use deeper items f or parameters. Once you have this subroutine list interpreter, you can collapse other lists of subroutine calls to 2 f ull cells + 2 partial cells of overhead, plus 1 more cell per subroutine call. T he compiler converts ``.... if ... t rue_st u ... t hen ...'' statements to something like // last 5 inst ruct ions must all be in same cell [ ... # push 2/ jmp dropR push dropR ] [ t hen ] [ ... t rue_st u ... ] ... t hen: ... T he compiler converts ``.... if ... t rue_st u ... else ... false_st u ... t hen ...'' statements to something like // last 5 inst ruct ions must all be in same cell [ ... # push 2/ jmp dropR push dropR ] [ else ] [ ... t rue_st u ... ] ... [ ... # push jmp dropR ] [ t hen ] else: [ ... false_st u ... ] ... t hen: ... A minimum implementation (only 3 instructions per cell) is f orced to use this alternate f orm:

// last 2 inst ruct ions must be in same cell [ ... ... # push 2/ jmp ] [ else ] [ dropR push dropR ... t rue_st u ... ] ... [ ... # push jmp dropR ] [ t hen ] else: [ dropR push dropR ... false_st u ... ] ... t hen: [ ... ] ... A smarter compiler might have a special ``else 0 then'' case f or code like ``... if ... t rue_st u ... else 0 t hen ...'' to [ ... # push 2/ jmp dropR ] // ( refrain from dropping T ) [ (t hen) ] [ push dropR ... t rue_st u ... ] // T was non-zero; drop it ... t hen: ... // if T was 0, now T is st ill 0. (note that the "2/" never drops T, even in the case where T =0 and the CPU actually executes that next instruction. I suppose a really smart compiler might be able to ref rain f rom dropping T in certain unusual situations when we later want to use T /2 in the true_stuf f . and/or 0 in more complicated f alse_stuf f . ). T he compiler converts ``... DO ... LOOP ...'' into [FIXME] DO: loop initialization ... LOOP: increment by 1 loop +LOOP: variable increment loop ??? Some processors have the only conditional instruction be a "conditional return if 0" instruction. Another way of implementing ``.... if ... t rue_st u ... t hen ...'' is to use the "conditional-return" style: [ ... # push call] [ subrout ine_label ] [ dropR ... ] ... subrout ine_label: [ ... condit ion ... ] [ ... 2/ ret urn push dropR ] // condit ional ret urn [ ... t rue_st u ... ] [ ... ret urn ] // uncondit ional ret urn A "conditional return" implementation of ``.... if ... t rue_st u ... else ... false_st u ... t hen ...'' looks something like

// could calculat e some of condit ion here [ ... # push call] [ (else) ] ... else: // < ret urn_address > // could calculat e some of condit ion here [ ... # push call ] [ (t hen) ] // < ret urn_address > [ ... false_st u ... ] [ ... ret urn ] // uncondit ional ret urn ... t hen: // < ... ret urn_address else_address > [ ... condit ion ... ] // eit her ret urn t o "else:", // or x up ret urn st ack so // we can do t rue_st u // and ret urn all t he way t o // original call. [ ... 2/ ret urn drop dropR ] // < ret urn_address > [ ... t rue_st u ... ] ... [ ... ret urn dropR ] // uncondit ional ret urn

//some common macros //using t he inst ruct ion set // "# !R+ + xor nand 2/ push popA AT call nop" (DAV 1999-03-24) // pseudo-primit ives: pop, dup, drop, swapTR, swap. // pop: t ake t op value o R st ack, push ont o T st ack. : pop ( -- x ) < x -- > popA AT ; // always inlined : pop_dup ( -- x x ) < x -- > _[[ popA AT AT _]] ; // 3 cycles, always inlined // dup: make a duplicat e copy of T, push int o T and S : dup ( x -- x x ) push pop_dup ; // 4 cycles, always inlined : drop ( x -- ) < -- > push dropR ; // always inlined // swapTR: swap T wit h R : swapTR ( a -- b ) < b -- a > popA push AT ; // 3 cycles, always inlined : swap (... a b --... b a ) swap T wit h S push swapTR pop // 6 cycles : 0 ( -- 0 ) dup dup xor ;

: 0 ( -- 0 ) _[[ AT AT _]] xor ; // 3 cycles, always inlined // // // // The A regist er and ``popA'' and ``AT'' inst ruct ions may be reserved exclusively for t hese pseudo-primit ives t o use. Then use "push, pop, dup, drop, swap, swapTR, dropR, 0" macros as if t hey were t he primit ive inst ruct ions.

// st ack manipulat ion : over (... b a --... b a b ) push dup pop swap ; : over (... b a --... b a b ) 13 cycles if inlined push dup swapTR pop ; : rot (... a b c --... b c a ) push swap pop swap ; : rot (... a b c --... b c a ) push push swapTR pop swapTR pop ; // arit hmet ical manipulat ion // com: complement : invert all bit s of T. : com ( x -- (-x-1) ) or equivalent ly ( x -- -(x+1) ) dup nand ; // 5 cycles : -1 ( -- -1) 0 com ; // 8 cycles : -1 ( -- -1) 0 0 nand ; // ( -- -1) // 7 cycles : negat e ( x -- -x ) #(-1) + com ; // ( x -- -x ) // 13 cycles, or 7 cycles + 1 RAM cycle : 1 ( -- +1 ) #(1) ; // 1 cycle + 1 RAM cycle // t oo much work for no gain over st raight forward lit eral. :1 -1 dup + com ; // 17 cycles :1 _[[ AT AT _]] xor // ( -- 0 ) _[[ AT AT _]] xor // ( 0 -- 0 0 ) nand // ( 0 0 -- -1 ) push _[[ popA AT AT // + // ( ... -1 -1 -- -2 ); A=-1 AT AT _]] + // ( ... -2 -1 -1 -- ... -2 -2 ) nand ; // (... -2 -2 -- ... +1 ) // 16 cycles : and nand dup nand ; : or over com and xor ; // clever sequence from Chuck Moore : or com swap com nand ; : or // 11 cycles

push push pop_dup nand pop_dup nand nand

// subt ract T from S: :(... a b --... (a-b) ) negat e + ; // // // // 0<> logical buer: 0->0, nonzero->(-1). 0= logical not : 0->(-1); nonzero->0 (Would it be bet t er t o map false t o +1 rat her t han -1 ?) (Do we really use less power by using +1 rat her t han -1 ?)

// simple and st raight forward : 0= ( 0 -- -1 ) | ( nonzero -- 0 ) if 0 else 1 t hen ; : 0<> ( 0 -- 0 ) | ( nonzero -- -1 ) if -1 else 0 // t ake advant age of special speedup for ``else 0 t hen'' t hen ; : 0= ( 0 -- -1 ) | ( nonzero -- 0 ) 0<> com ; : 0<> ( 0 -- 0 ) | ( nonzero -- -1 ) 0= com ; // doesn't work now -- only works if "swap" is really a primit ive //: 0<> ( 0 -- 0 ) | ( nonzero -- -1 ) // #(-1) swap // // x=0 | x!=0 // 2/ swap // (...1 0 --...0 1 ) | (...1 x --...1 x/2 ) // drop // (...0 1 --...0 ) | (...1 x/2 --... 1 ) // somewhat more bizzare programming st yles : 0= ( 0 -- -1 ) | ( nonzero -- 0 ) -1 swap if dup else 0 // t ake advant age of special speedup for ``else 0 t hen'' t hen xor ; : 0= ( 0 -- -1 ) | ( nonzero -- 0 )

1 push 0 push // ( x -- x ) <... --... 1 0 > _[[ popA // ( x -- x ) <... 1 0 --... 1 >, A=0 // x=0 | x!=0 2/ popA // < 1 -- >, A=1 | < 1 -- 1 >, A=0 AT AT _]] // (0 --...0 1 1) < -- > | (x/2 --...x/2 0 0) < 1 -- 1 > 2/ dropR drop // (0 1 1 --...0 1)< -- > | (...x/2 0 0 --...x/2 0) < 1 -- > push push dropR pop // (...0 1 -- 1 )< -- > | (...x/2 0 -- 0 )< -- > ; : 0<> ( 0 -- 0 ) | ( nonzero -- -1 ) -1 push if pop ret // force a ret urn else 0 t hen // t ake advant age of special speedup for ``else 0 t hen'' dropR ; which t he compiler expands t o : 0<> ( 0 -- 0 ) | ( nonzero -- -1 ) [ # push # push 2/ jmp dropR ] [ -1 ] [ iszero ] nonzero: [ push dropR popA AT ret ] iszero: [ dropR ret ] // even more bizzare versions: : 0<> ( 0 -- 0 ) | ( nonzero -- -1 ) [ 2/ ret cnop ] nonzero: [ push dropR # ret cnop ] [ -1 ] : 0= ( 0 -- -1 ) | ( nonzero -- 0 ) // t his version must *not * be in-lined #(nonzero) push 2/ swapPR push dropR # dropR ret // t hese 6 inst ruct ions *must * be in t he same cell zero: [-1] nonzero: [ 0] which t he compiler expands t o : 0= ( 0 -- -1 ) | ( nonzero -- 0 ) [# push 2/ swapPR push dropR # dropR ret ] [ nonzero ] [-1] nonzero: [ 0] // A smart compiler might t ake any phrase of t he form ``... if #(a) ... t rue_st u ... else #(b) ... false_st u ... t hen'' and compile it t o

[ ... # push 2/ jmp dropR push dropR # ] [ else ] [a] [ ... t rue_st u ... ] ... [ ... # push jmp dropR ] [ t hen ] else: [b] [ ... false_st u ... ] ... t hen: ...

: 1+ ( ... a -- ... (1+a) ) com -1 + com ; // 19 cycles : 1+ ( ... a -- ... (1+a) ) #(1) + ; // 2 cycles + 1 RAM cycle : 1- ( ... a -- ... (a-1) ) -1 + ; // 9 cycles

: R-- ( -- ) < x -- (x-1) > subt ract one from R -1 pop + push ; // ot her macros // "R@+", "fet ch": load word from RAM t o T at address in R. // Not e t hat bot h swapPR *must * be in same cell t o avoid wild jumps. : R@+ ( -- value ) < source_address -- (1+source_address) > _[[ swapPR load swapPR _]] ; // "@", "fet ch": load word from RAM at address in T. :@ ( source_address -- value ) < -- > push R@+ dropR ; // "!", "poke": put value in T int o RAM at address S // (... value dest _address -- ) :! push !R+ dropR ; : = ( n1 n2 -- (n1==n1) ) xor 0= ;

HEX

// from plichot a : MIN( -- 80000000) 80000000 ; : signbit ( -- 80000000) 80000000 ; // from plichot a : MAX+ ( -- 7FFFFFFF) 7FFFFFFF ;

// if T is st rict ly negat ive, ret urn t rue. // from plichot a : 0< ( x<0 -- -1 ) | ( x>=0 -- 0 ) signbit and 0<> ; : > ( n1 n2 -- ag=n1>n2 ) - 0< ; :< ( n1 n2 -- ag=n1 ; ( n1 n2 -- ag=n1<=n2)

: <= > 0= ;

: >= ( n1 n2 -- ag=n1>=n2) swap <= ; // from plichot a : ABS ( n -- |n|) DUP 0< IF NEGATE THEN ; // pull out t he t hird element down in t he st ack, and put on T : rot ( ... a b c -- ... b c a ) push swap pop swap ; // mult i-word arit hmet ic // copy word from one lit eral locat ion t o anot her // ( -- ) < -- > : copyword(source, dest ) #(dest ) #(source) @ ! ; : copyword @R+ push swapTR st ore pop ; ( dest -- 1+dest ) < source -- 1+source >

// Not e t hat bot h swapPR *must * be in same cell t o avoid wild jumps. // t his requires cell size t o be at least 6 inst ruct ions // 9 cycles + 4 RAM cycles : copy2words ( source -- 2+source ) < dest -- 2+dest > push _[[ swapPR load st ore

load st ore swapPR _]] pop // ( dest -- 4+dest ) < source -- 4+source > : copy4words (copyword copyword copyword copyword) // ( -- ) < source dest -- 4+source 4+dest > // Not e t hat bot h swapPR *must * be in same cell t o avoid wild jumps. // t his requires cell size t o be at least 6 inst ruct ions // 13 cycles + 8 RAM cycles : copy4words ( _[[ swapPR load load load load swapPR _]] _[[ popA st ore st ore st ore st ore AT _]] push

[ _St ack Comput ers_ by Philip Koopman 1989 ment ions t hese words: // ?DUP: condit ionally duplicat e T if it is non-zero : ?DUP ( 0 -- 0 ) or ( x -- x x ) dup if dup t hen ; (Chuck Moore doesn't like ?DUP: ``1x Fort h'' by Charles Moore April 13, 1999 ht t p://www.ult rat echnology.com/1xfort h.ht m ) U< U1 U2 - FLAG Ret urn a t rue FLAG if U1 is less t han U2 when compared as unsigned int egers. U> U1 U2 - FLAG Ret urn a t rue FLAG if U1 is great er t han U2 when compared as unsigned int egers. U* N1 N2 - D3 Perform unsigned int eger mult iplicat ion on N1 and N2, yielding t he unsigned double precision result D3. U/MOD D1 N2 - N3 N4 Perform unsigned int eger division on D1 and N2, yielding t he quot ient N4 and t he remainder N3.

how t o implement mult i-precision operat ions ? since t here are no condit ion codes, t he carry ag must be pushed ont o t he dat a st ack as a logical value. _St ack Comput ers_ by Philip Koopman 1989 ment ions t hese words: RLC N1 CIN -> N2 COUT -> Rot at e left t hrough carry N1 by 1 bit . CIN is carry-in, COUT is carry-out . RRC N1 CIN -> N2 COUT -> Rot at e right t hrough carry N1 by 1 bit . CIN is carry-in, COUT is carry-out . UNORM (... EXP1 U2 ->... EXP3 U4 ) Float ing point normalize of unsigned 32-bit mant issa ADC (... N1 N2 CIN ->... N3 COUT ) Add wit h carry. CIN and COUT are logical ags on t he st ack. //St ore t he double-precision value D1 at t he t wo memory //words st art ing at ADDR. // [Is most -signicant or least -signicant word on T ?] : D! (...D1 ADDR - ) //Drop t he double-precision int eger D1. : DDROP ( D1 -- ) drop drop ; //Duplicat e double-precision int eger D1 on t he st ack. : DDUP ( D1 - D1 D1 ) over over ; D+ D1 D2 - D3 Ret urn t he double precision sum of D1 and D2 as D3. D@ ADDR - D1 Fet ch t he double precision value D1 from memory st art ing at address ADDR. DNEGATE D1 - D2 Ret urn D2, which is t he t wo's complement of D1. // Swap t he t op t wo double-precision numbers on t he st ack. : DSWAP ( D1 D2 - D2 D1 ) (... a b c d --... c d a b ) push swap push swap //(...c a ) <...d b > pop pop

//(...c a b d) <... > swap push swap //(...c d a) <...b > pop I - N1 Ret urn t he index of t he current ly act ive loop. I' - N1 Ret urn t he limit of t he current ly act ive loop. J - N1 Ret urn t he index of t he out er loop in a nest ed loop st ruct ure. LEAVE Set t he loop count er on t he ret urn st ack equal t o t he loop limit t o force an exit from t he loop. S-D N1 - D2 Sign ext end N1 t o occupy t wo words, making it a double precision int eger D2. SP@ (fet ch cont ent s of dat a st ack point er) SP! (init ialize dat a st ack point er) RP@ (fet ch cont ent s of ret urn st ack point er) RP! (init ialize ret urn st ack point er) MATCH (st ring compare primit ive) ABORT" (error checking & report ing word) +LOOP (variable increment loop) /LOOP (variable unsigned increment loop) CMOVE (st ring move)

not e t o assembly language programmers and compiler writ ers


T he compiler must ensure that A is in a don't care state at the start and end of every cell in order f or interrupt routines to be f ree to use A without saving it. T he compiler can enf orce this rule by only allowing the use of the ``popA'' and ``AT '' opcodes in the subroutines "0", "dup", "pop", "swap", "swapT R" (and perhaps a f ew others), and making sure that when these subroutines are inlined, cell breaks don't occur at the "wrong" place. (T he dropR opcode, even though it is the same as popA, can be allowed anywhere A is already in a don't care state, since it leaves A in a (dif f erent) don't care state). Perhaps we need a special "non-breaking space" notation so the programmer can indicate that certain instructions (those sensitive to A, and certain other ones f ollowing ``2/'') must be packed into a single cell. (if cells contain N instructions, ``cnop'' f orces the next N instructions into the same cell; but when I really wanted the next 3 instructions to stay together, and there were 4 empty slots remaining in the current cell, ``cnop'' is a bit wastef ul.)

not e t o opt imizing compiler writ ers:

peephole optimizer usually needs to eliminate do-nothing "popa at push" sequences in straight-line code, since the only ef f ect is to change A and usually A is in a "don't care state". In particular, the "pop swap" macro sequence (in the "over" macro) expands to "pop push swapTR pop", and the obviously do-nothing subsequence "pop push" is f urther expanded to the do-nothing sequence " popa at push". Perhaps it would be simplest to immediately replace the "pop swap" sequence with the f aster sequence "swapTR pop" and then expand *that* into "popA push AT popA AT " Similarly, "swap push" expands to "push swapT R pop push", with the same do-nothing subsequence "pop push"; it seems simple to make the compiler smart enough to immediately replace "swap push" with "push swapT R"

Macros t o emulat e each of t he F21 inst ruct ions


(T he "A" is simulated as an location in RAM, _A ) else uncondit ional jump: #(dest ) push _[[ jmp dropR _]] cnop T=0(dest ) : dup #(dest ) push _[[ 2/ jmp dropR push dropR _]] cnop call(dest ) : #(dest ) push call cnop dropR C=0(dest ) : #(MSbset ) nand dup nand T=0(dest ) cnop ; ret urn : ret urn cnop cnop subrout ine st art : // no ent ry sequence needed. : @R+ // fet ch, address in R, t hen increment R cnop // next 3 inst ruct ion must be in same cell. swapPR @P+ // do t he load swapPR // rest ore P ; : @A+ : #(_A) // get address of _A (&A) dup push push @R+ dropR // fet ch value of A: (A) < &A > push @R+ // fet ch what A point s t o: (dat a) < A+1 &A > pop ! dropR // updat e A wit h new value. (dat a) < -- > #:# @A: #(_A) // get address of A (&A) push @R+ dropR // fet ch value of A: (A) < -- > push @R+ // fet ch what A point s t o: (dat a) < A+1 > dropR // !R+ : ! //st ore, address in R, increment R !A+ : #(_A) // get address of A (&A dat a) dup push push @R+ dropR // fet ch value of A: (A dat a) < &A > push ! // pop T, st ore it at [A]: ( -- ) < A+1 &A > pop ! dropR // updat e A wit h new value. (dat a) < -- > !A : #(_A) // get address of A (&A dat a) push @R+ dropR // fet ch value of A: (A dat a) < -- > push ! // pop T, st ore it at [A]: ( -- ) < A+1 > dropR // com : dup nand 2* : dup + 2/ : 2/

+* : // add S t o T if T0 one dup #(1) and if over + t hen

DUP 1 AND IF OVER + THEN

+* : // add S t o T if T0 one DUP 1 AND IF OVER + THEN dup #(1) and // expands t o ``nand dup nand'' if over // expands t o ``push dup swapTR pop'' else 0 t hen + which expands t o (wit h a good compiler on a 4-inst ruct ion-per-word machine) [push popA AT AT] // dup [ # push # nand] // set up for ``if'', #(1), ``nand'' [ (t hen) ] [ (1) ] [ push popA AT AT] // ''dup'' [ nand 2/ jmp dropR ] // ``nand'', ``if'' [ push dropR push push ] // nish ``if'', st art ''over'': ``push push'' [ popA AT AT nop] // cont inue ``over'': ``dupR'' [ popA push AT nop] // ``swapTR'' [ popA AT nop nop ] // nish ``over'': ``pop''. t hen: [+] +*R : // add R t o T if T0 one DUP 1 AND IF pop dup push + THEN [push popA AT AT] // dup [ # push # nand ] // set up for ``if'', #(1), ``nand'' [ (t hen) ] [ (1) ] [ push popA AT AT] // ''dup'' [ nand 2/ jmp dropR ] // ``nand'', ``if'' [ push dropR nop nop ] // nish ``if'' [ popA AT AT push] // get copy of R ``pop dup push'' t hen: [ + ] // add eit her 0 or a copy of R t o original T // add st ep on ret urn st ack +*_ret urn : // add Rnext t o Rt op if Rt op0 one //pop DUP 1 AND IF pop dup push + THEN push [ popA AT AT #] // pop dup #(1), [ (1) ] [ nand push nop nop ] // st art of ``and'' [ popA AT AT nand] // end of ``and'' [ # 2/ jmp dropR ] //``if'' [ (t hen) ] [ push dropR nop nop ] // nish ``if'' [ popA AT AT push] // get copy of R ``pop dup push'' t hen: [ + push ] // add eit her 0 or a copy of Rnext t o original Rt op

xor : xor and : nand dup nand +:+ pop : popA AT A@ : #(_A) push @R+ dropR dup : dup over : push dup pop swap over : push dup swapTR pop which expands t o push push _[[ popA AT AT _]] _[[ popA push AT _]] _[[ popA AT _]] push : push A! : #(A) push ! dropR nop : nop drop : push popA

** Ways t o expand inst ruct ion set and make less ugly
-- delete the "A" register, and replace 2 instructions "popA" and "AT " with 4 instructions "dup", "swap", "drop", and "pop". (I think this actually simplif ies the hardware) -- Add a proper "load" instruction (perhaps @R+) so we don't have to have this ugly hack of playing around with the program counter. (which seems to do bad things with the instruction pref etch). [FIXME:] 2000-07-23: Af ter f urther thought, this seems like a very good thing. Replace "!R++" and "@P++" with "@R++" and "!A++". Don't have time to go through and see if this is really superior and update all the examples. Advantages: Now we have a real load and a real store instruction. You may be asking, "OK, now we can *read* f rom a address in R, and we can *write* to a address in A -but how do we get literal values into R ?". err... ummm... we can't. Other than painstakingly mathematically assembling them a bit at a time. Unless... we have a special f unction at memory address 0. So normal code, whenever it needs some literal value, does dup dup xor // get a zero push // push t hat zero ont o t he ret urn st ack swapPR // call t he special funct ion at address 0. 60321 // some lit eral value, embedded in t he code dropR // get rid of t he 0. ... // program cont inues from t his point . T his is a classic technique in 6502 assembler (and probably other assembly languages) -- embed literal values, parameters to a subroutine, in the next memory cells af ter the subroutine call. T hen the subroutine can access them using the value on the return stack: [0 org] @R++ // grab t he const ant in t he caller's code st ream swapPR // ret urn from subrout ine. dropR // opt ionally get rid of t he 0 in t he branch delay. With this modif ication, code is *not* f orced to be executed a entire 3 instruction bundle at a time.

We don't need to call this subroutine *every* time -- I could imagine that we could have special versions of most subroutines that have a extra "@R++" tacked on the end, to get that literal value set to go. -- replace the ``swapPR'' jack-of -all-trades branch (it's not as ugly as earlier versions with a ``T =0'' jackof -all-trades branch) with normal "call" (``:'') and "return" (``;'') instructions (call: P->R, T->P) (return: R>P) T he original F21 MISC had 5 branch instructions with several addressing modes each, but they run much f aster. - delete "nand" and replace with "and" and "com". Strange variations and ideas (ways to make the instruction set *more* ugly): maybe make a completely dif f erent instruction the conditional instruction ?) (Alan Grimes gave me the idea that we can make the jump instruction unconditional, if we design one or more of the arithmetical instructions, if they evaluate to zero, execute the f ollowing "jump" instruction (typically either this unconditional jump instruction or NOP), Would it be better to have a "-" subtract replace the "+" add ? If one does have a subtract, is it better to : f- ( ... S T -- ... (S-T) ) swap r- ; : r- ( ... S T -- ... (T-S) ) swap f- ; : negat e 0 r- ; : negat e 0 swap f- ; :+ negat e r- negat e ; // 5 cycles :+ negat e f- ; // 4 cycles : +_+ ( ... a b c -- ... (a+b+c) ) negat e r- r- negat e ; // 6 cycles : +_+ ( ... a b c -- ... (a+b+c) ) have f - or to have r- ? T hey can emulate each other, of course: negat e f- negat e f- ; // 8 cycles (??? would it be better to make ``2/'' *unsigned* shif t-in-zeros ?) make the jump instruction conditional, like this: T here is only 1 branch instruction, only one way to modif y the value of P. it is also the only conditional instruction. T=0 conditional branch: conditionally swap P with R. (??? should this also drop T when T is zero ? When you *know* it is zero, why bother keeping it ?) otherwise skip the "jump" and continue). +* : // add S to T if T 0 one DUP 1 AND IF OVER + T HEN dup #(1) nand dup nand ( x -- lsb x ) #(continuelabel) push T =0 cnop // (1 x) < continuelabel > drop over + continuelabel: // (x) < old_address > dropR Once a cell (pointed to by [P]) is loaded f or execution, all of the instructions in that cell are executed, f rom the f irst to the last. T he CPU never skips the remaining instructions in a cell (not even f or the branch instruction T =0). T he CPU never starts executing somewhere in the middle of a cell -- all branch destinations are to (the f irst instruction of ) a particular cell, not to some (other) instruction inside that cell. Once a cell has been loaded f rom a particular address into the f or execution, P is incremented.

Typically the instructions in that cell are executed while the contents of the next address is being loaded; that next cell will in turn be executed unless (a) the current cell mentions "load" (#), which diverts the next cell to T instead of the instruction latch, increments P, and starts loading the f ollowing cell; (b) the current cell modif ies P with "T =0". (We *could* just f lush the "next cell" we just loaded, then load where P now points, then start executing along this new branch). Can "A" be thought of as part of a stack, so that a minimal-gate implementation only needs to hold P, T, the instruction latch, and pointers to the 2 stacks ? (what other internal registers are there ?) If A is part of the T S stack, then it needs to be copied when the stack shrinks (i.e., when we have 2 operand instructions or we have "push"). Perhaps making it part of the R stack will be easier. If A is part of the R stack, then "swapPR" is OK, "popA " is trivial (decrement pointer to A R stack), " AT " is OK, "push" is a bit tricky -- needs to copy old value of A to next higher location, then move T to old location of A. It may make the hardware simpler f or the "conditional skip next instruction" to only work *inside* a cell; i.e., af ter the end of every cell is a implied "nop", and if the conditional skip instruction is the last instruction in a cell, only that implied "nop" is skipped, not the 1st instruction of the next cell. Does this *really* make the hardware any simpler ? : dropT push dropR ; : dropR One needs either "dropT " or "dropR", the other can be synthesized: pop dropT ; Picking "dropR" seems to reduce branch and call overhead. : over push dup pop swap ; >R dup >A R> A> ; : swap over push push drop pop pop ; push popA push AT pop ; One needs either "swap" or "over", the other can be synthesized: >A >R A> R> ; "2-register" branch instruction can replace both call and return: ... load(label) SWAP(P,T) cnop (next inst ruct ion execut ed aft er subrout ine ret urns) dropT ... label: (st art of leaf subrout ine) push ... pop SWAP(P,T) cnop (6 instructions, not counting cnop). Or

... load(label) push SWAP(P,R) cnop (next inst ruct ion execut ed aft er subrout ine ret urns) dropR ... label: (st art of leaf subrout ine) ... SWAP(P,R) alternatively cnop (5 instructions, not counting cnop) (6 instructions if we replace "dropR" with "pop dropT ") "3-register" branch instruction can replace both call and return ... load(label) P->R, T->P cnop (next inst ruct ion execut ed aft er subrout ine ret urns) dropR ... label: (st art of leaf subrout ine) ... pop P->R, T->P call-like "3-register" branch: cnop (5 instructions, not counting cnop) (6 instructions if we replace "dropR" with "pop dropT ") "Go To" style jumps can be coded #(label) push P->R, T->P // jump ... label: dropR // drop t he unused address from R ... Conditional "Go To" style jumps can be coded #(label) push condit ional(T=0 ? P->R, T->P) ... label: dropR // drop t he unused address from t he R st ack ...

... load label push P->T, R->P // call subrout ine cnop (next inst ruct ion execut ed aft er subrout ine ret urns) dropT ... label: (st art of leaf subrout ine) push // save t he ret urn address on t he R st ack ... P->T, R->P return-like "3-register" branch: cnop (6 instructions, not counting cnop) T he return-like "3-register" branch seems to always be inf erior to the call-like "3-register" branch, because the call sequence requres that extra "push" instruction. One could just make P the top of the return stack, making R not directly accessible ... must be much more caref ul not to change P accidentally when temporarily using R. T hen push and pop do dual-purpose as call and return. T hen "Go To" style jumps can be coded #(label) push wit h mat ching "come from" locat ions coded label: pop pop drop push // ... or alt ernat ively #(label) pop drop push with no special code needed at the "come f rom" location. Conditional "If (stuf f )T hen (thenpart) endif (continuation)" style structures are coded (st u) // leaves result on T // T=0 condit ionally drops eit her P (nonzero) or R (zero) ??? #(label2) push T=0 cnop label: (t henpart ) label2: (cont inuat ion) "If (stuf f )T hen (thenpart) else (elsepart) (st u) // leaves result on T // T=0 condit ionally drops eit her P (nonzero) or R (zero) ??? #(label2) push T=0 cnop label: (t henpart ) pop drop // get rid of current value of P #(label3) push // get new value of P label2: (elsepart ) label3: endif (continuation)" (cont inuat ion) Subroutine calls can be coded

# (com) push ... while subrout ines t hem selves can be coded // t he ret urn address is on t he R st ack ... pop drop I played with this (1998) and got Instruction summary: The 15 inst ruct ion codes are: Code Name Descript ion Tradit ional Fort h where A is a variable P @ @ 1 P +! P @ ! 1 P +!

@P+ fet ch, address in P, push dat a ont o T, P++ !P+ st ore, address in P, pop dat a from T, P++ push pop T, push int o P pop pop P, push int o T T=0 >P P>

if T0 t o T19 all zero, nop else drop P. (not e: ignores MSbit of T !) A! A@ XOR AND COM 2/

A! pop T int o A A@ push A int o T xor exclusive-or S t o T nand nand S t o T + add S t o T

2/ shift T, T20 t o T19 (T20 unchanged) dup push T int o T DUP over push S int o T OVER drop pop T DROP nop NOP

T he next instruction f etch

begins as soon as no memory instructions (@P+ !P+) or possible branches (push pop T =0) are pending. // fet ch, address in R, increment R @R+ pop @P+ over push push drop pop //st ore, address in R, increment R !R+ pop over !P+ push drop push @P+ pop drop fet ch, replacing address in T wit h dat a. @ push !P+ pop drop St ore, address in T, dat a in S, removing bot h. ! Memory load instructions: One *could* read everything through a single register @P+ and !P+ but then ref erences to variables in data RAM require caref ul manipulation to put the variable address in P, do the load, and restore the nextinstruction address to P bef ore the end of the cell: swapPR !P+ swapPR.

swapPR. One *could* access everything through a single register @R+ and !R+ but then ref erences to in-line constants require that the sequence swapPR @R+ swapPR f inish bef ore the end of the cell. It would be nice to *allow* a pref etch mechanism to work properly: As soon as instruction is loaded into the instruction latch and starts to execute, the next word starts to be loaded into a pref etch buf f er. If the instruction f inishes with no memory access, we wait f or the word to f inish loading then copy the pref etch buf f er to the instruction latch. If the instruction has a # literal load, we wait f or the word to f inish loading then copy the pref etch buf f er to T, and start loading the *next* word. If the instruction has a data-memory access ... f astest would be to wait f or the word to f inish loading somehow keep it in some other buf f er, start another memory cycle to load the requested item to T. But a simpler circuit would just interrupt and throw away the current cycle, start a new cycle to load the item to the buf f er like all memory accesses do, when that is done move it to T and re-start the cycle to load the next instruction. (T his may be f aster if memory is *very* slow and can be cancelled quickly). T he only case when this pref etch doesn't help is with a calculated jump, then one must interrupt and throw away the current cycle, start a new cycle to load the new P location, and when that is done move it f rom the buf f er to the instruction latch. We might get away with *not* having a "delayed branch" f or normal calls (#(dest ) push call) if we could f orce them into the sequence - execut e current inst ruct ion while loading next cell - when you hit #, wait for cell t o nish loading - don't pre-fet ch any more cells unt il we're ready t o put t his new address on t he dat a bus - load new cell from [dest ] int o inst ruct ion regist er and cont inue from t here. (but this still gives a delayed branch f or the return). Using @P+ (rather than @R+) as the only memory access instruction seems to better ref lect the state of the "simpler circuit" pref etch mechanism. If one tries to replace the unconditional SWAP(P,R) instruction with a conditional cSWAP(P,R) instruction, then it becomes more dif f icult to do pack a data memory load sequence into a single cell. One needs //< address -- address+1 > //( -- [address] ) dup dup xor // push 0 t o T // all in one word cSWAP(P,R) // always swaps, since T is zero @P+ // do t he load swap // get t he 0 back on t op cSWAP(P,R) // rest ore P // end of word something like dropT // get rid of t he 0. OK, packing 4 instructions into a single

word isn't too dif f icult. Although it *does* seem like a lot of work (8 instructions) to do a pretty f undamental operation. (it would only be 4 instructions if we add a unconditional SWAP(P,R) instruction; if we instead add a @R+ instruction, it would only take 1 instruction). Something simple like "copy one //< source dest -- source++ dest ++> //( -- ) dup dup xor // push 0 t o T // all in one word cSWAP(P,R) // always swaps, since T is zero @P+ // do t he load swap // get t he 0 back on t op cSWAP(P,R) // rest ore P // end of word dropT // get rid of t he 0. pop swap // now < dest > ( value source++ ) !R+ word f rom this address to that address" becomes push Since you never have multiple branches inside a single cell, maybe you could seperate the f low-control and perhaps the "#" listeral instruction, putting them in a special f ield. T he remaining normal sequential instructions, since there are f ewer of them, might f it into f ewer bits in the remaining f ields of the cell. T his may be a slippery slope leading to a VLIW CPU rather than a MISC CPU. But it avoids the conf usion of "What if there are multiple f low-control instructions in a word ?" Basically, I'm thinking something like a group of bits to indicate which type of cell is this: (a) t his ent ire cell is a lit eral; read it int o T and go on t o t he next word. (b) This cell is normal sequent ial list of inst ruct ions; read t hem int o t he inst ruct ion lat ch, execut e t hem, and go on t o t he next word. (c) When you're done wit h t he inst ruct ions in t his cell, do a condit ional branch or call of t ype XXX. (d) When you're done wit h t hese inst ruct ions, ret urn. At the time I'm writing this, that is 4 dif f erent possibilities which requires 2 bits, but it eliminates 3 instructions f rom my opcode list. (Just a single bit could be used to discriminate between sequential v. branching cells ...) I don't know if I like the semantics of the conditional branch instruction T =0. Do we want it to test *all* T ? everything except the most signif icant bit of T (the carry bit) ? Do we want it to take the value T and conditionally (based on S, so we'll call it S=0) either move it to push down P (P->R, T->P) or just throw the value away ? Do we want T =0 to conditionally pop the value of f P (R->P) or do nothing ? Do we want T =0 to conditionally *replace* the value P with a *copy* of the next value below it on the stack, or do nothing ? Do we want the zero value to be popped or not ? Is this signed 2/ better than unsigned ? (I think so, since it's much easier to synthesize a unsigned shif t - generate a mask and strip of f the shif ted-in one bits -- than it is to have the unsigned shif t be primitive and try to synthesize a signed shif t.)

You might also like