How Microprocessors Work

How Microprocessors Work e 1 of 64
ZM
How Microprocessors Work
ZAHIDMEHBOOB
+923215020706
ZAHIDMEHBOOB@LIVE.COM
2003
BS(IT)
PRESTION UNIVERSITY
How Microprocessors Work

The computer you are using to read this page uses a microprocessor to do its work. The
microprocessor is the heart of any normal computer, whether it is a desktop machine, a
server or a laptop. The microprocessor you are using might be a Pentium, a K6, a PowerPC,
a Sparc or any of the many other brands and types of microprocessors, but they all do
approximately the same thing in approximately the same way.
If you have ever wondered what the microprocessor in your computer is doing, or if you have
ever wondered about the differences between types of microprocessors, then read on.
Microprocessor History
A microprocessor -- also known as a CPU or central processing unit -- is a complete
computation engine that is fabricated on a single chip. The first microprocessor was the Intel
ZAHIDMEHBOOB@LIVE.COM +923215020706 (2003)

ZM
4004, introduced in 1971. The 4004 was not very powerful -- all it could do was add and
subtract, and it could only do that 4 bits at a time. But it was amazing that everything was on
one chip. Prior to the 4004, engineers built computers either from collections of chips or from
discrete components (transistors wired one at a time). The 4004 powered one of the first
portable electronic calculators.
The first microprocessor to make it into a home computer was the Intel 8080, a complete 8-
bit computer on one chip, introduced in 1974. The first microprocessor to make a real splash
in the market was the Intel 8088, introduced in 1979 and incorporated into the IBM PC
(which first appeared around 1982). If you are familiar with the PC market and its history, you
know that the PC market moved from the 8088 to the 80286 to the 80386 to the 80486 to the
Pentium to the Pentium II to the Pentium III to the Pentium 4. All of these microprocessors
are made by Intel and all of them are improvements on the basic design of the 8088. The
Pentium 4 can execute any piece of code that ran on the original 8088, but it does it about
5,000 times faster!
The following table helps you to understand the differences between the different processors
that Intel has introduced over the years.
Clock Data
Name Date Transistors Microns MIPS
speed width
8080 1974 6,000 6 2 MHz 8 bits 0.64
16 bits
8088 1979 29,000 3 5 MHz 0.33
8-bit bus
80286 1982 134,000 1.5 6 MHz 16 bits 1
80386 1985 275,000 1.5 16 MHz 32 bits 5
80486 1989 1,200,000 1 25 MHz 32 bits 20
32 bits
Pentium 1993 3,100,000 0.8 60 MHz 64-bit 100
bus
32 bits
Pentium II 1997 7,500,000 0.35 233 MHz 64-bit ~300
bus
32 bits
Pentium
1999 9,500,000 0.25 450 MHz 64-bit ~510
III
bus
32 bits
Pentium 4 2000 42,000,000 0.18 1.5 GHz 64-bit ~1,700
bus
Compiled from The Intel Microprocessor Quick Reference Guide and TSCP Benchmark Scores
Information about this table:
• The date is the year that the processor was first

introduced. Many processors are re-introduced at
higher clock speeds for many years after the original release date.

ZM
• Transistors is the number of transistors on the chip. You can see that the number of
transistors on a single chip has risen steadily over the years.
• Microns is the width, in microns, of the smallest wire on the chip. For comparison, a
human hair is 100 microns thick. As the feature size on the chip goes down, the number
of transistors rises.
• Clock speed is the maximum rate that the chip can be clocked at. Clock speed will make
more sense in the next section.
• Data Width is the width of the ALU. An 8-bit ALU can add/subtract/multiply/etc. two 8-bit
numbers, while a 32-bit ALU can manipulate 32-bit numbers. An 8-bit ALU would have to
execute four instructions to add two 32-bit numbers, while a 32-bit ALU can do it in one
instruction. In many cases, the external data bus is the same width as the ALU, but not
always. The 8088 had a 16-bit ALU and an 8-bit bus, while the modern Pentiums fetch
data 64 bits at a time for their 32-bit ALUs.
• MIPS stands for "millions of instructions per second" and is a rough measure of the
performance of a CPU. Modern CPUs can do so many different things that MIPS ratings
lose a lot of their meaning, but you can get a general sense of the relative power of the
CPUs from this column.
From this table you can see that, in general, there is a relationship between clock speed and
MIPS. The maximum clock speed is a function of the manufacturing process and delays
within the chip. There is also a relationship between the number of transistors and MIPS. For
example, the 8088 clocked at 5 MHz but only executed at 0.33 MIPS (about one instruction
per 15 clock cycles). Modern processors can often execute at a rate of two instructions per
clock cycle. That improvement is directly related to the number of transistors on the chip and
will make more sense in the next section.
Inside a Microprocessor
To understand how a microprocessor works, it is helpful to look inside and learn about the
logic used to create one. In the process you can also learn about assembly language -- the
native language of a microprocessor -- and many of the things that engineers can do to
boost the speed of a processor.
A microprocessor executes a collection of machine instructions that tell the processor what
to do. Based on the instructions, a microprocessor does three basic things:
• Using its ALU (Arithmetic/Logic Unit), a microprocessor can perform mathematical

operations like addition, subtraction, multiplication and division. Modern
microprocessors contain complete floating point processors that can perform
extremely sophisticated operations on large floating point numbers.
• A microprocessor can move data from one memory location to another.
• A microprocessor can make decisions and jump to a new set of instructions based
on those decisions.
There may be very sophisticated things that a microprocessor does, but those are its three
basic activities. The following diagram shows an extremely simple microprocessor capable of
doing those three things:

ZM
This is about as simple as a microprocessor gets. This microprocessor has:
• An address bus (that may be 8, 16 or 32 bits wide) that sends an address to

memory
• A data bus (that may be 8, 16 or 32 bits wide) that can send data to memory or
receive data from memory
• An RD (read) and WR (write) line to tell the memory whether it wants to set or get the
addressed location
• A clock line that lets a clock pulse sequence the processor
• A reset line that resets the program counter to zero (or whatever) and restarts
execution
Let's assume that both the address and data buses are 8 bits wide in this example.
Here are the components of this simple microprocessor:
• Registers A, B and C are simply latches made out of flip-flops. (See the section on
"edge-triggered latches" in How Boolean Logic Works for details.)
• The address latch is just like registers A, B and C.
• The program counter is a latch with the extra ability to increment by 1 when told to do
so, and also to reset to zero when told to do so.
• The ALU could be as simple as an 8-bit adder (see the section on adders in How
Boolean Logic Works for details), or it might be able to add, subtract, multiply and
divide 8-bit values. Let's assume the latter here.

ZM
• The test register is a special latch that can hold values from comparisons performed
in the ALU. An ALU can normally compare two numbers and determine if they are
equal, if one is greater than the other, etc. The test register can also normally hold a
carry bit from the last stage of the adder. It stores these values in flip-flops and then
the instruction decoder can use the values to make decisions.
• There are six boxes marked "3-State" in the diagram. These are tri-state buffers. A
tri-state buffer can pass a 1, a 0 or it can essentially disconnect its output (imagine a
switch that totally disconnects the output line from the wire that the output is heading
toward). A tri-state buffer allows multiple outputs to connect to a wire, but only one of
them to actually drive a 1 or a 0 onto the line.
• The instruction register and instruction decoder are responsible for controlling all of
the other components.
Although they are not shown in this diagram, there

would be control lines from the instruction decoder that
would:
• Tell the A register to latch the value currently on the data bus
• Tell the B register to latch the value currently on the data bus
• Tell the C register to latch the value currently on the data bus
• Tell the program counter register to latch the value currently on the data bus
• Tell the address register to latch the value currently on the data bus
• Tell the instruction register to latch the value currently on the data bus
• Tell the program counter to increment
• Tell the program counter to reset to zero
• Activate any of the six tri-state buffers (six separate lines)
• Tell the ALU what operation to perform
• Tell the test register to latch the ALU's test bits
• Activate the RD line
• Activate the WR line
Coming into the instruction decoder are the bits from the test register and the clock line, as
well as the bits from the instruction register.
RAM and ROM

The previous section talked about the address and data buses, as well as the RD and WR
lines. These buses and lines connect either to RAM or ROM -- generally both. In our sample
microprocessor, we have an address bus 8 bits wide and a data bus 8 bits wide. That means
that the microprocessor can address (28) 256 bytes of memory, and it can read or write 8 bits
of the memory at a time. Let's assume that this simple microprocessor has 128 bytes of
ROM starting at address 0 and 128 bytes of RAM starting at address 128.
ROM stands for read-only memory. A ROM chip is programmed with a permanent collection
of pre-set bytes. The address bus tells the ROM chip which byte to get and place on the data
bus. When the RD line changes state, the ROM chip presents the selected byte onto the
data bus.
RAM stands for random-access memory. RAM contains bytes of information, and the
microprocessor can read or write to those bytes depending on whether the RD or WR line is

ZM
signaled. One problem with today's RAM chips is that they forget everything once the power
goes off. That is why the computer needs ROM.
By the way, nearly all computers contain some amount of ROM (it is possible to create a
simple computer that contains no RAM -- many microcontrollers do this by placing a handful
of RAM bytes on the processor chip itself -- but generally impossible to create one that
contains no ROM). On a PC, the ROM is called the BIOS (Basic Input/Output System).
When the microprocessor starts, it begins executing instructions it finds in the BIOS. The
BIOS instructions do things like test the hardware in the machine, and then it goes to the
hard disk to fetch the boot sector (see How Hard Disks Work for details). This boot sector is
another small program, and the BIOS stores it in RAM after reading it off the disk. The
microprocessor then begins executing the boot sector's instructions from RAM. The boot
sector program will tell the microprocessor to fetch something else from the hard disk into
RAM, which the microprocessor then executes, and so on. This is how the microprocessor
loads and executes the entire operating system.
Microprocessor Instructions
Even the incredibly simple microprocessor shown in the previous example will have a fairly
large set of instructions that it can perform. The collection of instructions is implemented as
bit patterns, each one of which has a different meaning when loaded into the instruction
register. Humans are not particularly good at remembering bit patterns, so a set of short
words are defined to represent the different bit patterns. This collection of words is called the
assembly language of the processor. An assembler can translate the words into their bit
patterns very easily, and then the output of the assembler is placed in memory for the
microprocessor to execute.
Here's the set of assembly language instructions that the designer might create for the
simple microprocessor in our example:
• LOADA mem - Load register A from memory address

• LOADB mem - Load register B from memory address
• CONB con - Load a constant value into register B
• SAVEB mem - Save register B to memory address
• SAVEC mem - Save register C to memory address
• ADD - Add A and B and store the result in C
• SUB - Subtract A and B and store the result in C
• MUL - Multiply A and B and store the result in C
• DIV - Divide A and B and store the result in C
• COM - Compare A and B and store the result in test
• JUMP addr - Jump to an address
• JEQ addr - Jump, if equal, to address
• JNEQ addr - Jump, if not equal, to address
• JG addr - Jump, if greater than, to address
• JGE addr - Jump, if greater than or equal, to address
• JL addr - Jump, if less than, to address
• JLE addr - Jump, if less than or equal, to address
• STOP - Stop execution
If you have read How C Programming Works, then you know that this simple piece of C code
will calculate the factorial of 5 (where the factorial of 5 = 5! = 5 * 4 * 3 * 2 * 1 = 120):

ZM
a=1;
f=1;
while (a <= 5)
{
f = f * a;
a = a + 1;
}
At the end of the program's execution, the variable f contains the factorial of 5.
A C compiler translates this C code into assembly language. Assuming that RAM starts at
address 128 in this processor, and ROM (which contains the assembly language program)
starts at address 0, then for our simple microprocessor the assembly language might look
like this:
// Assume a is at address 128

// Assume F is at address 129
0 CONB 1 // a=1;
1 SAVEB 128
2 CONB 1 // f=1;
3 SAVEB 129
4 LOADA 128 // if a > 5 the jump to 17
5 CONB 5
6 COM
7 JG 17
8 LOADA 129 // f=f*a;
9 LOADB 128
10 MUL
11 SAVEC 129
12 LOADA 128 // a=a+1;
13 CONB 1
14 ADD
15 SAVEC 128
16 JUMP 4 // loop back to if
17 STOP
So now the question is, "How do all of these instructions look in ROM?" Each of these
assembly language instructions must be represented by a binary number. For the sake of
simplicity, let's assume each assembly language instruction is given a unique number, like
this:
• LOADA - 1
• LOADB - 2
• CONB - 3
• SAVEB - 4
• SAVEC mem - 5
• ADD - 6
• SUB - 7
• MUL - 8
• DIV - 9
• COM - 10
• JUMP addr - 11
• JEQ addr - 12

ZM
• JNEQ addr - 13
• JG addr - 14
• JGE addr - 15
• JL addr - 16
• JLE addr - 17
• STOP - 18
The numbers are known as opcodes. In ROM, our little program would look like this:
// Assume a is at address 128
// Assume F is at address 129
Addr opcode/value
0 3 // CONB 1
1 1
2 4 // SAVEB 128
3 128
4 3 // CONB 1
5 1
6 4 // SAVEB 129
7 129
8 1 // LOADA 128
9 128
10 3 // CONB 5
11 5
12 10 // COM
13 14 // JG 17
14 31
15 1 // LOADA 129
16 129
17 2 // LOADB 128
18 128
19 8 // MUL
20 5 // SAVEC 129
21 129
22 1 // LOADA 128
23 128
24 3 // CONB 1
25 1
26 6 // ADD
27 5 // SAVEC 128
28 128
29 11 // JUMP 4
30 8
31 18 // STOP
You can see that seven lines of C code became 17 lines of assembly language, and that
became 31 bytes in ROM.
The instruction decoder needs to turn each of the opcodes into a set of signals that drive the
different components inside the microprocessor. Let's take the ADD instruction as an
example and look at what it needs to do:
1. During the first clock cycle, we need to actually load the instruction. Therefore the
instruction decoder needs to:
• activate the tri-state buffer for the program counter

ZM
• activate the RD line

• activate the data-in tri-state buffer
• latch the instruction into the instruction register
2. During the second clock cycle, the ADD instruction is decoded. It needs to do very
little:
• set the operation of the ALU to addition
• latch the output of the ALU into the C register
3. During the third clock cycle, the program counter is incremented (in theory this could
be overlapped into the second clock cycle).
Every instruction can be broken down as a set of sequenced operations like these that
manipulate the components of the microprocessor in the proper order. Some instructions,
like this ADD instruction, might take two or three clock cycles. Others might take five or six
clock cycles
Microprocessor Performance
The number of transistors available has a huge effect on the performance of a processor.
As seen earlier, a typical instruction in a processor like an 8088 took 15 clock cycles to
execute. Because of the design of the multiplier, it took approximately 80 cycles just to do
one 16-bit multiplication on the 8088. With more transistors, much more powerful multipliers
capable of single-cycle speeds become possible.
More transistors also allow for a technology called pipelining. In a pipelined architecture,
instruction execution overlaps. So even though it might take five clock cycles to execute
each instruction, there can be five instructions in various stages of execution simultaneously.
That way it looks like one instruction completes every clock cycle.
Many modern processors have multiple instruction decoders, each with its own pipeline. This
allows for multiple instruction streams, which means that more than one instruction can
complete during each clock cycle. This technique can be quite complex to implement, so it
takes lots of transistors.
The trend in processor design has been toward full 32-bit ALUs with fast floating point
processors built in and pipelined execution with multiple instruction streams. There has also
been a tendency toward special instructions (like the MMX instructions) that make certain
operations particularly efficient. There has also been the addition of hardware virtual memory
support and L1 caching on the processor chip. All of these trends push up the transistor
count, leading to the multi-million transistor powerhouses available today. These processors
can execute about one billion instructions per second!
Computer Caches
A computer is a machine in which we measure time in very small increments. When the
microprocessor accesses the main memory (RAM), it does it in about 60 nanoseconds (60
billionths of a second). That's pretty fast, but it is much slower than the typical

ZM
microprocessor. Microprocessors can have cycle times as short as 2 nanoseconds, so to a

microprocessor 60 nanoseconds seems like an eternity.
What if we build a special memory bank, small but very fast (around 30 nanoseconds)?
That's already two times faster than the main memory access. That's called a level 2 cache
or an L2 cache. What if we build an even smaller but faster memory system directly into the
microprocessor's chip? That way, this memory will be accessed at the speed of the
microprocessor and not the speed of the memory bus. That's an L1 cache, which on a 233-
megahertz (MHz) Pentium is 3.5 times faster than the L2 cache, which is two times faster
than the access to main memory.
There are a lot of subsystems in a computer; you can put cache between many of them to
improve performance. Here's an example. We have the microprocessor (the fastest thing in
the computer). Then there's the L1 cache that caches the L2 cache that caches the main
memory which can be used (and is often used) as a cache for even slower peripherals like
hard disks and CD-ROMs. The hard disks are also used to cache an even slower medium --
your Internet connection.
Your Internet connection is the slowest link in your computer. So your browser (Internet
Explorer, Netscape, Opera, etc.) uses the hard disk to store HTML pages, putting them into a
special folder on your disk. The first time you ask for an HTML page, your browser renders it
and a copy of it is also stored on your disk. The next time you request access to this page,
your browser checks if the date of the file on the Internet is newer than the one cached. If the
date is the same, your browser uses the one on your hard disk instead of downloading it
from Internet. In this case, the smaller but faster memory system is your hard disk and the
larger and slower one is the Internet.
Cache can also be built directly on peripherals. Modern hard disks come with fast memory,
around 512 kilobytes, hardwired to the hard disk. The computer doesn't directly use this
memory -- the hard-disk controller does. For the computer, these memory chips are the disk
itself. When the computer asks for data from the hard disk, the hard-disk controller checks
into this memory before moving the mechanical parts of the hard disk (which is very slow
compared to memory). If it finds the data that the computer asked for in the cache, it will
return the data stored in the cache without actually accessing data on the disk itself, saving a
lot of time.
Here's an experiment you can try. Your computer caches your floppy drive with main
memory, and you can actually see it happening. Access a large file from your floppy -- for
example, open a 300-kilobyte text file in a text editor. The first time, you will see the light on
your floppy turning on, and you will wait. The floppy disk is extremely slow, so it will take 20
seconds to load the file. Now, close the editor and open the same file again. The second
time (don't wait 30 minutes or do a lot of disk access between the two tries) you won't see
the light turning on, and you won't wait. The operating system checked into its memory
cache for the floppy disk and found what it was looking for. So instead of waiting 20 seconds,
the data was found in a memory subsystem much faster than when you first tried it (one
access to the floppy disk takes 120 milliseconds, while one access to the main memory
takes around 60 nanoseconds -- that's a lot faster). You could have run the same test on
your hard disk, but it's more evident on the floppy drive because it's so slow.
To give you the big picture of it all, here's a list of a normal caching system:

ZM
• L1 cache - Memory accesses at full microprocessor speed (10 nanoseconds, 4

kilobytes to 16 kilobytes in size)
• L2 cache - Memory access of type SRAM (around 20 to 30 nanoseconds, 128
kilobytes to 512 kilobytes in size)
• Main memory - Memory access of type RAM (around 60 nanoseconds, 32
megabytes to 128 megabytes in size)
• Hard disk - Mechanical, slow (around 12 milliseconds, 1 gigabyte to 10 gigabytes in
size)
• Internet - Incredibly slow (between 1 second and 3 days, unlimited size)
Cache Technology
One common question asked at this point is, "Why not make all of the computer's memory
run at the same speed as the L1 cache, so no caching would be required?" That would work,
but it would be incredibly expensive. The idea behind caching is to use a small amount of
expensive memory to speed up a large amount of slower, less-expensive memory.
In designing a computer, the goal is to allow the microprocessor to run at its full speed as
inexpensively as possible. A 500-MHz chip goes through 500 million cycles in one second
(one cycle every two nanoseconds). Without L1 and L2 caches, an access to the main
memory takes 60 nanoseconds, or about 30 wasted cycles accessing memory.
When you think about it, it is kind of incredible that such relatively tiny amounts of memory
can maximize the use of much larger amounts of memory. Think about a 256-kilobyte L2
cache that caches 64 megabytes of RAM. In this case, 256,000 bytes efficiently caches
64,000,000 bytes. Why does that work?
In computer science, we have a theoretical concept called locality of reference. It means

that in a fairly large program, only small portions are ever used at any one time. As strange
as it may seem, locality of reference works for the huge majority of programs. Even if the
executable is 10 megabytes in size, only a handful of bytes from that program are in use at
any one time, and their rate of repetition is very high. Let's take a look at the following
pseudo-code to see why locality of reference works (see How C Programming Works to
really get into it):
Output to screen « Enter a number between 1 and 100 »

Read input from user
Put value from user in variable X
Put value 100 in variable Y
Put value 1 in variable Z
Loop Y number of time
Divide Z by X
If the remainder of the division = 0
then output « Z is a multiple of X »
Add 1 to Z
Return to loop
End
This small program asks the user to enter a number between 1 and 100. It reads the value
entered by the user. Then, the program divides every number between 1 and 100 by the
number entered by the user. It checks if the remainder is 0 (modulo division). If so, the
program outputs "Z is a multiple of X" (for example, 12 is a multiple of 6), for every number
between 1 and 100. Then the program ends.

ZM
Even if you don't know much about computer programming, it is easy to understand that in
the 11 lines of this program, the loop part (lines 7 to 9) are executed 100 times. All of the
other lines are executed only once. Lines 7 to 9 will run significantly faster because of
caching.
This program is very small and can easily fit entirely in the smallest of L1 caches, but let's
say this program is huge. The result remains the same. When you program, a lot of action
takes place inside loops. A word processor spends 95 percent of the time waiting for your
input and displaying it on the screen. This part of the word-processor program is in the
cache.
This 95%-to-5% ratio (approximately) is what we call the locality of reference, and it's why a
cache works so efficiently. This is also why such a small cache can efficiently cache such a
large memory system. You can see why it's not worth it to construct a computer with the
fastest memory everywhere. We can deliver 95 percent of this effectiveness for a fraction of
the cost.
What is Virtual Memory?

.
Most computers today have something like 32 or 64 megabytes of RAM available for the
CPU to use (see How RAM Works for details on RAM). Unfortunately, that amount of RAM is
not enough to run all of the programs that most users expect to run at once.
For example, if you load the operating system, an e-mail program, a Web browser and word
processor into RAM simultaneously, 32 megabytes is not enough to hold it all. If there were
no such thing as virtual memory, then once you filled up the available RAM your computer
would have to say, "Sorry, you can not load any more applications. Please close another
application to load a new one." With virtual memory, what the computer can do is look at
RAM for areas that have not been used recently and copy them onto the hard disk. This
frees up space in RAM to load the new application.

ZM
Because this copying happens automatically, you don't even know it is happening, and it
makes your computer feel like is has unlimited RAM space even though it only has 32
megabytes installed. Because hard disk space is so much cheaper than RAM chips, it also
has a nice economic benefit.
6th Generation CPU Comparisons
The following is a comparative text meant to give people a feel for the differences
in the various 6th generation x86 CPUs. For this little ditty, I've chosen the Intel
P-II (aka Klamath, P6), the AMD K6 (aka NX686), and the Cyrix 6x86MX (aka
M2). These are all MMX capable 6th generation x86 compatible CPUs, however I
am not going to discuss the MMX capabilities at all beyond saying that they all
appear to have similar functionality. (MMX never really took off as the software
enabling technology Intel claimed it to be, so its not worth going into any depth
on it.)
In what follows, I am assuming a high level of competence and knowledge on the

part of the reader (basic 32 bit x86 assembly at least). For many of you, the
discussion will be just slightly over your head. For those, I would recommend
sitting through the 1 hour online lecture on the Post-RISC architecture by
Charles Severence to get some more background on the state of modern
processor technology. It is really an excellent lecture, that is well worth the time:

ZM
• Beyond RISC - The Post RISC Architecture (Mark Brehob, Travis

Doom, Richard Enbody, William H. Moore, Sherry Q. Moore, Ron Sass,
Charles Severance )
Much of the following information comes from online documentation from Cyrix,
AMD and Intel. I have played a little with Pentiums and Pentium-II's from work, as
well as my AMD-K6 at home. I would also like to thank, Dan Wax, Lance Smith
and "Bob Instigator" from AMD who corrected me on several points about the
K6, and both Andreas Kaiser and Lee Powell who also provided insightful
information, and corrections gleened from first hand experiences with these
CPUs. Also, thanks to Terje Mathisen who pointed out an error, and Brian
Converse who helped me with my grammar.
Comments welcome.
The AMD K6
The K6 architecture seems to mix some of the ideas of the P-II and 6x86MX
architectures. They made trade offs, and decisions that they believed would
deliver the maximal performance over all potential software. They have
emphasized short latencies (like the 6x86MX) but the K6 translates their x86
instructions into RISC operations that are queued in large instruction buffers and
feed many (7 in all) independent units (like the P-II.) While they don't always
have the best single implementation of any specific aspect, this was the result of
conscious decisions that they believe helps strike a balance that hits a good
performance sweet spot. Versus the P-II, they avoid situations of really deep
pipelining which has high penalties when the pipeline has to be backed out.
Versus the Cyrix, the AMD is a fully POST-RISC architecture which is not as
susceptible to pipeline stalls which artificially back ups other stages.
General Architecture
The K6 is an extremely short and elegant pipeline. The AMD-K6 MMX

Enhanced Processor x86 Code Optimization Application Note contains the
following diagrams:
This seems remarkably simple considering the features that are claimed for the
K6. The secret, is that most of these stages do very complicated things. The light
blue stages execute in an out of order fashion (and were colored by me, not
AMD.)

ZM
The fetch stage, is much like a typical Pentium instruction fetcher, and is able to
present 16 cache aligned bytes of data per clock. Of course this means that
some instructions that straddle 16 byte boundaries will suffer an extra clock
penalty before reaching the decode stage, much like they do on a Pentium. (The
K6 is a little clever in that if there are partial opcodes from which the predecoder
can determine the instruction length, then the prefetching mechanism will fetch
the new 16 byte buffer just in time to feed the remaining bytes to the issue
stage.)
The decode stage attempts to simultaneously decode 2 simple, 1 long, and fetch
from 1 ROM x86 instruction(s). If both of the first two fail (usually only on rare
instructions), the decoder is stalled for a second clock which is required to
completely decode the instruction from the ROM. If the first fails but the second
does not (the usual case when involving memory, or an override), then a single
instruction or override is decoded. If the first succeeds (the usual case when not
involving memory or overrides) then two simple instructions are decoded. The
decoded "OpQuad" is then entered into the scheduler.
Thus the K6's execution rate is limited to a maximum of two x86

instructions per clock. This decode stage decomposes the x86 instructions into
RISC86 ops.
This last statement has been generally misunderstood in its importance (even by
me!) Given that the P-II architecture can decode 3 instructions at once, it is
tempting to conclude that the P-II can execute typically up to 50% faster than a
K6. According to "Bob Instigator" (a technical marketroid from AMD) and "The

ZM
Anatomy of a High-Performance Microprocessor A Systems Perspective"

this just isn't so. Besides the back-end limitations and scheduler problems that
clog up the P-II, real world software traces analyzed at Advanced Micro Devices
indicated that a 3-way decoder would have added almost no benefit while
severely limiting the clock rate ramp of the K6 given its back end architecture.
That said, in real life decode bandwidth limitation crops up every now and then
as a limiting factor, but is rarely egregiously in comparison to ordinary execution
limitations.
The issue stage accepts up to 4 RISC86 instructions from the scheduler. The
scheduler is basically an OpQuad buffer that can hold up to 6 clocks of
instructions (which is up to 12 dual issued x86 instructions.) The K6 issues
instructions subject only to execution unit availability using an oldest unissued
first algorithm at a maximum rate of 4 RISC86 instructions per clock (the X and Y
ALU pipelines, the load unit, and the store unit.) The instructions are marked as
issued, but not removed until retirement.
The operand fetch stage reads the issued instruction operands without any
restriction other than register availability. This is in contrast with the P-II which
can only read up to two retired register operands per clock (but is unrestricted in
forwarding (unretired) register accesses.) The K6 uses some kind of internal
"register MUX" which allows arbitrary accesses of internal and commited register
space. If this stage "fails" because of a long data dependency, then according to
expected availability of the operands the instruction is either held in this stage for
an additional clock or unissued back into the scheduler, essentially moving the
instruction backwards through the pipeline!
This is an ingenious design that allows the K6 to perform "late" data dependency
determinations without over-complicating the scheduler's issue logic. This clever
idea gives a very close approximation of a reservation station architecture's
"greedy algorithm scheduling".

ZM
The execution stages perform in one or two pipelined stages (with the exception
of the floating point unit which is not pipelined, or complex instructions which stall
those units during execution.) In theory, all units can be executing at once.
Retirement happens as completed instructions are pushed out of the scheduler

(exactly 6 clocks after they are entered.) If for some reason, the oldest OpQuad
in the scheduler is not finished, scheduler advancement (which pushes out the
oldest OpQuad and makes space for a newly decoded OpQuad) is halted until
the OpQuad can be retired.
What we see here is the front end starting fairly tight (two instruction) and the
back end ending somewhat wider (two integer execution units, one load, one
store, and one FPU.) The reason for this seeming mismatch in execution
bandwidth (as opposed to the Pentium, for example which remains two-wide
from top to bottom) is that it will be able to sustain varying execution loads as the
dependency states change from clock to clock. This at the very heart of what an
out of order architecture is trying to accomplish, being wider at the back-end is a
natrual consequence of this kind design.
Branch Prediction
The K6 uses a very sophisiticated branch prediction mechanism which delivers

better prediction and fewer stalls than the P-II. There is a 8192 table of two bit
prediction entries which combine historic prediction information for any given
branch with a heuristic that takes into account the results of nearby branches.
Even branches that have somehow left the branch prediction table, still can have
the benefit of nearby branch activity data to help their prediction. According to
published papers which studied these branch prediction implementations, this
allows them to achieve a 95% prediction rate versus the P-II's 90% prediction
rate.
Additional stalls are avoided by using a 16 entry times 16 byte branch target
cache which allows first instruction decode to occur simultaneously with
instruction address computation, rather than requiring (E)IP to be known and
used to direct the next fetch (as is the case with the P-II.) This removes an (E)IP
calculation dependency and instruction fetch bubble. (This is a huge advantage
in certain algorithms such as computing a GCD; see my examples for the code)
The K6 allows up to 7 outstanding unresolved branches (which seems like more
than enough since the scheduler only allows up to 6 issued clocks of pending
instructions in the first place.)
The K6 benefits additionally from the fact that it is only a 6 stage pipeline (as
opposed to a 12 stage pipeline like the P-II) so even if a branch is incorrectly
predicted it is only a 4 clock penalty as opposed to the P-II's 11-15 clock penalty.

ZM
One disadvantage pointed out to me by Andreas Kaiser is that misaligned branch

targets still suffer an extra clock penalty and that attempts to align branch targets
can lead to branch target cache tag bit aliasing. This is a good point, however it
seems to me that you can help this along by hand aligning only your most inner
loop branches.
Another (also pointed out to me by Andreas Kaiser) is that such a prediction

mechanism does not work for indirect jump predictions (because the verification
tables only compare a binary jump decision value, not a whole address.) This is a
bit of a bummer for virtual C++ functions.
Back of envelope calculation
This all means that the average loop penalty is:
(95% * 0) + (5% * 4) = 0.2 clocks per loop
But because of the K6's limited decode bandwidth, branch instructions take up
precious instruction decode bandwidth. There are no branch execution clocks in
most situations, however, branching instructions end up taking a slot where there
is essentially no calculations. In that sense K6 branches have a typical penalty of
about 0.5 clocks. To combat this, the K6 executes the LOOP instruction in a
single clock, however this instruction performs so badly on Intel CPUs, that no
compiler generates it.
Floating Point
The common high demand, high performance FPU operations (FADD, FSUB,
FMUL) all execute with a throughput and latency of 2 clocks (versus 1 or 2 clock
throughput and 3-5 clock latency on the P-II.) Amazingly, this means that it can
complete FPU operations faster than the P-II, however is worse on FPU code that is
optimally scheduled for the P-II. Like the Pentium, in the P-II Intel has worked hard
on fully pipelining the faster FPU operations which works in their favor. Central to
this is FXCH which, in combination with FPU instruction operands allows two
new stack registers to be addressed by each binary FPU operation. The P-II
allows FXCH to execute in 0 clocks -- the early revs of the K6 took two clocks,
while later revs based on the "CXT core" can execute them in 0 clocks.
Unfortunately, the P-II derives much more benefit from this since its FPU
architecture allows it to decode and execute at a peak rate of one new FPU
instruction on every clock.
More complex instructions such as FDIV, FSQRT and so on will stall more of the
units on the P-II than on the K6. However since the P-II's scheduler is larger it will
be able to execute more instructions in parallel with the stalled FPU instruction
(21 in all, however the port 0 integer unit is unavailable for the duration of the

ZM
stalled FPU instruction) while the K6 can execute up to 11 other x86 instructions
a full speed before needing to wait for the stalled FPU instruction to complete.
In a test I wrote (admittedly rigged to favor Intel FPUs) the K6 measured to only
perform at about 55% of the P-II's performance. (Update: using the K6-2's new
SIMD floating point features, the roles have reversed -- the P-II can only execute
at about 70% of a K6-2's speed.)
An interesting note is that FPU instructions on the K6 will retire before they
completely execute. This is possible because it is only required that they work
out whether or not they will generate an exception, and the execution state is
reset on a task switch, by the OS's built-in FPU state saving mechanism.
The state of floating point has changed so drastically recently, that its hard to
make a definitive comment on this without a plethora of caveats. Facts: (1) the
pure x87 floating point unit in the K6 does not compare favorably with that of the
P-II, (2) this does not tend to always reflect in real life software which can be
made from bad compilers, (3) the future of floating point clearly lies with SIMD,
where AMD has clearly established a leadership role. (4) Intel's advantage was
primarily in software that was hand optimized by assembly coders -- but that has
clearly reversed roles since the introduction of the K6-2.
Cache
The K6's L1 cache is 64KB, which is twice as large as the P-II's L1 cache. But it
is only 2 way set associative (as opposed to the P-II which is 4 way). This makes
the replacement algorithm much simpler, but decreases its effectiveness in
random data accesses. The increased size, however, more than compensates
for the extra bit of associativity. For code that works with contiguous data sets,
the K6 simply offers twice the working set ceiling of the P-II.
Like the P-II, the K6's cache is divided into two fixed caches for separate code
and data. I am not as big a fan of split architectures (commonly referred to as the
Harvard Architecture) because they set an artificial lower limit on your working
sets. As pointed out to me by the AMD folk, this keeps them from having to worry
about data accesses kicking out their instruction cache lines. But I would expect
this to be dealt with by associativity and don't believe that it is worth the trade off
of lower working set sizes.
Among the design benefits they do derive from a split architecture is that they
can add pre-decode bits to just the instruction cache. On the K6, the predecode
bits are used for determining instruction length boundaries. Their address tags
(which appears to work out to 9 bits) point to a sector which contains two 32 byte
long cache lines, which (I assume) are selected by standard associativity rules.
Each cache line has a standard set of dirty bits to indicate accessibility state
(obsolete, busy, loaded, etc).

ZM
Although the K6's cache is non-blocking, (allowing accesses to other lines even if
a cache line miss is being processed) the K6's load/store unit architecture only
allows in-order data access. So this feature cannot be taken advantage of in the
K6. (Thanks to Andreas Kaiser for pointing this out to me.)
In addition, like the 6x86MX, the store unit of the K6 actually is buffered by a
store Queue. A neat feature of the store unit architecture is that it has two
operand fetch stages -- the first for the address, and the second for the data
which happens one clock later. This allows stores of data that are being
computed in the same clock as the store to occurr without any apparent stall.
That is so darn cool!
But perhaps more fundamentally, as AMD have said themselves, bigger is better,
and at twice the P-II's size, I'll have to give the nod to AMD (though a bigger nod
to the 6x86MX; see below.)
The K6 takes two (fully pipelined) clocks to fetch from its L1 cache from within its
load execution unit. Like the original P55C, the 6x86MX spends extra load clocks
(i.e., address generation) during earlier stages of their pipeline. On the other
hand this compares favorably with the P-II which takes three (fully pipelined)
clocks to fetch from the L1 cache. What this means is that when walking a
(cached) linked list (a typical data structure manipulation), the 6x86MX is the
fastest, followed by the K6, followed by the P-II.
Update: AMD has released the K6-3 which, like the Celeron adds a large on die
L2 cache. The K6-3's L2 cache is 256K which is larger than the Celeron's at
128K. Unlike Intel, however, AMD has recommended that motherboard continue
to include on board L2 caches creating what AMD calls a "TriLevel cache"
architecture (I recall that an earler Alpha based system did exactly this same
thing.) Benchmarks indicate that the K6-3 has increased in performance between
10% and 15% over similarly clocked K6-2's! (Wow! I think I might have to get one
of these.)
Other
• The K6 has bad memory bandwidth. One is an unknown bottleneck in

their block move and bursting over their bus (I and others have observed
this through testing, though there is no documentation available from AMD
that explains this. Update: the K6 did not support pipelined stores which
has been corrected in the "CXT core".)
• The K6 has a 2/3 (32 bits ready/64 bits ready) clock integer multiply, which
is good counter to the P-II's 1/4 (throughput/latency) clock integer multiply.
Programmers usually only use the base 32 bit LSB result of the multiply,
and so are likely to achieve realistic 2-clock throughputs. On the other
hand the P-II's 4 clock "hands off" rule is unlikely to be so easily
scheduled, since no contentions in 4 clocks is unlikely. In the real world, I

ZM
would be very surprised if the P-II actually achieves a 1 or even 2 clock

throughput.
• The K6 does not suffer the same kind of partial register stalls that the P-II
does. Register contention is accurate down to the byte sub-register as
required. Special clearing of the register is unnecessary. However 16 bit
partial register instructions will have instruction decode overrides which
will cost an extra clock.
• The K6 seems to prefer [esi+0] to [esi] memory addressing (for faster pre-
decoding.) This is a side effect of the 386 ISA's strange encoding rules for
this operand. Basically, the 16 bit mod/rm encodings and 32 bit modrm
encodings cause an mode or operand conflict for this situation. Basically,
if they made the [esi] decoding fast, numerous 16 bit modrm decodings
would be very slow. This trade off was more beneficial to more code at the
time.
• The K6 has a wider riscop instruction window than the P-II. That is to say
instructions are entered into and retired from their scheduler at a rate of 4
RISC86 ops per clock, while the P-II enters and retires microops from their
reorder buffer at a rate of 3 microops per clock.
• The K6 has a fast (1 cycle) LOOP instruction. It looks like the Intel CPUs
may be the lone wolves with their slow LOOP instruction. If you ask me,
this is the most ideal instruction to use for loops.
• Of course, the K6 has more a sophisticated instruction decode
implementation in the sense that they can decode two 7 byte instructions
or one 11 byte instruction in a single clock. Like the 6x86MX (though, with
an entirely different mechanism) it can only decode a maximum of two
instructions per clock versus the P-II's maximum rate of 3 instructions per
clock. However, the P-II's decoding is overly optimistic since it balks on
any instructions more than 7 bytes long and is also limited by micro-op
decode restrictions.
Anyhow, this design is very much in line with AMD's recommendation of

using complicated load and execute instructions which tend to be longer
and would favor the K6 over the P-II. In fact, the AMD just seems better
suited overall for the CISCy nature of the x86 ISA. For example, the K6
can issue 2 push reg instructions per clock, versus the P-II's 1 push reg
per clock.
According to AMD, the typical 32 bit decode bandwidth is about the same
for both the K6 and the P-II, but 16 bit decode is about 20% faster for the
K6. Unfortunately for AMD, if software developers and compiler writers
heed the P-II optimization rules with the same vigor that they did with the
Pentium, the typical decode bandwidth will change over time to favor the
P-II.
• The K6's issue to execute scheduling is pretty cool. They use complete
logical comparisons between pipeline stages to always find the best path

ZM
to propagate from issue to operand read to execute. This is particularly

effective to divide the work between the two integer units. The scheduler
will actually push stalled instructions backwards through the pipeline to
simplify and avoid over speculation in multi-clock stalls situations. This
also allows other instructions to slip through rather than being caught
behind a stalled instruction. This is an effective alternative to the P-II's
reservation station which is an optional extra pipeline stage that serves a
similar purpose.
6x86MX seems to just let their pipelines accumulate with work moving
only in a forward direction which makes them more susceptible to being
backed up, but they do allow their X and Y pipes to swap contents at one
stage.
• The K6 does not support the new P6 ISA instructions, specifically, the
conditional move instructions. It also does not appear to support the set of
MSRs that the P6 does (besides the ever important TSC register.) So from
a programmer's architecture point of view, the K6 is more like a Pentium
than a Pentium-II. Its not clear that this is a real big issue since all the
modern compilers still target the 80386 ISA.
Update: AMD's new "CXT Core" has enabled write combining.
As I have been contemplating the K6 design, it has really grown on me.

Fundamentally, the big problem with x86 processors versus RISC chips is that
they have too few registers and are inherently limited in instruction decode
bandwidth due to natural instruction complexity. The K6 addresses both of these
by maximizing performance of memory based in-cache data accesses to make
up for the lack of registers, and by streamlining CISC instruction issue to be
optimally broken down into RISC like sub-instructions.
It is unfortunate, that compilers are favoring Intel style optimizations. Basically

there are several instructions and instruction mixes that compilers avoid due to
their poor performance on Intel CPUs, even though the K6 executes them just
fine. As an x86 assembly nut, it is not hard to see why I favor the K6 design over
the Intel design.
Optimization
AMD realizing that there is a tremendous interest for code optimization for certain
high performance applications, decided to write up some Optimization
documentation for the K6 (and now K6-2) processor(s). The documentation is
fairly good about describing general strategies as well as giving a fairly detailed
description for modelling the exact performance of code. This documentation far
exceeed the quality of any of Intel's "Optimization AP notes", fundamentally
because its accurate and more thorough.

ZM
The reason I have come to this conclusion is that the architecture of the chip
itself is much more straight forward than, say the P-II, and so there is less
explanation necessary. So the volume of documentation is not the only
determining factor to measuring its quality.
If companies were interested in writing a compiler that optimized for the K6 I'm
sure they could do very well. In my own experiments, I've found that optimizing
for the K6 is very easy.
Recommendations I know of: (1) Avoid vector decoded instructions including

carry flag reading instructions and shld/shrd instructions, (2) Use the loop
instruction, (3) Align branch targets and code in general as much as possible, (4)
Pre-load memory into registers early in your loops to work around the load
latency issue.
Brass Tacks
The K6 is cheap, supports super socket 7 (with 100Mz Bus), that has established
itself very well in the market place, winning businnes from all the top tier OEMs
(with the exception of Dell, which seems to have missed the consumer market
shift entirely, and taken a serious step back from challenging Compaq's number
one position.) AMD really changed the minds of people who thought the x86
market was pretty much an Intel deal (including me.)
Their marketting strategy of selling at a low price while adding features (cheaper
Super7 infrastructure, SIMD floating point, 256K on chip L2 cache combined with
motherboard L2 cache) has paid off in an unheard of level brand name
recognition outside of Intel. Indeed, 3DNow! is a great counter to Intel Inside. If
nothing else they helped create a real sub-$1000 PC market, and have dictated
the price for retail x86 CPUs (Intel has been forced to drop even their own prices
to unheard of lows for them.)
AMD has struggled more to meet the demand of new speeds as they come
online (they seem predictably optimistic) but overall have been able to sell a boat
load of K6's without being stepped on by Intel.
Previously, in this section I maintained a small chronical of AMD's acheivements

as the K6 architecture grew, however we've gotten far beyond the question of
"will the K6 survive?" (A question only idiots like Ashok Kumar still ask.) From the
consumer's point of view, up until now (Aug 99) AMD has done a wonderful job.
Eventually, they will need to retire the K6 core -- it has done its tour of duty.
However, as long as Intel keeps Celeron in the market, I'm sure AMD will keep
the K6 in the market. AMD has a new core that they have just introduced into the
market: the K7. This processor has many significant advantages over "6th
generation architectures".

ZM
The real CPU WAR has only just begun ...
AMD performance documentation links
The first release of their x86 Optimization guide is what triggered me to write this
page. With it, I had documentation for all three of these 6th generation x86
CPUs. Unfortunately, they often elect to go with terse explanations that assume
the reader is very familiar with CPU architecture and terminologies. This lead me
to some misunderstandings from my initial reading of the documentation (I'm just
a software guy.) On the other hand, the examples they give really help clarify the
inner workings of the K6.
• AMD K6-2 specifications

• AMD-K6 Optimization Guide
• AMD-K6 Processor Data Sheet
Update: The IEEE Computer Society has published a book called "The Anatomy
of a High-Performance Microprocessor A Systems Perspective" based on the
AMD K6-2 microprocessor. It gives inner details of the K6-2 that I have never
seen in any other documentation on Microprocessors before. These details are a
bit overwhelming for a mere software developer, however, for a hard core x86
hacker its a treasure trove of information.
• The Anatomy of a High-Performance Microprocessor A Systems

Perspective
The Intel P-II

This was the first processor (I knew of) to have completely documented post-
RISC features such as dynamic execution, out of order execution and retirement.
(PA-RISC predated it as far as implementing the technology, however; I am
suspicious that HP told Intel to either work with them on Merced, or be sued up
the wazoo.) This stuff really blew me away when I first read it. The goal is to
allow longer latencies in exchange for high throughput (single cycle whenever
possible.) The various stages would attempt to issue/start instructions with the
highest possible probability as often as possible, working out dependencies,
register renaming requirements, forwarding, resource contentions, as later parts
of the instruction pipe by means of speculative and out of order execution.
Intel has enjoyed the status of "defacto standard" in the x86 world for some time.
Their P6/P-II architecture, while not delivering the same performance boost of
previous generational increments, solidifies their position. Its is the fastest, but it
is also the most expensive of the lot.

ZM
The P-II is a highly pipelined architecture with an out of order execution engine in
the middle. The Intel Architecture Optimization Manual lists the following two
diagrams:
The two sections shown are essentially concatenated, showing 10 stages of in-
order processing (since retirement must also be in-order) with 3 stages of out of
order execution (RS, the Ports, and ROB write back colored in light blue by me,
not Intel.)
Intel's basic idea was to break down the problem of execution into as many units
as possible and to peel away every possible stall that was incurred by their
previous Pentium architecture as each instructions marches forward down their
assembly line. In particular, Intel invests 5 pipelined clocks to go from the
instruction cache to a set of ready to execute micro-ops. (RISC architectures
have no need for these 5 stages, since their fixed width instructions are generally
already specified to make this translation immediate. It is these 5 stages that truly
separate the x86 from ordinary RISC architectures, and Intel has essentially
solved it with a brute force approach which costs them dearly in chip area.)

ZM
Each of Intel's "Ports" is used as a feeding trough for microops to various

groupings of units as shown in the above diagram. So, Intel's 5 micro-op
execution per clock bandwidth is a little misleading in the sense that two ports
are required for any single storage operation. So, it is more realistic to consider
this equivalent, to at most 4 K6 RISC86 ops issued per clock.
As a note of interest, Intel divides the execution and write back stages into two
seperate stages (the K6 does not, and there is really no compelling reason for
the P6's method that I can see.)
Although it is not as well described, I believe that Intel's reservation station and
reorder buffer combinations serves substantially the same purpose as the K6's
scheduler, and similarly the retire unit acts on instruction clusters in exactly the
same way as they were issued (CPUs are not otherwise known to have sorting
algorithms wired into them.) Thus the micro-op throughput is limited to 3 per
clock (compared with 4 RISC86 ops for the K6.)
So when everything is working well, the P-II can take 3 simple x86 instructions
and turn them into 3 micro-ops on every clock. But, as can be plainly seen in
their comments, they have a bizzare problem: they can only read two physical input
register operands per clock (rename registers are not constrained by this
condition.) This means scheduling becomes very complicated. Registers to be
read for multiple purposes will not cost very much, and data dependencies don't
suffer from any more clocks than expected, however the very typical trick of
spreading calculations over several registers (used especially in loop unrolling)
will upper bound the pipeline to two micro-ops per clock because of a physical
register read bottleneck.
In any event, the decoders (which can decode up to 6 micro-ops per clock) are
clearly out-stripping the later pipeline stages which are bottlenecked both by the
3 micro-op issue and two physical register read operand limit. The front end

ZM
easily outperforms the back end. This helps Intel deal with their branch bubble,
by making sure the decode bandwidth can stay well ahead of the execution units.
Something that you cannot see in the pictures above is the fact that the FPU is
actually divided into two partitioned units. One for addition and subtraction and
the other for all the other operations. This is found in the Pentium Pro
documentation and given the above diagram and the fact that this is not
mentioned anywhere in the P-II documentation I assumed that in fact the P-II
was different from the PPro in this respect (Intel's misleading documentation is
really unhelpful on this point.) After I made some claims about these differences
on USENET some Intel engineer (who must remain anonymous since he had a
copyright statement insisting that I not copy anything he sent me -- and it made
no mention of excluding his name) who claims to have worked on the PPro felt it
his duty to point out that I was mistaken about this. In fact, he says, the PPro and
P-II have an identical FPU architecture. So in fact the P-II and PPro really are the
same core design with the exception of MMX, segment caching and probably
some different glue logic for the local L2 caches.
This engineer also reiterated Intel's position on not revealing the inner works of
their CPU architectures thus rendering it impossible for ordinary software
engineers to know how to properly optimize for the P-II.
Branch Prediction
Central to facilitating the P-II's aggressive fire and forget execution strategy is full
branch prediction. The functionality has been documented by Agner Fog, and
can track very complex patterns of branching. They have advertised a prediction
rate of about 90% (based on academic work using the same implementation.)
This prediction mechanism was also incorporated into the Pentium MMX CPUs.
Unlike the K6, the branch target buffer contains target addresses, not instructions
and predictions only for the current branch. This means an extra clock is required
for taken branches to be able to decode their branch target. Branches not in the
branch target buffer are predicted statically (backward jumps taken, forward
jumps not.) However, this "extra clock" is generally overlapped with execution
clocks, and hence is not a factor except in short loops, or code loops with poorly
translated code sequences (like compiled sprites.)
In order to do this in a sound manner, subsequent instructions must be executed

"speculatively" under the proviso that any work done by them may have to be
undone if the prediction turns out to be wrong. This is handled in part by
renaming target write-back registers to shadow registers in a hidden set of extra
registers. The K6 and 6x86MX have similar rename and speculation
mechanisms, but with its longer latencies, it is a more important feature for the P-
II. The trade off is that the process of undoing a mispredicted branch is slow (since the
pipelines must completely flush) costing as much as 15 clocks (and no less than 11.)
These clocks are non-overlappable with execution, of course, since the execution

ZM
stream cannot be correctly known until the mispredict is completely processed. This
huge penalty offsets the performance of the P-II, especially in code in which no
P6/P-II optimizations considerations have been made.
The P-II's predictor always deals with addresses (rather than boolean compare
results as is done in the K6) and so is applicable to all forms of control transfer
such as direct and indirect jumps and calls. This is critical to the P-II given that
the latency between the ALUs and the instruction fetch is so large.
In the event of a conditional branch both addresses are computed in parallel. But
this just aids in making the prediction address ready sooner; there is no
appreciable performance gained from having the mispredicted address ready
early given the huge penalty. The addresses are computed in an integer
execution port (seperate from the FPU) so branches are considered an ALU
operation. The prefetch buffer is stalled for one clock until the target address is
computed, however since the decode bandwidth out-performs the execution
bandwidth by a fair margin, this is not an issue for non-trivial loops.
This all means that the average loop penalty is:
(90% * 0) + (10% * 13) = 1.3 clocks per loop
This is obviously a lot higher than the K6 penalty. (The zero as the first penalty
assumes that the loop is sufficiently large to hide the one clock branch bubble.)
For programmers this means one major thing: Avoid mispredicted branches in
your inner loops at all costs (make that 10% closer to 0%). Using tables or
conditional move instructions are common methods, however since the predictor
is used even in indirect jumps, there are situations with branching where you
have no choice but to suffer from branch prediction penalties.
Floating Point
In keeping with their post-RISC architecture, the P-II's have in some cases
increased the latency of some of the FPU instructions over the Pentium for sake
of pipelining at high clock rates and with idea that it hopefully will not matter if the
code is properly scheduled. Intel says that FXCH requires no execution cycles,
but does not explicitly state whether or not throughput bubbles are introduced.
Other than latency, the P-II is very similar to the Pentium in terms of performance
characteristics. This is because all FPU operations go through port 0 except
FXCH's which go to port 1, and the first stage of a multiply takes two non-
pipelined clocks. This is pretty much identical to the P5 architecture.

ZM
The Intel floating point design has traditionally beat the Cyrix and AMD CPUs on
floating point performance and this still appears to hold true as tests with Quake
and 3D Studio have confirmed. (The K6 is also beaten, but not by such a large
margin -- and in the case of Quake II on a K6-2 the roles are reversed.)
The P-II's floating point unit is issued from the same port as one of the ALU units.
This means that it cannot issue two integer and 1 floating point operation on
every clock, and thus is likely to be constrained to an issue rate similar to the K6.
As Andreas Kaiser points out, this does not necessarily preclude later execution
clocks (for slower FPU operations for eg) to execute in parallel from all three
basic math units (though this same comment applies to the K6).
As I mentioned above, the P-II's floating point unit is actually two units, one is a
fully pipelined add and subtract unit, and the other is a partially pipelined complex
unit (including multiplies.) In theory this gives greater parallelism opportunities
over the original Pentium but since the single port 0 cannot feed the units at a
rate greater than 1 instruction per clock, the only value is design simplification.
For most code, especially P5 optimized code, the extra multiply latency is likely
to be the most telling factor.
Update: Intel has introduced the P-!!! which is nothing more than a 500Mhz+ P6
core with 3DNow!-like SIMD instructions. These instructions appear to be very
similar in functionality and remarkably similar in performance to the 3DNow!
instruction set. There are a lot of misconceptions about the performance of SSE
versus 3DNow! The best analysis I've seen so far indicate that they are nearly
identical by virtue of the fact that Intel's "4-1-1" issue rate restriction holds back
the mostly meaty 2 micro-op SSE instructions. Furthermore, there are twice as
many subscribers to the SSE units per instruction than 3DNow! which totally
nullifies the doubled output width. In any event, its almost humorous to see Intel
playing catch up to AMD like this. The clear winner: consumers.
Cache
The P-II's L1 cache is 32KB divided into two fixed 16KB caches for separate
code and data. These caches are 4-way set associative which decreases
thrashing versus the K6. But relatively speaking, this is quite small and inflexible
when compared with the 6x86MX's unified cache. I am not a big fan of the P-II's
smaller, less flexible L1 cache, and it appears as though they have done little to
justify it being half the size of their competitors' L1 caches.
The greater associativity helps programs that are written indifferently with respect
to data locality, but has no effect on code mindful of data locality (i.e., keeping
their working sets contiguous and no larger than the L1 cache size.)
The P-II also has an "on PCB L2 cache". What this means is they do not need
use the motherboard bus to access their L2 cache. As such the communications

ZM
interface can (and does) have a much higher frequency. In current P-II's it is 1/2
the CPU clock rate. This is an advantage over K6, K6-2 and 6x86MX cpus which
access motherboard based L2 caches at only 66Mhz or 100Mhz. (However the
K6-III's on die L2 cache runs at the CPU clock rate, which is thus twice as fast as
the P-II's)
Other
• The P-II has a partial register stall which is very costly. This occurs when
writing to a sub-register within a few clocks of writing to a 32 bit register.
That is to say, writing to a ?l or ?h 8 bit register will cause a partial register
stall when next reading the corresponding ?x or e?x register. The same is
true of writing to a ?x register then reading the corresponding e?x register.
As described by Agner Fog, the front end is in-order and must assign
internal registers before the instruction can be entered into the
reservations stations. If there is a partial register overlap with a live
instruction ahead of it, then a disjoint register cannot be assigned until that
instruction retires. This is a devastating performance stall when it occurs
because new instructions cannot even be entered into the reservations
stations until this stall is resolved. Intel lists this as having roughly a 7
clock cost.
Intel recommends using XOR reg,reg or SUB reg,reg which will

somehow mark the partial register writes as automatically zero extending.
But obviously this can be inappropriate if you need other parts of the
register to be non-zero. It is not clear to me whether or not this extends to
memory address forwarding (it probably does.) I would recommend simply
seperating the partial register write from the dependent register read by as
much distance as possible.
This is not a big issue so long as the execution units are kep busy with
instructions leading up to this partial registers stall, but that is a difficult
criteria to code towards. One way to accomplish this would be to try to
schedule this partial register stall as far away from the previous branch
control transfer as possible (the decoders usually get well ahead of the
ALUs after several clocks following a control transfer.)
• The P-II, like the P6, performs worse on 16 bit code per clock rate than the
Pentium. (Significantly worse than the Cyrix 6x86MX, and somewhat
worse than the K6.) However, the P-II is not as bad as the P6. In
particular, it uses a small 16 bit segment/selector cache which the P6
does not.
• The P-II's data access actually require an additional address unit for
stores. What this means is that memory writes must be broken down into

ZM
"address store" and "data store" micro-ops. This increases data write
latency (versus the K6.)
• The P-II can decode instructions to many, many micro-ops, but really only
decodes optimally when 2 out of every 3 instructions are decoded to a
single micro-op and in a specific "4-1-1" sequence (that is for three
instructions to decode in parallel the first must decode to no more than 4
micro-ops, and the second and third in no more than 1 micro-op).
Instructions must also be 8 bytes or less to allow other instructions to be
decoded in the same clock. According to MicroProcessor Report, only one
load or store memory operation can be decoded in the first of the at most
3 instructions. If this is true, it certainly detracts from the "one load or store
operations per clock" claim Intel makes (of course the second of the two
store microops might execute at the same time as a load.)
Only under these circumstances can the P-II achieve its maximum rate of
decoding 3 instructions per cycle.
Update: I recently tried to hand optimize some code, and found that it is
actually not all that difficult to achieve the 3 instruction issue per clock, but
that certainly no compiler I know of is up to the task. It turns out, though,
that such activities are almost certainly a red herring since dependency
bubbles will end up throttling your performance anyways. My
recommendation is to parallelize your calculations as much as possible.
• Stores are pipelined, but not queued like the 6x86MX or K6. This means
cache misses necessarily stalls subsequent store micro-op execution. So
the P-II ends up using the reservation station to queue up store
commands rather than a dedicated store queue. It is not totally clear to me
if this stalls the load unit, but I am guessing not since the cache has been
claimed to be non-blocking.
• The K6 requires in-order writes, while the P-II almost assuredly reorders
its writes very aggressively in an attempt to build contiguous memory write
streams. The original P6 core has also included write combining (makes
clustered byte writes appear as byte enabled dword writes to the PCI bus.)
With the introduction of the Pentium Pro, many 3rd party hardware
peripheral vendors that used the memory mapping features of PCI found
themselves fixing their drivers to, in some cases, work around this
"feature" of the P6 architecture. However for ordinary applications this just
meant higher memory bandwidth performance (more so with the P-II than
the P6.)
• Intel has leveraged its dominance in the market, advanced process and a
daring approach to L2 cache usage to introduce their slot 1 cartridge
interface to motherboards. The upshot of all of this is that they are able to
use a larger heat sink and have better control over a more reliably yielded
L2 cache running at reasonable clock rate (half the processor speed.)

ZM
At the same clock rate, this is its biggest advantages over the current K6
whose L2 cache is tied to the chipset speed of 66Mhz.
• Intel's CPUs come with more MSRs which give detailed information about
branch prediction and scheduling stalls (by a net counts) and let you mark
memory type ranges with respect to cacheability and write combinability.
These details, among others, were at the heart of the controversy
surrounding "Appendix H" a while back with the Pentium CPU.
But now, after being pressured into publishing information about MSRs,
Intel has decided to go one step further and provide a tool to help present
the MSR information in a Windows program. While this tool is very useful
in of itself, it would be infinitely superior if there were accompanying
documentation that described the P-II's exact scheduling mechanism.
Optimization
Intel has been diligent in creating optimization notes and even some interactive
tutorials that describe how the P-II microarchitecture works. But the truth is that
they serve as much as CPU advertisements as they do for serious technical
material. We found out with the Pentium CPU, Intel's notes were woefully
inadequate to give an accurate characterization for modelling its behaviour with
respect to performance (this opened the door for people like Michael Abrash and
Agner Fog to write up far more detailed descriptions based on observation rather
than Intel's anemic documentation.) They contain egregious omissions, without
given a totally clear description of the architecture.
While they claim that hand scheduling has little or no effect on performance
experiments I and others have conducted have convinced me that this simply is
not the case. In the few attempts I've made using ideas I've recently been shown
and studied myself, I can get between 5% and 30% improvement on very
innocent looking loops via some very unintuitive modifications. The problem is
that these ideas don't have any well described explanation -- yet.
With the P-II we find a nice dog and pony show, but again the documentation is
inadequate to describe essential performance characteristics. They do steer you
away from the big performance drains (branch misprediction and partial register
stalls.) But in studying the P-II more closely, it is clear that there are lots of things
going on under the hood that are not generally well understood. Here are some
examples (1) since the front end out-performs the back-end (in most cases) the
"schedule on tie" situation is extremely important, but there is not a word about it
anywhere in their documentation (Lee Powell puts this succinctly by saying that
the P-II prefers 3X superscalar code to 2X superscalar code.) (2) The partial
register stall appears to, in some cases, totally parallelize with other execution in
some cases (the stall is less than 7 clocks), while not at all in others (a 7 clock
stall in addition to ordinary clock expenditures.) (3) Salting execution streams

ZM
with bogus memory read instructions can improve performance (I strongly

suspect this has something to do with (1)).
So why doesn't Intel tell us these things so we can optimize for their CPU? The
theory that they are just telling Microsoft or other compiler vendors under NDA
doesn't fly since the kinds of details that are missing are well beyond the
capabilities of any conventional compiler to take advantage of (I can still beat the
best compilers by hand without even knowing the optimization rules, but instead
just by guessing at them!) I can only imagine that they are only divulging these
rules to certain companies that perform performance critical tasks that Intel has a
keen interest in seeing done well running on their CPUs (soft DVD from Zoran for
example; I'd be surprised if Intel didn't give them either better optimization
documentation or actual code to improve their performance.)
Intel has their own compiler that they have periodically advertised on the net as a
plug in replacement for the Windows NT based version of MSVC++, available for
evaluation purposes (it's called Proton if I recall correctly). However, it is unclear
to me how good it is, or whether anyone is using it (I don't use WinNT, so I did
not pursue trying to get on the beta list). Update: I have been told that Microsoft
and Imprise (Borland) have licensed Intel's compiler source and has been using it
are their compiler base.
Recommendations I know of: (1) Avoid mispredicted branches (using conditional

move can be helpful in removing unpredictable branches), (2) Avoid partial
register stalls (via the xor/sub reg,reg work arounds, or the movzx/movsx
commands or by simply avoiding smaller registers.) (3) Remove artificial
"schedule on tie" stalls by slowing down the front end x86 decoder (i.e., issuing a
bogus bypassed load command by rereading out of the same address as the last
memory read can often help.) For more information read the discussion on the
sandpile.org discussion forum (4) Align branch target instructions; they are not
fed to the second prefetch stage fast enough to automatically align in time.
Brass Tacks
When it comes right down to brass tacks though, the biggest advantage of their
CPU is the higher clock rates that they have achieved. They have managed to
stay one or two speed grades ahead of AMD. The chip also enjoys the benefit of
the Intel Inside branding. Intel has spent a ton of money in brand name
recognition to help lock its success over competitors. Like the Pentium, the P-II
still requires a lot of specific coding practices to wring the best performance out
of them, and its no doubt that many programmers will do this, and Intel has gone
to some great lengths to write tutorials that explain how to do this (regardless of
their lack of correctness, they will give programmers a false sense of
empowerment).

ZM
During 1998, what consumers have been begging for for years, a transition to
super cheap PCs has taken place. This is sometimes called the sub-$1000 PC
market segment. Intel's P-II CPUs are simply too expensive (costing up to $800
alone) for manufacturers to build compelling sub-$1000 systems with them. As
such, Intel has watched AMD and Cyrix pick up unprecedented market share.
Intel made a late foray into the sub-$1000 PC market. Their whole business
model did not support such an idea. Intel's "value consumer line" the Celeron
started out as a L2-cacheless piece of garbage architecture (read: about the
same speed as P55Cs at the same clock rates), then switched to an integrated
L2 cache architecture (stealing the K6-3's thunder). Intel was never really able to
recover the bad reputation that stuck to the Celeron, but perhaps that was their
intent all along. It is now clear that Intel is basically dumping Celerons in an effort
to wipe out AMD and Cyrix, while trying to maintain their hefty margins in their
Pentium-II line. For the record, there is little performance difference between a
Pentium-II and a Celeron, and the clock rates for the Celeron were being made
artificially slow so as not to eat into their Pentium line. This action alone has
brought a resurgence into the "over clocking game" that some adventurous
power users like to get into.
But Intel being Intel has managed to seriously dent what was exclusively an AMD
and Cyrix market for a while. Nevertheless since the "value consumer" market
has been growing so strongly, AMD and Cyrix have nevertheless been able to
increase their volumes even with Intel's encroachment.
The P-II architecture is getting long in the tooth, but Intel keeps insisting on
pushing it (demonstrating an uncooled 650Mhz sample in early 1999.) Mum's the
word on Intel's seventh generation x86 architecture (the Williamette or Foster)
probably because that architecture is not scheduled to be ready before late 2000.
This old 6th generation part may prove to be easy pickings for Cyrix Jalapeno
and AMD's K7, both of which will be available in the second half of 1999.
Intel performance documentation links
While Intel does have plenty of documentation on their web site, they quite
simply do not sit still with their URLs. It is impossible to keep track of these URLs,
and I suspect Intel keeps changing their URLs based on some ulterior motive. All
I can suggest is: slog through their links starting at the top. I have provided a link
to Agner Fog's assembly page where his famous Pentium optimization manual
has been updated with a dissection of the P-II.
• http://www.intel.com/
• developer.intel.com
• Agner Fog's P5 and P-II optimization manual

ZM
The Cyrix 6x86MX

Originally codenamed M2, this chip was very eagerly awaited. Cyrix had been
making grand claims about their superior architecture and this was echoed by
Cyrix's small but loyal following. Indeed it did come out, and indeed it is very
interesting architecture, but to say the least, it was very late. Both AMD and Intel
had announced their 6th generation MMX processors and enjoyed a couple
months of extra revenue before Cyrix appeared on the scene. But as always,
these Cyrix CPUs have hit the street at a substantially lower cost than its rivals,
although recent price cuts from AMD may be marginalizing them at the high end.
The primary microarchitecture difference of the 6x86MX CPU versus the K6 and
P-II CPUs is that it still does native x86 execution rather than translation to
internal RISC ops.
The Cyrix remains a dual piped superscalar architecture, however it has

significant improvements over the Pentium design:
By being able to swap the instructions, there is no concern about artificial stalls
due to scheduling of instruction to the wrong pipeline. By introducing two address
generation stages, they eliminate the all too common AGI stall that is seen in the
Pentium. The 6x86MX relies entirely on up front dependency resolution via register
renaming, and data forwarding; it does not buffer instructions in any way. Thus its
instruction issue performance becomes bottlenecked by dependencies.

ZM
The out of order nature of the execution units are not very well described in
Cyrix's documentation beyond saying that slower instructions will make way for
faster instructions. Hence it is not clear what the execution model really looks
like.
Branch Prediction
The Cyrix CPU uses a 512 entry 2 bit predictor and this does not have a
prediction rate that rivals either the P-II or K6 designs. However, both sides of the
branch will have first instruction decoded simultaneously in the same clock. In
this way, the Cyrix hedges its bets so that it doesn't pay such a severe
performance penalty when its prediction goes wrong. Beyond this, it appears as
though Cyrix has gone full Post-RISC architecture and supports a branch
predict and speculative execution model. This fits nicely with their aggressive
register renaming, and data forwarding model from the original 6x86 design.
Because of potential FPU exceptions, all FPU instructions are treated the same
way as branch prediction. I would expect the same to be true of the P-II, but Intel
has not documented this, whereas Cyrix has.
They have a fixed scheme of 4 levels of speculation, that are simply increased
for every new speculative instruction issued (this is somewhat lower than the P-II
and K6 which can have 20 or 24 live instructions at any one given time, and
somewhat more outstanding branches.)
The 6x86MX architecture is more lock stepped than the K6, and as such their
issue follows their latency timings more closely. Specifically their decode, issue
and address generation stages are executed in lock step, with any stalls from
resource contentions, complex decoding etc, backing up the entire instruction
fetch stages. However their design makes it clear that they do everything
possible to reduce these resource contentions as early as possible. This is to be
contrasted with the K6 design which is not lock step at all, but due to its late
resource contention resolution, may be in the situation of re-issuing instructions
after potentially wasting an extra clock that it didn't need to in its operand fetch
stage.
Floating Point
The 6x86MX has significantly slower floating point. The Cyrix's FADD, FMUL,
FLD and FXCH instructions all take at least 4 clocks which puts it at one quarter of
the P-II's peak FPU execution rate. The 6x86MX (and even older 6x86) tried to
make up for it by having an FPU instruction FIFO. This means that most of the
FPU clocks can be overlapped with integer clocks, and that a handful of FPU
operations can be in flight at the same time, but in general requires hand
scheduling and relatively light use of the FPU to maximally leverage it. Oddly
enough, their FDIV and FSQRT performance is about as good if not better than

ZM
the P-II implementation. This seems like an odd design decision, as optimizing
FADD, FLD, FXCH and FMUL, are clearly of much higher importance.
Like AMD, Cyrix designed their 6x86MX floating point around the weak code that
x86 compilers generate on FPU code. But, personally, I think Cyrix has gone way
too far in ignoring FPU performance. Compilers only need to get a tiny bit better
for the difference between the Cyrix and Pentium II to be very noticeable on FPU
code.
Cache
The Cyrix's cache design is unifed at 64KB with a separate extra 256 byte
instruction buffer. I prefer this design to the K6's and P-II's split code and data
architecture, since it better takes into account the different dynamics of code
usage versus data usage that you would expect in varied software designs. As
an example, p-code interpreters, or just interpreted languages in general, you
would expect to benefit more from a larger data cache. The same would apply to
multimedia algorithms which would tend to apply simple transformations on large
amounts of data (though in truth, your system benefits more from coprocessors
for this purpose.) As another example, highly complex (compiled) applications
that weave together the resources of many code paths (web browsers, office
suite packages, and pre-emptive multitasking OSes in general) would prefer to
have larger instruction caches. At both extremes, the Cyrix has twice the cache
ceiling.
Thus the L1 cache becomes a sort of L2 cache for the 256 byte instruction line
buffer, which allows the Cyrix design to complicate a much smaller cache
structure with predecode bits and so on, and use the unified L1 cache more
efficiently as described above. Although I don't know details, the prefetch units
could try to see a cache miss comming and pre-load the instruction line cache in
parallel with ordinary execution; this would compensate for the instruction
cache's unusually small size, I would expect to the point of making it a mute
point.
Other
• In moving to separate unit architectures, the K6 and P-II have sacrificed

certain aspects of the Pentium-like fixed architecture. For example, the
Pentium is capable of two loads and two stores per cycle to addresses
with different cache line offsets. I would expect that the Cyrix has retained
this as a side effect of staying with a fixed dual pipeline architecture.
• Faster selector/segment manipulation instructions than the Pentium.
Apparently, the Cyrix has many more TLBs (transfer look aside buffers)
that aids this significantly. This helps their 16 bit code significantly over the
P-II, and even K6 which explains why their WinStone scores are so much
better under Windows 95 than Windows NT.

ZM
• One clock LOOP instructions! This follows the PowerPC design choice of
making a high throughput count down branch instruction. Like the
PowerPC, they could in fact have implemented this with 100% accurate
branch target prediction, however they did not document whether or not
they have done this. Unfortunately, programmers have been using these
instructions less and less, since starting with the Pentium, Intel has been
making this instruction slower.
• Two barrel shifters (one for each pipe), allowing greater parallelism with
shift instructions. This is an advantage over both the P-II and K6 which
each have only one unit that can handle shifts.
• There are no partial register stalls or smaller operand restrictions that I
could find documented. Cyrix is clearly committed to retaining high
performance of older 16 bit code. This is important for Windows 95,
however less so for Windows NT.
• The Cyrix has a very interesting extension to their general architecture
that allows them to use part of the L1 cache as a scratch pad. This
presents a very interesting alternative for programmers who have
complained about the x86's lack of registers. It is not clear that
programmers would be willing to special case the Cyrix to use this feature,
but you can bet that the drivers Cyrix writes for their GX platforms uses
this feature.
Although I have not read about the Cyrix in great detail, it would seem to
me that this was motivated by the desire to perform well on multimedia
algorithms. The reason is that multimedia tends to use memory in
streams, instead of reusing data which conventional caching strategies
are designed for. So if the Cyrix's cache line locking mechanism allows
redirecting of certain memory loads then this will allow them to keep the
rest of their L1 cache intact for use by tables or other temporary buffers.
This would be a good strategy for their next generation MXi processor (an
integrated graphics and x86 processor.)
Optimization
Cyrix's documentation is not that deep, but I get the feeling that neither are their
CPU's. Nevertheless, they do not describe their out of order mechanism in
sufficient detail to even evaluate it. Not having tried to optimize for a Cyrix CPU
myself, I don't have enough data points to really evaluate how lacking the
documentation is. But it does appear that Cyrix is somewhat behind both Intel
and AMD here.
It is my understanding that Cyrix commissioned Green Hills to create a compiler

for them, however I have not encountered or even heard of any target code
produced by it (that I know of). Perhaps the MediaGX drivers are built with it.

ZM
Update: I've been recently pointed at Cyrix's Appnotes page, in particular note
106 which describes optimization techniques for the 6x86 and 6x86MX. It does
provide a lot of good suggestions which are in line with what I know about the
Cyrix CPUs, but they do not explain everything about how the 6x86MX really
works. In particular, I still don't know how their "out of order" mechanism works.
It is very much like Intel's documentation which just tells software developers
what to do without giving complete explanations as to how their CPU works. The
difference is that its much shorter and more to the point.
One thing that surprised me is that the 6x86MX appears to have several
extended MMX instructions! So in fact, Cyrix had actually beaten AMD to
(nontrivially) extending the x86 instruction set (with the K6-2), they just didn't do a
song and dance about it at the time. I haven't studied them yet, but I suspect that
when Cyrix releases their 3DNOW! implementation they should be able to
advertise the fact that they will be supplying more total extensions to the x86
instruction set with all of them being MMX based.
Brass Tacks
The 6x86MX design clearly has the highest instructions processed per clock on
most ordinary tasks (read: WinStone.) I have been told various explanations for it
(4-way 64K L1 cache, massive TLB cache, very aggressive memory strategies,
etc), but without a real part to play with, I have not been able to verify this on my
own.
Well, whatever it is, Cyrix learned an important lesson the hard way: clock rate is
more important than architectural performance. Besides keeping Cyrix in the
"PR" labelling game, their clock scalability could not keep up with either Intel or
AMD. Cyrix did not simply give up however. Faced with a quickly dying
architecture, a shared market with IBM, as well as an unsuccessful first foray into
integrated CPUs, Cyrix did the only thing they could do -- drop IBM, get foundry
capacity from National Semiconductor and sell the 6x86MX for rock bottom
prices into the sub-$1000 PC market. Indeed here they remained out of reach of
either Intel and AMD, though they were not exactly making much money with this
strategy.
Cyrix's acquisition by National Semiconductor has kept their future processor

designs funded, but Cayenne (a 6x86MX derivative with a faster FPU and
3DNow! support) has yet to appear. Whatever window of opportunity existed for it
has almost surely disappeared if it cannot follow the clock rate curve of Intel and
AMD. But the real design that we are all waiting for is Jalapeno. National
Semiconducter is making an very credible effort to ramp up its 0.18 micron
process and may beat both Intel and AMD to it. This will launch Jalapeno at
speeds in excess of 600Mhz with the "PR" shackels removed, which should allow
Cyrix to become a real player again.

ZM
Update: National has buckled under the pressure of keeping the Cyrix division
alive (unable to produce CPUs with high enough clock rate) and has sold it off to
VIA. How this affects Cyrix' ability to try to reenter the market, and release next
generation products remains to be seen.
Cyrix performance documentation links
• 6x86MX Processor Data Book

• Software Customization for the 6x86™ Family, Rev 1.5
• Cyrix's ISV page
Common Features
The P-II and K6 processors require in-order retirement (for the Cyrix, retirement
has no meaning; it uses 4 levels of speculation to retain order.) This can be
reasoned out simply because of x86 architectural constraints. Specifically, in-
order execution is required to do proper resource contention.
Within the scheduler the order of the instructions are maintained. When a micro-
op is ready to retire it becomes marked as such. The retire unit then waits for
micro-op blocks that correspond to x86 instructions to become entirely ready for
retirement and removes them from the scheduler simultaneously. (In fact, the K6
retains blocks corresponding to all the RISC86 ops scheduled per clock so that
one or two x86 instructions might retire per clock. The Intel documentation is not
as clear about its retirement strategies.) As instructions are retired the non-
speculative CS:EIP is updated.
The speculation aspect is the fact that the branch target of a branch prediction is
simply fed to the prefetch immediately before the branch is resolved. A "branch
verify" instruction is then queued up in place of the branch instruction and if the
verify instruction checks out then it is simply retired (with no outputs except
possibly to MSRs) like any ordinary instruction, otherwise a branch misprediction
exception occurs.
Whenever an exception occurs (including branch mispredicts, page fault, divide

by zero, non-maskable interrupt, etc) the currently pending instructions have to
be "undone" in a sense before the processor can handle the exception situation.
One way to do this is to simply rename all the registers with the older values up
until the last retirement which might be available in the current physical register
file, then send the processor into a kind of "single step" mode.
According to Agner Fog, the P-II retains fixed architectural registers which are
not renamable and only updated upon retiring. This would provide a convenient
"undo" state. This also jells with the documentation which indicates that the PII
can only read at most two architectural registers per clock. The K6, however,

ZM
does not appear to be similarly stymied, however it too has fixed architectural
registers as well.
The out of orderedness is limited to execution and register write-back. The

benefits of this are mostly in mixing multiple instruction micro-op types so that
they can execute in parallel. It is also useful for mixing multi-clock, or otherwise
high-latency instructions with low-clock instructions. In the absence of these
opportunities there are few advantages over previous generation CPU
technologies that aren't taken care of by compilers or hand scheduling.
Contrary to what has been written about these processors, however, hand tuning
of code is not unnecessary. In particular, the Intel processors still handle carry
flag based computation very well, even though compilers do not; the K6 has load
latencies, all of these processors still have alignment issues and the K6 and
6x86MX prefer the LOOP instruction which compilers do not generate. XCHG is
also still the fastest way to swap two integer registers on all these processors,
but compilers continue to avoid that instruction. Many of the exceptions (partial
register stall, vector decoding, etc.) are also unknown to most modern compilers.
In the past, penalties for cache misses, instruction misalignment and other
hidden side-effects were sort of ignored. This is because on older architectures,
they hurt you no matter what, with no opportunity for instruction overlap, so the
rule of avoiding them as much as possible was more important than knowing the
precise penalty. With these architectures its important to know how much code
can be parallelized with these cache misses. Issues such as PCI bus, chip set
and memory performance will have to be more closely watched by programmers.
The K6's documentation was the clearest about its cache design, and indeed it
does appear to have a lot of good features. Their predecode bits are used in a
very logical manner (which appears to buy the same thing that the Cyrix's
instruction buffer buys them) and they have stuck with the simple to implement 2-
way set associativity. A per-cache line status is kept, allowing independent
access to separate lines.
Final words
With out of order execution, all these processors appear to promise the
programmer freedom from complicated scheduling and optimization rules of
previous generation CPUs. Just write your code in whatever manner pleases you
and the CPU will take care of making sure it all executes optimally for you. And
depending on who you believe, you can easily be lead to think this.
While these architectures are impressive, I don't believe that programmers can
take such a relaxed attitude. There are still simple rules of coding that you have
to watch out for (partial register stalls, 32 bit coding, for example) and there are
other hardware limitations (at most 4 levels of speculation, a 4 deep FPU FIFO

ZM
etc.) that still will require care on the part of the programmer in search of the
highest levels of performance. I also hope that the argument that what these
processors are doing is too complicated for programmers to model dies down as these
processors are better understood.
Some programmers may mistakenly believe that the K6 and 6x86MX processors
will fade away due to market dominance by Intel. I really don't think this is the
case, as my sources tell me that AMD and Cyrix are selling every CPU they
make, as fast as they can make them. The demand is definately there. 3Q97 PC
purchases indicated an unusually strong sales for PCs at $1000 or less
(dominated by Compaq machines powered by the Cyrix CPU), making up about
40% of the market.
The astute reader may notice that there are numerous features that I did not
discuss at all. While its possible that it is an oversight, I have also intentionally
left out discussion of features that are common to all these processors (data
forwarding, register renaming, call-return prediction stacks, and out of order
execution for example.) If you are pretty sure I am missing something that should
be told, don't hesitate to send me feedback.
Update: Centaur, a subsidiary of IDT, has introduced a CPU called the WinChip
C6. A brief reading of the documentation on their web site indicates that its
basically a single pipe 486 with a 64K split cache dual MMX units, some 3D
instruction extensions and generally more RISCified instructions. From a
performance point of view their angle seems to be that the simplicity of the CPU
will allow quick ramp up in clock rate. Their chip has been introduced at 225 and
240 MHz initially (available in Nov 97) with an intended ramp up to 266 and 300
Mhz by the second half of 1998. They are also targeting low power consumption,
and small die size with an obvious eye towards the laptop market.
Unfortunately, even in the test chosen by them (WinStone; which is limited by

memory, graphics and hard disk speed as much as CPU), they appear to score
only marginally better than the now obsolete Pentium with MMX, and worse than
all other CPUs at the same clock rate. These guys will have to pick their markets
carefully and rely on good process technology to deliver the clock rates they are
planning for.
Update: They have since announced the WinChip 2 which is superscalar and
they expect to have far superior performance. (They claim that they will be able
to clock them between 400 and 600 Mhz) We shall see; and we shall see if they
explain their architecture to a greater depth.
Update: 05/29/98 RISE (a startup Technology Company), has announced that

they too will introduce an x86 CPU, however are keeping a tight lid on their
architecture.

ZM
Glossary of terms
• Branch prediction - a mechanism by which the processor guesses the
results of a condition decision and thus assumes whether or not a
conditional branch is taken.
• Data forwarding - the process of copying the contents of a unit output
value to an input value for another unit in the same clock.
• (Instruction) coloring - a technique for marking speculatively executed
instructions to put them into equivalence classes of speculative resolution.
The idea is that once a speculative condition has been resolved the
corresponding instructions of that color are all deal with in the same way
as being either retired or undone.
• (Instruction) issue - the first stage of a CPU pipeline where the
instruction is first recognized and decoded.
• Latency - the total number of clocks required to completely execute an
instruction. In maximal resource contention situations, this is usually the
maximum number of clocks an instruction can take. (Often manufacturers
will abuse the precise definition in their documentation by ignoring clocks
that are assumed to (almost) always overlap. For example, most
instruction on all fully pipelined processors really take at least 5 clocks
from issue to retirement, however under normal circumstances most of
those clocks are consistently overlapped by stages of other instructions,
and hence are documented to take that many fewer clocks.) The goal of
the Post-RISC architecture is to hide latencies to the maximal degree
possible via parallelism.
• Out of order execution - a feature of the Post-RISC architecture whereby
instructions may actual complete their calculation steps in an order
different from that in which they were issued in the original program.
• Post-RISC architecture - a term coined by Charles Severance referring
to the modern trend of CPUs to use techniques not found on traditional
RISC processors such as speculative execution and register renaming in
conjunction with instruction retirement.
• Register contention - a condition where an instruction is trying to use a
register whose last write back or read has not yet completed.
• Register renaming - retargetting the output of an instruction to an
arbitrary internal register that is virtually renamed to be the value of the
architectural register. In x86 processors this ordinarily occurs whenever a
fld or fxch instruction or a mov with a destination register is encountered.
• Resource contention - A condition where a register, alu or pipeline stage
is required for an instruction but is currently in use, or scheduled to be
used, by a previously unretired instruction.
• Retirement - The process by which the CPU knows that an instruction
has really completed.
• SIMD - Single Instruction Multiple Data. An instruction set which replicates
the same operation over multiple operands which are themselves packed

ZM
into wide registers. MMX (multimedia extensions), 3DNow! (3D no waiting)

and SSE (streaming simd extensions) are examples of SIMD instruction
sets.
• Speculative execution - a processor state in which execution proceeds
even if it is not yet known whether such an execution path will actually be
taken. Usually occurs after a branch instruction is issued but before it is
resolved.
• Throughput - the minimal number of clocks that an instruction needs
during the flow of a program. In ideal situations this is just the time it takes
to issue the instruction, assuming there are no resource contentions with
other subsequent or previous instructions.
7th Generation CPU

Comparisons.
As of 08/11/99 there is only CPU manufacturer claiming to have a 7th generation
CPU. Its AMD; with its "Athlon" processor. So I will start with this and await the
Intel Willamette core. Of course VIA and Transmeta are welcome to join in too if
they build a processor they deem worthy of joining this class of CPU.
The following information comes from various public presentations on the Athlon
that have been given. One in particular is the "dinner with Dirk Meyer" audio
session provided by Steve Porter/John Cholewa. I also did my own analysis on
a real Athlon. I must also thank Lance Smith -- my inside man at AMD -- for
invaluable assistance.
Comments welcome.
The AMD Athlon Processor

The Athlon, formerly known as the K7, is AMD's follow on to the AMD K6
microprocessor. Now I was already a fan of the K6 architecture, so I was
expecting good things from the K7. I would say they have delivered.
On 08/09/99 a flurry of benchmarks disclosures and reviews accompanied AMD's

official announcement of general availability of the Athlon processor. With only
some inexplicable exceptions they all basically say the same thing: Athlon is
simply faster than the Pentium !!!, at the same clock rate, and in absolute
performance.
Shockingly, at the time of release, at the 650Mhz Athlon became the second
highest clocked modern CPU available on the market -- beaten only by the Alpha
21264 at 667Mhz.

ZM
But enough of all the hype. Just how good is this architecture? The Athlon as far
as I can tell is a cross between a K6, and an Alpha 21264. It has the cleanliness
of the K6 architecture while having a no holds barred brute force set of functional
units like the 21264.
AMD touts the Athlon as the first processor that can be considered 7th
generation. Most of the features of the K7 are really just super beefed up
features that exist in the K6 (and P6). But what differentiates it is its radically out
of order floating point unit. Through a combination of 88 (!!) rename registers,
with stack and dependency renaming on a fully superscalar FPU AMD has
created, with the possible exception of the 21264, what is probably the most
advanced architecture I've ever seen. It also definitely presents a significant
performance level above both the K6 and P6 architectures, despite the claims of
some skeptical high profile microprocessor reviewers.
Throughout I will be comparing the K7 to the 21264, and the P6 cores. The
following are reference diagrams for each of the architectures found in
documentation supplied by the vendors. The mark ups contain what I consider to
be the most important considerations from a programming point of view, which
are explained in greater detail below. Red markings indicate a slow or previous
generation feature. Green markings indicate a fast or "state of the art feature".
The K7

ZM
This is the latest x86 compatible architecture from AMD. It is instruction set
compatible with Intel's Pentium II CPUs. It uses instruction translation to convert
the cumbersome x86 instruction set to high performance RISC-like instructions,
and drives those RISC instructions with a state of the art microarchitecture.
Update: This is not meant to contradict Dirk Meyer who claimed that "With the
K7, the central quantum of information that floats around the machine is not
decomposed RISC operations, it is a macro operation." Its really just a matter of
perspective. The ALUs in the K7 don't understand "macro operations", they
understand individual operations akin to the RISC86 ops in the K6. The macro
operation bundles that are decoded are just a convenient structure inside of the
K7 which gives much more complete coverage of the x86 instruction set (which
have the net effect of delivering more operations to the function units per clock.)
Each bundle is itself dispatched as separate operations to the ALUs as individual
execution morsels (I'd still call this decomposition to risc ops myself.)
I'm sure the reason Dirk is saying that this is not just an x86 to RISC translation is
because the internal mechanisms by which the K7 does its translation has no
resemblance to the way either the K6 or P6 perform their translation. Thus for
marketing reasons it is important for AMD to differentiate the way the K7 works
from these previous generation chips. I'm just speculating on this last part of

ZM
course -- for all I know "translation from x86 to RISC" may be a technical term
with a hard and fast definition that puts me clearly in the wrong. :)
The 21264
This is the latest incarnation of the DEC Alpha. Its a no holds barred advanced
architecture, that is out-of-order and highly superscalar. It is fairly well recognized
as the fastest microprocessor on earth by the industry standard SPEC
benchmark.
The P6

ZM
This is Intel's latest incarnation of their Pentium Pro architecture. It also

translates x86 instructions into RISC-like instructions which are executed by an
advanced out-of-order core.
The Athlon is a long pipelined architecture, and like the P6, does a lot of work to
unravel some of the oddball conventions of the x86 instruction architecture in
order to feed a powerful RISC-like engine.
The Athlon starts out with 3 beefy symmetrical direct path x86 decoders that are
fed by highly pipelined instruction prefetch and align stages. The direct path
decoders can take short x86 instructions as well as memory-register instructions.
The instructions are translated to Macro-Ops which themselves contain two
packaged ops (one being one of: load, load/store, store, and the other being an
alu op.) Thus the front end of the K7 can realistically maintain up to 6 ops
decoded per clock. (The decoders also can sustain up to one vector path decode
per clock for the rarely used weird x86 instructions.)

ZM
The K7 has a 72 entry instruction control unit (so that's up to 144 ops, which is
significantly more than the P6's 40 entry reorder buffer) in addition to an 18 entry
integer reservation station as well as a 36 entry FPU reservation station. Holy
cow. The K7 will do an awful lot of scheduling for you, that's for sure.
Now, the K7 has two load and one store ports into the D-cache (the P6 core can
sustain a throughput of one load and/or store per clock.) However, algorithms are
rarely store limited. Furthermore stores can be retired before they are totally
completed. So I hesitate to stick with the 6 ops sustained rate. Instead its more
realistic to consider it as 5 ops sustained with free stores. (Note that for
comparison purposes, this is being very generous to the P6 core's estimated 3
ops per clock sustained rate of execution since it actually executes stores as two
micro-ops. This would be equivalent to only two AMD RISC86 ops per clock
throughput on code which is more store limited.)
After this point, the instructions are simply fed into fully pipelined instruction units
(except, presumably, instructions that are microsequenced.) So indeed 5 ops is
the K7's sustained instruction throughput. This is superior to the P6
architecture in that (1) it can supply an additional ALU op per clock (hence
50% more calculation bandwidth) (2) it can actually execute up to two
additional ops per clock (that's 67% more total general execution
bandwidth), and (3) it can service the ever important dual load case (this is
twice the load bandwidth of the P6 architecture.) So like its predecessor the K6,
the instruction decoders and back ends look fairly well balanced, except that with
the K7 we have a significantly wider engine.
The 21264 is a 4-decode machine with separate load and alu instructions. The
21264 pipeline is structured with a maximum of 2 memory, integer, or FP
instructions, from which any combination of executing 4 can be sustained per
clock. So while the K7 has a higher total ops issued per clock, the 21264 has the
advantage in the one case of 2 integer and 2 floating point instructions sustained
per clock configuration. In reality this would not come up very often, however,
conversely neither would many of the memory operanded instruction
combinations on the K7. The K7 has the advantage of being able to execute 3
integer or 3 floating point ops, but that is balanced by the fact that the K7 has
fewer registers and in reality only 2 "real work" floating point ops can be
executed.
It is a remarkably difficult call to decide between the 21264 and the K7 as to

which has the higher expected execute bandwidth which in of itself is a very
impressive level for the K7 to have attained.
Branch Prediction
For branch prediction AMD went with the GShare algorithm with a large number
of entries -- 2048 entry branch target buffer in addition to a 4096 entry branch

ZM
history buffer. This differs from the K6's sophisticated history per branch
combined with recent branch history algorithm and a branch target cache. AMD's
claims are that the K7's algorithm achieves 95% prediction accuracy (similar to
the K6.) Given the long pipelined architecture of the K7, using a very accurate
predictor seems more necessary than it was on the K6. Like the P6 core, the K7
also loses a decode clock on any taken branch (because it does not use a
branch target cache like the K6 does.) However, the high decode bandwidth of
the Athlon will typically make this a non issue.
Plugging into our equation once again we see:
(95% * 0) + (5% * 10) = 0.5 clocks per loop
Hey, that's not too bad! Remember that the K6 didn't really beat 0.5 clocks due to
the relatively larger impact on instruction decode bandwidth of the branch
instruction itself. So the K7 appears to have the same expected average
branch penalty as the K6! That's quite good for a deeply pipelined architecture.
Its better than the P6 which has a worse predictor (90% accuracy) and larger
miss penalty (13+ clocks).
Update: Andreas Kaiser has written up a very detailed analysis of how the K7
branch predictor works.
Floating Point
There has been a lot of talk about the K7's floating point capability. Especially
given the poor reputation of Intel's x86 competitors on floating point. The interest
in the K7's floating point probably overshadowed any other feature.
I think AMD knew they had to deliver on floating point or forever suffer the
backlash of the raving lunatics that would be denied their Quake frame rate being
pegged at the monitor's refresh rate. And there is no question that AMD has
delivered. On top of being fully pipelined (the P6 is partially pipelined when
performing multiplies) AMD had the gall to make a superpipelined FPU. I
would have thought that this was impossible given the horribly constipated x87
instruction set, but I was shocked to find that its really possible to execute well
above one floating point operations per clock (on things like multiply
accumulates.)
The K7 architecture shows a three-way pipeline (FADD, FMUL, and FSTORE)

for the FPU however, "FSTORE" does not appear to be all that important (its
used for FST(P), FLD(CONST) and "miscellaneous" instructions.) So the only
question you'd think remains is "how fast is FXCH"? However, upon reflection it
seems to me that the use of FXCH is far less important with the K7.

ZM
Since the K7 can combine ALU and load instructions with high performance,
pervasive use of memory operands in floating point instructions (which reduces
the necessity of using FXCH) seems like a better idea than the Intel
recommended strategies.
A floating point test I did that uses this strategy confirms that the K7 is indeed
significantly faster than the P6's floating point performance. My test ran about
50% faster. I suspect that as I become more familiar with the Athlon FPU I will be
able to widen that gap (i.e., no I can't show what I have done so far.)
Nevertheless the top two stages of the FPU pipeline are stack renaming then
internal register renaming steps. The register renaming stage would be
unnecessary if FXCH (which helps treat the stack more like a register file) did not
execute with very high bandwidth so I can only assume that FXCH must be really
fast. Update: The latest Athlon optimization guide says that FXCH generates a
NOP instruction with no dependencies. Thus it has an effective latency of 0-
cycles (though it apparently has an internal latency or 2 clocks -- I can't even
think of a way to measure this.)
Holy cow. Nobody in the mainstream computer industry can complain about the
K7's floating point performance.
The 21264 also has two main FP units (Mul and Add) on top of a direct register
file. So while the 21264 will have better bandwidth than the K7 on typical code
which has been optimized in the Intel manner (with wasteful FXCHs) on code
fashioned as described above, I don't see that the Alpha has much of an
advantage at all over the K7. Both have identical peak FP throughput of 2 ops
per clock, that in theory should be able to be sustainable by either processor.
As far as SIMD FP goes, AMD is sticking to their guns with 3DNow! (Although
they did add the integer MMX register based SSE instructions -- it appears as
though this was just to ensure that the Pentium-!!! did not have any functional
coverage over the Athlon.) They did add 5 "DSP functions" which are basically 3
complex number arithmetic acceleration instructions as well as two FP <-> 16 bit
integer conversion instructions. The two way SIMD architecture seems to be a
perfect fit for complex numbers.
Other than these new instructions, there does not seem to be any architectural
advantage to the K7 implementation of 3DNow! over the K6's 3DNow!
implementation. I don't think this should be taken as any kind of negative against
AMD's K7 designers, however. 3DNow! is one of those architectures that
appears to be naturally implemented in only one way: the fastest way. So its not
surprising that the K6 is as fast as the K7 in SIMD FP right out of the chute. (In
the real world the K7 should be faster on 3DNow! loops due to better execution
of necessary integer overhead instructions.)

ZM
On the surface it appears as though the SIMD capabilities of the Pentium !!!'s full
SSE implementation better alleviates register pressure over the K7. However the
K7 has the opportunity to pull even with SSE in this area as well by virtue of its
use, once again, of memory operands. (The theoretical peak result throughput of
SSE and 3DNow! are identical -- each has slight advantages over the other
which on balance are a wash.)
Comparatively speaking, the Alpha has only added special acceleration functions
for video playback. I am not familiar with the Alpha's extensions however I am
under the impression that they did not add a full SIMD FP or SIMD integer
instruction set.
Cache
The K7's cache is now 128 KB (2-way, harvard architecture, just like the Alpha
21264.) Ok this is just ridiculous -- the K7 has 4 times the amount of L1 cache as
Intel's current offerings. If somebody can give me a good explanation as to why
Intel keeps letting itself be a victim to what appears to be a simple design choice
for AMD, I'd like to hear it.
The load pipe has increased from 2 cycle latency on the K6 to 3 cycle latency on
the K7. This matches up with the P6 which also has a 3 cycle access time to their
L1 cache. (But recall that the K7 can perform two loads per clock which is up to
twice as fast as the K6 or P6.)
The K7 has a 44 entry load/store queue. (Holy cow.) Well, that ought to support
plenty of outstanding memory operations.
Although starting from a 512K on-PCB L2 cache, AMD claims the ability to move
to caches as large as 8MB. It should be obvious that AMD intends to take the K7
head to head against Intel's Xeon line. Off the PCB card, the K7 bus (which was
actually designed by the Compaq Alpha team for the 21264) can support 20
outstanding transactions.
This all looks like top notch memory bandwidth to me.
Other
• The memory BUS (the EV6 bus, which is actually the same bus used by
the 21264) runs at 2x100Mhz. Though everything I am told right now
indicates that the memory throughput is still limited by the 100Mhz PC100
ram technology of today, that it does allow for scaling into higher
performance ram of the future. (PC133 is supposed to be around the
corner.) In any event it should allow the processor to dispatch stores to the
chipset in a fire and forget manner much faster than the current 1x100Mhz

ZM
of the P6 bus. So the CPU should not be tied up issuing stores for as long.
(Not a big issue, realistically.)
• The FDIV latency is remarkably low in comparison to the P6. I suspect
that AMD is using the 3DNow! divide approximation tables to drive a faster
newton raphson algorithm.
• According to independently confirmed tests, the LOOP instruction is slow!
Oh well. I can't imagine that there is something about deeply pipelined
architectures that makes this instruction slow. I can only guess that AMD
got tired of dealing with the legacy timing loops people wrote with this
instruction expecting it to be the same absolute speed as it was on a 486.
Fortunately for AMD, this is not a problem since for typical loops there is
easily enough left over instruction decode bandwidth to perform a
DEC/JNZ instruction pair with the same performance.
• The K7 appears to support all of the P6 conditional move and conditional
floating point instructions, as well as the write combining "MTRR registers"
and the performance event counters.
Optimization
The AMD Athlon optimization guide is an amazing piece of work. Besides

including a good description of the micro architecture, it takes a very pragmatic
approach to presenting optimization tips. Basically there are just a few pitfalls to
avoid. These pitfalls are easily worth the trade off for the benefit of essentially not
having to worry about decode bandwidth at all.
It comes with a brief description on optimizing 64 bit arithmetic (which is

becoming an issue for C++ compilers which support the long long data type) as
well as numerous examples of high and low level optimizations. The
recommendations are insightful. Not only will reading it will convince you of the
awesome power of the Athlon processor, but it just might give you some good
general programming optimization ideas.
I would recommend this guide to anyone interested in optimizing for the next
generation of processors.
The AMD Athlon optimization guide
Brass Tacks
Holy cow! Did I mention that this thing was released at 650Mhz! That's a clear,
uncontested 50Mhz lead over Intel. Although it has been suggested that this was
simply a premature announcement meant to steal the limelight away from Intel
(which has only recently started shipping the Pentium !!! at 600Mhz) they also
said that 700Mhz was on its way (Q4 '99). I find it easier to believe that they are
telling the truth (something some stockholder lawsuits should be motivating from
them) than lying to this extent.

ZM
AMD has previously announced its intention to fabricate the K7 in a copper

based process (which they gained from their strategic alliance with Motorola) in
combination with a 0.18 micron technology. I'm not a hardware guy, so I don't
really know what all this means, however, I have assurances (I read it in
MicroDesign Resources) that it is leading edge process technology which will
one way or another translate to higher frequencies. (Surprisingly, Intel will not
switch over to copper in their initial 0.18 process.)
I think AMD's challenge from here is to try and figure out exactly what markets it
can grow the Athlon into. Its too expensive for sub-$1K PCs and its not quite
ready for SMP. Its also currently only available in 512K L2 cache configurations,
so they can't go right after the Xeon market space just yet. While the Athlon is a
great processor, its clear that AMD needs to complete the picture with their
intended derivatives (the Athlon Select for the low end, the Athlon Ultra for
servers, and the Athlon Professional for everyone else, as AMD have themselves
disclosed) to take the fight to Intel in every segment.
Taken in total, the number of improved features of the K7 over previous

generation processors leaves little doubt that in fact the K7 is truly a 7th
generation processor. You don't have to take my word for it though. There are
plenty of reviews that show benchmark after benchmark with the K7 absolutely
creaming the contemporary P6. So 7th generation it is.
Versus the P6
The K7 is larger faster and better in just about every way. The Athlon simply
beats the P6, even on code tweaked to the hilt for the P6 architecture. From the
architecture, the Athlon should be able to execute any combination of optimized
x86 code at least as efficiently as the P6. Code optimized specifically for the K7
should increase the performance gap between these two processor substantially.
Versus the 21264
From a pure CPU technology point of view this one is too close to call. Both have
extremely comparable features with slightly different tradeoffs that should not, by
themselves tip the balance either way. However at the end of the day the 21264
cannot be denied the official crown. The Alpha processors have the advantage
that Compaq has developed the compilers themselves and they are 64bit on the
integer side. They also have a much cleaner floating point instruction set
architecture and use a higher end, more expensive infrastructure. AMD is stuck
with the 32 bit instruction set defined by Intel as well as the software which has
followed the optimization rules dictated by Intel's chips.
The only counter-balance that the K7 has is the MMX and 3DNow! instruction
sets (in addition to the new instructions that have been added) which give the K7
the advantage for multimedia.

ZM
Nevertheless its amazing how close the x86 compatible K7 comes. For a
developer writing something from scratch going for 21264-like performance
should be the goal to shoot for.
Update: In recent months both Intel and AMD have overtaken the Alpha in clock
speed by a substantial amount, and consequently in terms of real integer
performance as well. While their roadmap still shows higher clocked versions of
the 21264 in the future, it looks like Compaq is concentrating their efforts on
symmetric multithreading (something they presented at MicroProcessor Forum in
1999.)
The Willamette
On 02/15/00, at the Intel Developer Forum a very brief preview of the Willamette
architecture was given. Since that time other details have surfaced, and more
analysis has been done. However Intel has not yet fully unveiled all the details of
the architecture. As such, the analysis below is preliminary.
The architecture is a 20-stage deep pipeline, with the claimed purpose being for
clock rate scaling reasons. However this pipeline is very different from x86
processors designed up until this point. The top few stages feed from the on-chip
L2 cache straight into a one-way x86 decoder which feeds EIP and micro-ops
into something called a trace cache. This trace cache replaces the processor's
L1 I-cache. The trace cache then feeds micro-ops at a maximum rate of 3 per
clock (actually 6 micro-ops every other clock) in instruction order (driven by a
trace-cache-local branch predictor as necessary) into separate integer and
FP/multimedia schedulers (much like the Athlon, except that the rate is higher for
the Athlon.) This mechanism effectively serves the same purpose of the
combination the Athlon's Instruction Control scheduler and I-cache (including
predecode bits.) Because the x86 decoder is applied only upon entry into the
trace cache, its performance impact is analogous to an increase in I-cache line fill
latency of other architectures. From an implementation point of view, Intel saves
themselves from the need to making a superscalar decoder (something they
have implemented in a clumsy way in the P6 and P5 architectures.)
Update: Just to make it clear -- one other thing this buys them is that the trace
cache eliminates direct jumps, call and returns from the instruction stream. On
the other hand, such instructions should not exist as bottlenecks in any
reasonably designed performance software. These instructions are necessarily
parallizable with other code.
The integer side is a two way integer ALU plus 1 load and 1 store units. But an
important twist is that these computation stages are clocked at double the clock
rate of the base clock for the CPU. That is to say, the ALUs complete their
computation stages at 0.5 clock granularities (with 0.5 latencies in the examples

ZM
discussed). Results that complete in the earlier half of the clock can (in at least
the described cases) be forwarded directly to a computation issued into the
second half of the clock. (Intel calls this double pumping.) From this point of view
alone, the architecture has the potential to perform double the integer
computational work as the P6 architecture. However, since the trace cache can
sustain a maximum of 3 micro-ops delivered per clock (which is the same as the
maximum issue rate of the P6 architecture), there is no way for the integer units
to sustain 4-micro-ops of computation per clock. Nevertheless, this is a
shockingly innovative idea that does not exist in any other processor architecture
that I have ever heard of.
I previously thought that the 0.5 clock granularities applied to loads (thus allowing
two loads per clock). However, it has been clarified that in fact the load unit can
accept only one new load per clock. This is consistent with other people's
theories that the ALU clock doubling is synthesized at two fused adders which
are not applicable to the load unit.
The L1 D-cache latency is a surprisingly low 2 clocks! This is very suggestive of

simplifications in their L1-D cache design, or the load/store pipes. The Willamette
software developer's guide says that 3-operanded, or scaled addressing should
be avoided, which is in line with this theory.
Update: Leaked benchmarks indicate that there is some funny business going
on in their L1 cache. While they claimed an L1 latency of 2 clocks,
measurements indicate that it starts at 3 clocks (its possible they were ignoring
the address calculations which in some cases can be computed in parallel with
data access -- however, the Athlon architecture has the same feature.) The
latency benchmark scores that were leaked indicate that as data size increase to
4K and beyond, the latency gradually increases rather than falling off in cliffs (as
the data foot print size exceeds the size of one level of cache) like most other
CPUs.
Update: Bizarrely, the Willamette has only 8K of L1 D-cache. This is a throwback

to the 486 days. Paul DeMone offers as an explanation that the latency to the L1
D-cache really is 2-cycles and that when the total cache architecture is taken into
account (the Willamette includes 256K of 7-cycle latency on chip L2 cache) the
cache access rate is nearly identical to the projected performance for an Athlon
Mustang which in turn has a superior cache access rate over the Athlon
Thunderbird.
I don't completely buy this. One of the statement's Paul makes is: "However,
given the fact that modern x86 processors can execute up to three instructions
per cycle, the odds of finding up to 6 independent instructions to hide (or cover)
the load-use latency is rather small." This is not exactly the right way to view the
relationship between loads and computation ALU instructions. In modern x86's
the decoder's rate of bytes => scheduled riscops exceeds the rate of ALU

ZM
execution => retirement. The reason for this is that the amount of inherent
parallelism in typical programs is less that what these CPUs are capable of
doing. But, memory loads are different. Memory loads are dependent only
address computation which usually is not dependent on the critical path of
calculations in a typical algorithm (except when using slow data structures like
linked lists.) So once a memory instruction is decoded and scheduled, it can
almost always proceed immediately -- essentially always starting at the earliest
possible moment. As long as the data can be returned before the scheduler runs
out of other older work to do (which I claim it will have a lot of) then this latency
will not be noticed. Said in another way, a deep scheduler can cover for load
latency.
What does this mean? Well, I believe it means that shortening up the L1 D-cache
latency while sacrificing the size so dramatically in of itself cannot possibly be
worth it. I am more inclined to believe that the latency to the L2 cache (which
may be strictly a D-cache) itself has shown itself to be short enough to benefit
from the effect I referred to above. If the *L2* latency can be totally hidden as
well, then the real size concern is not with the L1 D-cache but rather the L2
cache.
Update: Also presented was the fact that the CPU uses a 4x100 Mhz Rambus
memory interface. While I ordinarily would ignore such bandwidth claims (for
memory latency is usually more important, and when you need bandwidth you
can use "prefetch" to hide the memory hits) leaks from some Intel insider on
USENET suggest that Willamette will use some sort of linear address pattern
matching prefetch mechanism. This is technique has apparently been used by
other RISC vendors, however with mixed results. Benchmark leaks seem to
confirm that Willamette will have bandwidth that is about double that of current
SDRAM based Athlons (which is the current x86 leader on the Stream
benchmark.)
The floating point is somewhat degraded in some features and somewhat

improved in others relative to the P6 architecture. First off, the FPU is not double
pumped like the integer unit. (So its reasonable to assume that double clocked
circuit techniques have complexity constraints that prevent implementation in just
any pipeline stage.) There are two FP issue ports; the first which does the real
work (FADD, FMUL, etc) and the second which performs miscellaneous (FMOVE
and FSTORE) operations. The FXCH instruction is no longer free -- it has real
latency (I don't know the exact latency, but anything above an effective 0 clocks
is bad), thus Intel will rely on Athlon-like FP scheduling strategies. However
prevalent miscellaneous instructions like FLD and FST(P) are handled by in
parallel to FADD/FMUL's which is an advantage over the way this was done on
the P6 code. But as far as the real serious operations are concerned (FADD,
FMUL) they have remained with a single pipelined FPU design. Apparently they
have fully pipelined the FMUL though which is an improvement versus the P6
core (and it has the same 5 clock latency).

ZM
Update: Leaked benchmark score indicate that Willamette FPU appears to be a

very weak performer (closer to a K6 than even a P6.) While it is possible that
Intel has majorly screwed up their FPU design, this data point ties more closely
with the fact that they claim to have booted Windows within weeks of receiving
first silicon. As is well known, Intel employs a microcode update feature which
allows them to change the implementation of some instructions. While this
feature is not ordinarily used for the most common and high speed instructions, it
seems likely to me, given the complexity of the design, that Intel would enable
the potential use of this feature on *all* its instructions for first silicon. The nature
of the trace cache also suggests that this might be done more easily on the
Willamette than their previous generation architectures. So, my suspicion is that
Intel screwed up one or more of its FPU instructions (maybe it just doesn't throw
exceptions properly) and has had to replace it with a microcode substitute. This
is *all* just speculation on my part. It will remain to be seen what the real FP
performance of Willamette is when it ships.
Update: Intel has been heavily hyping the new SIMD instructions added to the
Willamette (SSE-2). They have added a 2x64 packed floating point instruction set
as well as 128 bit integer MMX instructions. However, if their multimedia
computation can only be performed from one issue port (assuming that the
FMOVE and FSTORE pipe is not capable of any calculations) then they have
compromised their older 64 bit MMX performance (the P6 has dual MMX units)
and will only maintain parity with their older SSE unit if they've reduced the
number of micro-ops per instruction (which would necessitate a fully 128 bit wide
ALU, instead of the two parallel 64 bit units in the Pentium-!!!.) The new 2x64 FP
theoretically brings their "double FP" performance to parity with the Athlon's x87
FPU (again, this is contingent on single micro-ops per packed SSE-2 instruction).
I say theoretically, because the algorithm needs to be fully vectorized into SIMD
code just to keep up with what the Athlon can do in straight unvectorized (but
reasonably scheduled) code. The 128 bit MMX, can at best match the
performance of dual 64 bit MMX units which are present in the Athlon, K6, P55c
and P-!!! CPUs. One thing they have added which is nice is a SIMD 32-bit
multiplier to the integer side.
From an instruction point of view, Intel appears to be declaring victory (there are
now more instructions as well as more coverage than even the AltiVec instruction
set; with the exception of multiply accumulate, and lg/exp approximations), but I
don't see the performance benefit of SSE-2. In fact I think there is a real
possibility of a slight performance degradation here.
Although Intel correctly points out that x86 to micro-op decode penalties no
longer affect branch mispredicts, the bulk of the pipeline stages in the
architecture appear between the trace cache output and execution stages. Thus,
the latency of a branch mispredict (which basically needs to abort results from
trace cache output to execution) has worsened and in fact is worse than any
other architecture I have ever heard of. As a counter to this, Intel has increased

ZM
their branch target buffer to 4096 entries and is reportedly using an improved
prediction algorithm ("...[the] Willamette processor microarchitecture significantly
enhances the branch prediction algorithms originally implemented in the P6
family microarchitecture by effectively combining all currently available prediction
schemes"). Intel has not commented on the prediction probabilities of the
Willamette architecture. Intel has also added branch hint instructions.
Finally they claim to have a significantly larger scheduler (more than 100
instructions can be in-flight at once.)
On the surface it appears as though the Willamette processor will do very well on
integer code with lots of dependencies, however, will not fair as well as the
Athlon on floating point. Other factors such as trace cache and L1 D-cache size
and the quality of the branch predictor remain unknown.
The Transmeta Media Processor

On 01/19/00, the Transmeta processor architecture "Crusoe" was at least
partially unveiled. Because of the relative dearth of information on it, I will just
roughly describe it with only a few comments.
The Crusoe is the ultimate in x86 "emulation". The core chip is not an x86
compatible CPU at all, but rather a VLIW engine. The engine runs an emulation
program (the Code Morpher) which reads x86 instructions and compiles them to
VLIW code snippets, then executes the compiled snippets. The compiler uses
continuous on the fly profiling feedback to decide which code snippets need to be
analyzed the most. This design probably gets the most bang for the buck in
terms of performance per clock invested in the translation problem. Unlike other
technologies like FX!32 or Bochs, Crusoe has been clearly designed for 100%
x86 compatibility from boot to shutdown (hence esoteric protected mode
instructions are emulated in a compatible way -- device drivers will be written in
x86 binaries, not native Crusoe binary).
This contorted way of executing x86 buys them a number of things. (1) They
have complete freedom in how the VLIW core is designed. For example, it does
not even have to have robust register access -- if it takes two clocks for an
operation to finish, then rather than stalling subsequent accesses to the output
register, perhaps the value is old or undefined for the immediately subsequent
clock and updated on the second clock. But more importantly, as they target
better and better process technologies, they can change most aspects of their
design without compromising x86 compatibility. (2) It is possible to find
optimizations that the original software authors, or their compilers did not find in
their binary code. You can imagine that at least some x86 code might end up
substantially faster on the Crusoe. (3) The VLIW engine is very small and very
simple -- thus it is easier to analyze from a performance point of view, and

ZM
consequently should be easier to design for higher frequencies of operation. (4) It

is not necessary to design an out-of-order execution core; the translation
software will take care of re-ordering instructions. This is not 100% ideal since it
is unlikely to be able to detect fine grain dynamic behaviors (for example, one
could arrive at a branch target in a number of ways, and thus the machine might
be in a variety of different states that are not all best served with a single
subsequent instruction ordering.) (5) Bugs in "Crusoe" (either hardware or
software) can be worked around even on *already deployed* systems. In the
words of one TransMeta employee: "Actually, we can [work around hardware
and software bugs]. Obviously not for *ALL* kinds of bugs (if the chip can't
add, it's pretty useless), but in fact the software layer not only can, but
*DOES* work around several errata in the pre-production TM5400 samples
that would have been fatal without the abstraction layer. The performance
impact, incidentally, is negligible -- too small to be measured." This will give
them more options for ensuring the correctness of their solution that can
potentially shorten their design turn around time.
The bad news is that their initial clock rates of 300-400Mhz are not very
compelling, and the promise of 500-700Mhz in 6 months is kind of so-so, given
that the desktop competition is now at 800Mhz. There were no pure performance
benchmarks shown which is indicative that they probably are not achieving
performance per cycle parity with Intel or AMD. The good news is that this part is
being positioned in the mobile space. The (apparent) maximum power draw is an
amazing 2W, which will make for very battery friendly notebooks. They also
seem to be targeting the "internet appliance market" but I don't take that too
seriously (the "internet appliance market" that is.)
The white papers suggest that the VLIW engine drives 4 instructions in parallel in
a strict [ALU, ALU, MEM, BRANCH] format. Hmmm ... this looks roughly
comparable to a K6 to me (a little better with branches, a little worse with
memory.)
One thing they definitely have introduced which is interesting is the idea of a
speculative non-aliased memory window mechanism. What happens is that the
morpher can rearrange loads and stores in more optimal orders, and the legality
of this is checked with a special speculation checkpoint instruction. So like
branch prediction, if a late determination is made that the memory reordering was
wrong, then an interrupt is thrown and the "wrongly executed" block of
instructions can be undone. Of course, the goal is not to take advantage of this
speculative undoing (back to the checkpoint instruction), but rather just to use it
as a parachute to ensure robustness, in the hopes that in most cases for a given
fragment of code, memory reordering is a valid thing to do. This is a big deal.
This problem has plagued CPU designers and compiler writers for decades. The
fact that these guys have implemented a solution for this, is indicative that they
are very serious designers with some good ideas. The idea fits very well with
their code morpher because for degenerate cases where load/store reordering

ZM
never works, the morpher can detect this and throw the whole idea out for that
fragment of code.
Its not entirely clear how much of a long term advantage this is, though.
Apparently there is at least one other CPU architecture (HP's PA-RISC 8500)
that has implemented Load/Store speculative ordering. So there's no telling how
long Transmeta might be able to hold onto this advantage until the same
technology makes it into conventional x86 architectures.
I really don't think these guys are at all going to seriously contend with the Athlon
in any kind of head to head, so I will avoid making any kind of direct comparison.
Given the translation architecture, I don't think that further discussion of
processor features (like branch prediction, cache or floating point) will make too
much sense. We're going to have to wait until we can play with it before we can
get a real idea of what it can do.
Before I leave this, there is the thought that somehow the Transmeta chip would
be able to execute other instruction sets in a different configuration or perhaps
more interestingly simultaneously with an x86. The presentation seemed to steer
towards the direction of "we are only emulating x86's". However, public
statements made by Transmeta employees lead to a different possibility: "There
was a TM3120 running Doom on Linux. Doom was compiled mostly to x86,
except for the inner loop, which was compiled to picoJava using Steve
Chamberlain's picoJava back-end. The whole program was linked together
using a magic linker. When the program had to enter the inner loop, it
executed a reserved x86 opcode which jumped to picoJava mode. The
inner loop then executed picoJava bytecode until it was done, and re-
entered x86 mode."
This is very suggestive, at least to me, that they will support Java (or perhaps just
picoJava) on their CPUs that would likely be substantially faster than the current
crop of x86 based Java virtual machines.
[Major aside:] Here's my guess as to its implementation: (1) The picoJava VM

relies entirely on a host x86 program to manage memory for it. (2) The picoJava
VM would be able to read and write x86 state through special APIs, however, in
ordinary operation be otherwise totally isolated from the x86 state and vice
versa. (3) The only entry method for the x86 into picoJava mode is a reserved
opcode. (4) All interrupts need to be able to know which execution state the code
morphing environment is in, to be able to swap between the virtual x86 state and
the virtual picoJava state. (Essentially multitask between the two code morphing
tasks.)
Actually this sounds like it would be quite cool -- the x86 based JVM wrapper
code would have an inner loop that looked like:

ZM
;// Set up memory space for picoJava as well as

;// the entry point in one of the x86 registers.
jmp L2
L1:
cmp eax,[eax] ;// Force OS to load page.
L2:
TM_OPCODE(picoJava)
jnc L1
;// Do some kind of Meta functions like "exit"

;// switch( eax ) { ... } jmp L2
So that when the picoJava machine wanted to read a memory page it would
simply switch back to x86 mode clearing the carry flag and loading the address
into the eax register. When the picoJava code snippet was done it could switch
to x86 mode, setting the carry flag and an "exit command" into one of the x86
registers. When the picoJava environment wanted more memory, or to free
memory it would set the carry flag and an "allocate memory" command.
For their technology demo, I wouldn't be surprised if they didn't simply allocate
some fixed physical memory, and disallowed interrupts for the duration of the
picoJava code.
Glossary of terms
• ALU - Arithmetic Logic Unit. An execution unit in the processor that
performs some amount of calculation (as opposed to a data movement
unit or a branching unit.)
• Branch prediction - a mechanism by which the processor guesses the
results of a condition decision and thus assumes whether or not a
conditional branch is taken.
• Data forwarding - the process of copying the contents of a unit output
value to an input value for another unit in the same clock.
• Decode - the stage where instructions are first decoded from their
instruction bytes. In x86 processors this is an important consideration due
to the non-uniformity and variable length nature of the instruction set.
• Double pumping - A scheme by which a macro instruction uses the same
ALU twice to perform two individual parts of an instruction. Ordinarily this
leads to the ALU being tied up for twice the duration of its default
bandwidth.
• (Instruction) coloring - a technique for marking speculatively executed
instructions to put them into equivalence classes of speculative resolution.
The idea is that once a speculative condition has been resolved the
corresponding instructions of that color are all deal with in the same way
as being either retired or undone.
• (Instruction) issue - the first stage of a CPU pipeline where the
instruction is first recognized and sent to an execution unit.

ZM
• Latency - the total number of clocks required to completely execute an

instruction. In maximal resource contention situations, this is usually the
maximum number of clocks an instruction can take. (Often manufacturers
will abuse the precise definition in their documentation by ignoring clocks
that are assumed to (almost) always overlap. For example, most
instruction on all fully pipelined processors really take at least 5 clocks
from issue to retirement, however under normal circumstances most of
those clocks are consistently overlapped by stages of other instructions,
and hence are documented to take that many fewer clocks.) The goal of
the Post-RISC architecture is to hide latencies to the maximal degree
possible via parallelism.
• Out of order execution - a feature of the Post-RISC architecture whereby
instructions may actual complete their calculation steps in an order
different from that in which they were issued in the original program.
• Post-RISC architecture - a term coined by Charles Severance referring
to the modern trend of CPUs to use techniques not found on traditional
RISC processors such as speculative execution and register renaming in
conjunction with instruction retirement.
• Register contention - a condition where an instruction is trying to use a
register whose last update from a previous instruction has not yet
completed.
• Register pressure - a situation where software algorithm implementation
choices are limited due to a lack of available registers. This is a very
important consideration for x86 processors which have so few
architectural registers to begin with.
• Register renaming - retargeting the output of an instruction to an arbitrary
internal register that is virtually mapped to an architectural register.
Although in theory this mapping can be done on register selectively,
typically all output registers become renamed. Renamed registers are
typically written back to the real architectural registers when the instruction
retires.
• Resource contention - A condition where a register, alu or pipeline stage
is required for an instruction but is currently in use, or scheduled to be
used, by a previously unretired instruction.
• Retirement - The process by which the CPU knows that an instruction
has really completed, and can be considered totally executed. At this point
the architectural registers are usually updated with the output values of the
instruction. Beyond completion, retired instructions follow an in-order
discard from the scheduler. So an instruction is not retired even if its
completed until it is discarded from the scheduler.
• SIMD - Single Instruction Multiple Data. An instruction set which replicates
the same operation over multiple operands which are themselves packed
into wide registers. MMX (multimedia extensions), 3DNow! (3D no waiting)
and SSE (streaming simd extensions) are examples of SIMD instruction
sets.

ZM
• Speculative execution - a processor state in which execution proceeds

even if it is not yet known whether such an execution path will actually be
taken. Usually occurs after a branch instruction is issued but before it is
resolved or after an instruction the might throw an exception which has not
yet been resolved.
• Superscalar - the ability of a processor to perform more than one
instruction at a time. E.g.: A processor that can decode an instruction and
a branch simultaneously, then eliminate the branch and execute the
instruction simultaneously would be considered superscalar.
• Superpipelining - the ability of a processor to execute multiple
instructions in multiple pipelines at once. E.g.: A processor which can
decode then execute an add simultaneously with a multiply is
superpipelined.
• Throughput - the minimal number of clocks that an instruction needs
during the flow of a program. In ideal situations this is just the time it takes
to issue the instruction, assuming there are no resource contentions with
other subsequent or previous instructions.
• Trace Cache - A mechanism that converts an ordinary program sequence
to a decode ordered sequence of instructions that eliminate references to
the instruction pointer and all instructions that modify the instruction
pointer. Other preprocessing work (such as translation from x86 opcodes
to micro-ops) is also possible.

How Microprocessors Work

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

How Microprocessors Work

Uploaded by

Copyright:

Available Formats

How Microprocessors Work e 1 of 64

How Microprocessors Work

How Microprocessors Work

ZAHIDMEHBOOB@LIVE.COM +923215020706 (2003)

Information about this table:

• The date is the year that the processor was first

ZAHIDMEHBOOB@LIVE.COM +923215020706 (2003)

• Using its ALU (Arithmetic/Logic Unit), a microprocessor can perform mathematical

ZAHIDMEHBOOB@LIVE.COM +923215020706 (2003)

This is about as simple as a microprocessor gets. This microprocessor has:

• An address bus (that may be 8, 16 or 32 bits wide) that sends an address to

Here are the components of this simple microprocessor:

ZAHIDMEHBOOB@LIVE.COM +923215020706 (2003)

Although they are not shown in this diagram, there

RAM and ROM

ZAHIDMEHBOOB@LIVE.COM +923215020706 (2003)

• LOADA mem - Load register A from memory address

ZAHIDMEHBOOB@LIVE.COM +923215020706 (2003)

// Assume a is at address 128

ZAHIDMEHBOOB@LIVE.COM +923215020706 (2003)

ZAHIDMEHBOOB@LIVE.COM +923215020706 (2003)

• activate the RD line

ZAHIDMEHBOOB@LIVE.COM +923215020706 (2003)

microprocessor. Microprocessors can have cycle times as short as 2 nanoseconds, so to a

ZAHIDMEHBOOB@LIVE.COM +923215020706 (2003)

• L1 cache - Memory accesses at full microprocessor speed (10 nanoseconds, 4

In computer science, we have a theoretical concept called locality of reference. It means

Output to screen « Enter a number between 1 and 100 »

ZAHIDMEHBOOB@LIVE.COM +923215020706 (2003)

What is Virtual Memory?

ZAHIDMEHBOOB@LIVE.COM +923215020706 (2003)

6th Generation CPU Comparisons

In what follows, I am assuming a high level of competence and knowledge on the

ZAHIDMEHBOOB@LIVE.COM +923215020706 (2003)

• Beyond RISC - The Post RISC Architecture (Mark Brehob, Travis

The K6 is an extremely short and elegant pipeline. The AMD-K6 MMX

ZAHIDMEHBOOB@LIVE.COM +923215020706 (2003)

Thus the K6's execution rate is limited to a maximum of two x86

ZAHIDMEHBOOB@LIVE.COM +923215020706 (2003)

Anatomy of a High-Performance Microprocessor A Systems Perspective"

ZAHIDMEHBOOB@LIVE.COM +923215020706 (2003)

Retirement happens as completed instructions are pushed out of the scheduler

The K6 uses a very sophisiticated branch prediction mechanism which delivers

ZAHIDMEHBOOB@LIVE.COM +923215020706 (2003)

One disadvantage pointed out to me by Andreas Kaiser is that misaligned branch

Another (also pointed out to me by Andreas Kaiser) is that such a prediction

Back of envelope calculation

This all means that the average loop penalty is:

(95% * 0) + (5% * 4) = 0.2 clocks per loop

ZAHIDMEHBOOB@LIVE.COM +923215020706 (2003)

ZAHIDMEHBOOB@LIVE.COM +923215020706 (2003)

• The K6 has bad memory bandwidth. One is an unknown bottleneck in

ZAHIDMEHBOOB@LIVE.COM +923215020706 (2003)

would be very surprised if the P-II actually achieves a 1 or even 2 clock

Anyhow, this design is very much in line with AMD's recommendation of

ZAHIDMEHBOOB@LIVE.COM +923215020706 (2003)

to propagate from issue to operand read to execute. This is particularly

Update: AMD's new "CXT Core" has enabled write combining.

As I have been contemplating the K6 design, it has really grown on me.

It is unfortunate, that compilers are favoring Intel style optimizations. Basically

ZAHIDMEHBOOB@LIVE.COM +923215020706 (2003)

Recommendations I know of: (1) Avoid vector decoded instructions including

Previously, in this section I maintained a small chronical of AMD's acheivements

ZAHIDMEHBOOB@LIVE.COM +923215020706 (2003)