Professional Documents
Culture Documents
ZM
ZAHIDMEHBOOB
+923215020706
ZAHIDMEHBOOB@LIVE.COM
2003
BS(IT)
PRESTION UNIVERSITY
Microprocessor History
A microprocessor -- also known as a CPU or central processing unit -- is a complete
computation engine that is fabricated on a single chip. The first microprocessor was the Intel
4004, introduced in 1971. The 4004 was not very powerful -- all it could do was add and
subtract, and it could only do that 4 bits at a time. But it was amazing that everything was on
one chip. Prior to the 4004, engineers built computers either from collections of chips or from
discrete components (transistors wired one at a time). The 4004 powered one of the first
portable electronic calculators.
The first microprocessor to make it into a home computer was the Intel 8080, a complete 8-
bit computer on one chip, introduced in 1974. The first microprocessor to make a real splash
in the market was the Intel 8088, introduced in 1979 and incorporated into the IBM PC
(which first appeared around 1982). If you are familiar with the PC market and its history, you
know that the PC market moved from the 8088 to the 80286 to the 80386 to the 80486 to the
Pentium to the Pentium II to the Pentium III to the Pentium 4. All of these microprocessors
are made by Intel and all of them are improvements on the basic design of the 8088. The
Pentium 4 can execute any piece of code that ran on the original 8088, but it does it about
5,000 times faster!
The following table helps you to understand the differences between the different processors
that Intel has introduced over the years.
Clock Data
Name Date Transistors Microns MIPS
speed width
8080 1974 6,000 6 2 MHz 8 bits 0.64
16 bits
8088 1979 29,000 3 5 MHz 0.33
8-bit bus
80286 1982 134,000 1.5 6 MHz 16 bits 1
80386 1985 275,000 1.5 16 MHz 32 bits 5
80486 1989 1,200,000 1 25 MHz 32 bits 20
32 bits
Pentium 1993 3,100,000 0.8 60 MHz 64-bit 100
bus
32 bits
Pentium II 1997 7,500,000 0.35 233 MHz 64-bit ~300
bus
32 bits
Pentium
1999 9,500,000 0.25 450 MHz 64-bit ~510
III
bus
32 bits
Pentium 4 2000 42,000,000 0.18 1.5 GHz 64-bit ~1,700
bus
Compiled from The Intel Microprocessor Quick Reference Guide and TSCP Benchmark Scores
• Transistors is the number of transistors on the chip. You can see that the number of
transistors on a single chip has risen steadily over the years.
• Microns is the width, in microns, of the smallest wire on the chip. For comparison, a
human hair is 100 microns thick. As the feature size on the chip goes down, the number
of transistors rises.
• Clock speed is the maximum rate that the chip can be clocked at. Clock speed will make
more sense in the next section.
• Data Width is the width of the ALU. An 8-bit ALU can add/subtract/multiply/etc. two 8-bit
numbers, while a 32-bit ALU can manipulate 32-bit numbers. An 8-bit ALU would have to
execute four instructions to add two 32-bit numbers, while a 32-bit ALU can do it in one
instruction. In many cases, the external data bus is the same width as the ALU, but not
always. The 8088 had a 16-bit ALU and an 8-bit bus, while the modern Pentiums fetch
data 64 bits at a time for their 32-bit ALUs.
• MIPS stands for "millions of instructions per second" and is a rough measure of the
performance of a CPU. Modern CPUs can do so many different things that MIPS ratings
lose a lot of their meaning, but you can get a general sense of the relative power of the
CPUs from this column.
From this table you can see that, in general, there is a relationship between clock speed and
MIPS. The maximum clock speed is a function of the manufacturing process and delays
within the chip. There is also a relationship between the number of transistors and MIPS. For
example, the 8088 clocked at 5 MHz but only executed at 0.33 MIPS (about one instruction
per 15 clock cycles). Modern processors can often execute at a rate of two instructions per
clock cycle. That improvement is directly related to the number of transistors on the chip and
will make more sense in the next section.
Inside a Microprocessor
To understand how a microprocessor works, it is helpful to look inside and learn about the
logic used to create one. In the process you can also learn about assembly language -- the
native language of a microprocessor -- and many of the things that engineers can do to
boost the speed of a processor.
A microprocessor executes a collection of machine instructions that tell the processor what
to do. Based on the instructions, a microprocessor does three basic things:
There may be very sophisticated things that a microprocessor does, but those are its three
basic activities. The following diagram shows an extremely simple microprocessor capable of
doing those three things:
Let's assume that both the address and data buses are 8 bits wide in this example.
• Registers A, B and C are simply latches made out of flip-flops. (See the section on
"edge-triggered latches" in How Boolean Logic Works for details.)
• The address latch is just like registers A, B and C.
• The program counter is a latch with the extra ability to increment by 1 when told to do
so, and also to reset to zero when told to do so.
• The ALU could be as simple as an 8-bit adder (see the section on adders in How
Boolean Logic Works for details), or it might be able to add, subtract, multiply and
divide 8-bit values. Let's assume the latter here.
• The test register is a special latch that can hold values from comparisons performed
in the ALU. An ALU can normally compare two numbers and determine if they are
equal, if one is greater than the other, etc. The test register can also normally hold a
carry bit from the last stage of the adder. It stores these values in flip-flops and then
the instruction decoder can use the values to make decisions.
• There are six boxes marked "3-State" in the diagram. These are tri-state buffers. A
tri-state buffer can pass a 1, a 0 or it can essentially disconnect its output (imagine a
switch that totally disconnects the output line from the wire that the output is heading
toward). A tri-state buffer allows multiple outputs to connect to a wire, but only one of
them to actually drive a 1 or a 0 onto the line.
• The instruction register and instruction decoder are responsible for controlling all of
the other components.
• Tell the A register to latch the value currently on the data bus
• Tell the B register to latch the value currently on the data bus
• Tell the C register to latch the value currently on the data bus
• Tell the program counter register to latch the value currently on the data bus
• Tell the address register to latch the value currently on the data bus
• Tell the instruction register to latch the value currently on the data bus
• Tell the program counter to increment
• Tell the program counter to reset to zero
• Activate any of the six tri-state buffers (six separate lines)
• Tell the ALU what operation to perform
• Tell the test register to latch the ALU's test bits
• Activate the RD line
• Activate the WR line
Coming into the instruction decoder are the bits from the test register and the clock line, as
well as the bits from the instruction register.
ROM stands for read-only memory. A ROM chip is programmed with a permanent collection
of pre-set bytes. The address bus tells the ROM chip which byte to get and place on the data
bus. When the RD line changes state, the ROM chip presents the selected byte onto the
data bus.
RAM stands for random-access memory. RAM contains bytes of information, and the
microprocessor can read or write to those bytes depending on whether the RD or WR line is
signaled. One problem with today's RAM chips is that they forget everything once the power
goes off. That is why the computer needs ROM.
By the way, nearly all computers contain some amount of ROM (it is possible to create a
simple computer that contains no RAM -- many microcontrollers do this by placing a handful
of RAM bytes on the processor chip itself -- but generally impossible to create one that
contains no ROM). On a PC, the ROM is called the BIOS (Basic Input/Output System).
When the microprocessor starts, it begins executing instructions it finds in the BIOS. The
BIOS instructions do things like test the hardware in the machine, and then it goes to the
hard disk to fetch the boot sector (see How Hard Disks Work for details). This boot sector is
another small program, and the BIOS stores it in RAM after reading it off the disk. The
microprocessor then begins executing the boot sector's instructions from RAM. The boot
sector program will tell the microprocessor to fetch something else from the hard disk into
RAM, which the microprocessor then executes, and so on. This is how the microprocessor
loads and executes the entire operating system.
Microprocessor Instructions
Even the incredibly simple microprocessor shown in the previous example will have a fairly
large set of instructions that it can perform. The collection of instructions is implemented as
bit patterns, each one of which has a different meaning when loaded into the instruction
register. Humans are not particularly good at remembering bit patterns, so a set of short
words are defined to represent the different bit patterns. This collection of words is called the
assembly language of the processor. An assembler can translate the words into their bit
patterns very easily, and then the output of the assembler is placed in memory for the
microprocessor to execute.
Here's the set of assembly language instructions that the designer might create for the
simple microprocessor in our example:
If you have read How C Programming Works, then you know that this simple piece of C code
will calculate the factorial of 5 (where the factorial of 5 = 5! = 5 * 4 * 3 * 2 * 1 = 120):
a=1;
f=1;
while (a <= 5)
{
f = f * a;
a = a + 1;
}
At the end of the program's execution, the variable f contains the factorial of 5.
A C compiler translates this C code into assembly language. Assuming that RAM starts at
address 128 in this processor, and ROM (which contains the assembly language program)
starts at address 0, then for our simple microprocessor the assembly language might look
like this:
So now the question is, "How do all of these instructions look in ROM?" Each of these
assembly language instructions must be represented by a binary number. For the sake of
simplicity, let's assume each assembly language instruction is given a unique number, like
this:
• LOADA - 1
• LOADB - 2
• CONB - 3
• SAVEB - 4
• SAVEC mem - 5
• ADD - 6
• SUB - 7
• MUL - 8
• DIV - 9
• COM - 10
• JUMP addr - 11
• JEQ addr - 12
• JNEQ addr - 13
• JG addr - 14
• JGE addr - 15
• JL addr - 16
• JLE addr - 17
• STOP - 18
The numbers are known as opcodes. In ROM, our little program would look like this:
// Assume a is at address 128
// Assume F is at address 129
Addr opcode/value
0 3 // CONB 1
1 1
2 4 // SAVEB 128
3 128
4 3 // CONB 1
5 1
6 4 // SAVEB 129
7 129
8 1 // LOADA 128
9 128
10 3 // CONB 5
11 5
12 10 // COM
13 14 // JG 17
14 31
15 1 // LOADA 129
16 129
17 2 // LOADB 128
18 128
19 8 // MUL
20 5 // SAVEC 129
21 129
22 1 // LOADA 128
23 128
24 3 // CONB 1
25 1
26 6 // ADD
27 5 // SAVEC 128
28 128
29 11 // JUMP 4
30 8
31 18 // STOP
You can see that seven lines of C code became 17 lines of assembly language, and that
became 31 bytes in ROM.
The instruction decoder needs to turn each of the opcodes into a set of signals that drive the
different components inside the microprocessor. Let's take the ADD instruction as an
example and look at what it needs to do:
1. During the first clock cycle, we need to actually load the instruction. Therefore the
instruction decoder needs to:
• activate the tri-state buffer for the program counter
Every instruction can be broken down as a set of sequenced operations like these that
manipulate the components of the microprocessor in the proper order. Some instructions,
like this ADD instruction, might take two or three clock cycles. Others might take five or six
clock cycles
Microprocessor Performance
The number of transistors available has a huge effect on the performance of a processor.
As seen earlier, a typical instruction in a processor like an 8088 took 15 clock cycles to
execute. Because of the design of the multiplier, it took approximately 80 cycles just to do
one 16-bit multiplication on the 8088. With more transistors, much more powerful multipliers
capable of single-cycle speeds become possible.
More transistors also allow for a technology called pipelining. In a pipelined architecture,
instruction execution overlaps. So even though it might take five clock cycles to execute
each instruction, there can be five instructions in various stages of execution simultaneously.
That way it looks like one instruction completes every clock cycle.
Many modern processors have multiple instruction decoders, each with its own pipeline. This
allows for multiple instruction streams, which means that more than one instruction can
complete during each clock cycle. This technique can be quite complex to implement, so it
takes lots of transistors.
The trend in processor design has been toward full 32-bit ALUs with fast floating point
processors built in and pipelined execution with multiple instruction streams. There has also
been a tendency toward special instructions (like the MMX instructions) that make certain
operations particularly efficient. There has also been the addition of hardware virtual memory
support and L1 caching on the processor chip. All of these trends push up the transistor
count, leading to the multi-million transistor powerhouses available today. These processors
can execute about one billion instructions per second!
Computer Caches
A computer is a machine in which we measure time in very small increments. When the
microprocessor accesses the main memory (RAM), it does it in about 60 nanoseconds (60
billionths of a second). That's pretty fast, but it is much slower than the typical
What if we build a special memory bank, small but very fast (around 30 nanoseconds)?
That's already two times faster than the main memory access. That's called a level 2 cache
or an L2 cache. What if we build an even smaller but faster memory system directly into the
microprocessor's chip? That way, this memory will be accessed at the speed of the
microprocessor and not the speed of the memory bus. That's an L1 cache, which on a 233-
megahertz (MHz) Pentium is 3.5 times faster than the L2 cache, which is two times faster
than the access to main memory.
There are a lot of subsystems in a computer; you can put cache between many of them to
improve performance. Here's an example. We have the microprocessor (the fastest thing in
the computer). Then there's the L1 cache that caches the L2 cache that caches the main
memory which can be used (and is often used) as a cache for even slower peripherals like
hard disks and CD-ROMs. The hard disks are also used to cache an even slower medium --
your Internet connection.
Your Internet connection is the slowest link in your computer. So your browser (Internet
Explorer, Netscape, Opera, etc.) uses the hard disk to store HTML pages, putting them into a
special folder on your disk. The first time you ask for an HTML page, your browser renders it
and a copy of it is also stored on your disk. The next time you request access to this page,
your browser checks if the date of the file on the Internet is newer than the one cached. If the
date is the same, your browser uses the one on your hard disk instead of downloading it
from Internet. In this case, the smaller but faster memory system is your hard disk and the
larger and slower one is the Internet.
Cache can also be built directly on peripherals. Modern hard disks come with fast memory,
around 512 kilobytes, hardwired to the hard disk. The computer doesn't directly use this
memory -- the hard-disk controller does. For the computer, these memory chips are the disk
itself. When the computer asks for data from the hard disk, the hard-disk controller checks
into this memory before moving the mechanical parts of the hard disk (which is very slow
compared to memory). If it finds the data that the computer asked for in the cache, it will
return the data stored in the cache without actually accessing data on the disk itself, saving a
lot of time.
Here's an experiment you can try. Your computer caches your floppy drive with main
memory, and you can actually see it happening. Access a large file from your floppy -- for
example, open a 300-kilobyte text file in a text editor. The first time, you will see the light on
your floppy turning on, and you will wait. The floppy disk is extremely slow, so it will take 20
seconds to load the file. Now, close the editor and open the same file again. The second
time (don't wait 30 minutes or do a lot of disk access between the two tries) you won't see
the light turning on, and you won't wait. The operating system checked into its memory
cache for the floppy disk and found what it was looking for. So instead of waiting 20 seconds,
the data was found in a memory subsystem much faster than when you first tried it (one
access to the floppy disk takes 120 milliseconds, while one access to the main memory
takes around 60 nanoseconds -- that's a lot faster). You could have run the same test on
your hard disk, but it's more evident on the floppy drive because it's so slow.
To give you the big picture of it all, here's a list of a normal caching system:
Cache Technology
One common question asked at this point is, "Why not make all of the computer's memory
run at the same speed as the L1 cache, so no caching would be required?" That would work,
but it would be incredibly expensive. The idea behind caching is to use a small amount of
expensive memory to speed up a large amount of slower, less-expensive memory.
In designing a computer, the goal is to allow the microprocessor to run at its full speed as
inexpensively as possible. A 500-MHz chip goes through 500 million cycles in one second
(one cycle every two nanoseconds). Without L1 and L2 caches, an access to the main
memory takes 60 nanoseconds, or about 30 wasted cycles accessing memory.
When you think about it, it is kind of incredible that such relatively tiny amounts of memory
can maximize the use of much larger amounts of memory. Think about a 256-kilobyte L2
cache that caches 64 megabytes of RAM. In this case, 256,000 bytes efficiently caches
64,000,000 bytes. Why does that work?
Even if you don't know much about computer programming, it is easy to understand that in
the 11 lines of this program, the loop part (lines 7 to 9) are executed 100 times. All of the
other lines are executed only once. Lines 7 to 9 will run significantly faster because of
caching.
This program is very small and can easily fit entirely in the smallest of L1 caches, but let's
say this program is huge. The result remains the same. When you program, a lot of action
takes place inside loops. A word processor spends 95 percent of the time waiting for your
input and displaying it on the screen. This part of the word-processor program is in the
cache.
This 95%-to-5% ratio (approximately) is what we call the locality of reference, and it's why a
cache works so efficiently. This is also why such a small cache can efficiently cache such a
large memory system. You can see why it's not worth it to construct a computer with the
fastest memory everywhere. We can deliver 95 percent of this effectiveness for a fraction of
the cost.
For example, if you load the operating system, an e-mail program, a Web browser and word
processor into RAM simultaneously, 32 megabytes is not enough to hold it all. If there were
no such thing as virtual memory, then once you filled up the available RAM your computer
would have to say, "Sorry, you can not load any more applications. Please close another
application to load a new one." With virtual memory, what the computer can do is look at
RAM for areas that have not been used recently and copy them onto the hard disk. This
frees up space in RAM to load the new application.
Because this copying happens automatically, you don't even know it is happening, and it
makes your computer feel like is has unlimited RAM space even though it only has 32
megabytes installed. Because hard disk space is so much cheaper than RAM chips, it also
has a nice economic benefit.
The following is a comparative text meant to give people a feel for the differences
in the various 6th generation x86 CPUs. For this little ditty, I've chosen the Intel
P-II (aka Klamath, P6), the AMD K6 (aka NX686), and the Cyrix 6x86MX (aka
M2). These are all MMX capable 6th generation x86 compatible CPUs, however I
am not going to discuss the MMX capabilities at all beyond saying that they all
appear to have similar functionality. (MMX never really took off as the software
enabling technology Intel claimed it to be, so its not worth going into any depth
on it.)
Much of the following information comes from online documentation from Cyrix,
AMD and Intel. I have played a little with Pentiums and Pentium-II's from work, as
well as my AMD-K6 at home. I would also like to thank, Dan Wax, Lance Smith
and "Bob Instigator" from AMD who corrected me on several points about the
K6, and both Andreas Kaiser and Lee Powell who also provided insightful
information, and corrections gleened from first hand experiences with these
CPUs. Also, thanks to Terje Mathisen who pointed out an error, and Brian
Converse who helped me with my grammar.
Comments welcome.
The AMD K6
The K6 architecture seems to mix some of the ideas of the P-II and 6x86MX
architectures. They made trade offs, and decisions that they believed would
deliver the maximal performance over all potential software. They have
emphasized short latencies (like the 6x86MX) but the K6 translates their x86
instructions into RISC operations that are queued in large instruction buffers and
feed many (7 in all) independent units (like the P-II.) While they don't always
have the best single implementation of any specific aspect, this was the result of
conscious decisions that they believe helps strike a balance that hits a good
performance sweet spot. Versus the P-II, they avoid situations of really deep
pipelining which has high penalties when the pipeline has to be backed out.
Versus the Cyrix, the AMD is a fully POST-RISC architecture which is not as
susceptible to pipeline stalls which artificially back ups other stages.
General Architecture
This seems remarkably simple considering the features that are claimed for the
K6. The secret, is that most of these stages do very complicated things. The light
blue stages execute in an out of order fashion (and were colored by me, not
AMD.)
The fetch stage, is much like a typical Pentium instruction fetcher, and is able to
present 16 cache aligned bytes of data per clock. Of course this means that
some instructions that straddle 16 byte boundaries will suffer an extra clock
penalty before reaching the decode stage, much like they do on a Pentium. (The
K6 is a little clever in that if there are partial opcodes from which the predecoder
can determine the instruction length, then the prefetching mechanism will fetch
the new 16 byte buffer just in time to feed the remaining bytes to the issue
stage.)
The decode stage attempts to simultaneously decode 2 simple, 1 long, and fetch
from 1 ROM x86 instruction(s). If both of the first two fail (usually only on rare
instructions), the decoder is stalled for a second clock which is required to
completely decode the instruction from the ROM. If the first fails but the second
does not (the usual case when involving memory, or an override), then a single
instruction or override is decoded. If the first succeeds (the usual case when not
involving memory or overrides) then two simple instructions are decoded. The
decoded "OpQuad" is then entered into the scheduler.
This last statement has been generally misunderstood in its importance (even by
me!) Given that the P-II architecture can decode 3 instructions at once, it is
tempting to conclude that the P-II can execute typically up to 50% faster than a
K6. According to "Bob Instigator" (a technical marketroid from AMD) and "The
That said, in real life decode bandwidth limitation crops up every now and then
as a limiting factor, but is rarely egregiously in comparison to ordinary execution
limitations.
The issue stage accepts up to 4 RISC86 instructions from the scheduler. The
scheduler is basically an OpQuad buffer that can hold up to 6 clocks of
instructions (which is up to 12 dual issued x86 instructions.) The K6 issues
instructions subject only to execution unit availability using an oldest unissued
first algorithm at a maximum rate of 4 RISC86 instructions per clock (the X and Y
ALU pipelines, the load unit, and the store unit.) The instructions are marked as
issued, but not removed until retirement.
The operand fetch stage reads the issued instruction operands without any
restriction other than register availability. This is in contrast with the P-II which
can only read up to two retired register operands per clock (but is unrestricted in
forwarding (unretired) register accesses.) The K6 uses some kind of internal
"register MUX" which allows arbitrary accesses of internal and commited register
space. If this stage "fails" because of a long data dependency, then according to
expected availability of the operands the instruction is either held in this stage for
an additional clock or unissued back into the scheduler, essentially moving the
instruction backwards through the pipeline!
This is an ingenious design that allows the K6 to perform "late" data dependency
determinations without over-complicating the scheduler's issue logic. This clever
idea gives a very close approximation of a reservation station architecture's
"greedy algorithm scheduling".
The execution stages perform in one or two pipelined stages (with the exception
of the floating point unit which is not pipelined, or complex instructions which stall
those units during execution.) In theory, all units can be executing at once.
What we see here is the front end starting fairly tight (two instruction) and the
back end ending somewhat wider (two integer execution units, one load, one
store, and one FPU.) The reason for this seeming mismatch in execution
bandwidth (as opposed to the Pentium, for example which remains two-wide
from top to bottom) is that it will be able to sustain varying execution loads as the
dependency states change from clock to clock. This at the very heart of what an
out of order architecture is trying to accomplish, being wider at the back-end is a
natrual consequence of this kind design.
Branch Prediction
Additional stalls are avoided by using a 16 entry times 16 byte branch target
cache which allows first instruction decode to occur simultaneously with
instruction address computation, rather than requiring (E)IP to be known and
used to direct the next fetch (as is the case with the P-II.) This removes an (E)IP
calculation dependency and instruction fetch bubble. (This is a huge advantage
in certain algorithms such as computing a GCD; see my examples for the code)
The K6 allows up to 7 outstanding unresolved branches (which seems like more
than enough since the scheduler only allows up to 6 issued clocks of pending
instructions in the first place.)
The K6 benefits additionally from the fact that it is only a 6 stage pipeline (as
opposed to a 12 stage pipeline like the P-II) so even if a branch is incorrectly
predicted it is only a 4 clock penalty as opposed to the P-II's 11-15 clock penalty.
But because of the K6's limited decode bandwidth, branch instructions take up
precious instruction decode bandwidth. There are no branch execution clocks in
most situations, however, branching instructions end up taking a slot where there
is essentially no calculations. In that sense K6 branches have a typical penalty of
about 0.5 clocks. To combat this, the K6 executes the LOOP instruction in a
single clock, however this instruction performs so badly on Intel CPUs, that no
compiler generates it.
Floating Point
The common high demand, high performance FPU operations (FADD, FSUB,
FMUL) all execute with a throughput and latency of 2 clocks (versus 1 or 2 clock
throughput and 3-5 clock latency on the P-II.) Amazingly, this means that it can
complete FPU operations faster than the P-II, however is worse on FPU code that is
optimally scheduled for the P-II. Like the Pentium, in the P-II Intel has worked hard
on fully pipelining the faster FPU operations which works in their favor. Central to
this is FXCH which, in combination with FPU instruction operands allows two
new stack registers to be addressed by each binary FPU operation. The P-II
allows FXCH to execute in 0 clocks -- the early revs of the K6 took two clocks,
while later revs based on the "CXT core" can execute them in 0 clocks.
Unfortunately, the P-II derives much more benefit from this since its FPU
architecture allows it to decode and execute at a peak rate of one new FPU
instruction on every clock.
More complex instructions such as FDIV, FSQRT and so on will stall more of the
units on the P-II than on the K6. However since the P-II's scheduler is larger it will
be able to execute more instructions in parallel with the stalled FPU instruction
(21 in all, however the port 0 integer unit is unavailable for the duration of the
stalled FPU instruction) while the K6 can execute up to 11 other x86 instructions
a full speed before needing to wait for the stalled FPU instruction to complete.
In a test I wrote (admittedly rigged to favor Intel FPUs) the K6 measured to only
perform at about 55% of the P-II's performance. (Update: using the K6-2's new
SIMD floating point features, the roles have reversed -- the P-II can only execute
at about 70% of a K6-2's speed.)
An interesting note is that FPU instructions on the K6 will retire before they
completely execute. This is possible because it is only required that they work
out whether or not they will generate an exception, and the execution state is
reset on a task switch, by the OS's built-in FPU state saving mechanism.
The state of floating point has changed so drastically recently, that its hard to
make a definitive comment on this without a plethora of caveats. Facts: (1) the
pure x87 floating point unit in the K6 does not compare favorably with that of the
P-II, (2) this does not tend to always reflect in real life software which can be
made from bad compilers, (3) the future of floating point clearly lies with SIMD,
where AMD has clearly established a leadership role. (4) Intel's advantage was
primarily in software that was hand optimized by assembly coders -- but that has
clearly reversed roles since the introduction of the K6-2.
Cache
The K6's L1 cache is 64KB, which is twice as large as the P-II's L1 cache. But it
is only 2 way set associative (as opposed to the P-II which is 4 way). This makes
the replacement algorithm much simpler, but decreases its effectiveness in
random data accesses. The increased size, however, more than compensates
for the extra bit of associativity. For code that works with contiguous data sets,
the K6 simply offers twice the working set ceiling of the P-II.
Like the P-II, the K6's cache is divided into two fixed caches for separate code
and data. I am not as big a fan of split architectures (commonly referred to as the
Harvard Architecture) because they set an artificial lower limit on your working
sets. As pointed out to me by the AMD folk, this keeps them from having to worry
about data accesses kicking out their instruction cache lines. But I would expect
this to be dealt with by associativity and don't believe that it is worth the trade off
of lower working set sizes.
Among the design benefits they do derive from a split architecture is that they
can add pre-decode bits to just the instruction cache. On the K6, the predecode
bits are used for determining instruction length boundaries. Their address tags
(which appears to work out to 9 bits) point to a sector which contains two 32 byte
long cache lines, which (I assume) are selected by standard associativity rules.
Each cache line has a standard set of dirty bits to indicate accessibility state
(obsolete, busy, loaded, etc).
Although the K6's cache is non-blocking, (allowing accesses to other lines even if
a cache line miss is being processed) the K6's load/store unit architecture only
allows in-order data access. So this feature cannot be taken advantage of in the
K6. (Thanks to Andreas Kaiser for pointing this out to me.)
In addition, like the 6x86MX, the store unit of the K6 actually is buffered by a
store Queue. A neat feature of the store unit architecture is that it has two
operand fetch stages -- the first for the address, and the second for the data
which happens one clock later. This allows stores of data that are being
computed in the same clock as the store to occurr without any apparent stall.
That is so darn cool!
But perhaps more fundamentally, as AMD have said themselves, bigger is better,
and at twice the P-II's size, I'll have to give the nod to AMD (though a bigger nod
to the 6x86MX; see below.)
The K6 takes two (fully pipelined) clocks to fetch from its L1 cache from within its
load execution unit. Like the original P55C, the 6x86MX spends extra load clocks
(i.e., address generation) during earlier stages of their pipeline. On the other
hand this compares favorably with the P-II which takes three (fully pipelined)
clocks to fetch from the L1 cache. What this means is that when walking a
(cached) linked list (a typical data structure manipulation), the 6x86MX is the
fastest, followed by the K6, followed by the P-II.
Update: AMD has released the K6-3 which, like the Celeron adds a large on die
L2 cache. The K6-3's L2 cache is 256K which is larger than the Celeron's at
128K. Unlike Intel, however, AMD has recommended that motherboard continue
to include on board L2 caches creating what AMD calls a "TriLevel cache"
architecture (I recall that an earler Alpha based system did exactly this same
thing.) Benchmarks indicate that the K6-3 has increased in performance between
10% and 15% over similarly clocked K6-2's! (Wow! I think I might have to get one
of these.)
Other
According to AMD, the typical 32 bit decode bandwidth is about the same
for both the K6 and the P-II, but 16 bit decode is about 20% faster for the
K6. Unfortunately for AMD, if software developers and compiler writers
heed the P-II optimization rules with the same vigor that they did with the
Pentium, the typical decode bandwidth will change over time to favor the
P-II.
• The K6's issue to execute scheduling is pretty cool. They use complete
logical comparisons between pipeline stages to always find the best path
6x86MX seems to just let their pipelines accumulate with work moving
only in a forward direction which makes them more susceptible to being
backed up, but they do allow their X and Y pipes to swap contents at one
stage.
• The K6 does not support the new P6 ISA instructions, specifically, the
conditional move instructions. It also does not appear to support the set of
MSRs that the P6 does (besides the ever important TSC register.) So from
a programmer's architecture point of view, the K6 is more like a Pentium
than a Pentium-II. Its not clear that this is a real big issue since all the
modern compilers still target the 80386 ISA.
Optimization
AMD realizing that there is a tremendous interest for code optimization for certain
high performance applications, decided to write up some Optimization
documentation for the K6 (and now K6-2) processor(s). The documentation is
fairly good about describing general strategies as well as giving a fairly detailed
description for modelling the exact performance of code. This documentation far
exceeed the quality of any of Intel's "Optimization AP notes", fundamentally
because its accurate and more thorough.
The reason I have come to this conclusion is that the architecture of the chip
itself is much more straight forward than, say the P-II, and so there is less
explanation necessary. So the volume of documentation is not the only
determining factor to measuring its quality.
If companies were interested in writing a compiler that optimized for the K6 I'm
sure they could do very well. In my own experiments, I've found that optimizing
for the K6 is very easy.
Brass Tacks
The K6 is cheap, supports super socket 7 (with 100Mz Bus), that has established
itself very well in the market place, winning businnes from all the top tier OEMs
(with the exception of Dell, which seems to have missed the consumer market
shift entirely, and taken a serious step back from challenging Compaq's number
one position.) AMD really changed the minds of people who thought the x86
market was pretty much an Intel deal (including me.)
Their marketting strategy of selling at a low price while adding features (cheaper
Super7 infrastructure, SIMD floating point, 256K on chip L2 cache combined with
motherboard L2 cache) has paid off in an unheard of level brand name
recognition outside of Intel. Indeed, 3DNow! is a great counter to Intel Inside. If
nothing else they helped create a real sub-$1000 PC market, and have dictated
the price for retail x86 CPUs (Intel has been forced to drop even their own prices
to unheard of lows for them.)
AMD has struggled more to meet the demand of new speeds as they come
online (they seem predictably optimistic) but overall have been able to sell a boat
load of K6's without being stepped on by Intel.
The first release of their x86 Optimization guide is what triggered me to write this
page. With it, I had documentation for all three of these 6th generation x86
CPUs. Unfortunately, they often elect to go with terse explanations that assume
the reader is very familiar with CPU architecture and terminologies. This lead me
to some misunderstandings from my initial reading of the documentation (I'm just
a software guy.) On the other hand, the examples they give really help clarify the
inner workings of the K6.
Update: The IEEE Computer Society has published a book called "The Anatomy
of a High-Performance Microprocessor A Systems Perspective" based on the
AMD K6-2 microprocessor. It gives inner details of the K6-2 that I have never
seen in any other documentation on Microprocessors before. These details are a
bit overwhelming for a mere software developer, however, for a hard core x86
hacker its a treasure trove of information.
Intel has enjoyed the status of "defacto standard" in the x86 world for some time.
Their P6/P-II architecture, while not delivering the same performance boost of
previous generational increments, solidifies their position. Its is the fastest, but it
is also the most expensive of the lot.
General Architecture
The P-II is a highly pipelined architecture with an out of order execution engine in
the middle. The Intel Architecture Optimization Manual lists the following two
diagrams:
The two sections shown are essentially concatenated, showing 10 stages of in-
order processing (since retirement must also be in-order) with 3 stages of out of
order execution (RS, the Ports, and ROB write back colored in light blue by me,
not Intel.)
Intel's basic idea was to break down the problem of execution into as many units
as possible and to peel away every possible stall that was incurred by their
previous Pentium architecture as each instructions marches forward down their
assembly line. In particular, Intel invests 5 pipelined clocks to go from the
instruction cache to a set of ready to execute micro-ops. (RISC architectures
have no need for these 5 stages, since their fixed width instructions are generally
already specified to make this translation immediate. It is these 5 stages that truly
separate the x86 from ordinary RISC architectures, and Intel has essentially
solved it with a brute force approach which costs them dearly in chip area.)
As a note of interest, Intel divides the execution and write back stages into two
seperate stages (the K6 does not, and there is really no compelling reason for
the P6's method that I can see.)
Although it is not as well described, I believe that Intel's reservation station and
reorder buffer combinations serves substantially the same purpose as the K6's
scheduler, and similarly the retire unit acts on instruction clusters in exactly the
same way as they were issued (CPUs are not otherwise known to have sorting
algorithms wired into them.) Thus the micro-op throughput is limited to 3 per
clock (compared with 4 RISC86 ops for the K6.)
So when everything is working well, the P-II can take 3 simple x86 instructions
and turn them into 3 micro-ops on every clock. But, as can be plainly seen in
their comments, they have a bizzare problem: they can only read two physical input
register operands per clock (rename registers are not constrained by this
condition.) This means scheduling becomes very complicated. Registers to be
read for multiple purposes will not cost very much, and data dependencies don't
suffer from any more clocks than expected, however the very typical trick of
spreading calculations over several registers (used especially in loop unrolling)
will upper bound the pipeline to two micro-ops per clock because of a physical
register read bottleneck.
In any event, the decoders (which can decode up to 6 micro-ops per clock) are
clearly out-stripping the later pipeline stages which are bottlenecked both by the
3 micro-op issue and two physical register read operand limit. The front end
easily outperforms the back end. This helps Intel deal with their branch bubble,
by making sure the decode bandwidth can stay well ahead of the execution units.
Something that you cannot see in the pictures above is the fact that the FPU is
actually divided into two partitioned units. One for addition and subtraction and
the other for all the other operations. This is found in the Pentium Pro
documentation and given the above diagram and the fact that this is not
mentioned anywhere in the P-II documentation I assumed that in fact the P-II
was different from the PPro in this respect (Intel's misleading documentation is
really unhelpful on this point.) After I made some claims about these differences
on USENET some Intel engineer (who must remain anonymous since he had a
copyright statement insisting that I not copy anything he sent me -- and it made
no mention of excluding his name) who claims to have worked on the PPro felt it
his duty to point out that I was mistaken about this. In fact, he says, the PPro and
P-II have an identical FPU architecture. So in fact the P-II and PPro really are the
same core design with the exception of MMX, segment caching and probably
some different glue logic for the local L2 caches.
This engineer also reiterated Intel's position on not revealing the inner works of
their CPU architectures thus rendering it impossible for ordinary software
engineers to know how to properly optimize for the P-II.
Branch Prediction
Central to facilitating the P-II's aggressive fire and forget execution strategy is full
branch prediction. The functionality has been documented by Agner Fog, and
can track very complex patterns of branching. They have advertised a prediction
rate of about 90% (based on academic work using the same implementation.)
This prediction mechanism was also incorporated into the Pentium MMX CPUs.
Unlike the K6, the branch target buffer contains target addresses, not instructions
and predictions only for the current branch. This means an extra clock is required
for taken branches to be able to decode their branch target. Branches not in the
branch target buffer are predicted statically (backward jumps taken, forward
jumps not.) However, this "extra clock" is generally overlapped with execution
clocks, and hence is not a factor except in short loops, or code loops with poorly
translated code sequences (like compiled sprites.)
stream cannot be correctly known until the mispredict is completely processed. This
huge penalty offsets the performance of the P-II, especially in code in which no
P6/P-II optimizations considerations have been made.
The P-II's predictor always deals with addresses (rather than boolean compare
results as is done in the K6) and so is applicable to all forms of control transfer
such as direct and indirect jumps and calls. This is critical to the P-II given that
the latency between the ALUs and the instruction fetch is so large.
In the event of a conditional branch both addresses are computed in parallel. But
this just aids in making the prediction address ready sooner; there is no
appreciable performance gained from having the mispredicted address ready
early given the huge penalty. The addresses are computed in an integer
execution port (seperate from the FPU) so branches are considered an ALU
operation. The prefetch buffer is stalled for one clock until the target address is
computed, however since the decode bandwidth out-performs the execution
bandwidth by a fair margin, this is not an issue for non-trivial loops.
This is obviously a lot higher than the K6 penalty. (The zero as the first penalty
assumes that the loop is sufficiently large to hide the one clock branch bubble.)
For programmers this means one major thing: Avoid mispredicted branches in
your inner loops at all costs (make that 10% closer to 0%). Using tables or
conditional move instructions are common methods, however since the predictor
is used even in indirect jumps, there are situations with branching where you
have no choice but to suffer from branch prediction penalties.
Floating Point
In keeping with their post-RISC architecture, the P-II's have in some cases
increased the latency of some of the FPU instructions over the Pentium for sake
of pipelining at high clock rates and with idea that it hopefully will not matter if the
code is properly scheduled. Intel says that FXCH requires no execution cycles,
but does not explicitly state whether or not throughput bubbles are introduced.
Other than latency, the P-II is very similar to the Pentium in terms of performance
characteristics. This is because all FPU operations go through port 0 except
FXCH's which go to port 1, and the first stage of a multiply takes two non-
pipelined clocks. This is pretty much identical to the P5 architecture.
The Intel floating point design has traditionally beat the Cyrix and AMD CPUs on
floating point performance and this still appears to hold true as tests with Quake
and 3D Studio have confirmed. (The K6 is also beaten, but not by such a large
margin -- and in the case of Quake II on a K6-2 the roles are reversed.)
The P-II's floating point unit is issued from the same port as one of the ALU units.
This means that it cannot issue two integer and 1 floating point operation on
every clock, and thus is likely to be constrained to an issue rate similar to the K6.
As Andreas Kaiser points out, this does not necessarily preclude later execution
clocks (for slower FPU operations for eg) to execute in parallel from all three
basic math units (though this same comment applies to the K6).
As I mentioned above, the P-II's floating point unit is actually two units, one is a
fully pipelined add and subtract unit, and the other is a partially pipelined complex
unit (including multiplies.) In theory this gives greater parallelism opportunities
over the original Pentium but since the single port 0 cannot feed the units at a
rate greater than 1 instruction per clock, the only value is design simplification.
For most code, especially P5 optimized code, the extra multiply latency is likely
to be the most telling factor.
Update: Intel has introduced the P-!!! which is nothing more than a 500Mhz+ P6
core with 3DNow!-like SIMD instructions. These instructions appear to be very
similar in functionality and remarkably similar in performance to the 3DNow!
instruction set. There are a lot of misconceptions about the performance of SSE
versus 3DNow! The best analysis I've seen so far indicate that they are nearly
identical by virtue of the fact that Intel's "4-1-1" issue rate restriction holds back
the mostly meaty 2 micro-op SSE instructions. Furthermore, there are twice as
many subscribers to the SSE units per instruction than 3DNow! which totally
nullifies the doubled output width. In any event, its almost humorous to see Intel
playing catch up to AMD like this. The clear winner: consumers.
Cache
The P-II's L1 cache is 32KB divided into two fixed 16KB caches for separate
code and data. These caches are 4-way set associative which decreases
thrashing versus the K6. But relatively speaking, this is quite small and inflexible
when compared with the 6x86MX's unified cache. I am not a big fan of the P-II's
smaller, less flexible L1 cache, and it appears as though they have done little to
justify it being half the size of their competitors' L1 caches.
The greater associativity helps programs that are written indifferently with respect
to data locality, but has no effect on code mindful of data locality (i.e., keeping
their working sets contiguous and no larger than the L1 cache size.)
The P-II also has an "on PCB L2 cache". What this means is they do not need
use the motherboard bus to access their L2 cache. As such the communications
interface can (and does) have a much higher frequency. In current P-II's it is 1/2
the CPU clock rate. This is an advantage over K6, K6-2 and 6x86MX cpus which
access motherboard based L2 caches at only 66Mhz or 100Mhz. (However the
K6-III's on die L2 cache runs at the CPU clock rate, which is thus twice as fast as
the P-II's)
Other
• The P-II has a partial register stall which is very costly. This occurs when
writing to a sub-register within a few clocks of writing to a 32 bit register.
That is to say, writing to a ?l or ?h 8 bit register will cause a partial register
stall when next reading the corresponding ?x or e?x register. The same is
true of writing to a ?x register then reading the corresponding e?x register.
As described by Agner Fog, the front end is in-order and must assign
internal registers before the instruction can be entered into the
reservations stations. If there is a partial register overlap with a live
instruction ahead of it, then a disjoint register cannot be assigned until that
instruction retires. This is a devastating performance stall when it occurs
because new instructions cannot even be entered into the reservations
stations until this stall is resolved. Intel lists this as having roughly a 7
clock cost.
This is not a big issue so long as the execution units are kep busy with
instructions leading up to this partial registers stall, but that is a difficult
criteria to code towards. One way to accomplish this would be to try to
schedule this partial register stall as far away from the previous branch
control transfer as possible (the decoders usually get well ahead of the
ALUs after several clocks following a control transfer.)
• The P-II, like the P6, performs worse on 16 bit code per clock rate than the
Pentium. (Significantly worse than the Cyrix 6x86MX, and somewhat
worse than the K6.) However, the P-II is not as bad as the P6. In
particular, it uses a small 16 bit segment/selector cache which the P6
does not.
• The P-II's data access actually require an additional address unit for
stores. What this means is that memory writes must be broken down into
"address store" and "data store" micro-ops. This increases data write
latency (versus the K6.)
• The P-II can decode instructions to many, many micro-ops, but really only
decodes optimally when 2 out of every 3 instructions are decoded to a
single micro-op and in a specific "4-1-1" sequence (that is for three
instructions to decode in parallel the first must decode to no more than 4
micro-ops, and the second and third in no more than 1 micro-op).
Instructions must also be 8 bytes or less to allow other instructions to be
decoded in the same clock. According to MicroProcessor Report, only one
load or store memory operation can be decoded in the first of the at most
3 instructions. If this is true, it certainly detracts from the "one load or store
operations per clock" claim Intel makes (of course the second of the two
store microops might execute at the same time as a load.)
Only under these circumstances can the P-II achieve its maximum rate of
decoding 3 instructions per cycle.
Update: I recently tried to hand optimize some code, and found that it is
actually not all that difficult to achieve the 3 instruction issue per clock, but
that certainly no compiler I know of is up to the task. It turns out, though,
that such activities are almost certainly a red herring since dependency
bubbles will end up throttling your performance anyways. My
recommendation is to parallelize your calculations as much as possible.
• Stores are pipelined, but not queued like the 6x86MX or K6. This means
cache misses necessarily stalls subsequent store micro-op execution. So
the P-II ends up using the reservation station to queue up store
commands rather than a dedicated store queue. It is not totally clear to me
if this stalls the load unit, but I am guessing not since the cache has been
claimed to be non-blocking.
• The K6 requires in-order writes, while the P-II almost assuredly reorders
its writes very aggressively in an attempt to build contiguous memory write
streams. The original P6 core has also included write combining (makes
clustered byte writes appear as byte enabled dword writes to the PCI bus.)
With the introduction of the Pentium Pro, many 3rd party hardware
peripheral vendors that used the memory mapping features of PCI found
themselves fixing their drivers to, in some cases, work around this
"feature" of the P6 architecture. However for ordinary applications this just
meant higher memory bandwidth performance (more so with the P-II than
the P6.)
• Intel has leveraged its dominance in the market, advanced process and a
daring approach to L2 cache usage to introduce their slot 1 cartridge
interface to motherboards. The upshot of all of this is that they are able to
use a larger heat sink and have better control over a more reliably yielded
L2 cache running at reasonable clock rate (half the processor speed.)
At the same clock rate, this is its biggest advantages over the current K6
whose L2 cache is tied to the chipset speed of 66Mhz.
• Intel's CPUs come with more MSRs which give detailed information about
branch prediction and scheduling stalls (by a net counts) and let you mark
memory type ranges with respect to cacheability and write combinability.
These details, among others, were at the heart of the controversy
surrounding "Appendix H" a while back with the Pentium CPU.
But now, after being pressured into publishing information about MSRs,
Intel has decided to go one step further and provide a tool to help present
the MSR information in a Windows program. While this tool is very useful
in of itself, it would be infinitely superior if there were accompanying
documentation that described the P-II's exact scheduling mechanism.
Optimization
Intel has been diligent in creating optimization notes and even some interactive
tutorials that describe how the P-II microarchitecture works. But the truth is that
they serve as much as CPU advertisements as they do for serious technical
material. We found out with the Pentium CPU, Intel's notes were woefully
inadequate to give an accurate characterization for modelling its behaviour with
respect to performance (this opened the door for people like Michael Abrash and
Agner Fog to write up far more detailed descriptions based on observation rather
than Intel's anemic documentation.) They contain egregious omissions, without
given a totally clear description of the architecture.
While they claim that hand scheduling has little or no effect on performance
experiments I and others have conducted have convinced me that this simply is
not the case. In the few attempts I've made using ideas I've recently been shown
and studied myself, I can get between 5% and 30% improvement on very
innocent looking loops via some very unintuitive modifications. The problem is
that these ideas don't have any well described explanation -- yet.
With the P-II we find a nice dog and pony show, but again the documentation is
inadequate to describe essential performance characteristics. They do steer you
away from the big performance drains (branch misprediction and partial register
stalls.) But in studying the P-II more closely, it is clear that there are lots of things
going on under the hood that are not generally well understood. Here are some
examples (1) since the front end out-performs the back-end (in most cases) the
"schedule on tie" situation is extremely important, but there is not a word about it
anywhere in their documentation (Lee Powell puts this succinctly by saying that
the P-II prefers 3X superscalar code to 2X superscalar code.) (2) The partial
register stall appears to, in some cases, totally parallelize with other execution in
some cases (the stall is less than 7 clocks), while not at all in others (a 7 clock
stall in addition to ordinary clock expenditures.) (3) Salting execution streams
So why doesn't Intel tell us these things so we can optimize for their CPU? The
theory that they are just telling Microsoft or other compiler vendors under NDA
doesn't fly since the kinds of details that are missing are well beyond the
capabilities of any conventional compiler to take advantage of (I can still beat the
best compilers by hand without even knowing the optimization rules, but instead
just by guessing at them!) I can only imagine that they are only divulging these
rules to certain companies that perform performance critical tasks that Intel has a
keen interest in seeing done well running on their CPUs (soft DVD from Zoran for
example; I'd be surprised if Intel didn't give them either better optimization
documentation or actual code to improve their performance.)
Intel has their own compiler that they have periodically advertised on the net as a
plug in replacement for the Windows NT based version of MSVC++, available for
evaluation purposes (it's called Proton if I recall correctly). However, it is unclear
to me how good it is, or whether anyone is using it (I don't use WinNT, so I did
not pursue trying to get on the beta list). Update: I have been told that Microsoft
and Imprise (Borland) have licensed Intel's compiler source and has been using it
are their compiler base.
Brass Tacks
When it comes right down to brass tacks though, the biggest advantage of their
CPU is the higher clock rates that they have achieved. They have managed to
stay one or two speed grades ahead of AMD. The chip also enjoys the benefit of
the Intel Inside branding. Intel has spent a ton of money in brand name
recognition to help lock its success over competitors. Like the Pentium, the P-II
still requires a lot of specific coding practices to wring the best performance out
of them, and its no doubt that many programmers will do this, and Intel has gone
to some great lengths to write tutorials that explain how to do this (regardless of
their lack of correctness, they will give programmers a false sense of
empowerment).
During 1998, what consumers have been begging for for years, a transition to
super cheap PCs has taken place. This is sometimes called the sub-$1000 PC
market segment. Intel's P-II CPUs are simply too expensive (costing up to $800
alone) for manufacturers to build compelling sub-$1000 systems with them. As
such, Intel has watched AMD and Cyrix pick up unprecedented market share.
Intel made a late foray into the sub-$1000 PC market. Their whole business
model did not support such an idea. Intel's "value consumer line" the Celeron
started out as a L2-cacheless piece of garbage architecture (read: about the
same speed as P55Cs at the same clock rates), then switched to an integrated
L2 cache architecture (stealing the K6-3's thunder). Intel was never really able to
recover the bad reputation that stuck to the Celeron, but perhaps that was their
intent all along. It is now clear that Intel is basically dumping Celerons in an effort
to wipe out AMD and Cyrix, while trying to maintain their hefty margins in their
Pentium-II line. For the record, there is little performance difference between a
Pentium-II and a Celeron, and the clock rates for the Celeron were being made
artificially slow so as not to eat into their Pentium line. This action alone has
brought a resurgence into the "over clocking game" that some adventurous
power users like to get into.
But Intel being Intel has managed to seriously dent what was exclusively an AMD
and Cyrix market for a while. Nevertheless since the "value consumer" market
has been growing so strongly, AMD and Cyrix have nevertheless been able to
increase their volumes even with Intel's encroachment.
The P-II architecture is getting long in the tooth, but Intel keeps insisting on
pushing it (demonstrating an uncooled 650Mhz sample in early 1999.) Mum's the
word on Intel's seventh generation x86 architecture (the Williamette or Foster)
probably because that architecture is not scheduled to be ready before late 2000.
This old 6th generation part may prove to be easy pickings for Cyrix Jalapeno
and AMD's K7, both of which will be available in the second half of 1999.
While Intel does have plenty of documentation on their web site, they quite
simply do not sit still with their URLs. It is impossible to keep track of these URLs,
and I suspect Intel keeps changing their URLs based on some ulterior motive. All
I can suggest is: slog through their links starting at the top. I have provided a link
to Agner Fog's assembly page where his famous Pentium optimization manual
has been updated with a dissection of the P-II.
• http://www.intel.com/
• developer.intel.com
• Agner Fog's P5 and P-II optimization manual
The primary microarchitecture difference of the 6x86MX CPU versus the K6 and
P-II CPUs is that it still does native x86 execution rather than translation to
internal RISC ops.
General Architecture
By being able to swap the instructions, there is no concern about artificial stalls
due to scheduling of instruction to the wrong pipeline. By introducing two address
generation stages, they eliminate the all too common AGI stall that is seen in the
Pentium. The 6x86MX relies entirely on up front dependency resolution via register
renaming, and data forwarding; it does not buffer instructions in any way. Thus its
instruction issue performance becomes bottlenecked by dependencies.
The out of order nature of the execution units are not very well described in
Cyrix's documentation beyond saying that slower instructions will make way for
faster instructions. Hence it is not clear what the execution model really looks
like.
Branch Prediction
The Cyrix CPU uses a 512 entry 2 bit predictor and this does not have a
prediction rate that rivals either the P-II or K6 designs. However, both sides of the
branch will have first instruction decoded simultaneously in the same clock. In
this way, the Cyrix hedges its bets so that it doesn't pay such a severe
performance penalty when its prediction goes wrong. Beyond this, it appears as
though Cyrix has gone full Post-RISC architecture and supports a branch
predict and speculative execution model. This fits nicely with their aggressive
register renaming, and data forwarding model from the original 6x86 design.
Because of potential FPU exceptions, all FPU instructions are treated the same
way as branch prediction. I would expect the same to be true of the P-II, but Intel
has not documented this, whereas Cyrix has.
They have a fixed scheme of 4 levels of speculation, that are simply increased
for every new speculative instruction issued (this is somewhat lower than the P-II
and K6 which can have 20 or 24 live instructions at any one given time, and
somewhat more outstanding branches.)
The 6x86MX architecture is more lock stepped than the K6, and as such their
issue follows their latency timings more closely. Specifically their decode, issue
and address generation stages are executed in lock step, with any stalls from
resource contentions, complex decoding etc, backing up the entire instruction
fetch stages. However their design makes it clear that they do everything
possible to reduce these resource contentions as early as possible. This is to be
contrasted with the K6 design which is not lock step at all, but due to its late
resource contention resolution, may be in the situation of re-issuing instructions
after potentially wasting an extra clock that it didn't need to in its operand fetch
stage.
Floating Point
The 6x86MX has significantly slower floating point. The Cyrix's FADD, FMUL,
FLD and FXCH instructions all take at least 4 clocks which puts it at one quarter of
the P-II's peak FPU execution rate. The 6x86MX (and even older 6x86) tried to
make up for it by having an FPU instruction FIFO. This means that most of the
FPU clocks can be overlapped with integer clocks, and that a handful of FPU
operations can be in flight at the same time, but in general requires hand
scheduling and relatively light use of the FPU to maximally leverage it. Oddly
enough, their FDIV and FSQRT performance is about as good if not better than
the P-II implementation. This seems like an odd design decision, as optimizing
FADD, FLD, FXCH and FMUL, are clearly of much higher importance.
Like AMD, Cyrix designed their 6x86MX floating point around the weak code that
x86 compilers generate on FPU code. But, personally, I think Cyrix has gone way
too far in ignoring FPU performance. Compilers only need to get a tiny bit better
for the difference between the Cyrix and Pentium II to be very noticeable on FPU
code.
Cache
The Cyrix's cache design is unifed at 64KB with a separate extra 256 byte
instruction buffer. I prefer this design to the K6's and P-II's split code and data
architecture, since it better takes into account the different dynamics of code
usage versus data usage that you would expect in varied software designs. As
an example, p-code interpreters, or just interpreted languages in general, you
would expect to benefit more from a larger data cache. The same would apply to
multimedia algorithms which would tend to apply simple transformations on large
amounts of data (though in truth, your system benefits more from coprocessors
for this purpose.) As another example, highly complex (compiled) applications
that weave together the resources of many code paths (web browsers, office
suite packages, and pre-emptive multitasking OSes in general) would prefer to
have larger instruction caches. At both extremes, the Cyrix has twice the cache
ceiling.
Thus the L1 cache becomes a sort of L2 cache for the 256 byte instruction line
buffer, which allows the Cyrix design to complicate a much smaller cache
structure with predecode bits and so on, and use the unified L1 cache more
efficiently as described above. Although I don't know details, the prefetch units
could try to see a cache miss comming and pre-load the instruction line cache in
parallel with ordinary execution; this would compensate for the instruction
cache's unusually small size, I would expect to the point of making it a mute
point.
Other
• One clock LOOP instructions! This follows the PowerPC design choice of
making a high throughput count down branch instruction. Like the
PowerPC, they could in fact have implemented this with 100% accurate
branch target prediction, however they did not document whether or not
they have done this. Unfortunately, programmers have been using these
instructions less and less, since starting with the Pentium, Intel has been
making this instruction slower.
• Two barrel shifters (one for each pipe), allowing greater parallelism with
shift instructions. This is an advantage over both the P-II and K6 which
each have only one unit that can handle shifts.
• There are no partial register stalls or smaller operand restrictions that I
could find documented. Cyrix is clearly committed to retaining high
performance of older 16 bit code. This is important for Windows 95,
however less so for Windows NT.
• The Cyrix has a very interesting extension to their general architecture
that allows them to use part of the L1 cache as a scratch pad. This
presents a very interesting alternative for programmers who have
complained about the x86's lack of registers. It is not clear that
programmers would be willing to special case the Cyrix to use this feature,
but you can bet that the drivers Cyrix writes for their GX platforms uses
this feature.
Although I have not read about the Cyrix in great detail, it would seem to
me that this was motivated by the desire to perform well on multimedia
algorithms. The reason is that multimedia tends to use memory in
streams, instead of reusing data which conventional caching strategies
are designed for. So if the Cyrix's cache line locking mechanism allows
redirecting of certain memory loads then this will allow them to keep the
rest of their L1 cache intact for use by tables or other temporary buffers.
This would be a good strategy for their next generation MXi processor (an
integrated graphics and x86 processor.)
Optimization
Cyrix's documentation is not that deep, but I get the feeling that neither are their
CPU's. Nevertheless, they do not describe their out of order mechanism in
sufficient detail to even evaluate it. Not having tried to optimize for a Cyrix CPU
myself, I don't have enough data points to really evaluate how lacking the
documentation is. But it does appear that Cyrix is somewhat behind both Intel
and AMD here.
Update: I've been recently pointed at Cyrix's Appnotes page, in particular note
106 which describes optimization techniques for the 6x86 and 6x86MX. It does
provide a lot of good suggestions which are in line with what I know about the
Cyrix CPUs, but they do not explain everything about how the 6x86MX really
works. In particular, I still don't know how their "out of order" mechanism works.
It is very much like Intel's documentation which just tells software developers
what to do without giving complete explanations as to how their CPU works. The
difference is that its much shorter and more to the point.
One thing that surprised me is that the 6x86MX appears to have several
extended MMX instructions! So in fact, Cyrix had actually beaten AMD to
(nontrivially) extending the x86 instruction set (with the K6-2), they just didn't do a
song and dance about it at the time. I haven't studied them yet, but I suspect that
when Cyrix releases their 3DNOW! implementation they should be able to
advertise the fact that they will be supplying more total extensions to the x86
instruction set with all of them being MMX based.
Brass Tacks
The 6x86MX design clearly has the highest instructions processed per clock on
most ordinary tasks (read: WinStone.) I have been told various explanations for it
(4-way 64K L1 cache, massive TLB cache, very aggressive memory strategies,
etc), but without a real part to play with, I have not been able to verify this on my
own.
Well, whatever it is, Cyrix learned an important lesson the hard way: clock rate is
more important than architectural performance. Besides keeping Cyrix in the
"PR" labelling game, their clock scalability could not keep up with either Intel or
AMD. Cyrix did not simply give up however. Faced with a quickly dying
architecture, a shared market with IBM, as well as an unsuccessful first foray into
integrated CPUs, Cyrix did the only thing they could do -- drop IBM, get foundry
capacity from National Semiconductor and sell the 6x86MX for rock bottom
prices into the sub-$1000 PC market. Indeed here they remained out of reach of
either Intel and AMD, though they were not exactly making much money with this
strategy.
Update: National has buckled under the pressure of keeping the Cyrix division
alive (unable to produce CPUs with high enough clock rate) and has sold it off to
VIA. How this affects Cyrix' ability to try to reenter the market, and release next
generation products remains to be seen.
Common Features
The P-II and K6 processors require in-order retirement (for the Cyrix, retirement
has no meaning; it uses 4 levels of speculation to retain order.) This can be
reasoned out simply because of x86 architectural constraints. Specifically, in-
order execution is required to do proper resource contention.
Within the scheduler the order of the instructions are maintained. When a micro-
op is ready to retire it becomes marked as such. The retire unit then waits for
micro-op blocks that correspond to x86 instructions to become entirely ready for
retirement and removes them from the scheduler simultaneously. (In fact, the K6
retains blocks corresponding to all the RISC86 ops scheduled per clock so that
one or two x86 instructions might retire per clock. The Intel documentation is not
as clear about its retirement strategies.) As instructions are retired the non-
speculative CS:EIP is updated.
The speculation aspect is the fact that the branch target of a branch prediction is
simply fed to the prefetch immediately before the branch is resolved. A "branch
verify" instruction is then queued up in place of the branch instruction and if the
verify instruction checks out then it is simply retired (with no outputs except
possibly to MSRs) like any ordinary instruction, otherwise a branch misprediction
exception occurs.
According to Agner Fog, the P-II retains fixed architectural registers which are
not renamable and only updated upon retiring. This would provide a convenient
"undo" state. This also jells with the documentation which indicates that the PII
can only read at most two architectural registers per clock. The K6, however,
does not appear to be similarly stymied, however it too has fixed architectural
registers as well.
Contrary to what has been written about these processors, however, hand tuning
of code is not unnecessary. In particular, the Intel processors still handle carry
flag based computation very well, even though compilers do not; the K6 has load
latencies, all of these processors still have alignment issues and the K6 and
6x86MX prefer the LOOP instruction which compilers do not generate. XCHG is
also still the fastest way to swap two integer registers on all these processors,
but compilers continue to avoid that instruction. Many of the exceptions (partial
register stall, vector decoding, etc.) are also unknown to most modern compilers.
In the past, penalties for cache misses, instruction misalignment and other
hidden side-effects were sort of ignored. This is because on older architectures,
they hurt you no matter what, with no opportunity for instruction overlap, so the
rule of avoiding them as much as possible was more important than knowing the
precise penalty. With these architectures its important to know how much code
can be parallelized with these cache misses. Issues such as PCI bus, chip set
and memory performance will have to be more closely watched by programmers.
The K6's documentation was the clearest about its cache design, and indeed it
does appear to have a lot of good features. Their predecode bits are used in a
very logical manner (which appears to buy the same thing that the Cyrix's
instruction buffer buys them) and they have stuck with the simple to implement 2-
way set associativity. A per-cache line status is kept, allowing independent
access to separate lines.
Final words
With out of order execution, all these processors appear to promise the
programmer freedom from complicated scheduling and optimization rules of
previous generation CPUs. Just write your code in whatever manner pleases you
and the CPU will take care of making sure it all executes optimally for you. And
depending on who you believe, you can easily be lead to think this.
While these architectures are impressive, I don't believe that programmers can
take such a relaxed attitude. There are still simple rules of coding that you have
to watch out for (partial register stalls, 32 bit coding, for example) and there are
other hardware limitations (at most 4 levels of speculation, a 4 deep FPU FIFO
etc.) that still will require care on the part of the programmer in search of the
highest levels of performance. I also hope that the argument that what these
processors are doing is too complicated for programmers to model dies down as these
processors are better understood.
Some programmers may mistakenly believe that the K6 and 6x86MX processors
will fade away due to market dominance by Intel. I really don't think this is the
case, as my sources tell me that AMD and Cyrix are selling every CPU they
make, as fast as they can make them. The demand is definately there. 3Q97 PC
purchases indicated an unusually strong sales for PCs at $1000 or less
(dominated by Compaq machines powered by the Cyrix CPU), making up about
40% of the market.
The astute reader may notice that there are numerous features that I did not
discuss at all. While its possible that it is an oversight, I have also intentionally
left out discussion of features that are common to all these processors (data
forwarding, register renaming, call-return prediction stacks, and out of order
execution for example.) If you are pretty sure I am missing something that should
be told, don't hesitate to send me feedback.
Update: Centaur, a subsidiary of IDT, has introduced a CPU called the WinChip
C6. A brief reading of the documentation on their web site indicates that its
basically a single pipe 486 with a 64K split cache dual MMX units, some 3D
instruction extensions and generally more RISCified instructions. From a
performance point of view their angle seems to be that the simplicity of the CPU
will allow quick ramp up in clock rate. Their chip has been introduced at 225 and
240 MHz initially (available in Nov 97) with an intended ramp up to 266 and 300
Mhz by the second half of 1998. They are also targeting low power consumption,
and small die size with an obvious eye towards the laptop market.
Update: They have since announced the WinChip 2 which is superscalar and
they expect to have far superior performance. (They claim that they will be able
to clock them between 400 and 600 Mhz) We shall see; and we shall see if they
explain their architecture to a greater depth.
Glossary of terms
• Branch prediction - a mechanism by which the processor guesses the
results of a condition decision and thus assumes whether or not a
conditional branch is taken.
• Data forwarding - the process of copying the contents of a unit output
value to an input value for another unit in the same clock.
• (Instruction) coloring - a technique for marking speculatively executed
instructions to put them into equivalence classes of speculative resolution.
The idea is that once a speculative condition has been resolved the
corresponding instructions of that color are all deal with in the same way
as being either retired or undone.
• (Instruction) issue - the first stage of a CPU pipeline where the
instruction is first recognized and decoded.
• Latency - the total number of clocks required to completely execute an
instruction. In maximal resource contention situations, this is usually the
maximum number of clocks an instruction can take. (Often manufacturers
will abuse the precise definition in their documentation by ignoring clocks
that are assumed to (almost) always overlap. For example, most
instruction on all fully pipelined processors really take at least 5 clocks
from issue to retirement, however under normal circumstances most of
those clocks are consistently overlapped by stages of other instructions,
and hence are documented to take that many fewer clocks.) The goal of
the Post-RISC architecture is to hide latencies to the maximal degree
possible via parallelism.
• Out of order execution - a feature of the Post-RISC architecture whereby
instructions may actual complete their calculation steps in an order
different from that in which they were issued in the original program.
• Post-RISC architecture - a term coined by Charles Severance referring
to the modern trend of CPUs to use techniques not found on traditional
RISC processors such as speculative execution and register renaming in
conjunction with instruction retirement.
• Register contention - a condition where an instruction is trying to use a
register whose last write back or read has not yet completed.
• Register renaming - retargetting the output of an instruction to an
arbitrary internal register that is virtually renamed to be the value of the
architectural register. In x86 processors this ordinarily occurs whenever a
fld or fxch instruction or a mov with a destination register is encountered.
• Resource contention - A condition where a register, alu or pipeline stage
is required for an instruction but is currently in use, or scheduled to be
used, by a previously unretired instruction.
• Retirement - The process by which the CPU knows that an instruction
has really completed.
• SIMD - Single Instruction Multiple Data. An instruction set which replicates
the same operation over multiple operands which are themselves packed
The following information comes from various public presentations on the Athlon
that have been given. One in particular is the "dinner with Dirk Meyer" audio
session provided by Steve Porter/John Cholewa. I also did my own analysis on
a real Athlon. I must also thank Lance Smith -- my inside man at AMD -- for
invaluable assistance.
Comments welcome.
Shockingly, at the time of release, at the 650Mhz Athlon became the second
highest clocked modern CPU available on the market -- beaten only by the Alpha
21264 at 667Mhz.
But enough of all the hype. Just how good is this architecture? The Athlon as far
as I can tell is a cross between a K6, and an Alpha 21264. It has the cleanliness
of the K6 architecture while having a no holds barred brute force set of functional
units like the 21264.
AMD touts the Athlon as the first processor that can be considered 7th
generation. Most of the features of the K7 are really just super beefed up
features that exist in the K6 (and P6). But what differentiates it is its radically out
of order floating point unit. Through a combination of 88 (!!) rename registers,
with stack and dependency renaming on a fully superscalar FPU AMD has
created, with the possible exception of the 21264, what is probably the most
advanced architecture I've ever seen. It also definitely presents a significant
performance level above both the K6 and P6 architectures, despite the claims of
some skeptical high profile microprocessor reviewers.
Throughout I will be comparing the K7 to the 21264, and the P6 cores. The
following are reference diagrams for each of the architectures found in
documentation supplied by the vendors. The mark ups contain what I consider to
be the most important considerations from a programming point of view, which
are explained in greater detail below. Red markings indicate a slow or previous
generation feature. Green markings indicate a fast or "state of the art feature".
The K7
This is the latest x86 compatible architecture from AMD. It is instruction set
compatible with Intel's Pentium II CPUs. It uses instruction translation to convert
the cumbersome x86 instruction set to high performance RISC-like instructions,
and drives those RISC instructions with a state of the art microarchitecture.
Update: This is not meant to contradict Dirk Meyer who claimed that "With the
K7, the central quantum of information that floats around the machine is not
decomposed RISC operations, it is a macro operation." Its really just a matter of
perspective. The ALUs in the K7 don't understand "macro operations", they
understand individual operations akin to the RISC86 ops in the K6. The macro
operation bundles that are decoded are just a convenient structure inside of the
K7 which gives much more complete coverage of the x86 instruction set (which
have the net effect of delivering more operations to the function units per clock.)
Each bundle is itself dispatched as separate operations to the ALUs as individual
execution morsels (I'd still call this decomposition to risc ops myself.)
I'm sure the reason Dirk is saying that this is not just an x86 to RISC translation is
because the internal mechanisms by which the K7 does its translation has no
resemblance to the way either the K6 or P6 perform their translation. Thus for
marketing reasons it is important for AMD to differentiate the way the K7 works
from these previous generation chips. I'm just speculating on this last part of
course -- for all I know "translation from x86 to RISC" may be a technical term
with a hard and fast definition that puts me clearly in the wrong. :)
The 21264
This is the latest incarnation of the DEC Alpha. Its a no holds barred advanced
architecture, that is out-of-order and highly superscalar. It is fairly well recognized
as the fastest microprocessor on earth by the industry standard SPEC
benchmark.
The P6
General Architecture
The Athlon is a long pipelined architecture, and like the P6, does a lot of work to
unravel some of the oddball conventions of the x86 instruction architecture in
order to feed a powerful RISC-like engine.
The Athlon starts out with 3 beefy symmetrical direct path x86 decoders that are
fed by highly pipelined instruction prefetch and align stages. The direct path
decoders can take short x86 instructions as well as memory-register instructions.
The instructions are translated to Macro-Ops which themselves contain two
packaged ops (one being one of: load, load/store, store, and the other being an
alu op.) Thus the front end of the K7 can realistically maintain up to 6 ops
decoded per clock. (The decoders also can sustain up to one vector path decode
per clock for the rarely used weird x86 instructions.)
The K7 has a 72 entry instruction control unit (so that's up to 144 ops, which is
significantly more than the P6's 40 entry reorder buffer) in addition to an 18 entry
integer reservation station as well as a 36 entry FPU reservation station. Holy
cow. The K7 will do an awful lot of scheduling for you, that's for sure.
Now, the K7 has two load and one store ports into the D-cache (the P6 core can
sustain a throughput of one load and/or store per clock.) However, algorithms are
rarely store limited. Furthermore stores can be retired before they are totally
completed. So I hesitate to stick with the 6 ops sustained rate. Instead its more
realistic to consider it as 5 ops sustained with free stores. (Note that for
comparison purposes, this is being very generous to the P6 core's estimated 3
ops per clock sustained rate of execution since it actually executes stores as two
micro-ops. This would be equivalent to only two AMD RISC86 ops per clock
throughput on code which is more store limited.)
After this point, the instructions are simply fed into fully pipelined instruction units
(except, presumably, instructions that are microsequenced.) So indeed 5 ops is
the K7's sustained instruction throughput. This is superior to the P6
architecture in that (1) it can supply an additional ALU op per clock (hence
50% more calculation bandwidth) (2) it can actually execute up to two
additional ops per clock (that's 67% more total general execution
bandwidth), and (3) it can service the ever important dual load case (this is
twice the load bandwidth of the P6 architecture.) So like its predecessor the K6,
the instruction decoders and back ends look fairly well balanced, except that with
the K7 we have a significantly wider engine.
The 21264 is a 4-decode machine with separate load and alu instructions. The
21264 pipeline is structured with a maximum of 2 memory, integer, or FP
instructions, from which any combination of executing 4 can be sustained per
clock. So while the K7 has a higher total ops issued per clock, the 21264 has the
advantage in the one case of 2 integer and 2 floating point instructions sustained
per clock configuration. In reality this would not come up very often, however,
conversely neither would many of the memory operanded instruction
combinations on the K7. The K7 has the advantage of being able to execute 3
integer or 3 floating point ops, but that is balanced by the fact that the K7 has
fewer registers and in reality only 2 "real work" floating point ops can be
executed.
Branch Prediction
For branch prediction AMD went with the GShare algorithm with a large number
of entries -- 2048 entry branch target buffer in addition to a 4096 entry branch
history buffer. This differs from the K6's sophisticated history per branch
combined with recent branch history algorithm and a branch target cache. AMD's
claims are that the K7's algorithm achieves 95% prediction accuracy (similar to
the K6.) Given the long pipelined architecture of the K7, using a very accurate
predictor seems more necessary than it was on the K6. Like the P6 core, the K7
also loses a decode clock on any taken branch (because it does not use a
branch target cache like the K6 does.) However, the high decode bandwidth of
the Athlon will typically make this a non issue.
Hey, that's not too bad! Remember that the K6 didn't really beat 0.5 clocks due to
the relatively larger impact on instruction decode bandwidth of the branch
instruction itself. So the K7 appears to have the same expected average
branch penalty as the K6! That's quite good for a deeply pipelined architecture.
Its better than the P6 which has a worse predictor (90% accuracy) and larger
miss penalty (13+ clocks).
Update: Andreas Kaiser has written up a very detailed analysis of how the K7
branch predictor works.
Floating Point
There has been a lot of talk about the K7's floating point capability. Especially
given the poor reputation of Intel's x86 competitors on floating point. The interest
in the K7's floating point probably overshadowed any other feature.
I think AMD knew they had to deliver on floating point or forever suffer the
backlash of the raving lunatics that would be denied their Quake frame rate being
pegged at the monitor's refresh rate. And there is no question that AMD has
delivered. On top of being fully pipelined (the P6 is partially pipelined when
performing multiplies) AMD had the gall to make a superpipelined FPU. I
would have thought that this was impossible given the horribly constipated x87
instruction set, but I was shocked to find that its really possible to execute well
above one floating point operations per clock (on things like multiply
accumulates.)
Since the K7 can combine ALU and load instructions with high performance,
pervasive use of memory operands in floating point instructions (which reduces
the necessity of using FXCH) seems like a better idea than the Intel
recommended strategies.
A floating point test I did that uses this strategy confirms that the K7 is indeed
significantly faster than the P6's floating point performance. My test ran about
50% faster. I suspect that as I become more familiar with the Athlon FPU I will be
able to widen that gap (i.e., no I can't show what I have done so far.)
Nevertheless the top two stages of the FPU pipeline are stack renaming then
internal register renaming steps. The register renaming stage would be
unnecessary if FXCH (which helps treat the stack more like a register file) did not
execute with very high bandwidth so I can only assume that FXCH must be really
fast. Update: The latest Athlon optimization guide says that FXCH generates a
NOP instruction with no dependencies. Thus it has an effective latency of 0-
cycles (though it apparently has an internal latency or 2 clocks -- I can't even
think of a way to measure this.)
Holy cow. Nobody in the mainstream computer industry can complain about the
K7's floating point performance.
The 21264 also has two main FP units (Mul and Add) on top of a direct register
file. So while the 21264 will have better bandwidth than the K7 on typical code
which has been optimized in the Intel manner (with wasteful FXCHs) on code
fashioned as described above, I don't see that the Alpha has much of an
advantage at all over the K7. Both have identical peak FP throughput of 2 ops
per clock, that in theory should be able to be sustainable by either processor.
As far as SIMD FP goes, AMD is sticking to their guns with 3DNow! (Although
they did add the integer MMX register based SSE instructions -- it appears as
though this was just to ensure that the Pentium-!!! did not have any functional
coverage over the Athlon.) They did add 5 "DSP functions" which are basically 3
complex number arithmetic acceleration instructions as well as two FP <-> 16 bit
integer conversion instructions. The two way SIMD architecture seems to be a
perfect fit for complex numbers.
Other than these new instructions, there does not seem to be any architectural
advantage to the K7 implementation of 3DNow! over the K6's 3DNow!
implementation. I don't think this should be taken as any kind of negative against
AMD's K7 designers, however. 3DNow! is one of those architectures that
appears to be naturally implemented in only one way: the fastest way. So its not
surprising that the K6 is as fast as the K7 in SIMD FP right out of the chute. (In
the real world the K7 should be faster on 3DNow! loops due to better execution
of necessary integer overhead instructions.)
On the surface it appears as though the SIMD capabilities of the Pentium !!!'s full
SSE implementation better alleviates register pressure over the K7. However the
K7 has the opportunity to pull even with SSE in this area as well by virtue of its
use, once again, of memory operands. (The theoretical peak result throughput of
SSE and 3DNow! are identical -- each has slight advantages over the other
which on balance are a wash.)
Comparatively speaking, the Alpha has only added special acceleration functions
for video playback. I am not familiar with the Alpha's extensions however I am
under the impression that they did not add a full SIMD FP or SIMD integer
instruction set.
Cache
The K7's cache is now 128 KB (2-way, harvard architecture, just like the Alpha
21264.) Ok this is just ridiculous -- the K7 has 4 times the amount of L1 cache as
Intel's current offerings. If somebody can give me a good explanation as to why
Intel keeps letting itself be a victim to what appears to be a simple design choice
for AMD, I'd like to hear it.
The load pipe has increased from 2 cycle latency on the K6 to 3 cycle latency on
the K7. This matches up with the P6 which also has a 3 cycle access time to their
L1 cache. (But recall that the K7 can perform two loads per clock which is up to
twice as fast as the K6 or P6.)
The K7 has a 44 entry load/store queue. (Holy cow.) Well, that ought to support
plenty of outstanding memory operations.
Although starting from a 512K on-PCB L2 cache, AMD claims the ability to move
to caches as large as 8MB. It should be obvious that AMD intends to take the K7
head to head against Intel's Xeon line. Off the PCB card, the K7 bus (which was
actually designed by the Compaq Alpha team for the 21264) can support 20
outstanding transactions.
Other
• The memory BUS (the EV6 bus, which is actually the same bus used by
the 21264) runs at 2x100Mhz. Though everything I am told right now
indicates that the memory throughput is still limited by the 100Mhz PC100
ram technology of today, that it does allow for scaling into higher
performance ram of the future. (PC133 is supposed to be around the
corner.) In any event it should allow the processor to dispatch stores to the
chipset in a fire and forget manner much faster than the current 1x100Mhz
of the P6 bus. So the CPU should not be tied up issuing stores for as long.
(Not a big issue, realistically.)
• The FDIV latency is remarkably low in comparison to the P6. I suspect
that AMD is using the 3DNow! divide approximation tables to drive a faster
newton raphson algorithm.
• According to independently confirmed tests, the LOOP instruction is slow!
Oh well. I can't imagine that there is something about deeply pipelined
architectures that makes this instruction slow. I can only guess that AMD
got tired of dealing with the legacy timing loops people wrote with this
instruction expecting it to be the same absolute speed as it was on a 486.
Fortunately for AMD, this is not a problem since for typical loops there is
easily enough left over instruction decode bandwidth to perform a
DEC/JNZ instruction pair with the same performance.
• The K7 appears to support all of the P6 conditional move and conditional
floating point instructions, as well as the write combining "MTRR registers"
and the performance event counters.
Optimization
I would recommend this guide to anyone interested in optimizing for the next
generation of processors.
Brass Tacks
Holy cow! Did I mention that this thing was released at 650Mhz! That's a clear,
uncontested 50Mhz lead over Intel. Although it has been suggested that this was
simply a premature announcement meant to steal the limelight away from Intel
(which has only recently started shipping the Pentium !!! at 600Mhz) they also
said that 700Mhz was on its way (Q4 '99). I find it easier to believe that they are
telling the truth (something some stockholder lawsuits should be motivating from
them) than lying to this extent.
I think AMD's challenge from here is to try and figure out exactly what markets it
can grow the Athlon into. Its too expensive for sub-$1K PCs and its not quite
ready for SMP. Its also currently only available in 512K L2 cache configurations,
so they can't go right after the Xeon market space just yet. While the Athlon is a
great processor, its clear that AMD needs to complete the picture with their
intended derivatives (the Athlon Select for the low end, the Athlon Ultra for
servers, and the Athlon Professional for everyone else, as AMD have themselves
disclosed) to take the fight to Intel in every segment.
Versus the P6
The K7 is larger faster and better in just about every way. The Athlon simply
beats the P6, even on code tweaked to the hilt for the P6 architecture. From the
architecture, the Athlon should be able to execute any combination of optimized
x86 code at least as efficiently as the P6. Code optimized specifically for the K7
should increase the performance gap between these two processor substantially.
From a pure CPU technology point of view this one is too close to call. Both have
extremely comparable features with slightly different tradeoffs that should not, by
themselves tip the balance either way. However at the end of the day the 21264
cannot be denied the official crown. The Alpha processors have the advantage
that Compaq has developed the compilers themselves and they are 64bit on the
integer side. They also have a much cleaner floating point instruction set
architecture and use a higher end, more expensive infrastructure. AMD is stuck
with the 32 bit instruction set defined by Intel as well as the software which has
followed the optimization rules dictated by Intel's chips.
The only counter-balance that the K7 has is the MMX and 3DNow! instruction
sets (in addition to the new instructions that have been added) which give the K7
the advantage for multimedia.
Nevertheless its amazing how close the x86 compatible K7 comes. For a
developer writing something from scratch going for 21264-like performance
should be the goal to shoot for.
Update: In recent months both Intel and AMD have overtaken the Alpha in clock
speed by a substantial amount, and consequently in terms of real integer
performance as well. While their roadmap still shows higher clocked versions of
the 21264 in the future, it looks like Compaq is concentrating their efforts on
symmetric multithreading (something they presented at MicroProcessor Forum in
1999.)
The Willamette
On 02/15/00, at the Intel Developer Forum a very brief preview of the Willamette
architecture was given. Since that time other details have surfaced, and more
analysis has been done. However Intel has not yet fully unveiled all the details of
the architecture. As such, the analysis below is preliminary.
The architecture is a 20-stage deep pipeline, with the claimed purpose being for
clock rate scaling reasons. However this pipeline is very different from x86
processors designed up until this point. The top few stages feed from the on-chip
L2 cache straight into a one-way x86 decoder which feeds EIP and micro-ops
into something called a trace cache. This trace cache replaces the processor's
L1 I-cache. The trace cache then feeds micro-ops at a maximum rate of 3 per
clock (actually 6 micro-ops every other clock) in instruction order (driven by a
trace-cache-local branch predictor as necessary) into separate integer and
FP/multimedia schedulers (much like the Athlon, except that the rate is higher for
the Athlon.) This mechanism effectively serves the same purpose of the
combination the Athlon's Instruction Control scheduler and I-cache (including
predecode bits.) Because the x86 decoder is applied only upon entry into the
trace cache, its performance impact is analogous to an increase in I-cache line fill
latency of other architectures. From an implementation point of view, Intel saves
themselves from the need to making a superscalar decoder (something they
have implemented in a clumsy way in the P6 and P5 architectures.)
Update: Just to make it clear -- one other thing this buys them is that the trace
cache eliminates direct jumps, call and returns from the instruction stream. On
the other hand, such instructions should not exist as bottlenecks in any
reasonably designed performance software. These instructions are necessarily
parallizable with other code.
The integer side is a two way integer ALU plus 1 load and 1 store units. But an
important twist is that these computation stages are clocked at double the clock
rate of the base clock for the CPU. That is to say, the ALUs complete their
computation stages at 0.5 clock granularities (with 0.5 latencies in the examples
discussed). Results that complete in the earlier half of the clock can (in at least
the described cases) be forwarded directly to a computation issued into the
second half of the clock. (Intel calls this double pumping.) From this point of view
alone, the architecture has the potential to perform double the integer
computational work as the P6 architecture. However, since the trace cache can
sustain a maximum of 3 micro-ops delivered per clock (which is the same as the
maximum issue rate of the P6 architecture), there is no way for the integer units
to sustain 4-micro-ops of computation per clock. Nevertheless, this is a
shockingly innovative idea that does not exist in any other processor architecture
that I have ever heard of.
I previously thought that the 0.5 clock granularities applied to loads (thus allowing
two loads per clock). However, it has been clarified that in fact the load unit can
accept only one new load per clock. This is consistent with other people's
theories that the ALU clock doubling is synthesized at two fused adders which
are not applicable to the load unit.
Update: Leaked benchmarks indicate that there is some funny business going
on in their L1 cache. While they claimed an L1 latency of 2 clocks,
measurements indicate that it starts at 3 clocks (its possible they were ignoring
the address calculations which in some cases can be computed in parallel with
data access -- however, the Athlon architecture has the same feature.) The
latency benchmark scores that were leaked indicate that as data size increase to
4K and beyond, the latency gradually increases rather than falling off in cliffs (as
the data foot print size exceeds the size of one level of cache) like most other
CPUs.
I don't completely buy this. One of the statement's Paul makes is: "However,
given the fact that modern x86 processors can execute up to three instructions
per cycle, the odds of finding up to 6 independent instructions to hide (or cover)
the load-use latency is rather small." This is not exactly the right way to view the
relationship between loads and computation ALU instructions. In modern x86's
the decoder's rate of bytes => scheduled riscops exceeds the rate of ALU
execution => retirement. The reason for this is that the amount of inherent
parallelism in typical programs is less that what these CPUs are capable of
doing. But, memory loads are different. Memory loads are dependent only
address computation which usually is not dependent on the critical path of
calculations in a typical algorithm (except when using slow data structures like
linked lists.) So once a memory instruction is decoded and scheduled, it can
almost always proceed immediately -- essentially always starting at the earliest
possible moment. As long as the data can be returned before the scheduler runs
out of other older work to do (which I claim it will have a lot of) then this latency
will not be noticed. Said in another way, a deep scheduler can cover for load
latency.
What does this mean? Well, I believe it means that shortening up the L1 D-cache
latency while sacrificing the size so dramatically in of itself cannot possibly be
worth it. I am more inclined to believe that the latency to the L2 cache (which
may be strictly a D-cache) itself has shown itself to be short enough to benefit
from the effect I referred to above. If the *L2* latency can be totally hidden as
well, then the real size concern is not with the L1 D-cache but rather the L2
cache.
Update: Also presented was the fact that the CPU uses a 4x100 Mhz Rambus
memory interface. While I ordinarily would ignore such bandwidth claims (for
memory latency is usually more important, and when you need bandwidth you
can use "prefetch" to hide the memory hits) leaks from some Intel insider on
USENET suggest that Willamette will use some sort of linear address pattern
matching prefetch mechanism. This is technique has apparently been used by
other RISC vendors, however with mixed results. Benchmark leaks seem to
confirm that Willamette will have bandwidth that is about double that of current
SDRAM based Athlons (which is the current x86 leader on the Stream
benchmark.)
Update: Intel has been heavily hyping the new SIMD instructions added to the
Willamette (SSE-2). They have added a 2x64 packed floating point instruction set
as well as 128 bit integer MMX instructions. However, if their multimedia
computation can only be performed from one issue port (assuming that the
FMOVE and FSTORE pipe is not capable of any calculations) then they have
compromised their older 64 bit MMX performance (the P6 has dual MMX units)
and will only maintain parity with their older SSE unit if they've reduced the
number of micro-ops per instruction (which would necessitate a fully 128 bit wide
ALU, instead of the two parallel 64 bit units in the Pentium-!!!.) The new 2x64 FP
theoretically brings their "double FP" performance to parity with the Athlon's x87
FPU (again, this is contingent on single micro-ops per packed SSE-2 instruction).
I say theoretically, because the algorithm needs to be fully vectorized into SIMD
code just to keep up with what the Athlon can do in straight unvectorized (but
reasonably scheduled) code. The 128 bit MMX, can at best match the
performance of dual 64 bit MMX units which are present in the Athlon, K6, P55c
and P-!!! CPUs. One thing they have added which is nice is a SIMD 32-bit
multiplier to the integer side.
From an instruction point of view, Intel appears to be declaring victory (there are
now more instructions as well as more coverage than even the AltiVec instruction
set; with the exception of multiply accumulate, and lg/exp approximations), but I
don't see the performance benefit of SSE-2. In fact I think there is a real
possibility of a slight performance degradation here.
Although Intel correctly points out that x86 to micro-op decode penalties no
longer affect branch mispredicts, the bulk of the pipeline stages in the
architecture appear between the trace cache output and execution stages. Thus,
the latency of a branch mispredict (which basically needs to abort results from
trace cache output to execution) has worsened and in fact is worse than any
other architecture I have ever heard of. As a counter to this, Intel has increased
their branch target buffer to 4096 entries and is reportedly using an improved
prediction algorithm ("...[the] Willamette processor microarchitecture significantly
enhances the branch prediction algorithms originally implemented in the P6
family microarchitecture by effectively combining all currently available prediction
schemes"). Intel has not commented on the prediction probabilities of the
Willamette architecture. Intel has also added branch hint instructions.
Finally they claim to have a significantly larger scheduler (more than 100
instructions can be in-flight at once.)
On the surface it appears as though the Willamette processor will do very well on
integer code with lots of dependencies, however, will not fair as well as the
Athlon on floating point. Other factors such as trace cache and L1 D-cache size
and the quality of the branch predictor remain unknown.
The Crusoe is the ultimate in x86 "emulation". The core chip is not an x86
compatible CPU at all, but rather a VLIW engine. The engine runs an emulation
program (the Code Morpher) which reads x86 instructions and compiles them to
VLIW code snippets, then executes the compiled snippets. The compiler uses
continuous on the fly profiling feedback to decide which code snippets need to be
analyzed the most. This design probably gets the most bang for the buck in
terms of performance per clock invested in the translation problem. Unlike other
technologies like FX!32 or Bochs, Crusoe has been clearly designed for 100%
x86 compatibility from boot to shutdown (hence esoteric protected mode
instructions are emulated in a compatible way -- device drivers will be written in
x86 binaries, not native Crusoe binary).
This contorted way of executing x86 buys them a number of things. (1) They
have complete freedom in how the VLIW core is designed. For example, it does
not even have to have robust register access -- if it takes two clocks for an
operation to finish, then rather than stalling subsequent accesses to the output
register, perhaps the value is old or undefined for the immediately subsequent
clock and updated on the second clock. But more importantly, as they target
better and better process technologies, they can change most aspects of their
design without compromising x86 compatibility. (2) It is possible to find
optimizations that the original software authors, or their compilers did not find in
their binary code. You can imagine that at least some x86 code might end up
substantially faster on the Crusoe. (3) The VLIW engine is very small and very
simple -- thus it is easier to analyze from a performance point of view, and
The bad news is that their initial clock rates of 300-400Mhz are not very
compelling, and the promise of 500-700Mhz in 6 months is kind of so-so, given
that the desktop competition is now at 800Mhz. There were no pure performance
benchmarks shown which is indicative that they probably are not achieving
performance per cycle parity with Intel or AMD. The good news is that this part is
being positioned in the mobile space. The (apparent) maximum power draw is an
amazing 2W, which will make for very battery friendly notebooks. They also
seem to be targeting the "internet appliance market" but I don't take that too
seriously (the "internet appliance market" that is.)
The white papers suggest that the VLIW engine drives 4 instructions in parallel in
a strict [ALU, ALU, MEM, BRANCH] format. Hmmm ... this looks roughly
comparable to a K6 to me (a little better with branches, a little worse with
memory.)
One thing they definitely have introduced which is interesting is the idea of a
speculative non-aliased memory window mechanism. What happens is that the
morpher can rearrange loads and stores in more optimal orders, and the legality
of this is checked with a special speculation checkpoint instruction. So like
branch prediction, if a late determination is made that the memory reordering was
wrong, then an interrupt is thrown and the "wrongly executed" block of
instructions can be undone. Of course, the goal is not to take advantage of this
speculative undoing (back to the checkpoint instruction), but rather just to use it
as a parachute to ensure robustness, in the hopes that in most cases for a given
fragment of code, memory reordering is a valid thing to do. This is a big deal.
This problem has plagued CPU designers and compiler writers for decades. The
fact that these guys have implemented a solution for this, is indicative that they
are very serious designers with some good ideas. The idea fits very well with
their code morpher because for degenerate cases where load/store reordering
never works, the morpher can detect this and throw the whole idea out for that
fragment of code.
Its not entirely clear how much of a long term advantage this is, though.
Apparently there is at least one other CPU architecture (HP's PA-RISC 8500)
that has implemented Load/Store speculative ordering. So there's no telling how
long Transmeta might be able to hold onto this advantage until the same
technology makes it into conventional x86 architectures.
I really don't think these guys are at all going to seriously contend with the Athlon
in any kind of head to head, so I will avoid making any kind of direct comparison.
Given the translation architecture, I don't think that further discussion of
processor features (like branch prediction, cache or floating point) will make too
much sense. We're going to have to wait until we can play with it before we can
get a real idea of what it can do.
Before I leave this, there is the thought that somehow the Transmeta chip would
be able to execute other instruction sets in a different configuration or perhaps
more interestingly simultaneously with an x86. The presentation seemed to steer
towards the direction of "we are only emulating x86's". However, public
statements made by Transmeta employees lead to a different possibility: "There
was a TM3120 running Doom on Linux. Doom was compiled mostly to x86,
except for the inner loop, which was compiled to picoJava using Steve
Chamberlain's picoJava back-end. The whole program was linked together
using a magic linker. When the program had to enter the inner loop, it
executed a reserved x86 opcode which jumped to picoJava mode. The
inner loop then executed picoJava bytecode until it was done, and re-
entered x86 mode."
This is very suggestive, at least to me, that they will support Java (or perhaps just
picoJava) on their CPUs that would likely be substantially faster than the current
crop of x86 based Java virtual machines.
Actually this sounds like it would be quite cool -- the x86 based JVM wrapper
code would have an inner loop that looked like:
jmp L2
L1:
cmp eax,[eax] ;// Force OS to load page.
L2:
TM_OPCODE(picoJava)
jnc L1
For their technology demo, I wouldn't be surprised if they didn't simply allocate
some fixed physical memory, and disallowed interrupts for the duration of the
picoJava code.
Glossary of terms
• ALU - Arithmetic Logic Unit. An execution unit in the processor that
performs some amount of calculation (as opposed to a data movement
unit or a branching unit.)
• Branch prediction - a mechanism by which the processor guesses the
results of a condition decision and thus assumes whether or not a
conditional branch is taken.
• Data forwarding - the process of copying the contents of a unit output
value to an input value for another unit in the same clock.
• Decode - the stage where instructions are first decoded from their
instruction bytes. In x86 processors this is an important consideration due
to the non-uniformity and variable length nature of the instruction set.
• Double pumping - A scheme by which a macro instruction uses the same
ALU twice to perform two individual parts of an instruction. Ordinarily this
leads to the ALU being tied up for twice the duration of its default
bandwidth.
• (Instruction) coloring - a technique for marking speculatively executed
instructions to put them into equivalence classes of speculative resolution.
The idea is that once a speculative condition has been resolved the
corresponding instructions of that color are all deal with in the same way
as being either retired or undone.
• (Instruction) issue - the first stage of a CPU pipeline where the
instruction is first recognized and sent to an execution unit.