You are on page 1of 24

The Study and comparison

of

Pentium family Processors

Calin Ciordas Zhang Lei Yingbo Zhu


2001 02

Calin Ciordas Zhang ei "ing#o Zhu

with the instruction and the Part II.3 with the Part II.2 and Part II.! with the Part II.1 and the summary

CONTENTS
P$%T I Introduction.................................................................................................................................2 Part II Study Issues....................................................................................................................................3 1.Caches ................................................................................................................................................3 1.1 Consistency Protocol &'(SI Protocol).......................................................................................3 1.2 The *least recently used*& %+) 'echanism................................................................................! 1.3 The Pentium Processor.................................................................................................................! 1.! The Pentium Pro ,Pentium II,Pentium III ...................................................................................1.- The Pentium ! Processor............................................................................................................... 2 pipeline................................................................................................................................................/ 2.1Pentium 0 Pentium with ''1..................................................................................................../ 2.2 The Pentium Pro ,Pentium II,Pentium III ...................................................................................2 2.2Pentium !....................................................................................................................................10 2.3pipeline summary........................................................................................................................10 3 Parallel and superscalar aspects of Pentium processor family..........................................................11 3.1 Superscalar aspects....................................................................................................................12 3.2 SI'3 .......................................................................................................................................13 !4ranch prediction..............................................................................................................................15 !.1 Pentium......................................................................................................................................1. !.2 The Pentium Pro ,Pentium II,Pentium III .................................................................................1. !.3 Pentium !...................................................................................................................................1. P$%T III S+''$%"..........................................................................................................................1/ $ppendi61 7 Comparison of Pentium 8amily Processors Specifications................................................21 $ppendi6 27 %eferences...........................................................................................................................2-

PART I Introduction
The main tas9 of this paper is to offer a :iew of the entire Intel Pentium processor family from architectural point of :iew. It is interesting to o#ser:e the e:olution of design issues; the common things and the impro:ements that Intel engineers added o:er the years. The Pentium family is a family of CISC processors with ad:anced %ISC concepts included. (6cepting the first mem#er &Pentium) which has a modest superscalar design the others presents a full superscalar design. $rchitectural e6tensions li9e ''1; SS(; SS(2 are also an important impro:ement. The comparison #etween different cache policies and the #ranch prediction strategies are presented. In our opinion the Pentium family is an successful design family.

8or our study the Pentium; Pentium II and Pentium ! were studied in more detail. The other processors were partially studied with regards to certain interesting aspects; #ecause Pentium Pro; Pentium II and Pentium III are #ased on similar designs.

Part II Study Issues


1. Caches
$ll caches use the following model. 'ain memory is di:ided up into fi6ed<si=e #loc9s called cache lines. $ cache with n possi#le entries for each address is called an n<way set<associati:e cache. $ *two< way set<assocati:e* organisation is shown at 8igure 1.1.
18 32 bits address: line select 'hit' logic 'word valid' 'tag' 9 line 3 2

word byte word select 'word valid' 512 lines (2 sets)

tags 18 bits

data 32 bits

data 32 bits

'least recently used' bits

'line valid'

word #0

word #7

Figure !

A "t#o$#ay set$associati%e" Cache Organi&ation

1.1 Consistency Protocol (MESI Protocol)


The Pentium processor Cache Consistency Protocol is a set of rules #y which states are assigned to cached entries &lines). The rules apply for memory read,write cycles only. I,> and special cycles are not run through the data cache. (:ery line in the Pentium processor data cache is assigned a state dependent on #oth Pentium processor generated acti:ities and acti:ities generated #y other #us masters &snooping). The Pentium processor 3ata Cache Protocol consists of four states that define whether a line is :alid &?IT,'ISS); if it is a:aila#le in other caches; and if it has #een '>3I8I(3. The four states are the ' &'odified); ( &(6clusi:e); S &Shared) and the I &In:alid) states and the protocol is referred to as the '(SI protocol. $ definition of the states is gi:en #elow7 ' < 'odified7 $n '<state line is a:aila#le in >@ " one cache and it is also '>3I8I(3 &different from main memory). $n '<state line can #e accessed &read,written to) without sending a cycle out on the #us. ( < (6clusi:e7 $n (<state line is also a:aila#le in >@ " one cache in the system; #ut the line is not '>3I8I(3 &i.e.; it is the same as main memory). $n (<state line can #e accessed &read,written to) without generating a #us cycle. $ write to an (<state line will cause the line to #ecome '>3I8I(3. S < Shared7 This state indicates that the line is potentially shared with other caches &i.e. the same line may e6ist in more than one cache). $ read to an S<state line will not generate #us acti:ity; #ut a write to a S?$%(3 line will generate a write through cycle on the #us. The write through cycle may in:alidate this line in other caches. $ write to an S<state line will update the cache. I < In:alid7 This state indicates that the line is not a:aila#le in the cache. $ read to this line will #e a 'ISS and may cause the Pentium processor to e6ecute a I@( 8I &fetch the whole line into the cache from main memory). $ write to an I@A$ I3 line will cause the Pentium processor to e6ecute a write<through cycle on the #us.

1.2 The 'least recently used'( LRU) Mechanis


The *least recently used* #it indicates which set in each line has #een used last; the other set will #e replaced if none of them hits and #oth are :alid. The *least recently used*& %+) algorithm 9eeps an ordering of each set of locations that could #e ascended from a gi:en memory location. Bhene:er any of the present lines are accessed; it updates the list; mar9ing that entry the most recently accessed. Bhen it comes time to replace an entry; the one at the end of the list<the least recently accessed C is the one discarded. This decision tree is shown in 8igure 1.2.

Figure !' LR( Cache Re)lace*ent Strategy

1.! The Pentiu

Processor

!+! On$Chi) Caches The Pentium processor implements two internal caches for a total integrated cache si=e of 15 D#ytes7 an / D#yte data cache and a separate / D#yte code cache. These caches are transparent to application software to maintain compati#ility with pre:ious Intel $rchitecture generations. The data cache fully supports the '(SI &modified,e6clusi:e,shared,in:alid) write#ac9 cache consistency protocol. The code cache is inherently write protected to pre:ent code from #eing inad:ertently corrupted; and as a conseEuence supports a su#set of the '(SI protocol; the S&shared) and I &in:alid) states. The caches ha:e #een designed for ma6imum fle6i#ility and performance. The data cache is configura#le as write#ac9 or writethrough on a line<#y<line #asis. 'emory areas can #e defined as non<cachea#le #y software and e6ternal hardware. Cache write#ac9 and in:alidations can #e initiated #y hardware or software. Protocols for cache consistency and line replacement are implemented in hardware; easing system design. !+!' Cache Organi&ation >n the Pentium processor; each of the caches are / D#ytes in si=e and each is organi=ed as a 2< way set associati:e cache. There are 12/ sets in each cache; each set containing 2 lines &each line has its own tag address). (ach cache line is 32 #ytes wide. In the Pentium processor; replacement in #oth the data and instruction caches is handled #y the %+ mechanism which reEuires one #it per set in each of the caches. The data cache consists of eight #an9s interlea:ed on !<#yte #oundaries. The data cache can #e accessed simultaneously from #oth pipes; as long as the references are to different cache #an9s. $ conceptual diagram of the organi=ation of the data and code caches is shown in 8igure 2</. @ote that the data cache supports the '(SI write#ac9 cache consistency protocol which reEuires 2 state #its; while the code cache supports the S and I state only and therefore reEuires only one state #it.

Figure $+ Conce)tual Organi&ation o, Code and -ata Caches !+!+ Cache Structure The instruction and data caches can #e accessed simultaneously. The instruction cache can pro:ide up to 32 #ytes of raw opcodes and the data cache can pro:ide data for two data references all in the same cloc9. This capa#ility is implemented partially through the tag structure. The tags in the data cache are triple ported. >ne of the ports is dedicated to snooping while the other two are used to loo9up two independent addresses corresponding to data references from each of the pipelines. The instruction cache tags of the Pentium processor are also triple ported. $gain; one port is dedicated to support snooping and the other two ports facilitate split line accesses &simultaneously accessing upper half of one line and lower half of the ne6t line). The storage array in the data cache is single ported #ut interlea:ed on !<#yte #oundaries to #e a#le to pro:ide data for two simultaneous accesses to the same cache line. (ach of the caches are parity protected. In the instruction cache; there are parity #its on a Euarter line #asis and there is one parity #it for each tag. The data cache contains one parity #it for each tag and a parity #it per #yte of data. (ach of the caches are accessed with physical addresses and each cache has its own T 4 &translation loo9aside #uffer) to translate linear addresses to physical addresses. The T 4s associated with the instruction cache are single ported whereas the data cache T 4s are fully dual ported to #e a#le to translate two independent linear addresses for two data references simultaneously. The tag and data arrays of the T 4s are parity protected with a parity #it associated with each of the tag and data entries in the T 4s. The data cache of the Pentium processor has a !<way set associati:e; 5!<entry T 4 for !<D#yte pages and a separate !<way set associati:e; /<entry T 4 to support !<'#yte pages. The code cache has one !<way set associati:e; 32<entry T 4 for !<D#yte pages and !<'#yte pages which are cached in !< D#yte increments. %eplacement in the T 4s is handled #y a pseudo %+ mechanism &similar to the Intel!/5 CP+) that reEuires 3 #its per set.

1." The Pentiu

Pro #Pentiu

II#Pentiu

III

!.! The Pentiu* )ro The Pentium Pro Processor on<chip le:el one & 1) caches consist of one /<D#yte four<way set associati:e instruction cache unit with a cache line length of 32 #ytes and one /<D#yte two<way set associati:e data cache unit. @ot all misses in the 1 cache e6pose the full memory latency. The le:el two & 2) cache mas9s the full latency caused #y an 1 cache miss. The minimum delay for a 1 and

2 cache miss is #etween 11 and 1! cycles #ased on 3%$' page hit or miss. The data cache can #e accessed simultaneously #y a load instruction and a store instruction; as long as the references are to different cache #an9s.

Figure !. The Pentiu* Pro/ II/ III Processor 0icro$Architecture #ith Ad%anced Trans,erCache Enhance*ent/ The ,irst and second le%el caches !.!' The Pentiu* II 1Pentiu* III The on<chip cache su#system of Pentium II and Pentium III processors consists of two 15< D#yte four<way set associati:e caches with a cache line length of 32 #ytes. The caches employ a write< #ac9 mechanism and a pseudo< %+ &least recently used) replacement algorithm. The data cache consists of eight #an9s interlea:ed on four<#yte #oundaries. e:el two & 2) caches ha:e #een off chip #ut in the same pac9age. They are 12/D or more in si=e. 2 latencies are in the range of ! to 10 cycles. $n 2 miss initiates a transaction across the #us to memory chips. Such an access reEuires on the order of at least 11 additional #us cycles; assuming a 3%$' page hit. $ 3%$' page miss incurs another three #us cycles. (ach #us cycle eEuals se:eral processor cycles; for e6ample; one #us cycle for a 100 '?= #us is eEual to four processor cycles on a !00 '?= processor. The speed of the #us and si=es of 2 caches are implementation dependent; howe:er. Chec9 the specifications of a gi:en system to understand the precise characteristics of the 2 cache.

Figure $2 The Intel Net3urst 0icro$Architecture/ the First Le%el/ the Second Le%el Caches and Trace Cache

1.$ The Pentiu

" Processor

The Intel Pentium ! processor is the latest I$<32 processor; and the first #ased on the Intel @et4urst micro<architecture & 8igure1.-). The Intel @et4urst micro<architecture can support up to three le:els of on<chip cache. >nly two le:els of on<chip caches are implemented in the Pentium ! processor; #ut there #rings a new concept7 Trace Caches. The le:el nearest to the e6ecution core of the processor; the first le:el; contains separate caches for instructions and data7 a first<le:el data cache and the trace cache; which is an ad:anced first<le:el instruction cache. $ll other le:els of caches are shared. The le:els in the cache hierarchy are not inclusi:e; that is; the fact that a line is in le:el i does not imply that it is also in le:el iF1. $ll caches use a pseudo< %+ &least recently used) replacement algorithm. !2! E4ecution Trace Cache The e6ecution trace cache &TC) is the primary instruction cache in the Intel @et4urst micro< architecture. The TC stores decoded I$<32 instructions; or Gops. This remo:es decoding costs on freEuently<e6ecuted code; such as template restrictions and the e6tra latency to decode instructions upon a #ranch misprediction. In the Pentium ! processor implementation; the TC can hold up to 12D Gops and can deli:er up to three Gops per cycle. The TC does not hold all of the Gops that need to #e e6ecuted in the e6ecution core. In some situations; the e6ecution core may need to e6ecute a microcode flow; instead of the Gop traces that are stored in the trace cache.The Pentium ! processor is optimi=ed so that most freEuently< e6ecuted I$<32 instructions come from the trace cache; efficiently and continuously; while only a few instructions in:ol:e the microcode %>'. !2!' The Second$le%el Cache $ second<le:el cache miss initiates a transaction across the system #us interface to the memory su#<system. The system #us interface supports using a scala#le #us cloc9 and achie:es an effecti:e speed that Euadruples the speed of the scala#le #us cloc9. It ta9es on the order of 12 processor cycles to get to the #us and #ac9 within the processor; and 5<12 #us cycles to access memory if there is no #us congestion. (ach #us cycle eEuals se:eral processor cycles. The ratio of processor cloc9 speed to the scala#le #us cloc9 speed is referred to as #us ratio. 8or e6ample; one #us cycle for a 100 '?= #us is eEual to 1- processor cycles on a 1.-0 H?= processor.

2 pipeline
Pipelining is an architecture techniEue for increasing the throughput of comple6; multiple cycle instruction. The whole instruction can #e di:ided to a series of smaller stages which can #e completed within a single cloc9 cycle; and the freEuency and throughput of the system can #e impro:ed.

2.1 Pentiu

% Pentiu

&ith MM'

The Pentium processor has a fi:e stage pipeline for the integer instructions; while the Pentium processor with ''1 has an additional pipeline stage. The pipeline stages are shown #elow 7 P8 Prefetch 8 8etch&Pentium professor with ''1 technology only) 31 Instruction 3ecode 32 $ddress Henerate (1 (6ecute C$ + and Cache $ccess B4 Brite #ac9

Figure '! Pentiu* )rocessor )i)eline The Pentium processor is a superscalar machine ;which ha:e two pipelines called the IuJ and the I:J pipes. 8igure 3.1 shows the instruction flow in the Pentium processor. The Pentium processor also has a floating point pipeline. The floating point unit&8P+) is integrated with the integer unit on the same chip which has / pipeline stages; the first fi:e share with the integer unit. Integer instructions pass though only the first - stages. Integer instructions use the fifth&11) stages as a B4 &write<#ac9) stage. The / 8P pipeline stages and the acti:ities that performed in them are shown #elow7 P8 Prefetch 8 8etch&Pentium professor with ''1 technology only) 31 Instruction 3ecode 32 $ddress Henerate (1 'emory and register readK con:ersion of 8P data to e6ternal memory format and memory write. 11 8loating Point (6ecute stage one. 12 8loating Point (6ecute stage two. B8 Performing rounding and write floating<pointing result to register files. (% (rror %eporting,+pdate Status Bord. The Pentium processor with ''1 has an e6tra stage #y di:iding the Prefetch to two stage; Prefetch and 8etch.; thus the pipeline has 5 stages deep to yield higher throughput.

2.2 The Pentiu

Pro #Pentiu

II#Pentiu

III

The Pentium Pro ,Pentium II,Pentium III processor ha:e the same pipeline architecture . The Pentium pro and Pentium II processor ha:e an in order front end ;an out<order e6ecution path; and an in<order #ac9 end. In effect; the Pentium pro,Pentium II processor consist of an outer CISC shell with an inner %ISC core. 8igure 3.2 show s a #loc9 diagram of the Pentium pro,Pentium II processor . The operation of Pentium pro,Pentium II processor can #e summari=ed as follows7 $. The processor fetches instructions from memory in the order of the static program. 4. (ach instruction is translated into one or more fi6ed<length %ISC instructions; 9nown as micro< operation; or micro<ops. C. The processor e6ecutes the micro<ops on a superscalar pipeline organi=ation; so that the micro< ops may #e e6ecuted out of order. 3. The processor commits the results of each micro<op e6ecution to the processorLs register set in the order of the original flow. The Pentium pro,Pentium II processor ha:e 13 stages &8igure 3.3) as shown #elow7 4T40 4ranch Target 4uffer 0 4T41 4ranch Target 4uffer 1 I8+0 Instruction 8etch I8+1 Scan the #ytes to determine instruction #oundaries I8+2 Instruction predecode I30 Instruction 3ecode I31 Instruction 3ecode

Figure '!' Pentiu* )ro1Pentiu* II bloc5 diagra*

Figure '!+ Pentiu* )ro1Pentiu* II )i)eline

%$T %egister $llocator %>4 %eorder 4uffer; up to two register reads per cycle %S %eser:ation Station; micro<ops wait for operands and functional pipelines in ports 0< ! to #ecome a:aila#le. (1 (6ecute Stage; - ports are a:aila#le for the e6ecute stage %>4&w#) %eorder #uffer &write#ac9) %%8 %eorder 4uffer read

2.2 Pentiu

"

Pentium ! has a different architecture with the pre:ious one which has the name of @et4ust 'icro<$rchitecture. $s for the pipeline of the Pentium !; it has used the ?yper Pipelined Technology; which can reach the comparati:ely great depth7 20 stages.. The e6ecution of each command is di:ided into smaller parts; which is easier and faster to e6ecute than the entire command; nothing pre:ents the de:elopers from rising the CP+ freEuency. If the today*s 0.1/ micron technology allows achie:ing only 1H?= for Pentium III processor; the future Pentium ! processors will #e a#le to support up to 2H?= wor9ing freEuency. The pipeline of the Intel @et4ust 'icro<$rchitecture contain three sections7 $. the in<order issue front end 4. the out<order superscalar e6ecution core C. the in<order retirement unit. The front end supplies instructions in program order to the out order core. It fetch and decodes I$<32 instructions to the micro<operations. The out<order can issue multiple micro<operations per cycle and aggressi:ely reorder micro<operations so that those micro<operations; which is a:aila#le for e6ecution; can e6ecute as soon as possi#le. The retirement section ensures that the results of e6ecution of the micro<operations are processed according to the original program order and that the proper architecture states are updates. 8igure 1.- shows the #loc9 diagram of the Intel @et4urst 'icro< $rchitecture.

2.! (i(eline su

ary

Intel de:eloped his processor series from Pentium to Pentium ! now; The architecture of the processor ha:e changed a lot; to impro:e the performance and the thoughput ;the pipeline #ecomes longer and longer ;the operation #ecomes more comple6. Be can find it from the 8igure 2.!

10

Figure '!. The )i)eline o, Intel Pentiu* series )rocessor To measure the pipeline performance; we can de:elop a speedup factor for the instruction pipeline compared to e6ecution without the pipeline &#ased on a discussion in M?B$@23N.This model supposes that n instructions are processed without #ranches.

Sk =

T1 nk nk = = Tk M k + &n 1)N k + &n 1)

S k 7 speedup factor

Be can find that we can get k times speedup when n come to ;so the larger the num#er of pipeline stages; the greater potential for speedup. ?owe:er; a deeper pipeline isn*t free from its draw#ac9s. The first one is e:ident7 since there are more stages to e6ecute #efore the operation is completed; the o:erall time reEuired for each operation increases. That*s why in order to ma9e sure that younger Pentium ! models pro:e faster than the elder Pentium III CP+s; Intel starts its new processor family at 1.!H?=. If Intel launched a 1H?= Pentium !; it would undou#tedly #e #eaten #y a 1H?= Pentium III CP+. The second draw#ac9 of a deeper pipeline comes to light in case a #ranch prediction error occurs. The Pentium series processor is capa#le of e6ecuting instructions in succession as well as in parallel. In the latter case the instructions do not always follow the order they are listed in the program and the #ranches aren*t always correctly predicted. In order to choose the right #ranch for further e6ecution the CP+ predicts the results Oudging #y the collected stats. ?owe:er; if the processor mis<predicts a #ranch; all the speculati:ely e6ecuted instructions must #e flushed from the processor pipeline in order to restart the instruction e6ecution down the correct program #ranch. >n more deeply pipelined designs; more instructions must #e flushed from the pipeline; resulting in a longer reco:ery time from a #ranch mis<predict. The net result is that applications that ha:e many; difficult to predict; #ranches will tend to ha:e a lower a:erage le:el of instructions per cloc9.

n 7 the num#er of instructions 7 the cycle time of a instruction pipeline

T1 7 The time to e6ecute n instructions without pipeline Tk 7 The time to e6ecute n instructions with a k stages pipelines k 7 num#er of stages in the instruction pipeline

3 Parallel and superscalar aspects of Pentium processor family.


$s the functionality of these processors was pre:iously e6plained; this part propose Oust a deep loo9 inside the superscalar and parallel aspects of this processor family which were not mentioned #efore. It is #eyond the purpose of this part to e6plain again the functionality of certain processors.

11

There are also presented the SI'3 aspects of this processor family7 ''1; SS(; SS(2. Be try to present the architectural details of these implementations and the reasons for which were added and not an enumeration of the instruction set added #y them.

!.1 Su(erscalar as(ects


+! ! Pentiu* The original Pentium had a superscalar component consisting of the use of two separate integer e6ecution unit capa#le of e6ecuting 2 instructions in parallel. The pipelines are called IuJ and I:J pipelines. The floating point unit shares the first - stages with the integer pipeline. In the decode stage 31; Pentium has 2 decoders wor9ing in parallel. +! !' Pentiu* Pro1Pentiu* II Pentium II has #asically the same superscallar organi=ation as the Pentium Pro with the addition of the ''1 e6ecution units. The essential components of the superscalar organi=ation are the instruction fetch and decode units; the dispatch and e6ecute unit and the retire unit. The I31 stage &instruction decode 1) contains + decoders which can wor9 in parallel. >ne is a comple6 one and the others are simple ones. The comple6 decoder can handle Pentium instruction which can translate into up to four micro<ops. The second and third decoders can handle Oust simple Pentium instruction that map into a single micro<ops. $ few instructions reEuire more than four micro< ops. These are transferred to the 'IS &microcode instruction seEuencer) which is a microcode %>' which contains the series of micro<ops associated with the comple6 machine instructions. The output of I31 or 'IS is fed to the I32 &instruction decode 2) in a #loc9 of 6 *icro$o)s at a time. >perations Eueued in I32 pass through another renaming phase called register allocator &%$T) which remaps references to the 15 architectural registers into a set of !0 physical registers. In this way false dependencies are remo:ed. The %$T fed the reordered #uffer with the re:ised micro<ops. %>4 is a circular #uffer which can hold up to .7 *icro$o)s. 'icro<ops enter %>4 in order and are dispatched to the e6ecution unit out of order; the only criteria for this dispatch #eing the a:aila#ility of the appropriate e6ecution unit and the necessary data items. 'icro<ops are retired from the ro# in order. The %S &reser:ation station) is responsi#le for retrie:ing micro<ops from the %>4. The %S loo9 for micro<ops which status tell that are ready for e6ecution &has all operands) and dispatch it to the appropriate e6ecution unit. +p to 2 *icro$o)s can #e dispatched in one cycle. Fi%e )orts connects %S to e6ecution units. Port 0 is used for #oth integer and floating<point instructions with the e6ception of simple operations on integers and handling #ranch mispredictions which are allocated to port 1. ''1 e6ecution units are allocated #etween these two ports. The other ports are for memory loads and stores. >nce an e6ecution is completed; the entry in %>4 is updated and the e6ecution unit is free for the ne6t micro<op. The %+ &retire unit) wor9s to commit the result of instruction e6ecution. >nce it is determined that the micro<op is not :ulnera#le for #ranch misprediction it is mar9ed as ready for retirement. Bhen the pre:ious Pentium instruction was retired and all the micro<ops of the ne6t instruction ha:e #een mar9ed as ready for retirement the %+ deletes the micro<ops from the %>4 and updates all the registers affected #y this instruction. +! !. Pentiu* . Instructions are fetched and decoded #y a translation engine. There is only one decoder which can decode instructions at a ma6imum rate of )er cloc5 cycle. Some comple6 instructions must use the microcode %>' &li9e the Pentium II,Pentium Pro). The translation engine #uilds the decoded instruction into seEuences of micro<ops called traces; which are stored in the trace cache. The e6ecution trace cache stores these micro<ops in the path of program e6ecution flow; where the results of #ranches in the code are integrated into the same cache line. The trace cache can deli:er up to + *icro$o)s per cloc9 to the core. The core is designed to facilitate parallel e6ecution. It can dispatch up to 6 *icro$o)s per cycle through the . issue )orts pictured figure 3.1. Si6 micro<ops per cycle e6ceeds the trace cache and retirement micro<op #andwidth.

12

Figure +! Pentiu* . E4ecution (nit 'ost e6ecution units can start e6ecuting a new micro<op e:ery cycle; so that se:eral instructions can #e in flight at a time for each pipeline. $ num#er of arithmetic logical unit &$ +) instructions can start t#o )er cycle; and many floating<point instructions can start one e:ery two cycles. 'icro<ops can #egin e6ecution; out of order; as soon as their data inputs are ready and resources are a:aila#le &the same concept as Pentium II,Pentium Pro). Port 0. In the first half of the cycle; port 0 can dispatch either one floating<point mo:e micro<op &including floating<point stac9 mo:e; floating<point e6change or floating<point store data); or one arithmetic logical unit &$ +) micro<op &including arithmetic; logic or store data). In the second half of the cycle; it can dispatch one similar $ + micro<op. Port ! In the first half of the cycle; port 1 can dispatch either one floating<point e6ecution &all floating<point operations e6cept mo:es; all SI'3 operations) micro<op or normal<speed integer &multiply; shift and rotate) micro<op; or one $ + &arithmetic; logic or #ranch) micro<op. In the second half of the cycle; it can dispatch one similar $ + micro<op. Port '! Port 2 supports the dispatch of one load operation per cycle. Port +! Port 3 supports the dispatch of one store address operation per cycle. Thus the total issue #andwidth can range from &ero to si4 *icro$o)s per cycle. Bhen a micro<op completes and writes its result to the destination; it is retired. +p to + *icro$o)s may #e retired per cycle. The %eorder 4uffer &%>4) is the unit in the processor which #uffers completed micro<ops; updates the architectural state in order; and manages the ordering of e6ceptions. The retirement section also 9eeps trac9 of #ranches and sends updated #ranch target information to the #ranch target #uffer &4T4) to update #ranch history.

!.2 SIM)
+!'! Pentiu* Pro Pentium Pro does not implement the ''1 &'atri6 'ath (6tensions) e6ecution unit or the ''1 register set and therefore does not su))ort the ''1 instruction set. +!'!' Pentiu* II SI'3 computations were introduced into Intel I$<32 architecture with the Intel ''1 technology. Pentium II processor implements ''1 support. The Intel ''1 technology allows SI'3 computations to #e performed on pac9ed #yte; word and dou#leword integers that are contained in a set of eight registers called ''1 registers. The eight general<purpose registers are used along with the e6isting I$<32 addressing modes to address operands in memory. &The ''1 registers cannot #e used to address memory). The general< purpose registers are also used to hold operands for some ''1 technology operations.

13

These ''1 registers are mapped o:er the 8P+ registers. 8P+ registers are /0 #its wide #ut ''1 registers are 5! #its wide. ''1 registers are aliased on the 5! #its mantissa portion of the 8P registers. Bhen a :alue is written to one of the ''1 registers it also appears in the mantissa portion of the respecti:e 8P register. The re:erse is also true. Bhen a :alue is witten to an ''1 register; #its .2<5! of the corresponding 8P registers are all set to one. The ''1 registers are e6plicity addressed #y name &8P registers are addressa#le as stac9 locations). $n application can contain #oth 6/. 8P+ floating<point and ''1 instructions. ?owe:er;#ecause the ''1 registers are aliased to the 6/. 8P+ register stac9; care must #e ta9en when ma9ing transitions #etween 6/. 8P+ instructions and ''1 instructions to pre:ent the loss of data in the 6/. 8P+ and ''1 registers and to pre:ent incoherent or une6pected results. The time when the first ''1 instruction is e6ecuted two things occur7 the 8P+ registers are renamed as ''1 and the 8P+ tag word is mar9ed :alid. 4ecause of this it is necessary that an (''S &(mpty ''1 State) instruction #e e6ecuted after completion of the ''1 code and #efore any 8P+ code is e6ecuted. ''1 instruction set is di:ided into the following groups of instructions7 arithmethic; comparison; con:ersion; logical; shift; data transfer; (mpty ''1 State &(''S) instruction. The ''1 e6ecution units are connected to the %eser:ation Station &%S) at the first 2 ports. $t port 0 is the ''1 $ + +nit and ''1 'ultiplier +nit and at the port 1 there is an ''1 $ + unit and ''1 Shifter +nit. +!'!+ Pentiu* III The Intel ''1 technology introduced single<instruction multiple<data &SI'3) capa#ility into the I$<32 architecture; with the 5!<#it ''1 registers; 5!<#it pac9ed integer data types; and instructions that allowed SI'3 operations to #e performed on pac9ed integers. The SS( e6tensions e6tend this SI'3 e6ecution model; #y adding facilities for handling pac9ed and scalar single<precision floating< point :alues contained in 12/<#it registers. The SS( e6tensions introduced one data type; the 12/<#it pac9ed single<precision floating<point data type; to the I$<32 architecture. This data type consists of four I((( 32<#it single<precision floating<point :alues pac9ed into a dou#le Euadword. The 32<#it '1CS% register; which pro:ides control and status #its for operations performed on the 1'' registers is also added to Intel I$<32 architecture. Intel Pentium III offers Internet Streaming SI'3 (6tensions &SS() which add .0 new instructions ena#ling ad:anced imaging; 33; streaming audio and :ideo and speech recognition for an enhanced Internet e6perience. These includes SI'3 for floating point; additional SI'3 integer and cachea#ility control instruction. SS( allow SI'3 computations to #e performed on operands that contain ! pac9ed single< precision floating Cpoint data elements. The operands can #e either in memory or in a set of eight 12/<#it registers called 1'' registers. The SS( also e6tend SI'3 computational capa#ility with additional 5! #it ''1 instructions. 1'' registers can #e addressed directly using the names 1''0 to 1''.K ad they can #e accessed independently from the 6/. 8P+ and ''1 registers and the general<purpose registers & they are not aliased to any other of the processorLs registers). The 1'' registers can only #e used to perform calculations on dataK they cannot #e used to address memory. $ddressing memory is accomplished #y using the general<purpose registers. 3ata can #e loaded into the 1'' registers or written from the registers to memory in 32<#it; 5!< #it; and 12/<#it increments. Bhen storing the entire contents of an 1'' register in memory &12/<#it store); the data is stored in 15 consecuti:e #ytes; with the low<order #yte of the register #eing stored in the first #yte in memory. The 32<#it '1CS% register contains control and status information for SS( and SS(2 SI'3 floating<point operations. This register contains the flag and mas9 #its for the SI'3 floating<point e6ceptions; the rounding control field for SI'3 floating<point operations; the flush<to<=ero flag that pro:ides a means of controlling underflow conditions on SI'3 floating<point operations; and the denormals<are<=eros flag that controls how SI'3 floating<point instructions handle denormal source operands. The contents of this register can #e loaded from memory with the 3'1CS% and 81%ST>% instructions and stored in memory with the ST'1CS% and 81S$A( instructions. The SS( instructions are di:ided into four functional groups P Pac9ed and scalar single<precision floating<point instructions. P 5!<#it SI'3 integer instructions. P State management instructions

1!

P Cachea#ility control; prefetch; and memory ordering instructions. The pac9ed and scalar single<precision floating<point instructions are di:ided into the following su#groups7 P 3ata mo:ement instructions P $rithmetic instructions P ogical instructions P Comparison instructions P Shuffle instructions P Con:ersion instructions The SS( data mo:ement instructions mo:e single<precision floating<point data #etween 1'' registers and #etween an 1'' register and memory. The SS( arithmetic instructions perform addition; su#traction; multiply; di:ide; reciprocal; sEuare root; reciprocal of sEuare root; and ma6imum,minimum operations on pac9ed and scalar single< precision floating<point :alues. The SS( logical instructions preform $@3; $@3 @>T; >%; and 1>% operations on pac9ed single<precision floating<point :alues. The compare instructions compare pac9ed and scalar single<precision floating<point :alues and return the results of the comparison either to the destination operand or to the (8 $HS register. The SS( shuffle and unpac9 instructions shuffle or interlea:e the contents of two pac9ed single< precision floating<point :alues and store the results in the destination operand. The SS( con:ersion instructions support pac9ed and scalar con:ersions #etween single<precision floating<point and dou#leword integer formats. The SS( e6tensions add also 5!<#it pac9ed integer instructions to the I$<32 architec<ture. These instructions operate on data in ''1 registers and 5!<#it memory locations. The '1CS% state management instructions & 3'1CS% and ST'1CS%) load and sa:e the state of the '1CS% register; respecti:ely. The 3'1CS% instruction loads the '1CS% register from memory; while the ST'1CS% instruction stores the contents of the register to memory. The SS( e6tensions introduce se:eral new instructions to gi:e programs more control o:er the caching of data. The SS( e6tensions are fully compati#le with all software written for I$<32 processors. $ll e6isting software continues to run correctly; without modification; on processors that incorporate The SS( e6tensions; as well as in the presence of e6isting and new applications that incorporate these e6tensions. The 1'' registers are independent of the 6/. 8P+ and ''1 registers; so SS( and SS(2 oper< ations performed on the 1'' registers can #e performed in parallel with operations on the 6/. 8P+ and ''1 registers . +!'!. Pentiu* . The Pentium ! upgrades the P5 CP+ SS( to SS(2; Streaming SI'3 (6tensions 2; with se:enty< si6 SI'3 instructions and enhancements to si6ty<eight integer SI'3 instructions. That ma9es 1!! SI'3 instructions to manage floating point; application and multimedia performance. 8rom a programmerLs perspecti:e; the model for the new Pentium IA CP+ is not that dissimilar to the ''1 technology and SS( models in the Pentium II and III. The new SS(2 instructions add much more fle6i#ility; as they allow SI'3 computations to #e performed on floating<point; integer; and pac9ed integer data types in the 1'' registers. SS(2 use the same registers and is #ac9ward compati#le with the SS( of the Pentium III processor. @ew SI'3 instructions aim to do away with one of the maOor #ottlenec9s found in todayLs 6/5 CP+s7 the 6/. 8P+; or floating<point unit. The performance of the 6/. 8P+ is se:erely restricted #y this aging standard. Impro:ing performance would not #e easy with its original design. +sing SS(2 to #ypass it completely circum:ents the #ottlenec9. If Intel can find enough support among software de:elopers to start using SS(2 for doing floating point operations; the Pentium IALs SS(2 8P+ will #e a lot faster than an eEui:alently cloc9ed 6/. 8P+. The SS(2 e6tends SI'3 computations to operate on pac9ed dou#le<precision floating<point data elements and 12/<#it pac9ed integers. $ll 1!! instructions in the SS(2 can operate an two pac9ed dou#le precision floating<point data elements; or on 15 pac9ed #yte; / pac9ed word; ! dou#leword; and 2 Euadword integers. The SS(2 instructions are di:ided into four functional groups7 P Pac9ed and scalar dou#le<precision floating<point instructions. P 5!<#it and 12/<#it SI'3 integer instructions.

1-

P 12/<#it e6tensions of SI'3 integer instructions introduced with the ''1 technology and the SS( e6tensions. P Cachea#ility<control and instruction<ordering instructions. The SS(2 e6tensions adds se:eral 12/<#it pac9ed integer instructions to the I$<32 architecture. Bhere appropriate; a 5!<#it :ersion of each of these instruction is also pro:ided. The 12/<#it :ersions of instructions operate on data in the 1'' registers; and the 5! #it :ersions of these new instructions operate on data in the ''1 registers. +!'!2 Su**ary o, SI0The full set of I$<32 SI'3 technologies &the Intel ''1 technology; the SS( e6tensions; and the SS(2 e6tensions) gi:es the programmer the a#ility to de:elop algorithms that com#ine operations on pac9ed 5! and 12/ #it integer and single and dou#le<precision floating<points operands. $ll these technologies are architectural e6tensions in the I$<32 Intel architecture. $ll SI'3 instructions are accessi#le from all I$<32 e6ecution modes7 protected mode; real address mode and Airtual /0/5 mode. $ summary of types used for ''1; SS( and SS(2 can #e see in figure 3.2.

Figure +!' 008/ SSE and SSE' -ata ty)es

4 Branch prediction
In order to achie:e high throughput and performance of the processor; Intel has made the pipeline is longer and longer from Pentium to Pentium ! processor. So the pro#lem of how to deal with the #ranches is coming to more important. $s usual; a :ariety of approaches ha:e #een ta9en for predicting the #ranch will #e ta9en or not7 $. Predict ne:er ta9en 4. Predict always ta9en C. Predict #y opcode $. Ta9en,not ta9en switch 3. 4ranch history ta#le

15

The first three can #e summari=ed to the static prediction algorithm ; and the last two can #e summari=ed as the 3ynamic Prediction algorithm. In the Pentium series processor has used 9inds of method to predict the #ranches ;assuring a steady flow of instructions to the initial stages of the pipelines.

".1 Pentiu
The Pentium processor uses a dynamic #ranch prediction strategy #ased on the history of recent e6ecutions of a #ranches instruction. $ 4ranch Target 4uffer&4T4) is maintained that caches information a#out recently encountered #ranch instruction to predict the outcome of #ranch instructions which minimi=es pipeline stalls due to prefetch delays. The Pentium processor accesses the 4T4 with the address of the instruction in 31 stages. It contains a 4ranch prediction state machine with four stages7 $. Strongly not ta9en 4. Bea9ly not ta9en C. Bea9ly ta9en 3. Strongly ta9en If an entry already e6ists in the 4T4; then the instruction unit is guided #y the history information for that entry in determining whether to predict that the #ranch is ta9en. If a #ranch is predicted ;then the #ranch destination address associated with this entry is used for prepetching the #ranch target instruction. >nce the instruction is e6ecuted; the history portion of the appropriate entry is updated to reflect the result of the #ranch instruction. If this instruction is not represented in the 4T4; then the address of the instruction is loaded into an entry in the 4T4K If necessary; an old entry is deleted.

".2 The Pentiu

Pro #Pentiu

II#Pentiu

III

The Pentium Pro ,Pentium II,Pentium III ha:e much longer pipelines; so the penalty for mis< prediction is greater. $ccordingly; the Pentium pro and the Pentium II use a more ela#orate #ranch prediction scheme to reduce the mis<prediction rate. The Pentium pro,Pentium II 4T4 is organi=ed as a four Cway set associati:e cache with -12 lines. (ach entry uses the address of the #ranch as a tag. The entry also includes the #ranch destination address for the last time this #ranch was ta9en and a !<#it history field. Thus use of four history #its contrasts with the 2 #its used in the original Pentium processor. Bith ! #its ;the Pentium pro,Pentium II mechanism can ta9e into account a longer history in predicting #ranches. The algorithm that referred to as "ehLs algorithm can pro:ide a significant reduction in misprediction compare to algorithms that use only 2 #its of historyM(A(%2/N.

".! Pentiu

"

The Pentium ! processor with the Intel @et4ust 'icro<$rchitecture predicts all near #ranches; including conditional ; unconditional calls and returns; and indirect #ranches. It does not predict far transfers; for e6ample; far calls; irets; and software interrupts. Se:eral mechanisms are implemented to aid in predicting #ranches more accurately and in reducing the cost of ta9en #ranches7 $. 3ynamically predict the direction and target of the #ranches #ased on the instructionsLs linear address using the #ranch target #uffer &4T4). The #ranch prediction #uffer that store more detail on the history of past #ranches is increased up to !94;while the #uffer #y P5 family is only -124yte #ig. 4. If no dynamic prediction is a:aila#le or if it is in:alid; statically predict the outcome #ased on the offset of the target7 a #ac9ward #ranch is predicted to #e ta9en; a forward #ranch is predicted to #e not ta9en. C. %eturn address are predicted using the 15<entry return address stac9. 3.Traces of instructions are #uilt across predicted ta9en #ranches to a:oid #ranch penalties. The Pentium ! processor with a larger #ranch target #uffer and the more ad:anced #ranch prediction algorithm has the net effect of reducing the num#er of #ranch mis<prediction #y a#out 33Q o:er the Pentium III processsorLs #ranch prediction capa#ility. This a really good :alue; #ecause it means that Pentium ! offers o:er 20<2-Q of correct predictions.

1.

PART III

S(00ARY

$s we ha:e discussed ; till now on; all Pentium family processors; including Pentium; Pentium Pro; Pentium II; Pentium III; and the latest Pentium !; are all #ased on the I$<32 Intel $rchitecture. The computing power and the comple6ity &or roughly; the num#er of transistors per processor) of Intel architecture processors has grown; o:er the years; in close relation to 'oore*s law. 4y ta9ing ad:antage of new process technology and new micro<architecture designs; each new generations of I$<32 processors ha:e demonstrated freEuency<scaling headroom and new performance le:els o:er the pre:ious generation processors. Be synthesi=ed 9ey features of Pentium 8amily Processors as Ta#le 1; and more detailed comparisons are attached.

Table ! 9ey Features o, Pentiu* Fa*ily Processors


Intel Processor -ate Introducd 0a4!Cloc5 Shi)ed Transis tors )er -ie Register Si&es E4t! -ata 3us 0a4! E4tern! Addr! Caches

entiu! entiu! ro

1993 1995

"0 #$% 200 #$%

3&1 # 5&5 #

32 ' 80 ( ) 32 ' 80 ( ) 32 ' 80 ( ) "* ##. 32 ' 80 ( ) "* ##. 128 .## ' : 32 ( ): 80 ##.: "* .##: 128

"* "*

* '+ "* '+

,1:1"-+ ,1: 1"-+ ,2: 25"-+ or 512-+ ,1: 32-+ ,2: 25"-+ or 512-+

entiu! II 1997 2"" #$% 7#

"*

"* '+

entiu! ///

1999

500 #$%

8&2 #

"*

"* '+

,1: 32-+ ,2: 512-+

entiu! *

2000 /ntel 0et+urst !icroarc1itecture

1&50 '$%

*2 #

3&2 '+2s

"* '+

,1:8-+ 3race 4ac1e: 125o6 ,2:25"-+

The I$<32 Intel $rchitecture has #een at the forefront of the computer re:olution and is today clearly the preferred computer architecture; as measured #y the num#er of computers in use and total computing power a:aila#le in the world. Two of the maOor factors that may #e the cause of the popularity of I$<32 architecture are7 compati#ility of software written to run on I$<32 processors; and the fact that each generation of I$<32 processors deli:er significantly higher performance than the pre:ious generation. The I$<32 architecture has #een and is committed to the tas9 of maintaining #ac9ward compati#ility at the o#Oect code le:el to preser:e Intel customersL large in:estment in software. $t the same time; in each generation of the architecture; the latest most effecti:e micro-architecture and silicon fa#rication technologies ha:e #een used to produce high<performance processors. In each generation of I$<32 processors; Intel has concei:ed and incorporated increasingly sophisticated techniEues into its micro<architecture in pursuit of e:er faster computers. BhatLs the future of Pentium 8amilyR Perhaps Intel has reali=ed disad:antages of I$<32 architecture; too emphasi=e on #ac9ward compati#ility; they ha:e to implement more sophistic techniEues to o#tain o#scure features. Intel and ?P cooperate to a new micro<architecture I$5!; I$<5! micro<architecture may #e the #est way for ne6t generation processor; #ut itLs re:olutionary method for implementation; we ha:e to gi:e up all programs on I$<32; transfer to this newest generation.

1/

12

A))endi4 : Co*)arison o, Pentiu* Fa*ily Processors S)eci,ications


*eneral )etails

+a e
,a ily#*eneration

Pentiu /0-/5; -th Heneration

Pentiu

Pro

Pentiu

II

Pentiu

III

Pentiu

"

/05/5; 5th Heneration

/05/5; 5th Heneration; ''1 333 'h= 55 '?=; HT F %ISC; >ut<of<order and Speculati:e (6ecution 20 (ntry %S; !0 (ntry %>4

Cloc,re.uencies

CPU Core S(eed E/ternal 0us S(eed

.-; 20; 100; 120; 133; 133; 1-0; 155; 1-0; 155; 200 '?= 1/0; 200 '?= -0; 50; or 55 '?= 50 or 55 '?=; HT F %ISC; >ut<of< order and Speculati:e (6ecution 20 (ntry %S; !0 (ntry %>4

/05/5; 5th Heneration; ''1; SS( 500(; 5-0(; ...; /-0( '?= 100 '?= HT F &Slot 1); $HT F &Slot 2) %ISC; >ut<of<order and Speculati:e (6ecution

Intel @et4urst 'icro$rchitecture !2' transistors 1.!H?= !00'?= %ISC; >ut<>f<>rder and Speculati:e (6ecution

Processor Core

*eneric )etails

CISC; In<order and Pipelined (6ecution 3ual Pipeline 3esign

S(eci1ic )etails

Re2isters

Pi(eline )e(th E/ecution Units

20 (ntry %S; !0 (ntry %>4 32 4it Integer; /0 4it 32 4it Integer; /0 32 4it Integer; /0 4it 32 4it Integer; /0 4it 8P; 5! 4it ''; 12/ 4it 8P; !0 (ntry 8P; 5! 4it ''; !0 8P 4it SS(; !0 (ntry %$T (ntry %$T %$T 12 &In<order) plus 12 &In<order) plus 2 2 &Shared) plus 26 3 12 &In<order) plus 2 2 &>ut<of<order) &>ut<of<order) &3ual Pipeline) Stages &>ut<of<order) Stages Stages Stages 26 Integer; Pipelined 8P+ 26 $ +; oad; Store $dress; Store 3ata; Pipelined 8P+ 35 4it

HP7 32 8P+7 /0 ''17 5! 1''7 12/ 20 stages

3ddress 0us 4idth

32 4it

2S33% 26 $ +,''1; 26 $ +,''1,SS(; $ +,''1,SS(2; oad; Store $dress; oad; Store $dress; oad; Store $ddress; Store 3ata; Pipelined Store 3ata; Pipelined Store 3ata; Pipelined 8P+ 8P+ 8P+ 35 4it 35 4it 354it

21

Processor 0uses

)ata 0us 4idth 5! 4it; separate 5! 5! 4it; separate 5! 4it 4ac9side 2 4it 4ac9side 2 Cache 4us Cache 4us

5! 4it

Physical Me ory 5irtual Me ory Multi(rocessin2

2T32 4it U ! H4 &/;120 F /;122) 6 ! H4 U 5-;-2/ H4 &V5! T4) S'P; 2 Processors; using integrated local $PICs @,$

Processor Caches

Le6el 7 Le6el 1

Code

/ D4; 2<Bay; 32 4yte, ine; SI; 26 8etch Port &supports Split<line $cess); Snoop Port &for S'C); %+ / D4; 2<Bay; 32 4yte, ine; '(SI; @on<#loc9ing; 3ual< ported; Snoop Port; / 4an9s; %+

)ata

2T35 4it U 5! H4 &/;120 F /;122) 6 ! H4 U 5-;-2/ H4 &V5! T4) S'P; ! Processors; using integrated local $PICs @,$ / D4; !<Bay; 32 4yte, ine; SI; 8etch Port; Internal and (6ternal Snoop Port &for S'C,1'C); %+ / D4; 2<Bay; 32 4yte, ine; '(SI; @on<#loc9ing; 3ual<ported; Snoop Port; Brite $llocate; / 4an9s; %+

2T35 4it U 5! H4 &/;120 F /;122) 6 ! H4 U 5-;-2/ H4 &V5! T4)

5! 4it separate 5!F/ 4it 4ac9side 2 Cache 4us with (CC &0.2Gm) separate 2-5F32 4it 4ac9side 2 Cache 4us with (CC &0.1/ Gm) 2T35 4it U 5! H4 &/;120 F /;122) 6 ! H4 U 5-;-2/ H4 &V5! T4)

5!4it

2T354itU5!H4 &/;120F/;122)S!H4 U5-;-2/H4&V5!T4)

S'P; 2 Processors; S'P; 2 Processors; S'P; 2 Processors using integrated local using integrated local $PICs $PICs $PICs @,$ 15 D4; !<Bay; 32 4yte, ine; SI; 8etch Port; Internal and (6ternal Snoop Port &for S'C,1'C); %+ 15 D4; !<Bay; 32 4yte, ine; '(SI; @on<#loc9ing; 3ual< ported; Snoop Port; Brite $llocate; / 4an9s; %+ @,$ 15 D4; !<Bay; 32 4yte, ine; SI; 8etch Port; Internal and (6ternal Snoop Port &for S'C,1'C); %+ @,$ /D4; !<Bay; 5!4,line

15 D4; !<Bay; 32 Trace Cache7 4yte, ine; '(SI; 12000u>PS @on<#loc9ing; 3ual< ported; Snoop Port; Brite $llocate; / 4an9s; %+

22

Le6el 2 (6ternal; depends on 'other#oard

2-5 D4..1 '4; !< Bay; 32 4yte, ine; @on<#loc9ing; 5! H4 cachea#le; using 1 or 2 3ies inside Pac9age !6 32 4yte

2-5 D4; !<Bay; 32 4yte, ine; @on<#loc9ing; 5! H4 cachea#le;

2-5D4; /<Bay; +nified; 2-5 D4; /< 12/4,line Bay; 32 4yte, ine; '(SI @on<#loc9ing; 5! H4 cachea#le; %+ !6 32 4yte &Shared)

Processor 0u11ers

Read 0u11er

4rite 0u11er

Pre1etch 8ueue 0ranch Static Prediction

TL0

"es -12 (ntries; !< 2-5 (ntries; !<Bay; !< Bay; pro:iding )yna ic State 156 !<State Pattern %ecognition RS0 @,$ ! (ntries !D4 C>3( !D4 C>3( 32 (ntries; !<Bay; 32 (ntries; !<Bay; %+ %+

32 4yte for Code Cache 32 4yte for 3ata Cache 26 / 4yte &supports 3ual Pipeline 3esign) 36 32 4yte & ine %eplacement Brite 4uffer; Internal and (6ternal Snoop Brite 4uffer) 26 32 4yte &supports 3ual Pipeline 3esign) S'C can #e o#ser:ed up to 2! 4yte ahead "es

!6 32 4yte

32 4yte

32 4yte

32 4yte

32 4yte "es -12 (ntries; !<Bay; pro:iding 156 !<State Pattern %ecognition ! (ntries !D4 C>3( 32 (ntries; !<Bay; %+

32 4yte "es "es -12 (ntries; !<Bay; pro:iding 156 !<State Pattern %ecognition ! (ntries !D4 C>3( !D4 C>3( 32 (ntries; !<Bay; %+

23

Instruction Set Re2ular ,loatin2 Point Multi Media

!'4 C>3( $%H( C>3( @,$ &uses ! D4 Code 2 (ntries; 8ull; (ntries) %+ !D4 3$T$ !D4 3$T$ 5! (ntries; !<Bay; 5! (ntries; !<Bay; %+ %+ $%H( 3$T$ !'4 3$T$ / (ntries; !<Bay; / (ntries; !<Bay; %+ %+ I$<32 I$<32 Integrated Integrated @,$ @,$

$%H( C>3( $%H( C>3( 2 (ntries; 8ull; %+ 2 (ntries; 8ull; %+ !D4 3$T$ 5! (ntries; !<Bay; %+ $%H( 3$T$ / (ntries; !<Bay; %+ I$<32 Integrated ''1; 81S$A(,81%ST>% %eal; Protected; Airtual; Paging; S''; Pro#e 'ode !D4 3$T$ 5! (ntries; !<Bay; %+ $%H( 3$T$ / (ntries; !<Bay; %+ I$<32 Integrated ''1; SS( %eal; Protected; Airtual; Paging; S''; Pro#e 'ode

$%H( C>3(

!D4 3$T$

$%H( 3$T$ I$<32 Integrated ''1; SS(2 %eal; Protected; Airtual; Paging; S''; Pro#e 'ode

Processor Modes

%eal; Protected; %eal; Protected; Airtual; Paging; Airtual; Paging; S''; S''; Pro#e Pro#e 'ode 'ode

2!

A))endi4 ': Re,erences


ASTA;; $.S.Tanu#au7 Structured Computer Organization; Prentice ?all; 1222 E<ER;= (:ers.'.;et al. I$n $nalysis of Correlation and Predicta#ility7 Bhat ma9es Two< e:el 4ranch Predictors wor9J Proceeding,25th Annual Inter-national Sympo ium on !icroarchitecture, Wuly 122/ >?AN;+ ?wang; D. Ad"anced Computer Architecture. @ew "or97'cHraw< ?ill;1223 ?ill;6 Billiam ;S.; Computer organization and architecture;Prentice< ?all;Inc;1225 INTEL;= I@T( Inc.; P# $amily of proce or %ard&are 'e"eloper( !anual;2!!001<001;122/ INTEL7 I@T( Inc), Intel Pentium * Proce or and Intel +5, Performance -rief,2!22!0<003;2001 INTEL77 I@T( Inc), A 'etailed .ook In ide the Intel /et-ur t !icroArchitecture of the Intel Pentium * Proce or;2000

2-

You might also like