You are on page 1of 183

Introduction

4 Instruction tables
Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs
By Agner Fog. Copenhagen University College of Engineering. Copyright 1996 - 2012. Last updated 2012-02-29.

Introduction
This is the fourth in a series of five manuals: 1. Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms. 2. Optimizing subroutines in assembly language: An optimization guide for x86 platforms. 3. The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers. 4. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. 5. Calling conventions for different C++ compilers and operating systems.
The latest versions of these manuals are always available from www.agner.org/optimize. Copyright conditions are listed below. The present manual contains tables of instruction latencies, throughputs and micro-operation breakdown and other tables for x86 family microprocessors from Intel, AMD and VIA.

The figures in the instruction tables represent the results of my measurements rather than the official values published by microprocessor vendors. Some values in my tables are higher or lower than the values published elsewhere. The discrepancies can be explained by the following factors: My figures are experimental values while figures published by microprocessor vendors may be based on theory or simulations. My figures are obtained with a particular test method under particular conditions. It is possible that different values can be obtained under other conditions. Some latencies are difficult or impossible to measure accurately, especially for memory access and type conversions that cannot be chained. Latencies for moving data from one execution unit to another are listed explicitly in some of my tables while they are included in the general latencies in some tables published by Intel. Most values are the same in all microprocessor modes (real, virtual, protected, 16-bit, 32-bit, 64bit). Values for far calls and interrupts may be different in different modes. Call gates have not been tested. Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor systems. This also applies to the XCHG instruction with a memory operand.

Page 1

Introduction If any text in the pdf version of this manual is unreadable, then please refer to the spreadsheet version. Copyright notice

This series of five manuals is copyrighted by Agner Fog. Public distribution and mirroring is not allowed. Non-public distribution to a limited audience for educational purposes is allowed. The code examples in these manuals can be used without restrictions. A GNU Free Documentation License shall automatically come into force when I die. See www.gnu.org/copyleft/fdl.html

Page 2

Definition of terms

Definition of terms
Operands Operands can be different types of registers, memory, or immediate constants. Abbreviations used in the tables are: i = immediate constant, r = any general purpose register, r32 = 32-bit register, etc., mm = 64 bit mmx register, x or xmm = 128 bit xmm register, y = 256 bit ymm register, sr = segment register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc. The latency of an instruction is the delay that the instruction generates in a dependency chain. The measurement unit is clock cycles. Where the clock frequency is varied dynamically, the figures refer to the core clock frequency. The numbers listed are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity may increase the latencies by possibly more than 100 clock cycles on many processors, except in move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results may give a similar delay. A missing value in the table means that the value has not been measured or that it cannot be measured in a meaningful way. Some processors have a pipelined execution unit that is smaller than the largest register size so that different parts of the operand are calculated at different times. Assume, for example, that we have a long depencency chain of 128-bit vector instructions running in a fully pipelined 64-bit execution unit with a latency of 4. The lower 64 bits of each operation will be calculated at times 0, 4, 8, 12, 16, etc. And the upper 64 bits of each operation will be calculated at times 1, 5, 9, 13, 17, etc. as shown in the figure below. If we look at one 128-bit instruction in isolation, the latency will be 5. But if we look at a long chain of 128-bit instructions, the total latency will be 4 clock cycles per instruction plus one extra clock cycle in the end. The latency in this case is listed as 4 in the tables because this is the value it adds to a dependency chain.

Latency

Reciprocal throughput

The throughput is the maximum number of instructions of the same kind that can be executed per clock cycle when the operands of each instruction are independent of the preceding instructions. The values listed are the reciprocals of the throughputs, i.e. the average number of clock cycles per instruction when the instructions are not part of a limiting dependency chain. For example, a reciprocal throughput of 2 for FMUL means that a new FMUL instruction can start executing 2 clock cycles after a previous FMUL. A reciprocal throughput of 0.33 for ADD means that the execution units can handle 3 integer additions per clock cycle. The reason for listing the reciprocal values is that this makes comparisons between latency and throughput easier. The reciprocal throughput is also called issue latency. The values listed are for a single thread or a single core. A missing value in the table means that the value has not been measured.

Page 3

Definition of terms ops Uop or op is an abbreviation for micro-operation. Processors with out-of-order cores are capable of splitting complex instructions into ops. For example, a read-modify instruction may be split into a read-op and a modify-op. The number of ops that an instruction generates is important when certain bottlenecks in the pipeline limit the number of ops per clock cycle. The execution core of a microprocessor has several execution units. Each execution unit can handle a particular category of ops, for example floating point additions. The information about which execution unit a particular op goes to can be useful for two purposes. Firstly, two ops cannot execute simultaneously if they need the same execution unit. And secondly, some processors have a latency of an extra clock cycle when the result of a op executing in one execution unit is needed as input for a op in another execution unit. The execution units are clustered around a few execution ports on most Intel processors. Each op passes through an execution port to get to the right execution unit. An execution port can be a bottleneck because it can handle only one op at a time. Two ops cannot execute simultaneously if they need the same execution port, even if they are going to different execution units. This indicates which instruction set an instruction belongs to. The instruction is only available in processors that support this instruction set. The different instruction sets are listed at the end of this manual. Availability in processors prior to 80386 does not apply for 32-bit and 64-bit operands. Availability in the MMX instruction set does not apply to 128-bit packed integer instructions, which require SSE2. Availability in the SSE instruction set does not apply to double precision floating point instructions, which require SSE2. 32-bit instructions are available in 80386 and later. 64-bit instructions in general purpose registers are available only under 64-bit operating systems. Instructions that use XMM registers (SSE and later) are only available under operating systems that support this register set. Instructions that use YMM registers (AVX and later) are only available under operating systems that support this register set.

Execution unit

Execution port

Instruction set

How the values were measured


The values in the tables are measured with the use of my own test programs, which are available from www.agner.org/optimize/testp.zip The time unit for all measurements is CPU clock cycles. It is attempted to obtain the highest clock frequency if the clock frequency is varying with the workload. Many Intel processors have a performance counter named "core clock cycles". This counter gives measurements that are independent of the varying clock frequency. Where no "core clock cycles" counter is available, the "time stamp counter" is used (RDTSC instruction). In cases where this gives inconsistent results (e.g. in AMD Bobcat) it is necessary to make the processor boost the clock frequency by executing a large number of instructions (> 1 million) or turn off the power-saving feature in the BIOS setup. Instruction throughputs are measured with a long sequence of instructions of the same kind, where subsequent instructions use different registers in order to avoid dependence of each instruction on the previous one. The input registers are cleared in the cases where it is impossible to use different registers. The test code is carefully constructed in each case to make sure that no other bottleneck is limiting the throughput than the one that is being measured. Instruction latencies are measured in a long dependency chain of identical instructions where the output of each instruction is needed as input for the next instruction. The sequence of instructions should be long, but not so long that it doesn't fit into the level-1 code cache. A typical length is 100 instructions of the same type. This sequence is repeated in a loop if a larger number of instructions is desired.

Page 4

Definition of terms It is not possible to measure the latency of a memory read or write instruction with software methods. It is only possible to measure the combined latency of a memory write followed by a memory read from the same address. What is measured here is not actually the cache access time, because in most cases the microprocessor is smart enough to make a "store forwarding" directly from the write unit to the read unit rather than waiting for the data to go to the cache and back again. The latency of this store forwarding process is arbitrarily divided into a write latency and a read latency in the tables. But in fact, the only value that makes sense to performance optimization is the sum of the write time and the read time. A similar problem occurs where the input and the output of an instruction use different types of registers. For example, the MOVD instruction can transfer data between general purpose registers and XMM vector registers. The value that can be measured is the combined latency of data transfer from one type of registers to another type and back again (A B A). The division of this latency between the A B latency and the B A latency is sometimes obvious, sometimes based on guesswork, op counts, indirect evidence, or triangular sequences such as A B Memory A. In many cases, however, the division of the total latency between A B latency and B A latency is arbitrary. However, what cannot be measured cannot matter for performance optimization. What counts is the sum of the A B latency and the B A latency, not the individual terms. The op counts are usually measured with the use of the performance monitor counters (PMCs) that are built into modern microprocessors. The PMCs for VIA processors are undocumented, and the interpretation of these PMCs is based on experimentation. The execution ports and execution units that are used by each instruction or op are detected in different ways depending on the particular microprocessor. Some microprocessors have PMCs that can give this information directly. In other cases it is necessary to obtain this information indirectly by testing whether a particular instruction or op can execute simultaneously with another instruction/op that is known to go to a particular execution port or execution unit. On some processors, there is a delay for transmitting data from one execution unit (or cluster of execution units) to another. This delay can be used for detecting whether two different instructions/ops are using the same or different execution units.

Page 5

Microprocessors tested

Microprocessor versions tested


The tables in this manual are based on testing of the following microprocessors

Processor name
AMD K7 Athlon AMD K8 Opteron AMD K10 Opteron AMD Bulldozer AMD Bobcat Intel Pentium Intel Pentium MMX Intel Pentium II Intel Pentium III Intel Pentium 4 Intel Pentium 4 EM64T Intel Pentium M Intel Core Duo Intel Core 2 (65 nm) Intel Core 2 (45 nm) Intel Core i7 Intel Core i5 Intel Atom 330 VIA Nano L2200 VIA Nano L3050

Family Model Microarchitecture number number Code name (hex) (hex) Comment
6 F 10 15 14 5 5 6 6 F F 6 6 6 6 6 6 6 6 6 6 5 2 1 1 2 4 6 7 2 4 D E F 17 1A 2A 1C F F Step. 2, rev. A5 Stepping A 2350, step. 1 FX-6100, step 2 E350, step. 0 Stepping 4

Bulldozer, Zambezi Bobcat P5 P6 P6 Netburst Netburst, Prescott Dothan Yonah Merom Wolfdale Nehalem Sandy Bridge Diamondville Isaiah

Stepping 4, rev. B0 Xeon. Stepping 1 Stepping 6, rev. B1 Not fully tested T5500, Step. 6, rev. B2 E8400, Step. 6 i7-920, Step. 5, rev. D0 i5-2500, Step 7 Step. 2 Step. 2 Step. 8 (prerel. sample)

Page 6

AMD K7

AMD K7
List of instruction timings and macro-operation breakdown
Explanation of column headings:
Instruction: Operands: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit mmx register, xmm = 128 bit xmm register, sr = segment register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc. Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency listed does not include the memory operand where the operand is listed as register or memory (r/m). This is also called issue latency. This value indicates the average number of clock cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/3 indicates that the execution units can handle 3 instructions per clock cycle in one thread. However, the throughput may be limited by other bottlenecks in the pipeline. Indicates which execution unit is used for the macro-operations. ALU means any of the three integer ALU's. ALU0_1 means that ALU0 and ALU1 are both used. AGU means any of the three integer address generation units. FADD means floating point adder unit. FMUL means floating point multiplier unit. FMISC means floating point store and miscellaneous unit. FA/M means FADD or FMUL is used. FANY means any of the three floating point units can be used. Two macro-operations can execute simultaneously if they go to different execution units.

Ops: Latency:

Reciprocal throughput:

Execution unit:

Integer instructions
Instruction Move instructions MOV MOV Operands Ops Latency Reciprocal throughput 1 1 1/3 1/3 Execution unit ALU ALU Any addr. mode. Add 1 clk if code segment base ALU, AGU 0 ALU, AGU do. AGU do. AGU AH, BH, CH, DH Any other 8-bit register AGU Any addressing mode AGU AGU Notes

r,r r,i

1 1

MOV MOV MOV MOV MOV MOV MOV MOV

r8,m8 r16,m16 r32,m32 m8,r8H m8,r8L m16/32,r m,i r,sr

1 1 1 1 1 1 1 1

4 4 3 8 2 2 2 2 Page 7

1/2 1/2 1/2 1/2 1/2 1/2 1/2 1

AMD K7 MOV MOVZX, MOVSX MOVZX, MOVSX CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D) PUSHA(D) POP POP POP POP POPF(D) POPA(D) LEA LEA LAHF SAHF SALC LDS, LES, ... BSWAP Arithmetic instructions ADD, SUB ADD, SUB ADD, SUB ADC, SBB ADC, SBB ADC, SBB CMP CMP INC, DEC, NEG INC, DEC, NEG AAA, AAS DAA DAS AAD AAM MUL, IMUL MUL, IMUL MUL, IMUL IMUL sr,r/m r,r r,m r,r r,m r,r r,m r i m sr 6 1 1 1 1 3 3 2 1 1 2 2 1 9 2 3 6 9 2 9 2 1 4 2 1 10 1 9-13 1 4 1 2 16 5 8 1/3 1/2 1/3 1/2 1 16 1 1 1 1 1 4 1 1 10 18 1 4 1 1/3 2 2 1 9 1/3 ALU ALU, AGU ALU ALU, AGU ALU Timing depends ALU, AGU on hw ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU AGU Any addr. size AGU Any addr. size ALU ALU ALU ALU

r m DS/ES/FS/GS SS

r16,[m] r32,[m]

3 2 3 2 1 1

r,m r

r,r/i r,m m,r r,r/i r,m m,r/i r,r/i r,m r m

r8/m8 r16/m16 r32/m32 r16,r16/m16

1 1 1 1 1 1 1 1 1 1 9 12 16 4 31 3 3 3 2

1 1 7 1 1 7 1 1 7 5 6 7 5 13 3 3 4 3 Page 8

1/3 1/2 2.5 1/3 1/2 2.5 1/3 1/2 1/3 3 5 6 7

2 2 3 2

ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU ALU ALU ALU0 ALU ALU0 ALU0_1 ALU0_1 ALU0 latency ax=3, dx=4

AMD K7 IMUL IMUL IMUL IMUL IMUL DIV DIV DIV IDIV IDIV IDIV IDIV IDIV IDIV CBW, CWDE CWD, CDQ Logic instructions AND, OR, XOR AND, OR, XOR AND, OR, XOR TEST TEST NOT NOT SHL, SHR, SAR ROL, ROR RCL, RCR RCL RCR RCL RCR SHL,SHR,SAR,ROL,ROR RCL, RCR RCL RCR RCL RCR SHLD, SHRD SHLD, SHRD SHLD, SHRD BT BT BT BTC, BTR, BTS BTC BTR, BTS BTC, BTR, BTS BSF BSR r32,r32/m32 r16,(r16),i r32,(r32),i r16,m16,i r32,m32,i r8/m8 r16/m16 r32/m32 r8 r16 r32 m8 m16 m32 2 2 2 3 3 32 47 79 41 56 88 42 57 89 1 1 4 4 5 2.5 1 2 2 2 23 23 40 17 25 41 17 25 41 1/3 1/3 ALU0 ALU0 ALU0 ALU0 ALU0 ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU

24 24 40 17 25 41 17 25 41 1 1

r,r r,m m,r r,r r,m r m r,i/CL r,i/CL r,1 r,i r,i r,CL r,CL m,i /CL m,1 m,i m,i m,CL m,CL r,r,i r,r,cl m,r,i/CL r,r/i m,i m,r r,r/i m,i m,i m,r r,r r,r

1 1 1 1 1 1 1 1 1 1 9 7 9 7 1 1 10 9 9 8 6 7 8 1 1 5 2 5 4 8 19 23

1 1 7 1 1 1 7 1 1 1 4 3 3 3 7 7 5 8 6 7 4 4 7 1

2 7 7 6 7 9 Page 9

1/3 1/2 2.5 1/3 1/2 1/3 2.5 1/3 1/3 1/3 4 3 3 3 3 4 4 4 4 3 2 3 3 1/3 1/2 2 1 2 2 3 7 9

ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU ALU ALU ALU ALU ALU ALU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU ALU ALU, AGU ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU, AGU ALU, AGU ALU ALU

AMD K7 BSF BSR SETcc SETcc CLC, STC CMC CLD STD r,m r,m r m 20 23 1 1 1 1 2 3 8 10 1 8 10 1/3 1/2 1/3 1/3 1 2 ALU, AGU ALU, AGU ALU ALU, AGU ALU ALU ALU ALU

Control transfer instructions JMP short/near JMP JMP JMP JMP Jcc J(E)CXZ LOOP CALL CALL CALL CALL CALL RETN RETN RETF RETF IRET INT BOUND INTO String instructions LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS i i m far r m(near) m(far) short/near short short near far r m(near) m(far) i

1 16-20 1 1 17-21 1 2 7 3 16-22 4 5 16-22 2 2 15-23 15-24 32 33 6 2 23-32

ALU low values = real mode

2 2 25-33 1/3 - 2 1/3 - 2 3-4 2

ALU ALU, AGU low values = real mode rcp. t.= 2 if jump rcp. t.= 2 if jump

3-4 2 23-32 3 3 24-33 3 3 24-35 24-35 81 42

ALU ALU ALU ALU

low values = real mode 3 3 ALU ALU, AGU low values = real mode 3 3 ALU ALU low values = real mode low values = real mode real mode real mode values are for no jump values are for no jump

2 2

4 5 4 3 7 4 5 5 7 6

2 2 2 1 3 1-4 2 2 6 3-4 Page 10

2 2 2 1 3 1-4 2 2 6 3-4

values per count values per count values per count values per count values per count

AMD K7 Other NOP (90) Long NOP (0F 1F) ENTER LEAVE CLI STI CPUID RDTSC RDPMC

1 1 i,0 3 8-9 16-17 19-28 5 9

0 0 12

1/3 1/3 12 3 5 27

ALU ALU 12 3 ops, 5 clk if 16 bit

44-74 11 11

Floating point x87 instructions


Instruction Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FLDZ, FLD1 Operands Ops Latency Reciprocal throughput 2 4 16 41 2 3 7 0 9 7 1/2 1/2 4 39 1/2 1 5 188 0.4 1 1 1 Execution unit FA/M FANY Notes

r m32/64 m80 m80 r m32/64 m80 m80 r m m

1 1 7 30 1 1 10 260 1 1 1 1

FA/M FMISC

FMISC
FMISC, FA/M

FMISC Low latency immediately after FMISC, FA/M FCOMI FANY FANY Low latency immediately after FMISC, ALU FCOM FTST FMISC, ALU do. FMISC, ALU do. FMISC, ALU faster if FMISC, ALU unchanged

FCMOVcc FFREE FINCSTP, FDECSTP

st0,r r

9 1 1

6 0

5 1/3 1/3

FNSTSW FSTSW FNSTSW FNSTCW FLDCW Arithmetic instructions FADD(P),FSUB(R)(P) FIADD,FISUB(R) FMUL(P) FIMUL FDIV(R)(P)

AX AX m16 m16 m16

2 3 2 3 14

6-12 6-12

12 12 8 1 42

r/m m r/m m r/m

1 2 1 2 1

4 4 4 4 11-25 Page 11

1 1-2 1 2 8-22

FADD FADD,FMISC FMUL FMUL,FMISC FMUL Low values are for round divisors

AMD K7 FIDIV(R) FABS, FCHS FCOM(P), FUCOM(P) FCOMPP, FUCOMPP FCOMI(P) FICOM(P) FTST FXAM FRNDINT FPREM FPREM1 Math FSQRT FSIN FCOS FSINCOS FPTAN FPATAN FSCALE FXTRACT F2XM1 FYL2X FYL2XP1 Other FNOP (F)WAIT FNCLEX FNINIT FNSAVE FRSTOR FXSAVE FXRSTOR m r/m r m 2 1 1 1 1 2 1 2 5 1 1 12-26 2 2 2 3 2 10 7-10 8-11 9-23 1 1 1 1 1 1 2 3 8 8 FMUL,FMISC FMUL FADD FADD FADD
FADD, FMISC

do.

FADD FMISC, ALU FMUL FMUL

1 44 51 76 46 72 5 7 8 49 63

35 90-100 90-100 100-150 100-200 160-170 8 11 27 126 147

12

FMUL

1 1 7 25 76 65 44 85

0 0

1/3 1/3 24 92 147 120 59 87

FANY ALU FMISC FMISC

Integer MMX instructions


Instruction Move instructions MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVNTQ PACKSSWB/DW PACKUSWB Operands Ops Latency Reciprocal throughput 7 9 2 2 1/2 1 1/2 1/2 1 2 2 Execution unit FMICS, ALU FANY, ALU FANY FMISC FA/M FANY FMISC FMISC FA/M Notes

r32, mm mm, r32 mm,m32 m32, r mm,mm mm,m64 m64,mm m,mm mm,r/m

2 2 1 1 1 1 1 1 1

2 Page 12

AMD K7 PUNPCKH/LBW/WD PSHUFW MASKMOVQ PMOVMSKB PEXTRW PINSRW Arithmetic instructions PADDB/W/D PADDSB/W PADDUSB/W PSUBB/W/D PSUBSB/W PSUBUSB/W mm,r/m mm,r/m mm,r/m mm,r/m mm,r/m mm,r/m mm,r/m 1 1 1 1 1 1 1 2 2 3 3 2 2 3 1/2 1/2 1 1 1/2 1/2 1 FA/M FA/M FMUL FMUL FA/M FA/M FADD mm,r/m mm,mm,i mm,mm r32,mm r32,mm,i mm,r32,i 1 1 32 3 2 2 2 2 2 1/2 24 3 2 2 FA/M FA/M FADD FMISC, ALU FA/M

5 12

PCMPEQ/GT B/W/D PMULLW PMULHW PMULHUW PMADDWD PAVGB/W PMIN/MAX SW/UB PSADBW Logic PAND PANDN POR PXOR PSLL/RLW/D/Q PSRAW/D Other EMMS

mm,r/m mm,i/mm/m

1 1

2 2

1/2 1/2

FA/M FA/M

1/3

FANY

Floating point XMM instructions


Instruction Move instructions MOVAPS MOVAPS MOVAPS MOVUPS MOVUPS MOVUPS MOVSS MOVSS MOVSS MOVHLPS, MOVLHPS MOVHPS, MOVLPS MOVHPS, MOVLPS MOVNTPS MOVMSKPS SHUFPS Operands Ops Latency Reciprocal throughput 2 1 2 2 1 2 2 1 1 1 1/2 1/2 1 4 2 3 Execution unit FA/M FMISC FMISC FA/M Notes

r,r r,m m,r r,r r,m m,r r,r r,m m,r r,r r,m m,r m,r r32,r r,r/m,i

2 2 2 2 5 5 1 2 1 1 1 1 2 3 3

2 4 3 2

3 Page 13

FA/M FANY FMISC FMISC FA/M FMISC FMISC FMISC FADD FMUL

AMD K7 UNPCK H/L PS Conversion CVTPI2PS CVT(T)PS2PI CVTSI2SS CVT(T)SS2SI Arithmetic ADDSS SUBSS ADDPS SUBPS MULSS MULPS r,r/m 2 3 3 FMUL

xmm,mm mm,xmm xmm,r32 r32,xmm

1 1 4 2

4 6 10 3

FMISC FMISC FMISC FMISC

r,r/m r,r/m r,r/m r,r/m

1 2 1 2

4 4 4 4

1 2 1 2

FADD FADD FMUL FMUL Low values are for round divisors, e.g. powers of 2. do.

DIVSS DIVPS RCPSS RCPPS MAXSS MINSS MAXPS MINPS CMPccSS CMPccPS COMISS UCOMISS Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D Math SQRTSS SQRTPS RSQRTSS RSQRTPS Other LDMXCSR STMXCSR

r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m

1 2 1 2 1 2 1 2 1

11-16 18-30 3 3 2 2 2 2 2

8-13 18-30 1 2 1 2 1 2 1

FMUL FMUL FMUL FMUL FADD FADD FADD FADD FADD

r,r/m

FMUL

r,r/m r,r/m r,r/m r,r/m

1 2 1 2

19 36 3 3

16 36 1 2

FMUL FMUL FMUL FMUL

m m

8 3

9 10

3DNow instructions (obsolete)


Instruction Operands Ops Latency Reciprocal throughput 1/2 1 1 1 1 1/2 Execution unit AGU FMISC FMISC FMISC FMISC FA/M Notes Move and convert instructions PREFETCH(W) m PF2ID mm,mm PI2FD mm,mm PF2IW mm,mm PI2FW mm,mm PSWAPD mm,mm

1 1 1 1 1 1

5 5 5 5 2 Page 14

3DNow E 3DNow E 3DNow E

AMD K7 Integer instructions PAVGUSB PMULHRW Floating point instructions PFADD/SUB/SUBR PFCMPEQ/GE/GT PFMAX/MIN PFMUL PFACC PFNACC, PFPNACC PFRCP PFRCPIT1/2 PFRSQRT PFRSQIT1 Other FEMMS

mm,mm mm,mm

1 1

2 3

1/2 1

FA/M FMUL

mm,mm mm,mm mm,mm mm,mm mm,mm mm,mm mm,mm mm,mm mm,mm mm,mm

1 1 1 1 1 1 1 1 1 1

4 2 2 4 4 4 3 4 3 4

1 1 1 1 1 1 1 1 1 1

FADD FADD FADD FMUL FADD FADD FMUL FMUL FMUL FMUL

3DNow E

mm,mm

1/3

FANY

Page 15

K8

AMD K8
List of instruction timings and macro-operation breakdown
Explanation of column headings:
Instruction: Operands: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit mmx register, xmm = 128 bit xmm register, sr = segment register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc. Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency listed does not include the memory operand where the operand is listed as register or memory (r/m).

Ops: Latency:

Reciprocal through- This is also called issue latency. This value indicates the average number of clock cycles from the execution of an instruction begins to a subsequent independent input: struction of the same kind can begin to execute. A value of 1/3 indicates that the execution units can handle 3 instructions per clock cycle in one thread. However, the throughput may be limited by other bottlenecks in the pipeline. Execution unit: Indicates which execution unit is used for the macro-operations. ALU means any of the three integer ALU's. ALU0_1 means that ALU0 and ALU1 are both used. AGU means any of the three integer address generation units. FADD means floating point adder unit. FMUL means floating point multiplier unit. FMISC means floating point store and miscellaneous unit. FA/M means FADD or FMUL is used. FANY means any of the three floating point units can be used. Two macro-operations can execute simultaneously if they go to different execution units.

Integer instructions
Instruction Move instructions MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV Operands Ops Latency Reciprocal throughput 1 1 4 4 3 3 8 3 3 3 3 2 9-13 1/3 1/3 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2-1 8 Execution unit Notes

r,r r,i r8,m8 r16,m16 r32,m32 r64,m64 m8,r8H m8,r8L m16/32/64,r m,i m64,i32 r,sr sr,r/m

1 1 1 1 1 1 1 1 1 1 1 1 6

ALU ALU ALU, AGU ALU, AGU AGU AGU AGU AGU AGU AGU AGU

Any addressing mode. Add 1 clock if code segment base 0 AH, BH, CH, DH Any other 8-bit register
Any addressing mode

Page 16

K8 MOVNTI MOVZX, MOVSX MOVZX, MOVSX MOVSXD MOVSXD CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POP POPF(D/Q) POPA(D) LEA LEA LEA LAHF SAHF SALC LDS, LES, ... BSWAP PREFETCHNTA PREFETCHT0/1/2 SFENCE LFENCE MFENCE IN OUT m,r r,r r,m r64,r32 r64,m32 r,r r,m r,r r,m r i m sr 1 1 1 1 1 1 1 3 3 2 1 1 2 2 5 9 2 3 4-6 7-9 25 9 2 1 1 4 1 1 10 1 1 1 6 1 7 270 300 1 4 1 1 2 16 5 1 1 1 1 2 4 1 1 8 28 10 4 3 2 2 3 1 1 1 2-3 1/3 1/2 1/3 1/2 1/3 1/2 1 16 1 1 1 1 2 4 1 1 8 28 10 4 1 1/3 1/3 2 1/3 1/3 9 1/3 1/2 1/2 8 5 16 AGU ALU ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU AGU AGU AGU ALU ALU ALU ALU AGU AGU Timing depends on hw

r m
DS/ES/FS/GS

SS

r16,[m] r32,[m] r64,[m]

Any address size Any address size Any address size

r,m r m m

r,i/DX i/DX,r

Arithmetic instructions ADD, SUB ADD, SUB ADD, SUB ADC, SBB ADC, SBB ADC, SBB CMP CMP INC, DEC, NEG INC, DEC, NEG

r,r/i r,m m,r r,r/i r,m m,r/i r,r/i r,m r m

1 1 1 1 1 1 1 1 1 1

1 1 7 1 1 7 1 1 7 Page 17

1/3 1/2 2.5 1/3 1/2 2.5 1/3 1/2 1/3 3

ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU ALU, AGU

K8 AAA, AAS DAA DAS AAD AAM MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV DIV DIV DIV IDIV IDIV IDIV IDIV IDIV IDIV IDIV IDIV CBW, CWDE, CDQE CWD, CDQ, CQO Logic instructions AND, OR, XOR AND, OR, XOR AND, OR, XOR TEST TEST NOT NOT SHL, SHR, SAR ROL, ROR RCL, RCR RCL RCR RCL RCR SHL,SHR,SAR,ROL,R OR 9 12 16 4 31 1 3 2 2 1 1 1 2 1 1 3 3 3 31 46 78 143 40 55 87 152 41 56 88 153 1 1 5 6 7 5 13 3 3-4 3 4-5 3 3 4 4 3 4 5 6 7 ALU ALU ALU ALU0 ALU ALU0 ALU0_1 ALU0_1 ALU0_1 ALU0 ALU0 ALU0_1 ALU0 ALU0 ALU0 ALU0 ALU0 ALU0_1 ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU

r8/m8 r16/m16 r32/m32 r64/m64 r16,r16/m16 r32,r32/m32 r64,r64/m64 r16,(r16),i r32,(r32),i r64,(r64),i r16,m16,i r32,m32,i r64,m64,i r8/m8 r16/m16 r32/m32 r64/m64 r8 r16 r32 r64 m8 m16 m32 m64

15 23 39 71 17 25 41 73 17 25 41 73 1 1

1 2 1 2 1 1 2 1 1 2 2 2 2 15 23 39 71 17 25 41 73 17 25 41 73 1/3 1/3

latency ax=3, dx=4 latency rax=4, rdx=5

r,r r,m m,r r,r r,m r m r,i/CL r,i/CL r,1 r,i r,i r,CL r,CL m,i /CL

1 1 1 1 1 1 1 1 1 1 9 7 9 7 1

1 1 7 1 1 1 7 1 1 1 3 3 4 3 7 Page 18

1/3 1/2 2.5 1/3 1/2 1/3 2.5 1/3 1/3 1/3 3 3 4 3 3

ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU ALU ALU ALU ALU ALU ALU ALU, AGU

K8 RCL, RCR RCL RCR RCL RCR SHLD, SHRD SHLD, SHRD SHLD, SHRD BT BT BT BTC, BTR, BTS BTC BTR, BTS BTC BTR, BTS BSF BSF BSR BSF BSF BSF BSR SETcc SETcc CLC, STC CMC CLD STD m,1 m,i m,i m,CL m,CL r,r,i r,r,cl m,r,i/CL r,r/i m,i m,r r,r/i m,i m,i m,r m,r r16/32,r r64,r r,r r16,m r32,m r64,m r,m r m 1 10 9 9 8 6 7 8 1 1 5 2 5 4 8 8 21 22 28 20 22 25 28 1 1 1 1 1 2 7 9 8 7 8 3 3 6 1 4 4 4 4 3 3 3 3 1/3 1/2 2 1 2 2 5 3 8 9 10 8 9 10 10 1/3 1/2 1/3 1/3 1/3 1/3 ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU ALU ALU, AGU ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU ALU ALU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU ALU, AGU ALU ALU ALU ALU

2 7 7 5 8 8 9 10 8 9 10 10 1

Control transfer instructions JMP short/near JMP far JMP r JMP m(near) JMP m(far) Jcc short/near J(E/R)CXZ short LOOP short CALL near CALL far CALL r CALL m(near) CALL m(far) RETN RETN i RETF RETF i IRET INT i

1 16-20 1 1 17-21 1 2 7 3 16-22 4 5 16-22 2 2 15-23 15-24 32 33

2 23-32 2 2 25-33 1/3 - 2 1/3 - 2 3-4 2 3 3 3 3

ALU
low values = real mode

ALU ALU, AGU


low values = real mode

3-4 2 23-32 3 3 24-33 3 3 24-35 24-35 81 42 Page 19

ALU ALU ALU ALU ALU ALU, AGU

recip. thrp.= 2 if jump recip. thrp.= 2 if jump

low values = real mode

low values = real mode

ALU ALU
low values = real mode low values = real mode

real mode real mode

K8 BOUND INTO String instructions LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS Other NOP (90) Long NOP (0F 1F) ENTER LEAVE CLI STI CPUID RDTSC RDPMC m 6 2 2 2
values are for no jump values are for no jump

4 2 5 2 4 2 1.5 - 2 0.5 - 1 7 3 3 1-2 5 2 5 2 2 3 6 2

2 2 2 0.5 - 1 3 1-2 2 2 3 2

values are per count values are per count values are per count values are per count values are per count

1 0 1 0 i,0 12 2 8-9 16-17 22-50 47-164 6 10 9 12

1/3 1/3 12 3 5 27 7 7

ALU ALU 12 3 ops, 5 clk if 16 bit

Floating point x87 instructions


Instruction Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FLDZ, FLD1 FCMOVcc FFREE FINCSTP, FDECSTP Operands Ops Latency Reciprocal throughput 2 4 16 41 2 3 7 173 0 9 7 1/2 1/2 4 39 1/2 1 5 160 0.4 1 1 1 4 2 1/3 Execution unit Notes

r m32/64 m80 m80 r m32/64 m80 m80 r m m

1 1 7 30 1 1 10 260 1 1 1 1 9 1 1

FA/M FANY

FA/M FMISC

FMISC FMISC, FA/M FMISC Low latency immediFMISC, FA/M ately after FCOMI FANY FANY Low latency immediately after FCOM FMISC, ALU FTST FMISC, ALU do.

st0,r r

4-15 0

FNSTSW FSTSW

AX AX

2 3

6-12 6-12 Page 20

12 12

K8 FNSTSW FNSTCW FLDCW Arithmetic instructions FADD(P),FSUB(R)(P) FIADD,FISUB(R) FMUL(P) FIMUL FDIV(R)(P) FIDIV(R) FABS, FCHS FCOM(P), FUCOM(P) FCOMPP, FUCOMPP FCOMI(P) FICOM(P) FTST FXAM FRNDINT FPREM FPREM1 Math FSQRT FLDPI, etc. FSIN FCOS FSINCOS FPTAN FPATAN FSCALE FXTRACT F2XM1 FYL2X FYL2XP1 Other FNOP (F)WAIT FNCLEX FNINIT FNSAVE FRSTOR FXSAVE FXRSTOR m16 m16 m16 2 3 18 8 1 50 FMISC, ALU FMISC, ALU FMISC, ALU do. faster if unchanged

r/m m r/m m r/m m r/m r m

1 2 1 2 1 2 1 1 1 1 2 1 2 5 1 1

4 4 4 4 11-25 12-26 2 2 2 3 2 10 7-10 8-11

1 1-2 1 2 8-22 9-23 1 1 1 1 1 1 1 3 8 8

FADD FADD,FMISC FMUL FMUL,FMISC Low values are for round divisors FMUL FMUL,FMISC do. FMUL FADD FADD FADD FADD, FMISC FADD FMISC, ALU FMUL FMUL

1 1 66 73 98 67 97 5 7 53 72 75

27 140-190 150-190 170-200 150-180 217 8 12 126 179 175

12 1

FMUL FMISC

1 1 8 26 77 70 61 101

0 0

1/3 1/3 27 100 171 136 56 95

FANY ALU FMISC FMISC

Integer MMX and XMM instructions

Page 21

K8 Instruction Move instructions MOVD MOVD MOVD MOVD MOVD MOVD MOVD MOVD (MOVQ) MOVD (MOVQ) MOVD (MOVQ) MOVQ MOVQ MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/ DQ PUNPCKH/LBW/WD/ DQ PUNPCKHQDQ PUNPCKLQDQ PSHUFD PSHUFW PSHUFL/HW MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRW PINSRW PINSRW Operands Ops Latency Reciprocal throughput 4 9 2 3 2 2 1/2 2 2 1 1 2 2 2 1/2 1 1/2 1 1 1 2 2 2 2 1/2 1 2 3 2 2 2 2 1 1/2 1.5 1/2 1 13 26 1 2 2 3 Execution unit Notes

r32, mm mm, r32 mm,m32 r32, xmm xmm, r32 xmm,m32 m32, r r64,mm/xmm mm,r64 xmm,r64 mm,mm xmm,xmm mm,m64 xmm,m64 m64,mm/x xmm,xmm xmm,m m,xmm xmm,m m,xmm mm,xmm xmm,mm m,mm m,xmm mm,r/m xmm,r/m mm,r/m xmm,r/m xmm,r/m xmm,r/m xmm,xmm,i mm,mm,i xmm,xmm,i mm,mm xmm,xmm r32,mm/xmm r32,mm/x,i mm,r32,i xmm,r32,i

2 2 1 3 3 2 1 2 2 3 1 2 1 2 1 2 2 2 4 5 1 2 1 2 1 3 1 2 2 1 3 1 2 32 64 1 2 2 3

FMICS, ALU FANY, ALU FANY FMISC, ALU FANY FMISC Moves 64 bits.Name FMISC, ALU of instruction differs FANY, ALU do. FANY, ALU do. FA/M FA/M, FMISC FANY FANY, FMISC FMISC FA/M FMISC FMISC

4 9 9 2 2

2 2

FA/M FA/M, FMISC FMISC FMISC FA/M FA/M FA/M FA/M FA/M FA/M FA/M FA/M FA/M

2 3 2 2 2 2 3 2 2

2 5 12 12

FADD FMISC, ALU FA/M FA/M

Arithmetic instructions

Page 22

K8 PADDB/W/D/Q PADDSB/W PADDUSB/W PSUBB/W/D/Q PSUBSB/W PSUBUSB/W PADDB/W/D/Q PADDSB/W ADDUSB/W PSUBB/W/D/Q PSUBSB/W PSUBUSB/W PCMPEQ/GT B/W/D PCMPEQ/GT B/W/D PMULLW PMULHW PMULHUW PMULUDQ PMULLW PMULHW PMULHUW PMULUDQ PMADDWD PMADDWD PAVGB/W PAVGB/W PMIN/MAX SW/UB PMIN/MAX SW/UB PSADBW PSADBW Logic PAND PANDN POR PXOR PAND PANDN POR PXOR PSLL/RL W/D/Q PSRAW/D PSLL/RL W/D/Q PSRAW/D PSLLDQ, PSRLDQ Other EMMS

mm,r/m

1/2

FA/M

xmm,r/m mm,r/m xmm,r/m

2 1 2

2 2 2

1 1/2 1

FA/M FA/M FA/M

mm,r/m

FMUL

xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m

2 1 2 1 2 1 2 1 2

3 3 3 2 2 2 2 3 3

2 1 2 1/2 1 1/2 1 1 2

FMUL FMUL FMUL FA/M FA/M FA/M FA/M FADD FADD

mm,r/m xmm,r/m mm,i/mm/m x,i/x/m xmm,i

1 2 1 2 2

2 2 2 2 2

1/2 1 1/2 1 1

FA/M FA/M FA/M FA/M FA/M

1/3

FANY

Floating point XMM instructions


Instruction Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D Operands Ops Latency Reciprocal throughput 2 1 2 2 1 Execution unit Notes

r,r r,m m,r r,r

2 2 2 2

2 Page 23

FA/M FMISC FMISC FA/M

K8 MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHLPS, MOVLHPS MOVHPS/D, MOVLPS/D MOVHPS/D, MOVLPS/D MOVDDUP MOVSH/LDUP MOVNTPS/D MOVMSKPS/D SHUFPS/D UNPCK H/L PS/D Conversion CVTPS2PD CVTPD2PS CVTSD2SS CVTSS2SD CVTDQ2PS CVTDQ2PD CVT(T)PS2DQ CVT(T)PD2DQ CVTPI2PS CVTPI2PD CVT(T)PS2PI CVT(T)PD2PI CVTSI2SS CVTSI2SD CVT(T)SD2SI CVT(T)SS2SI Arithmetic ADDSS/D SUBSS/D ADDPS/D SUBPS/D HADDPS/D HSUBPS/D MULSS/D MULPS/D r,m m,r r,r r,m m,r r,r r,m m,r r,r r,r m,r r32,r r,r/m,i r,r/m 4 5 1 2 1 1 1 1 2 2 2 1 3 2 2 2 1 1 1 1/2 1 1 1 2 3 1 2 3

2 4 3 2

FA/M FANY FMISC FMISC FA/M FMISC FMISC SSE3 SSE3 FMISC FADD FMUL FMUL

2 2 8 3 3

r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m xmm,mm xmm,mm mm,xmm mm,xmm xmm,r32 xmm,r32 r32,xmm r32,xmm

2 4 3 1 2 2 2 4 1 2 1 3 3 2 2 2

4 8 8 2 5 5 5 8 4 5 6 8 14 12 10 9

2 3 8 1 2 2 2 3 1 2 1 2 2 2 2 2

FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC

r,r/m r,r/m r,r/m r,r/m r,r/m

1 2 2 1 2

4 4 4 4 4

1 2 2 1 2

FADD FADD FADD FMUL FMUL SSE3

DIVSS DIVPS DIVSD DIVPD RCPSS RCPPS

r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m

1 2 1 2 1 2

11-16 18-30 11-20 16-34 3 3 Page 24

8-13 18-30 8-17 16-34 1 2

FMUL FMUL FMUL FMUL FMUL FMUL

Low values are for round divisors, e.g. powers of 2. do. do. do.

K8 MAXSS/D MINSS/D MAXPS/D MINPS/D CMPccSS/D CMPccPS/D COMISS/D UCOMISS/D Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS RSQRTPS Other LDMXCSR STMXCSR r,r/m r,r/m r,r/m r,r/m r,r/m 1 2 1 2 1 2 2 2 2 2 1 2 1 2 1 FADD FADD FADD FADD FADD

r,r/m

FMUL

r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m

1 2 1 2 1 2

19 36 27 48 3 3

16 36 24 48 1 2

FMUL FMUL FMUL FMUL FMUL FMUL

m m

8 3

9 10

Page 25

K10

AMD K10
List of instruction timings and macro-operation breakdown
Explanation of column headings:
Instruction: Operands: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit mmx register, xmm = 128 bit xmm register, sr = segment register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc. Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency listed does not include the memory operand where the operand is listed as register or memory (r/m).

Ops: Latency:

Reciprocal through- This is also called issue latency. This value indicates the average number of clock cycles from the execution of an instruction begins to a subsequent independent put: instruction of the same kind can begin to execute. A value of 1/3 indicates that the execution units can handle 3 instructions per clock cycle in one thread. However, the throughput may be limited by other bottlenecks in the pipeline. Execution unit: Indicates which execution unit is used for the macro-operations. ALU means any of the three integer ALU's. ALU0_1 means that ALU0 and ALU1 are both used. AGU means any of the three integer address generation units. FADD means floating point adder unit. FMUL means floating point multiplier unit. FMISC means floating point store and miscellaneous unit. FA/M means FADD or FMUL is used. FANY means any of the three floating point units can be used. Two macro-operations can execute simultaneously if they go to different execution units.

Integer instructions
Instruction Move instructions MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV Operands Ops Latency Reciprocal throughput 1 1 4 4 3 3 8 3 3 3 3 3-4 8-26 1/3 1/3 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 8 Execution unit Notes

r,r r,i r8,m8 r16,m16 r32,m32 r64,m64 m8,r8H m8,r8L m16/32/64,r m,i m64,i32 r,sr sr,r/m

1 1 1 1 1 1 1 1 1 1 1 1 6

ALU ALU ALU, AGU ALU, AGU AGU AGU AGU AGU AGU AGU AGU

Any addressing mode. Add 1 clock if code segment base 0 AH, BH, CH, DH Any other 8-bit register Any addressing mode

from AMD manual

Page 26

K10 MOVNTI MOVZX, MOVSX MOVZX, MOVSX MOVSXD MOVSXD CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POP POPF(D/Q) POPA(D) LEA LEA LEA LAHF SAHF SALC LDS, LES, ... BSWAP PREFETCHNTA PREFETCHT0/1/2 SFENCE LFENCE MFENCE IN OUT Arithmetic instructions ADD, SUB ADD, SUB ADD, SUB ADC, SBB ADC, SBB ADC, SBB CMP CMP INC, DEC, NEG INC, DEC, NEG m,r r,r r,m r64,r32 r64,m32 r,r r,m r,r r,m r i m sr 1 1 1 1 1 1 1 2 2 2 1 1 2 2 9 9 1 3 6 10 28 9 2 1 1 4 1 1 10 1 1 1 6 1 4 ~270 ~300 1 4 1 4 1 4 1 21 5 1 1/3 1/2 1/3 1/2 1/3 1/2 1 19 5 1/2 1/2 1 1 3 6 1/2 1 8 16 11 6 1 1/3 1/3 2 1/3 1 10 1/3 1/2 1/2 8 1 33 AGU ALU ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU AGU ALU ALU ALU ALU AGU AGU

Timing depends on hw

6 3 10 26 16 6 3 1 2 3 1 1 1

r m
DS/ES/FS/GS

SS

r16,[m] r32/64,[m] r32/64,[m]

Any address size 2 source operands W. scale or 3 opr.

r,m r m m

r,i/DX i/DX,r

r,r/i r,m m,r r,r/i r,m m,r/i r,r/i r,m r m

1 1 1 1 1 1 1 1 1 1

1 4 1 4 1 1 7 Page 27

1/3 1/2 1 1/3 1/2 1 1/3 1/2 1/3 2

ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU ALU, AGU

K10 AAA, AAS DAA DAS AAD AAM MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV IDIV IDIV DIV DIV DIV IDIV IDIV IDIV CBW, CWDE, CDQE CWD, CDQ, CQO 9 12 16 4 30 1 3 2 2 1 1 1 2 1 1 3 3 3 5 6 7 5 13 3 3 3 4 3 3 4 4 3 4 5 6 7 5 13 1 2 1 2 1 1 2 1 1 2 2 2 2 17 19 22 15-30 15-46 15-78 24-39 24-55 24-87 1/3 1/3 ALU ALU ALU ALU0 ALU ALU0 ALU0_1 ALU0_1 ALU0_1 ALU0 ALU0 ALU0_1 ALU0 ALU0 ALU0 ALU0 ALU0 ALU0_1 ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU

r8/m8 r16/m16 r32/m32 r64/m64 r16,r16/m16 r32,r32/m32 r64,r64/m64 r16,(r16),i r32,(r32),i r64,(r64),i r16,m16,i r32,m32,i r64,m64,i r8/m8 r8 m8 r16/m16 r32/m32 r64/m64 r16/m16 r32/m32 r64/m64

latency ax=3, dx=4


latency rax=4, rdx=5

1 1

17 19 22 15-30 15-46 15-78 24-39 24-55 24-87 1 1

Depends on number of significant bits in absolute value of dividend. See AMD software optimization guide.

Logic instructions AND, OR, XOR AND, OR, XOR AND, OR, XOR TEST TEST NOT NOT SHL, SHR, SAR ROL, ROR RCL, RCR RCL RCR RCL RCR SHL,SHR,SAR,ROL,ROR RCL, RCR RCL RCR RCL

r,r r,m m,r r,r r,m r m r,i/CL r,i/CL r,1 r,i r,i r,CL r,CL m,i /CL m,1 m,i m,i m,CL

1 1 1 1 1 1 1 1 1 1 9 7 9 7 1 1 10 9 9

1 4 1 1 7 1 1 1 3 3 4 3 7 7 7 7 8 Page 28

1/3 1/2 1 1/3 1/2 1/3 1 1/3 1/3 1 3 3 4 3 1 1 5 6 6

ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU ALU ALU ALU ALU ALU ALU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU

K10 RCR SHLD, SHRD SHLD, SHRD SHLD, SHRD BT BT BT BTC, BTR, BTS BTC BTR, BTS BTC BTR, BTS BSF BSR BSF BSR POPCNT LZCNT SETcc SETcc CLC, STC CMC CLD STD m,CL r,r,i r,r,cl m,r,i/CL r,r/i m,i m,r r,r/i m,i m,i m,r m,r r,r r,r r,m r,m r,r/m r,r/m r m 8 6 7 8 1 1 5 2 5 4 8 8 6 7 7 8 1 1 1 1 1 1 1 2 7 3 3 7.5 1 7 2 9 9 8 8 4 4 7 7 2 2 1 5 2 3 6 1/3 1/2 2 1/3 1.5 1.5 10 7 3 3 3 3 1 1 1/3 1/2 1/3 1/3 1/3 2/3 ALU, AGU ALU ALU ALU, AGU ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU ALU ALU, AGU ALU, AGU ALU ALU ALU ALU, AGU ALU ALU ALU ALU

SSE4.A / SSE4.2 SSE4.A, AMD only

Control transfer instructions JMP short/near JMP far JMP r JMP m(near) JMP m(far) Jcc short/near J(E/R)CXZ short LOOP short CALL near CALL far CALL r CALL m(near) CALL m(far) RETN RETN i RETF RETF i IRET INT i BOUND m INTO String instructions LODS

1 16-20 1 1 17-21 1 2 7 3 16-22 4 5 16-22 2 2 15-23 15-24 32 33 6 2

2 23-32 2 2 25-33 1/3 - 2 2/3 - 2 3 2 3 3 3 3

ALU
low values = real mode

ALU ALU, AGU


low values = real mode

2 23-32 3 3 24-33 3 3 24-35 24-35 81 42

ALU ALU ALU ALU ALU ALU, AGU

recip. thrp.= 2 if jump recip. thrp.= 2 if jump

low values = real mode

low values = real mode

ALU ALU
low values = real mode low values = real mode

real mode real mode 2 2


values are for no jump values are for no jump

2 Page 29

K10 REP LODS STOS REP STOS MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS Other NOP (90) Long NOP (0F 1F) ENTER LEAVE CLI STI CPUID RDTSC RDPMC 5 4 2 7 3 5 5 7 3 2 2 1 3 1 2 2 3 1 2 2 1 3 1 2 2 3 1
values are per count values are per count values are per count values are per count values are per count

1 0 1 0 i,0 12 2 8-9 16-17 22-50 47-164 30 13

1/3 1/3 3 5 27 67 5

ALU ALU 12 3 ops, 5 clk if 16 bit

Floating point x87 instructions


Instruction Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FLDZ, FLD1 FCMOVcc FFREE FINCSTP, FDECSTP FNSTSW FSTSW FNSTSW FNSTCW FLDCW Arithmetic instructions Page 30 Operands Ops Latency Reciprocal throughput 2 4 13 94 2 2 8 167 0 6 4 1/2 1/2 4 30 1/2 1 7 163 1/3 1 1 1 Execution unit Notes

r m32/64 m80 m80 r m32/64 m80 m80 r m m

1 1 7 20 1 1 10 218 1 1 1 1 9 1 1 2 3 2 3 12

FA/M FANY

FA/M FMISC

FMISC FMISC FMISC Low latency immeFMISC, FA/M diately after FCOMI FANY FANY
Low latency immediately after FMISC, ALU FCOM FTST

st0,r r

1/3 1/3 16 14 9 2 14

AX AX m16 m16 m16

FMISC, ALU FMISC, ALU FMISC, ALU FMISC, ALU

do. do. faster if unchanged

K10 FADD(P),FSUB(R)(P) FIADD,FISUB(R) FMUL(P) FIMUL FDIV(R)(P) FIDIV(R) FABS, FCHS FCOM(P), FUCOM(P) FCOMPP, FUCOMPP FCOMI(P) FICOM(P) FTST FXAM FRNDINT FPREM FPREM1 Math FSQRT FLDPI, etc. FSIN FCOS FSINCOS FPTAN FPATAN FSCALE FXTRACT F2XM1 FYL2X FYL2XP1 Other FNOP (F)WAIT FNCLEX FNINIT FNSAVE FRSTOR FXSAVE FXRSTOR r/m m r/m m r/m m r/m r m 1 2 1 2 1 2 1 1 1 1 2 1 2 6 1 1 4 4 ? 31 2 1 4 1 4 24 24 2 1 1 1 1 1 1 37 7 7 FADD FADD,FMISC FMUL FMUL,FMISC FMUL FMUL,FMISC FMUL FADD FADD FADD FADD, FMISC FADD FMISC, ALU FMUL FMUL

1 1 45 51 76 45 9 5 11 8 8 12

35 ~51? ~90? ~125? ~119 151? 9 9 65 13 114

35 1

FMUL FMISC

45? 29 41 30? 30? 44?

m m m m

1 1 8 26 77 70 61 85

0 0

162 133 63 89

1/3 1/3 28 103 149 149 58 79

FANY ALU FMISC FMISC

Integer MMX and XMM instructions


Instruction Move instructions MOVD MOVD MOVD MOVD Operands Ops Latency Reciprocal throughput 3 6 4 3 Page 31 1 3 1/2 1 Execution unit Notes

r32, mm mm, r32 mm,m32 r32, xmm

1 2 1 1

FADD FANY FADD

K10 MOVD MOVD MOVD xmm, r32 xmm,m32 m32,mm/x 2 1 1 6 2 2 3 1/2 1

FMISC Moves 64 bits. Name of instruction differs do. do.

MOVD (MOVQ) MOVD (MOVQ) MOVD (MOVQ) MOVQ MOVQ MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/ DQ PUNPCKH/LBW/WD/ DQ PUNPCKHQDQ PUNPCKLQDQ PSHUFD PSHUFW PSHUFL/HW MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRW PINSRW INSERTQ INSERTQ EXTRQ EXTRQ

r64,(x)mm mm,r64 xmm,r64 mm,mm xmm,xmm mm,m64 xmm,m64 m64,(x)mm xmm,xmm xmm,m m,xmm xmm,m m,xmm mm,xmm xmm,mm m,mm m,xmm mm,r/m xmm,r/m mm,r/m xmm,r/m xmm,r/m xmm,r/m xmm,xmm,i mm,mm,i xmm,xmm,i mm,mm xmm,xmm r32,mm/xmm r32,(x)mm,i (x)mm,r32,i xmm,xmm xmm,xmm,i,i xmm,xmm xmm,xmm,i,i

1 2 2 1 1 1 1 1 1 1 2 1 3 1 1 1 2 1 1 1 1 1 1 1 1 1 32 64 1 2 2 3 3 1 1

3 6 6 2 2.5 4 2 2 2.5 2 2 2 3 2 2

1 3 3 1/2 1/3 1/2 1/2 1 1/3 1/2 1 1/2 2 1/3 1/3 1 1 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 13 24 1 1 3 2 2 1/2 1/2

FADD FMUL, ALU FA/M FANY FANY ? FMISC FANY ? FMUL,FMISC

FANY FANY FMISC FMUL,FMISC FA/M FA/M FA/M FA/M FA/M FA/M FA/M FA/M FA/M

2 3 2 3 3 3 3 2 2

3 6 9 6 6 2 2

FADD FA/M FA/M FA/M FA/M FA/M

SSE4.A, AMD only SSE4.A, AMD only SSE4.A, AMD only SSE4.A, AMD only

Arithmetic instructions

Page 32

K10 PADDB/W/D/Q PADDSB/W PADDUSB/W PSUBB/W/D/Q PSUBSB/W PSUBUSB/W PCMPEQ/GT B/W/D PMULLW PMULHW PMULHUW PMULUDQ PMADDWD PAVGB/W PMIN/MAX SW/UB PSADBW Logic PAND PANDN POR PXOR PSLL/RL W/D/Q PSRAW/D PSLL/RL W/D/Q PSRAW/D PSLLDQ, PSRLDQ Other EMMS

mm/xmm,r/m mm/xmm,r/m

1 1

2 2

1/2 1/2

FA/M FA/M

mm/xmm,r/m mm/xmm,r/m mm/xmm,r/m mm/xmm,r/m mm/xmm,r/m

1 1 1 1 1

3 3 2 2 3

1 1 1/2 1/2 1

FMUL FMUL FA/M FA/M FADD

mm/xmm,r/m mm,i/mm/m x,i/(x)mm xmm,i

1 1 1 1

2 2 3 3

1/2 1/2 1/2 1/2

FA/M FA/M FA/M FA/M

1/3

FANY

Floating point XMM instructions


Instruction Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHLPS, MOVLHPS MOVHPS/D, MOVLPS/D MOVHPS/D, MOVLPS/D MOVNTPS/D MOVNTSS/D MOVMSKPS/D SHUFPS/D Operands Ops Latency Reciprocal throughput 2.5 2 2 2.5 2 3 2 2 2 3 4 1/2 1/2 1 1/2 1/2 2 1/2 1/2 1 1/2 1/2 1 3 1 1 1/2 Execution unit Notes

r,r r,m m,r r,r r,m m,r r,r r,m m,r r,r r,m m,r m,r m,r r32,r r,r/m,i

1 1 2 1 1 3 1 1 1 1 1 1 2 1 1 1

FANY ? FMUL,FMISC FANY ? FMISC FA/M ? FMISC FA/M FA/M FMISC FMUL,FMISC FMISC SSE4.A, AMD only FADD FA/M

3 3 Page 33

K10 UNPCK H/L PS/D Conversion CVTPS2PD CVTPD2PS CVTSD2SS CVTSS2SD CVTDQ2PS CVTDQ2PD CVT(T)PS2DQ CVT(T)PD2DQ CVTPI2PS CVTPI2PD CVT(T)PS2PI CVT(T)PD2PI CVTSI2SS CVTSI2SD CVT(T)SD2SI CVT(T)SS2SI Arithmetic ADDSS/D SUBSS/D ADDPS/D SUBPS/D MULSS/D MULPS/D DIVSS DIVPS DIVSD DIVPD RCPSS RCPPS MAXSS/D MINSS/D MAXPS/D MINPS/D CMPccSS/D CMPccPS/D COMISS/D UCOMISS/D Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS RSQRTPS Other Page 34 r,r/m 1 3 1/2 FA/M

r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m xmm,mm xmm,mm mm,xmm mm,xmm xmm,r32 xmm,r32 r32,xmm r32,xmm

1 2 3 3 1 1 1 2 2 1 1 2 3 3 2 2

2 7 8 7 4 4 4 7 7 4 4 7 14 14 8 8

1 1 2 2 1 1 1 1 1 1 1 1 3 3 1 1

FMISC

FMISC FMISC FMISC

FMISC FMISC

FADD,FMISC FADD,FMISC

r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m

1 1 1 1 1 1 1 1 1 1 1 1 1 1

4 4 4 4 16 18 20 20 3 2 2 2 2

1 1 1 1 13 15 17 17 1 1 1 1 1 1

FADD FADD FMUL FMUL FMUL FMUL FMUL FMUL FMUL FADD FADD FADD FADD FADD

r,r/m

1/2

FA/M

r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m

1 1 1 1 1 1

19 21 27 27 3 3

16 18 24 24 1 1

FMUL FMUL FMUL FMUL FMUL FMUL

K10 LDMXCSR STMXCSR m m 12 3 12 12 10 11

3DNow instructions (obsolete)


Instruction Operands Ops Latency Reciprocal throughput 1/2 1 1 1 1 1/2 Execution unit Notes Move and convert instructions PREFETCH(W) m PF2ID mm,mm PI2FD mm,mm PF2IW mm,mm PI2FW mm,mm PSWAPD mm,mm Integer instructions PAVGUSB PMULHRW

1 1 1 1 1 1

5 5 5 5 2

AGU FMISC FMISC FMISC FMISC FA/M

3DNow extension 3DNow extension 3DNow extension

mm,mm mm,mm

1 1

2 3

1/2 1

FA/M FMUL

Floating point instructions PFADD/SUB/SUBR mm,mm PFCMPEQ/GE/GT mm,mm PFMAX/MIN mm,mm PFMUL mm,mm PFACC mm,mm PFNACC, PFPNACC mm,mm PFRCP mm,mm PFRCPIT1/2 mm,mm PFRSQRT mm,mm PFRSQIT1 mm,mm Other FEMMS

1 1 1 1 1 1 1 1 1 1

4 2 2 4 4 4 3 4 3 4

1 1 1 1 1 1 1 1 1 1

FADD FADD FADD FMUL FADD FADD FMUL FMUL FMUL FMUL

3DNow extension

mm,mm

1/3

FANY

Thank you to Xucheng Tang for doing the measurements on the K10.

Page 35

Bulldozer

AMD Bulldozer
List of instruction timings and macro-operation breakdown
Explanation of column headings:
Instruction: Operands: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit mmx register, x = 128 bit xmm register, y = 256 bit ymm register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc. Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency listed does not include the memory operand where the listing for register and memory operand are joined (r/m). This is also called issue latency. This value indicates the average number of clock cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/3 indicates that the execution units can handle 3 instructions per clock cycle in one thread. However, the throughput may be limited by other bottlenecks in the pipeline. Indicates which execution pipe or unit is used for the macro-operations: Integer pipes: EX0: integer ALU, division EX1: integer ALU, multiplication, jump EX01: can use either EX0 or EX1 AG01: address generation unit 0 or 1 Floating point and vector pipes: P0: floating point add, mul, div, convert, shuffle, shift P1: floating point add, mul, div, shuffle, shift P2: move, integer add, boolean P3: move, integer add, boolean, store P01: can use either P0 or P1 P23: can use either P2 or P3 Two macro-operations can execute simultaneously if they go to different execution pipes Tells which execution unit domain is used: ivec: integer vector execution unit. fp: floating point execution unit. fma: floating point multiply/add subunit. inherit: the output operand inherits the domain of the input operand. ivec/fma means the input goes to the ivec domain and the output comes from the fma domain. There is an additional latency of 1 clock cycle if the output of an ivec instruction goes to the input of a fp or fma instruction, and when the output of a fp or fma instruction goes to the input of an ivec or store instruction. There is no latency between the fp and fma units. All other latencies after memory load and before memory store instructions are included in the latency counts. An fma instruction has a latency of 5 if the output goes to another fma instruction, 6 if the output goes to an fp instuction, and 6+1 if the output goes to an ivec or store instruction.

Ops: Latency:

Reciprocal throughput:

Execution pipe:

Domain:

Page 36

Bulldozer

Integer instructions
Instruction Move instructions MOV MOV MOV MOV MOV MOVNTI MOVZX, MOVSX MOVSX MOVZX MOVSXD MOVSXD CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POPF(D/Q) POPA(D) LEA LEA LEA LEA LAHF SAHF SALC BSWAP PREFETCHNTA PREFETCHT0/1/2 SFENCE LFENCE MFENCE Arithmetic instructions ADD, SUB ADD, SUB ADD, SUB Operands Ops Latency Reciprocal throughput 1 1 4 4 5 1 5 4 1 5 1 1 ~50 6 0.5 0.5 0.5 1 1 2 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 ~50 2 1 1 1.5 4 9 1 1 19 8 Execution pipes EX01 EX01 AG01 EX01 AG01 Notes

r,r r,i r,m m,r m,i m,r r,r r,m r,m r64,r32 r64,m32 r,r r,m r,r r,m r i m

1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 2 8 9 1 2 34 14 2 1 1 1 4 2 1 1 1 1 6 1 6

all addr. modes all addr. modes

EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01


Timing depends on hw

r m

r16,[m] r32,[m] r32/64,[m] r32/64,[m]

2-3 2-3 2 1 3 2 1 1 0.5 0.5 2 1 1 0.5 0.5 0.5 89 0.25 89

EX01 EX01 EX01 EX01

any addr. size 16 bit addr. size scale factor > 1 or 3 operands all other cases

r m m

EX01

r,r r,i r,m

1 1 1

1 1

0.5 0.5 0.5

EX01 EX01 EX01

Page 37

Bulldozer ADD, SUB ADD, SUB ADC, SBB ADC, SBB ADC, SBB ADC, SBB ADC, SBB CMP CMP CMP INC, DEC, NEG INC, DEC, NEG AAA, AAS DAA DAS AAD AAM MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV DIV DIV DIV IDIV IDIV IDIV IDIV CBW, CWDE, CDQE CDQ, CQO CWD Logic instructions AND, OR, XOR AND, OR, XOR AND, OR, XOR AND, OR, XOR AND, OR, XOR TEST TEST m,r m,i r,r r,i r,m m,r m,i r,r r,i r,m r m 1 1 1 1 1 1 1 1 1 1 1 1 10 16 20 4 9 1 2 1 1 1 1 1 2 1 1 2 2 2 14 18 16 16 33 36 36 36 1 1 2 7-8 7-8 1 1 1 9 9 1 1 1 7-8 6 9 10 6 20 4 4 4 6 4 4 6 5 4 6 1 1 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01

1 1 1 0.5 0.5 0.5 0.5 1

r8/m8 r16/m16 r32/m32 r64/m64 r16,r16/m16 r32,r32/m32 r64,r64/m64 r16,(r16),i r32,(r32),i r64,(r64),i r16,m16,i r32,m32,i r64,m64,i r8/m8 r16/m16 r32/m32 r64/m64 r8/m8 r16/m16 r32/m32 r64/m64

20 15-27 16-43 16-75 23 23-33 22-48 22-79 1 1 1

20 2 2 2 4 2 2 4 2 2 4 2 2 4 20 15-28 16-43 16-75 20 20-27 20-43 20-75 0.5 1

EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX0 EX0 EX0 EX0 EX0 EX0 EX0 EX0 EX01 EX01 EX01

r,r r,i r,m m,r m,i r,r r,i

1 1 1 1 1 1 1

1 1 7-8 7-8 1 1 Page 38

0.5 0.5 0.5 1 1 0.5 0.5

EX01 EX01 EX01 EX01 EX01 EX01 EX01

Bulldozer TEST TEST NOT NOT SHL, SHR, SAR ROL, ROR RCL RCL RCL RCR RCR RCR SHLD, SHRD SHLD, SHRD SHLD, SHRD BT BT BT BTC, BTR, BTS BTC, BTR, BTS BTC, BTR, BTS BSF BSF BSR BSR LZCNT POPCNT SETcc SETcc CLC, STC CMC CLD STD m,r m,i r m r,i/CL r,i/CL r,1 r,i r,cl r,1 r,i r,cl r,r,i r,r,cl m,r,i/CL r,r/i m,i m,r r,r/i m,i m,r r,r r,m r,r r,m r,r r,r/m r m 1 1 1 1 1 1 1 16 17 1 15 16 6 7 8 1 1 7 2 4 10 6 8 7 9 1 1 1 1 1 1 2 2 0.5 0.5 0.5 1 0.5 0.5 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX0 EX1 EX01 EX01 EX01 EX01

7 1 1 1 8 9 1 8 8 3 4 1

3 4 4 2 4 1

3 3.5 3.5 0.5 0.5 3.5 1 2 5 3 4 4 5 2 2 0.5 1 0.5 3 4

SSE4.A SSE4.2

Control transfer instructions JMP short/near JMP r JMP m Jcc short/near fused CMP+Jcc short/near J(E/R)CXZ short LOOP short LOOPE LOOPNE short CALL near CALL r CALL m RET RET i BOUND m INTO

1 1 1 1 1 1 1 1 2 2 3 1 4 11 4 Page 39

2 2 2 1-2 1-2 1-2 1-2 1-2 2 2 2 2 2-3 5 24

EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1

2 if jumping 2 if jumping 2 if jumping 2 if jumping 2 if jumping

for no jump for no jump

Bulldozer String instructions LODS REP LODS STOS REP STOS REP STOS MOVS REP MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS Synchronization LOCK ADD XADD LOCK XADD CMPXCHG LOCK CMPXCHG CMPXCHG LOCK CMPXCHG CMPXCHG8B LOCK CMPXCHG8B CMPXCHG16B LOCK CMPXCHG16B Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC CRC32 CRC32 CRC32 XGETBV

3 6n 3 2n 3 per 16B 5 2n 4 per 16B 3 7n 6 9n

3 3n 3 2n 3 per 16B 3 2n 3 per 16B 3 4n 3 4n

small n best case small n best case

m,r m,r m,r m8,r8 m8,r8 m,r16/32/64 m,r16/32/64 m64 m64 m64 m64

1 4 4 5 5 6 6 18 18 22 22

~55 10 ~51 15 ~51 14 ~52 15 ~53 52 ~94

a,0 a,b

r32,r8 r32,r16 r32,r32

1 1 40 13 11+5b 2 37-63 36 22 3 5 5 4

3 5 6

0.25 0.25 43 22 16+4b 4 112-280 42 30 2 5 6 31

none none

Floating point x87 instructions


Instruction Move instructions FLD FLD Operands Ops Latency Reciprocal throughput 2 8 Page 40 0.5 1 Execution pipes P01 Domain, notes

r m32/64

1 1

fp fp

Bulldozer FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FLDZ, FLD1 FCMOVcc FFREE FINCSTP, FDECSTP FNSTSW FNSTSW FLDCW FNSTCW Arithmetic instructions FADD(P),FSUB(R)(P) FIADD,FISUB(R) FMUL(P) FIMUL FDIV(R)(P) FDIV(R) FIDIV(R) FABS, FCHS FCOM(P), FUCOM(P) FCOMPP, FUCOMPP FCOMI(P) FICOM(P) FTST FXAM FRNDINT FPREM FPREM1 Math FSQRT FLDPI, etc. FSIN FCOS FSINCOS FPTAN FPATAN FSCALE FXTRACT F2XM1 FYL2X FYL2XP1 m80 m80 r m32/64 m80 m80 r m m st0,r r AX m16 m16 m16 8 60 1 2 13 239 1 1 2 1 8 1 1 4 3 1 3 14 61 2 8 9 240 0 12 8 3 0 ~13 ~13 4 40 0.5 1 20 244 0.5 1 1 0.5 3 0.25 0.25 22 19 3 2 P0 P1 P2 P3 P01 fp fp fp fp fp fp inherit fp fp fp fp inherit

P0 P1 F3 P01 F3 P0 F3 P01 P0 P1 F3 none none P0 P2 P3 P0 P2 P3

r/m m r/m m r m m r/m r m

1 2 1 2 1 2 2 1 1 1 2 2 1 1 1 1 1

5-6 5-6 10-42

1 2 1 2 5-18

~20 4 19-62 19-65

0.5 0.5 0.5 1 1 0.5 0.5 1

P01 P01 P01 P01 P01 P01 P01 P01 P01 P01 P0 P1 F3 P01 P01 P01 P0 P0 P0

fma fma fma fma fp fp fp fp fp fp fp fp fp fp fp fp fp

1 1 10-162 160-170 12-166 11-190 10-355 8 12 10 10-175 10-175

10-53 65-210 ~160 95-160 95-245 60-440 52 10 64-71 60-290 60-320 Page 41 0.5 65-210 ~160 95-160 95-245 60-440 5

P01 P01 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3

Bulldozer Other FNOP (F)WAIT FNCLEX FNINIT FNSAVE FRSTOR

m864 m864

1 1 18 31 103 76

300 312

0.25 0.25 57 170 300 312

none none P0 P0 P0 P1 P2 P3 P0 P3

Integer MMX and XMM instructions


Instruction Move instructions MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA VMOVDQA VMOVDQA VMOVDQA MOVDQU MOVDQU MOVDQU LDDQU VMOVDQU VMOVDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ MOVNTDQA PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/D Q PUNPCKHQDQ PUNPCKLQDQ PSHUFB PSHUFD PSHUFW PSHUFL/HW PALIGNR PBLENDW Operands Ops Latency Reciprocal throughput 8 10 6 5 2 6 5 0 6 5 2 6 5 0 6 5 6 6 6 2 2 6 6 6 2 2 2 2 2 3 2 2 2 2 2 Page 42 1 1 0.5 1 0.5 0.5 1 0.25 0.5 1 0.5 1 3 0.25 0.5 1 0.5 1-2 10 0.5 0.5 2 2 0.5 1 1 1 1 1 1 1 1 1 1 0.5 Execution pipes Notes

r32/64, mm/x mm/x, r32/64 mm/x,m32 m32,mm/x mm/x,mm/x mm/x,m64 m64,mm/x xmm,xmm xmm,m m,xmm ymm,ymm ymm,m256 m256,ymm xmm,xmm xmm,m m,xmm xmm,m ymm,m256 m256,ymm mm,xmm xmm,mm m,mm m,xmm xmm,m (x)mm,r/m (x)mm,r/m (x)mm,r/m xmm,r/m xmm,r/m (x)mm,r/m xmm,xmm,i mm,mm,i xmm,xmm,i (x)mm,r/m,i xmm,r/m

1 2 1 1 1 1 1 1 1 1 2 2 4 1 1 1 1 2 8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

P23 P3 none P3 P23 P3 none P3

inherit

inherit

P2 P3 P23 P23 P3 P3 P1 P1 P1 P1 P1 P1 P1 P1 P1 P1 P23

SSE4.1

Bulldozer MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRB/W/D/Q PINSRB/W/D/Q PMOVSXBW/BD/BQ/ WD/WQ/DQ PMOVZXBW/BD/BQ/ WD/WQ/DQ mm,mm xmm,xmm r32,mm/x r,x/mm,i x/mm,r,i xmm,xmm xmm,xmm 31 64 2 2 2 1 1 38 48 10 10 12 2 2 37 61 1 1 2 1 1 P3 P1 P3 P1 P3 P1 P3 P1 P1 P1

AVX

SSE4.1 SSE4.1

Arithmetic instructions PADDB/W/D/Q/SB/SW /USB/USW (x)mm,r/m PSUBB/W/D/Q/SB/SW /USB/USW (x)mm,r/m PCMPEQ/GT B/W/D (x)mm,r/m PMULLW PMULHW PMULHUW PMULUDQ (x)mm,r/m xmm,r/m PMULLD xmm,r/m PMULDQ (x)mm,r/m PMULHRSW PMADDWD (x)mm,r/m PMADDUBSW (x)mm,r/m PAVGB/W (x)mm,r/m PMIN/MAX SB/SW/ SD UB/UW/UD (x)mm,r/m PHMINPOSUW xmm,r/m PABSB/W/D (x)mm,r/m PSIGNB/W/D (x)mm,r/m PSADBW (x)mm,r/m MPSADBW x,x,i Logic PAND PANDN POR PXOR PSLL/RL W/D/Q PSRAW/D PSLL/RL W/D/Q PSRAW/D PSLLDQ, PSRLDQ PTEST String instructions PCMPESTRI PCMPESTRM PCMPISTRI PCMPISTRM Encryption PCLMULQDQ

1 1 1

2 2 2

0.5 0.5 0.5

P23 P23 P23

1 1 1 1 1 1 1 1 2 1 1 2 8

4 5 4 4 4 4 2 2 4 2 2 4 8

1 2 1 1 1 1 0.5 0.5 1 0.5 0.5 1 4

P0 P0 P0 P0 P0 P0 P23 P23 P1 P23 P23 P23 P23 P1 P23

SSE4.1 SSE4.1 SSSE3

SSE4.1 SSSE3 SSSE3 SSE4.1

(x)mm,r/m (x)mm,r/m (x)mm,i xmm,i xmm,r/m

1 1 1 1 2

2 3 2 2

0.5 1 1 1 1

P23 P1 P1 P1 P1 P3

SSE4.1

x,x,i x,x,i x,x,i x,x,i

27 27 7 7

17 10 14 7

10 10 3 4

P1 P2 P3 P1 P2 P3 P1 P2 P3 P1 P2 P3

SSE4.2 SSE4.2 SSE4.2 SSE4.2

xmm,r/m

12

P1

pclmul

Page 43

Bulldozer AESDEC AESDECLAST AESENC AESENCLAST AESIMC AESKEYGENASSIST Other EMMS x,x x,x x,x x,x x,x x,x,i 2 2 2 2 1 1 5 5 5 5 5 5 2 2 2 2 1 1 P01 P01 P01 P01 P0 P0 aes aes aes aes aes aes

0.25

Floating point XMM and YMM instructions


Instruction Move instructions MOVAPS/D MOVUPS/D VMOVAPS/D MOVAPS/D MOVUPS/D VMOVAPS/D VMOVUPS/D MOVAPS/D MOVUPS/D VMOVAPS/D VMOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVHPS/D MOVLPS/D MOVLHPS MOVHLPS MOVMSKPS/D VMOVMSKPS/D MOVNTPS/D VMOVNTPS/D SHUFPS/D VSHUFPS/D VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERM2F128 VPERM2F128 BLENDPS/PD VBLENDPS/PD BLENDVPS/PD VBLENDVPS/PD Operands Ops Latency Reciprocal throughput Execution pipes Domain, notes

x,x y,y x,m128 y,m256 m128,x m256,y m256,y x,x x,m32/64 m32/64,x x,m64 m64,x m64,x x,x r32,x r32,y m128,x m256,y x,x/m,i y,y,y/m,i x,x,x/m y,y,y/m x,x/m,i y,y/m,i y,y,y,i y,y,m,i x,x/m,i y,y,y/m,i x,x/m,xmm0 y,y,y/m,y

1 2 1 2 1 4 8 1 1 1 1 2 1 1 2 1 1 2 1 2 1 2 8 10 1 2 1 2

0 2 6 6 5 5 6 2 6 5 7 8 7 2 10 6 2 2 3 3 2 2 4 2 2 2 2 Page 44

0.25 0.5 0.5 1-2 1 3 10 0.5 0.5 1 1 1 1 1 1 2 1 2 1 2 1 2 3 4 0.5 1 1 2

none P23

inherit ivec

P3 P3 P2 P3 P01

fp

P1 P3 P3 P1 P1 P3 P3 P1 P1 P1 P1 P1 P1 P23 P23 P23 P23 P1 P1

ivec

ivec ivec ivec ivec ivec ivec ivec ivec ivec ivec ivec ivec

Bulldozer MOVDDUP MOVDDUP VMOVDDUP VMOVDDUP VBROADCASTSS VBROADCASTSS VBROADCASTSD VBROADCASTF128 MOVSH/LDUP MOVSH/LDUP VMOVSH/LDUP VMOVSH/LDUP UNPCKH/LPS/D VUNPCKH/LPS/D EXTRACTPS EXTRACTPS VEXTRACTF128 VEXTRACTF128 INSERTPS INSERTPS VINSERTF128 VINSERTF128 VMASKMOVPS/D VMASKMOVPS/D VMASKMOVPS/D VMASKMOVPS/D Conversion CVTPD2PS VCVTPD2PS CVTPS2PD VCVTPS2PD CVTSD2SS CVTSS2SD CVTDQ2PS VCVTDQ2PS CVT(T) PS2DQ VCVT(T) PS2DQ CVTDQ2PD VCVTDQ2PD CVT(T)PD2DQ VCVT(T)PD2DQ CVTPI2PS CVT(T)PS2PI CVTPI2PD CVT(T) PD2PI CVTSI2SS CVT(T)SS2SI CVTSI2SD CVT(T)SD2SI x,x x,m64 y,y y,m256 x,m32 y,m32 y,m64 y,m128 x,x x,m128 y,y y,m256 x,x/m y,y,y/m r32,x,i m32,x,i x,y,i m128,y,i x,x,i x,m32,i y,y,x,i y,y,m128,i x,x,m128 y,y,m256 m128,x,x m256,y,y 1 1 2 2 1 2 2 2 1 1 2 1 1 2 2 2 1 2 1 1 2 2 1 2 18 34 2 2 6 6 6 6 2 2 2 2 10 14 2 7 2 2 9 9 9 22 25 1 0.5 2 1 0.5 0.5 0.5 0.5 1 0.5 2 1 1 2 1 1 1 1 1 1 1 1 0.5 1 7 13 P1 P1 ivec ivec

P23 P23 P23 P1 P1 P1 P1 P1 P3 P1 P3 P23 P23 P1 P1 P23 P23 P01 P01 P0 P1 P2 P3 P0 P1 P2 P3

ivec ivec ivec ivec

ivec

ivec

x,x x,y x,x y,x x,x x,x x,x y,y x,x y,y x,x y,x x,x x,y x,mm mm,x x,mm mm,x x,r32 r32,x x,r32/64 r32/64,x

2 4 2 4 1 1 1 2 1 2 2 4 2 4 1 1 2 2 2 2 2 2

7 7 7 7 4 4 4 4 4 4 7 8 7 7 4 4 7 7 14 13 14 13 Page 45

1 2 1 2 1 1 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1

P01 P01 P01 P01 P0 P0 P0 P0 P0 P0 P01 P01 P01 P01 P0 P0 P0 P1 P0 P1 P0 P0 P0 P0

fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp

Bulldozer Arithmetic ADDSS/D SUBSS/D ADDPS/D SUBPS/D VADDPS/D VSUBPS/D ADDSUBPS/D VADDSUBPS/D HADDPS/D HSUBPS/D HADDPS/D HSUBPS/D VHADDPS/D VHSUBPS/D VHADDPS/D VHSUBPS/D MULSS MULSD MULPS MULPD VMULPS VMULPD DIVSS DIVPS VDIVPS DIVSD DIVPD VDIVPD RCPSS/PS VRCPPS CMPSS/D CMPPS/D VCMPPS/D COMISS/D UCOMISS/D MAXSS/SD/PS/PD MINSS/SD/PS/PD VMAXPS/D VMINPS/D ROUNDSS/SD/PS/PD VROUNDSS/SD/PS/ PD DPPS DPPS VDPPS VDPPS DPPD DPPD Math SQRTSS/PS VSQRTPS SQRTSD/PD VSQRTPD RSQRTSS/PS VRSQRTPS

x,x/m x,x/m y,y,y/m x,x/m y,y,y/m x,x x,m128 y,y,y y,y,m x,x/m x,x/m y,y,y/m x,x/m y,y,y/m x,x/m y,y,y/m x,x/m y,y/m x,x/m y,y,y/m x,x/m x,x/m y,y,y/m x,x/m,i y,y/m,i x,x,i x,m128,i y,y,y,i y,m256,i x,x,i x,m128,i

1 1 2 1 2 3 4 8 10 1 1 2 1 2 1 2 1 2 1 2 2 1 2 1 2 16 18 25 29 15 17

5-6 5-6 5-6 5-6 5-6 10

0.5 0.5 1 0.5 1 2 2

P01 P01 P01 P01 P01 P01 P1 P01 P1 P01 P1 P01 P1 P01 P01 P01 P01 P01 P01 P01 P01 P01 P01 P01 P01 P3 P01 P01 P0 P0 P01 P23 P01 P23 P01 P3 P01 P3 P01 P23 P01 P23

fma fma fma fma fma ivec/fma ivec/fma ivec/fma ivec/fma fma fma fma fp fp fp fp fp fp fp fp fp fp fp fp fp fma fma fma fma fma fma

10

4 4 0.5 0.5 1 4.5-9.5 9-19 4.5-11 9-22 1 2 0.5 1 1

5-6 5-6 5-6 9-24 9-24 9-27 9-27 5 5 2 2

2 2 4 4 25 27 15

0.5 1 1 2 6 7 13 13 5 6

x,x/m y,y/m x,x/m y,y/m x,x/m y,y/m

1 2 1 2 1 2

14-15 14-15 24-26 24-26 5 5

4.5-12 9-24 4.5-16.5 9-33 1 2

P01 P01 P01 P01 P01 P01

fp fp fp fp fp fp

Page 46

Bulldozer Logic
AND/ANDN/OR/XORPS/ PD

x,x/m y,y,y/m

1 2

2 2

0.5 1

P23 P23

ivec ivec

VAND/ANDN/OR/XOR PS/PD Other VZEROUPPER VZEROUPPER VZEROALL VZEROALL LDMXCSR STMXCSR FXSAVE FXRSTOR XSAVE XRSTOR

m32 m32 m4096 m4096 m m

9 16 17 32 1 2 67 116 122 177

10 19 136 176 196 250

4 5 6 10 4 19 136 176 196 250

P2 P3 P2 P3 P0 P3 P0 P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3

32 bit mode 64 bit mode 32 bit mode 64 bit mode

AMD-specific instructions
Instruction 3DNow instructions PREFETCH/W SSE4A instructions LZCNT POPCNT POPCNT EXTRQ EXTRQ INSERTQ INSERTQ MOVNTSS/SD XOP instructions VFRCZSS/SD/PS/PD VFRCZSS/SD/PS/PD VPCMOV VPCMOV VPPERM VPCOMB/W/D/Q VPCOMUB/W/D/Q VPHADDBW/BD/BQ/ WD/WQ/DQ VPHADDUBW/BD/BQ/ WD/WQ/DQ VPHSUBBW/WD/DQ VPMACSWW/WD Operands Ops Latency Reciprocal throughput 0.5 Execution pipes Notes

r,r r16/32,r16/32 r64,r64 x,i,i x,x x,x,i,i x,x m,x

2 1 1 1 1 1 1 1

2 4 4 3 3 3 3

2 2 4 1 1 1 1 4

P1 P1 P1 P1 P3

x,x x,m x,x,x,x/m y,y,y,y/m x,x,x,x/m x,x,x/m,i x,x,x/m,i x,x/m x,x/m x,x/m x,x,x/m,x

2 3 1 2 1 1 1 1 1 1 1

10 10 2 2 2 2 2 2 2 2 4 Page 47

2 2 1 2 1 0.5 0.5 0.5 0.5 0.5 1

P01 P01 P1 P1 P1 P23 P23 P23 P23 P23 P0

latency 0 if i=6, 7 latency 0 if i=6, 7

Bulldozer VPMACSDD VPMACSDQH/L VPMACSSWW/WD VPMACSSDD VPMACSSDQH/L VPMADCSWD VPMADCSSWD VPROTB/W/D/Q VPROTB/W/D/Q VPSHAB/W/D/Q VPSHLB/W/D/Q FMA4 instructions VFMADDSS/SD VFMADDSSPS/PD VFMADDSSPS/PD VFMSUBSS/SD VFMSUBSSPS/PD VFMSUBSSPS/PD VFMADDSUBPS/PD VFMADDSUBPS/PD VFMSUBADDPS/PD VFMSUBADDPS/PD VFNMADDSS/SD VFNMADDSSPS/PD VFNMADDSSPS/PD VFNMSUBSS/SD VFNMSUBSSPS/PD VFNMSUBSSPS/PD x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m x,x,i x,x,x/m x,x,x/m 1 1 1 1 1 1 1 1 1 1 1 5 4 4 5 4 4 4 3 2 3 3 2 1 1 2 1 1 1 1 1 1 1 P0 P0 P0 P0 P0 P0 P0 P1 P1 P1 P1

x,x,x,x/m x,x,x,x/m y,y,y,y/m x,x,x,x/m x,x,x,x/m y,y,y,y/m x,x,x,x/m y,y,y,y/m x,x,x,x/m y,y,y,y/m x,x,x,x/m x,x,x,x/m y,y,y,y/m x,x,x,x/m x,x,x,x/m y,y,y,y/m

1 1 2 1 1 2 1 2 1 2 1 1 2 1 1 2

5-6 5-6 5-6 5-6 5-6 5-6 5-6 5-6 5-6 5-6 5-6 5-6 5-6 5-6 5-6 5-6

0.5 0.5 1 0.5 0.5 1 0.5 1 0.5 1 0.5 0.5 1 0.5 0.5 1

P01 P01 P01 P01 P01 P01 P01 P01 P01 P01 P01 P01 P01 P01 P01 P01

fma fma fma fma fma fma fma fma fma fma fma fma fma fma fma fma

Page 48

Bobcat

AMD Bobcat
List of instruction timings and macro-operation breakdown
Explanation of column headings:
Instruction: Operands: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit mmx register, xmm = 128 bit xmm register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc. Number of micro-operations issued from instruction decoder to schedulers. Instructions with more than 2 micro-operations are micro-coded. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latencies listed do not include memory operands where the operand is listed as register or memory (r/m). The clock frequency varies dynamically, which makes it difficult to measure latencies. The values listed are measured after the execution of millions of similar instructions, assuming that this will make the processor boost the clock frequency to the highest possible value. Reciprocal throughput: This is also called issue latency. This value indicates the average number of clock cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/2 indicates that the execution units can handle 2 instructions per clock cycle in one thread. However, the throughput may be limited by other bottlenecks in the pipeline. Indicates which execution pipe is used for the micro-operations. I0 means integer pipe 0. I0/1 means integer pipe 0 or 1. FP0 means floating point pipe 0 (ADD). FP1 means floating point pipe 1 (MUL). FP0/1 means either one of the two floating point pipes. Two micro-operations can execute simultaneously if they go to different execution pipes.

Ops: Latency:

Execution pipe:

Integer instructions
Instruction Move instructions MOV MOV MOV MOV MOV MOV MOVNTI MOVZX, MOVSX MOVZX, MOVSX MOVSXD MOVSXD CMOVcc CMOVcc XCHG Operands Ops Latency Reciprocal throughput 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 4 4 7 6 1 5 1 5 1 1 Page 49 1/2 1/2 1 1 1 1 1 1/2 1 1/2 1 1/2 1 1 Execution pipe I0/1 I0/1 AGU AGU AGU AGU AGU I0/1 Notes

r,r r,i r,m m,r m8,r8H m,i m,r r,r r,m r64,r32 r64,m32 r,r r,m r,r

Any addressing mode Any addressing mode AH, BH, CH, DH

I0/1 I0/1

Bobcat XCHG XLAT PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POPF(D/Q) POPA(D) LEA LEA LEA LEA LAHF SAHF SALC BSWAP PREFETCHNTA PREFETCHT0/1/2 PREFETCH SFENCE LFENCE MFENCE r,m r i m 3 2 1 1 3 9 9 1 4 29 9 2 1 1 1 4 1 1 1 1 1 1 4 1 4 20 5 1 1 2 6 9 1 4 22 8 2 1/2 1 1/2 2 1/2 1/2 1 1 1 ~45 1 ~45
Timing depends on hw

r m

r16,[m] r32/64,[m] r32/64,[m] r64,[m]

3 1 2-4 4 1 1 1

I0 I0/1 I0 I0/1 I0/1 I0/1 AGU AGU AGU AGU AGU AGU

Any address size no scale, no offset w. scale or offset RIP relative

r m m m

AMD only

Arithmetic instructions ADD, SUB r,r/i ADD, SUB r,m ADD, SUB m,r ADC, SBB r,r/i ADC, SBB r,m ADC, SBB m,r/i CMP r,r/i CMP r,m INC, DEC, NEG r INC, DEC, NEG m AAA AAS DAA DAS AAD AAM MUL, IMUL r8/m8 MUL, IMUL r16/m16 MUL, IMUL r32/m32 MUL, IMUL r64/m64 IMUL r16,r16/m16 IMUL r32,r32/m32 IMUL r64,r64/m64

1 1 1 1 1 1 1 1 1 1 9 9 12 16 4 33 1 3 2 2 1 1 1

1 6-7 1 1 6 5 10 7 8 5 23 3 3-5 3-4 6-7 3 3 6 Page 50

1/2 1 1 1 1 1/2 1 1/2

I0/1

I0/1

I0/1 I0/1

23 1 2 1 1 4

I0 I0 I0 I0 I0 I0 I0

latency ax=3, dx=5 latency eax=3, edx=4 latency rax=6, rdx=7

Bobcat IMUL IMUL IMUL DIV DIV DIV DIV IDIV IDIV IDIV IDIV CBW, CWDE, CDQE CWD, CDQ, CQO Logic instructions AND, OR, XOR AND, OR, XOR AND, OR, XOR TEST TEST NOT NOT SHL, SHR, SAR ROL, ROR RCL, RCR RCL RCR RCL RCR SHL,SHR,SAR,ROL, ROR RCL, RCR RCL RCR RCL RCR SHLD, SHRD SHLD, SHRD SHLD, SHRD BT BT BT BTC, BTR, BTS BTC BTR, BTS BTC BTR, BTS BSF, BSR BSF, BSR POPCNT r16,(r16),i r32,(r32),i r64,(r64),i r8/m8 r16/m16 r32/m32 r64/m64 r8/m8 r16/m16 r32/m32 r64/m64 2 1 1 1 1 1 1 1 1 1 1 1 1 4 3 7 27 33 49 81 29 37 55 81 1 1 3 1 4 27 33 49 81 29 37 55 81 I0 I0 I0 I0 I0 I0 I0 I0 I0 I0 I0 I0/1 I0/1

r,r r,m m,r r,r r,m r m r,i/CL r,i/CL r,1 r,i r,i r,CL r,CL m,i /CL m,1 m,i m,i m,CL m,CL r,r,i r,r,cl m,r,i/CL r,r/i m,i m,r r,r/i m,i m,i m,r m,r r,r r,m r,r/m

1 1 1 1 1 1 1 1 1 1 9 7 9 9 1 1 10 9 9 8 6 7 8 1 1 5 2 5 4-5 8 8 11 11 9

1 1 1 1 1 5 4 6 5 7 7 18

1/2 1 1 1/2 1 1/2 1 1/2 1/2 1 5 4 5 4 1 1 ~15 ~14 15 15 3 4 15 1/2 1 3 1 15 15 13 15 6 6 5

I0/1

I0/1 I0/1 I0/1 I0/1 I0/1

3 4 18

16 15 6 12 Page 51

SSE4.A/SSE4.2

Bobcat LZCNT SETcc SETcc CLC, STC CMC CLD STD r,r/m r m 8 1 1 1 1 1 2 5 1 SSE4.A, AMD only 1/2 1 1/2 1/2 1 2

I0/1 I0/1 I0 I0,I1

Control transfer instructions JMP short/near JMP r JMP m(near) Jcc short/near J(E/R)CXZ short LOOP short CALL near CALL r CALL m(near) RET RET i BOUND m INTO String instructions LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC

1 1 1 1 2 8 2 2 5 1 4 8 4

2 2 2 1/2 - 2 1-2 4 2 2 2 ~3 ~4 4 2

recip. thrp.= 2 if jump recip. thrp.= 2 if jump

values are for no jump values are for no jump

4 5 4 2 7 2 5 6 7 6

~3 ~3 2 5

values are per count best case 6-7 Byte/clk best case 5 Byte/clk

3 3 4 3

values are per count values are per count

1 0 1 0 6 i,0 12 a,b 10+6b 2 30-52 70-830 26 14

1/2 1/2 6

I0/1 I0/1 36 34+6b

3 87 8

32 bit mode

Floating point x87 instructions


Instruction Operands Ops Latency Reciprocal throughput Page 52 Execution pipe Notes

Bobcat Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(T)(P) FLDZ, FLD1 FCMOVcc FFREE FINCSTP, FDECSTP FNSTSW FNSTSW FNSTCW FLDCW Arithmetic instructions FADD(P),FSUB(R)(P) FADD(P),FSUB(R)(P) FIADD,FISUB(R) FMUL(P) FMUL(P) FIMUL FDIV(R)(P) FDIV(R)(P) FIDIV(R) FABS, FCHS FCOM(P), FUCOM(P) FCOM(P), FUCOM(P) FCOMPP, FUCOMPP FCOMI(P) FICOM(P) FTST FXAM FRNDINT FPREM FPREM1 Math FSQRT FLDPI, etc. FSIN FCOS FSINCOS FPTAN r m32/64 m80 m80 r m32/64 m80 m80 r m m st0,r r AX m16 m16 m16 1 1 7 21 1 1 16 217 1 1 1 1 12 1 1 2 2 3 12 2 6 14 30 2 6 19 177 0 9 6 7 1 ~20 ~20 1/2 1 5 35 1/2 1 9 180 1 1 1 1 7 1 1 10 10 2 10 FP0/1 FP0/1

FP0/1 FP1

FP1 FP1 FP1 FP0/1 FP1 FP1 FP1 FP1 FP0 FP1

r m m r m m r m m r m r m

1 1 2 1 1 2 1 1 2 1 1 1 1 1 2 1 2 5 1 1

3 3 5 5 19

1 1 3 3 3 19 19 19 2 1 1 1 2 1 1 2

11 11-16 11-19

FP0 FP0 FP0,FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP0 FP0 FP0 FP0 FP0, FP1 FP0 FP1 FP0, FP1 FP1 FP1

1 31 1 4-44 27-105 11-51 51-94 11-75 48-110 ~45 ~113 Page 53

1 27-105 51-94 48-110 ~113

FP1 FP0 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1

Bobcat FPATAN FSCALE FXTRACT F2XM1 FYL2X FYL2XP1 Other FNOP (F)WAIT FNCLEX FNINIT FNSAVE FRSTOR FXSAVE FXRSTOR 9-75 49-163 5 8 7 9 30-56 ~60 8 29 12 44 49-163 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1

m m m m

1 1 9 26 85 80 71 111

0 0

1/2 1/2 30 78 163 123 105 118

FP0, FP1 ALU FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1

Integer MMX and XMM instructions


Instruction Move instructions MOVD MOVD MOVD MOVD MOVD MOVD MOVD MOVD (MOVQ) MOVD (MOVQ) MOVD (MOVQ) MOVQ MOVQ MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU, LDDQU MOVDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB Operands Ops Latency Reciprocal throughput 1 1 1 1 3 2 1 1 2 3 1 2 1 2 1 2 2 2 2 2 1 2 1 2 1 3 7 7 5 6 6 5 6 7 7 7 1 1 5 5 6 1 6 6 6-9 6-9 1 1 13 13 1 2 Page 54 1 3 1 1 3 1 2 1 3 3 1/2 1 1 1 2 1 2 3 2-5.5 3-6 1/2 1 1.5 3 1/2 2 Execution pipe FP0 FP0/1 FP0/1 FP0 FP1 FP1 FP1 FP0 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP1 FP1 FP0/1 AGU FP1 AGU FP1 FP0/1 FP0/1 FP1 FP1 FP0/1 FP0/1 Moves 64 bits.Name of instruction differs do. do. Notes

r32, mm mm, r32 mm,m32 r32, xmm xmm, r32 xmm,m32 m32,(x)mm r64,(x)mm mm,r64 xmm,r64 mm,mm xmm,xmm mm,m64 xmm,m64 m64,(x)mm xmm,xmm xmm,m m,xmm xmm,m m,xmm mm,xmm xmm,mm m,mm m,xmm mm,r/m xmm,r/m

Bobcat PUNPCKH/LBW/WD/ DQ PUNPCKH/LBW/WD/ DQ PUNPCKHQDQ PUNPCKLQDQ PSHUFB PSHUFB PSHUFD PSHUFW PSHUFL/HW PALIGNR MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRW PINSRW PINSRW INSERTQ INSERTQ EXTRQ EXTRQ Arithmetic instructions PADDB/W/D/Q PADDSB/W PADDUSB/W PSUBB/W/D/Q PSUBSB/W PSUBUSB/W PADDB/W/D/Q PADDSB/W ADDUSB/W PSUBB/W/D/Q PSUBSB/W PSUBUSB/W PHADD/SUBW/SW/D PHADD/SUBW/SW/D PCMPEQ/GT B/W/D PCMPEQ/GT B/W/D PMULLW PMULHW PMULHUW PMULUDQ PMULLW PMULHW PMULHUW PMULUDQ PMULHRSW PMULHRSW PMADDWD PMADDWD PMADDUBSW PMADDUBSW mm,r/m xmm,r/m xmm,r/m xmm,r/m mm,mm xmm,xmm xmm,xmm,i mm,mm,i xmm,xmm,i xmm,xmm,i mm,mm xmm,xmm r32,(x)mm r32,(x)mm,i mm,r32,i xmm,r32,i xmm,xmm xmm,xmm,i,i xmm,xmm xmm,xmm,i,i 1 2 2 1 1 6 3 1 2 20 32 64 1 2 2 3 3 3 1 1 1 1 1 1 2 3 2 1 2 19 146-1400 279-3000 8 12 10 10 3-4 3-4 1 2 1/2 1 1 1/2 1 3 2 1/2 2 12 130-1170 260-2300 2 2 6 3 3 1 2

FP0, FP1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0, FP1 FP0, FP1 FP0 FP0, FP1 FP0/1 FP0/1 FP0, FP1 FP0, FP1 FP0/1 FP0/1

Suppl. SSE3 Suppl. SSE3

Suppl. SSE3

SSE4.A, AMD only SSE4.A, AMD only SSE4.A, AMD only SSE4.A, AMD only

mm,r/m

1/2

FP0/1

xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m

2 1 2 1 2

1 1 4 1 1

1 1/2 1 1/2 1

FP0/1 FP0/1 FP0/1 FP0/1 FP0/1

Suppl. SSE3 Suppl. SSE3

mm,r/m

FP0

xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m

2 1 2 1 2 1 2

2 2 2 2 2 2 2 Page 55

2 1 2 1 2 1 2

FP0 FP0 FP0 FP0 FP0 FP0 FP0

Suppl. SSE3 Suppl. SSE3

Suppl. SSE3 Suppl. SSE3

Bobcat PAVGB/W PAVGB/W PMIN/MAX SW/UB PMIN/MAX SW/UB PABSB/W/D PABSB/W/D PSIGNB/W/D PSIGNB/W/D PSADBW PSADBW Logic PAND PANDN POR PXOR PAND PANDN POR PXOR PSLL/RL W/D/Q PSRAW/D PSLL/RL W/D/Q PSRAW/D PSLLDQ, PSRLDQ Other EMMS mm,r/m xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m 1 2 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 2 2 1/2 1 1/2 1 1/2 1 1/2 1 2 2 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0 FP0, FP1

Suppl. SSE3 Suppl. SSE3 Suppl. SSE3 Suppl. SSE3

mm,r/m xmm,r/m mm,i/mm/m


xmm,i/xmm/m

1 2 1 2 2

1 1 1 1 1

1/2 1 1 1 1

FP0/1 FP0/1 FP0/1 FP0/1 FP0/1

xmm,i

1/2

FP0/1

Floating point XMM instructions


Instruction Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHLPS, MOVLHPS MOVHPS/D, MOVLPS/D MOVHPS/D, MOVLPS/D MOVNTPS/D MOVNTSS/D MOVDDUP MOVDDUP MOVSHDUP, MOVSLDUP Operands Ops Latency Reciprocal throughput 2 2 2 2 2 2 1 2 1 1 1 1 2 1 2 2 2 1 6 6 1 6-9 6-9 1 6 5 1 6 5 12 12 2 7 1 Page 56 1 2 3 1 2-6 3-6 1/2 2 2 1/2 2 3 3 2 1 2 1 Execution pipe FP0/1 AGU FP1 FP0/1 AGU FP1 FP0/1 FP1 FP1 FP0/1 AGU FP1 FP1 FP1 FP0/1 FP0/1 FP0/1 Notes

r,r r,m m,r r,r r,m m,r r,r r,m m,r r,r r,m m,r m,r m,r r,r r,m64 r,r

SSE4.A, AMD only SSE3 SSE3

Bobcat MOVSHDUP, MOVSLDUP MOVMSKPS/D SHUFPS/D UNPCK H/L PS/D Conversion CVTPS2PD CVTPD2PS CVTSD2SS CVTSS2SD CVTDQ2PS CVTDQ2PD CVT(T)PS2DQ CVT(T)PD2DQ CVTPI2PS CVTPI2PD CVT(T)PS2PI CVT(T)PD2PI CVTSI2SS CVTSI2SD CVT(T)SS2SI CVT(T)SD2SI Arithmetic ADDSS/D SUBSS/D ADDPS/D SUBPS/D ADDSUBPS/D HADDPS/D HSUBPS/D MULSS MULSD MULPS MULPD DIVSS DIVPS DIVSD DIVPD RCPSS RCPPS MAXSS/D MINSS/D MAXPS/D MINPS/D CMPccSS/D CMPccPS/D COMISS/D UCOMISS/D Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D r,m r32,r r,r/m,i r,r/m 2 1 3 2 12 ~6 2 1 3 2 2 1 AGU FP0 FP0/1 FP0/1

r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m xmm,mm xmm,mm mm,xmm mm,xmm xmm,r32 xmm,r32 r32,xmm r32,xmm

2 4 3 1 2 2 2 4 1 2 1 3 3 2 2 2

5 5 5 4 4 5 4 6 4 5 4 6 12 11 12 11

2 3 3 1 4 2 4 3 2 2 1 2 3 3 1 1

FP1 FP0, FP1 FP0, FP1 FP1 FP1 FP1 FP1 FP0, FP1 FP1 FP1 FP1 FP0, FP1 FP0, FP1 FP1 FP0, FP1 FP0, FP1

r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m

1 2 2 2 1 1 2 2 1 2 1 2 1 2 1 2 1 2 1

3 3 3 3 2 4 2 4 13 38 17 34 3 3 2 2 2 2

1 2 2 2 1 2 2 4 13 38 17 34 1 2 1 2 1 2 1

FP0 FP0 FP0 FP0 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP0 FP0 FP0 FP0 FP0

SSE3 SSE3

r,r/m

FP0/1

Page 57

Bobcat Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS RSQRTPS Other LDMXCSR STMXCSR r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m 1 2 1 2 1 2 14 48 24 48 3 3 14 48 24 48 1 2 FP1 FP1 FP1 FP1 FP1 FP1

m m

12 3

10 11

FP0, FP1 FP0, FP1

Page 58

Intel Pentium

Intel Pentium and Pentium MMX


List of instruction timings
Explanation of column headings:
Operands r = register, accum = al, ax or eax, m = memory, i = immediate data, sr = segment register, m32 = 32 bit memory operand, etc. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. u = pairable in u-pipe, v = pairable in v-pipe, uv = pairable in either pipe, np = not pairable.

Clock cycles

Pairability

Integer instructions (Pentium and Pentium MMX)


Instruction NOP MOV MOV MOV MOV XCHG XCHG XCHG XLAT PUSH POP PUSH POP PUSH POP PUSHF POPF PUSHA POPA PUSHAD POPAD LAHF SAHF MOVSX MOVZX LEA LDS LES LFS LGS LSS ADD SUB AND OR XOR ADD SUB AND OR XOR ADD SUB AND OR XOR ADC SBB ADC SBB ADC SBB CMP CMP TEST TEST TEST Operands r/m, r/m/i r/m, sr sr , r/m m , accum (E)AX, r r,r r,m r/i r m m sr sr Clock cycles Pairability 1 uv 1 uv 1 np >= 2 b) np 1 uv h) 2 np 3 np >15 np 4 np 1 uv 1 uv 2 np 3 np 1 b) np >= 3 b) np 3-5 np 4-6 np 5-9 i) np 5 np 2 np 3 a) np 1 uv 4 c) np 1 uv 2 uv 3 uv 1 u 2 u 3 u 1 uv 2 uv 1 uv 2 uv 1 f) Page 59

r , r/m r,m m r , r/i r,m m , r/i r , r/i r,m m , r/i r , r/i m , r/i r,r m,r r,i

Intel Pentium TEST INC DEC INC DEC NEG NOT MUL IMUL MUL IMUL DIV DIV DIV IDIV IDIV IDIV CBW CWDE CWD CDQ SHR SHL SAR SAL SHR SHL SAR SAL SHR SHL SAR SAL ROR ROL RCR RCL ROR ROL ROR ROL RCR RCL RCR RCL SHLD SHRD SHLD SHRD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR SETcc JMP CALL JMP CALL conditional jump CALL JMP RETN RETN RETF RETF J(E)CXZ LOOP BOUND CLC STC CMC CLD STD CLI STI LODS REP LODS STOS REP STOS MOVS m,i r m r/m r8/r16/m8/m16 all other versions r8/m8 r16/m16 r32/m32 r8/m8 r16/m16 r32/m32 2 1 3 1/3 11 9 d) 17 25 41 22 30 46 3 2 1 3 4/5 1/3 1/3 4/5 8/10 7/9 4 a) 5 a) 4 a) 4 a) 9 a) 7 a) 8 a) 14 a) 7-73 a) 1/2 a) 1 e) >= 3 e) 1/4/5/6 e) 2/5 e 2/5 e 3/6 e) 4/7 e) 5/8 e) 4-11 e) 5-10 e) 8 2 6-9 2 7+3*n g) 3 10+n g) 4 Page 60 np uv uv np np np np np np np np np np np u u np u np np np np np np np np np np np np np np v np v np np np np np np np np np np np np np np np

r,i m,i r/m, CL r/m, 1 r/m, i(><1) r/m, CL r/m, i(><1) r/m, CL r, i/CL m, i/CL r, r/i m, i m, i r, r/i m, i m, r r , r/m r/m short/near far short/near r/m i i short short r,m

Intel Pentium REP MOVS SCAS REP(N)E SCAS CMPS REP(N)E CMPS BSWAP CPUID RDTSC Notes: a 12+n g) 4 9+4*n g) 5 8+4*n g) 1 a) 13-16 a) 6-13 a) j) np np np np np np np np

b c d e f g h i j

This instruction has a 0FH prefix which takes one clock cycle extra to decode on a P1 unless preceded by a multi-cycle instruction. versions with FS and GS have a 0FH prefix. see note a. versions with SS, FS, and GS have a 0FH prefix. see note a. versions with two operands and no immediate have a 0FH prefix, see note a. high values are for mispredicted jumps/branches. only pairable if register is AL, AX or EAX. add one clock cycle for decoding the repeat prefix unless preceded by a multi-cycle instruction (such as CLD). pairs as if it were writing to the accumulator. 9 if SP divisible by 4 (imperfect pairing). on P1: 6 in privileged or real mode; 11 in non-privileged; error in virtual mode. On PMMX: 8 and 13 clocks respectively.

Floating point instructions (Pentium and Pentium MMX)


Explanation of column headings
Operands Clock cycles r = register, m = memory, m32 = 32-bit memory operand, etc. The numbers are minimum values. Cache misses, misalignment, denormal operands, and exceptions may increase the clock counts considerably. + = pairable with FXCH, np = not pairable with FXCH. Overlap with integer instructions. i-ov = 4 means that the last four clock cycles can overlap with subsequent integer instructions. Overlap with floating point instructions. fp-ov = 2 means that the last two clock cycles can overlap with subsequent floating point instructions. (WAIT is considered a floating point instruction here)

Pairability i-ov

fp-ov

Instruction FLD FLD FBLD FST(P) FST(P) FST(P) FBSTP FILD

Operand r/m32/m64 m80 m80 r m32/m64 m80 m80 m

Clock cycles Pairability 1 0 3 np 48-58 np 1 np 2 m) np 3 m) np 148-154 np 3 np Page 61

i-ov 0 0 0 0 0 0 0 2

fp-ov 0 0 0 0 0 0 0 2

Intel Pentium FIST(P) FLDZ FLD1 FLDPI FLDL2E etc. FNSTSW FLDCW FNSTCW FADD(P) FSUB(R)(P) FMUL(P) FDIV(R)(P) FCHS FABS FCOM(P)(P) FUCOM FIADD FISUB(R) FIMUL FIDIV(R) FICOM FTST FXAM FPREM FPREM1 FRNDINT FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN FNOP FXCH FINCSTP FDECSTP FFREE FNCLEX FNINIT FNSAVE FRSTOR WAIT Notes: m n o p m 6 2 5 s) 6 q) 8 2 3 3 3 19/33/39 p) 1 1 6 6 22/36/42 p) 4 1 17-21 16-64 20-70 9-20 20-32 12-66 70 65-100 r) 89-112 r) 53-59 r) 103 r) 105 r) 120-147 r) 112-134 r) 1 1 2 2 6-9 12-22 124-300 70-95 1 np np np np np np 0 0 0 0 0 0 np np np np np np np np np np np np np np np np np np np np np np np np np np np np 0 0 2 0 0 0 2 2 2 38 o) 0 0 2 2 38 o) 0 0 4 2 2 0 5 0 69 o) 2 2 2 2 2 36 o) 2 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 2 2 2 n) 2 0 0 2 2 2 0 0 0 2 2 0 0 0 2 2 2 2 2 2 0 2 0 0 0 0 0 0 0 0 0

AX/m16 m16 m16 r/m r/m r/m r/m r/m m m m m

r r

m m

q r

The value to store is needed one clock cycle in advance. 1 if the overlapping instruction is also an FMUL. Cannot overlap integer multiplication instructions. FDIV takes 19, 33, or 39 clock cycles for 24, 53, and 64 bit precision respectively. FIDIV takes 3 clocks more. The precision is defined by bit 8-9 of the floating point control word. The first 4 clock cycles can overlap with preceding integer instructions. Clock counts are typical. Trivial cases may be faster, extreme cases may be slower. Page 62

Intel Pentium s May be up to 3 clocks more when output needed for FST, FCHS, or FABS.

MMX instructions (Pentium MMX)


A list of MMX instruction timings is not needed because they all take one clock cycle, except the MMX multiply instructions which take 3. MMX multiply instructions can be pipelined to yield a throughput of one multiplication per clock cycle. The EMMS instruction takes only one clock cycle, but the first floating point instruction after an EMMS takes approximately 58 clocks extra, and the first MMX instruction after a floating point instruction takes approximately 38 clocks extra. There is no penalty for an MMX instruction after EMMS on the PMMX. There is no penalty for using a memory operand in an MMX instruction because the MMX arithmetic unit is one step later in the pipeline than the load unit. But the penalty comes when you store data from an MMX register to memory or to a 32-bit register: The data have to be ready one clock cycle in advance. This is analogous to the floating point store instructions. All MMX instructions except EMMS are pairable in either pipe. Pairing rules for MMX instructions are described in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs".

Page 63

Pentium II and III

Intel Pentium II and Pentium III


List of instruction timings and op breakdown
Explanation of column headings:
Operands: i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc. The number of ops that the instruction generates for each execution port. Port 0: ALU, etc. Port 1: ALU, jumps Instructions that can go to either port 0 or 1, whichever is vacant first. Port 2: load data, etc. Port 3: address generation for store Port 4: store data This is the delay that the instruction generates in a dependency chain. (This is not the same as the time spent in the execution unit. Values may be inaccurate in situations where they cannot be measured exactly, especially with memory operands). The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays by 50-150 clocks, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay.

ops: p0: p1: p01: p2: p3: p4: Latency:

Reciprocal throughput: The average number of clock cycles per instruction for a series of independent instructions of the same kind.

Integer instructions (Pentium Pro, Pentium II and Pentium III)


Instruction MOV MOV MOV MOV MOV MOV MOV MOVSX MOVZX MOVSX MOVZX CMOVcc CMOVcc XCHG XCHG XLAT PUSH POP POP PUSH POP PUSH POP Operands p0 r,r/i r,m m,r/i r,sr m,sr sr,r sr,m r,r r,m r,r r,m r,r r,m r/i r (E)SP m m sr sr p1 ops p01 p2 p3 1 1 1 1 1 1 1 1 1 1 1 1 1 3 4 1 1 1 2 1 5 2 8 1 1 1 1 1 1 1 1 1 1 1 1 high b) Latency p4 Reciprocal throughput

1 1 5 8

8 7

1 1 1

1 1 1

Page 64

Pentium II and III PUSHF(D) POPF(D) PUSHA(D) POPA(D) LAHF SAHF LEA LDS LES LFS LGS LSS ADD SUB AND OR XOR ADD SUB AND OR XOR ADD SUB AND OR XOR ADC SBB ADC SBB ADC SBB CMP TEST CMP TEST INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS AAD AAM IMUL IMUL DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV CBW CWDE CWD CDQ SHR SHL SAR ROR ROL SHR SHL SAR ROR ROL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL SHLD SHRD SHLD SHRD BT BT BTR BTS BTC BTR BTS BTC BSF BSR BSF BSR SETcc 3 10 11 6 2 2 1 1 1 8 8 1 c) 8 1 1 1 2 2 3 1 1 1 1 1 1 1 1 1 2 3 3 2 2 2 1 r,i/CL m,i/CL r,1 r8,i/CL r16/32,i/CL m,1 m8,i/CL m16/32,i/CL r,r,i/CL m,r,i/CL r,r/i m,r/i r,r/i m,r/i r,r r,m r 1 1 1 4 3 1 4 4 2 2 1 1 1 1 1 1 4 3 2 3 2 1 1 6 1 6 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 4 15 4 4 19 23 39 19 23 39 3 1 1 1 1 1 1 1 1 8 1

r,m m r,r/i r,m m,r/i r,r/i r,m m,r/i r,r/i m,r/i r m

r,(r),(i) (r),m r8 r16 r32 m8 m16 m32

1 1 1

1 1 12 21 37 12 21 37

1 1 1 1 1 1 1

1 1 1 1

1 1 1 1

Page 65

Pentium II and III SETcc JMP JMP JMP JMP JMP conditional jump CALL CALL CALL CALL CALL RETN RETN RETF RETF J(E)CXZ LOOP LOOP(N)E ENTER ENTER LEAVE BOUND CLC STC CMC CLD STD CLI STI INTO LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP(N)E SCAS CMPS REP(N)E CMPS BSWAP NOP (90) Long NOP (0F 1F) CPUID RDTSC IN OUT PREFETCHNTA d) PREFETCHT0/1/2 d) SFENCE d) Notes m short/near far r m(near) m(far) short/near near far r m(near) m(far) i i short short short i,0 a,b r,m 23 23 2 2 ca. 7 1 8 8 12 18 +4b 2 6 1 4 1 1 1 1 1 21 1 1 21 1 1 28 1 1 28 1 1 2 3 2 4 1 1 1 2 1 1 3 3 1 1 2 1 2 1 1 2 1 2 1 1 2 2 2 2 2 2 2 2 2 1 1 2

1 b-1 1 2

1 2b

9 17 5 2 10+6n ca. 5n 1 ca. 6n 1 12+7n 4 12+9n r 1 1 1 1 0.5 1 2 1 a) 3 a) 2 1 1 1 1

23-48 31 18 18 m m 1 1 1 1

>300 >300

Page 66

Pentium II and III a) b) c) d) Faster under certain conditions: see manual 3: "The microarchitecture of Intel, AMD and VIA CPUs". Has an implicit LOCK prefix. 3 if constant without base or index register P3 only.

Floating point x87 instructions (Pentium Pro, II and III)


Instruction FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FADD(P) FSUB(R)(P) FADD(P) FSUB(R)(P) FMUL(P) FMUL(P) FDIV(R)(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT Operands p0 r m32/64 m80 m80 r m32/m64 m80 m80 r m m 1 2 38 1 2 165 3 2 1 2 2 3 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 6 6 6 6 1 1 23 33 30 1 1 1 1 2 2 1 2 2 1 2 2 1 p1 ops p01 p2 p3 Latency p4 Reciprocal throughput

0 5 5

f)

r AX m16 m16 m16 r m r m r m

2 7 1 1 1 1 1 1 1 1 3 3-4 5 5-6 38 h) 38 h) 2 1 1 1 1 1 1 1 2 g) 2 g) 37 37 1 10

r m r m m m m m

1 1 1 1 1 1 1

1 2

Page 67

Pentium II and III FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN FNOP FINCSTP FDECSTP FFREE FFREEP FNCLEX FNINIT FNSAVE FRSTOR WAIT Notes: e) f) g) 56 15 1 17-97 18-110 17-48 36-54 31-53 21-102 25-86 1 1 1 2 3 13 141 72 2 Not pipelined FXCH generates 1 op that is resolved by register renaming without going to any port. FMUL uses the same circuitry as integer multiplication. Therefore, the combined throughput of mixed floating point and integer multiplications is 1 FMUL + 1 IMUL per 3 clock cycles. FDIV latency depends on precision specified in control word: 64 bits precision gives latency 38, 53 bits precision gives latency 32, 24 bits precision gives latency 18. Division by a power of 2 takes 9 clocks. Reciprocal throughput is 1/ (latency-1). Faster for lower precision.

27-103 29-130 66 103 98-107 13-143 44-143

69 e) e) e) e) e) e) e)

e,i)

r r

h)

i)

Integer MMX instructions (Pentium II and Pentium III)


Instruction MOVD MOVQ MOVD MOVQ MOVD MOVQ PADD PSUB PCMP PADD PSUB PCMP PMUL PMADD PMUL PMADD PAND(N) POR PXOR PAND(N) POR PXOR PSRA PSRL PSLL PSRA PSRL PSLL PACK PUNPCK PACK PUNPCK EMMS MASKMOVQ d) Operands p0 r,r mm,m32/64 m32/64,mm mm,mm mm,m64 mm,mm mm,m64 mm,mm mm,m64 mm,mm/i mm,m64 mm,mm mm,m64 mm,mm p1 ops p01 p2 p3 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 Page 68 1 1 1 1 1 1 1 6 k) 2-8 Latency p4 1 1 1 3 3 1 Reciprocal throughput 0.5 1 1 0.5 1 1 1 0.5 1 1 1 1 1 2 - 30

1 1

Pentium II and III PMOVMSKB d) MOVNTQ d) PSHUFW d) PSHUFW d) PEXTRW d) PINSRW d) PINSRW d) PAVGB PAVGW d) PAVGB PAVGW d) PMIN/MAXUB/SW d) PMIN/MAXUB/SW d) PMULHUW d) PMULHUW d) PSADBW d) PSADBW d) Notes: d) k) r32,mm m64,mm mm,mm,i mm,m64,i r32,mm,i mm,r32,i mm,m16,i mm,mm mm,m64 mm,mm mm,m64 mm,mm mm,m64 mm,mm mm,m64 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 2 2 1 2 1 2 1 2 3 4 5 6 1 1 1 - 30 1 1 1 1 1 0.5 1 0.5 1 1 1 2 2

P3 only. The delay can be hidden by inserting other instructions between EMMS and any subsequent floating point instruction.

Floating point XMM instructions (Pentium III)


Instruction MOVAPS MOVAPS MOVAPS MOVUPS MOVUPS MOVSS MOVSS MOVSS MOVHPS MOVLPS MOVHPS MOVLPS MOVLHPS MOVHLPS MOVMSKPS MOVNTPS CVTPI2PS CVTPI2PS CVT(T)PS2PI CVTPS2PI CVTSI2SS CVTSI2SS CVT(T)SS2SI CVTSS2SI ADDPS SUBPS ADDPS SUBPS ADDSS SUBSS ADDSS SUBSS Operands p0 xmm,xmm xmm,m128 m128,xmm xmm,m128 m128,xmm xmm,xmm xmm,m32 m32,xmm xmm,m64 m64,xmm xmm,xmm r32,xmm m128,xmm xmm,mm xmm,m64 mm,xmm mm,m128 xmm,r32 xmm,m32 r32,xmm r32,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m32 p1 ops p01 p2 p3 2 2 2 4 1 4 1 1 1 1 1 1 1 2 2 2 2 1 2 2 1 1 2 2 1 1 1 2 1 2 1 2 2 1 Latency p4 1 2 3 2 3 1 1 1 1 1 1 1 3 4 3 4 4 5 3 4 3 3 3 3 Reciprocal throughput 1 2 2 4 4 1 1 1 1 1 1 1 2 - 15 1 2 1 1 2 2 1 2 2 2 1 1

2 4

1 1

1 2

Page 69

Pentium II and III MULPS MULPS MULSS MULSS DIVPS DIVPS DIVSS DIVSS AND(N)PS ORPS XORPS AND(N)PS ORPS XORPS MAXPS MINPS MAXPS MINPS MAXSS MINSS MAXSS MINSS CMPccPS CMPccPS CMPccSS CMPccSS COMISS UCOMISS COMISS UCOMISS SQRTPS SQRTPS SQRTSS SQRTSS RSQRTPS RSQRTPS RSQRTSS RSQRTSS RCPPS RCPPS RCPSS RCPSS SHUFPS SHUFPS UNPCKHPS UNPCKLPS UNPCKHPS UNPCKLPS LDMXCSR STMXCSR FXSAVE FXRSTOR xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm,i xmm,m128,i xmm,xmm xmm,m128 m32 m32 m4096 m4096 2 2 1 1 2 2 1 1 2 2 2 2 1 1 2 2 1 1 1 1 2 2 2 2 2 2 1 1 2 2 1 1 2 2 2 2 11 6 116 89 1 2 2 2 2 1 2 1 2 2 1 2 1 1 2 1 2 1 2 1 4 4 4 4 48 48 18 18 2 2 3 3 3 3 3 3 3 3 1 1 56 57 30 31 2 3 1 2 2 3 1 2 2 2 3 3 15 7 62 68 2 2 1 1 34 34 17 17 2 2 2 2 1 1 2 2 1 1 1 1 56 56 28 28 2 2 1 1 2 2 1 1 2 2 2 2 15 9

Page 70

Pentium M

Intel Pentium M, Core Solo and Core Duo


List of instruction timings and op breakdown
Explanation of column headings:
Operands: i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc. The number of ops at the decode, rename, allocate and retirement stages in the pipeline. Fused ops count as one. The number of ops for each execution port. Fused ops count as two. Port 0: ALU, etc. Port 1: ALU, jumps Instructions that can go to either port 0 or 1, whichever is vacant first. Port 2: load data, etc. Port 3: address generation for store Port 4: store data This is the delay that the instruction generates in a dependency chain. (This is not the same as the time spent in the execution unit. Values may be inaccurate in situations where they cannot be measured exactly, especially with memory operands). The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays by 50-150 clocks, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. The average number of clock cycles per instruction for a series of independent instructions of the same kind.

ops fused domain: ops unfused domain: p0: p1: p01: p2: p3: p4: Latency:

Reciprocal throughput:

Integer instructions
Instruction Operands ops fused domain ops unfused domain p0 p1 p01 p2 p3 Latency Reciprocal p4 through put 0.5 1 1 1

Move instructions MOV MOV MOV MOV MOV MOV MOV MOV MOVNTI MOVSX MOVZX MOVSX MOVZX CMOVcc CMOVcc

r,r/i r,m m,r m,i r,sr m,sr sr,r sr,m m,r32 r,r r,m r,r r,m

1 1 1 2 1 2 8 8 2 1 1 2 2

1 1 1 1 1 1 8 7 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 5 8

2 0.5 1 1.5

Page 71

Pentium M XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D) PUSHA(D) POP POP POP POP POPF(D) POPA(D) LAHF SAHF SALC LEA BSWAP LDS LES LFS LGS LSS PREFETCHNTA PREFETCHT0/1/2 SFENCE/LFENCE/MFENCE IN OUT Arithmetic instructions ADD SUB ADD SUB ADD SUB ADC SBB ADC SBB ADC SBB CMP CMP CMP INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS AAD AAM MUL IMUL MUL IMUL IMUL IMUL MUL IMUL MUL IMUL IMUL IMUL DIV IDIV r,r r,m r i m sr 3 7 2 1 2 2 2 16 18 1 3 2 10 17 10 1 2 1 2 11 1 1 2 3 4 1 1 1 1 1 1 1 1 1 8 1 1 1 1 1 1 8 2 high b) 1 1 2 1.5 1 1 1 1 6 8

1 3 1 11 2 2 9 6 2 1 1 1 1 1 1 1 8

r (E)SP m sr

1 16 7 1 1 1

10

7 1 1

r,m r m m m

1 1 1

1 8

3 1 1 1 1 >300 >300

1 1 6

18 18

r,r/i r,m m,r/i r,r/i r,m m,r/i r,r/i m,r m,i r m

r8 r16/r32 r,r r,r,i m8 m16/m32 r,m r,m,i r8

1 1 3 2 2 7 1 1 2 1 3 1 3 4 1 3 1 1 1 3 1 2 5

1 1

1 1 1 1 1 4 1 1 1 1 1 2 2

1 1 1 1 1 1 1

1 2 1 1 2 1 1 1 1 1 1 1 2 15 4 5 4 4 4 5 4 4 15-16 c)

0.5 1 1 2

0.5 1 1 0.5

1 1 1 1 3 1 1 1 3 1 1 4 1

1 1 1 1 1

1 1 1 1 1 1 1 1 12

Page 72

Pentium M DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV CBW CWDE CWD CDQ Logic instructions AND OR XOR AND OR XOR AND OR XOR TEST TEST TEST SHR SHL SAR ROR ROL SHR SHL SAR ROR ROL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL SHLD SHRD SHLD SHRD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR BSF BSR SETcc SETcc CLC STC CMC CLD STD r16 r32 m8 m16 m32 4 4 6 5 5 1 1 3 3 4 3 3 1 1 1 1 1 1 1 15-24 c) 15-39 c) 15-16 c) 15-24 c) 15-39 c) 1 1 12-20 c) 12-20 c) 12 12-20 c) 12-20 c) 1 1

1 1 1

r,r/i r,m m,r/i r,r/i m,r m,i r,i/CL m,i/CL r,1 r8,i/CL r8,i/CL r16/32,i/CL m,1 m8,i/CL m8,i/CL m16/32,i/CL r,r,i/CL m,r,i/CL r,r/i m,r m,i r,r/i m,r m,i r,r r,m r m

1 1 3 1 1 2 1 3 2 9 8 6 7 12 11 10 2 4 1 8 2 1 10 3 2 2 1 2 1 4

1 1 1 1 1 1 1 1 1 5 4 3 2 6 5 5 2 1 1 7 1 1 7 1 1 1 1 1 1 1 1

1 1 1 1 1

1 2 1 1 1 1 1 1 1 2 11 10 9

0.5 1 1 0.5 1 1 1 2

1 4 4 3 2 3 3 2 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1

1 1 1 1 2 1 1 1 2

1 1

1 1

1 4

1 1 7

Control transfer instructions JMP short/near JMP far JMP r JMP m(near) JMP m(far) conditional jump short/near J(E)CXZ short LOOP short LOOP(N)E short

1 22 1 2 25 1 2 11 11

1 21 1 1 23 1 1 1 1 1 8 8 1 1 2

2 2

1 28 1 2 31 1 1 6 6

Page 73

Pentium M CALL CALL CALL CALL CALL RETN RETN RETF RETF BOUND INTO String instructions LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP(N)E SCAS CMPS REP(N)E CMPS Other NOP (90) Long NOP (0F 1F) PAUSE CLI STI ENTER ENTER LEAVE CPUID RDTSC Notes: a) b) c) near far r m(near) m(far) i i r,m 4 32 4 4 35 2 3 27 27 15 5 1 27 1 1 29 1 1 24 24 7 2 1 2 1 2 1 1 3 3 2 1 1 1 2 1 1 2 1 2 1 1 2 2 27 9 2 30 2 2 30 30 8 4

6 5

2 6n 3 5n 6 6n 3 7n 6 9n

2 10+6n ca. 5n 1 ca. 6n 1 12+7n 4 12+9n 1 a) 3 a) 2 2 1 1 1 1

4 0.5 1 0.7 0.7 0.5 1.3 0.6 0.7 0.5

1 1 2 9 17 i,0 a,b 12 ca. 3 38-59 13 38-59 13

1 1 2

0.5 1

10 18 +4b 2

1 1 b-1 2b 1 ca. 130 42

Faster under certain conditions: see manual 3: "The microarchitecture of Intel, AMD and VIA CPUs". Has an implicit LOCK prefix. High values are typical, low values are for round divisors. Core Solo/Duo is more efficient than Pentium M in cases with round values that allow an early-out algorithm.

Floating point x87 instructions


Instruction Operands ops fused domain ops unfused domain p0 p1 p01 p2 p3 Latency Reciprocal p4 through put

Move instructions Page 74

Pentium M FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FISTTP g) FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE FFREEP FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FADD(P) FSUB(R)(P) FMUL(P) FMUL(P) FDIV(R)(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT Math FSCALE FXTRACT r m32/64 m80 m80 r m32/m64 m80 m80 r m m m 1 1 4 40 1 1 6 169 1 4 4 4 1 2 2 3 2 3 3 1 1 2 142 72 1 2 38 1 2 165 3 2 2 1 2 2 3 1 1 1 1 1 2 142 72 1 1 1 1 1 1 2 2 1 2 2 1 2 2 1 1

1 3 167 0.33 f) 2 2 2

0 5 5 5

r AX m16 m16 m16 r r

2 7 1 1 1 1 1 1 1

3 19 3 1 2 131 91

r m r m r m

r m r m m m m

1 1 1 1 1 1 1 1 1 1 2 1 6 6 6 6 1 1 26 15

1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1

1 1 1 1 1 1 1

3 5 5 3

3 3 5 5 9-38 c) 9-38 c) 1 1 1 1 1 1 3 5 9-38 c)

2 1 1

1 1 2 2 8-37 c) 8-37 c) 1 1 1 1 1 1 3 3 8-37 c) 4 1 1

26 15

37 19

28 15

28 15

43 9

Page 75

Pentium M FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN Other FNOP WAIT FNCLEX FNINIT Notes: c) f) g) 1 80-100 90-110 ~ 20 ~ 40 ~ 55 ~ 100 ~ 85 1 80-100 90-110 ~20 ~40 ~55 ~100 ~85 9 h) 80-110 100-130 ~45 ~60 ~65 ~140 ~140 8

1 2 3 14

1 1 3 14 1

1 1 13 27

High values are typical, low values are for low precision or round divisors. FXCH generates 1 op that is resolved by register renaming without going to any port. SSE3 instruction only available on Core Solo and Core Duo.

Integer MMX and XMM instructions


Instruction Operands ops fused domain ops unfused domain p0 p1 p01 p2 p3 Latency Reciprocal p4 through put 1 1 1 1 1 2 1 1 1 1 2 1 2 2 2 2 5-6 1 1 2 2-3 2-3 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 0.5 0.5 1 1 1 1 1 1 0.5 1 1 1 1 1 1 2 2 2-10 4-20 2 1 1 2

Move instructions MOVD MOVD MOVD MOVD MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU LDDQU g) MOVDQ2Q MOVQ2DQ MOVNTQ

r32,mm mm,r32 mm,m32 m32,mm r32,xmm xmm,r32 xmm,m32 m32, xmm mm,mm mm,m64 m64,mm xmm,xmm xmm,m64 m64, xmm xmm, xmm xmm, m128 m128, xmm xmm, m128 m128, xmm xmm, m128 mm, xmm xmm,mm m64,mm

1 1 1 1 1 2 2 1 1 1 1 2 2 1 2 2 2 4 8 4 1 2 1 Page 76

1 1

Pentium M MOVNTDQ PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKHQDQ PUNPCKHQDQ PUNPCKLQDQ PUNPCKLQDQ PSHUFW PSHUFW PSHUFD PSHUFD PSHUFL/HW PSHUFL/HW MASKMOVQ MASKMOVDQU PMOVMSKB PMOVMSKB PEXTRW PEXTRW PINSRW PINSRW Arithmetic instructions PADD/SUB(U)(S)B/W/D PADD/SUB(U)(S)B/W/D PADD/SUB(U)(S)B/W/D PADD/SUB(U)(S)B/W/D PADDQ PSUBQ PADDQ PSUBQ PADDQ PSUBQ PADDQ PSUBQ PCMPEQ/GTB/W/D PCMPEQ/GTB/W/D PCMPEQ/GTB/W/D PCMPEQ/GTB/W/D PMULL/HW PMULHUW PMULL/HW PMULHUW PMULL/HW PMULHUW PMULL/HW PMULHUW PMULUDQ PMULUDQ m128,xmm mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 xmm,xmm xmm, m128 xmm,xmm xmm, m128 mm,mm,i mm,m64,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm, m128,i mm,mm xmm,xmm r32,mm r32,xmm r32,mm,i r32,xmm,i mm,r32,i xmm,r32,i 4 1 1 3 4 1 1 2 3 2 3 1 1 1 2 3 4 2 3 3 8 1 1 2 4 1 2 1 1 2 1 1 1 2 1 1 1 2 1 2 2 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 j) 1 2 2 2 2 1 2 1 2 1 1 2 3 1 1 1 1 1 2 1 2 1 1 2 2 1 1 2 2 1 3 1 1 2 2 1 1 2 2 1 1 1 1 1 1 2 2 1 1

mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64

1 1 2 4 2 2 4 6 1 1 2 2 1 1 2 4 1 1

1 1 2 2 2 2 4 4 1 1 2 2 1 1 2 2 1 1

1 1 1 2 2 1 2 2 1 1 1 2 1 2 1 3 3 3 3 4 4

0.5 1 1 2 1 1 2 2 0.5 1 1 2 1 1 2 2 1 1

Page 77

Pentium M PMULUDQ PMULUDQ PMADDWD PMADDWD PMADDWD PMADDWD PAVGB/W PAVGB/W PAVGB/W PAVGB/W PMIN/MAXUB/SW PMIN/MAXUB/SW PMIN/MAXUB/SW PMIN/MAXUB/SW PSADBW PSADBW PSADBW PSADBW Logic instructions PAND(N) POR PXOR PAND(N) POR PXOR PAND(N) POR PXOR PAND(N) POR PXOR PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ Other EMMS Notes: g) j) k) xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 2 4 1 1 2 4 1 1 2 4 1 1 2 4 2 2 4 6 2 2 1 1 2 2 1 1 2 2 1 1 2 2 2 2 4 4 2 1 2 1 1 2 1 1 1 2 1 2 4 4 4 4 4 4 3 3 3 3 1 2 2 1 1 2 2 0.5 1 1 2 0.5 1 1 2 1 1 2 2

mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm/i mm,m64 xmm,i xmm,xmm xmm,m128 xmm,i

1 1 2 4 1 1 2 3 3 4

1 1 2 2 1 1 2 2 3

1 1 1 2 1 1 2 2 2 3

1 1 1

0.5 1 1 2 1 1 2 2 2 3

11

11

6 k)

SSE3 instruction only available on Core Solo and Core Duo. Also uses some execution units under port 1. You may hide the delay by inserting other instructions between EMMS and any subsequent floating point instruction.

Floating point XMM instructions


Instruction Operands ops fused domain ops unfused domain p0 p1 p01 p2 p3 Latency Reciprocal p4 through put 1 2 3 2 1 2 2 2

Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D

xmm,xmm xmm,m128 m128,xmm xmm,m128

2 2 2 4 Page 78

2 2 2 4 2

Pentium M MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVHPS/D MOVLPS/D MOVLHPS MOVHLPS MOVMSKPS/D MOVNTPS/D SHUFPS/D SHUFPS/D MOVDDUP g) MOVSH/LDUP g) MOVSH/LDUP g) UNPCKH/LPS UNPCKH/LPS UNPCKH/LPD UNPCKH/LPD Conversion CVTPS2PD CVTPS2PD CVTPD2PS CVTPD2PS CVTSD2SS CVTSD2SS CVTSS2SD CVTSS2SD CVTDQ2PS CVTDQ2PS CVT(T) PS2DQ CVT(T) PS2DQ CVTDQ2PD CVTDQ2PD CVT(T)PD2DQ CVT(T)PD2DQ CVTPI2PS CVTPI2PS CVT(T)PS2PI CVT(T)PS2PI CVTPI2PD CVTPI2PD CVT(T) PD2PI CVT(T) PD2PI CVTSI2SS CVT(T)SS2SI CVT(T)SS2SI CVTSI2SD CVTSI2SD CVT(T)SD2SI m128,xmm xmm,xmm xmm,m32/64 m32/64,xmm xmm,m64 m64,xmm xmm,xmm r32,xmm m128,xmm xmm,xmm,i xmm,m128,i xmm,xmm xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m128 8 1 2 1 1 1 1 1 2 3 4 2 2 4 4 4 2 3 4 1 1 1 1 j) 2 2 1 1 1 2 1 2 2 2 2 1 1 3-4 2 1 1 1 1 2 2 1 1 1 1 1 1 1 2 2 3 1 1 1 1 1 1 2 4 1 1 1 1 1 1 1 3 2 2 1 2 5 5 1 1

xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m64 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,mm xmm,m64 mm,xmm mm,m128 xmm,mm xmm,m64 mm,xmm mm,m128 xmm,r32 r32,xmm r32,m32 xmm,r32 xmm,m32 r32,xmm

4 4 4 6 2 3 2 3 2 4 2 4 4 5 4 6 1 2 1 2 4 5 3 5 2 2 3 2 3 2

2 1 3 3

2 2 1 1 2 2

3 1 4 2 4 1 2 1 2 2 2 2 4 4 4 4 3 2 3 2 4 1 4 2 3 1 3 1 5 1 3 3 4 2 4 4 1 4 1 1 4

2 2

2 2

1 1 1 1 2 2

1 1

1 1 1 1 1 1

1 1

3 3 3 3 2 2 2 2 2 2 2 2 2 2 3 3 1 1 1 1 2 2 2 2 1 1 1 1 1 1

Page 79

Pentium M CVT(T)SD2SI Arithmetic ADDSS/D SUBSS/D ADDSS/D SUBSS/D ADDPS/D SUBPS/D ADDPS/D SUBPS/D ADDSUBPS/D g) HADDPS HSUBPS g) HADDPD HSUBPD g) MULSS MULSD MULSS MULSD MULPS MULPD MULPS MULPD DIVSS DIVSD DIVSS DIVSD DIVPS DIVPD DIVPS DIVPD CMPccSS/D CMPccSS/D CMPccPS/D CMPccPS/D COMISS/D UCOMISS/D COMISS/D UCOMISS/D MAXSS/D MINSS/D MAXSS/D MINSS/D MAXPS/D MINPS/D MAXPS/D MINPS/D RCPSS RCPSS RCPPS RCPPS Math SQRTSS SQRTSS SQRTSD SQRTSD SQRTPS SQRTPD SQRTPS SQRTPD r32,m64 3 1 1 1 1

xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,m32 xmm,m64 xmm,xmm xmm,xmm xmm,m128 xmm,m128 xmm,xmm xmm,xmm xmm,m32 xmm,m64 xmm,xmm xmm,xmm xmm,m128 xmm,m128 xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm xmm,m32/64 xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m128

1 2 2 4 2 6? 3 1 1 2 2 2 2 4 4 1 1 2 2 2 2 4 4 1 2 2 4 1 2 1 2 2 4 1 2 2 4

1 1 2 2 2 ? 3 1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2 1 1 2 2 1 1 1 1 2 2 1 1 2 2

1 2

1 1

2 2

1 1

2 2 1

3 3 3 3 3 7 4 4 5 4 5 4 5 4 5 9-18 c) 9-32 c) 9-18 c) 9-32 c) 16-34 c) 16-62 c) 16-34 c) 16-62 c) 3 3

2 1 1 2 1 3 2 3 3 3 3 3

1 1 2 2 2 4 2 1 2 1 2 2 4 2 4 8-17 c) 8-31 c) 8-17 c) 8-31 c) 16-34 c) 16-62 c) 16-34 c) 16-62 c) 1 1 2 2 1 1 1 1 2 2 1 1 2 2

xmm,xmm xmm,m32 xmm,xmm xmm,m64 xmm,xmm xmm,xmm xmm,m128 xmm,m128

2 3 1 2 2 2 4 4

2 2 1 1 2 2 2 2

6-30 1 5-58 1 8-56 16-114 2 2

4-28 4-28 4-57 4-57 16-55 16-114 16-55 16-114

Page 80

Pentium M RSQRTSS RSQRTSS RSQRTPS RSQRTPS Logic AND/ANDN/OR/XORPS/D AND/ANDN/OR/XORPS/D Other LDMXCSR STMXCSR FXSAVE FXRSTOR Notes: c) g) j) xmm,xmm xmm,m32 xmm,xmm xmm,m128 1 2 2 4 1 1 3 2 3 1 3 2 1 1 2 2

xmm,xmm xmm,m128

2 4

2 2

1 2

1 1

m32 m32 m4096 m4096

9 6 118 87

9 6 32 43

43 44

43

20 12 63 72

High values are typical, low values are for round divisors. SSE3 instruction only available on Core Solo and Core Duo. Also uses some execution units under port 1.

Page 81

Merom

Intel Core 2 (Merom, 65nm)


List of instruction timings and op breakdown
Explanation of column headings:
Operands: i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm register, (x)mm = mmx or xmm register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc. The number of ops at the decode, rename, allocate and retirement stages in the pipeline. Fused ops count as one. The number of ops for each execution port. Fused ops count as two. Fused macro-ops count as one. The instruction has op fusion if the sum of the numbers listed under p015 + p2 + p3 + p4 exceeds the number listed under ops fused domain. An x under p0, p1 or p5 means that at least one of the ops listed under p015 can optionally go to this port. For example, a 1 under p015 and an x under p0 and p5 means one op which can go to either port 0 or port 5, whichever is vacant first. A value listed under p015 but nothing under p0, p1 and p5 means that it is not known which of the three ports these ops go to. The total number of ops going to port 0, 1 and 5. The number of ops going to port 0 (execution units). The number of ops going to port 1 (execution units). The number of ops going to port 5 (execution units). The number of ops going to port 2 (memory read). The number of ops going to port 3 (memory write address). The number of ops going to port 4 (memory write data). Tells which execution unit cluster is used. An additional delay of 1 clock cycle is generated if a register written by a op in the integer unit (int) is read by a op in the floating point unit (float) or vice versa. fltint means that an instruction with multiple ops receive the input in the float unit and delivers the output in the int unit. Delays for moving data between different units are included under latency when they are unavoidable. For example, movd eax,xmm0 has an extra 1 clock delay for moving from the XMM-integer unit to the general purpose integer unit. This is included under latency because it occurs regardless of which instruction comes next. Nothing listed under unit means that additional delays are either unlikely to occur or unavoidable and therefore included in the latency figure. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter.

ops fused domain: ops unfused domain:

p015: p0: p1: p5: p2: p3: p4: Unit:

Latency:

Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.

Integer instructions
Instruction Operands ops ops unfused domain Unit fused dop015 p0 p1 p5 p2 p3 p4 main Page 82 Laten- Recicy procal throughput

Merom Move instructions MOV MOV a) MOV a) MOV MOV MOV MOV MOV MOVNTI MOVSX MOVZX MOVSXD MOVSX MOVZX CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) i) POP POP POP POP POPF(D/Q) POPA(D) i) LAHF SAHF SALC i) LEA a) BSWAP LDS LES LFS LGS LSS PREFETCHNTA PREFETCHT0/1/2 LFENCE MFENCE SFENCE CLFLUSH IN OUT Arithmetic instructions ADD SUB ADD SUB ADD SUB ADC SBB ADC SBB r,r/i r,m m,r m,i r,sr m,sr sr,r sr,m m,r r,r r,m r,r r,m r,r r,m r i m sr 1 1 1 1 1 2 8 8 2 1 1 2 2 3 7 2 1 1 2 2 17 18 1 4 2 10 24 10 1 2 1 2 11 1 1 2 2 2 4 1 x x x 1 1 1 1 1 4 5 1 1 1 1 int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int 1 2 3 3 0.33 1 1 1 1 1 16 16 2 0.33 1 1 2 1 1 1 1 1 7 8 1 1.5 17 20 1 4 1 4 7 0.33 1 1 1 17 1 1 8 9 9 117

4 3

x x

x x

1 1 2 2 3 1 x x x x x x x x x 1 x x x 1 1 1 1 1 1 1 1 1 1

1 2 2 high b) 4 3

1 1 1 1 1 1 8

1 1 15 9 3 9 23 2 1 2 1 2 11 x x x 1 1 1 1 1 8

r (E/R)SP m sr

x x x 1 1

x x x

x x x 1

r,m r m m m

1 1 1 1 1 1 1 1 1 1 1

m8

240

r,r/i r,m m,r/i r,r/i r,m

1 1 2 2 2

1 1 1 2 2

x x x x x

x x x x x

x x x x x

1 1 1

int int int int int

1 6 2 2

0.33 1 1 2 2

Page 83

Merom ADC SBB CMP CMP INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS i) AAD i) AAM i) MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV CBW CWDE CDQE CWD CDQ CQO Logic instructions AND OR XOR AND OR XOR AND OR XOR TEST TEST SHR SHL SAR SHR SHL SAR ROR ROL m,r/i r,r/i m,r/i r m 4 1 1 1 3 1 3 4 1 3 3 3 1 1 1 1 1 1 1 3 3 3 1 1 1 1 1 1 3 5 4 32 56 4 6 5 32 56 1 1 3 1 1 1 1 1 3 4 1 3 3 3 1 1 1 1 1 1 1 3 3 2 1 1 1 1 1 1 3 5 4 32 56 3 5 4 31 55 1 1 x x x x x x x x x x x 1 x 1 x x x 1 1 1 1 1 x x 2 1 x x 1 1 1 1 1 1 x x 1 1 1 1 1 1 1 1 1 1 x x x x x x 1 1 1 1 1 1 1 int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int 7 1 1 1 6 0.33 1 0.33 1 1 1 1 1.5 1.5 4 1 1 2 1 1 2 1 1.5 1.5 4 1 1 2 2 1 2 12 12-20 c) 12-36 c) 18-37 c) 28-40 c) 12 12-20 c) 12-36 c) 18-37 c) 28-40 c)

r8 r16 r32 r64 r16,r16 r32,r32 r64,r64 r16,r16,i r32,r32,i r64,r64,i m8 m16 m32 m64 r16,m16 r32,m32 r64,m64 r16,m16,i r32,m32,i r64,m64,i r8 r16 r32 r64 r64 m8 m16 m32 m64 m64

x x x

x x x

17 3 5 5 7 3 3 5 3 3 5 3 5 5 7 3 3 5

1 1 1 1 1 x x x x x

18 18-26 18-42 29-61 39-72 18 18-26 18-42 29-61 39-72 1 1

r,r/i r,m m,r/i r,r/i m,r/i r,i/cl m,i/cl r,i/cl

1 1 2 1 1 1 3 1

1 1 1 1 1 1 2 1

x x x x x x x x

x x x x x

x x x x x x x x

1 1 1 1

int int int int int int int int

1 6 1 1 6 1

0.33 1 1 0.33 1 0.5 1 1

Page 84

Merom ROR ROL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL SHLD SHRD SHLD SHRD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR BSF BSR SETcc SETcc CLC STC CMC CLD STD m,i/cl r,1 r8,i/cl r8,i/cl r16/32/64,i/cl m,1 m8,i/cl m8,i/cl m16/32/64,i/cl r,r,i/cl m,r,i/cl r,r/i m,r m,i r,r/i m,r m,i r,r r,m r m 3 2 9 8 6 4 12 11 10 2 3 1 10 2 1 11 3 2 2 1 2 1 7 6 2 2 9 8 6 3 9 8 7 2 2 1 9 1 1 8 1 2 2 1 1 1 7 6 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 1 1 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 1 1 1 int int int int int int int int int int int int int int int int int int int int int int int int 6 2 12 11 11 7 14 13 13 2 7 1 1 2

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1

1 1 1 1 1

1 1 5 1

1 1

1 1

1 5 6 2 1 1

1 2 1 1 0.33 4 14

Control transfer instructions JMP short/near JMP i) far JMP r JMP m(near) JMP m(far) Conditional jump short/near Fused compare/test and branch e,i) J(E/R)CXZ short LOOP short LOOP(N)E short CALL near CALL i) far CALL r CALL m(near) CALL m(far) RETN RETN i RETF RETF i BOUND i) r,m INTO i) String instructions LODS

1 30 1 1 31 1 1 2 11 11 3 43 3 4 44 1 3 32 32 15 5

1 30 1 1 29 1 1 2 11 11 2 43 2 3 42 1 30 30 13 5

1 1 1 1 1 1 x x x

1 2

x x x x

x x x x

1 1 1

1 1 1

1 1

1 2 1 1 2 2 2

int int int int int int int int int int int int int int int int int int int int int

0 0 0 0 0

1-2 76 1-2 1-2 68 1 1 1-2 5 5 2 75 2 2 75 2 2 78 78 8 3

2 Page 85

int

Merom REP LODS STOS REP STOS MOVS REP MOVS SCAS REP(N)E SCAS CMPS REP(N)E CMPS Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC Notes: a) b) c) e) i) 4+7n - 14+6n 4 2 8+5n - 20+1.2n 8 5 1 1 1 7+7n - 13+n 4 3 7+8n - 17+7n 7 5 7+10n - 7+9n 1 1 int int int int int int int int int int 1+5n - 21+3n 1 7+2n - 0.55n

5 1 2

1+3n - 0.63n 1 3+8n - 23+6n 3 2+7n - 22+5n

i,0 a,b

1 1 3 12 3 46-100 29 23

1 1 3 10 2

x x x

x x x

x x x 1 1 1

int int int int int int int int int

0.33 1 8 8

180-215 64 54

Applies to all addressing modes Has an implicit LOCK prefix. Low values are for small results, high values for high results. See manual 3: "The microarchitecture of Intel, AMD and VIA CPUs" for restrictions on macro-op fusion. Not available in 64 bit mode.

Floating point x87 instructions


Instruction Operands ops ops unfused domain Unit fused dop015 p0 p1 p5 p2 p3 p4 main 1 1 4 40 1 1 7 170 1 1 2 3 3 1 2 1 1 2 38 1 1 2 1 x x x x 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 2 1 2 2 float float float float float float float float float float float float float float float Laten- Recicy procal throughput 1 3 4 45 1 3 4 164 0 6 6 6 6 1 1 3 20 1 1 5 166 1 1 1 1 1 1 2

Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST FISTP FISTTP g) FLDZ FLD1

r m32/64 m80 m80 r m32/m64 m80 m80 r m m m m

3 x 166 x 0 f) 1 1 1 1 1 1 1 2 1 Page 86

Merom FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FADD(P) FSUB(R)(P) FMUL(P) FMUL(P) FDIV(R)(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT Math FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN Other FNOP WAIT FNCLEX FNINIT Notes: r AX m16 m16 m16 r m m 2 2 1 2 2 3 1 2 142 78 2 2 1 1 1 1 1 2 2 2 1 1 1 1 1 2 1 float float float float float float float float float float 2 2 2 1 2 10 8 1 2 192 177

1 184 169

r m r m r m

r m r m m m m

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 2 2 2 2 2 2 2 1 1 1 1 21-27 21-27 7-15 7-15

1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1

1 1 2 2 1

1 1 1

1 1 1 1

float float float float float float float float float float float float float float float float float float float float

1 1 5 2 2 6-38 d) 5-37 d) 5-37 d) 1 1 1 1 1 1 1 2 2 5-37 d) 2 1 1 16-56 22-29

27 82 1 ~96 ~100 ~19 ~53 ~98 ~70

27 82 1 ~96 ~100 ~19 ~53 ~98 ~70

float float float float float float float float float

41 170 6-69 ~96 ~115 ~45 ~96 ~136 ~119

1 2 4 15

1 2 4 15

float float float float

1 1 15 63

Page 87

Merom d) f) g) Round divisors or low precision give low values. Resolved by register renaming. Generates no ops in the unfused domain. SSE3 instruction set.

Integer MMX and XMM instructions


Instruction Operands ops ops unfused domain Unit fused dop015 p0 p1 p5 p2 p3 p4 main 1 1 1 1 1 1 1 1 1 1 9 4 4 1 1 1 1 1 1 3 4 1 1 3 4 1 2 1 2 4 5 1 2 2 3 1 2 2 2 2 1 1 1 x x x x x x 1 x 1 x 1 1 1 x x x 1 4 2 2 1 1 x x x x x x x x x x x 1 2 2 1 2 1 2 int int int int 1 1 1 1 3 3 1 1 3 3 1 1 1 1 4 4 1 1 2 2 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 x x x 1 1 1 1 1 1 int int fltint int int int fltint int int int int int int int int int fltint int int int int int int 1 3 1 3 1 1 3 1 3 1 2 2 1 int int 1 int int int int int Laten- Recicy procal throughput 2 3 2 2 1 2 3 1 2 3 3-8 2-8 2-8 1 1 0.33 1 0.5 1 0.33 1 1 0.33 1 1 4 2 2 0.33 0.33 2 2 1 1 2 2 1 1 2 2 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1

Move instructions MOVD k) MOVD k) MOVD k) MOVD k) MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU LDDQU g) MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKH/LQDQ PUNPCKH/LQDQ PSHUFB h) PSHUFB h) PSHUFB h) PSHUFB h) PSHUFW PSHUFW PSHUFD PSHUFD PSHUFL/HW PSHUFL/HW PALIGNR h) PALIGNR h) PALIGNR h)

r32/64,(x)mm m32/64,(x)mm (x)mm,r32/64 (x)mm,m32/64 (x)mm, (x)mm (x)mm,m64 m64, (x)mm xmm, xmm xmm, m128 m128, xmm m128, xmm xmm, m128 xmm, m128 mm, xmm xmm,mm m64,mm m128,xmm mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 xmm,xmm xmm, m128 mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm,i mm,m64,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm, m128,i mm,mm,i mm,m64,i xmm,xmm,i

x x

x x

x x

x x x

x x x

Page 88

Merom PALIGNR h) MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRW PEXTRW PINSRW PINSRW PINSRW PINSRW Arithmetic instructions PADD/SUB(U)(S)B/W/D PADD/SUB(U)(S)B/W/D PADDQ PSUBQ PADDQ PSUBQ PHADD(S)W PHSUB(S)W h) PHADD(S)W PHSUB(S)W h) PHADD(S)W PHSUB(S)W h) PHADD(S)W PHSUB(S)W h) PHADDD PHSUBD h) PHADDD PHSUBD h) PHADDD PHSUBD h) PHADDD PHSUBD h) PCMPEQ/GTB/W/D PCMPEQ/GTB/W/D PMULL/HW PMULHUW PMULL/HW PMULHUW PMULHRSW h) PMULHRSW h) PMULUDQ PMULUDQ PMADDWD PMADDWD PMADDUBSW h) PMADDUBSW h) PAVGB/W PAVGB/W PMIN/MAXUB/SW PMIN/MAXUB/SW PABSB PABSW PABSD h) PSIGNB PSIGNW PSIGND h) PSADBW PSADBW xmm,m128,i mm,mm xmm,xmm r32,(x)mm r32,mm,i r32,xmm,i mm,r32,i mm,m16,i xmm,r32,i xmm,m16,i 2 4 10 1 2 3 1 2 3 4 2 x x x 1 int int int int int int int int int int 1 2-5 6-10 1 1 1 1 1 1.5 1.5

1 2 3 1 1 3 3

x x

x x

1 1 x x

2 3 5 2 6

1 1

(x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m

1 1 2 2 5 6 7 8 3 4 5 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 2 2 5 5 7 7 3 3 5 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

x x x x

x x x x

1 1

int int int int int

1 2

0.5 1 1 1 4 4

int int 6

4 4 2 2 3 3 0.5 1 1 1 1 1 1 1 1 1 1 1 0.5 1 0.5 1 0.5 1 0.5 1 1 1

1 1 1 x x 1 1 1 1 1 1 1 1 1 1 x x x x x x x x 1 1 x x x x x x x x x x 1 1 1 1 1 1 1 1 1 1 1

int int int int int int int int int int int int int int int int int int int int int int int int int int int

3 5 1 3 3 3 3 3 1 1 1 1 3

Page 89

Merom Logic instructions PAND(N) POR PXOR PAND(N) POR PXOR PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ Other EMMS Notes: g) h) k) (x)mm,(x)mm (x)mm,m mm,mm/i mm,m64 xmm,i xmm,xmm xmm,m128 xmm,i 1 1 1 1 1 2 3 2 1 1 1 1 1 2 2 2 x x 1 1 1 x x x x x x x int int int int int int int int 1 1 1 2 2 0.33 1 1 1 1 1 1 1

1 1

x x x

11

11

float

SSE3 instruction set. Supplementary SSE3 instruction set. MASM uses the name MOVD rather than MOVQ for this instruction even when moving 64 bits.

Floating point XMM instructions


Instruction Operands ops ops unfused domain Unit fused dop015 p0 p1 p5 p2 p3 p4 main 1 1 1 4 9 1 1 1 2 2 1 1 1 1 3 4 1 2 1 2 1 2 3 4 1 1 x x x 1 1 2 4 1 1 x x x x 1 x x 2 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 3 3 1 1 1 1 1 1 3 3 1 1 1 1 1 1 3 3 1 1 1 1 1 1 1 fltint fltint float float int int int int fltint int float 3 1 1 1 3 1 1 1 1 1 1 float float 1 int 2 1 int 2 int int int int Laten- Recicy procal throughput 1 2 3 2-4 3-4 1 2 3 3 5 3 1 1 0.33 1 1 2 4 0.33 1 1 1 1 1 1 1 2-3 2 2 1 1 1 1 1 1 2 2 1

Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVHPS/D MOVLPS/D MOVLHPS MOVHLPS MOVMSKPS/D MOVNTPS/D SHUFPS SHUFPS SHUFPD SHUFPD MOVDDUP g) MOVDDUP g) MOVSH/LDUP g) MOVSH/LDUP g) UNPCKH/LPS UNPCKH/LPS UNPCKH/LPD

xmm,xmm xmm,m128 m128,xmm xmm,m128 m128,xmm xmm,xmm xmm,m32/64 m32/64,xmm xmm,m64 m64,xmm m64,xmm xmm,xmm r32,xmm m128,xmm xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm

Page 90

Merom UNPCKH/LPD Conversion CVTPD2PS CVTPD2PS CVTSD2SS CVTSD2SS CVTPS2PD CVTPS2PD CVTSS2SD CVTSS2SD CVTDQ2PS CVTDQ2PS CVT(T) PS2DQ CVT(T) PS2DQ CVTDQ2PD CVTDQ2PD CVT(T)PD2DQ CVT(T)PD2DQ CVTPI2PS CVTPI2PS CVT(T)PS2PI CVT(T)PS2PI CVTPI2PD CVTPI2PD CVT(T) PD2PI CVT(T) PD2PI CVTSI2SS CVTSI2SS CVT(T)SS2SI CVT(T)SS2SI CVTSI2SD CVTSI2SD CVT(T)SD2SI CVT(T)SD2SI Arithmetic ADDSS/D SUBSS/D ADDSS/D SUBSS/D ADDPS/D SUBPS/D ADDPS/D SUBPS/D ADDSUBPS/D g) ADDSUBPS/D g) HADDPS HSUBPS g) HADDPS HSUBPS g) HADDPD HSUBPD g) HADDPD HSUBPD g) MULSS MULSS MULSD xmm,m128 2 1 1 1 float 1

xmm,xmm xmm,m128 xmm,xmm xmm,m64 xmm,xmm xmm,m64 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,mm xmm,m64 mm,xmm mm,m128 xmm,mm xmm,m64 mm,xmm mm,m128 xmm,r32 xmm,m32 r32,xmm r32,m32 xmm,r32 xmm,m32 r32,xmm r32,m64

2 2 2 2 2 2 2 2 1 1 1 1 2 3 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2 2 1 1

2 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2 1 1 1

1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1

float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float

4 4 2 2 3 3 4 4 3 3 4 4 4 3 4 3

1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 3 3 1 1 3 3 1 1

xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm

1 1 1 1 1 1 6 7 3 4 1 1 1

1 1 1 1 1 1 6 6 3 3 1 1 1

1 1 1 1 1 1

1 1 1 1 1

1 1 1

float float float float float float float float float float float float float

3 3 3 9 5 4 5

1 1 1 1 1 1 3 3 2 2 1 1 1

Page 91

Merom MULSD MULPS MULPS MULPD MULPD DIVSS DIVSS DIVSD DIVSD DIVPS DIVPS DIVPD DIVPD RCPSS/PS RCPSS/PS CMPccSS/D CMPccSS/D CMPccPS/D CMPccPS/D COMISS/D UCOMISS/D COMISS/D UCOMISS/D MAXSS/D MINSS/D MAXSS/D MINSS/D MAXPS/D MINPS/D MAXPS/D MINPS/D Math SQRTSS/PS SQRTSS/PS SQRTSD/PD SQRTSD/PD RSQRTSS/PS RSQRTSS/PS xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm xmm,m32/64 xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 float float float float float float float float float float float float float float float float float float float float float float float float float 4 5 6-18 d) 6-32 d) 6-18 d) 6-32 d) 3 3 3 3 3 3 1 1 1 1 1 5-17 d) 5-17 d) 5-31 d) 5-31 d) 5-17 d) 5-17 d) 5-31 d) 5-31 d) 2 2 1 1 1 1 1 1 1 1 1 1

xmm,xmm xmm,m xmm,xmm xmm,m xmm,xmm xmm,m

1 2 1 2 1 1

1 1 1 1 1 1

1 1 1 1 1 1

1 1 1

float float float float float float

6-29 6-58 3

6-29 6-29 6-58 6-58 2 2

Logic AND/ANDN/OR/XORPS/D xmm,xmm AND/ANDN/OR/XORPS/D xmm,m128 Other LDMXCSR STMXCSR FXSAVE FXRSTOR Notes: d) g)

1 1

1 1

x x

x x

x x

int int

0.33 1

m32 m32 m4096 m4096

14 6 141 119

13 4

1 1 1 145 164

42 19 145 164

Round divisors give low values. SSE3 instruction set.

Page 92

Wolfdale

Intel Core 2 (Wolfdale, 45nm)


List of instruction timings and op breakdown
Explanation of column headings:
Operands: i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm register, (x)mm = mmx or xmm register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc. The number of ops at the decode, rename, allocate and retirement stages in the pipeline. Fused ops count as one. The number of ops for each execution port. Fused ops count as two. Fused macro-ops count as one. The instruction has op fusion if the sum of the numbers listed under p015 + p2 + p3 + p4 exceeds the number listed under ops fused domain. An x under p0, p1 or p5 means that at least one of the ops listed under p015 can optionally go to this port. For example, a 1 under p015 and an x under p0 and p5 means one op which can go to either port 0 or port 5, whichever is vacant first. A value listed under p015 but nothing under p0, p1 and p5 means that it is not known which of the three ports these ops go to. The total number of ops going to port 0, 1 and 5. The number of ops going to port 0 (execution units). The number of ops going to port 1 (execution units). The number of ops going to port 5 (execution units). The number of ops going to port 2 (memory read). The number of ops going to port 3 (memory write address). The number of ops going to port 4 (memory write data). Tells which execution unit cluster is used. An additional delay of 1 clock cycle is generated if a register written by a op in the integer unit (int) is read by a op in the floating point unit (float) or vice versa. fltint means that an instruction with multiple ops receive the input in the float unit and delivers the output in the int unit. Delays for moving data between different units are included under latency when they are unavoidable. For example, movd eax,xmm0 has an extra 1 clock delay for moving from the XMM-integer unit to the general purpose integer unit. This is included under latency because it occurs regardless of which instruction comes next. Nothing listed under unit means that additional delays are either unlikely to occur or unavoidable and therefore included in the latency figure. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter.

ops fused domain: ops unfused domain:

p015: p0: p1: p5: p2: p3: p4: Unit:

Latency:

Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.

Integer instructions
Instruction Operands ops ops unfused domain Unit fused dop015 p0 p1 p5 p2 p3 p4 main Laten- Recicy procal throughput

Page 93

Wolfdale Move instructions MOV MOV a) MOV a) MOV MOV MOV MOV MOV MOVNTI MOVSX MOVZX MOVSXD MOVSX MOVZX MOVSX MOVSXD CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) i) POP POP POP POP POPF(D/Q) POPA(D) i) LAHF SAHF SALC i) LEA a) BSWAP LDS LES LFS LGS LSS PREFETCHNTA PREFETCHT0/1/2 LFENCE MFENCE SFENCE CLFLUSH IN OUT Arithmetic instructions ADD SUB ADD SUB ADD SUB ADC SBB r,r/i r,m m,r m,i r,sr m,sr sr,r sr,m m,r r,r r16/32,m r64,m r,r r,m r,r r,m r i m sr 1 1 1 1 1 2 8 8 2 1 1 2 2 2 3 7 2 1 1 2 2 17 18 1 4 2 10 24 10 1 2 1 2 11 1 1 2 2 2 4 1 x x x 1 1 1 1 1 4 5 1 1 1 1 1 2 3 3 0.33 1 1 1 1 1 16 16 2 0.33 1 1 1 2 1 1 1 1 1 7 8 1 1.5 17 20 1 4 1 4 1 1 1 1 1 1 1 1 1 1 1 7 0.33 1 1 1 17 1 1 8 6 9 90

4 3

x x

x x

1 1 1 2 2 3 1 x x x x x x x x x x x x x x x 1 1

1 1

2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 8 2 high b) 4 3

1 1 15 9 3 9 23 2 1 2 1 2 11 x x x 1 1 1 1 1 8

r (E/R)SP m sr

2 1 1

x x x 1 1

x x x

x x x 1

r,m r m m m

m8

120

r,r/i r,m m,r/i r,r/i

1 1 2 2

1 1 1 2

x x x x

x x x x

x x x x

1 1 1 1 1 6 2

0.33 1 1 2

Page 94

Wolfdale ADC SBB ADC SBB CMP CMP INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS i) AAD i) AAM i) MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV CBW CWDE CDQE CWD CDQ CQO Logic instructions AND OR XOR AND OR XOR AND OR XOR TEST TEST SHR SHL SAR SHR SHL SAR r,m m,r/i r,r/i m,r/i r m 2 2 4 3 1 1 1 1 1 1 3 1 1 1 3 3 5 5 1 1 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 4 4 7 7 7 7 32-38 32-38 56-62 56-62 4 3 7 6 7 6 32 31 56 55 1 1 1 1 x x x x x x x x x x x x x x x x x 1 x x 1 x x x 1 1 1 1 1 x x 2 1 x x 1 1 1 1 1 1 1 2 1 x x x 2 3 2 9 10 13 x x x 1 2 2 3 2 x x x x x x x x x x x x x x x x 1 1 1 1 1 1 1 1 1 1 x x x x x x x x x x x 1 1 1 1 1 1 1 1 2 7 1 1 1 6 2 0.33 1 0.33 1 1 1 1 1.5 1.5 4 1 1 2 1 1 2 1 1.5 1.5 4 1 1 2 2 1 2

r8 r16 r32 r64 r16,r16 r32,r32 r64,r64 r16,r16,i r32,r32,i r64,r64,i m8 m16 m32 m64 r16,m16 r32,m32 r64,m64 r16,m16,i r32,m32,i r64,m64,i r8 r16 r32 r64 r64 m8 m16 m32 m64 m64

17 3 5 5 7 3 3 5 3 3 5 3 5 5 7 3 3 5

1 1 1 1 1

9-18 c) 14-22 c) 14-23 c) 18-57 c) 34-88 c) 9-18 14-22 c) 14-23 c) 34-88 c) 39-72 c) 1 1

r,r/i r,m m,r/i r,r/i m,r/i r,i/cl m,i/cl

1 1 2 1 1 1 3

1 1 1 1 1 1 2

x x x x x x x

x x x x x

x x x x x x x

1 1 1 1 1 1 1 1 6 1 1 6 1

0.33 1 1 0.33 1 0.5 1

Page 95

Wolfdale ROR ROL ROR ROL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL SHLD SHRD SHLD SHRD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR BSF BSR SETcc SETcc CLC STC CMC CLD STD r,i/cl m,i/cl r,1 r8,i/cl r8,i/cl r16/32/64,i/cl m,1 m8,i/cl m8,i/cl m16/32/64,i/cl r,r,i/cl m,r,i/cl r,r/i m,r m,i r,r/i m,r m,i r,r r,m r m 1 3 2 9 8 6 4 12 11 10 2 3 1 9 3 1 10 3 2 2 1 2 1 6 6 1 2 2 9 8 6 3 9 8 7 2 2 1 8 2 1 7 1 2 2 1 1 1 6 6 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 1 1 1 1 6 2 12 11 11 7 14 13 13 2 7 1 1 1 2

x x x x x x x x x x x x x x x x 1 1 x x x x x

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1

1 1 1 1 1

1 1 4 1

1 1

1 1

1 5 6 2 1

1 1

1 1 1 1 0.33 3 14

Control transfer instructions JMP short/near JMP i) far JMP r JMP m(near) JMP m(far) Conditional jump short/near Fused compare/test and branch e,i) J(E/R)CXZ short LOOP short LOOP(N)E short CALL near CALL i) far CALL r CALL m(near) CALL m(far) RETN RETN i RETF RETF i BOUND i) r,m INTO i) String instructions

1 30 1 1 31 1 1 2 11 11 3 43 3 4 44 1 3 32 32 15 5

1 30 1 1 29 1 1 2 11 11 2 43 2 3 42 1 30 30 13 5

1 1 1 1 1 1 x x x

0 0 0 0 0

1 2

x x x x

x x x x

1 1 1

1 1 1

1 1

1 2 1 1 2 2 2

1-2 76 1-2 1-2 68 1 1 1-2 5 5 2 75 2 2 75 2 2 78 78 8 3

Page 96

Wolfdale LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP(N)E SCAS CMPS REP(N)E CMPS Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC Notes: a) b) c) e) i) 3 2 4+7n-14+6n 4 2 8+5n-20+1.2n 8 5 1 1 1 7+7n-13+n 4 3 7+8n-17+7n 7 5 7+10n-7+9n 1 1+5n-21+3n 1 1 7+2n-0.55n 5 1+3n-0.63n 1 3+8n-23+6n 2 2+7n-22+5n 3 1 1 1

i,0 a,b

1 1 3 12 3 53-117 13 23

1 1 3 10 2

x x x

x x x

x x x 1 1 1

0.33 1 8 8

53-211 32 54

Applies to all addressing modes Has an implicit LOCK prefix. Low values are for small results, high values for high results. The reciprocal throughput is only slightly less than the latency. See manual 3: "The microarchitecture of Intel, AMD and VIA CPUs" for restrictions on macro-op fusion. Not available in 64 bit mode.

Floating point x87 instructions


Instruction Operands ops ops unfused domain Unit fused dop015 p0 p1 p5 p2 p3 p4 main 1 1 4 40 1 1 7 171 1 1 2 3 1 1 2 38 1 1 2 x 1 1 2 2 1 2 2 1 1 1 1 1 1 2 2 float float float float float float float float float float float float Laten- Recicy procal throughput 1 3 4 45 1 3 4 164 0 6 6 6 1 1 3 20 1 1 5 166 1 1 1 1

Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST FISTP

r m32/64 m80 m80 r m32/m64 m80 m80 r m m m

3 x 167 x 0 f) 1 1 1 Page 97

x x 1 1 1

x x

Wolfdale FISTTP g) FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FADD(P) FSUB(R)(P) FMUL(P) FMUL(P) FDIV(R)(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT Math FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN m 3 1 2 2 2 1 2 2 3 1 2 141 78 1 1 2 2 2 1 1 1 1 1 2 95 51 1 1 1 2 1 1 1 1 1 x x x x x x 1 1 1 2 1 1 float float float float float float float float float float float float float 6 1 1 2 2 2 1 2 10 8 1 2 142 177

r AX m16 m16 m16 r m m

x x 7 23 23 x 27

r m r m r m

r m r m m m m

1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 1 26-29 28-35 17-19

1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 1

1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 x x x

1 1 1

1 1

1 1 1 1

x x x

x x x

float float float float float float float float float float float float float float float float float float float float float

1 1 5 2 2 6-21 d) 5-20 d) 6-21 d) 5-20 d) 1 1 1 1 1 1 1 2 2 5-20 d) 2 1 1

3 5 6-21

13-40 18-41 10-22

28 28 53-84 1 1 18-85 76-100 18105 19 19 57-65 19-100 23-87

x x 1 x x x x x x x

x x x x x x x x x

x x x x x x x x x

float float float float float float float float float float

43 ~170 6-20 32-85 70-100 38-107 45 50-100 40-130 55-130

Page 98

Wolfdale Other FNOP WAIT FNCLEX FNINIT Notes: d) f) g) 1 2 4 15 1 2 4 15 1 x x float float float float 1 1 15 63

x x x

x x x

Round divisors or low precision give low values. Resolved by register renaming. Generates no ops in the unfused domain. SSE3 instruction set.

Integer MMX and XMM instructions


Instruction Operands ops ops unfused domain Unit fused dop015 p0 p1 p5 p2 p3 p4 main 1 1 1 1 1 1 1 1 1 1 9 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 x x x x x x 1 x 1 x 1 1 1 x x x 1 4 2 2 1 1 x x x x x x x x x x x 1 2 2 1 2 1 2 int int int int 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 int int int int int int int int int int int int 1 1 1 2 1 1 int int 1 int int int int int Laten- Recicy procal throughput 2 3 2 2 1 2 3 1 2 3 3-8 2-8 2-8 1 1 0.33 1 0.5 1 0.33 1 1 0.33 1 1 4 2 2 0.33 0.33 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1

Move instructions MOVD k) r32/64,(x)mm MOVD k) m32/64,(x)mm MOVD k) (x)mm,r32/64 MOVD k) (x)mm,m32/64 MOVQ (x)mm, (x)mm MOVQ (x)mm,m64 MOVQ m64, (x)mm MOVDQA xmm, xmm MOVDQA xmm, m128 MOVDQA m128, xmm MOVDQU m128, xmm MOVDQU xmm, m128 LDDQU g) xmm, m128 MOVDQ2Q mm, xmm MOVQ2DQ xmm,mm MOVNTQ m64,mm MOVNTDQ m128,xmm MOVNTDQA j) xmm, m128 PACKSSWB/DW PACKUSWB mm,mm PACKSSWB/DW PACKUSWB mm,m64 PACKSSWB/DW PACKUSWB xmm,xmm PACKSSWB/DW PACKUSWB xmm,m128 PACKUSDW j) xmm,xmm PACKUSDW j) xmm,m PUNPCKH/LBW/WD/DQ mm,mm PUNPCKH/LBW/WD/DQ mm,m64 PUNPCKH/LBW/WD/DQ xmm,xmm PUNPCKH/LBW/WD/DQ xmm,m128 PUNPCKH/LQDQ xmm,xmm PUNPCKH/LQDQ xmm, m128

x x

1 1 1 1

Page 99

Wolfdale PMOVSX/ZXBW j) PMOVSX/ZXBW j) PMOVSX/ZXBD j) PMOVSX/ZXBD j) PMOVSX/ZXBQ j) PMOVSX/ZXBQ j) PMOVSX/ZXWD j) PMOVSX/ZXWD j) PMOVSX/ZXWQ j) PMOVSX/ZXWQ j) PMOVSX/ZXDQ j) PMOVSX/ZXDQ j) PSHUFB h) PSHUFB h) PSHUFB h) PSHUFB h) PSHUFW PSHUFW PSHUFD PSHUFD PSHUFL/HW PSHUFL/HW PALIGNR h) PALIGNR h) PALIGNR h) PALIGNR h) PBLENDVB j) PBLENDVB j) PBLENDW j) PBLENDW j) MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRB j) PEXTRB j) PEXTRW PEXTRW j) PEXTRD j) PEXTRD j) PEXTRQ j,m) PEXTRQ j,m) PINSRB j) PINSRB j) PINSRW PINSRW PINSRD j) PINSRD j) PINSRQ j,m) PINSRQ j,m) xmm,xmm xmm,m64 xmm,xmm xmm,m32 xmm,xmm xmm,m16 xmm,xmm xmm,m64 xmm,xmm xmm,m32 xmm,xmm xmm,m64 mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm,i mm,m64,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm, m128,i mm,mm,i mm,m64,i xmm,xmm,i xmm,m128,i x,x,xmm0 x,m,xmm0 xmm,xmm,i xmm,m,i mm,mm xmm,xmm r32,(x)mm r32,xmm,i m8,xmm,i r32,(x)mm,i m16,(x)mm,i r32,xmm,i m32,xmm,i r64,xmm,i m64,xmm,i xmm,r32,i xmm,m8,i (x)mm,r32,i (x)mm,m16,i xmm,r32,i xmm,m32,i xmm,r64,i xmm,m64,i 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 2 1 2 2 3 1 1 2 2 1 1 4 10 1 2 2 2 2 2 2 2 2 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 2 1 1 1 4 1 2 2 2 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 1 1 2 2 1 1 1 1 1 x x x x x 3 x x x x x x x x 1 x 1 x 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2-5 6-10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 2

1 3

1 1 1 1 1 1 1 1 1 1 1

2 3 3 3 3 3 1 2 1 1

Page 100

Wolfdale Arithmetic instructions PADD/SUB(U)(S)B/W/D (x)mm, (x)mm PADD/SUB(U)(S)B/W/D (x)mm,m PADDQ PSUBQ (x)mm, (x)mm PADDQ PSUBQ (x)mm,m PHADD(S)W PHSUB(S)W h) (x)mm, (x)mm PHADD(S)W PHSUB(S)W h) (x)mm,m64 PHADDD PHSUBD h) (x)mm, (x)mm PHADDD PHSUBD h) (x)mm,m64 PCMPEQ/GTB/W/D (x)mm,(x)mm PCMPEQ/GTB/W/D (x)mm,m PCMPEQQ j) xmm,xmm PCMPEQQ j) xmm,m128 PMULL/HW PMULHUW (x)mm,(x)mm PMULL/HW PMULHUW (x)mm,m PMULHRSW h) (x)mm,(x)mm PMULHRSW h) (x)mm,m PMULLD j) xmm,xmm PMULLD j) xmm,m128 PMULDQ j) xmm,xmm PMULDQ j) xmm,m128 PMULUDQ (x)mm,(x)mm PMULUDQ (x)mm,m PMADDWD (x)mm,(x)mm PMADDWD (x)mm,m PMADDUBSW h) (x)mm,(x)mm PMADDUBSW h) (x)mm,m PAVGB/W (x)mm,(x)mm PAVGB/W (x)mm,m PMIN/MAXSB j) xmm,xmm PMIN/MAXSB j) xmm,m128 PMIN/MAXUB (x)mm,(x)mm PMIN/MAXUB (x)mm,m PMIN/MAXSW (x)mm,(x)mm PMIN/MAXSW (x)mm,m PMIN/MAXUW j) xmm,xmm PMIN/MAXUW j) xmm,m PMIN/MAXSD j) xmm,xmm PMIN/MAXSD j) xmm,m128 PMIN/MAXUD j) xmm,xmm PMIN/MAXUD j) xmm,m128 PHMINPOSUW j) xmm,xmm PHMINPOSUW j) xmm,m128 PABSB PABSW PABSD h)(x)mm,(x)mm PABSB PABSW PABSD h) (x)mm,m PSIGNB PSIGNW PSIGND h) (x)mm,(x)mm 1 1 2 2 3 4 3 4 1 1 1 1 1 1 1 1 4 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 1 1 1 1 1 2 2 3 3 3 3 1 1 1 1 1 1 1 1 4 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 1 1 1 x x x x 1 1 1 1 x x x x x x 2 2 2 2 x x 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 x x 1 1 x x x x 1 1 1 1 1 1 1 4 4 x x x 1 1 1 int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int 1 1 2 0.5 1 1 1 2 2 2 2 0.5 1 1 1 1 1 1 1 2 4 1 1 1 1 1 1 1 1 0.5 1 1 1 0.5 1 0.5 1 1 1 1 1 1 1 4 4 0.5 1 0.5

1 1

3 1 1 3 3 5 5 3 3 3 3 1 1 1 1 1 1 1 4 1

x x 1 1 x x x x 1

x x x

Page 101

Wolfdale PSIGNB PSIGNW PSIGND h) PSADBW PSADBW MPSADBW j) MPSADBW j) Logic instructions PAND(N) POR PXOR PAND(N) POR PXOR PTEST j) PTEST j) PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ Other EMMS Notes: g) h) j) k) m) (x)mm,m (x)mm,(x)mm (x)mm,m xmm,xmm,i xmm,m,i 1 1 1 3 4 1 1 1 3 3 x 1 1 1 1 x 1 1 2 2 1 int int int int int 1 1 1 2 2

3 5

(x)mm,(x)mm (x)mm,m xmm,xmm xmm,m128 mm,mm/i mm,m64 xmm,i xmm,xmm xmm,m128 xmm,i

1 1 2 2 1 1 1 2 3 1

1 1 2 2 1 1 1 2 2 1

x x 1 1 1 1 1 x x x

x x x x

x x x x

1 1 1

x x x

int int int int int int int int int int

1 1 1 1 2 1

0.33 1 1 1 1 1 1 1 1 1

11

11

float

SSE3 instruction set. Supplementary SSE3 instruction set. SSE4.1 instruction set MASM uses the name MOVD rather than MOVQ for this instruction even when moving 64 bits Only available in 64 bit mode

Floating point XMM instructions


Instruction Operands ops ops unfused domain Unit fused dop015 p0 p1 p5 p2 p3 p4 main 1 1 1 4 9 1 1 1 2 2 1 1 1 1 1 1 x x x 1 1 2 4 1 1 x x x x 1 x x 2 1 1 1 1 1 1 1 1 Page 102 1 1 1 1 1 1 1 int 1 1 1 1 1 1 float float 1 int 2 1 int 2 int int int int Laten- Recicy procal throughput 1 2 3 2-4 3-4 1 2 3 3 5 3 1 1 0.33 1 1 2 4 0.33 1 1 1 1 1 1 1 2-3 1

Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVHPS/D MOVLPS/D MOVLHPS MOVHLPS MOVMSKPS/D MOVNTPS/D SHUFPS

xmm,xmm xmm,m128 m128,xmm xmm,m128 m128,xmm xmm,xmm xmm,m32/64 m32/64,xmm xmm,m64 m64,xmm m64,xmm xmm,xmm r32,xmm m128,xmm xmm,xmm,i

Wolfdale SHUFPS SHUFPD SHUFPD BLENDPS/PD j) BLENDPS/PD j) BLENDVPS/PD j) BLENDVPS/PD j) MOVDDUP g) MOVDDUP g) MOVSH/LDUP g) MOVSH/LDUP g) UNPCKH/LPS UNPCKH/LPS UNPCKH/LPD UNPCKH/LPD EXTRACTPS j) EXTRACTPS j) INSERTPS j) INSERTPS j) Conversion CVTPD2PS CVTPD2PS CVTSD2SS CVTSD2SS CVTPS2PD CVTPS2PD CVTSS2SD CVTSS2SD CVTDQ2PS CVTDQ2PS CVT(T) PS2DQ CVT(T) PS2DQ CVTDQ2PD CVTDQ2PD CVT(T)PD2DQ CVT(T)PD2DQ CVTPI2PS CVTPI2PS CVT(T)PS2PI CVT(T)PS2PI CVTPI2PD CVTPI2PD CVT(T) PD2PI CVT(T) PD2PI CVTSI2SS CVTSI2SS CVT(T)SS2SI CVT(T)SS2SI CVTSI2SD xmm,m128,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i
xmm,xmm,xmm0

xmm,m,xmm0 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m128 r32,xmm,i m32,xmm,i xmm,xmm,i xmm,m32,i

2 1 2 1 1 2 2 1 2 1 2 1 1 1 2 2 2 1 2

1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 2 1 1 1

1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 x

1 1 1 1 1 1 1 1

x 1 1 1

1 1

int float float int int int int int int int int int int float float int int int int

1 1 2 1 1 1 1 4 1

1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1

xmm,xmm xmm,m128 xmm,xmm xmm,m64 xmm,xmm xmm,m64 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,mm xmm,m64 mm,xmm mm,m128 xmm,mm xmm,m64 mm,xmm mm,m128 xmm,r32 xmm,m32 r32,xmm r32,m32 xmm,r32

2 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2

2 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2

1 1 1 1 2 2 2 2

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

float float float float float float float float float float float float float float float float float float float float float float float float float float float float float

4 4 2 2 3 3 4 4 3 3 4 4 4 3 4

1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 3 3 1 1 3

Page 103

Wolfdale CVTSI2SD CVT(T)SD2SI CVT(T)SD2SI Arithmetic ADDSS/D SUBSS/D ADDSS/D SUBSS/D ADDPS/D SUBPS/D ADDPS/D SUBPS/D ADDSUBPS/D g) ADDSUBPS/D g) HADDPS HSUBPS g) HADDPS HSUBPS g) HADDPD HSUBPD g) HADDPD HSUBPD g) MULSS MULSS MULSD MULSD MULPS MULPS MULPD MULPD DIVSS DIVSS DIVSD DIVSD DIVPS DIVPS DIVPD DIVPD RCPSS/PS RCPSS/PS CMPccSS/D CMPccSS/D CMPccPS/D CMPccPS/D COMISS/D UCOMISS/D COMISS/D UCOMISS/D MAXSS/D MINSS/D MAXSS/D MINSS/D MAXPS/D MINPS/D MAXPS/D MINPS/D ROUNDSS/D j) ROUNDSS/D j) ROUNDPS/D j) ROUNDPS/D j) DPPS j) DPPS j) DPPD j) xmm,m32 r32,xmm r32,m64 2 1 1 1 1 1 1 1 1 1 1 float float float 3 3 1 1

xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm xmm,m32/64 xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i xmm,xmm,i

1 1 1 1 1 1 3 4 3 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4

1 1 1 1 1 1 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4

x x 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 x x

1 1 1 2 2 x x 1 1 1 1 1 1 1 1 1 1

2 2 x

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 x

1 1 1 1 1 1 1 1 1 x

float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float

3 3 3 7 6 4 5 4 5 6-13 d) 6-21 d) 6-13 d) 6-21 d) 3 3 3 3 3 3 3 3 11 9

1 1 1 1 1 1 3 3 1.5 1.5 1 1 1 1 1 1 1 1 5-12 d) 5-12 d) 5-20 d) 5-20 d) 5-12 d) 5-12 d) 5-20 d) 5-20 d) 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3

Page 104

Wolfdale DPPD j) Math SQRTSS/PS SQRTSS/PS SQRTSD/PD SQRTSD/PD RSQRTSS/PS RSQRTSS/PS Logic
AND/ANDN/OR/XORPS/D AND/ANDN/OR/XORPS/D

xmm,m128,i

float

xmm,xmm xmm,m xmm,xmm xmm,m xmm,xmm xmm,m

1 2 1 2 1 1

1 1 1 1 1 1

1 1 1 1 1 1

1 1 1

float float float float float float

6-13 6-20 3

5-12 5-12 5-19 5-19 2 2

xmm,xmm xmm,m128

1 1

1 1

x x

x x

x x

int int

0.33 1

Other LDMXCSR STMXCSR FXSAVE FXRSTOR Notes: d) g)

m32 m32 m4096 m4096

13 10 151 121

12 8 67 74

x x x x

x x x x

x 1 x 1 1 x 8 38 38 x 47

38 20 145 150

Round divisors give low values. SSE3 instruction set.

Page 105

Nehalem

Intel Nehalem
List of instruction timings and op breakdown
Explanation of column headings:
Operands: i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm register, (x)mm = mmx or xmm register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc. The number of ops at the decode, rename, allocate and retirement stages in the pipeline. Fused ops count as one. The number of ops for each execution port. Fused ops count as two. Fused macro-ops count as one. The instruction has op fusion if the sum of the numbers listed under p015 + p2 + p3 + p4 exceeds the number listed under ops fused domain. An x under p0, p1 or p5 means that at least one of the ops listed under p015 can optionally go to this port. For example, a 1 under p015 and an x under p0 and p5 means one op which can go to either port 0 or port 5, whichever is vacant first. A value listed under p015 but nothing under p0, p1 and p5 means that it is not known which of the three ports these ops go to. The total number of ops going to port 0, 1 and 5. The number of ops going to port 0 (execution units). The number of ops going to port 1 (execution units). The number of ops going to port 5 (execution units). The number of ops going to port 2 (memory read). The number of ops going to port 3 (memory write address). The number of ops going to port 4 (memory write data). Tells which execution unit domain is used: "int" = integer unit (general purpose registers), "ivec" = integer vector unit (SIMD), "fp" = floating point unit (XMM and x87 floating point). An additional "bypass delay" is generated if a register written by a op in one domain is read by a op in another domain. The bypass delay is 1 clock cycle between the "int" and "ivec" units, and 2 clock cycles between the "int" and "fp", and between the "ivec" and "fp" units. The bypass delay is indicated under latency only where it is unavoidable because either the source operand or the destination operand is in an unnatural domain such as a general purpose register (e.g. eax) in the "ivec" domain. For example, the PEXTRW instruction executes in the "int" domain. The source operand is an xmm register and the destination operand is a general purpose register. The latency for this instruction is indicated as 2+1, where 2 is the latency of the instruction itself and 1 is the bypass delay, assuming that the xmm operand is most likely to come from the "ivec" domain. If the xmm operand comes from the "fp" domain then the bypass delay will be 2 rather than one. The flags register can also have a bypass delay. For example, the COMISS instruction (floating point compare) executes in the "fp" domain and returns the result in the integer flags. Almost all instructions that read these flags execute in the "int" domain. Here the latency is indicated as 1+2, where 1 is the latency of the instruction itself and 2 is the bypass delay from the "fp" domain to the "int" domain. The bypass delay from the memory read unit to any other unit and from any unit to the memory write unit are included in the latency figures in the table. Where the domain is not listed, the bypass delays are either unlikely to occur or unavoidable and therefore included in the latency figure.

ops fused domain: ops unfused domain:

p015: p0: p1: p5: p2: p3: p4: Domain:

Page 106

Nehalem Latency: This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter.

Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.

Integer instructions
Instruction Operands Doops ops unfused domain main fused dop015 p0 p1 p5 p2 p3 p4 main 1 1 1 1 1 2 6 6 2 1 1 2 2 3 7 2 1 1 2 2 3 18 1 3 2 7 8 10 1 2 1 1 1 x x x 1 1 1 1 1 3 4 1 1 1 1 int int int int int int int int int int 1 2 2 3 1 x x x x x x x x x 1 1 1 1 1 1 1 1 1 8 1 1 1 1 1 1 8 int int int int int int int int int int int int int int int int int int int int int int Laten- Recicy procal throughput 1 2 3 3 0.33 1 1 1 1 1 13 14 1 0.33 1 1 2 1 1 1 1 1 1 8 1 5 1 15 14 8 0.33 1 1 1

Move instructions MOV MOV a) MOV a) MOV MOV MOV MOV MOV MOVNTI MOVSX MOVZX MOVSXD MOVSX MOVZX MOVSXD CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) i) POP POP POP POP POPF(D/Q) POPA(D) i) LAHF SAHF SALC i) LEA a) BSWAP

r,r/i r,m m,r m,i r,sr m,sr sr,r sr,m m,r r,r r,m r,r r,m r,r r,m r i m sr

3 2

x x

x x

1 1 x x x

~270 1

2 2 20 b) 5 3

1 1 2 2 2 2 7 2 1 2 1 1 x x x x 1 1 x x x 1 1 1 5 1 8

r (E/R)SP m sr

x x x

x x x 1 1

x x x

r,m r32

1 4 1 1

Page 107

Nehalem BSWAP LDS LES LFS LGS LSS PREFETCHNTA PREFETCHT0/1/2 LFENCE MFENCE SFENCE Arithmetic instructions ADD SUB ADD SUB ADD SUB ADC SBB ADC SBB ADC SBB CMP CMP INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS i) AAD i) AAM i) MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV c) DIV c) DIV c) DIV c) IDIV c) IDIV c) IDIV c) IDIV c) r64 m m m 1 9 1 1 2 3 2 1 3 x 1 x x 6 1 1 1 1 1 1 1 1 int int int int int int int 3 1 15 1 1 9 23 5

r,r/i r,m m,r/i r,r/i r,m m,r/i r,r/i m,r/i r m

r8 r16 r32 r64 r16,r16 r32,r32 r64,r64 r16,r16,i r32,r32,i r64,r64,i m8 m16 m32 m64 r16,m16 r32,m32 r64,m64 r16,m16,i r32,m32,i r64,m64,i r8 r16 r32 r64 r8 r16 r32 r64

1 1 2 2 2 4 1 1 1 3 1 3 5 1 3 3 3 1 1 1 1 1 1 1 3 3 3 1 1 1 1 1 1 4 6 6 ~40 4 8 7 ~60

1 1 1 2 2 3 1 1 1 1 1 3 5 1 3 3 3 1 1 1 1 1 1 1 3 3 2 1 1 1 1 1 1 4 6 6 4 8 7

x x x x x x x x x x x x x x x

x x x x x x x x x x 1 x x 1 x x x 1 1 1 1

x x x x x x x x x x x x x x x

1 1 1 1 1 1

1 x x 2 1 x x 1 1 1 1 1 1 1 x x x 1 x x x 2 4 3 x 2 5 3 x 1 x x x 1 x x x x x 1 1 1 1 1 1 1 1 1 1

int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int

1 6 2 2 7 1 1 1 6 3 15 20 3 5 5 3 3 3 3 3 3 3 3 5 5 3 3 3 3

0.33 1 1 2 2 0.33 1 0.33 1 1 2 7 1 2 2 2 1 1 1 1 1 2 1 2 2 2 1 1 1 1 1 1 7-11 7-12 7-17 19-69 7-11 7-12 7-17 26-86

11-21 17-22 17-28 28-90 10-22 18-23 17-28 37-100

Page 108

Nehalem CBW CWDE CDQE CWD CDQ CQO POPCNT ) POPCNT ) CRC32 ) CRC32 ) Logic instructions AND OR XOR AND OR XOR AND OR XOR TEST TEST SHR SHL SAR SHR SHL SAR ROR ROL ROR ROL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL SHLD SHLD SHRD SHRD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR BSF BSR SETcc SETcc CLC STC CMC CLD STD 1 1 1 1 1 1 1 1 1 1 1 1 x x x 1 1 1 1 x x 1 1 int int int int int int 1 1 3 3 1 1 1 1 1 1

r,r r,m r,r r,m

r,r/i r,m m,r/i r,r/i m,r/i r,i/cl m,i/cl r,i/cl m,i/cl r,1 r8,i/cl r8,i/cl r16/32/64,i/cl m,1 m8,i/cl m8,i/cl m16/32/64,i/cl r,r,i/cl m,r,i/cl r,r,i/cl m,r,i/cl r,r/i m,r m,i r,r/i m,r m,i r,r r,m r m

1 1 2 1 1 1 3 1 3 2 9 8 6 4 12 11 10 2 3 2 3 1 9 2 1 10 3 1 2 1 2 1 2 2

1 1 1 1 1 1 2 1 2 2 9 8 6 3 9 8 7 2 2 2 2 1 8 2 1 7 3 1 1 1 1 1 2 2

x x x x x x x x x x x x x x x x x x x x x x x x x x x

x x x x x

x x x x x x x x x x x x

x 1 1

x x x x x x x x x x x x x x x x x x x x x x x x x x x

1 1 1 1 1

1 1

1 1

1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1

1 1 1 1 1 1

1 1

1 1

x x x x x

x x x x

x x x x x

int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int

1 6 1 1 6 1 6 2 13 11 12-13 7 16 14 15 3 8 4 9 1

0.33 1 1 0.33 1 0.5 1 1 1 2

12-13

1 1 1 5 1 1

1 6 6 3 3 1 1

1 1 1 1 0.33 4 5

Control transfer instructions JMP short/near JMP i) far JMP r JMP m(near) JMP m(far) Conditional jump short/near

1 31 1 1 31 1

1 31 1 1 31 1 Page 109

1 1 1 1

1 11

int int int int int int

0 0 0 0

2 67 2 2 73 2

Nehalem Fused compare/test and branch e) J(E/R)CXZ short LOOP short LOOP(N)E short CALL near CALL i) far CALL r CALL m(near) CALL m(far) RETN RETN i RETF RETF i BOUND i) r,m INTO i) String instructions LODS REP LODS STOS REP STOS REP STOS MOVS REP MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC Notes: a) b) c) e) i) ) 1 2 6 11 2 46 3 4 47 1 3 39 40 15 4 1 2 6 11 2 46 2 3 47 1 2 39 40 13 4 x x x x x x 1 1 x x 1 9 1 1 1 1 1 1 1 1 1 1 1 1 int int int int int int int int int int int int int int int 0 2 2 4 7 2 74 2 2 79 2 2 120 124 7 5

small n large n small n large n

2 1 11+4n 3 1 60+n 2.5/16 bytes 5 2 13+6n 2/16 bytes 3 2 37+6n 5 3 65+8n

x x

x x

x x

1 1 1

x x

x x

x x

1 2

int int int int int int int int int int int int

1 40+12n 1 12+n 1 clk / 16 bytes 4 12+n 1 clk / 16 bytes 1 40+2n 4 42+2n

a,0 a,b

1 1 5 11 34+7b 3 25-100 22 28

1 1 5 9 3

x x x x

x x x x

x x x x

1 1

int int int int int int int int int

0.33 1 9 8 79+5b ~200 5 ~200 24 40-60

Applies to all addressing modes Has an implicit LOCK prefix. Low values are for small results, high values for high results. See manual 3: "The microarchitecture of Intel, AMD and VIA CPUs" for restrictions on macro-op fusion. Not available in 64 bit mode. SSE4.2 instruction set.

Page 110

Nehalem

Floating point x87 instructions


Instruction Operands ops ops unfused domain Dofused main dop015 p0 p1 p5 p2 p3 p4 main 1 1 4 41 1 1 7 208 1 1 3 3 1 2 2 2 2 3 2 2 1 2 143 79 1 1 2 38 1 3 204 0 f) 1 1 1 1 2 2 2 2 2 1 1 1 2 89 52 1 1 x 1 x x 1 x 1 2 3 1 2 2 1 1 1 1 1 1 2 2 float float float float float float float float float float float float float float float float float float float float float float float float Laten- Recicy procal throughput 1 3 4 45 1 4 5 242 0 6 7 7 1 1 2 20 1 1 5 245 1 1 1 1 1 2 2 2 1 2 31 1 1 4 178 156

Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FISTTP g) FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FADD(P) FSUB(R)(P) FMUL(P) FMUL(P) FDIV(R)(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM

r m32/64 m80 m80 r m32/m64 m80 m80 r m m m

x x 1 1 1

x x

1 1 2

1 2

r AX m16 m16 m16 r m m

2+2

1 1 1 1 x x x 1 x x x

1 1

7 5 1 178 156

x x 8 23 23 x 27

r m r m r m

r m r m m m m

1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 1

1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 1

1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1

1 1 1

1 1

1 1 1 1

float 3 1 float 1 float 5 1 float 1 float 7-27 d) 7-27 d) float 7-27 d) 7-27 d) float 1 1 float 1 1 float 1 float 1 float 1 float 1 float 3 2 float 5 2 float 7-27 d) 7-27 d) float 1 float 1 float 1

Page 111

Nehalem FPREM FPREM1 FRNDINT Math FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN Other FNOP WAIT FNCLEX FNINIT Notes: d) f) g) 25 35 17 25 35 17 x x x x x x x x x float float float 14 19 22

24 17 1 ~100 ~100 ~100 19 ~55 ~100 ~82

24 17 1
~100 ~100 ~100

19 ~55
~100

~82

x x 1 x x x x x x x

x x x x x x x x x

x x x x x x x x x

float 12 float 13 float ~27 float 40-100 float 40-100 float ~110 float 58 float ~80 float ~115 float ~120

1 1 1 2 2 x 3 3 ~190 ~190 x

x x x

x x x

float float float float

1 1 17 77

Round divisors or low precision give low values. Resolved by register renaming. Generates no ops in the unfused domain. SSE3 instruction set.

Integer MMX and XMM instructions


Instruction Operands ops ops unfused domain Dofused main dop015 p0 p1 p5 p2 p3 p4 main 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 x x x x x x x 1 x 1 x 1 1 1 x x x 1 1 1 1 1 1 1 1 1 1 x x x x x x 1 1 Page 112 1 1 ivec ivec 1 1 1 ivec ivec 1 ivec int Laten- Recicy procal throughput 1+1 3 1+1 2 1 2 3 1 2 3 2 3 2 1 1 ~270 ~270 0.33 1 0.33 1 0.33 1 1 0.33 1 1 1 1 1 0.33 0.33 2 2

Move instructions MOVD k) MOVD k) MOVD k) MOVD k) MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU LDDQU g) MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ

r32/64,(x)mm m32/64,(x)mm (x)mm,r32/64 (x)mm,m32/64 (x)mm, (x)mm (x)mm,m64 m64, (x)mm xmm, xmm xmm, m128 m128, xmm xmm, m128 m128, xmm xmm, m128 mm, xmm xmm,mm m64,mm m128,xmm

Nehalem MOVNTDQA j) PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKUSDW j) PACKUSDW j) PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKH/LQDQ PUNPCKH/LQDQ PMOVSX/ZXBW j) PMOVSX/ZXBW j) PMOVSX/ZXBD j) PMOVSX/ZXBD j) PMOVSX/ZXBQ j) PMOVSX/ZXBQ j) PMOVSX/ZXWD j) PMOVSX/ZXWD j) PMOVSX/ZXWQ j) PMOVSX/ZXWQ j) PMOVSX/ZXDQ j) PMOVSX/ZXDQ j) PSHUFB h) PSHUFB h) PSHUFW PSHUFW PSHUFD PSHUFD PSHUFL/HW PSHUFL/HW PALIGNR h) PALIGNR h) PBLENDVB j) PBLENDVB j) PBLENDW j) PBLENDW j) MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRB j) PEXTRB j) PEXTRW PEXTRW j) PEXTRD j) PEXTRD j) PEXTRQ j,m) xmm, m128 mm,mm mm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m (x)mm, (x)mm (x)mm,m xmm,xmm xmm, m128 xmm,xmm xmm,m64 xmm,xmm xmm,m32 xmm,xmm xmm,m16 xmm,xmm xmm,m64 xmm,xmm xmm,m32 xmm,xmm xmm,m64 (x)mm, (x)mm (x)mm,m mm,mm,i mm,m64,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm, m128,i (x)mm,(x)mm,i (x)mm,m,i x,x,xmm0 xmm,m,xmm0 xmm,xmm,i xmm,m,i mm,mm xmm,xmm r32,(x)mm r32,xmm,i m8,xmm,i r32,(x)mm,i m16,(x)mm,i r32,xmm,i m32,xmm,i r64,xmm,i 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 2 1 2 2 3 1 2 4 10 1 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 4 1 2 2 2 2 2 1 2 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 1 1 x x 1 x 1 x x x x x x x 1 1 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 1 1 x x x x x x x x x x x x x x x 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 1 2 1 2 1 x ivec ivec float ivec ivec 1 1 1 ivec 1 ivec 2+1 2+1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ivec 1 1 ivec 2 1 1 1 2 0.5 2 2 2 0.5 2 0.5 1 1 2 1 2 1 2 1 2 1 2 1 2 0.5 1 0.5 1 0.5 1 0.5 1 1 1 1 1 0.5 1 2 7 1 1 1 1 1 1 1 1

2+2 2+1 2+1

Page 113

Nehalem PEXTRQ j,m) PINSRB j) PINSRB j) PINSRW PINSRW PINSRD j) PINSRD j) PINSRQ j,m) PINSRQ j,m) Arithmetic instructions PADD/SUB(U) (S)B/W/D/Q PADD/SUB(U) (S)B/W/D/Q PHADD/SUB(S)W/D h) PHADD/SUB(S)W/D h) PCMPEQ/GTB/W/D PCMPEQ/GTB/W/D PCMPEQQ j) PCMPEQQ j) PCMPGTQ ) PCMPGTQ ) PMULL/HW PMULHUW PMULL/HW PMULHUW PMULHRSW h) PMULHRSW h) PMULLD j) PMULLD j) PMULDQ j) PMULDQ j) PMULUDQ PMULUDQ PMADDWD PMADDWD PMADDUBSW h) PMADDUBSW h) PAVGB/W PAVGB/W PMIN/MAXSB j) PMIN/MAXSB j) PMIN/MAXUB PMIN/MAXUB PMIN/MAXSW PMIN/MAXSW PMIN/MAXUW j) PMIN/MAXUW j) PMIN/MAXU/SD j) PMIN/MAXU/SD j) PHMINPOSUW j) m64,xmm,i xmm,r32,i xmm,m8,i (x)mm,r32,i (x)mm,m16,i xmm,r32,i xmm,m32,i xmm,r64,i xmm,m64,i 2 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 x x x x x x x x x x x x x x x x x x 1 1 ivec 1 ivec 1 ivec 1 1+1 1+1 1+1 1 ivec 1+1 1 1 1 1 1 1 1 1 1

(x)mm, (x)mm (x)mm,m (x)mm, (x)mm (x)mm,m64 (x)mm,(x)mm (x)mm,m xmm,xmm xmm,m128 xmm,xmm xmm,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m xmm,xmm xmm,m128 xmm,xmm xmm,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m xmm,xmm xmm,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m xmm,xmm xmm,m xmm,xmm xmm,m128 xmm,xmm

1 1 3 4 1 1 1 1 1 1 1 1 1 1 2 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 3 3 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

x x x x x x x x 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 x x x x x x x x x x x x 1

x x x x x x x x 1

ivec

0.5 2 1.5 3 0.5 2 0.5 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 0.5 1 1 2 0.5 2 0.5 2 1 2 1 2 1

ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1

3 1 1 3 3 3 6 3 3 3 3 1 1 1 1 1 1 3

x x x x x x x x x x x x

ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec

Page 114

Nehalem PHMINPOSUW j) PABSB PABSW PABSD h) PABSB PABSW PABSD h) PSIGNB PSIGNW PSIGND h) PSIGNB PSIGNW PSIGND h) PSADBW PSADBW MPSADBW j) MPSADBW j) PCLMULQDQ n) AESDEC, AESDECLAST, AESENC, AESENCLAST n) AESIMC n) AESKEYGENASSIST n) Logic instructions PAND(N) POR PXOR PAND(N) POR PXOR PTEST j) PTEST j) PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ String instructions PCMPESTRI ) PCMPESTRI ) PCMPESTRM ) PCMPESTRM ) PCMPISTRI ) PCMPISTRI ) PCMPISTRM ) PCMPISTRM ) Other EMMS Notes: g) h) j) k) xmm,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m xmm,xmm,i xmm,m,i xmm,xmm,i 1 1 1 1 1 1 1 3 4 1 1 1 1 1 1 1 3 3 x x x x 1 1 x x 1 x x x x 1 ivec 1 x x ivec 1 12 5 3 1 ivec 1 1 ivec 1 3 0.5 1 0.5 2 1 3 1 2 8

x x

xmm,xmm xmm,xmm xmm,xmm,i

~5 ~5 ~5

~2 ~2 ~2

(x)mm,(x)mm (x)mm,m xmm,xmm xmm,m128 mm,mm/i mm,m64 xmm,i xmm,xmm xmm,m128 xmm,i

1 1 2 2 1 1 1 2 3 1

1 1 2 2 1 1 1 2 2 1

x x x x

x x x

x x x x 1 1 1 1 1

x x x x

ivec 1 ivec 1 ivec 1 ivec ivec 1 ivec

1 3 1 1 2 1

x x x

0.33 1 1 1 1 2 1 2 1 1

xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i

8 9 9 10 3 4 4 6

8 8 9 10 3 4 4 5

x x x x x x x x

x x x x x x x x

x x x x x x x x

1 1 1 1

ivec ivec ivec ivec ivec ivec ivec ivec

14 14 7 7 8 8 7 7

5 6 6 6 2 2 2 5

11

11

float

SSE3 instruction set. Supplementary SSE3 instruction set. SSE4.1 instruction set MASM uses the name MOVD rather than MOVQ for this instruction even when moving 64 bits Page 115

Nehalem ) m) n) SSE4.2 instruction set Only available in 64 bit mode Only available on newer models

Floating point XMM instructions


Instruction Operands Doops ops unfused domain fused main dop015 p0 p1 p5 p2 p3 p4 main 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 1 2 2 3 1 1 1 1 1 1 1 2 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 float float 1 float float float 1 1+2 1 1 1 1 float 1 float float float float float float float 1 1 1 float float 1 1 1 float Laten- Recicy procal throughput 1 2 3 2 3 1 2 3 3 5 1 1+2 ~270 1 1 2 1 2 1 1 1 1 1-4 1-3 1 1 1 2 1 1 1 2 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2

Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVH/LPS/D MOVLHPS MOVHLPS MOVMSKPS/D MOVNTPS/D SHUFPS/D SHUFPS/D BLENDPS/PD j) BLENDPS/PD j) BLENDVPS/PD j) BLENDVPS/PD j) MOVDDUP g) MOVDDUP g) MOVSH/LDUP g) MOVSH/LDUP g) UNPCKH/LPS/D UNPCKH/LPS/D EXTRACTPS j) EXTRACTPS j) INSERTPS j) INSERTPS j) Conversion CVTPD2PS CVTPD2PS CVTSD2SS CVTSD2SS CVTPS2PD CVTPS2PD CVTSS2SD CVTSS2SD

xmm,xmm xmm,m128 m128,xmm xmm,m128 m128,xmm xmm,xmm xmm,m32/64 m32/64,xmm xmm,m64 m64,xmm xmm,xmm r32,xmm m128,xmm xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i x,x,xmm0 xmm,m,xmm0 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 r32,xmm,i m32,xmm,i xmm,xmm,i xmm,m32,i

xmm,xmm xmm,m128 xmm,xmm xmm,m64 xmm,xmm xmm,m64 xmm,xmm xmm,m32

2 2 2 2 2 2 1 1

2 2 2 2 2 2 1 1

1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1

float float float float float float float float

4 4 2 1

1 1 1 1 1 1 1 2

Page 116

Nehalem CVTDQ2PS CVTDQ2PS CVT(T) PS2DQ CVT(T) PS2DQ CVTDQ2PD CVTDQ2PD CVT(T)PD2DQ CVT(T)PD2DQ CVTPI2PS CVTPI2PS CVT(T)PS2PI CVT(T)PS2PI CVTPI2PD CVTPI2PD CVT(T) PD2PI CVT(T) PD2PI CVTSI2SS CVTSI2SS CVT(T)SS2SI CVT(T)SS2SI CVTSI2SD CVTSI2SD CVT(T)SD2SI CVT(T)SD2SI Arithmetic ADDSS/D SUBSS/D ADDSS/D SUBSS/D ADDPS/D SUBPS/D ADDPS/D SUBPS/D ADDSUBPS/D g) ADDSUBPS/D g) HADDPS HSUBPS g) HADDPS HSUBPS g) HADDPD HSUBPD g) HADDPD HSUBPD g) MULSS MULPS MULSS MULPS MULSD MULPD MULSD MULPD DIVSS DIVPS DIVSS DIVPS DIVSD DIVPD DIVSD DIVPD RCPSS/PS RCPSS/PS CMPccSS/D CMPccPS/D CMPccSS/D CMPccPS/D xmm,m 2 1 Page 117 1 1 float 1 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,mm xmm,m64 mm,xmm mm,m128 xmm,mm xmm,m64 mm,xmm mm,m128 xmm,r32 xmm,m32 r32,xmm r32,m32 xmm,r32 xmm,m32 r32,xmm r32,m64 1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2 2 1 1 1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 x x 1
float/ivec

float float float float float float float float float float float float
ivec/float

3+2 3+2 4+2 4+2 3+2 3+2 6 6 3+2 3+2 4+2 3+2

x x

1 1 1 1 1 float float float float float float float float

1 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 3 3 1 1 3 3 1 1

xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m xmm,xmm xmm,m xmm,xmm xmm,m xmm,xmm xmm,m xmm,xmm xmm,m xmm,xmm

1 1 1 1 1 1 3 4 3 4 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 2 2 2 2 1 1 1 1 1 1 1

float float float float float float float float float float float float float float float float float float float float float

3 3 3 5 3 4 5 7-14 7-22 3

1 1 1 1 1 1 2 2 2 2 1 1 1 1 7-14 7-14 7-22 7-22 2 2 1

Nehalem COMISS/D UCOMISS/D COMISS/D UCOMISS/D MAXSS/D MINSS/D MAXSS/D MINSS/D MAXPS/D MINPS/D MAXPS/D MINPS/D ROUNDSS/D ROUNDPS/D j) ROUNDSS/D ROUNDPS/D j) DPPS j) DPPS j) DPPD j) DPPD j) Math SQRTSS/PS SQRTSS/PS SQRTSD/PD SQRTSD/PD RSQRTSS/PS RSQRTSS/PS Logic
AND/ANDN/OR/XORPS/D AND/ANDN/OR/XORPS/D

xmm,xmm xmm,m32/64 xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i

1 1 1 1 1 1 1 2 4 6 3 4

1 1 1 1 1 1 1 1 4 5 3 3

1 1 1 1 1 1 1 1 2 x x x

1 1 1

float float float float float float float

1+2 3 3

1 1 1 1 1 1 1 1 2 1 3

1 1 x x x 1 1

1 x x x

float float float float float

11 9

xmm,xmm xmm,m xmm,xmm xmm,m xmm,xmm xmm,m

1 2 1 2 1 1

1 1 1 1 1 1

1 1 1 1 1 1

1 1 1

float float float float float float

7-18 7-32 3

7-18 7-18 7-32 7-32 2 2

xmm,xmm xmm,m128

1 1

1 1

1 1

float float

1 1

Other LDMXCSR STMXCSR FXSAVE FXRSTOR Notes: g)

m32 m32 m4096 m4096

6 2 141 112

6 x 1 141 x 90 x

x x x

x 1 1 1 1 x 5 38 38 x 42

90

5 1 90 100

SSE3 instruction set.

Page 118

Sandy Bridge

Intel Sandy Bridge


List of instruction timings and op breakdown
Explanation of column headings:
Operands: i = immediate data, r = register, mm = 64 bit mmx register, x = 128 bit xmm register, (x)mm = mmx or xmm register, y = 256 bit ymm register, same = same register for both operands. m = memory operand, m32 = 32-bit memory operand, etc. The number of ops at the decode, rename, allocate and retirement stages in the pipeline. Fused ops count as one. The number of ops for each execution port. Fused ops count as two. Fused macro-ops count as one. The instruction has op fusion if the sum of the numbers listed under p015 + p23 + p4 exceeds the number listed under ops fused domain. A number indicated as 1+ under a read or write port means a 256-bit read or write operation using two clock cycles for handling 128 bits each cycle. The port cannot receive another read or write op in the second clock cycle, but a read port can receive an address-calculation op in the second clock cycle. An x under p0, p1 or p5 means that at least one of the ops listed under p015 can optionally go to this port. For example, a 1 under p015 and an x under p0 and p5 means one op which can go to either port 0 or port 5, whichever is vacant first. A value listed under p015 but nothing under p0, p1 and p5 means that it is not known which of the three ports these ops go to. The total number of ops going to port 0, 1 and 5. The number of ops going to port 0 (execution units). The number of ops going to port 1 (execution units). The number of ops going to port 5 (execution units). The number of ops going to port 2 or 3 (memory read or address calculation). The number of ops going to port 4 (memory write data). This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Where hyperthreading is enabled, the use of the same execution units in the other thread leads to inferior performance. Denormal numbers, NAN's and infinity do not increase the latency. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter.

ops fused domain: ops unfused domain:

p015: p0: p1: p5: p23: p4: Latency:

Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread. The latencies and throughputs listed below for addition and multiplication using full size YMM registers are obtained only after a warm-up period of a thousand instructions or more. The latencies may be one or two clock cycles longer and the reciprocal throughputs double the values for shorter sequences of code. There is no warm-up effect when vectors are 128 bits wide or less.

Integer instructions
Instruction Operands ops ops unfused domain Latency ReciComfused p015 p0 p1 p5 p23 p4 procal ments dothroughmain put

Move instructions Page 119

Sandy Bridge MOV MOV r,r/i r,m 1 1 1 x x x 1 1 2 0.33 0.5 all addressing modes

MOV MOV MOVNTI MOVSX MOVZX MOVSXD MOVSX MOVZX MOVSXD CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POPF(D/Q) POPA(D) LAHF SAHF SALC LEA LEA

m,r m,i m,r r,r r,m r,r r,m r,r r,m

1 1 2 1 1 2 2 3 8 3 1 1 2 3 16 1 1 2 9 18 1 3 1 1

1 1 1 1 x x x 1 2 2 3 x x x x x x x x x

1 1 1

3 ~350 1

1 1 1 0.33 0.5

2 1 2 1 2 25 7 3

1 1 1 implicit lock 1 1 1 1 1 8 0.5 0.5 1 18 9 1 1 0.5 1

r i m

2 0 0 8 10 1 3 1 1

r (E/R)SP m

1 1 1 2 1 8 1 1 2 1 8

1 1 1 1 8

not 64 bit

2 1

not 64 bit not 64 bit simple complex or rip relative

r,m r,m

1 1

1 1 1 3

BSWAP BSWAP PREFETCHNTA PREFETCHT0/1/2 LFENCE MFENCE SFENCE Arithmetic instructions ADD SUB ADD SUB ADD SUB SUB ADC SBB ADC SBB ADC SBB CMP CMP

r32 r64 m m

1 2 1 1 2 3 2

1 2

1 2 1 1 1 1 1

1 2

1 1 1

1 1 0.5 0.5 4 33 6

r,r/i r,m m,r/i r,same r,r/i r,m m,r/i r,r/i m,r/i

1 1 2 1 2 2 4 1 1

1 1 1 0 2 2 3 1 1

x x x x x x x x

x x x x x x x x

x x x x x x x x

1 1 2 1 6 0 2 2 7 1 1

1 2 1

0.33 0.5 1 0.25 1 1 1.5 0.33 0.5

Page 120

Sandy Bridge INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS AAD AAM MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL DIV DIV DIV DIV IDIV IDIV IDIV IDIV CBW CWDE CDQE CWD CDQ CQO POPCNT POPCNT CRC32 CRC32 Logic instructions AND OR XOR AND OR XOR AND OR XOR XOR TEST TEST SHR SHL SAR r m 1 3 2 3 3 8 1 4 3 2 1 2 1 1 1 4 3 2 1 2 1 1 10 11 10 34-56 10 10 9 59138 1 1 1 2 1 1 1 1 1 1 1 1 2 3 3 8 1 4 3 2 1 2 1 1 1 3 2 1 1 2 1 1 10 11 10 10 10 9 x x x x x x 2 1 1 6 4 4 2 20 3 4 4 3 3 4 3 3 3 0.33 2 not 64 bit not 64 bit not 64 bit not 64 bit

r8 r16 r32 r64 r,r r16,r16,i r32,r32,i r64,r64,i m8 m16 m32 m64 r,m r16,m16,i r32,m32,i r64,m64,i r8 r16 r32 r64 r8 r16 r32 r64

1 1 1 1

1 1 1

1 1 1 1 1 1 1 1

20-24 21-25 20-28 30-94 21-24 21-25 20-27 40-103 1 1 1 1 1 1 3 1 3 1

11 1 2 2 1 1 1 1 1 1 2 2 2 1 1 1 1 11-14 11-14 11-18 22-76 11-14 11-14 11-18 25-84 0.5 1 0.5 1 1 0.5 1 1 1 1

r,r r,m r,r r,m

1 1 1 2 1 1 1 1 1 1

1 1 1 1

SSE4.2 SSE4.2 SSE4.2 SSE4.2

r,r/i r,m m,r/i r,same r,r/i m,r/i r,i

1 1 2 1 1 1 1

1 1 1 0 1 1 1

x x x x x x

x x x x x

x x x x x x

1 1 2 1 6 0 1 1

0.33 0.5 1 0.25 0.33 0.5 0.5

Page 121

Sandy Bridge SHR SHL SAR SHR SHL SAR SHR SHL SAR ROR ROL ROR ROL ROR ROL ROR ROL RCR RCR RCR RCR RCR RCR RCL RCL RCL RCL RCL SHRD SHLD SHRD SHLD SHRD SHLD SHRD SHLD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR BSF BSR SETcc SETcc CLC STC CMC CLD STD m,i r,cl m,cl r,i m,i r,cl m,cl r8,1 r16/32/64,1 r,i m,i r,cl m,cl r,1 r,i m,i r,cl m,cl r,r,i m,r,i r,r,cl m,r,cl r,r/i m,r m,i r,r/i m,r m,i r,r r,m r m 3 3 5 1 4 3 5 high 3 8 11 8 11 3 8 11 8 11 1 3 4 5 1 10 2 1 11 3 1 1 1 2 1 1 3 1 3 3 1 3 3 3 3 8 7 8 7 3 8 7 8 7 1 2 4 3 1 8 1 1 7 1 1 1 1 1 0 1 3 2 1 1 1 2 2 1 x x x x x x x 1 1 1 1 1 1 1 3 1 2 1 1 2 2 2 2 1 1 1 1 2 1 high 2 5 5 2 6 6 1 2 2 2 4 1 2 2 4 high 2 5 6 5 6 2 6 6 6 6 0.5 2 2 4 0.5 5 0.5 0.5 5 2 1 1 0.5 1 0.25 0.33 4

Control transfer instructions JMP short/near JMP r JMP m Conditional jump short/near

1 1 1 1

1 1 1 1

1 1 1 1

0 0 0 0

2 2 2 1-2

fastest if not jumping

Fused arithmetic and branch J(E/R)CXZ LOOP LOOP(N)E CALL CALL

1 short short short near r 2 7 11 3 2

1 2 7 11 2 1 x x

1 1

1-2 2-4 5 5 2 2

1 1

1 1

1 1

Page 122

Sandy Bridge CALL RET RET BOUND INTO String instructions LODS REP LODS STOS REP STOS REP STOS MOVS REP MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS Other NOP (90) Long NOP (0F 1F) m i r,m 3 2 3 15 4 2 2 2 13 4 1 1 1 2 1 1 1 2 2 2 7 6

not 64 bit not 64 bit

3 5n+12 3 2n 1.5/16B 5 2n 3/16B 3 6n+47 5 8n+80

2 1

1 ~2n 1 1 n 1/16B

1 1 worst case best case 4 1.5 n 1/16B 1 2n+45 4 2n+80 worst case best case

1 1

0 0

0.25 0.25

decode only 1 per clk

PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC

a,0 a,b

7 7 12 10 49+6b 3 3 31-75 21 35

2 1

1 84+3b

11 8 7 100-250 28 42

Floating point x87 instructions


Instruction Operands ops ops unfused domain Latency ReciComfused p015 p0 p1 p5 p23 p4 procal ments dothroughmain put 1 1 4 43 1 1 7 1 1 2 40 1 3 Page 123 1 1 1 1 2 1 2 1 1 2 3 1 3 4 45 1 4 5 1 1 2 21 1 1 5

Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP

r m32/64 m80 m80 r m32/m64 m80

Sandy Bridge FBSTP FXCH FILD FIST(P) FISTTP FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FADD(P) FSUB(R)(P) FMUL(P) FMUL(P) FDIV(R)(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT Math FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X m80 r m m m 246 1 1 3 3 1 2 2 3 2 2 3 2 1 1 143 90 0 1 1 1 1 2 2 3 2 1 2 1 1 1 0 6 7 7 252 0.5 1 2 2 2 2 2 2 1 1 1 1 1 166 165

1 1 1 1 1 1 2

1 1 1

1 1

SSE3

r AX m16 m16 m16 r m m

3 2 1 1 1 1 1 8 5 1

1 1

r m r m r m

r m r m m m m

1 2 1 1 1 1 1 1 1 1 2 3 2 2 2 2 1 2 28 41-87 17

1 2 1 1 1 1 1 1 1 1 2 3 2 2 2 2 1 2 28 17

1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1

3 1 5 1 10-24 1 1 1 3 1 4 1 1 1 1

1 1

1 1 1 1 10-24 10-24 1 1 1 1 1 1 1 1 2 1 2 21 26-50

21 26-50 22

27 27 17 17 1 1 1 64-100 20-110 20-110 53-118 454 454 Page 124

12 10 10-24 47-100 47-115 43-123 61-69 724

Sandy Bridge FYL2XP1 FPTAN FPATAN Other FNOP WAIT FNCLEX FNINIT 464 464 102 102 28-91 726 130 93-146

1 2 5 26

1 2 5 26

1 1 22 81

Integer MMX and XMM instructions


Instruction Operands ops ops unfused domain Latency ReciComfused p015 p0 p1 p5 p23 p4 procal ments dothroughmain put 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 x x x x x x x 1 x 1 x 1 1 1 x x x 1 1 1 1 1 1 1 1 1 1 3 1 3 1 1 3 1 3 3 3 3 3 1 1 ~300 ~300 0.33 1 0.33 0.5 0.33 0.5 1 0.33 0.5 1 0.5 1 0.5 1 0.33 1 0.5 1 1 x x x x x x x x x x x x x x x x x x x x 1 1 1 1 1 1 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1

Move instructions MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU LDDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ MOVNTDQA PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKUSDW PACKUSDW PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKH/LQDQ PUNPCKH/LQDQ PMOVSX/ZXBW PMOVSX/ZXBW

r32/64,(x)mm m32/64,(x)mm (x)mm,r32/64 (x)mm,m32/64 (x)mm,(x)mm (x)mm,m64 m64, (x)mm x,x x, m128 m128, x x, m128 m128, x x, m128 mm, x x,mm m64,mm m128,x x, m128 mm,mm mm,m64 x,x x,m128 x,x x,m (x)mm,(x)mm (x)mm,m x,x x, m128 x,x x,m64

1 1 1 2 1

SSE3

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1

SSE4.1

SSE4.1 SSE4.1

SSE4.1 SSE4.1

Page 125

Sandy Bridge PMOVSX/ZXBD PMOVSX/ZXBD PMOVSX/ZXBQ PMOVSX/ZXBQ PMOVSX/ZXWD PMOVSX/ZXWD PMOVSX/ZXWQ PMOVSX/ZXWQ PMOVSX/ZXDQ PMOVSX/ZXDQ PSHUFB PSHUFB PSHUFW PSHUFW PSHUFD PSHUFD PSHUFL/HW PSHUFL/HW PALIGNR PALIGNR PBLENDVB PBLENDVB PBLENDW PBLENDW MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRB PEXTRB PEXTRW PEXTRW PEXTRD PEXTRD PEXTRQ PEXTRQ PINSRB PINSRB PINSRW PINSRW PINSRD PINSRD PINSRQ PINSRQ Arithmetic instructions
PADD/SUB(U,S)B/W/D/Q PADD/SUB(U,S)B/W/D/Q

x,x x,m32 x,x x,m16 x,x x,m64 x,x x,m32 x,x x,m64 (x)mm,(x)mm (x)mm,m mm,mm,i mm,m64,i xmm,x,i x,m128,i x,x,i x, m128,i (x)mm,(x)mm,i (x)mm,m,i x,x,xmm0 x,m,xmm0 x,x,i x,m,i mm,mm x,x r32,(x)mm r32,x,i m8,x,i r32,(x)mm,i m16,(x)mm,i r32,x,i m32,x,i r64,x,i m64,x,i x,r32,i x,m8,i (x)mm,r32,i (x)mm,m16,i x,r32,i x,m32,i x,r64,i x,m64,i

1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 2 1 2 2 3 1 2 4 10 1 2 2 2 2 2 3 2 3 2 2 2 2 2 2 2 2

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 4 1 2 1 2 1 2 2 2 2 2 1 2 1 2 1 2 1

x x x x x x x x x x x x x x x x x x x x 1 1 x x 1 1 x x x x x x x x x x x x x x x x

x x x x x x x x x x x x x x x x x x x x 1 1 x x

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 4 1 x 2 2 1 1 1 1 1 2 1 2 1 2 1 1 2 1 2 1 2 1 2

x x

x x x x x x x x x x x x x x x x

0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 1 0.5 0.5 1 6 1 1 1 1 2 1 1 1 1 1 0.5 1 0.5 1 0.5 1 0.5

SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSSE3 SSSE3

SSSE3 SSSE3 SSE4.1 SSE4.1 SSE4.1 SSE4.1

SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1, 64b SSE4.1 SSE4.1

SSE4.1 SSE4.1 SSE4.1, 64 b

PHADD/SUB(S)W/D PHADD/SUB(S)W/D PCMPEQ/GTB/W/D

(x)mm, (x)mm (x)mm,m (x)mm, (x)mm (x)mm,m64 (x)mm,(x)mm

1 1 3 4 1

1 1 3 3 1

x x x x x

x x x x x

1 1 2 1 1

0.5 0.5 1.5 1.5 0.5

SSSE3 SSSE3

Page 126

Sandy Bridge PCMPEQ/GTB/W/D PCMPEQQ PCMPEQQ PCMPGTQ PCMPGTQ PSUBxx, PCMPGTx PCMPEQx PMULL/HW PMULHUW PMULL/HW PMULHUW PMULHRSW PMULHRSW PMULLD PMULLD PMULDQ PMULDQ PMULUDQ PMULUDQ PMADDWD PMADDWD PMADDUBSW PMADDUBSW PAVGB/W PAVGB/W PMIN/MAXSB PMIN/MAXSB PMIN/MAXUB PMIN/MAXUB PMIN/MAXSW PMIN/MAXSW PMIN/MAXUW PMIN/MAXUW PMIN/MAXU/SD PMIN/MAXU/SD PHMINPOSUW PHMINPOSUW PABSB/W/D PABSB/W/D PSIGNB/W/D PSIGNB/W/D PSADBW PSADBW MPSADBW MPSADBW (x)mm,m x,x x,m128 x,x x,m128 x,same x,same (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m x,x x,m128 x,x x,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m x,x x,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m x,x x,m x,x x,m128 x,x x,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m x,x,i x,m,i 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 4 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 x x x 1 1 x x x 1 1 1 5 1 0 0 5 1 5 1 5 1 5 1 5 1 5 1 5 1 x x x x x x x x x x x x 1 1 1 1 1 1 1 1 1 1 1 1 5 1 x x x x 1 1 1 1 5 1 6 1 0.5 0.5 0.5 1 1 0.25 0.5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 1 0.5 0.5 0.5 0.5 1 1 1 1 SSE4.1 SSE4.1 SSE4.2 SSE4.2

1 1 1 1 1 1 1 1 1 1 1 1 1 1 x x x x x x x x x x x x 1 1 x x x x 1 1

SSSE3 SSSE3 SSE4.1 SSE4.1 SSE4.1 SSE4.1

SSSE3 SSSE3

SSE4.1 SSE4.1

SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSSE3 SSSE3 SSSE3 SSSE3

PCLMULQDQ AESDEC, AESDECLAST, AESENC, AESENCLAST

x,x,i

18

18

14

SSE4.1 SSE4.1 only in some processors

x,x

2 Page 127

do.

Sandy Bridge AESIMC AESKEYGENASSIST Logic instructions PAND(N) POR PXOR PAND(N) POR PXOR PXOR PTEST PTEST PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ String instructions PCMPESTRI PCMPESTRI PCMPESTRM PCMPESTRM PCMPISTRI PCMPISTRI PCMPISTRM PCMPISTRM Other EMMS x,x x,x,i 2 11 2 11 2 8 2 8 do. do.

(x)mm,(x)mm (x)mm,m x,same x,x x,m128 mm,mm/i mm,m64 xmm,i x,x x,m128 x,i

1 1 1 1 1 1 1 1 2 3 1

1 1 0 1 1 1 1 1 2 2 1

x x

x x

x x

1 1 0 1 1

1 1 1

1 1 1 2 1 1

0.33 0.5 0.25 1 1 1 2 1 1 1 1

SSE4.1 SSE4.1

x,x,i x,m128,i x,x,i x,m128,i x,x,i x,m128,i x,x,i x,m128,i

8 8 8 8 3 4 3 4

8 7 8 7 3 3 3 3

4 1 11-12 1 3 1 11 1

4 4 4 4 3 3 3 3

SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2

31

31

18

Floating point XMM and YMM instructions


Instruction Operands ops ops unfused domain Latency ReciComfused p015 p0 p1 p5 p23 p4 procal ments dothroughmain put 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1+ 1 1 1 1 1 1 1 1 1 1 1 1 3 4 3 3 1 3 3 3 3 1 1 1 0.5 1 1 1 1 0.5 1 1 1 1

Move instructions MOVAPS/D VMOVAPS/D MOVAPS/D MOVUPS/D VMOVAPS/D VMOVUPS/D MOVAPS/D MOVUPS/D VMOVAPS/D VMOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVH/LPS/D MOVLHPS MOVHLPS

x,x y,y x,m128 y,m256 m128,x m256,y x,x x,m32/64 m32/64,x x,m64 m64,x x,x

AVX

AVX

1 1+

AVX

1 1 1

1 1 1

Page 128

Sandy Bridge MOVMSKPS/D VMOVMSKPS/D MOVNTPS/D VMOVNTPS/D SHUFPS/D SHUFPS/D VSHUFPS/D VSHUFPS/D VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERM2F128 VPERM2F128 BLENDPS/PD BLENDPS/PD VBLENDPS/PD VBLENDPS/PD BLENDVPS/PD BLENDVPS/PD VBLENDVPS/PD VBLENDVPS/PD MOVDDUP MOVDDUP VMOVDDUP VMOVDDUP VBROADCASTSS VBROADCASTSS VBROADCASTSD VBROADCASTF128 MOVSH/LDUP MOVSH/LDUP VMOVSH/LDUP VMOVSH/LDUP UNPCKH/LPS/D UNPCKH/LPS/D VUNPCKH/LPS/D VUNPCKH/LPS/D EXTRACTPS EXTRACTPS VEXTRACTF128 VEXTRACTF128 INSERTPS INSERTPS VINSERTF128 VINSERTF128 VMASKMOVPS/D VMASKMOVPS/D r32,x r32,y m128,x m256,y x,x,i x,m128,i y,y,y,i y, y,m256,i x,x,x/i y,y,y/i x,x,m y,y,m x,m,i y,m,i y,y,y,i y,y,m,i x,x,i x,m128,i y,y,i y,m256,i x,x,xmm0 x,m,xmm0 y,y,y,y y,y,m,y x,x x,m64 y,y y,m256 x,m32 y,m32 y,m64 y,m128 x,x x,m128 y,y y,m256 x,x x,m128 y,y,y y,y,m256 r32,x,i m32,x,i x,y,i m128,y,i x,x,i x,m32,i y,y,x,i y,y,m128,i x,x,m128 y,y,m256 1 1 1 1 1 2 1 2 1 1 2 2 2 2 1 2 1 2 1 2 2 3 2 3 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 2 3 1 2 1 2 1 2 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1+ 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 2 Page 129 1 1+ 1 1 1 1 1 1 1 1 1 1 1 1 1 1+ 2 1 1 1 2 1 1 1+ 1 2 1 1 1 1 1+ 1 1 1 1+ 1 1+ 2 1+ 1 1 1 1+ 2 1 2 1+ 1 1 3 1 3 1 4 2 2 ~300 ~300 1 1 1 1 25 1 1 1 1 1 1 1 1 1 1 1 1 0.5 0.5 1 1 1 1 1 1 1 0.5 1 1 1 1 1 1 1 0.5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

AVX

1 1 1 1

1 1 1 1

1 3 1 4 1

AVX AVX AVX AVX AVX AVX AVX AVX AVX AVX SSE4.1 SSE4.1 AVX AVX SSE4.1 SSE4.1 AVX AVX SSE3 SSE3 AVX AVX AVX AVX AVX AVX SSE3 SSE3 AVX AVX SSE3 SSE3 AVX AVX SSE4.1 SSE4.1 AVX AVX SSE4.1 SSE4.1 AVX AVX AVX AVX

Sandy Bridge VMASKMOVPS/D VMASKMOVPS/D Conversion CVTPD2PS CVTPD2PS VCVTPD2PS VCVTPD2PS CVTSD2SS CVTSD2SS CVTPS2PD CVTPS2PD VCVTPS2PD VCVTPS2PD CVTSS2SD CVTSS2SD CVTDQ2PS CVTDQ2PS VCVTDQ2PS VCVTDQ2PS CVT(T) PS2DQ CVT(T) PS2DQ VCVT(T) PS2DQ VCVT(T) PS2DQ CVTDQ2PD CVTDQ2PD VCVTDQ2PD VCVTDQ2PD CVT(T)PD2DQ CVT(T)PD2DQ VCVT(T)PD2DQ VCVT(T)PD2DQ CVTPI2PS CVTPI2PS CVT(T)PS2PI CVT(T)PS2PI CVTPI2PD CVTPI2PD CVT(T) PD2PI CVT(T) PD2PI CVTSI2SS CVTSI2SS CVT(T)SS2SI CVT(T)SS2SI CVTSI2SD CVTSI2SD CVT(T)SD2SI CVT(T)SD2SI Arithmetic Page 130 m128,x,x m256,y,y 4 4 2 2 1 1 1 1+ 1 2 AVX AVX

x,x x,m128 x,y x,m256 x,x x,m64 x,x x,m64 y,x y,m128 x,x x,m32 x,x x,m128 y,y y,m256 x,x x,m128 y,y y,m256 x,x x,m64 y,x y,m128 x,x x,m128 x,y x,m256 x,mm x,m64 mm,x mm,m128 x,mm x,m64 mm,x mm,m128 x,r32 x,m32 r32,x r32,m32 x,r32 x,m32 r32,x r32,m64

2 2 2 2 2 2 2 2 2 3 2 2 1 1 1 1 1 1 1 1 2 2 2 3 2 2 2 2 1 1 2 2 2 2 2 2 2 1 2 2 2 1 2 2

2 2 2 2 2 2 2 2 2 3 2 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 1 1 2 1 2 2 2 2 2 1 2 2 2 1 2 2

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1+ 1 1 1 1 1 1

3 4 3 3 1 4 1 3 1 3 1 3 1+ 3 1 3 1+

1 1 1 1 1 1 1 1

4 1 5 1 4 1 5 1+ 4 1 4 1

1 1

4 1 4 1

1 1 1 1 1 1 1 1

4 1 4 1 4 1 4 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1.5 1.5 1 1 1.5 1.5 1 1

AVX AVX

AVX AVX

AVX AVX

AVX AVX

AVX AVX

AVX AVX

Sandy Bridge ADDSS/D SUBSS/D ADDSS/D SUBSS/D ADDPS/D SUBPS/D ADDPS/D SUBPS/D VADDPS/D VSUBPS/D VADDPS/D VSUBPS/D ADDSUBPS/D ADDSUBPS/D VADDSUBPS/D VADDSUBPS/D HADDPS/D HSUBPS/D HADDPS/D HSUBPS/D VHADDPS/D VHSUBPS/D VHADDPS/D VHSUBPS/D MULSS MULPS MULSS MULPS VMULPS VMULPS MULSD MULPD MULSD MULPD VMULPD VMULPD DIVSS DIVPS DIVSS DIVPS VDIVPS VDIVPS DIVSD DIVPD DIVSD DIVPD VDIVPD VDIVPD RCPSS/PS RCPSS/PS VRCPPS VRCPPS CMPccSS/D CMPccPS/D CMPccSS/D CMPccPS/D VCMPccPS/D VCMPccPS/D COMISS/D UCOMISS/D COMISS/D UCOMISS/D MAXSS/D MINSS/D MAXSS/D MINSS/D MAXPS/D MINPS/D MAXPS/D MINPS/D VMAXPS/D VMINPS/D VMAXPS/D VMINPS/D ROUNDSS/SD/PS/PD x,m128 y,y,y y,y,m256 x,x x,m32/64 x,x x,m32/64 x,x x,m128 y,y,y y,y,m256 x,x,i 2 1 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 Page 131 1 1 1 1 1 1 1 1 1 1 1 1 3 1+ 2 1 3 1 3 1 3 1+ 3 1 1 1 1 1 1 1 1 1 1 1 1 AVX AVX x,x x,m32/64 x,x x,m128 y,y,y y,y,m256 x,x x,m128 y,y,y y,y,m256 x,x x,m128 y,y,y y,y,m256 x,x x,m y,y,y y,y,m256 x,x x,m y,y,y y,y,m256 x,x x,m y,y,y y,y,m256 x,x x,m y,y,y y,y,m256 x,x x,m128 y,y y,m256 x,x 1 1 1 1 1 1 1 1 1 1 3 4 3 4 1 1 1 1 1 1 1 1 1 1 3 4 1 1 3 4 1 1 2 4 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 1 1 1 1 1 1 1 1 1 1 3 3 1 1 3 3 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 2 1 1 3 1 3 1 3 1+ 3 1 3 1+ 2 2 2 2 1+ 5 1 5 1+ 5 1 5 1+ 10-14 1 1 1 21-29 1+ 10-22 1 1 1+ 5 1 7 1+ 1 3 21-45 5 1 5 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 10-14 10-14 20-28 20-28 10-22 10-22 20-44 20-44 1 1 2 2 1

AVX AVX SSE3 SSE3 AVX AVX SSE3 SSE3 AVX AVX

AVX AVX

AVX AVX

AVX AVX

AVX AVX

AVX AVX

AVX AVX SSE4.1

Sandy Bridge ROUNDSS/SD/PS/PD VROUNDSS/SD/PS/PD VROUNDSS/SD/PS/PD DPPS DPPS VDPPS VDPPS DPPD DPPD Math SQRTSS/PS SQRTSS/PS VSQRTPS VSQRTPS SQRTSD/PD SQRTSD/PD VSQRTPD VSQRTPD RSQRTSS/PS RSQRTSS/PS VRSQRTPS VRSQRTPS Logic
AND/ANDN/OR/XORPS/PD AND/ANDN/OR/XORPS/PD

x,m128,i y,y,i y,m256,i x,x,i x,m128,i y,y,y,i y,m256,i x,x,i x,m128,i

2 1 2 4 6 4 6 3 4

1 1 1 4 5 4 5 3 3

1 1 1 2

1 3 1+ 1 1 12 1+ 9 1 12

1 1 1 2 4 2 4 2 2

SSE4.1 AVX AVX SSE4.1 SSE4.1 AVX AVX SSE4.1 SSE4.1

x,x x,m128 y,y y,m256 x,x x,m128 y,y y,m256 x,x x,m128 y,y y,m256

1 1 3 4 1 2 3 4 1 1 3 4

1 1 3 3 1 1 3 3 1 1 3 3

1 1

10-14 1 1+

1 1

10-21 1 21-43 1+

1 1

5 1 7 1+

10-14 10-14 21-28 21-28 10-21 10-21 21-43 21-43 1 1 2 2

AVX AVX

AVX AVX

AVX AVX

x,x x,m128 y,y,y y,y,m256 x/y,x/y,same

1 1 1 1 1

1 1 1 1 0

1 1 1 1

1 1 1 1+ 0

1 1 1 1 0.25 AVX AVX

VAND/ANDN/OR/XORPS /PD VAND/ANDN/OR/XORPS /PD (V)XORPS/PD Other VZEROUPPER VZEROALL VZEROALL LDMXCSR STMXCSR VSTMXCSR FXSAVE FXRSTOR XSAVEOPT

4 12 20 3 3 3 3 3 3 130 116 100-161

1 11 9 3 1 1 68 72

AVX AVX, 32 bit AVX, 64 bit

m32 m32 m32 m4096 m4096 m

1 1

1 1 1

1 1

AVX

60-500

Page 132

Pentium 4

Intel Pentium 4
List of instruction timings and op breakdown
This list is measured for a Pentium 4, model 2. Timings for model 3 may be more like the values for P4E, listed on the next sheet

Explanation of column headings:


Instruction: Operands: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit mmx register, xmm = 128 bit xmm register, sr = segment register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc. Number of ops issued from instruction decoder and stored in trace cache. Number of additional ops issued from microcode ROM. This is the delay that the instruction generates in a dependency chain if the next dependent instruction starts in the same execution unit. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency of moves to and from memory cannot be measured accurately because of the problem with memory intermediates explained above under How the values were measured. This number is added to the latency if the next dependent instruction is in a different execution unit. There is no additional latency between ALU0 and ALU1. This is also called issue latency. This value indicates the number of clock cycles from the execution of an instruction begins to a subsequent independent instruction can begin to execute in the same execution subunit. A value of 0.25 indicates 4 instructions per clock cycle in one thread. The port through which each op goes to an execution unit. Two independent ops can start to execute simultaneously only if they are going through different ports. Use this information to determine additional latency. When an instruction with more than one op uses more than one execution unit, only the first and the last execution unit is listed. Throughput measures apply only to instructions executing in the same subunit. Indicates the compatibility of an instruction with other 80x86 family microprocessors. The instruction can execute on microprocessors that support the instruction set indicated.

ops: Microcode: Latency:

Additional latency:

Reciprocal throughput:

Port:

Execution unit:

Execution subunit: Instruction set

Integer instructions

Page 133

Pentium 4 ops Microcode Latency Additional latency Port Reciprocal throughput Execution unit Subunit Instruction set Notes Instruction Operands

Move instructions MOV MOV MOV MOV MOV MOV MOV MOV MOVNTI MOVZX MOVZX MOVSX MOVSX CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D) PUSHA(D) POP POP POP POPF(D) POPA(D) LEA LEA LEA LEA LEA LAHF SAHF SALC LDS, LES, ... BSWAP IN, OUT PREFETCHNTA PREFETCHT0/1/2

r,r r,i r32,m r8/16,m m,r m,i r,sr sr,r/m m,r32 r,r r,m r,r r,m r,r/m r,r r,m r i m sr

r m sr

r,[r+r/i] r,[r+r+i] r,[r*i] r,[r+r*i] r,[r+r*i+i]

r,m r r,r/i m m

1 1 1 2 1 3 4 4 2 1 1 1 2 3 3 4 4 2 2 3 4 4 4 2 4 4 4 4 1 2 3 2 3 1 1 3 4 3 8 4 4

0 0.5 0.5-1 0.25 0/1 0 0.5 0.5-1 0.25 0/1 0 2 0 1 2 0 3 0 1 2 0 1 2 0 0 2 0,3 2 6 4 12 0 14 0 33 0 0.5 0.5-1 0.25 0/1 0 2 0 1 2 0 0.5 0.5-1 0.5 0 0 3 0.5-1 1 2,0 0 6 0 3 0 1.5 0.5-1 1 0/1 8 >100 0 3 0 1 2 0 1 2 0 2 4 7 4 10 10 19 0 1 0 1 8 14 5 13 8 52 16 14 0 0.5 0.5-1 0.25 0/1 0 1 0.5-1 0.5 0/1 0 4 0.5-1 1 1 0 4 0.5-1 1 1 0 4 0.5-1 1 1 0 4 0 4 1 0 0.5 0.5-1 0.5 0/1 0 5 0 1 1 7 15 0 7 0 2 64 >1000 2 6 2 6 Page 134

alu0/1 alu0/1 load load store store

alu0/1 load alu0

alu0/1

alu0/1 alu0/1 int,alu int,alu int,alu int alu0/1 int int,alu 86

86 86 86 86 86 86 86 86 sse2 386 386 386 386 ppro 86 86 86 86 186 86 86 86 186 86 86 86 86 186 86 86 386 386 386 86 86 86 86 486 sse sse

b, c

a, q c c a, e

Pentium 4 SFENCE LFENCE MFENCE Arithmetic instructions ADD, SUB ADD, SUB ADD, SUB ADC, SBB ADC, SBB ADC, SBB ADC, SBB CMP CMP INC, DEC INC, DEC NEG NEG AAA, AAS DAA, DAS AAD AAM MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV DIV DIV IDIV IDIV IDIV CBW CWD, CDQ CWDE Logic instructions AND, OR, XOR AND, OR, XOR AND, OR, XOR TEST TEST NOT 4 4 4 2 2 2 40 38 100 sse sse2 sse2

r,r r,m m,r r,r r,i r,m m,r r,r r,m r m r m

r8/32 r16 m8/32 m16 r32,r r32,(r),i r16,r r16,r,i r16,m16 r32,m32 r,m,i r8/m8 r16/m16 r32/m32 r8/m8 r16/m16 r32/m32

1 2 3 4 3 4 4 1 2 2 4 1 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 2 1

0 0.5 0.5-1 0.25 0 1 0.5-1 1 0 8 4 4 6 0 6 0 6 0 6 6 8 0 8 7 9 8 0 0.5 0.5-1 0.25 0 1 0.5-1 1 0 0.5 0.5-1 0.5 0 4 4 0 0.5 0.5-1 0.5 0 3 27 90 57 100 10 22 22 56 6 16 0 8 7 17 0 8 7-8 16 0 8 10 16 0 8 0 14 0 4.5 0 14 0 4.5 5 16 0 9 5 15 0 8 7 15 0 10 0 14 0 8 7 14 0 10 20 61 0 24 18 53 0 23 21 50 0 23 24 61 0 24 22 53 0 23 20 50 0 23 0 1 0.5-1 1 0 1 0.5-1 0.5 0 0.5 0.5-1 0.5

0/1

alu0/1

1 1 1 0/1 0/1 0

int,alu int,alu int,alu alu0/1 alu0/1 alu0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0/1 0

int int int int int int int int int int int int int int int int int int int alu0 alu0/1 alu0

fpmul fpdiv fpmul fpmul fpmul fpmul fpmul fpmul fpmul fpmul fpmul fpmul fpmul fpdiv fpdiv fpdiv fpdiv fpdiv fpdiv

86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 386 386 386 186 386 386 186 86 86 386 86 86 386 86 86 386

c c c

c c

a a a a a

r,r r,m m,r r,r r,m r

1 2 3 1 2 1

0 0 0 0 0 0

0.5 1 8 0.5 1 0.5

0.5-1 0.5 0.5-1 1 4 0.5-1 0.5 0.5-1 1 0.5-1 0.5

alu0

0 0

alu0 alu0

86 86 86 86 86 86

c c c c c

Page 135

Pentium 4 NOT SHL, SHR, SAR SHL, SHR, SAR ROL, ROR ROL, ROR RCL, RCR RCL, RCR RCL, RCR SHL,SHR,SAR,ROL, ROR RCL, RCR RCL, RCR SHLD, SHRD SHLD, SHRD BT BT BT BT BTR, BTS, BTC BTR, BTS, BTC BTR, BTS, BTC BTR, BTS, BTC BSF, BSR BSF, BSR SETcc SETcc CLC, STC CMC CLD STD CLI STI m r,i r,CL r,i r,CL r,1 r,i r,CL m,i/CL m,1 m,i/CL r,r,i/CL m,r,i/CL r,i r,r m,i m,r r,i r,r m,i m,r r,r r,m r m 4 1 2 1 2 1 4 4 4 4 4 4 4 3 2 4 4 3 2 4 4 2 3 3 4 3 3 4 4 4 4 0 0 0 0 0 0 15 15 4 6 4 6 4 16 16 1 0 1 0 1 0 0 4 1 1 1 1 1 15 14 10 10 14 14 14 2 1 2 12 2 4 8 14 2 3 1 3 2 2 52 48 35 43 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 int int int int int int int int int int int int int int int int int int int int int int int int mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh 86 186 86 186 86 86 186 86 86 86 86 386 386 386 386 386 386 386 386 386 386 386 386 386 386 86 86 86 86 86 86

d d d d d d d d d

7-8 10 0 7 10 0 18 18-28 14 14 0 18 14 0 0 4 0 0 4 0 0 4 0 12 12 0 0 6 0 0 6 0 7 18 0 15 14 0 0 4 0 0 4 0 0 5 0 0 5 0 0 10 0 0 10 0 7 52 0 5 48 0 5 35 12 43

d d d d

Control transfer instructions JMP short/near JMP far JMP r JMP m(near) JMP m(far) Jcc short/near J(E)CXZ short LOOP short CALL near CALL far CALL r CALL m(near) CALL m(far) RETN RETN i RETF

1 4 3 3 4 1 4 4 3 4 4 4 4 4 4 4

0 28 0 0 31 0 4 4 0 34 4 4 38 0 0 33

0 118 4 4 11 0 0 0 2 8 9 2 2 11

1 118 4 4 11 2-4 2-4 2-4 2

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

alu0 alu0 alu0 alu0 alu0 alu0 alu0 alu0 alu0 alu0 alu0

branch branch branch branch branch branch branch branch branch branch branch

86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86

Page 136

Pentium 4 RETF IRET ENTER ENTER LEAVE BOUND INTO INT String instructions LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS Other NOP (90) Long NOP (0F 1F) PAUSE CPUID RDTSC Notes: a) b) i i,0 i,n m i 4 33 11 4 48 24 4 12 26 4 45+24n 4 0 3 4 14 14 4 5 18 4 84 644 0 0 26 128+16n 3 14 18 186 186 186 86 86 86 86 186

4 4 4 4 4 4 4 4 4 4

3 6 5n 4n+36 2 6 2n+3 3n+10 4 6 163+1.1n 3 40+6n 4n 5 50+8n 4n

6 86 6 86 4 86 6 86 8 86

86 86 86 86 86

1 0 0 1 0 0 4 2 4 39-81 4 7

0.25 0/1 0.25 0/1 200-500 80

alu0/1 alu0/1 p5

86 ppro sse2 p5

c) d) e) q)

Add 1 op if source is a memory operand. Uses an extra op (port 3) if SIB byte used. A SIB byte is needed if the memory operand has more than one pointer register, or a scaled index, or ESP is used as base pointer. Add 1 op if source or destination, but not both, is a high 8-bit register (AH, BH, CH, DH). Has (false) dependence on the flags in most cases. Not available on PMMX Latency is 12 in 16-bit real or virtual mode, 24 in 32-bit protected mode.

Floating point x87 instructions


ops Microcode Latency Additional latency Port Reciprocal throughput Execution unit Subunit Instruction set Notes Instruction Operands

Move instructions FLD FLD

r m32/64

1 1

0 0

6 7

0 0

1 1

0 2

mov load

87 87

Page 137

Pentium 4 FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FILD FIST FIST FISTP FLDZ FLD1 FCMOVcc FFREE FINCSTP, FDECSTP FNSTSW FSTSW FNSTSW FNSTCW FLDCW Arithmetic instructions FADD(P),FSUB(R)(P) FADD,FSUB(R) FIADD,FISUB(R) FIADD,FISUB(R) FMUL(P) FMUL FIMUL FIMUL FDIV(R)(P) FDIV(R) FIDIV(R) FIDIV(R) FABS FCHS FCOM(P), FUCOM(P) FCOM(P) FCOMPP, FUCOMPP FCOMI(P) FICOM(P) FICOM(P) FTST FXAM FRNDINT FPREM FPREM1 m80 m80 r m32/64 m80 m80 r m16 m32/64 m16 m32/64 m 3 3 1 2 3 3 1 3 2 3 2 3 1 2 4 3 1 4 6 4 4 4 4 75 0 0 8 311 0 3 0 0 0 0 0 0 0 0 0 0 0 4 4 7 6 2 90 2 1 0 2-3 0 8 0 400 0 1 0 6 2 1 2 2-4 0 2-3 0 2-4 0 2 0 2 0 4 1 4 0 1 0 3 1 3 1 6 0 6 0 (8) 0,2 load load mov store store store mov load load store store store mov mov fp mov mov 87 87 87 87 87 87 87 87 87 87 87 87 87 87 PPro 87 87 287 287 87 87 87

6 7

0 10 10 10 10 10

st0,r r AX AX m16 m16 m16

2-4 0 11 11

1 0 0 0

(3)

r m m16 m32 r m m16 m32 r m m16 m32

r m r m16 m32

1 2 3 3 1 2 3 3 1 2 3 3 1 1 1 2 2 3 4 3 1 1 3 6 6

0 0 4 0 0 0 4 0 0 0 4 0 0 0 0 0 0 0 4 0 0 0 15 84 84

5 5 6 5 7 7 7 7 43 43 43 43 2 2 2 2 2 10 2 2 2 23 212 212

1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0

1 1 6 2 2 2 6 2 43 43 43 43 1 1 1 1 1 3 6 2 1 1 15

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0,1 1 1,2 1 1 0,1 1 1

fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp

add add add add mul mul mul mul div div div div misc misc misc misc misc misc misc misc misc misc

87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 PPro 87 87 87 87 87 87 387

g, h g, h g, h g, h

Page 138

Pentium 4 Math FSQRT FLDPI, etc. FSIN FCOS FSINCOS FPTAN FPATAN FSCALE FXTRACT F2XM1 FYL2X FYL2XP1 Other FNOP (F)WAIT FNCLEX FNINIT FNSAVE FRSTOR FXSAVE FXRSTOR Notes: e) f) 1 2 6 6 7 6 3 3 3 3 3 11 0 43 0 150 180 175 207 178 216 160 230 92 187 24 57 15 20 45 165 60 200 134 242 0 43 3
170 207 211 200 153

66 20 63 90
220

1 1 1 1 1 1 1 1 1 1 1 1

fp fp fp fp fp fp fp fp fp fp fp fp

div

87 87 387 387 387 87 87 87 87 87 87 87

g, h

1 2 4 6 4 4 4 4

0 0 4 29 174 96 69 94

1 0

0 0

456 528 132 208

1 0 1 0 96 1 172 420 0,1 532 96 208

mov mov

87 87 87 87 87 87 sse sse

i i

Not available on PMMX The latency for FLDCW is 3 when the new value loaded is the same as the value of the control word before the preceding FLDCW, i.e. when alternating between the same two values. In all other cases, the latency and reciprocal throughput is 143. Latency and reciprocal throughput depend on the precision setting in the F.P. control word. Single precision: 23, double precision: 38, long double precision (default): 43. Throughput of FP-MUL unit is reduced during the use of the FP-DIV unit. Takes 6 ops more and 40-80 clocks more when XMM registers are disabled.

g)

h) i)

Integer MMX and XMM instructions


ops Microcode Latency Additional latency Port Reciprocal throughput Execution unit Subunit Instruction set Notes Instruction Operands

Move instructions MOVD MOVD MOVD MOVD MOVD

r32, mm mm, r32 mm,m32 r32, xmm xmm, r32

2 2 1 2 2

0 0 0 0 0

5 2 8 10 6

1 0 0 1 1

1 2 1 2 2

0 1 2 0 1

fp mmx load fp mmx

alu

shift

mmx mmx mmx sse2 sse2

Page 139

Pentium 4 MOVD MOVD MOVQ MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/ DQ PUNPCKHBW/WD/DQ/ QDQ PUNPCKLBW/WD/DQ/Q DQ PSHUFD PSHUFL/HW PSHUFW MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRW PEXTRW PINSRW PINSRW Arithmetic instructions PADDB/W/D PADD(U)SB/W PSUBB/W/D PSUB(U)SB/W PADDQ, PSUBQ PADDQ, PSUBQ PCMPEQB/W/D PCMPGTB/W/D PMULLW PMULHW PMULHUW PMADDWD PMULUDQ PAVGB/W PMIN/MAXUB xmm,m32 m32, r mm,mm xmm,xmm r,m64 m64,r xmm,xmm xmm,m m,xmm xmm,m m,xmm mm,xmm xmm,mm m,mm m,xmm mm,r/m xmm,r/m mm,r/m xmm,r/m xmm,r/m xmm,xmm,i xmm,xmm,i mm,mm,i mm,mm xmm,xmm r32,r r32,mm,i r32,xmm,i mm,r32,i xmm,r32,i 1 2 1 1 1 2 1 1 2 4 4 3 2 3 2 1 1 1 1 1 1 1 1 4 4 2 3 3 2 2 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 4 6 0 0 0 0 0 8 8 6 2 8 8 6 8 8 0 0 1 1 2 1 2 1 2 1 1 2 2 2 2 2 75 18 1 2 1 2 2 2 2 1 7 10 3 2 2 2 2 2 load 0,1 0 mov 1 mmx shift 2 load 0 mov 0 mov 2 load 0 mov 2 load 0 mov 0,1 mov-mmx sse2 0,1 mov-mmx sse2 0 mov 0 mov 1 1 1 1 1 1 1 1 0 0 0,1 1 1 1 1 mmx mmx mmx mmx mmx mmx mmx mmx mov mov
mmx-alu0

sse2 mmx mmx sse2 mmx mmx sse2 sse2 sse2 sse2 sse2

k k

8 8

1 1

sse sse2 mmx mmx mmx sse2 sse2 sse2 sse2 mmx sse sse2 a a a a a

2 4 2 4 2 4 2 2

1 1 1 1 1 1 1 1

shift shift shift shift shift shift shift shift

7 8 9 3 4

1 1 1 1 1

mmx-int mmx-int int-mmx int-mmx

sse sse sse2 sse sse2

r,r/m r,r/m mm,r/m xmm,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m

1 1 1 1 1 1 1 1 1 1 1

0 0 0 0 0 0 0 0 0 0 0

2 2 2 4 2 6 6 6 6 2 2

1 1 1 1 1 1 1 1 1 1 1

1,2 1,2 1 2 1,2 1,2 1,2 1,2 1,2 1,2 1,2

1 1 1 1 1 1 1 1 1 1 1

mmx mmx mmx fp mmx fp fp fp fp mmx mmx

alu alu alu add alu mul mul mul mul alu alu

mmx mmx sse2 sse2 mmx mmx sse mmx sse2 sse sse

a,j a,j a a a,j a,j a,j a,j a,j a,j a,j

Page 140

Pentium 4 PMIN/MAXSW PSADBW Logic PAND, PANDN POR, PXOR PSLL/RLW/D/Q, PSRAW/D PSLLDQ, PSRLDQ Other EMMS Notes: a) j) k) r,r/m r,r/m 1 1 0 0 2 4 1 1 1,2 1,2 1 1 mmx mmx alu alu sse sse a,j a,j

r,r/m r,r/m r,i/r/m xmm,i

1 1 1 1

0 0 0 0

2 2 2 4

1 1 1 1

1,2 1,2 1,2 2

1 1 1 1

mmx mmx mmx mmx

alu alu shift shift

mmx mmx mmx sse2

a,j a,j a,j a

11

12

12

mmx

Add 1 op if source is a memory operand. Reciprocal throughput is 1 for 64 bit operands, and 2 for 128 bit operands. It may be advantageous to replace this instruction by two 64-bit moves

Floating point XMM instructions


ops Microcode Latency Additional latency Port Reciprocal throughput Execution unit Subunit Instruction set Notes Instruction Operands

Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVUPS/D MOVSS MOVSD MOVSS, MOVSD MOVSS, MOVSD MOVHLPS MOVLHPS MOVHPS/D, MOVLPS/D MOVHPS/D, MOVLPS/D MOVNTPS/D MOVMSKPS/D SHUFPS/D UNPCKHPS/D UNPCKLPS/D

r,r r,m m,r r,r r,m m,r r,r r,r r,m m,r r,r r,r r,m m,r m,r r32,r r,r/m,i r,r/m r,r/m

1 1 2 1 4 4 1 1 1 2 1 1 3 2 2 2 1 1 1

0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0

6 7 7 6

0 0 0

2 2 7 4 2

0 1 0 0 0

1 1 2 1 2 8 2 2 1 2 2 2 4 2 4 3 2 2 2

0 2 0 0 2 0 1 1 2 0 1 1 2 0 0 1 1 1 1

mov

mov

mmx mmx

shift shift

mmx mmx

shift shift

sse sse sse sse sse sse sse sse sse sse sse sse sse sse sse/2 sse sse sse sse

k k

6 4 4 2

1 1 1 1

fp mmx mmx mmx

shift shift shift

Page 141

Pentium 4 Conversion CVTPS2PD CVTPD2PS CVTSD2SS CVTSS2SD CVTDQ2PS CVTDQ2PD CVT(T)PS2DQ CVT(T)PD2DQ CVTPI2PS CVTPI2PD CVT(T)PS2PI CVT(T)PD2PI CVTSI2SS CVTSI2SD CVT(T)SD2SI CVT(T)SS2SI Arithmetic ADDPS/D ADDSS/D SUBPS/D SUBSS/D MULPS/D MULSS/D DIVSS DIVPS DIVSD DIVPD RCPPS RCPSS MAXPS/D MAXSS/DMINPS/D MINSS/D CMPccPS/D CMPccSS/D COMISS/D UCOMISS/D Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS RSQRTPS Other LDMXCSR STMXCSR Notes: a) r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m xmm,mm xmm,mm mm,xmm mm,xmm xmm,r32 xmm,r32 r32,xmm r32,xmm 4 2 4 4 1 3 1 2 4 4 3 3 3 4 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 10 14 10 4 9 4 9 10 11 7 11 10 15 8 8 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 4 2 6 6 2 4 2 2 4 5 2 3 3 6 2.5 2.5 1 1 1 1 1 1 1 1 1 1 0,1 0,1 1 1 1 1 mmx fp-mmx mmx mmx fp mmx-fp fp fp-mmx mmx fp-mmx fp-mmx fp-mmx fp-mmx fp-mmx fp fp shift sse2 shift shift sse2 sse2 sse2 sse sse2 sse sse2 sse2 a sse2 sse2 sse2 a sse2 a sse a a a a a sse2 sse a a a a a a

a a

r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m

1 1 1 1 1 1 1 2

0 0 0 0 0 0 0 0

4 4 6 23 39 38 69 4

1 1 1 0 0 0 0 1

2 2 2 23 39 38 69 4

1 1 1 1 1 1 1 1

fp fp fp fp fp fp fp mmx

add add mul div div div div

sse sse sse sse sse sse2 sse2 sse

a a a a,h a,h a,h a,h a

r,r/m r,r/m r,r/m

1 1 2

0 0 0

4 4 6

1 1 1

2 2 3

1 1 1

fp fp fp

add add add

sse sse sse

a a a

r,r/m

mmx

alu

sse

r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m

1 1 1 1 2 2

0 0 0 0 0 0

23 39 38 69 4 4

0 0 0 0 1 1

23 39 38 69 3 4

1 1 1 1 1 1

fp fp fp fp mmx mmx

div div div div

sse sse sse2 sse2 sse sse

a,h a,h a,h a,h a a

m m

4 4

8 4

98

100 6

1 1

sse sse

Add 1 op if source is a memory operand. Page 142

Pentium 4 h) k) Throughput of FP-MUL unit is reduced during the use of the FP-DIV unit. It may be advantageous to replace this instruction by two 64-bit moves.

Page 143

Prescott

Intel Pentium 4 w. EM64T (Prescott)


List of instruction timings and op breakdown
Explanation of column headings:
Instruction: Operands: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit mmx register, xmm = 128 bit xmm register, sr = segment register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc., mabs = memory operand with 64-bit absolute address. Number of ops issued from instruction decoder and stored in trace cache. Number of additional ops issued from microcode ROM. This is the delay that the instruction generates in a dependency chain if the next dependent instruction starts in the same execution unit. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency of moves to and from memory cannot be measured accurately because of the problem with memory intermediates explained above under How the values were measured. This number is added to the latency if the next dependent instruction is in a different execution unit. There is no additional latency between ALU0 and ALU1. This is also called issue latency. This value indicates the number of clock cycles from the execution of an instruction begins to a subsequent independent instruction can begin to execute in the same execution subunit. A value of 0.25 indicates 4 instructions per clock cycle in one thread. The port through which each op goes to an execution unit. Two independent ops can start to execute simultaneously only if they are going through different ports. Use this information to determine additional latency. When an instruction with more than one op uses more than one execution unit, only the first and the last execution unit is listed. Throughput measures apply only to instructions executing in the same subunit. Indicates the compatibility of an instruction with other 80x86 family microprocessors. The instruction can execute on microprocessors that support the instruction set indicated.

ops: Microcode: Latency:

Additional latency: Reciprocal throughput:

Port:

Execution unit:

Execution subunit: Instruction set

Integer instructions
ops Microcode Latency Additional latency Port Reciprocal throughput Execution unit Subunit Instruction set Notes Instruction Operands

Move instructions MOV MOV MOV

r,r r8/16/32,i r64,i32

1 1 1

0 0 0

1 1

0 0 0

0.25 0/1 0.25 0/1 0.5 0/1

alu0/1 alu0/1 alu0/1

86 86 x64

Page 144

Prescott MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV MOVNTI MOVZX MOVZX MOVZX MOVSX MOVSX MOVSX MOVSXD CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POPF(D/Q) POPA(D) LEA LEA LEA LEA LEA LEA LAHF SAHF SALC LDS, LES, ... LODS REP LODS STOS REP STOS MOVS REP MOVSB REP MOVSW r64,i64 r8/16,m r32/64,m m,r m,i m64,i32 r,sr sr,r/m r,mabs mabs,r m,r32 r,r r16,r8 r,m r16,r8 r32/64,r8/16 r,m r64,r32 r,r/m r,r r,m r i m sr 2 0 0 2 0 3 0 1 0 2 0 1 0 2 0 2 0 1 2 1 8 3 0 3 0 2 0 1 0 1 0 2 0 2 0 1 0 2 0 2 0 2 0 1 0 1 0 2 0 3 0 1 0 1 0 3 0 9.5 0 3 0 2 0 2 6 100 4 0 6 2 0 2 2 0 2 3 0 2 1 3 1 3 1 9 2 0 1 0 2 6 1 8 1 8 2 16 1 0 1 0 2.5 0 2 0 3.5 0 3 0 3.5 0 2 0 3.5 0 3 0 3.5 0 1 0 4 0 1 0 5 0 2 0 0 2 10 1 3 8 1 5n 4n+50 1 2 8 1 2.5n 3n 1 4 8 9 .3n .3n 1 .5-1.1n .6-1.4n Page 145 1 1 1 2 2 2 8 27 1 2 2 0.25 1 1 1 0.5 1 0.5 3 1 1 2 2 0 0,3 0,3 alu1 load load store store store x64 86 86 86 b,c 86 x64 86 86 a,q x64 l x64 l sse2 386 c 386 c 386 386 a,c,o 386 a,c,o 386 x64 a PPro a,e 86 86 86 86 186 86 86 86 186 m 86 86 86 86 186 m 86 p 86 86 386 386 386 86 n 86 d,n 86 m 86 m 86 86 86 8 86 86 86

0/1 0/1 2 0 0 2 0 0/1

alu0/1 alu0/1 load alu0 alu0 load alu0 alu0/1

r m sr

r,[m] r,[r+r/i] r,[r+r+i] r,[r*i] r,[r+r*i] r,[r+r*i+i]

2 2 2 9 9 16 1 10 30 70 15 0.25 0.25 0.5 1 1 1

r,m

1 28 8 8

0/1 0/1 0/1 1 0,1 1 1 0/1 1

alu0/1 alu0/1 alu0/1 alu alu0,1 alu int alu0/1 int

86

Prescott REP MOVSD REP MOVSQ BSWAP IN, OUT PREFETCHNTA PREFETCHT0/1/2 SFENCE LFENCE MFENCE Arithmetic instructions ADD, SUB ADD, SUB ADD, SUB ADC, SBB ADC, SBB ADC, SBB ADC, SBB CMP CMP INC, DEC INC, DEC NEG NEG AAA, AAS DAA, DAS AAD AAM MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV DIV DIV DIV 1 1 1 1 1 1 1 1 1
1.1n 1.4 n 1.1n 1.4 n

r r,r/i m m

0 52 0 0 2 2 4

1 >1000 1 1 50 50 124

86 x64 alu 86

486 sse sse sse sse2 sse2

r,r r,m m,r r,r/i r,m m,r m,i r,r r,m r m r m

r8 r16 r32 r64 m8 m16 m32 m64 r16,r16 r16,r16,i r32,r32 r32,(r32),i r64,r64 r64,(r64),i r16,m16 r32,m32 r64,m64 r,m,i r8/m8 r16/m16 r32/m32 r64/m64

1 2 3 3 2 2 3 1 2 2 4 1 3 1 1 2 2 1 4 3 1 2 2 3 2 1 2 1 1 1 1 2 2 2 3 1 1 1 1

0 0 0 0 5 6 5 0 0 0 0 0 0 10 16 5 17 0 0 0 5 0 5 0 6 0 0 0 0 0 0 0 0 0 0 20 19 21 31

1 1 5 10 10 20 22 1 1 1 5 1 5 26 29 13 71 10 11 11 11 10 11 11 11 10 11 10 10 10 10 10 10 10 10 74 73 76 63

0 0 0 0

0 0 0 0

0.25 0/1 1 2 10 1 10 1 10 10 0.25 0/1 1 0.5 0/1 3 0.5 0 3

alu0/1

int,alu int,alu

alu0/1 alu0/1 alu0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5


1-2.5

34 34 34 52

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

int int int int int int int int int int int int int int int int int int int int int int int int

mul fpdiv mul mul mul mul mul mul mul mul mul mul mul mul mul mul mul mul mul mul fpdiv fpdiv fpdiv fpdiv

86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 x64 86 86 86 x64 386 186 386 386 x64 x64 386 386 x64 186 86 86 386 x64

c c c

c c

m m m m

a a a a

Page 146

Prescott IDIV IDIV IDIV IDIV CBW CWD CDQ CQO CWDE CDQE SCAS REP SCAS CMPS REP CMPS Logic AND, OR, XOR AND, OR, XOR AND, OR, XOR TEST TEST NOT NOT SHL SHR, SAR SHR, SAR SHL SHR, SAR SHR, SAR ROL, ROR ROL, ROR ROL, ROR ROL, ROR RCL, RCR RCL RCR RCL RCR SHL, SHR, SAR ROL. ROR SHL, SHR, SAR ROL. ROR RCL, RCR RCL, RCR RCL, RCR SHLD, SHRD SHLD SHRD SHLD, SHRD SHLD r8/m8 r16/m16 r32/m32 r64/m64 1 21 76 0 1 19 79 0 1 19 79 0 1 58 96 0 2 0 2 0 2 0 2 0 1 0 1 0 1 0 7 0 2 0 2 0 1 0 1 0 1 3 0 1 54+6n 4n 1 5 1 81+8n 5n 34 34 34 91 1 1 1 1 1 1 8 10 86 1 1 1 1 0 0/1 0/1 0/1 0/1 0/1 int int int int alu0 alu0/1 alu0/1 alu0/1 alu0/1 alu0/1 fpdiv fpdiv fpdiv fpdiv 86 86 386 x64 86 86 386 x64 386 x64 86 86 a a a a

86

r,r r,m m,r r,r r,m r m r,i r8/16/32,i r64,i r,CL r8/16/32,CL r64,CL r8/16/32,i r64,i r8/16/32,CL r64,CL r,1 r,i r,i r,CL r,CL m8/16/32,i m8/16/32,i m8/16/32,cl m8/16/32,cl m8/16/32,1 m8/16/32,i m8/16/32,cl r8/16/32,r,i r64,r64,i r64,r64,i r8/16/32,r,cl r64,r64,cl

1 2 3 1 2 1 3 1 1 1 2 2 2 1 1 2 2 1 2 2 1 1 3 3 2 2 2 3 2 3 4 3 4 4

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 11 11 11 11 6 6 6 6 5 13 13 0 5 7 0 5

1 1 5 1 1 1 5 1 1 7 2 2 8 1 7 2 8 7 31 25 31 25 10 10 10 10 27 38 37 8 10 10 9 14

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0.5 1 2 0.5 1 0.5 2 0.5 0.5 2 2 2 1 7 2 8 7 31 25 31 25

alu0

0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

alu0 alu0 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1

27 38 37 7

86 86 86 86 86 86 86 186 186 x64 86 86 x64 186 x64 86 x64 86 186 186 86 86 86 86 86 86 86 86 86 386 x64 x64 386 x64

c c c c c

d d d d d d d d d d d d d d

Page 147

Prescott SHRD SHLD, SHRD SHLD, SHRD BT BT BT BT BTR, BTS, BTC BTR, BTS, BTC BTR, BTS, BTC BTR, BTS, BTC BSF, BSR SETcc SETcc CLC, STC CMC CLD, STD r64,r64,cl m,r,i m,r,CL r,i r,r m,i m,r r,i r,r m,i m,r r,r/m r m 3 3 2 1 2 3 2 1 2 3 2 2 2 3 2 3 1 8 8 8 0 0 0 7 0 0 6 10 0 0 0 0 0 8 12 20 20 8 9 8 10 8 9 28 14 16 9 9 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 10 8 9 8 10 8 9 10 14 4 1 2 8 53 1 1 1 1 1 1 1 1 1 1 1 1 1 1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 int int x64 386 386 386 386 386 386 386 386 386 386 386 386 386 86 86 86

d d d d

Control transfer instructions JMP short/near JMP far JMP r JMP m(near) JMP m(far) Jcc short/near J(E)CXZ short LOOP short CALL near CALL far CALL r CALL m(near) CALL m(far) RETN RETN i RETF RETF i IRET BOUND m INT i INTO Other NOP (90) Long NOP (0F 1F) PAUSE LEAVE CLI STI CPUID RDTSC

1 2 3 3 2 1 4 4 3 3 4 4 2 4 4 1 2 1 2 2 1

0 25 0 0 28 0 0 0 0 29 0 0 32 0 0 30 30 49 11 67 4

1 154 15 10 157 2-4 4 4 7 160 7 9 160 7 7 160 160 325 12 470 26

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

alu0 alu0 alu0 alu0 alu0 alu0 alu0 alu0 alu0 alu0 alu0

branch branch branch branch branch branch branch branch branch branch branch

86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 186 86 86

m m

1 0 1 0 1 2 4 0 1 5 1 11 1 49-90 1 12

0 0 5

0.25 0/1 0.25 0/1 50 5 52 64 300-500 100

alu0/1 alu0/1

86 ppro sse2 186 86 86 p5

p5

Page 148

Prescott RDPMC (bit 31 = 1) RDPMC (bit 31 = 0) MONITOR MWAIT Notes: a) b) c) d) e) l) m) n) o) 1 4 37 154 100 240 p5 p5 (sse3) (sse3) Add 1 op if source is a memory operand. Uses an extra op (port 3) if SIB byte used. Add 1 op if source or destination, but not both, is a high 8-bit register (AH, BH, CH, DH). Has (false) dependence on the flags in most cases. Not available on PMMX Move accumulator to/from memory with 64 bit absolute address (opcode A0 A3). Not available in 64 bit mode. Not available in 64 bit mode on some processors. MOVSX uses an extra op if the destination register is smaller than the biggest register size available. Use a 32 bit destination register in 16 bit and 32 bit mode, and a 64 bit destination register in 64 bit mode for optimal performance. LEA with a direct memory operand has 1 op and a reciprocal throughput of 0.25. This also applies if there is a RIP-relative address in 64-bit mode. A signextended 32-bit direct memory operand in 64-bit mode without RIP-relative address takes 2 ops because of the SIB byte. The throughput is 1 in this case. You may use a MOV instead. These values are measured in 32-bit mode. In 16-bit real mode there is 1 microcode op and a reciprocal throughput of 17.

p)

q)

Floating point x87 instructions


ops Microcode Latency Additional latency Port Reciprocal throughput Execution unit Subunit Instruction set Notes Instruction Operands

Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FILD FIST(P) FISTTP FLDZ

r m32/64 m80 m80 r m32/64 m80 m80 r m16 m32/64 m m

1 1 3 3 1 2 3 3 1 3 2 3 3 1

0 0 3 74 0 0 6 311 0 2 0 0 0 0

0 0

7 7

1 1 8 90 1 2 10 400 1 8 2 2.5 2.5 2

0 2 2 2 0 0 0 0 0 2 2 0 0 0

mov load load load mov store store store mov load load store store mov

87 87 87 87 87 87 87 87 87 87 87 87 sse3 87

Page 149

Prescott FLD1 FCMOVcc FFREE FINCSTP, FDECSTP FNSTSW FSTSW FNSTSW FNSTCW FLDCW Arithmetic instructions FADD(P),FSUB(R)(P) FADD,FSUB(R) FIADD,FISUB(R) FIADD,FISUB(R) FMUL(P) FMUL FIMUL FIMUL FDIV(R)(P) FDIV(R) FIDIV(R) FIDIV(R) FABS FCHS FCOM(P), FUCOM(P) FCOM(P) FCOMPP, FUCOMPP FCOMI(P) FICOM(P) FICOM(P) FTST FXAM FRNDINT FPREM FPREM1 Math FSQRT FLDPI, etc. FSIN, FCOS FSINCOS FPTAN FPATAN FSCALE FXTRACT F2XM1 FYL2X FYL2XP1 st0,r r AX AX m16 m16 m16 2 4 3 1 4 6 2 4 3 0 0 0 0 0 0 3 0 6 5 0 1 0 0 0 2 4 3 1 3 3 8 3 10 0 1 0 0 1 1 0 0 0,2 mov fp mov mov 87 PPro 87 87 287 287 87 87 87 e

r m m16 m32 r m m16 m32 r m m16 m32

r m r m16 m32

1 2 3 3 1 2 3 3 1 2 3 3 1 1 1 2 2 3 3 3 1 1 3 8 9

0 0 3 0 0 0 3 0 0 0 3 3 0 0 0 0 0 0 3 0 0 0 14 86 92

6 6 7 6 8 8 8 8 45 45 45 45 3 3 3 3 3

1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0

28 220 220

1 1 1

1 1 6 2 2 2 8 3 45 45 45 45 1 1 1 1 1 3 8 2 1 1 16

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0,1 1 1,2 1 1 0,1 1 1

fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp

add add add add mul mul mul mul div div div div misc misc misc misc misc misc misc misc misc misc

87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 PPro 87 87 87 87 87 87 387

g,h g,h g,h g,h

1 2 3 5 8 4 3 4 3 3 3

0 0 100 150 170 97 25 16 190 63 58

45 200 200 270 250 96 27 270 170 170

45 2
200 200 270 250

1 1 1 1 1 1 1 1 1 1 1

fp fp fp fp fp fp fp fp fp fp fp

div

87 87 387 387 87 87 87 87 87 87 87

g,h

Page 150

Prescott Other FNOP (F)WAIT FNCLEX FNINIT FNSAVE FRSTOR FXSAVE FXRSTOR Notes: e) f) 1 2 1 1 2 2 2 2 0 0 4 30 181 96 121 118 1 0 0 0 1 1 120 200 0 0 1 0,1 160 244 mov mov 87 87 87 87 87 87 sse sse

500 570

i i

Not available on PMMX The latency for FLDCW is 3 when the new value loaded is the same as the value of the control word before the preceding FLDCW, i.e. when alternating between the same two values. In all other cases, the latency and reciprocal throughput is > 100. Latency and reciprocal throughput depend on the precision setting in the F.P. control word. Single precision: 32, double precision: 40, long double precision (default): 45. Throughput of FP-MUL unit is reduced during the use of the FP-DIV unit. Takes fewer microcode ops when XMM registers are disabled, but the throughput is the same.

g)

h) i)

Integer MMX and XMM instructions


ops Microcode Latency Additional latency Port Reciprocal throughput Execution unit Subunit Instruction set Notes Instruction Operands

Move instructions MOVD MOVD MOVD MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU LDDQU MOVDQ2Q

r32, mm mm, r32 mm,m32 r32, xmm xmm, r32 xmm,m32 m32, r mm,mm xmm,xmm r,m64 m64,r xmm,xmm xmm,m m,xmm xmm,m m,xmm xmm,m mm,xmm

2 1 1 1 2 1 2 1 1 1 2 1 1 2 4 4 4 3

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0

6 3 7 4

1 1 1 1

7 2

0 1

10

1 1 1 1 2 1 2 1 2 1 2 1 1 2 23 8 2.5 2

0 fp 1 mmx 2 load 0 fp 1 mmx 2 load 0,1 0 mov 1 mmx 2 load 0 mov 0 mov 2 load 0 mov 2 load 0 mov 2 load 0,1 mov-mmx

alu

shift

shift

mmx mmx mmx sse2 sse2 sse2 mmx mmx sse2 mmx mmx sse2 sse2 sse2 sse2 sse2 sse3

k k

sse2

Page 151

Prescott MOVQ2DQ MOVNTQ MOVNTDQ MOVDDUP MOVSHDUP MOVSLDUP PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/ DQ PUNPCKHBW/WD/DQ/ QDQ PUNPCKLBW/WD/DQ/Q DQ PSHUFD PSHUFL/HW PSHUFW MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRW PEXTRW PINSRW Arithmetic instructions PADDB/W/D PADD(U)SB/W PSUBB/W/D PSUB(U)SB/W PADDQ, PSUBQ PADDQ, PSUBQ PCMPEQB/W/D PCMPGTB/W/D PMULLW PMULHW PMULHUW PMADDWD PMULUDQ PAVGB/W PMIN/MAXUB PMIN/MAXSW PSADBW Logic PAND, PANDN POR, PXOR PSLL/RLW/D/Q, PSRAW/D PSLLDQ, PSRLDQ xmm,mm m,mm m,xmm xmm,xmm xmm,xmm mm,r/m xmm,r/m mm,r/m xmm,r/m xmm,r/m xmm,xmm,i xmm,xmm,i mm,mm,i mm,mm xmm,xmm r32,r r32,mm,i r32,xmm,i r,r32,i 2 3 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 4 6 0 0 0 0 10 1 2 4 4 2 2 2 4 2 4 2 2 2 1 10 12 3 2 3 2 0,1 mov-mmx sse2 0 mov 0 mov 1 mmx shift 1 1 1 1 1 1 1 1 1 0 0 0,1 1 1 1 mmx mmx mmx mmx mmx mmx mmx mmx mmx mov mov shift shift shift shift shift shift shift shift shift sse sse2 sse3 sse3 mmx mmx mmx sse2 sse2 sse2 sse sse sse sse2 a a a a a

2 4 2 4 2 4 2 4 2 2

1 1 1 1 1 1 1 1 1 1

7 7 7 4

mmx-alu0 sse

mmx-int sse mmx-int sse2 int-mmx sse

r,r/m r,r/m mm,r/m xmm,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m

1 1 1 1 1 1 1 1 1 1 1 1 1

0 0 0 0 0 0 0 0 0 0 0 0 0

2 2 2 5 2 7 7 7 7 2 2 2 4

1 1 1 1 1 1 1 1 1 1 1 1 1

1,2 1,2 1 2 1,2 1,2 1,2 1,2 1,2 1,2 1,2 1,2 1,2

1 1 1 1 1 1 1 1 1 1 1 1 1

mmx mmx mmx fp mmx fp fp fp fp mmx mmx mmx mmx

alu alu alu add alu mul mul mul mul alu alu alu alu

mmx mmx sse2 sse2 mmx mmx sse mmx sse2 sse sse sse sse

a,j a,j a a a,j a,j a,j a,j a,j a,j a,j a,j a,j

r,r/m r,r/m r,i/r/m xmm,i

1 1 1 1

0 0 0 0

2 2 2 4

1 1 1 1

1,2 1,2 1,2 2

1 1 1 1

mmx mmx mmx mmx

alu alu shift shift

mmx mmx mmx sse2

a,j a,j a,j

Page 152

Prescott Other EMMS Notes: a) j) k) 10 10 12 0 mmx

Add 1 op if source is a memory operand. Reciprocal throughput is 1 for 64 bit operands, and 2 for 128 bit operands. It may be advantageous to replace this instruction by two 64-bit moves or LDDQU.

Floating point XMM instructions


ops Microcode Latency Additional latency Port Reciprocal throughput Execution unit Subunit Instruction set Notes Instruction Operands

Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVUPS/D MOVSS MOVSD MOVSS, MOVSD MOVSS, MOVSD MOVHLPS MOVLHPS
MOVHPS/D, MOVLPS/D

MOVHPS/D, MOVLPS/D

MOVSH/LDUP MOVDDUP MOVNTPS/D MOVMSKPS/D SHUFPS/D UNPCKHPS/D UNPCKLPS/D Conversion CVTPS2PD CVTPD2PS CVTSD2SS CVTSS2SD CVTDQ2PS CVTDQ2PD CVT(T)PS2DQ CVT(T)PD2DQ CVTPI2PS

r,r r,m m,r r,r r,m m,r r,r r,r r,m m,r r,r r,r r,m m,r r,r r,r m,r r32,r r,r/m,i r,r/m r,r/m

1 1 2 1 4 4 1 1 1 2 1 1 2 2 1 1 2 2 1 2 1

0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0

2 4

1 1 0 1 1

4 2

4 2 5 4 4 2

1 1 1 1 1 1

1 1 2 1 2 8 2 2 1 2 2 2 2 2 2 2 4 3 2 2 2

0 2 0 0 2 0 1 1 2 0 1 1 2 0 1 1 0 1 1 1 1

mov

mov

mmx mmx

shift shift

mmx mmx

shift shift

fp mmx mmx mmx

shift shift shift

sse sse sse sse sse sse sse sse sse sse sse sse sse sse sse3 sse3 sse sse sse sse sse

k k

r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m xmm,mm

1 2 3 2 1 3 1 2 4

0 0 0 0 0 0 0 0 0

4 10 14 8 5 10 5 11 12

1 1 1 1 1 1 1 1 1

4 2 6 6 2 4 2 2 6

1 1 1 1 1 1 1 1 1

mmx fp-mmx mmx mmx fp mmx-fp fp fp-mmx mmx

shift sse2 shift shift sse2 sse2

sse2 a sse2 sse2 sse2 a sse2 a sse

a a a a a a

Page 153

Prescott CVTPI2PD CVT(T)PS2PI CVT(T)PD2PI CVTSI2SS CVTSI2SD CVT(T)SD2SI CVT(T)SS2SI Arithmetic ADDPS/D ADDSS/D SUBPS/D SUBSS/D ADDSUBPS/D HADDPS/D HSUBPS/D MULPS/D MULSS/D DIVSS DIVPS DIVSD DIVPD RCPPS RCPSS MAXPS/D MAXSS/DMINPS/D MINSS/D CMPccPS/D CMPccSS/D COMISS/D UCOMISS/D Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS RSQRTPS Other LDMXCSR STMXCSR Notes: a) h) k) xmm,mm mm,xmm mm,xmm xmm,r32 xmm,r32 r32,xmm r32,xmm 4 3 4 3 4 2 2 0 0 0 0 0 0 0 12 8 12 20 20 12 17 1 0 1 1 1 1 1 5 2 3 4 5 4 4 1 0,1 0,1 1 1 1 1 fp-mmx sse2 fp-mmx sse fp-mmx sse2 fp-mmx sse fp-mmx sse2 fp fp a a a a a sse2 sse

a a

r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m

1 1 1 3 1 1 1 1 1 2

0 0 0 0 0 0 0 0 0 0

5 5 5 13 7 32 41 40 71 6

1 1 1 1 1 1 1 1 1 1

2 2 2 5-6 2 23 41 40 71 4

1 1 1 1 1 1 1 1 1 1

fp fp fp fp fp fp fp fp fp mmx

add add add add mul div div div div

sse sse sse3 sse3 sse sse sse sse2 sse2 sse

a a a a a a,h a,h a,h a,h a

r,r/m r,r/m r,r/m

1 1 2

0 0 0

5 5 6

1 1 1

2 2 3

1 1 1

fp fp fp

add add add

sse sse sse

a a a

r,r/m

mmx

alu

sse

r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m

1 1 1 1 2 2

0 0 0 0 0 0

32 41 40 71 5 6

1 1 1 1 1 1

32 41 40 71 3 4

1 1 1 1 1 1

fp fp fp fp mmx mmx

div div div div

sse sse sse2 sse2 sse sse

a,h a,h a,h a,h a a

m m

2 3

11 0

13 3

1 1

sse sse

Add 1 op if source is a memory operand. Throughput of FP-MUL unit is reduced during the use of the FP-DIV unit.
It may be advantageous to replace this instruction by two 64-bit moves or LDDQU.

Page 154

Atom

Intel Atom
List of instruction timings and op breakdown
Explanation of column headings:
Instruction: Operands: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm register, (x)mm = mmx or xmm register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc. The number of ops from the decoder or ROM. Tells which execution unit is used. Instructions that use the same unit cannot execute simultaneously. ALU0 and ALU1 means integer unit 0 or 1, respectively. ALU0/1 means that either unit can be used. ALU0+1 means that both units are used. Mem means memory in/out unit. FP0 means floating point unit 0 (includes multiply, divide and other SIMD instructions). FP1 means floating point unit 1 (adder). MUL means multiplier, shared between FP and integer units. DIV means divider, shared between FP and integer units. np means not pairable: Cannot execute simultaneously with any other instruction. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. The average number of clock cycles per instruction for a series of independent instructions of the same kind in the same thread.

ops: Unit:

Latency:

Reciprocal throughput:

Integer instructions
Operands ops Unit Latency Reciprocal throughput 1 1 1-3 1 1 1/2 1/2 1 1 1 1 5 21 26 2.5 1 2 Remarks

Move instructions MOV MOV MOV MOV MOV MOV MOV MOV MOV MOVNTI MOVSX MOVZX MOVSXD CMOVcc

r,r r,i r,m m,r m,i r,sr m,sr sr,r sr,m m,r r,r/m r,r

1 1 1 1 1 1 2 7 8 1 1 1

ALU0/1 ALU0/1 ALU0, Mem ALU0, Mem ALU0, Mem

All addr. modes All addr. modes

ALU0, Mem ALU0 ALU0+1

1 2

Page 155

Atom CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POP POPF(D/Q) POPA(D) LAHF SAHF SALC LEA BSWAP LDS LES LFS LGS LSS PREFETCHNTA PREFETCHT0/1/2 LFENCE MFENCE SFENCE Arithmetic instructions ADD SUB ADD SUB ADD SUB ADC SBB ADC SBB ADC SBB CMP CMP INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS AAD AAM MUL IMUL MUL IMUL MUL IMUL MUL IMUL r,m r,r r,m r i m sr 1 3 4 3 1 1 2 3 14 9 1 1 3 7 19 16 1 1 2 1 1 10 1 1 1 1 1 6 6 6 1 3 6 6 6 1 1 5 6 12 11 1 1 6 31 28 12 2 1/2 5 1 1 30 1 1 1/2 1 1

Implicit lock

np np

Not in x64 mode

r (E/R)SP m sr

np np

1 1

Not in x64 mode

ALU0+1 ALU0/1

2 1 7 1-4 1 30

r,m r m m m

AGU1 ALU0 Mem Mem

Not in x64 mode 4 clock latency on input register

r,r/i r,m m,r/i r,r/i r,m m,r/i r,r/i m,r/i r m

r8 r16 r32 r64

1 1 1 1 1 1 1 1 1 1 13 13 20 21 4 10 3 4 3 8

ALU0/1 ALU0/1, Mem

1 2 2 2 2 1 1 1 16 12 20 25 7 24 7 6 6 14

ALU0/1 ALU0/1

1/2 1 1 2 2 2 1/2 1 1/2 Not in x64 mode Not in x64 mode Not in x64 mode Not in x64 mode Not in x64 mode Not in x64 mode 7 6 6 14

ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Mul

Page 156

Atom IMUL IMUL IMUL IMUL IMUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL DIV DIV DIV DIV IDIV IDIV IDIV IDIV CBW CWDE CDQE CWD CDQ CQO Logic instructions AND OR XOR AND OR XOR AND OR XOR TEST TEST SHR SHL SAR SHR SHL SAR ROR ROL ROR ROL RCR RCL RCR RCL SHLD SHLD SHLD SHLD SHLD SHLD SHRD SHRD SHRD SHRD SHRD r16,r16 r32,r32 r64,r64 r16,r16,i r32,r32,i r64,r64,i m8 m16 m32 m64 r/m8 r/m16 r/m32 r/m 64 r/m8 r/m16 r/m32 r/m64 2 1 6 2 1 7 3 5 4 8 9 12 12 38 26 29 29 60 2 1 1 2 1 1 ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Div ALU0, Div ALU0, Div ALU0, Div ALU0, Div ALU0, Div ALU0, Div ALU0, Div ALU0 ALU0 ALU0 ALU0 ALU0 ALU0 6 5 13 5 5 14 6 7 7 14 22 33 49 183 38 45 61 207 5 1 1 5 1 1 5 2 11 5 2 14

22 33 49 183 38 45 61 207

r,r/i r,m m,r/i r,r/i m,r/i r,i/cl m,i/cl r,i/cl m,i/cl r,1 r,1 r/m,i/cl r/m,i/cl r16,r16,i r32,r32,i r64,r64,i r16,r16,cl r32,r32,cl r64,r64,cl r16,r16,i r32,r32,i r64,r64,i r16,r16,cl r32,r32,cl

1 ALU0/1 1 1 ALU0/1, Mem 1 ALU0/1, Mem 1 1 ALU0/1 1 1 ALU0/1, Mem 1 ALU0 1 1 ALU0 1 1 ALU0 1 1 ALU0 1 5 ALU0 7 2 ALU0 1 12-17 ALU0 12-15 14-20 ALU0 14-18 10 ALU0 10 2 ALU0 5 10 ALU0 11 9 ALU0 9 2 ALU0 5 9 ALU0 10 8 ALU0 8 2 ALU0 5 10 ALU0 9 7 ALU0 8 2 ALU0 5 Page 157

1/2 1 1 1/2 1 1 1 1 1

1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem

Atom SHRD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR SETcc SETcc CLC STC CMC CLD STD r64,r64,cl r,r/i m,r m,i r,r/i m,r m,i r,r/m r m 9 1 9 2 1 10 3 10 1 2 1 1 5 6 ALU0 ALU1 9 1 10 5 1 11 6 16 2 1-2 more if mem 1

ALU1 ALU1 ALU1 ALU0+1 ALU0/1

2 5 1/2 2 7 25

Control transfer instructions JMP short/near JMP far JMP r JMP m(near) JMP m(far) Conditional jump short/near J(E/R)CXZ short LOOP short LOOP(N)E short CALL near CALL far CALL r CALL m(near) CALL m(far) RETN RETN i RETF RETF i BOUND r,m INTO String instructions LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS Other

1 29 1 2 30 1 3 8 8 1 37 1 2 38 1 1 36 36 11 4

ALU1

ALU1

np np

2 66 4 7 78 2 7 8 8 3 65 18 20 64 6 6 80 80 10 6

Not in x64 mode

Not in x64 mode

Not in x64 mode Not in x64 mode

3 5n+11 2 3n+10 4 4n+11 3 5n+16 5 6n+16

6 3n+50 5 2n+4 6 2n - 4n 6 3n+60 7 4n+40

fastest for high n

Page 158

Atom NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC 1 1 5 14 20+6b 4 40-80 16 24 ALU0/1 ALU0/1 24 23 6 100-170 29 48 1/2 1/2

a,0 a,b

Floating point x87 instructions


Operands Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FISTTP FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FMUL(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) ops Unit Latency Reciprocal through1 1 3 1 9 10 92 92 1 1 7 9 12 13 221 221 1 1 7 6 11 9 11 9 1 8 10 9 9 10 10 8 9 1 1 1 321 321 177 177 Remarks

r m32/m64 m80 m80 r m32/m64 m80 m80 r m m m

r AX m16 m16 m16

m m

1 1 4 52 1 3 8 189 1 1 3 3 1 2 2 3 4 4 2 3 1 1 166 83

SSE3

r/m r/m r/m

r/m r m

1 1 1 1 1 1 1 5 3 Page 159

Mul Div

5 5 71 1 1 1 1

1 2 71 1 1 1 1 10 9

Atom FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT Math FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN Other FNOP WAIT FNCLEX FNINIT m m m 3 3 3 1 1 26 37 19 Mul Div 1 1 ~110 ~130 48 9 73 9 1 1

30 15 1 9 112 25 63 100 91

Div

56 24 71 ~260 ~260 ~100 ~220 ~300 ~300

1 2 4 23

5 74

1 5 26

Integer MMX and XMM instructions


Operands Move instructions MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU LDDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ ops Unit Latency Reciprocal through4 5 3 4 1 4 5 1 4 5 6 6 6 1 1 ~400 ~450 2 1 1 1 1/2 1 1 1/2 1 1 6 6 6 1 1 1 3 Remarks

r32/64,(x)mm m32/64,(x)mm (x)mm,r32/64 (x)mm,m32/64 (x)mm, (x)mm (x)mm,m64 m64, (x)mm xmm, xmm xmm, m128 m128, xmm m128, xmm xmm, m128 xmm, m128 mm, xmm xmm,mm m64,mm m128,xmm

1 1 1 1 1 1 1 1 1 1 3 4 4 1 1 1 1

Mem Mem FP0/1 Mem Mem FP0/1 Mem Mem Mem Mem Mem

Mem Mem

Page 160

Atom PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/DQ PUNPCKH/LQDQ PSHUFB PSHUFB PSHUFW PSHUFL/HW PSHUFD PALIGNR MASKMOVQ MASKMOVDQU PMOVMSKB PINSRW PEXTRW Arithmetic instructions PADD/SUB(U)(S)B/W/D PADDQ PSUBQ PHADD(S)W PHSUB(S)W PHADDD PHSUBD PCMPEQ/GTB/W/D PMULL/HW PMULHUW PMULL/HW PMULHUW PMULHRSW PMULHRSW PMULUDQ PMULUDQ PMADDWD PMADDWD PMADDUBSW PMADDUBSW PSADBW PSADBW PAVGB/W PMIN/MAXUB PMIN/MAXSW PABSB PABSW PABSD PSIGNB PSIGNW PSIGND (x)mm, (x)mm (x)mm, (x)mm (x)mm, (x)mm mm,mm xmm,xmm mm,mm,i xmm,xmm,i xmm,xmm,i xmm, xmm,i mm,mm xmm,xmm r32,(x)mm (x)mm,r32,i r32,(x)mm,i 1 1 1 1 4 1 1 1 1 1 2 1 1 2 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP0 Mem Mem 1 1 1 1 6 1 1 1 1 1 1 1 1 6 1 1 1 1 2 7 2 1 5

4 3 5

(x)mm, (x)mm (x)mm, (x)mm (x)mm, (x)mm (x)mm, (x)mm (x)mm,(x)mm mm,mm xmm,xmm mm,mm xmm,xmm mm,mm xmm,xmm mm,mm xmm,xmm mm,mm xmm,xmm mm,mm xmm,xmm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm

1 2 7 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

FP0/1

FP0/1 FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0/1 FP0/1 FP0/1 FP0/1 FP0/1

1 5 8 6 1 4 5 4 5 4 5 4 5 4 5 4 5 1 1 1 1 1

1/2 5 8 1/2 1 2 1 2 1 2 1 2 1 2 1 2 1/2 1/2 1/2 1/2 1/2

Logic instructions PAND(N) POR PXOR PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ Other EMMS

(x)mm,(x)mm (x)mm,(x)mm (x)xmm,i xmm,i

1 2 1 1

FP0/1 FP0 FP0 FP0

1 5 1 1

1/2 5 1 1

Page 161

Atom

Floating point XMM instructions


Operands Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVHPS/D MOVLPS/D MOVLHPS MOVHLPS MOVMSKPS/D MOVNTPS/D SHUFPS SHUFPD MOVDDUP MOVSH/LDUP UNPCKH/LPS UNPCKH/LPD Conversion CVTPD2PS CVTSD2SS CVTPS2PD CVTSS2SD CVTDQ2PS CVT(T) PS2DQ CVTDQ2PD CVT(T)PD2DQ CVTPI2PS CVT(T)PS2PI CVTPI2PD CVT(T) PD2PI CVTSI2SS CVT(T)SS2SI CVTSI2SD CVT(T)SD2SI Arithmetic ADDSS SUBSS ADDSD SUBSD ADDPS SUBPS ADDPD SUBPD ADDSUBPS ADDSUBPD ops Unit Latency Reciprocal through1 1/2 4 1 5 1 6 6 6 6 1 1/2 4 1 5 1 5 1 4 1 4 1 1 1 4 2 ~500 3 1 1 1 1 1 1 1 1 1 1 1 Remarks

xmm,xmm xmm,m128 m128,xmm xmm,m128 m128,xmm xmm,xmm xmm,m32/64 m32/64,xmm xmm,m64 m64,xmm m64,xmm xmm,xmm r32,xmm m128,xmm xmm,xmm,i xmm,xmm,i xmm,xmm xmm,xmm xmm,xmm xmm,xmm

1 1 1 4 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

FP0/1 Mem Mem Mem Mem FP0/1 Mem Mem Mem Mem Mem FP0 Mem FP0 FP0 FP0 FP0 FP0 FP0

xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,mm mm,xmm xmm,mm mm,xmm xmm,r32 r32,xmm xmm,r32 r32,xmm

4 3 4 3 3 3 3 3 1 1 3 4 3 3 3 3

11 10 7 6 6 6 7 6 6 4 7 7 7 10 8 10

11 10 6 6 6 6 6 6 5 1 6 7 6 8 6 8

xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm

1 1 1 3 1 3 Page 162

FP1 FP1 FP1 FP1 FP1 FP1

5 5 5 6 5 6

1 1 1 6 1 6

Atom HADDPS HSUBPS HADDPD HSUBPD MULSS MULSD MULPS MULPD DIVSS DIVSD DIVPS DIVPD RCPSS RCPPS CMPccSS/D CMPccPS/D COMISS/D UCOMISS/D MAXSS/D MINSS/D MAXPS/D MINPS/D Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS RSQRTPS Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D Other LDMXCSR STMXCSR FXSAVE FXRSTOR xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm 5 5 1 1 1 6 3 3 6 6 1 5 1 3 4 1 3 FP0+1 FP0+1 FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Div FP0, Div FP0, Div FP0, Div 8 8 4 5 5 9 31 60 64 122 4 9 5 6 9 5 6 7 7 1 2 2 9 31 60 64 122 1 8 1 6 9 1 6

FP0 FP0 FP0 FP0 FP0

xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm

3 5 3 5 1 5

FP0, Div FP0, Div FP0, Div FP0, Div FP0 FP0

31 63 60 121 4 9

31 63 60 121 1 8

xmm,xmm xmm,xmm xmm,xmm xmm,xmm

1 1 1 1

FP0/1 FP0/1 FP0/1 FP0/1

1 1 1 1

1/2 1/2 1/2 1/2

m32 m32 m4096 m4096

4 4 121 116

5 14 142 149

6 15 144 150

Page 163

VIA Nano 2000

VIA Nano 2000 series


List of instruction timings and op breakdown
Explanation of column headings:
Operands: i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm register, (x)mm = mmx or xmm register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc. The number of micro-operations from the decoder or ROM. Note that the VIA Nano 2000 processor has no reliable performance monitor counter for ops. Therefore the number of ops cannot be determined except in simple cases. Tells which execution port or unit is used. Instructions that use the same port cannot execute simultaneously. Integer add, Boolean, shift, etc. I1: Integer add, Boolean, move, jump. I2: Can use either I1 or I2, whichever is vacant first. I12: Multiply, divide and square root on all operand types. MA: Various Integer and floating point SIMD operations. MB: Floating point addition subunit under MB. MBfadd: Memory store address. SA: Memory store. ST: Memory load. LD: This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. Note: There is an additional latency for moving data from one unit or subunit to another. A table of these latencies is given in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs". These additional latencies are not included in the listings below where the source and destination operands are of the same type. Reciprocal throughput: The average number of clock cycles per instruction for a series of independent instructions of the same kind in the same thread.

ops:

Port:

Latency:

Integer instructions
Operands ops Port Latency Reciprocal thruoghput 1 1 1 1.5 1.5 1 2 20 20 1.5 Latency 4 on pointer register Remarks

Move instructions
MOV MOV MOV MOV MOV MOV MOV MOV MOV MOVNTI r,r r,i r,m m,r m,i r,sr m,sr sr,r sr,m m,r 1 1 1 1 1 I2 I2 LD SA, ST SA, ST 1 1 2 2

SA, ST Page 164

20 20 2

VIA Nano 2000 MOVSX MOVSXD MOVZX MOVSX MOVSXD MOVZX CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POP POPF(D/Q) POPA(D) LAHF SAHF SALC LEA BSWAP LDS LES LFS LGS LSS PREFETCHNTA PREFETCHT0/1/2 LFENCE MFENCE SFENCE Arithmetic instructions ADD SUB ADD SUB ADD SUB ADC SBB ADC SBB ADC SBB CMP CMP INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS r,r r,m r,m r,r r,m r,r r,m m r i m sr 1 2 1 2 3 I2 LD, I2 LD I1, I2 LD, I1 I2 1 3 2 2 5 3 20 6 1 1 1 1 2 3 20 1-2 1-2 2 17 8 15 1.25 4 5 20 9 12 1 1 6 1 1 30 1-2 1-2 14 14 14

Implicit lock

SA, ST SA, ST Ld, SA, ST 8

Not in x64 mode

r (E/R)SP m sr

LD

9 1 1 r,m r m m m 1 1 I1 I1 SA I2 1 1 9 1 1 30 LD LD

Not in x64 mode

Not in x64 mode 3 clock latency on input register

r,r/i r,m m,r/i r,r/i r,m m,r/i r,r/i m,r/i r m

1 2 3 1 2 3 1 2 1 3

I12 LD I12
LD I12 SA ST

1 5 1 5 1 1 5

I1 LD I1 LD I1 SA ST I12 LD I12 I12


LD I12 SA ST

1/2 1 2 1 1 2 1/2 1 1/2 37 37 22 24 Not in x64 mode Not in x64 mode Not in x64 mode Not in x64 mode

Page 165

VIA Nano 2000 AAD AAM MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV DIV DIV DIV IDIV IDIV IDIV IDIV CBW CWDE CDQE CWD CDQ CQO Logic instructions AND OR XOR AND OR XOR AND OR XOR TEST TEST SHR SHL SAR ROR ROL RCR RCL RCR RCL SHLD SHRD SHLD SHRD SHLD SHRD SHLD SHRD SHLD SHRD SHLD SHRD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR SETcc r8 r16 r32 r64 r16,r16 r32,r32 r64,r64 r16,r16,i r32,r32,i r64,r64,i r8 r16 r32 r64 r8 r16 r32 r64 1 1 MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA I1 I1 7-9 7-9 7-9 8-10 4-6 4-6 5-7 4-6 4-6 5-7 26 27-35 25-41 148-183 26 27-35 23-39 187-222 1 1 23 30 Not in x64 mode Not in x64 mode Extra latency to other ports do. do. do. do. do. do. do. do. do. do. do. do. do. do. do. do. do.

1 1 2 1 1 2 26 27-35 25-41 148-183 26 27-35 23-39 187-222 1 1

r,r/i r,m m,r/i r,r/i m,r/i r,i/cl r,i/cl r,1 r,i/cl r16,r16,i r32,r32,i r64,r64,i r64,r64,i r16,r16,cl r32,r32,cl r64,r64,cl r64,r64,cl r,r/i m,r m,i r,r/i m,r m,i r,r r

1 2 3 1 2 1 1 1

I12 LD I12
LD I12 SA ST

1 5 1 1 1 1 28+3n 11 7 33 43 11 7 33 43 1

1 2 2

I12 LD I12 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 Page 166

2 10 8 3 2

1/2 1 2 1/2 1 1 1 1 28+3n 11 7 33 43 11 7 33 43 1 8 1 2 10 8 2 1

VIA Nano 2000 SETcc CLC STC CMC CLD STD Control transfer instructions JMP JMP JMP JMP JMP Conditional jump short/near far r m(near) m(far) short/near 1 I2 3 58 3 3 55 1-3-8 3 8 if >2 jumps in 16 bytes block Not in x64 mode 8 if >2 jumps in 16 bytes block do. 1 if not jumping. 3 if jumping. 8 if >2 jumps in 16 bytes block do. do. 8 if >2 jumps in 16 bytes block Not in x64 mode 8 if >2 jumps in 16 bytes block do. 8 if >2 jumps in 16 bytes block do. m I1 3 3 1 3 3

I2

3 3 1-3-8

J(E/R)CXZ LOOP LOOP(N)E CALL CALL CALL CALL CALL RETN RETN RETF RETF BOUND INTO String instructions LODSB/W/D/Q REP LODSB/W/D/Q STOSB/W/D/Q REP STOSB/W/D/Q

short short short near far r m(near) m(far)

1-3-8 1-3-8 25 3 72 3 4 72 3 3 39 39

1-3-8 1-3-8 25 3 72 3 3 72 3 3 39 39 13 7

i i r,m

Not in x64 mode Not in x64 mode

MOVSB/W/D/Q REP MOVSB/W/D/Q

SCASB/W/D/Q REP SCASB REP SCASW/D/Q

CMPSB/W/D/Q Page 167

1 3n+22 1-2 Small: 2n+2, Big: 6 bytes per clock 2 Small: 2n+45, Big: 6 bytes per clock 1 2.2n Small: 2n+50 Big: 5 bytes per clock 6

VIA Nano 2000 REP CMPSB/W/D/Q Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC 2.4n+24

1 1 a,0 a,b

All I12

4 53-173 40

1 1/2 25 23 52+5b 4 39 40

Blocks all ports

Floating point x87 instructions


Operands ops Port and Unit MB LD MB LD MB MB MB SA ST MB SA ST I2

Move instructions
FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FILD FILD FIST(T)(P) FIST(T)(P) FIST(T)(P) FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FMUL(P) FDIV(R)(P) r/m r/m r/m 1 1 MB MA MA r m32/m64 m80 m80 r m32/m64 m80 m80 r m16 m32 m64 m16 m32 m64 1 2 2 1 3 3 1

Latency Reciprocal thruoghput 1 4 4 54 1 5 5 125 0 7 5 5 6 5 5 1 1 1 54 1 1-2 1-2 125 1

Remarks

1 r AX m16 m16 m16 1 1 m m

MB 2

13 I2 MB 0 321 195

1 10 2 5 3 13 2 1 1 321 195

2 4 15-42

1 2 15-42

Lower precision: Lat: 4, Thr: 2

Page 168

VIA Nano 2000 FABS FCHS FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT Math FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN Other FNOP WAIT FNCLEX FNINIT 1 1 1 1 1 MB MB MB MB MB MB 1 1 1 1 1 1 1 2 4 42 2 1 41

r/m r m m m m

1 1

MB 151-171 106-155 29

39 36-57 73 51-159 270-360 50-200 ~60 ~170 300-370 ~170

1 1

MB I12

1 1/2 57 85

Integer MMX and XMM instructions


Operands ops Port and Unit

Move instructions
MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU r32/64,(x)mm m32/64,(x)mm (x)mm,r32/64 (x)mm,m32/64 (x)mm, (x)mm (x)mm,m64 m64, (x)mm xmm, xmm xmm, m128 m128, xmm m128, xmm xmm, m128 1 1 1 1 1 1 1 1 1 1 1 SA ST LD MB LD SA ST MB LD SA ST SA ST LD

Latency Reciprocal thruoghput 3 2-3 4 2-3 1 2-3 2-3 1 2-3 2-3 2-3 2-3 1 1-2 1 1 1 1 1-2 1 1 1-2 1-2 1

Remarks

Page 169

VIA Nano 2000 LDDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/ DQ PUNPCKH/LQDQ PSHUFB PSHUFW PSHUFL/HW PSHUFD PALIGNR MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRW PINSRW Arithmetic instructions PADD/SUB(U)(S)B/W/D PADDQ PSUBQ PHADD(S)W PHSUB(S)W PHADDD PHSUBD PCMPEQ/GTB/W/D PMULL/HW PMULHUW PMULHRSW PMULUDQ PMADDWD PMADDUBSW PSADBW PAVGB/W PMIN/MAXUB PMIN/MAXSW PABSB PABSW PABSD PSIGNB PSIGNW PSIGND Logic instructions PAND(N) POR PXOR PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ Other EMMS (x)mm, (x)mm (x)mm, (x)mm (x)mm, (x)mm (x)mm, (x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm 1 1 3 3 1 1 1 1 MB MB MB MB MB MA MA MA 1 1 3 3 1 3 3 3 4 10 2 1 1 1 1 1 1 1 3 3 1 1 1 1 2 8 1 1 1 1 1 1 xmm, m128 mm, xmm xmm,mm m64,mm m128,xmm (x)mm, (x)mm (x)mm, (x)mm (x)mm, (x)mm (x)mm,(x)mm mm,mm,i xmm,xmm,i xmm,xmm,i xmm, xmm,i mm,mm xmm,xmm r32,(x)mm r32 ,(x)mm,i (x)mm,r32,i 1 1 1 3 3 1 1 1 1 1 1 1 1 LD MB MB 2-3 1 1 ~300 ~300 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1-3 1-3 1 1 9

MB MB MB MB MB MB MB MB

3 3 9

1 1 1 1 1

MB MB MB MB MB MB

(x)mm,(x)mm (x)mm,(x)mm (x)xmm,i xmm,i

1 1 1 1

MB MB MB MB

1 1 1 1

1 1 1 1

MB Page 170

VIA Nano 2000

Floating point XMM instructions


Operands ops Port and Unit MB LD SA ST LD SA ST MB LD SA ST

Move instructions
MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVHPS/D MOVLPS/D MOVLHPS MOVHLPS MOVMSKPS/D MOVNTPS/D SHUFPS SHUFPD MOVDDUP MOVSH/LDUP UNPCKH/LPS UNPCKH/LPD Conversion CVTPD2PS CVTSD2SS CVTPS2PD CVTSS2SD CVTDQ2PS CVT(T) PS2DQ CVTDQ2PD CVT(T)PD2DQ CVTPI2PS CVT(T)PS2PI CVTPI2PD CVT(T) PD2PI CVTSI2SS CVT(T)SS2SI CVTSI2SD CVT(T)SD2SI Arithmetic ADDSS SUBSS ADDSD SUBSD ADDPS SUBPS xmm,xmm xmm,m128 m128,xmm xmm,m128 m128,xmm xmm,xmm xmm,m32/64 m32/64,xmm xmm,m64 xmm,m64 m64,xmm m64,xmm xmm,xmm r32,xmm m128,xmm xmm,xmm,i xmm,xmm,i xmm,xmm xmm,xmm xmm,xmm xmm,xmm 1 1 1 1 1 1 1 1

Latency Reciprocal thruoghput 1 2-3 2-3 2-3 2-3 1 2-3 2-3 6 6 6 2 1 3 ~300 1 1 1 1 1 1 1 1 1-2 1 1-2 1 1 1-2 1 1 1-2 1-2 1 1 2.5 1 1 1 1 1 1

Remarks

MB

1 1 1 1 1 1

MB MB MB MB MB MB

xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,mm mm,xmm xmm,mm mm,xmm xmm,r32 r32,xmm xmm,r32 r32,xmm

3-4 15 3-4 15 3 2 4 3 4 3 4 3 5 4 5 4

xmm,xmm xmm,xmm xmm,xmm

1 1 1

MBfadd MBfadd MBfadd

2-3 2-3 2-3

1 1 1

Page 171

VIA Nano 2000 ADDPD SUBPD ADDSUBPS ADDSUBPD HADDPS HSUBPS HADDPD HSUBPD MULSS MULSD MULPS MULPD DIVSS DIVSD DIVPS DIVPD RCPSS RCPPS CMPccSS/D CMPccPS/D COMISS/D UCOMISS/D MAXSS/D MINSS/D MAXPS/D MINPS/D Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS RSQRTPS Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D Other LDMXCSR STMXCSR FXSAVE FXRSTOR xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm 1 1 1 MBfadd MBfadd MBfadd MBfadd MBfadd MA MA MA MA MA MA MA MA 2-3 2-3 2-3 5 5 3 4 3 4 15-22 15-36 42-82 24-70 5 14 2 2 3 2 2 1 1 1 3 3 1 2 1 2 15-22 15-36 42-82 24-70 5 11 1 1 1 1 1

1 1

1 1

MBfadd MBfadd

1 1

MBfadd MBfadd

xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm

MA MA MA MA

33 126 62 122 5 14

33 126 62 122 5 11

xmm,xmm xmm,xmm xmm,xmm xmm,xmm

1 1 1 1

MB MB MB MB

1 1 1 1

1 1 1 1

m32 m32 m4096 m4096

45 13 208 232

29 13 208 232

VIA-specific instructions
Instruction XSTORE XSTORE REP XSTORE REP XSTORE REP XCRYPTECB Conditions Data available No data available Quality factor = 0 Quality factor > 0 128 bits key Clock cycles, approximately 160-400 clock giving 8 bytes 50-80 clock giving 0 bytes 4800 clock per 8 bytes 19200 clock per 8 bytes 44 clock per 16 bytes

Page 172

VIA Nano 2000 REP XCRYPTECB REP XCRYPTECB REP XCRYPTCBC REP XCRYPTCBC REP XCRYPTCBC REP XCRYPTCTR REP XCRYPTCTR REP XCRYPTCTR REP XCRYPTCFB REP XCRYPTCFB REP XCRYPTCFB REP XCRYPTOFB REP XCRYPTOFB REP XCRYPTOFB REP XSHA1 REP XSHA256 192 bits key 256 bits key 128 bits key 192 bits key 256 bits key 128 bits key 192 bits key 256 bits key 128 bits key 192 bits key 256 bits key 128 bits key 192 bits key 256 bits key 46 clock per 16 bytes 48 clock per 16 bytes 54 clock per 16 bytes 59 clock per 16 bytes 63 clock per 16 bytes 43 clock per 16 bytes 46 clock per 16 bytes 48 clock per 16 bytes 54 clock per 16 bytes 59 clock per 16 bytes 63 clock per 16 bytes 54 clock per 16 bytes 59 clock per 16 bytes 63 clock per 16 bytes 3 clock per byte 4 clock per byte

Page 173

Nano 3000

VIA Nano 3000 series


List of instruction timings and op breakdown
Explanation of column headings:
Operands: i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm register, (x)mm = mmx or xmm register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc. The number of micro-operations from the decoder or ROM. Note that the VIA Nano 3000 processor has no reliable performance monitor counter for ops. Therefore the number of ops cannot be determined except in simple cases. Tells which execution port or unit is used. Instructions that use the same port cannot execute simultaneously. Integer add, Boolean, shift, etc. I1: Integer add, Boolean, move, jump. I2: Can use either I1 or I2, whichever is vacant first. I12: Multiply, divide and square root on all operand types. MA: Various Integer and floating point SIMD operations. MB: Floating point addition subunit under MB. MBfadd: Memory store address. SA: Memory store. ST: Memory load. LD: This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. Note: There is an additional latency for moving data from one unit or subunit to another. A table of these latencies is given in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs". These additional latencies are not included in the listings below where the source and destination operands are of the same type. Reciprocal throughput: The average number of clock cycles per instruction for a series of independent instructions of the same kind in the same thread.

ops:

Port:

Latency:

Integer instructions
Operands ops Port Latency Reciprocal thruoghput 1 1 2 2 1 1/2 1 1.5 1.5 1/2 1.5 20 20 1.5 Latency 4 on pointer register Remarks

Move instructions
MOV MOV MOV MOV MOV MOV MOV MOV MOV MOVNTI r,r r,i r,m m,r m,i r,sr m,sr sr,r sr,m m,r 1 1 1 1 1 I2 I12 LD SA, ST SA, ST I12

SA, ST Page 174

20 20 2

Nano 3000 MOVSX MOVZX MOVSXD MOVSX MOVSXD MOVZX CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POP POPF(D/Q) POPA(D) LAHF SAHF SALC LEA BSWAP LDS LES LFS LGS LSS PREFETCHNTA PREFETCHT0/1/2 LFENCE MFENCE SFENCE Arithmetic instructions ADD SUB ADD SUB ADD SUB ADC SBB ADC SBB ADC SBB CMP CMP INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS AAD r,r r64,r32 r,m r,m r,r r,m r,r r,m m r i m sr 1 1 2 1 1 3 3 1 1 I12 LD, I12 LD I12 LD, I12 I12 LD, I1 SA, ST SA, ST LD, SA, ST 1 1 3 2 1 5 3 18 6 1/2 1 1 1 1/2 1 1.5 18 2 1-2 1-2 2 6 2 15 1.25 4 2 11 1 12 1 1 6 1 1 28 1 1 15

Implicit lock

r (E/R)SP m sr

3 9 2 3 3 16 1 1 2

2 LD

Not in x64 mode

Not in x64 mode

I1 I1

1 1 10 1 1 28

r,m r m m m

1 1 12 1 1

SA I2

Not in x64 mode Extra latency to other ports

LD LD

r,r/i r,m m,r/i r,r/i r,m m,r/i r,r/i m,r/i r m

1 2 3 1 2 3 1 2 1 3 12 12 14 14 7

I12 LD I12
LD I12 SA ST

1 5 1 5 1 1 5

I1 LD I1 LD I1 SA ST I12 LD I12 I12


LD I12 SA ST

1/2 1 2 1 1 2 1/2 1 1/2 37 22 22 24 24 Not in x64 mode Not in x64 mode Not in x64 mode Not in x64 mode Not in x64 mode

Page 175

Nano 3000 AAM MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV DIV DIV DIV IDIV IDIV IDIV IDIV CBW CWDE CDQE CWD CDQ CQO Logic instructions AND OR XOR AND OR XOR AND OR XOR TEST TEST SHR SHL SAR ROR ROL RCR RCL RCR RCL SHLD SHRD SHLD SHRD SHLD SHRD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR SETcc SETcc CLC STC CMC CLD STD r8 r16 r32 r64 r16,r16 r32,r32 r64,r64 r16,r16,i r32,r32,i r64,r64,i r8 r16 r32 r64 r8 r16 r32 r64 13 1 3 3 3 1 1 1 1 1 1 31 I2 I2 I2 MA I2 I2 MA I2 I2 MA MA MA MA MA MA MA MA MA I2 I2 2 3 3 8 2 2 5 2 2 5 22-24 24-28 22-30 145-162 21-24 24-28 18-26 182-200 1 1 8 1 1 2 1 1 Extra latency to other ports Not in x64 mode

Extra latency to other ports

1 1

Extra latency to other ports 2 22-24 24-28 22-30 145-162 21-24 24-28 18-26 182-200 1 1

r,r/i r,m m,r/i r,r/i m,r/i r,i/cl r,i/cl r,1 r,i/cl r16,r16,i/cl r32,r32,i/cl r64,r64,i/cl r64,r64,i/cl r,r/i m,r m,i r,r/i m,r m,i r,r r8 m

1 I12 2 LD I12 3 LD I12 SA ST 1 I12 2 LD I12 1 I12 1 I1 1 I1 5+2n I1 2 I1 2 I1 16 I1 23 I1 1 I1 6 I1 2 I1 2 I1 8 I1 5 I1 2 I1 1 I1 2 3 I1 3 I1

1 5 1 1 1 1 28+3n 2 2 32 42 1

2 10 8 2 1 3 3

1/2 1 2 1/2 1 1/2 1 1 28+3n 2 2 32 42 1 8 1 2 10 8 2 1 2 3 3

Page 176

Nano 3000 Control transfer instructions JMP JMP JMP JMP JMP short/near far r m(near) m(far) 1 14 2 2 17 I2 3 3 50 3 3 42 8 if >2 jumps in 16 bytes block Not in x64 mode 8 if >2 jumps in 16 bytes block do. 1 if not jumping. 3 if jumping. 8 if >2 jumps in 16 bytes block

I2

3 3

Conditional jump J(E/R)CXZ LOOP LOOP(N)E CALL CALL CALL CALL CALL RETN RETN RETF RETF BOUND INTO String instructions LODSB/W/D/Q REP LODSB/W/D/Q STOSB/W/D/Q

short/near short short short near far r m(near) m(far)

1 2 2 5 2 17 2 3 19 3 4 20 20 9 3

I2

1-3-8 1-3-8 1-3-8 24 3

1-3-8 1-3-8 1-3-8 24 3 58 3 3 54 3 3 49 49 13 7

3 4

8 if >2 jumps in 16 bytes block Not in x64 mode 8 if >2 jumps in 16 bytes block do. 8 if >2 jumps in 16 bytes block do.

i i r,m

3 3

Not in x64 mode Not in x64 mode

2 3n 1

REP STOSB/W/D/Q MOVSB/W/D/Q

REP MOVSB/W/D/Q SCASB/W/D/Q REP SCASB

REP SCASW/D/Q CMPSB/W/D/Q REP CMPSB/W/D/Q

1 3n+27 1-2 Small: n+40, Big: 6-7 bytes/clk 2 Small: 2n+20, Big: 6-7 bytes/clk 1 2.4n Small: 2n+31, Big: 5 bytes/clk 6 2.2n+30

Page 177

Nano 3000 Other NOP (90) long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC 0-1 0-1 2 10 3 I12 I12 0 0 1/2 1/2 6 21 52+5b 2 37 40 Sometimes fused

a,0 a,b

2 55-146

Floating point x87 instructions


Operands ops Port

Move instructions
FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FILD FILD FIST(T)(P) FIST(T)(P) FIST(T)(P) FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FMUL(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM r m32/m64 m80 m80 r m32/m64 m80 m80 r m16 m32 m64 m16 m32 m64 1 2 2 36 1 3 3 80 1 3 2 2 3 3 3 1 3 1 1 3 5 3 1 1 122 115 MB LD MB LD MB MB MB SA ST MB SA ST I2

Latency Reciprocal thruoghput 1 4 4 54 1 5 5 125 0 7 5 5 6 5 5 1 1 1 54 1 1-2 1-2 125 1

Remarks

MB MB 2

r AX m16 m16 m16

I2 MB

0 319 196

m m

1 10 2 1 2 8 2 1 1 319 196

r/m r/m r/m

1 1 1 1 1

r/m

MB MA MA MB MB MB Page 178

2 4 14-23 1 1

1 2 14-23 1 1 1

Nano 3000 FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT Math FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN Other FNOP WAIT FNCLEX FNINIT r m m m m 1 1 3 3 3 3 1 15 MB MB MB 2 1 1 2 4 16 2 1 38

MB

11

2 38 ~130 ~130 27

22 13

37 57 73 ~150 270-360 50-200 ~50 ~50 300-370 ~180 Less at lower precision

1 1

MB I12

1 1/2 59 84

Integer MMX and XMM instructions


Operands ops Port

Move instructions
MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU LDDQU MOVDQ2Q r32/64,(x)mm m32/64,(x)mm (x)mm,r32/64 (x)mm,m32/64 (x)mm, (x)mm (x)mm,m64 m64, (x)mm xmm, xmm xmm, m128 m128, xmm m128, xmm xmm, m128 xmm, m128 mm, xmm 1 1 1 1 1 1 1 1 1 1 1 1 1 1 MB SA ST I2 LD MB LD SA ST MB LD SA ST SA ST LD LD MB Page 179

Latency Reciprocal thruoghput 3 2 4 2 1 2 2 1 2 2 2 2 2 1 1 1-2 1 1 1 1 1-2 1 1 1-2 1-2 1 1 1

Remarks

Nano 3000 MOVQ2DQ MOVNTQ MOVNTDQ MOVNTDQA PACKSSWB/DW PACKUSWB PACKUSDW


PUNPCKH/LBW/WD/DQ

xmm,mm m64,mm m128,xmm xmm,m128 (x)mm, (x)mm xmm,xmm (x)mm, (x)mm (x)mm, (x)mm (x)mm,(x)mm mm,mm,i xmm,xmm,i xmm,xmm,i x,x,xmm0 xmm,xmm,i xmm, xmm,i mm,mm xmm,xmm r32,(x)mm r32 ,(x)mm,i r32/64 ,xmm,i (x)mm,r32,i xmm,r32/64,i xmm,xmm

1 2 2 1 1 1 1 1 1 1 1 1 1 1 1

MB

1 ~360 ~360 2 1 1 1 1 1 1 1 1 2 1 1

1 2 2 1 1 1 1 1 1 1 1 1 2 1 1 1-2 1-2 1 1 1 1 1 1

PUNPCKH/LQDQ PSHUFB PSHUFW PSHUFL/HW PSHUFD PBLENDVB PBLENDW PALIGNR MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRW PEXTRB/D/Q PINSRW PINSRB/D/Q PMOVSX/ZXBW/BD/ BQ/WD/WQ/DQ Arithmetic instructions PADD/SUB(U)(S)B/W/D PADDQ PSUBQ PHADD(S)W PHSUB(S)W PHADDD PHSUBD PCMPEQ/GTB/W/D PCMPEQQ PMULL/HW PMULHUW PMULHRSW PMULLD PMULUDQ PMULDQ PMADDWD PMADDUBSW PSADBW MPSADBW PAVGB/W PMIN/MAXSW PMIN/MAXUB PMIN/MAXSB/D PMIN/MAXUW/D PHMINPOSUW

MB MB MB MB MB MB MB MB MB MB MB

1 1 2 2 1

MB MB MB MB MB

3 3 3 5 5 1

(x)mm, (x)mm (x)mm, (x)mm (x)mm, (x)mm (x)mm, (x)mm (x)mm,(x)mm xmm,xmm (x)mm,(x)mm (x)mm,(x)mm xmm,xmm (x)mm,(x)mm xmm,xmm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm xmm,xmm,i (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm xmm,xmm xmm,xmm xmm,xmm

1 1 3 3 1 1 1 1 1 1 1 1 7 1 1 1 1 1 1 1 1

MB MB MB MB MB MB MA MA MA MA MA MA MB MB MB MB MB MB MB MB

1 1 3 3 1 1 3 3 3 3 3 4 10 2 2 1 1 1 1 1 2

1 1 3 3 1 1 1 1 1 1 1 2 8 1 1 1 1 1 1 1 1

Page 180

Nano 3000 PABSB PABSW PABSD (x)mm,(x)mm PSIGNB PSIGNW PSIGND Logic instructions PAND(N) POR PXOR PTEST PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ Other EMMS (x)mm,(x)mm 1 1 MB MB 1 1 1 1

(x)mm,(x)mm xmm,xmm (x)mm,(x)mm (x)xmm,i xmm,i

1 1 1 1 1

MB MB MB MB MB

1 3 1 1 1

1 1 1 1 1

MB

Floating point XMM instructions


Operands ops Port

Move instructions
MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVHPS/D MOVLPS/D MOVLHPS MOVHLPS MOVMSKPS/D MOVNTPS/D SHUFPS SHUFPD MOVDDUP MOVSH/LDUP UNPCKH/LPS UNPCKH/LPD Conversion CVTPD2PS CVTSD2SS CVTPS2PD CVTSS2SD CVTDQ2PS CVT(T) PS2DQ CVTDQ2PD xmm,xmm xmm,m128 m128,xmm xmm,m128 m128,xmm xmm,xmm xmm,m32/64 m32/64,xmm xmm,m64 xmm,m64 m64,xmm m64,xmm xmm,xmm r32,xmm m128,xmm xmm,xmm,i xmm,xmm,i xmm,xmm xmm,xmm xmm,xmm xmm,xmm 1 1 1 1 2 1 1 2 2 2 3 1 1 2 1 1 1 1 1 1 MB LD SA ST LD SA ST MB LD SA ST

Latency Reciprocal thruoghput 1 2 2 2 2 1 2-3 2-3 6 6 6 2 1 3 ~360 1 1 1 1 1 1 1 1 1 1 1 1 1 1-2 1 1 1-2 1-2 1 1 1-2 1 1 1 1 1 1

Remarks

MB MB MB MB MB MB

xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm

2 1 2 1 1 1 2

MB

5 2 5 2 3 2 5

2 1 1 1 1

Page 181

Nano 3000 CVT(T)PD2DQ CVTPI2PS CVT(T)PS2PI CVTPI2PD CVT(T) PD2PI CVTSI2SS CVT(T)SS2SI CVTSI2SD CVT(T)SD2SI Arithmetic ADDSS SUBSS ADDSD SUBSD ADDPS SUBPS ADDPD SUBPD ADDSUBPS ADDSUBPD HADDPS HSUBPS HADDPD HSUBPD MULSS MULSD MULPS MULPD DIVSS DIVSD DIVPS DIVPD RCPSS RCPPS CMPccSS/D CMPccPS/D COMISS/D UCOMISS/D MAXSS/D MINSS/D MAXPS/D MINPS/D Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS RSQRTPS Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D xmm,xmm xmm,mm mm,xmm xmm,mm mm,xmm xmm,r32 r32,xmm xmm,r32 r32,xmm 2 1 2 2 2 1 2 1 4 5 4 4 4 5 4 5 4 2 2 1 1 2 1 1

xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm

1 1 1 1 1 1 3 3 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1

MBfadd MBfadd MBfadd MBfadd MBfadd MBfadd MBfadd MBfadd MA MA MA MA MA MA MA MA MA MA MBfadd MBfadd MBfadd MBfadd MBfadd

2 2 2 2 2 2 5 5 3 4 3 4 13 13-20 24 21-38 5 14 2 2 3 2 2

1 1 1 1 1 1 3 3 1 2 1 2 13 13-20 24 21-38 5 11 1 1 1 1 1

xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm

1 1 1 1 1 3

MA MA MA MA

33 64 62 122 5 14

33 64 62 122 5 11

xmm,xmm xmm,xmm xmm,xmm xmm,xmm

1 1 1 1

MB MB MB MB

1 1 1 1

1 1 1 1

Page 182

Nano 3000 Other LDMXCSR STMXCSR FXSAVE FXRSTOR m32 m32 m4096 m4096 31 13 97 201

VIA-specific instructions
Instruction XSTORE XSTORE REP XSTORE REP XSTORE REP XCRYPTECB REP XCRYPTECB REP XCRYPTECB REP XCRYPTCBC REP XCRYPTCBC REP XCRYPTCBC REP XCRYPTCTR REP XCRYPTCTR REP XCRYPTCTR REP XCRYPTCFB REP XCRYPTCFB REP XCRYPTCFB REP XCRYPTOFB REP XCRYPTOFB REP XCRYPTOFB REP XSHA1 REP XSHA256 Conditions Data available No data available Quality factor = 0 Quality factor > 0 128 bits key 192 bits key 256 bits key 128 bits key 192 bits key 256 bits key 128 bits key 192 bits key 256 bits key 128 bits key 192 bits key 256 bits key 128 bits key 192 bits key 256 bits key Clock cycles, approximately 160-400 clock giving 8 bytes 50-80 clock giving 0 bytes 1300 clock per 8 bytes 5455 clock per 8 bytes 15 clock per 16 bytes 17 clock per 16 bytes 18 clock per 16 bytes 29 clock per 16 bytes 33 clock per 16 bytes 37 clock per 16 bytes 23 clock per 16 bytes 26 clock per 16 bytes 27 clock per 16 bytes 29 clock per 16 bytes 33 clock per 16 bytes 37 clock per 16 bytes 29 clock per 16 bytes 33 clock per 16 bytes 37 clock per 16 bytes 5 clock per byte 5 clock per byte

Page 183

You might also like