Instruction Table For Assembly Language

Introduction
4 Instruction tables
Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs
By Agner Fog. Copenhagen University College of Engineering. Copyright 1996 - 2012. Last updated 2012-02-29.
Introduction
This is the fourth in a series of five manuals: 1. Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms. 2. Optimizing subroutines in assembly language: An optimization guide for x86 platforms. 3. The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers. 4. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. 5. Calling conventions for different C++ compilers and operating systems.
The latest versions of these manuals are always available from www.agner.org/optimize. Copyright conditions are listed below. The present manual contains tables of instruction latencies, throughputs and micro-operation breakdown and other tables for x86 family microprocessors from Intel, AMD and VIA.
The figures in the instruction tables represent the results of my measurements rather than the official values published by microprocessor vendors. Some values in my tables are higher or lower than the values published elsewhere. The discrepancies can be explained by the following factors: My figures are experimental values while figures published by microprocessor vendors may be based on theory or simulations. My figures are obtained with a particular test method under particular conditions. It is possible that different values can be obtained under other conditions. Some latencies are difficult or impossible to measure accurately, especially for memory access and type conversions that cannot be chained. Latencies for moving data from one execution unit to another are listed explicitly in some of my tables while they are included in the general latencies in some tables published by Intel. Most values are the same in all microprocessor modes (real, virtual, protected, 16-bit, 32-bit, 64bit). Values for far calls and interrupts may be different in different modes. Call gates have not been tested. Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor systems. This also applies to the XCHG instruction with a memory operand.
Page 1
Introduction If any text in the pdf version of this manual is unreadable, then please refer to the spreadsheet version. Copyright notice
This series of five manuals is copyrighted by Agner Fog. Public distribution and mirroring is not allowed. Non-public distribution to a limited audience for educational purposes is allowed. The code examples in these manuals can be used without restrictions. A GNU Free Documentation License shall automatically come into force when I die. See www.gnu.org/copyleft/fdl.html
Page 2
Definition of terms
Definition of terms
Operands Operands can be different types of registers, memory, or immediate constants. Abbreviations used in the tables are: i = immediate constant, r = any general purpose register, r32 = 32-bit register, etc., mm = 64 bit mmx register, x or xmm = 128 bit xmm register, y = 256 bit ymm register, sr = segment register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc. The latency of an instruction is the delay that the instruction generates in a dependency chain. The measurement unit is clock cycles. Where the clock frequency is varied dynamically, the figures refer to the core clock frequency. The numbers listed are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity may increase the latencies by possibly more than 100 clock cycles on many processors, except in move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results may give a similar delay. A missing value in the table means that the value has not been measured or that it cannot be measured in a meaningful way. Some processors have a pipelined execution unit that is smaller than the largest register size so that different parts of the operand are calculated at different times. Assume, for example, that we have a long depencency chain of 128-bit vector instructions running in a fully pipelined 64-bit execution unit with a latency of 4. The lower 64 bits of each operation will be calculated at times 0, 4, 8, 12, 16, etc. And the upper 64 bits of each operation will be calculated at times 1, 5, 9, 13, 17, etc. as shown in the figure below. If we look at one 128-bit instruction in isolation, the latency will be 5. But if we look at a long chain of 128-bit instructions, the total latency will be 4 clock cycles per instruction plus one extra clock cycle in the end. The latency in this case is listed as 4 in the tables because this is the value it adds to a dependency chain.
Latency
Reciprocal throughput
The throughput is the maximum number of instructions of the same kind that can be executed per clock cycle when the operands of each instruction are independent of the preceding instructions. The values listed are the reciprocals of the throughputs, i.e. the average number of clock cycles per instruction when the instructions are not part of a limiting dependency chain. For example, a reciprocal throughput of 2 for FMUL means that a new FMUL instruction can start executing 2 clock cycles after a previous FMUL. A reciprocal throughput of 0.33 for ADD means that the execution units can handle 3 integer additions per clock cycle. The reason for listing the reciprocal values is that this makes comparisons between latency and throughput easier. The reciprocal throughput is also called issue latency. The values listed are for a single thread or a single core. A missing value in the table means that the value has not been measured.
Page 3
Definition of terms ops Uop or op is an abbreviation for micro-operation. Processors with out-of-order cores are capable of splitting complex instructions into ops. For example, a read-modify instruction may be split into a read-op and a modify-op. The number of ops that an instruction generates is important when certain bottlenecks in the pipeline limit the number of ops per clock cycle. The execution core of a microprocessor has several execution units. Each execution unit can handle a particular category of ops, for example floating point additions. The information about which execution unit a particular op goes to can be useful for two purposes. Firstly, two ops cannot execute simultaneously if they need the same execution unit. And secondly, some processors have a latency of an extra clock cycle when the result of a op executing in one execution unit is needed as input for a op in another execution unit. The execution units are clustered around a few execution ports on most Intel processors. Each op passes through an execution port to get to the right execution unit. An execution port can be a bottleneck because it can handle only one op at a time. Two ops cannot execute simultaneously if they need the same execution port, even if they are going to different execution units. This indicates which instruction set an instruction belongs to. The instruction is only available in processors that support this instruction set. The different instruction sets are listed at the end of this manual. Availability in processors prior to 80386 does not apply for 32-bit and 64-bit operands. Availability in the MMX instruction set does not apply to 128-bit packed integer instructions, which require SSE2. Availability in the SSE instruction set does not apply to double precision floating point instructions, which require SSE2. 32-bit instructions are available in 80386 and later. 64-bit instructions in general purpose registers are available only under 64-bit operating systems. Instructions that use XMM registers (SSE and later) are only available under operating systems that support this register set. Instructions that use YMM registers (AVX and later) are only available under operating systems that support this register set.
Execution unit
Execution port
Instruction set
How the values were measured

The values in the tables are measured with the use of my own test programs, which are available from www.agner.org/optimize/testp.zip The time unit for all measurements is CPU clock cycles. It is attempted to obtain the highest clock frequency if the clock frequency is varying with the workload. Many Intel processors have a performance counter named "core clock cycles". This counter gives measurements that are independent of the varying clock frequency. Where no "core clock cycles" counter is available, the "time stamp counter" is used (RDTSC instruction). In cases where this gives inconsistent results (e.g. in AMD Bobcat) it is necessary to make the processor boost the clock frequency by executing a large number of instructions (> 1 million) or turn off the power-saving feature in the BIOS setup. Instruction throughputs are measured with a long sequence of instructions of the same kind, where subsequent instructions use different registers in order to avoid dependence of each instruction on the previous one. The input registers are cleared in the cases where it is impossible to use different registers. The test code is carefully constructed in each case to make sure that no other bottleneck is limiting the throughput than the one that is being measured. Instruction latencies are measured in a long dependency chain of identical instructions where the output of each instruction is needed as input for the next instruction. The sequence of instructions should be long, but not so long that it doesn't fit into the level-1 code cache. A typical length is 100 instructions of the same type. This sequence is repeated in a loop if a larger number of instructions is desired.
Page 4
Definition of terms It is not possible to measure the latency of a memory read or write instruction with software methods. It is only possible to measure the combined latency of a memory write followed by a memory read from the same address. What is measured here is not actually the cache access time, because in most cases the microprocessor is smart enough to make a "store forwarding" directly from the write unit to the read unit rather than waiting for the data to go to the cache and back again. The latency of this store forwarding process is arbitrarily divided into a write latency and a read latency in the tables. But in fact, the only value that makes sense to performance optimization is the sum of the write time and the read time. A similar problem occurs where the input and the output of an instruction use different types of registers. For example, the MOVD instruction can transfer data between general purpose registers and XMM vector registers. The value that can be measured is the combined latency of data transfer from one type of registers to another type and back again (A B A). The division of this latency between the A B latency and the B A latency is sometimes obvious, sometimes based on guesswork, op counts, indirect evidence, or triangular sequences such as A B Memory A. In many cases, however, the division of the total latency between A B latency and B A latency is arbitrary. However, what cannot be measured cannot matter for performance optimization. What counts is the sum of the A B latency and the B A latency, not the individual terms. The op counts are usually measured with the use of the performance monitor counters (PMCs) that are built into modern microprocessors. The PMCs for VIA processors are undocumented, and the interpretation of these PMCs is based on experimentation. The execution ports and execution units that are used by each instruction or op are detected in different ways depending on the particular microprocessor. Some microprocessors have PMCs that can give this information directly. In other cases it is necessary to obtain this information indirectly by testing whether a particular instruction or op can execute simultaneously with another instruction/op that is known to go to a particular execution port or execution unit. On some processors, there is a delay for transmitting data from one execution unit (or cluster of execution units) to another. This delay can be used for detecting whether two different instructions/ops are using the same or different execution units.
Page 5
Microprocessors tested
Microprocessor versions tested

The tables in this manual are based on testing of the following microprocessors
Processor name
AMD K7 Athlon AMD K8 Opteron AMD K10 Opteron AMD Bulldozer AMD Bobcat Intel Pentium Intel Pentium MMX Intel Pentium II Intel Pentium III Intel Pentium 4 Intel Pentium 4 EM64T Intel Pentium M Intel Core Duo Intel Core 2 (65 nm) Intel Core 2 (45 nm) Intel Core i7 Intel Core i5 Intel Atom 330 VIA Nano L2200 VIA Nano L3050
Family Model Microarchitecture number number Code name (hex) (hex) Comment
6 F 10 15 14 5 5 6 6 F F 6 6 6 6 6 6 6 6 6 6 5 2 1 1 2 4 6 7 2 4 D E F 17 1A 2A 1C F F Step. 2, rev. A5 Stepping A 2350, step. 1 FX-6100, step 2 E350, step. 0 Stepping 4
Bulldozer, Zambezi Bobcat P5 P6 P6 Netburst Netburst, Prescott Dothan Yonah Merom Wolfdale Nehalem Sandy Bridge Diamondville Isaiah
Stepping 4, rev. B0 Xeon. Stepping 1 Stepping 6, rev. B1 Not fully tested T5500, Step. 6, rev. B2 E8400, Step. 6 i7-920, Step. 5, rev. D0 i5-2500, Step 7 Step. 2 Step. 2 Step. 8 (prerel. sample)
Page 6
AMD K7
AMD K7
List of instruction timings and macro-operation breakdown
Explanation of column headings:
Instruction: Operands: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit mmx register, xmm = 128 bit xmm register, sr = segment register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc. Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency listed does not include the memory operand where the operand is listed as register or memory (r/m). This is also called issue latency. This value indicates the average number of clock cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/3 indicates that the execution units can handle 3 instructions per clock cycle in one thread. However, the throughput may be limited by other bottlenecks in the pipeline. Indicates which execution unit is used for the macro-operations. ALU means any of the three integer ALU's. ALU0_1 means that ALU0 and ALU1 are both used. AGU means any of the three integer address generation units. FADD means floating point adder unit. FMUL means floating point multiplier unit. FMISC means floating point store and miscellaneous unit. FA/M means FADD or FMUL is used. FANY means any of the three floating point units can be used. Two macro-operations can execute simultaneously if they go to different execution units.
Ops: Latency:
Reciprocal throughput:
Execution unit:
Integer instructions
Instruction Move instructions MOV MOV Operands Ops Latency Reciprocal throughput 1 1 1/3 1/3 Execution unit ALU ALU Any addr. mode. Add 1 clk if code segment base ALU, AGU 0 ALU, AGU do. AGU do. AGU AH, BH, CH, DH Any other 8-bit register AGU Any addressing mode AGU AGU Notes
r,r r,i
1 1
MOV MOV MOV MOV MOV MOV MOV MOV
r8,m8 r16,m16 r32,m32 m8,r8H m8,r8L m16/32,r m,i r,sr
1 1 1 1 1 1 1 1
4 4 3 8 2 2 2 2 Page 7
1/2 1/2 1/2 1/2 1/2 1/2 1/2 1
AMD K7 MOV MOVZX, MOVSX MOVZX, MOVSX CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D) PUSHA(D) POP POP POP POP POPF(D) POPA(D) LEA LEA LAHF SAHF SALC LDS, LES, ... BSWAP Arithmetic instructions ADD, SUB ADD, SUB ADD, SUB ADC, SBB ADC, SBB ADC, SBB CMP CMP INC, DEC, NEG INC, DEC, NEG AAA, AAS DAA DAS AAD AAM MUL, IMUL MUL, IMUL MUL, IMUL IMUL sr,r/m r,r r,m r,r r,m r,r r,m r i m sr 6 1 1 1 1 3 3 2 1 1 2 2 1 9 2 3 6 9 2 9 2 1 4 2 1 10 1 9-13 1 4 1 2 16 5 8 1/3 1/2 1/3 1/2 1 16 1 1 1 1 1 4 1 1 10 18 1 4 1 1/3 2 2 1 9 1/3 ALU ALU, AGU ALU ALU, AGU ALU Timing depends ALU, AGU on hw ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU AGU Any addr. size AGU Any addr. size ALU ALU ALU ALU
r m DS/ES/FS/GS SS
r16,[m] r32,[m]
3 2 3 2 1 1
r,m r
r,r/i r,m m,r r,r/i r,m m,r/i r,r/i r,m r m
r8/m8 r16/m16 r32/m32 r16,r16/m16
1 1 1 1 1 1 1 1 1 1 9 12 16 4 31 3 3 3 2
1 1 7 1 1 7 1 1 7 5 6 7 5 13 3 3 4 3 Page 8
1/3 1/2 2.5 1/3 1/2 2.5 1/3 1/2 1/3 3 5 6 7
2 2 3 2
ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU ALU ALU ALU0 ALU ALU0 ALU0_1 ALU0_1 ALU0 latency ax=3, dx=4
AMD K7 IMUL IMUL IMUL IMUL IMUL DIV DIV DIV IDIV IDIV IDIV IDIV IDIV IDIV CBW, CWDE CWD, CDQ Logic instructions AND, OR, XOR AND, OR, XOR AND, OR, XOR TEST TEST NOT NOT SHL, SHR, SAR ROL, ROR RCL, RCR RCL RCR RCL RCR SHL,SHR,SAR,ROL,ROR RCL, RCR RCL RCR RCL RCR SHLD, SHRD SHLD, SHRD SHLD, SHRD BT BT BT BTC, BTR, BTS BTC BTR, BTS BTC, BTR, BTS BSF BSR r32,r32/m32 r16,(r16),i r32,(r32),i r16,m16,i r32,m32,i r8/m8 r16/m16 r32/m32 r8 r16 r32 m8 m16 m32 2 2 2 3 3 32 47 79 41 56 88 42 57 89 1 1 4 4 5 2.5 1 2 2 2 23 23 40 17 25 41 17 25 41 1/3 1/3 ALU0 ALU0 ALU0 ALU0 ALU0 ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU
24 24 40 17 25 41 17 25 41 1 1
r,r r,m m,r r,r r,m r m r,i/CL r,i/CL r,1 r,i r,i r,CL r,CL m,i /CL m,1 m,i m,i m,CL m,CL r,r,i r,r,cl m,r,i/CL r,r/i m,i m,r r,r/i m,i m,i m,r r,r r,r
1 1 1 1 1 1 1 1 1 1 9 7 9 7 1 1 10 9 9 8 6 7 8 1 1 5 2 5 4 8 19 23
1 1 7 1 1 1 7 1 1 1 4 3 3 3 7 7 5 8 6 7 4 4 7 1
2 7 7 6 7 9 Page 9
1/3 1/2 2.5 1/3 1/2 1/3 2.5 1/3 1/3 1/3 4 3 3 3 3 4 4 4 4 3 2 3 3 1/3 1/2 2 1 2 2 3 7 9
ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU ALU ALU ALU ALU ALU ALU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU ALU ALU, AGU ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU, AGU ALU, AGU ALU ALU
AMD K7 BSF BSR SETcc SETcc CLC, STC CMC CLD STD r,m r,m r m 20 23 1 1 1 1 2 3 8 10 1 8 10 1/3 1/2 1/3 1/3 1 2 ALU, AGU ALU, AGU ALU ALU, AGU ALU ALU ALU ALU
Control transfer instructions JMP short/near JMP JMP JMP JMP Jcc J(E)CXZ LOOP CALL CALL CALL CALL CALL RETN RETN RETF RETF IRET INT BOUND INTO String instructions LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS i i m far r m(near) m(far) short/near short short near far r m(near) m(far) i
1 16-20 1 1 17-21 1 2 7 3 16-22 4 5 16-22 2 2 15-23 15-24 32 33 6 2 23-32
ALU low values = real mode
2 2 25-33 1/3 - 2 1/3 - 2 3-4 2
ALU ALU, AGU low values = real mode rcp. t.= 2 if jump rcp. t.= 2 if jump
3-4 2 23-32 3 3 24-33 3 3 24-35 24-35 81 42
ALU ALU ALU ALU
low values = real mode 3 3 ALU ALU, AGU low values = real mode 3 3 ALU ALU low values = real mode low values = real mode real mode real mode values are for no jump values are for no jump
2 2
4 5 4 3 7 4 5 5 7 6
2 2 2 1 3 1-4 2 2 6 3-4 Page 10
2 2 2 1 3 1-4 2 2 6 3-4
values per count values per count values per count values per count values per count
AMD K7 Other NOP (90) Long NOP (0F 1F) ENTER LEAVE CLI STI CPUID RDTSC RDPMC
1 1 i,0 3 8-9 16-17 19-28 5 9
0 0 12
1/3 1/3 12 3 5 27
ALU ALU 12 3 ops, 5 clk if 16 bit
44-74 11 11
Floating point x87 instructions

Instruction Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FLDZ, FLD1 Operands Ops Latency Reciprocal throughput 2 4 16 41 2 3 7 0 9 7 1/2 1/2 4 39 1/2 1 5 188 0.4 1 1 1 Execution unit FA/M FANY Notes
r m32/64 m80 m80 r m32/64 m80 m80 r m m
1 1 7 30 1 1 10 260 1 1 1 1
FA/M FMISC
FMISC
FMISC, FA/M
FMISC Low latency immediately after FMISC, FA/M FCOMI FANY FANY Low latency immediately after FMISC, ALU FCOM FTST FMISC, ALU do. FMISC, ALU do. FMISC, ALU faster if FMISC, ALU unchanged
FCMOVcc FFREE FINCSTP, FDECSTP
st0,r r
9 1 1
6 0
5 1/3 1/3
FNSTSW FSTSW FNSTSW FNSTCW FLDCW Arithmetic instructions FADD(P),FSUB(R)(P) FIADD,FISUB(R) FMUL(P) FIMUL FDIV(R)(P)
AX AX m16 m16 m16
2 3 2 3 14
6-12 6-12
12 12 8 1 42
r/m m r/m m r/m
1 2 1 2 1
4 4 4 4 11-25 Page 11
1 1-2 1 2 8-22
FADD FADD,FMISC FMUL FMUL,FMISC FMUL Low values are for round divisors
AMD K7 FIDIV(R) FABS, FCHS FCOM(P), FUCOM(P) FCOMPP, FUCOMPP FCOMI(P) FICOM(P) FTST FXAM FRNDINT FPREM FPREM1 Math FSQRT FSIN FCOS FSINCOS FPTAN FPATAN FSCALE FXTRACT F2XM1 FYL2X FYL2XP1 Other FNOP (F)WAIT FNCLEX FNINIT FNSAVE FRSTOR FXSAVE FXRSTOR m r/m r m 2 1 1 1 1 2 1 2 5 1 1 12-26 2 2 2 3 2 10 7-10 8-11 9-23 1 1 1 1 1 1 2 3 8 8 FMUL,FMISC FMUL FADD FADD FADD
FADD, FMISC
do.
FADD FMISC, ALU FMUL FMUL
1 44 51 76 46 72 5 7 8 49 63
35 90-100 90-100 100-150 100-200 160-170 8 11 27 126 147
12
FMUL
1 1 7 25 76 65 44 85
0 0
1/3 1/3 24 92 147 120 59 87
FANY ALU FMISC FMISC
Integer MMX instructions

Instruction Move instructions MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVNTQ PACKSSWB/DW PACKUSWB Operands Ops Latency Reciprocal throughput 7 9 2 2 1/2 1 1/2 1/2 1 2 2 Execution unit FMICS, ALU FANY, ALU FANY FMISC FA/M FANY FMISC FMISC FA/M Notes
r32, mm mm, r32 mm,m32 m32, r mm,mm mm,m64 m64,mm m,mm mm,r/m
2 2 1 1 1 1 1 1 1
2 Page 12
AMD K7 PUNPCKH/LBW/WD PSHUFW MASKMOVQ PMOVMSKB PEXTRW PINSRW Arithmetic instructions PADDB/W/D PADDSB/W PADDUSB/W PSUBB/W/D PSUBSB/W PSUBUSB/W mm,r/m mm,r/m mm,r/m mm,r/m mm,r/m mm,r/m mm,r/m 1 1 1 1 1 1 1 2 2 3 3 2 2 3 1/2 1/2 1 1 1/2 1/2 1 FA/M FA/M FMUL FMUL FA/M FA/M FADD mm,r/m mm,mm,i mm,mm r32,mm r32,mm,i mm,r32,i 1 1 32 3 2 2 2 2 2 1/2 24 3 2 2 FA/M FA/M FADD FMISC, ALU FA/M
5 12
PCMPEQ/GT B/W/D PMULLW PMULHW PMULHUW PMADDWD PAVGB/W PMIN/MAX SW/UB PSADBW Logic PAND PANDN POR PXOR PSLL/RLW/D/Q PSRAW/D Other EMMS
mm,r/m mm,i/mm/m
1 1
2 2
1/2 1/2
FA/M FA/M
1/3
FANY
Floating point XMM instructions

Instruction Move instructions MOVAPS MOVAPS MOVAPS MOVUPS MOVUPS MOVUPS MOVSS MOVSS MOVSS MOVHLPS, MOVLHPS MOVHPS, MOVLPS MOVHPS, MOVLPS MOVNTPS MOVMSKPS SHUFPS Operands Ops Latency Reciprocal throughput 2 1 2 2 1 2 2 1 1 1 1/2 1/2 1 4 2 3 Execution unit FA/M FMISC FMISC FA/M Notes
r,r r,m m,r r,r r,m m,r r,r r,m m,r r,r r,m m,r m,r r32,r r,r/m,i
2 2 2 2 5 5 1 2 1 1 1 1 2 3 3
2 4 3 2
3 Page 13
FA/M FANY FMISC FMISC FA/M FMISC FMISC FMISC FADD FMUL
AMD K7 UNPCK H/L PS Conversion CVTPI2PS CVT(T)PS2PI CVTSI2SS CVT(T)SS2SI Arithmetic ADDSS SUBSS ADDPS SUBPS MULSS MULPS r,r/m 2 3 3 FMUL
xmm,mm mm,xmm xmm,r32 r32,xmm
1 1 4 2
4 6 10 3
FMISC FMISC FMISC FMISC
r,r/m r,r/m r,r/m r,r/m
1 2 1 2
4 4 4 4
1 2 1 2
FADD FADD FMUL FMUL Low values are for round divisors, e.g. powers of 2. do.
DIVSS DIVPS RCPSS RCPPS MAXSS MINSS MAXPS MINPS CMPccSS CMPccPS COMISS UCOMISS Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D Math SQRTSS SQRTPS RSQRTSS RSQRTPS Other LDMXCSR STMXCSR
r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m
1 2 1 2 1 2 1 2 1
11-16 18-30 3 3 2 2 2 2 2
8-13 18-30 1 2 1 2 1 2 1
FMUL FMUL FMUL FMUL FADD FADD FADD FADD FADD
r,r/m
FMUL
r,r/m r,r/m r,r/m r,r/m
1 2 1 2
19 36 3 3
16 36 1 2
FMUL FMUL FMUL FMUL
m m
8 3
9 10
3DNow instructions (obsolete)

Instruction Operands Ops Latency Reciprocal throughput 1/2 1 1 1 1 1/2 Execution unit AGU FMISC FMISC FMISC FMISC FA/M Notes Move and convert instructions PREFETCH(W) m PF2ID mm,mm PI2FD mm,mm PF2IW mm,mm PI2FW mm,mm PSWAPD mm,mm
1 1 1 1 1 1
5 5 5 5 2 Page 14
3DNow E 3DNow E 3DNow E
AMD K7 Integer instructions PAVGUSB PMULHRW Floating point instructions PFADD/SUB/SUBR PFCMPEQ/GE/GT PFMAX/MIN PFMUL PFACC PFNACC, PFPNACC PFRCP PFRCPIT1/2 PFRSQRT PFRSQIT1 Other FEMMS
mm,mm mm,mm
1 1
2 3
1/2 1
FA/M FMUL
mm,mm mm,mm mm,mm mm,mm mm,mm mm,mm mm,mm mm,mm mm,mm mm,mm
1 1 1 1 1 1 1 1 1 1
4 2 2 4 4 4 3 4 3 4
1 1 1 1 1 1 1 1 1 1
FADD FADD FADD FMUL FADD FADD FMUL FMUL FMUL FMUL
3DNow E
mm,mm
1/3
FANY
Page 15
K8
AMD K8
Instruction: Operands: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit mmx register, xmm = 128 bit xmm register, sr = segment register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc. Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency listed does not include the memory operand where the operand is listed as register or memory (r/m).
Ops: Latency:
Reciprocal through- This is also called issue latency. This value indicates the average number of clock cycles from the execution of an instruction begins to a subsequent independent input: struction of the same kind can begin to execute. A value of 1/3 indicates that the execution units can handle 3 instructions per clock cycle in one thread. However, the throughput may be limited by other bottlenecks in the pipeline. Execution unit: Indicates which execution unit is used for the macro-operations. ALU means any of the three integer ALU's. ALU0_1 means that ALU0 and ALU1 are both used. AGU means any of the three integer address generation units. FADD means floating point adder unit. FMUL means floating point multiplier unit. FMISC means floating point store and miscellaneous unit. FA/M means FADD or FMUL is used. FANY means any of the three floating point units can be used. Two macro-operations can execute simultaneously if they go to different execution units.
Instruction Move instructions MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV Operands Ops Latency Reciprocal throughput 1 1 4 4 3 3 8 3 3 3 3 2 9-13 1/3 1/3 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2-1 8 Execution unit Notes
r,r r,i r8,m8 r16,m16 r32,m32 r64,m64 m8,r8H m8,r8L m16/32/64,r m,i m64,i32 r,sr sr,r/m
1 1 1 1 1 1 1 1 1 1 1 1 6
ALU ALU ALU, AGU ALU, AGU AGU AGU AGU AGU AGU AGU AGU
Any addressing mode. Add 1 clock if code segment base 0 AH, BH, CH, DH Any other 8-bit register
Any addressing mode
Page 16
K8 MOVNTI MOVZX, MOVSX MOVZX, MOVSX MOVSXD MOVSXD CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POP POPF(D/Q) POPA(D) LEA LEA LEA LAHF SAHF SALC LDS, LES, ... BSWAP PREFETCHNTA PREFETCHT0/1/2 SFENCE LFENCE MFENCE IN OUT m,r r,r r,m r64,r32 r64,m32 r,r r,m r,r r,m r i m sr 1 1 1 1 1 1 1 3 3 2 1 1 2 2 5 9 2 3 4-6 7-9 25 9 2 1 1 4 1 1 10 1 1 1 6 1 7 270 300 1 4 1 1 2 16 5 1 1 1 1 2 4 1 1 8 28 10 4 3 2 2 3 1 1 1 2-3 1/3 1/2 1/3 1/2 1/3 1/2 1 16 1 1 1 1 2 4 1 1 8 28 10 4 1 1/3 1/3 2 1/3 1/3 9 1/3 1/2 1/2 8 5 16 AGU ALU ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU AGU AGU AGU ALU ALU ALU ALU AGU AGU Timing depends on hw
r m
DS/ES/FS/GS
SS
r16,[m] r32,[m] r64,[m]
Any address size Any address size Any address size
r,m r m m
r,i/DX i/DX,r
Arithmetic instructions ADD, SUB ADD, SUB ADD, SUB ADC, SBB ADC, SBB ADC, SBB CMP CMP INC, DEC, NEG INC, DEC, NEG
1 1 1 1 1 1 1 1 1 1
1 1 7 1 1 7 1 1 7 Page 17
1/3 1/2 2.5 1/3 1/2 2.5 1/3 1/2 1/3 3
ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU ALU, AGU
K8 AAA, AAS DAA DAS AAD AAM MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV DIV DIV DIV IDIV IDIV IDIV IDIV IDIV IDIV IDIV IDIV CBW, CWDE, CDQE CWD, CDQ, CQO Logic instructions AND, OR, XOR AND, OR, XOR AND, OR, XOR TEST TEST NOT NOT SHL, SHR, SAR ROL, ROR RCL, RCR RCL RCR RCL RCR SHL,SHR,SAR,ROL,R OR 9 12 16 4 31 1 3 2 2 1 1 1 2 1 1 3 3 3 31 46 78 143 40 55 87 152 41 56 88 153 1 1 5 6 7 5 13 3 3-4 3 4-5 3 3 4 4 3 4 5 6 7 ALU ALU ALU ALU0 ALU ALU0 ALU0_1 ALU0_1 ALU0_1 ALU0 ALU0 ALU0_1 ALU0 ALU0 ALU0 ALU0 ALU0 ALU0_1 ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU
r8/m8 r16/m16 r32/m32 r64/m64 r16,r16/m16 r32,r32/m32 r64,r64/m64 r16,(r16),i r32,(r32),i r64,(r64),i r16,m16,i r32,m32,i r64,m64,i r8/m8 r16/m16 r32/m32 r64/m64 r8 r16 r32 r64 m8 m16 m32 m64
15 23 39 71 17 25 41 73 17 25 41 73 1 1
1 2 1 2 1 1 2 1 1 2 2 2 2 15 23 39 71 17 25 41 73 17 25 41 73 1/3 1/3
latency ax=3, dx=4 latency rax=4, rdx=5
r,r r,m m,r r,r r,m r m r,i/CL r,i/CL r,1 r,i r,i r,CL r,CL m,i /CL
1 1 1 1 1 1 1 1 1 1 9 7 9 7 1
1 1 7 1 1 1 7 1 1 1 3 3 4 3 7 Page 18
1/3 1/2 2.5 1/3 1/2 1/3 2.5 1/3 1/3 1/3 3 3 4 3 3
ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU ALU ALU ALU ALU ALU ALU ALU, AGU
K8 RCL, RCR RCL RCR RCL RCR SHLD, SHRD SHLD, SHRD SHLD, SHRD BT BT BT BTC, BTR, BTS BTC BTR, BTS BTC BTR, BTS BSF BSF BSR BSF BSF BSF BSR SETcc SETcc CLC, STC CMC CLD STD m,1 m,i m,i m,CL m,CL r,r,i r,r,cl m,r,i/CL r,r/i m,i m,r r,r/i m,i m,i m,r m,r r16/32,r r64,r r,r r16,m r32,m r64,m r,m r m 1 10 9 9 8 6 7 8 1 1 5 2 5 4 8 8 21 22 28 20 22 25 28 1 1 1 1 1 2 7 9 8 7 8 3 3 6 1 4 4 4 4 3 3 3 3 1/3 1/2 2 1 2 2 5 3 8 9 10 8 9 10 10 1/3 1/2 1/3 1/3 1/3 1/3 ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU ALU ALU, AGU ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU ALU ALU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU ALU, AGU ALU ALU ALU ALU
2 7 7 5 8 8 9 10 8 9 10 10 1
Control transfer instructions JMP short/near JMP far JMP r JMP m(near) JMP m(far) Jcc short/near J(E/R)CXZ short LOOP short CALL near CALL far CALL r CALL m(near) CALL m(far) RETN RETN i RETF RETF i IRET INT i
1 16-20 1 1 17-21 1 2 7 3 16-22 4 5 16-22 2 2 15-23 15-24 32 33
2 23-32 2 2 25-33 1/3 - 2 1/3 - 2 3-4 2 3 3 3 3
ALU
low values = real mode
ALU ALU, AGU

3-4 2 23-32 3 3 24-33 3 3 24-35 24-35 81 42 Page 19
ALU ALU ALU ALU ALU ALU, AGU
recip. thrp.= 2 if jump recip. thrp.= 2 if jump
ALU ALU
low values = real mode low values = real mode
real mode real mode
K8 BOUND INTO String instructions LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS Other NOP (90) Long NOP (0F 1F) ENTER LEAVE CLI STI CPUID RDTSC RDPMC m 6 2 2 2
values are for no jump values are for no jump
4 2 5 2 4 2 1.5 - 2 0.5 - 1 7 3 3 1-2 5 2 5 2 2 3 6 2
2 2 2 0.5 - 1 3 1-2 2 2 3 2
values are per count values are per count values are per count values are per count values are per count
1 0 1 0 i,0 12 2 8-9 16-17 22-50 47-164 6 10 9 12
1/3 1/3 12 3 5 27 7 7

Instruction Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FLDZ, FLD1 FCMOVcc FFREE FINCSTP, FDECSTP Operands Ops Latency Reciprocal throughput 2 4 16 41 2 3 7 173 0 9 7 1/2 1/2 4 39 1/2 1 5 160 0.4 1 1 1 4 2 1/3 Execution unit Notes
r m32/64 m80 m80 r m32/64 m80 m80 r m m
1 1 7 30 1 1 10 260 1 1 1 1 9 1 1
FA/M FANY
FA/M FMISC
FMISC FMISC, FA/M FMISC Low latency immediFMISC, FA/M ately after FCOMI FANY FANY Low latency immediately after FCOM FMISC, ALU FTST FMISC, ALU do.
st0,r r
4-15 0
FNSTSW FSTSW
AX AX
2 3
6-12 6-12 Page 20
12 12
K8 FNSTSW FNSTCW FLDCW Arithmetic instructions FADD(P),FSUB(R)(P) FIADD,FISUB(R) FMUL(P) FIMUL FDIV(R)(P) FIDIV(R) FABS, FCHS FCOM(P), FUCOM(P) FCOMPP, FUCOMPP FCOMI(P) FICOM(P) FTST FXAM FRNDINT FPREM FPREM1 Math FSQRT FLDPI, etc. FSIN FCOS FSINCOS FPTAN FPATAN FSCALE FXTRACT F2XM1 FYL2X FYL2XP1 Other FNOP (F)WAIT FNCLEX FNINIT FNSAVE FRSTOR FXSAVE FXRSTOR m16 m16 m16 2 3 18 8 1 50 FMISC, ALU FMISC, ALU FMISC, ALU do. faster if unchanged
r/m m r/m m r/m m r/m r m
1 2 1 2 1 2 1 1 1 1 2 1 2 5 1 1
4 4 4 4 11-25 12-26 2 2 2 3 2 10 7-10 8-11
1 1-2 1 2 8-22 9-23 1 1 1 1 1 1 1 3 8 8
FADD FADD,FMISC FMUL FMUL,FMISC Low values are for round divisors FMUL FMUL,FMISC do. FMUL FADD FADD FADD FADD, FMISC FADD FMISC, ALU FMUL FMUL
1 1 66 73 98 67 97 5 7 53 72 75
27 140-190 150-190 170-200 150-180 217 8 12 126 179 175
12 1
FMUL FMISC
1 1 8 26 77 70 61 101
0 0
1/3 1/3 27 100 171 136 56 95
Integer MMX and XMM instructions
Page 21
K8 Instruction Move instructions MOVD MOVD MOVD MOVD MOVD MOVD MOVD MOVD (MOVQ) MOVD (MOVQ) MOVD (MOVQ) MOVQ MOVQ MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/ DQ PUNPCKH/LBW/WD/ DQ PUNPCKHQDQ PUNPCKLQDQ PSHUFD PSHUFW PSHUFL/HW MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRW PINSRW PINSRW Operands Ops Latency Reciprocal throughput 4 9 2 3 2 2 1/2 2 2 1 1 2 2 2 1/2 1 1/2 1 1 1 2 2 2 2 1/2 1 2 3 2 2 2 2 1 1/2 1.5 1/2 1 13 26 1 2 2 3 Execution unit Notes
r32, mm mm, r32 mm,m32 r32, xmm xmm, r32 xmm,m32 m32, r r64,mm/xmm mm,r64 xmm,r64 mm,mm xmm,xmm mm,m64 xmm,m64 m64,mm/x xmm,xmm xmm,m m,xmm xmm,m m,xmm mm,xmm xmm,mm m,mm m,xmm mm,r/m xmm,r/m mm,r/m xmm,r/m xmm,r/m xmm,r/m xmm,xmm,i mm,mm,i xmm,xmm,i mm,mm xmm,xmm r32,mm/xmm r32,mm/x,i mm,r32,i xmm,r32,i
2 2 1 3 3 2 1 2 2 3 1 2 1 2 1 2 2 2 4 5 1 2 1 2 1 3 1 2 2 1 3 1 2 32 64 1 2 2 3
FMICS, ALU FANY, ALU FANY FMISC, ALU FANY FMISC Moves 64 bits.Name FMISC, ALU of instruction differs FANY, ALU do. FANY, ALU do. FA/M FA/M, FMISC FANY FANY, FMISC FMISC FA/M FMISC FMISC
4 9 9 2 2
2 2
FA/M FA/M, FMISC FMISC FMISC FA/M FA/M FA/M FA/M FA/M FA/M FA/M FA/M FA/M
2 3 2 2 2 2 3 2 2
2 5 12 12
FADD FMISC, ALU FA/M FA/M
Arithmetic instructions
Page 22
K8 PADDB/W/D/Q PADDSB/W PADDUSB/W PSUBB/W/D/Q PSUBSB/W PSUBUSB/W PADDB/W/D/Q PADDSB/W ADDUSB/W PSUBB/W/D/Q PSUBSB/W PSUBUSB/W PCMPEQ/GT B/W/D PCMPEQ/GT B/W/D PMULLW PMULHW PMULHUW PMULUDQ PMULLW PMULHW PMULHUW PMULUDQ PMADDWD PMADDWD PAVGB/W PAVGB/W PMIN/MAX SW/UB PMIN/MAX SW/UB PSADBW PSADBW Logic PAND PANDN POR PXOR PAND PANDN POR PXOR PSLL/RL W/D/Q PSRAW/D PSLL/RL W/D/Q PSRAW/D PSLLDQ, PSRLDQ Other EMMS
mm,r/m
1/2
FA/M
xmm,r/m mm,r/m xmm,r/m
2 1 2
2 2 2
1 1/2 1
FA/M FA/M FA/M
mm,r/m
FMUL
xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m
2 1 2 1 2 1 2 1 2
3 3 3 2 2 2 2 3 3
2 1 2 1/2 1 1/2 1 1 2
FMUL FMUL FMUL FA/M FA/M FA/M FA/M FADD FADD
mm,r/m xmm,r/m mm,i/mm/m x,i/x/m xmm,i
1 2 1 2 2
2 2 2 2 2
1/2 1 1/2 1 1
FA/M FA/M FA/M FA/M FA/M
1/3
FANY

Instruction Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D Operands Ops Latency Reciprocal throughput 2 1 2 2 1 Execution unit Notes
r,r r,m m,r r,r
2 2 2 2
2 Page 23
FA/M FMISC FMISC FA/M
K8 MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHLPS, MOVLHPS MOVHPS/D, MOVLPS/D MOVHPS/D, MOVLPS/D MOVDDUP MOVSH/LDUP MOVNTPS/D MOVMSKPS/D SHUFPS/D UNPCK H/L PS/D Conversion CVTPS2PD CVTPD2PS CVTSD2SS CVTSS2SD CVTDQ2PS CVTDQ2PD CVT(T)PS2DQ CVT(T)PD2DQ CVTPI2PS CVTPI2PD CVT(T)PS2PI CVT(T)PD2PI CVTSI2SS CVTSI2SD CVT(T)SD2SI CVT(T)SS2SI Arithmetic ADDSS/D SUBSS/D ADDPS/D SUBPS/D HADDPS/D HSUBPS/D MULSS/D MULPS/D r,m m,r r,r r,m m,r r,r r,m m,r r,r r,r m,r r32,r r,r/m,i r,r/m 4 5 1 2 1 1 1 1 2 2 2 1 3 2 2 2 1 1 1 1/2 1 1 1 2 3 1 2 3
2 4 3 2
FA/M FANY FMISC FMISC FA/M FMISC FMISC SSE3 SSE3 FMISC FADD FMUL FMUL
2 2 8 3 3
r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m xmm,mm xmm,mm mm,xmm mm,xmm xmm,r32 xmm,r32 r32,xmm r32,xmm
2 4 3 1 2 2 2 4 1 2 1 3 3 2 2 2
4 8 8 2 5 5 5 8 4 5 6 8 14 12 10 9
2 3 8 1 2 2 2 3 1 2 1 2 2 2 2 2
FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC
r,r/m r,r/m r,r/m r,r/m r,r/m
1 2 2 1 2
4 4 4 4 4
1 2 2 1 2
FADD FADD FADD FMUL FMUL SSE3
DIVSS DIVPS DIVSD DIVPD RCPSS RCPPS
r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m
1 2 1 2 1 2
11-16 18-30 11-20 16-34 3 3 Page 24
8-13 18-30 8-17 16-34 1 2
FMUL FMUL FMUL FMUL FMUL FMUL
Low values are for round divisors, e.g. powers of 2. do. do. do.
K8 MAXSS/D MINSS/D MAXPS/D MINPS/D CMPccSS/D CMPccPS/D COMISS/D UCOMISS/D Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS RSQRTPS Other LDMXCSR STMXCSR r,r/m r,r/m r,r/m r,r/m r,r/m 1 2 1 2 1 2 2 2 2 2 1 2 1 2 1 FADD FADD FADD FADD FADD
r,r/m
FMUL
1 2 1 2 1 2
19 36 27 48 3 3
16 36 24 48 1 2
m m
8 3
9 10
Page 25
K10
AMD K10
Instruction: Operands: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit mmx register, xmm = 128 bit xmm register, sr = segment register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc. Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency listed does not include the memory operand where the operand is listed as register or memory (r/m).
Ops: Latency:
Reciprocal through- This is also called issue latency. This value indicates the average number of clock cycles from the execution of an instruction begins to a subsequent independent put: instruction of the same kind can begin to execute. A value of 1/3 indicates that the execution units can handle 3 instructions per clock cycle in one thread. However, the throughput may be limited by other bottlenecks in the pipeline. Execution unit: Indicates which execution unit is used for the macro-operations. ALU means any of the three integer ALU's. ALU0_1 means that ALU0 and ALU1 are both used. AGU means any of the three integer address generation units. FADD means floating point adder unit. FMUL means floating point multiplier unit. FMISC means floating point store and miscellaneous unit. FA/M means FADD or FMUL is used. FANY means any of the three floating point units can be used. Two macro-operations can execute simultaneously if they go to different execution units.
Instruction Move instructions MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV Operands Ops Latency Reciprocal throughput 1 1 4 4 3 3 8 3 3 3 3 3-4 8-26 1/3 1/3 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 8 Execution unit Notes
r,r r,i r8,m8 r16,m16 r32,m32 r64,m64 m8,r8H m8,r8L m16/32/64,r m,i m64,i32 r,sr sr,r/m
1 1 1 1 1 1 1 1 1 1 1 1 6
ALU ALU ALU, AGU ALU, AGU AGU AGU AGU AGU AGU AGU AGU
Any addressing mode. Add 1 clock if code segment base 0 AH, BH, CH, DH Any other 8-bit register Any addressing mode
from AMD manual
Page 26
K10 MOVNTI MOVZX, MOVSX MOVZX, MOVSX MOVSXD MOVSXD CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POP POPF(D/Q) POPA(D) LEA LEA LEA LAHF SAHF SALC LDS, LES, ... BSWAP PREFETCHNTA PREFETCHT0/1/2 SFENCE LFENCE MFENCE IN OUT Arithmetic instructions ADD, SUB ADD, SUB ADD, SUB ADC, SBB ADC, SBB ADC, SBB CMP CMP INC, DEC, NEG INC, DEC, NEG m,r r,r r,m r64,r32 r64,m32 r,r r,m r,r r,m r i m sr 1 1 1 1 1 1 1 2 2 2 1 1 2 2 9 9 1 3 6 10 28 9 2 1 1 4 1 1 10 1 1 1 6 1 4 ~270 ~300 1 4 1 4 1 4 1 21 5 1 1/3 1/2 1/3 1/2 1/3 1/2 1 19 5 1/2 1/2 1 1 3 6 1/2 1 8 16 11 6 1 1/3 1/3 2 1/3 1 10 1/3 1/2 1/2 8 1 33 AGU ALU ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU AGU ALU ALU ALU ALU AGU AGU
Timing depends on hw
6 3 10 26 16 6 3 1 2 3 1 1 1
r m
DS/ES/FS/GS
SS
r16,[m] r32/64,[m] r32/64,[m]
Any address size 2 source operands W. scale or 3 opr.
r,m r m m
r,i/DX i/DX,r
1 1 1 1 1 1 1 1 1 1
1 4 1 4 1 1 7 Page 27
1/3 1/2 1 1/3 1/2 1 1/3 1/2 1/3 2
ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU ALU, AGU
K10 AAA, AAS DAA DAS AAD AAM MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV IDIV IDIV DIV DIV DIV IDIV IDIV IDIV CBW, CWDE, CDQE CWD, CDQ, CQO 9 12 16 4 30 1 3 2 2 1 1 1 2 1 1 3 3 3 5 6 7 5 13 3 3 3 4 3 3 4 4 3 4 5 6 7 5 13 1 2 1 2 1 1 2 1 1 2 2 2 2 17 19 22 15-30 15-46 15-78 24-39 24-55 24-87 1/3 1/3 ALU ALU ALU ALU0 ALU ALU0 ALU0_1 ALU0_1 ALU0_1 ALU0 ALU0 ALU0_1 ALU0 ALU0 ALU0 ALU0 ALU0 ALU0_1 ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU
r8/m8 r16/m16 r32/m32 r64/m64 r16,r16/m16 r32,r32/m32 r64,r64/m64 r16,(r16),i r32,(r32),i r64,(r64),i r16,m16,i r32,m32,i r64,m64,i r8/m8 r8 m8 r16/m16 r32/m32 r64/m64 r16/m16 r32/m32 r64/m64
latency ax=3, dx=4

latency rax=4, rdx=5
1 1
17 19 22 15-30 15-46 15-78 24-39 24-55 24-87 1 1
Depends on number of significant bits in absolute value of dividend. See AMD software optimization guide.
Logic instructions AND, OR, XOR AND, OR, XOR AND, OR, XOR TEST TEST NOT NOT SHL, SHR, SAR ROL, ROR RCL, RCR RCL RCR RCL RCR SHL,SHR,SAR,ROL,ROR RCL, RCR RCL RCR RCL
r,r r,m m,r r,r r,m r m r,i/CL r,i/CL r,1 r,i r,i r,CL r,CL m,i /CL m,1 m,i m,i m,CL
1 1 1 1 1 1 1 1 1 1 9 7 9 7 1 1 10 9 9
1 4 1 1 7 1 1 1 3 3 4 3 7 7 7 7 8 Page 28
1/3 1/2 1 1/3 1/2 1/3 1 1/3 1/3 1 3 3 4 3 1 1 5 6 6
ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU ALU ALU ALU ALU ALU ALU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU
K10 RCR SHLD, SHRD SHLD, SHRD SHLD, SHRD BT BT BT BTC, BTR, BTS BTC BTR, BTS BTC BTR, BTS BSF BSR BSF BSR POPCNT LZCNT SETcc SETcc CLC, STC CMC CLD STD m,CL r,r,i r,r,cl m,r,i/CL r,r/i m,i m,r r,r/i m,i m,i m,r m,r r,r r,r r,m r,m r,r/m r,r/m r m 8 6 7 8 1 1 5 2 5 4 8 8 6 7 7 8 1 1 1 1 1 1 1 2 7 3 3 7.5 1 7 2 9 9 8 8 4 4 7 7 2 2 1 5 2 3 6 1/3 1/2 2 1/3 1.5 1.5 10 7 3 3 3 3 1 1 1/3 1/2 1/3 1/3 1/3 2/3 ALU, AGU ALU ALU ALU, AGU ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU ALU ALU, AGU ALU, AGU ALU ALU ALU ALU, AGU ALU ALU ALU ALU
SSE4.A / SSE4.2 SSE4.A, AMD only
Control transfer instructions JMP short/near JMP far JMP r JMP m(near) JMP m(far) Jcc short/near J(E/R)CXZ short LOOP short CALL near CALL far CALL r CALL m(near) CALL m(far) RETN RETN i RETF RETF i IRET INT i BOUND m INTO String instructions LODS
1 16-20 1 1 17-21 1 2 7 3 16-22 4 5 16-22 2 2 15-23 15-24 32 33 6 2
2 23-32 2 2 25-33 1/3 - 2 2/3 - 2 3 2 3 3 3 3
ALU
ALU ALU, AGU

2 23-32 3 3 24-33 3 3 24-35 24-35 81 42
ALU ALU ALU ALU ALU ALU, AGU
ALU ALU
low values = real mode low values = real mode
real mode real mode 2 2

2 Page 29
K10 REP LODS STOS REP STOS MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS Other NOP (90) Long NOP (0F 1F) ENTER LEAVE CLI STI CPUID RDTSC RDPMC 5 4 2 7 3 5 5 7 3 2 2 1 3 1 2 2 3 1 2 2 1 3 1 2 2 3 1
values are per count values are per count values are per count values are per count values are per count
1 0 1 0 i,0 12 2 8-9 16-17 22-50 47-164 30 13
1/3 1/3 3 5 27 67 5

Instruction Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FLDZ, FLD1 FCMOVcc FFREE FINCSTP, FDECSTP FNSTSW FSTSW FNSTSW FNSTCW FLDCW Arithmetic instructions Page 30 Operands Ops Latency Reciprocal throughput 2 4 13 94 2 2 8 167 0 6 4 1/2 1/2 4 30 1/2 1 7 163 1/3 1 1 1 Execution unit Notes
r m32/64 m80 m80 r m32/64 m80 m80 r m m
1 1 7 20 1 1 10 218 1 1 1 1 9 1 1 2 3 2 3 12
FA/M FANY
FA/M FMISC
FMISC FMISC FMISC Low latency immeFMISC, FA/M diately after FCOMI FANY FANY
Low latency immediately after FMISC, ALU FCOM FTST
st0,r r
1/3 1/3 16 14 9 2 14
AX AX m16 m16 m16
FMISC, ALU FMISC, ALU FMISC, ALU FMISC, ALU
do. do. faster if unchanged
K10 FADD(P),FSUB(R)(P) FIADD,FISUB(R) FMUL(P) FIMUL FDIV(R)(P) FIDIV(R) FABS, FCHS FCOM(P), FUCOM(P) FCOMPP, FUCOMPP FCOMI(P) FICOM(P) FTST FXAM FRNDINT FPREM FPREM1 Math FSQRT FLDPI, etc. FSIN FCOS FSINCOS FPTAN FPATAN FSCALE FXTRACT F2XM1 FYL2X FYL2XP1 Other FNOP (F)WAIT FNCLEX FNINIT FNSAVE FRSTOR FXSAVE FXRSTOR r/m m r/m m r/m m r/m r m 1 2 1 2 1 2 1 1 1 1 2 1 2 6 1 1 4 4 ? 31 2 1 4 1 4 24 24 2 1 1 1 1 1 1 37 7 7 FADD FADD,FMISC FMUL FMUL,FMISC FMUL FMUL,FMISC FMUL FADD FADD FADD FADD, FMISC FADD FMISC, ALU FMUL FMUL
1 1 45 51 76 45 9 5 11 8 8 12
35 ~51? ~90? ~125? ~119 151? 9 9 65 13 114
35 1
FMUL FMISC
45? 29 41 30? 30? 44?
m m m m
1 1 8 26 77 70 61 85
0 0
162 133 63 89
1/3 1/3 28 103 149 149 58 79

Instruction Move instructions MOVD MOVD MOVD MOVD Operands Ops Latency Reciprocal throughput 3 6 4 3 Page 31 1 3 1/2 1 Execution unit Notes
r32, mm mm, r32 mm,m32 r32, xmm
1 2 1 1
FADD FANY FADD
K10 MOVD MOVD MOVD xmm, r32 xmm,m32 m32,mm/x 2 1 1 6 2 2 3 1/2 1
FMISC Moves 64 bits. Name of instruction differs do. do.
MOVD (MOVQ) MOVD (MOVQ) MOVD (MOVQ) MOVQ MOVQ MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/ DQ PUNPCKH/LBW/WD/ DQ PUNPCKHQDQ PUNPCKLQDQ PSHUFD PSHUFW PSHUFL/HW MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRW PINSRW INSERTQ INSERTQ EXTRQ EXTRQ
r64,(x)mm mm,r64 xmm,r64 mm,mm xmm,xmm mm,m64 xmm,m64 m64,(x)mm xmm,xmm xmm,m m,xmm xmm,m m,xmm mm,xmm xmm,mm m,mm m,xmm mm,r/m xmm,r/m mm,r/m xmm,r/m xmm,r/m xmm,r/m xmm,xmm,i mm,mm,i xmm,xmm,i mm,mm xmm,xmm r32,mm/xmm r32,(x)mm,i (x)mm,r32,i xmm,xmm xmm,xmm,i,i xmm,xmm xmm,xmm,i,i
1 2 2 1 1 1 1 1 1 1 2 1 3 1 1 1 2 1 1 1 1 1 1 1 1 1 32 64 1 2 2 3 3 1 1
3 6 6 2 2.5 4 2 2 2.5 2 2 2 3 2 2
1 3 3 1/2 1/3 1/2 1/2 1 1/3 1/2 1 1/2 2 1/3 1/3 1 1 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 13 24 1 1 3 2 2 1/2 1/2
FADD FMUL, ALU FA/M FANY FANY ? FMISC FANY ? FMUL,FMISC
FANY FANY FMISC FMUL,FMISC FA/M FA/M FA/M FA/M FA/M FA/M FA/M FA/M FA/M
2 3 2 3 3 3 3 2 2
3 6 9 6 6 2 2
FADD FA/M FA/M FA/M FA/M FA/M
SSE4.A, AMD only SSE4.A, AMD only SSE4.A, AMD only SSE4.A, AMD only
Arithmetic instructions
Page 32
K10 PADDB/W/D/Q PADDSB/W PADDUSB/W PSUBB/W/D/Q PSUBSB/W PSUBUSB/W PCMPEQ/GT B/W/D PMULLW PMULHW PMULHUW PMULUDQ PMADDWD PAVGB/W PMIN/MAX SW/UB PSADBW Logic PAND PANDN POR PXOR PSLL/RL W/D/Q PSRAW/D PSLL/RL W/D/Q PSRAW/D PSLLDQ, PSRLDQ Other EMMS
mm/xmm,r/m mm/xmm,r/m
1 1
2 2
1/2 1/2
FA/M FA/M
mm/xmm,r/m mm/xmm,r/m mm/xmm,r/m mm/xmm,r/m mm/xmm,r/m
1 1 1 1 1
3 3 2 2 3
1 1 1/2 1/2 1
FMUL FMUL FA/M FA/M FADD
mm/xmm,r/m mm,i/mm/m x,i/(x)mm xmm,i
1 1 1 1
2 2 3 3
1/2 1/2 1/2 1/2
FA/M FA/M FA/M FA/M
1/3
FANY

Instruction Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHLPS, MOVLHPS MOVHPS/D, MOVLPS/D MOVHPS/D, MOVLPS/D MOVNTPS/D MOVNTSS/D MOVMSKPS/D SHUFPS/D Operands Ops Latency Reciprocal throughput 2.5 2 2 2.5 2 3 2 2 2 3 4 1/2 1/2 1 1/2 1/2 2 1/2 1/2 1 1/2 1/2 1 3 1 1 1/2 Execution unit Notes
r,r r,m m,r r,r r,m m,r r,r r,m m,r r,r r,m m,r m,r m,r r32,r r,r/m,i
1 1 2 1 1 3 1 1 1 1 1 1 2 1 1 1
FANY ? FMUL,FMISC FANY ? FMISC FA/M ? FMISC FA/M FA/M FMISC FMUL,FMISC FMISC SSE4.A, AMD only FADD FA/M
3 3 Page 33
K10 UNPCK H/L PS/D Conversion CVTPS2PD CVTPD2PS CVTSD2SS CVTSS2SD CVTDQ2PS CVTDQ2PD CVT(T)PS2DQ CVT(T)PD2DQ CVTPI2PS CVTPI2PD CVT(T)PS2PI CVT(T)PD2PI CVTSI2SS CVTSI2SD CVT(T)SD2SI CVT(T)SS2SI Arithmetic ADDSS/D SUBSS/D ADDPS/D SUBPS/D MULSS/D MULPS/D DIVSS DIVPS DIVSD DIVPD RCPSS RCPPS MAXSS/D MINSS/D MAXPS/D MINPS/D CMPccSS/D CMPccPS/D COMISS/D UCOMISS/D Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS RSQRTPS Other Page 34 r,r/m 1 3 1/2 FA/M
1 2 3 3 1 1 1 2 2 1 1 2 3 3 2 2
2 7 8 7 4 4 4 7 7 4 4 7 14 14 8 8
1 1 2 2 1 1 1 1 1 1 1 1 3 3 1 1
FMISC
FMISC FMISC FMISC
FMISC FMISC
FADD,FMISC FADD,FMISC
r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m
1 1 1 1 1 1 1 1 1 1 1 1 1 1
4 4 4 4 16 18 20 20 3 2 2 2 2
1 1 1 1 13 15 17 17 1 1 1 1 1 1
FADD FADD FMUL FMUL FMUL FMUL FMUL FMUL FMUL FADD FADD FADD FADD FADD
r,r/m
1/2
FA/M
1 1 1 1 1 1
19 21 27 27 3 3
16 18 24 24 1 1
K10 LDMXCSR STMXCSR m m 12 3 12 12 10 11
3DNow instructions (obsolete)

Instruction Operands Ops Latency Reciprocal throughput 1/2 1 1 1 1 1/2 Execution unit Notes Move and convert instructions PREFETCH(W) m PF2ID mm,mm PI2FD mm,mm PF2IW mm,mm PI2FW mm,mm PSWAPD mm,mm Integer instructions PAVGUSB PMULHRW
1 1 1 1 1 1
5 5 5 5 2
AGU FMISC FMISC FMISC FMISC FA/M
3DNow extension 3DNow extension 3DNow extension
mm,mm mm,mm
1 1
2 3
1/2 1
FA/M FMUL
Floating point instructions PFADD/SUB/SUBR mm,mm PFCMPEQ/GE/GT mm,mm PFMAX/MIN mm,mm PFMUL mm,mm PFACC mm,mm PFNACC, PFPNACC mm,mm PFRCP mm,mm PFRCPIT1/2 mm,mm PFRSQRT mm,mm PFRSQIT1 mm,mm Other FEMMS
1 1 1 1 1 1 1 1 1 1
4 2 2 4 4 4 3 4 3 4
1 1 1 1 1 1 1 1 1 1
FADD FADD FADD FMUL FADD FADD FMUL FMUL FMUL FMUL
3DNow extension
mm,mm
1/3
FANY
Thank you to Xucheng Tang for doing the measurements on the K10.
Page 35
Bulldozer
AMD Bulldozer
Instruction: Operands: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit mmx register, x = 128 bit xmm register, y = 256 bit ymm register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc. Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency listed does not include the memory operand where the listing for register and memory operand are joined (r/m). This is also called issue latency. This value indicates the average number of clock cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/3 indicates that the execution units can handle 3 instructions per clock cycle in one thread. However, the throughput may be limited by other bottlenecks in the pipeline. Indicates which execution pipe or unit is used for the macro-operations: Integer pipes: EX0: integer ALU, division EX1: integer ALU, multiplication, jump EX01: can use either EX0 or EX1 AG01: address generation unit 0 or 1 Floating point and vector pipes: P0: floating point add, mul, div, convert, shuffle, shift P1: floating point add, mul, div, shuffle, shift P2: move, integer add, boolean P3: move, integer add, boolean, store P01: can use either P0 or P1 P23: can use either P2 or P3 Two macro-operations can execute simultaneously if they go to different execution pipes Tells which execution unit domain is used: ivec: integer vector execution unit. fp: floating point execution unit. fma: floating point multiply/add subunit. inherit: the output operand inherits the domain of the input operand. ivec/fma means the input goes to the ivec domain and the output comes from the fma domain. There is an additional latency of 1 clock cycle if the output of an ivec instruction goes to the input of a fp or fma instruction, and when the output of a fp or fma instruction goes to the input of an ivec or store instruction. There is no latency between the fp and fma units. All other latencies after memory load and before memory store instructions are included in the latency counts. An fma instruction has a latency of 5 if the output goes to another fma instruction, 6 if the output goes to an fp instuction, and 6+1 if the output goes to an ivec or store instruction.
Ops: Latency:
Execution pipe:
Domain:
Page 36
Bulldozer
Instruction Move instructions MOV MOV MOV MOV MOV MOVNTI MOVZX, MOVSX MOVSX MOVZX MOVSXD MOVSXD CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POPF(D/Q) POPA(D) LEA LEA LEA LEA LAHF SAHF SALC BSWAP PREFETCHNTA PREFETCHT0/1/2 SFENCE LFENCE MFENCE Arithmetic instructions ADD, SUB ADD, SUB ADD, SUB Operands Ops Latency Reciprocal throughput 1 1 4 4 5 1 5 4 1 5 1 1 ~50 6 0.5 0.5 0.5 1 1 2 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 ~50 2 1 1 1.5 4 9 1 1 19 8 Execution pipes EX01 EX01 AG01 EX01 AG01 Notes
r,r r,i r,m m,r m,i m,r r,r r,m r,m r64,r32 r64,m32 r,r r,m r,r r,m r i m
1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 2 8 9 1 2 34 14 2 1 1 1 4 2 1 1 1 1 6 1 6
all addr. modes all addr. modes
EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01

r m
r16,[m] r32,[m] r32/64,[m] r32/64,[m]
2-3 2-3 2 1 3 2 1 1 0.5 0.5 2 1 1 0.5 0.5 0.5 89 0.25 89
EX01 EX01 EX01 EX01
any addr. size 16 bit addr. size scale factor > 1 or 3 operands all other cases
r m m
EX01
r,r r,i r,m
1 1 1
1 1
0.5 0.5 0.5
EX01 EX01 EX01
Page 37
Bulldozer ADD, SUB ADD, SUB ADC, SBB ADC, SBB ADC, SBB ADC, SBB ADC, SBB CMP CMP CMP INC, DEC, NEG INC, DEC, NEG AAA, AAS DAA DAS AAD AAM MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV DIV DIV DIV IDIV IDIV IDIV IDIV CBW, CWDE, CDQE CDQ, CQO CWD Logic instructions AND, OR, XOR AND, OR, XOR AND, OR, XOR AND, OR, XOR AND, OR, XOR TEST TEST m,r m,i r,r r,i r,m m,r m,i r,r r,i r,m r m 1 1 1 1 1 1 1 1 1 1 1 1 10 16 20 4 9 1 2 1 1 1 1 1 2 1 1 2 2 2 14 18 16 16 33 36 36 36 1 1 2 7-8 7-8 1 1 1 9 9 1 1 1 7-8 6 9 10 6 20 4 4 4 6 4 4 6 5 4 6 1 1 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01
1 1 1 0.5 0.5 0.5 0.5 1
r8/m8 r16/m16 r32/m32 r64/m64 r16,r16/m16 r32,r32/m32 r64,r64/m64 r16,(r16),i r32,(r32),i r64,(r64),i r16,m16,i r32,m32,i r64,m64,i r8/m8 r16/m16 r32/m32 r64/m64 r8/m8 r16/m16 r32/m32 r64/m64
20 15-27 16-43 16-75 23 23-33 22-48 22-79 1 1 1
20 2 2 2 4 2 2 4 2 2 4 2 2 4 20 15-28 16-43 16-75 20 20-27 20-43 20-75 0.5 1
EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX0 EX0 EX0 EX0 EX0 EX0 EX0 EX0 EX01 EX01 EX01
r,r r,i r,m m,r m,i r,r r,i
1 1 1 1 1 1 1
1 1 7-8 7-8 1 1 Page 38
0.5 0.5 0.5 1 1 0.5 0.5
EX01 EX01 EX01 EX01 EX01 EX01 EX01
Bulldozer TEST TEST NOT NOT SHL, SHR, SAR ROL, ROR RCL RCL RCL RCR RCR RCR SHLD, SHRD SHLD, SHRD SHLD, SHRD BT BT BT BTC, BTR, BTS BTC, BTR, BTS BTC, BTR, BTS BSF BSF BSR BSR LZCNT POPCNT SETcc SETcc CLC, STC CMC CLD STD m,r m,i r m r,i/CL r,i/CL r,1 r,i r,cl r,1 r,i r,cl r,r,i r,r,cl m,r,i/CL r,r/i m,i m,r r,r/i m,i m,r r,r r,m r,r r,m r,r r,r/m r m 1 1 1 1 1 1 1 16 17 1 15 16 6 7 8 1 1 7 2 4 10 6 8 7 9 1 1 1 1 1 1 2 2 0.5 0.5 0.5 1 0.5 0.5 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX0 EX1 EX01 EX01 EX01 EX01
7 1 1 1 8 9 1 8 8 3 4 1
3 4 4 2 4 1
3 3.5 3.5 0.5 0.5 3.5 1 2 5 3 4 4 5 2 2 0.5 1 0.5 3 4
SSE4.A SSE4.2
Control transfer instructions JMP short/near JMP r JMP m Jcc short/near fused CMP+Jcc short/near J(E/R)CXZ short LOOP short LOOPE LOOPNE short CALL near CALL r CALL m RET RET i BOUND m INTO
1 1 1 1 1 1 1 1 2 2 3 1 4 11 4 Page 39
2 2 2 1-2 1-2 1-2 1-2 1-2 2 2 2 2 2-3 5 24
EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1
2 if jumping 2 if jumping 2 if jumping 2 if jumping 2 if jumping
for no jump for no jump
Bulldozer String instructions LODS REP LODS STOS REP STOS REP STOS MOVS REP MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS Synchronization LOCK ADD XADD LOCK XADD CMPXCHG LOCK CMPXCHG CMPXCHG LOCK CMPXCHG CMPXCHG8B LOCK CMPXCHG8B CMPXCHG16B LOCK CMPXCHG16B Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC CRC32 CRC32 CRC32 XGETBV
3 6n 3 2n 3 per 16B 5 2n 4 per 16B 3 7n 6 9n
3 3n 3 2n 3 per 16B 3 2n 3 per 16B 3 4n 3 4n
small n best case small n best case
m,r m,r m,r m8,r8 m8,r8 m,r16/32/64 m,r16/32/64 m64 m64 m64 m64
1 4 4 5 5 6 6 18 18 22 22
~55 10 ~51 15 ~51 14 ~52 15 ~53 52 ~94
a,0 a,b
r32,r8 r32,r16 r32,r32
1 1 40 13 11+5b 2 37-63 36 22 3 5 5 4
3 5 6
0.25 0.25 43 22 16+4b 4 112-280 42 30 2 5 6 31
none none

Instruction Move instructions FLD FLD Operands Ops Latency Reciprocal throughput 2 8 Page 40 0.5 1 Execution pipes P01 Domain, notes
r m32/64
1 1
fp fp
Bulldozer FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FLDZ, FLD1 FCMOVcc FFREE FINCSTP, FDECSTP FNSTSW FNSTSW FLDCW FNSTCW Arithmetic instructions FADD(P),FSUB(R)(P) FIADD,FISUB(R) FMUL(P) FIMUL FDIV(R)(P) FDIV(R) FIDIV(R) FABS, FCHS FCOM(P), FUCOM(P) FCOMPP, FUCOMPP FCOMI(P) FICOM(P) FTST FXAM FRNDINT FPREM FPREM1 Math FSQRT FLDPI, etc. FSIN FCOS FSINCOS FPTAN FPATAN FSCALE FXTRACT F2XM1 FYL2X FYL2XP1 m80 m80 r m32/64 m80 m80 r m m st0,r r AX m16 m16 m16 8 60 1 2 13 239 1 1 2 1 8 1 1 4 3 1 3 14 61 2 8 9 240 0 12 8 3 0 ~13 ~13 4 40 0.5 1 20 244 0.5 1 1 0.5 3 0.25 0.25 22 19 3 2 P0 P1 P2 P3 P01 fp fp fp fp fp fp inherit fp fp fp fp inherit
P0 P1 F3 P01 F3 P0 F3 P01 P0 P1 F3 none none P0 P2 P3 P0 P2 P3
r/m m r/m m r m m r/m r m
1 2 1 2 1 2 2 1 1 1 2 2 1 1 1 1 1
5-6 5-6 10-42
1 2 1 2 5-18
~20 4 19-62 19-65
0.5 0.5 0.5 1 1 0.5 0.5 1
P01 P01 P01 P01 P01 P01 P01 P01 P01 P01 P0 P1 F3 P01 P01 P01 P0 P0 P0
fma fma fma fma fp fp fp fp fp fp fp fp fp fp fp fp fp
1 1 10-162 160-170 12-166 11-190 10-355 8 12 10 10-175 10-175
10-53 65-210 ~160 95-160 95-245 60-440 52 10 64-71 60-290 60-320 Page 41 0.5 65-210 ~160 95-160 95-245 60-440 5
P01 P01 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3
Bulldozer Other FNOP (F)WAIT FNCLEX FNINIT FNSAVE FRSTOR
m864 m864
1 1 18 31 103 76
300 312
0.25 0.25 57 170 300 312
none none P0 P0 P0 P1 P2 P3 P0 P3

Instruction Move instructions MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA VMOVDQA VMOVDQA VMOVDQA MOVDQU MOVDQU MOVDQU LDDQU VMOVDQU VMOVDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ MOVNTDQA PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/D Q PUNPCKHQDQ PUNPCKLQDQ PSHUFB PSHUFD PSHUFW PSHUFL/HW PALIGNR PBLENDW Operands Ops Latency Reciprocal throughput 8 10 6 5 2 6 5 0 6 5 2 6 5 0 6 5 6 6 6 2 2 6 6 6 2 2 2 2 2 3 2 2 2 2 2 Page 42 1 1 0.5 1 0.5 0.5 1 0.25 0.5 1 0.5 1 3 0.25 0.5 1 0.5 1-2 10 0.5 0.5 2 2 0.5 1 1 1 1 1 1 1 1 1 1 0.5 Execution pipes Notes
r32/64, mm/x mm/x, r32/64 mm/x,m32 m32,mm/x mm/x,mm/x mm/x,m64 m64,mm/x xmm,xmm xmm,m m,xmm ymm,ymm ymm,m256 m256,ymm xmm,xmm xmm,m m,xmm xmm,m ymm,m256 m256,ymm mm,xmm xmm,mm m,mm m,xmm xmm,m (x)mm,r/m (x)mm,r/m (x)mm,r/m xmm,r/m xmm,r/m (x)mm,r/m xmm,xmm,i mm,mm,i xmm,xmm,i (x)mm,r/m,i xmm,r/m
1 2 1 1 1 1 1 1 1 1 2 2 4 1 1 1 1 2 8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
P23 P3 none P3 P23 P3 none P3
inherit
inherit
P2 P3 P23 P23 P3 P3 P1 P1 P1 P1 P1 P1 P1 P1 P1 P1 P23
SSE4.1
Bulldozer MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRB/W/D/Q PINSRB/W/D/Q PMOVSXBW/BD/BQ/ WD/WQ/DQ PMOVZXBW/BD/BQ/ WD/WQ/DQ mm,mm xmm,xmm r32,mm/x r,x/mm,i x/mm,r,i xmm,xmm xmm,xmm 31 64 2 2 2 1 1 38 48 10 10 12 2 2 37 61 1 1 2 1 1 P3 P1 P3 P1 P3 P1 P3 P1 P1 P1
AVX
SSE4.1 SSE4.1
Arithmetic instructions PADDB/W/D/Q/SB/SW /USB/USW (x)mm,r/m PSUBB/W/D/Q/SB/SW /USB/USW (x)mm,r/m PCMPEQ/GT B/W/D (x)mm,r/m PMULLW PMULHW PMULHUW PMULUDQ (x)mm,r/m xmm,r/m PMULLD xmm,r/m PMULDQ (x)mm,r/m PMULHRSW PMADDWD (x)mm,r/m PMADDUBSW (x)mm,r/m PAVGB/W (x)mm,r/m PMIN/MAX SB/SW/ SD UB/UW/UD (x)mm,r/m PHMINPOSUW xmm,r/m PABSB/W/D (x)mm,r/m PSIGNB/W/D (x)mm,r/m PSADBW (x)mm,r/m MPSADBW x,x,i Logic PAND PANDN POR PXOR PSLL/RL W/D/Q PSRAW/D PSLL/RL W/D/Q PSRAW/D PSLLDQ, PSRLDQ PTEST String instructions PCMPESTRI PCMPESTRM PCMPISTRI PCMPISTRM Encryption PCLMULQDQ
1 1 1
2 2 2
0.5 0.5 0.5
P23 P23 P23
1 1 1 1 1 1 1 1 2 1 1 2 8
4 5 4 4 4 4 2 2 4 2 2 4 8
1 2 1 1 1 1 0.5 0.5 1 0.5 0.5 1 4
P0 P0 P0 P0 P0 P0 P23 P23 P1 P23 P23 P23 P23 P1 P23
SSE4.1 SSE4.1 SSSE3
SSE4.1 SSSE3 SSSE3 SSE4.1
(x)mm,r/m (x)mm,r/m (x)mm,i xmm,i xmm,r/m
1 1 1 1 2
2 3 2 2
0.5 1 1 1 1
P23 P1 P1 P1 P1 P3
SSE4.1
x,x,i x,x,i x,x,i x,x,i
27 27 7 7
17 10 14 7
10 10 3 4
P1 P2 P3 P1 P2 P3 P1 P2 P3 P1 P2 P3
SSE4.2 SSE4.2 SSE4.2 SSE4.2
xmm,r/m
12
P1
pclmul
Page 43
Bulldozer AESDEC AESDECLAST AESENC AESENCLAST AESIMC AESKEYGENASSIST Other EMMS x,x x,x x,x x,x x,x x,x,i 2 2 2 2 1 1 5 5 5 5 5 5 2 2 2 2 1 1 P01 P01 P01 P01 P0 P0 aes aes aes aes aes aes
0.25
Floating point XMM and YMM instructions

Instruction Move instructions MOVAPS/D MOVUPS/D VMOVAPS/D MOVAPS/D MOVUPS/D VMOVAPS/D VMOVUPS/D MOVAPS/D MOVUPS/D VMOVAPS/D VMOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVHPS/D MOVLPS/D MOVLHPS MOVHLPS MOVMSKPS/D VMOVMSKPS/D MOVNTPS/D VMOVNTPS/D SHUFPS/D VSHUFPS/D VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERM2F128 VPERM2F128 BLENDPS/PD VBLENDPS/PD BLENDVPS/PD VBLENDVPS/PD Operands Ops Latency Reciprocal throughput Execution pipes Domain, notes
x,x y,y x,m128 y,m256 m128,x m256,y m256,y x,x x,m32/64 m32/64,x x,m64 m64,x m64,x x,x r32,x r32,y m128,x m256,y x,x/m,i y,y,y/m,i x,x,x/m y,y,y/m x,x/m,i y,y/m,i y,y,y,i y,y,m,i x,x/m,i y,y,y/m,i x,x/m,xmm0 y,y,y/m,y
1 2 1 2 1 4 8 1 1 1 1 2 1 1 2 1 1 2 1 2 1 2 8 10 1 2 1 2
0 2 6 6 5 5 6 2 6 5 7 8 7 2 10 6 2 2 3 3 2 2 4 2 2 2 2 Page 44
0.25 0.5 0.5 1-2 1 3 10 0.5 0.5 1 1 1 1 1 1 2 1 2 1 2 1 2 3 4 0.5 1 1 2
none P23
inherit ivec
P3 P3 P2 P3 P01
fp
P1 P3 P3 P1 P1 P3 P3 P1 P1 P1 P1 P1 P1 P23 P23 P23 P23 P1 P1
ivec
ivec ivec ivec ivec ivec ivec ivec ivec ivec ivec ivec ivec
Bulldozer MOVDDUP MOVDDUP VMOVDDUP VMOVDDUP VBROADCASTSS VBROADCASTSS VBROADCASTSD VBROADCASTF128 MOVSH/LDUP MOVSH/LDUP VMOVSH/LDUP VMOVSH/LDUP UNPCKH/LPS/D VUNPCKH/LPS/D EXTRACTPS EXTRACTPS VEXTRACTF128 VEXTRACTF128 INSERTPS INSERTPS VINSERTF128 VINSERTF128 VMASKMOVPS/D VMASKMOVPS/D VMASKMOVPS/D VMASKMOVPS/D Conversion CVTPD2PS VCVTPD2PS CVTPS2PD VCVTPS2PD CVTSD2SS CVTSS2SD CVTDQ2PS VCVTDQ2PS CVT(T) PS2DQ VCVT(T) PS2DQ CVTDQ2PD VCVTDQ2PD CVT(T)PD2DQ VCVT(T)PD2DQ CVTPI2PS CVT(T)PS2PI CVTPI2PD CVT(T) PD2PI CVTSI2SS CVT(T)SS2SI CVTSI2SD CVT(T)SD2SI x,x x,m64 y,y y,m256 x,m32 y,m32 y,m64 y,m128 x,x x,m128 y,y y,m256 x,x/m y,y,y/m r32,x,i m32,x,i x,y,i m128,y,i x,x,i x,m32,i y,y,x,i y,y,m128,i x,x,m128 y,y,m256 m128,x,x m256,y,y 1 1 2 2 1 2 2 2 1 1 2 1 1 2 2 2 1 2 1 1 2 2 1 2 18 34 2 2 6 6 6 6 2 2 2 2 10 14 2 7 2 2 9 9 9 22 25 1 0.5 2 1 0.5 0.5 0.5 0.5 1 0.5 2 1 1 2 1 1 1 1 1 1 1 1 0.5 1 7 13 P1 P1 ivec ivec
P23 P23 P23 P1 P1 P1 P1 P1 P3 P1 P3 P23 P23 P1 P1 P23 P23 P01 P01 P0 P1 P2 P3 P0 P1 P2 P3
ivec ivec ivec ivec
ivec
ivec
x,x x,y x,x y,x x,x x,x x,x y,y x,x y,y x,x y,x x,x x,y x,mm mm,x x,mm mm,x x,r32 r32,x x,r32/64 r32/64,x
2 4 2 4 1 1 1 2 1 2 2 4 2 4 1 1 2 2 2 2 2 2
7 7 7 7 4 4 4 4 4 4 7 8 7 7 4 4 7 7 14 13 14 13 Page 45
1 2 1 2 1 1 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1
P01 P01 P01 P01 P0 P0 P0 P0 P0 P0 P01 P01 P01 P01 P0 P0 P0 P1 P0 P1 P0 P0 P0 P0
fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp
Bulldozer Arithmetic ADDSS/D SUBSS/D ADDPS/D SUBPS/D VADDPS/D VSUBPS/D ADDSUBPS/D VADDSUBPS/D HADDPS/D HSUBPS/D HADDPS/D HSUBPS/D VHADDPS/D VHSUBPS/D VHADDPS/D VHSUBPS/D MULSS MULSD MULPS MULPD VMULPS VMULPD DIVSS DIVPS VDIVPS DIVSD DIVPD VDIVPD RCPSS/PS VRCPPS CMPSS/D CMPPS/D VCMPPS/D COMISS/D UCOMISS/D MAXSS/SD/PS/PD MINSS/SD/PS/PD VMAXPS/D VMINPS/D ROUNDSS/SD/PS/PD VROUNDSS/SD/PS/ PD DPPS DPPS VDPPS VDPPS DPPD DPPD Math SQRTSS/PS VSQRTPS SQRTSD/PD VSQRTPD RSQRTSS/PS VRSQRTPS
x,x/m x,x/m y,y,y/m x,x/m y,y,y/m x,x x,m128 y,y,y y,y,m x,x/m x,x/m y,y,y/m x,x/m y,y,y/m x,x/m y,y,y/m x,x/m y,y/m x,x/m y,y,y/m x,x/m x,x/m y,y,y/m x,x/m,i y,y/m,i x,x,i x,m128,i y,y,y,i y,m256,i x,x,i x,m128,i
1 1 2 1 2 3 4 8 10 1 1 2 1 2 1 2 1 2 1 2 2 1 2 1 2 16 18 25 29 15 17
5-6 5-6 5-6 5-6 5-6 10
0.5 0.5 1 0.5 1 2 2
P01 P01 P01 P01 P01 P01 P1 P01 P1 P01 P1 P01 P1 P01 P01 P01 P01 P01 P01 P01 P01 P01 P01 P01 P01 P3 P01 P01 P0 P0 P01 P23 P01 P23 P01 P3 P01 P3 P01 P23 P01 P23
fma fma fma fma fma ivec/fma ivec/fma ivec/fma ivec/fma fma fma fma fp fp fp fp fp fp fp fp fp fp fp fp fp fma fma fma fma fma fma
10
4 4 0.5 0.5 1 4.5-9.5 9-19 4.5-11 9-22 1 2 0.5 1 1
5-6 5-6 5-6 9-24 9-24 9-27 9-27 5 5 2 2
2 2 4 4 25 27 15
0.5 1 1 2 6 7 13 13 5 6
x,x/m y,y/m x,x/m y,y/m x,x/m y,y/m
1 2 1 2 1 2
14-15 14-15 24-26 24-26 5 5
4.5-12 9-24 4.5-16.5 9-33 1 2
P01 P01 P01 P01 P01 P01
fp fp fp fp fp fp
Page 46
Bulldozer Logic
AND/ANDN/OR/XORPS/ PD
x,x/m y,y,y/m
1 2
2 2
0.5 1
P23 P23
ivec ivec
VAND/ANDN/OR/XOR PS/PD Other VZEROUPPER VZEROUPPER VZEROALL VZEROALL LDMXCSR STMXCSR FXSAVE FXRSTOR XSAVE XRSTOR
m32 m32 m4096 m4096 m m
9 16 17 32 1 2 67 116 122 177
10 19 136 176 196 250
4 5 6 10 4 19 136 176 196 250
P2 P3 P2 P3 P0 P3 P0 P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3
32 bit mode 64 bit mode 32 bit mode 64 bit mode
AMD-specific instructions
Instruction 3DNow instructions PREFETCH/W SSE4A instructions LZCNT POPCNT POPCNT EXTRQ EXTRQ INSERTQ INSERTQ MOVNTSS/SD XOP instructions VFRCZSS/SD/PS/PD VFRCZSS/SD/PS/PD VPCMOV VPCMOV VPPERM VPCOMB/W/D/Q VPCOMUB/W/D/Q VPHADDBW/BD/BQ/ WD/WQ/DQ VPHADDUBW/BD/BQ/ WD/WQ/DQ VPHSUBBW/WD/DQ VPMACSWW/WD Operands Ops Latency Reciprocal throughput 0.5 Execution pipes Notes
r,r r16/32,r16/32 r64,r64 x,i,i x,x x,x,i,i x,x m,x
2 1 1 1 1 1 1 1
2 4 4 3 3 3 3
2 2 4 1 1 1 1 4
P1 P1 P1 P1 P3
x,x x,m x,x,x,x/m y,y,y,y/m x,x,x,x/m x,x,x/m,i x,x,x/m,i x,x/m x,x/m x,x/m x,x,x/m,x
2 3 1 2 1 1 1 1 1 1 1
10 10 2 2 2 2 2 2 2 2 4 Page 47
2 2 1 2 1 0.5 0.5 0.5 0.5 0.5 1
P01 P01 P1 P1 P1 P23 P23 P23 P23 P23 P0
latency 0 if i=6, 7 latency 0 if i=6, 7
Bulldozer VPMACSDD VPMACSDQH/L VPMACSSWW/WD VPMACSSDD VPMACSSDQH/L VPMADCSWD VPMADCSSWD VPROTB/W/D/Q VPROTB/W/D/Q VPSHAB/W/D/Q VPSHLB/W/D/Q FMA4 instructions VFMADDSS/SD VFMADDSSPS/PD VFMADDSSPS/PD VFMSUBSS/SD VFMSUBSSPS/PD VFMSUBSSPS/PD VFMADDSUBPS/PD VFMADDSUBPS/PD VFMSUBADDPS/PD VFMSUBADDPS/PD VFNMADDSS/SD VFNMADDSSPS/PD VFNMADDSSPS/PD VFNMSUBSS/SD VFNMSUBSSPS/PD VFNMSUBSSPS/PD x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m x,x,i x,x,x/m x,x,x/m 1 1 1 1 1 1 1 1 1 1 1 5 4 4 5 4 4 4 3 2 3 3 2 1 1 2 1 1 1 1 1 1 1 P0 P0 P0 P0 P0 P0 P0 P1 P1 P1 P1
x,x,x,x/m x,x,x,x/m y,y,y,y/m x,x,x,x/m x,x,x,x/m y,y,y,y/m x,x,x,x/m y,y,y,y/m x,x,x,x/m y,y,y,y/m x,x,x,x/m x,x,x,x/m y,y,y,y/m x,x,x,x/m x,x,x,x/m y,y,y,y/m
1 1 2 1 1 2 1 2 1 2 1 1 2 1 1 2
5-6 5-6 5-6 5-6 5-6 5-6 5-6 5-6 5-6 5-6 5-6 5-6 5-6 5-6 5-6 5-6
0.5 0.5 1 0.5 0.5 1 0.5 1 0.5 1 0.5 0.5 1 0.5 0.5 1
P01 P01 P01 P01 P01 P01 P01 P01 P01 P01 P01 P01 P01 P01 P01 P01
fma fma fma fma fma fma fma fma fma fma fma fma fma fma fma fma
Page 48
Bobcat
AMD Bobcat
Instruction: Operands: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit mmx register, xmm = 128 bit xmm register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc. Number of micro-operations issued from instruction decoder to schedulers. Instructions with more than 2 micro-operations are micro-coded. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latencies listed do not include memory operands where the operand is listed as register or memory (r/m). The clock frequency varies dynamically, which makes it difficult to measure latencies. The values listed are measured after the execution of millions of similar instructions, assuming that this will make the processor boost the clock frequency to the highest possible value. Reciprocal throughput: This is also called issue latency. This value indicates the average number of clock cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/2 indicates that the execution units can handle 2 instructions per clock cycle in one thread. However, the throughput may be limited by other bottlenecks in the pipeline. Indicates which execution pipe is used for the micro-operations. I0 means integer pipe 0. I0/1 means integer pipe 0 or 1. FP0 means floating point pipe 0 (ADD). FP1 means floating point pipe 1 (MUL). FP0/1 means either one of the two floating point pipes. Two micro-operations can execute simultaneously if they go to different execution pipes.
Ops: Latency:
Execution pipe:
Instruction Move instructions MOV MOV MOV MOV MOV MOV MOVNTI MOVZX, MOVSX MOVZX, MOVSX MOVSXD MOVSXD CMOVcc CMOVcc XCHG Operands Ops Latency Reciprocal throughput 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 4 4 7 6 1 5 1 5 1 1 Page 49 1/2 1/2 1 1 1 1 1 1/2 1 1/2 1 1/2 1 1 Execution pipe I0/1 I0/1 AGU AGU AGU AGU AGU I0/1 Notes
r,r r,i r,m m,r m8,r8H m,i m,r r,r r,m r64,r32 r64,m32 r,r r,m r,r
Any addressing mode Any addressing mode AH, BH, CH, DH
I0/1 I0/1
Bobcat XCHG XLAT PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POPF(D/Q) POPA(D) LEA LEA LEA LEA LAHF SAHF SALC BSWAP PREFETCHNTA PREFETCHT0/1/2 PREFETCH SFENCE LFENCE MFENCE r,m r i m 3 2 1 1 3 9 9 1 4 29 9 2 1 1 1 4 1 1 1 1 1 1 4 1 4 20 5 1 1 2 6 9 1 4 22 8 2 1/2 1 1/2 2 1/2 1/2 1 1 1 ~45 1 ~45
r m
r16,[m] r32/64,[m] r32/64,[m] r64,[m]
3 1 2-4 4 1 1 1
I0 I0/1 I0 I0/1 I0/1 I0/1 AGU AGU AGU AGU AGU AGU
Any address size no scale, no offset w. scale or offset RIP relative
r m m m
AMD only
Arithmetic instructions ADD, SUB r,r/i ADD, SUB r,m ADD, SUB m,r ADC, SBB r,r/i ADC, SBB r,m ADC, SBB m,r/i CMP r,r/i CMP r,m INC, DEC, NEG r INC, DEC, NEG m AAA AAS DAA DAS AAD AAM MUL, IMUL r8/m8 MUL, IMUL r16/m16 MUL, IMUL r32/m32 MUL, IMUL r64/m64 IMUL r16,r16/m16 IMUL r32,r32/m32 IMUL r64,r64/m64
1 1 1 1 1 1 1 1 1 1 9 9 12 16 4 33 1 3 2 2 1 1 1
1 6-7 1 1 6 5 10 7 8 5 23 3 3-5 3-4 6-7 3 3 6 Page 50
1/2 1 1 1 1 1/2 1 1/2
I0/1
I0/1
I0/1 I0/1
23 1 2 1 1 4
I0 I0 I0 I0 I0 I0 I0
latency ax=3, dx=5 latency eax=3, edx=4 latency rax=6, rdx=7
Bobcat IMUL IMUL IMUL DIV DIV DIV DIV IDIV IDIV IDIV IDIV CBW, CWDE, CDQE CWD, CDQ, CQO Logic instructions AND, OR, XOR AND, OR, XOR AND, OR, XOR TEST TEST NOT NOT SHL, SHR, SAR ROL, ROR RCL, RCR RCL RCR RCL RCR SHL,SHR,SAR,ROL, ROR RCL, RCR RCL RCR RCL RCR SHLD, SHRD SHLD, SHRD SHLD, SHRD BT BT BT BTC, BTR, BTS BTC BTR, BTS BTC BTR, BTS BSF, BSR BSF, BSR POPCNT r16,(r16),i r32,(r32),i r64,(r64),i r8/m8 r16/m16 r32/m32 r64/m64 r8/m8 r16/m16 r32/m32 r64/m64 2 1 1 1 1 1 1 1 1 1 1 1 1 4 3 7 27 33 49 81 29 37 55 81 1 1 3 1 4 27 33 49 81 29 37 55 81 I0 I0 I0 I0 I0 I0 I0 I0 I0 I0 I0 I0/1 I0/1
r,r r,m m,r r,r r,m r m r,i/CL r,i/CL r,1 r,i r,i r,CL r,CL m,i /CL m,1 m,i m,i m,CL m,CL r,r,i r,r,cl m,r,i/CL r,r/i m,i m,r r,r/i m,i m,i m,r m,r r,r r,m r,r/m
1 1 1 1 1 1 1 1 1 1 9 7 9 9 1 1 10 9 9 8 6 7 8 1 1 5 2 5 4-5 8 8 11 11 9
1 1 1 1 1 5 4 6 5 7 7 18
1/2 1 1 1/2 1 1/2 1 1/2 1/2 1 5 4 5 4 1 1 ~15 ~14 15 15 3 4 15 1/2 1 3 1 15 15 13 15 6 6 5
I0/1
I0/1 I0/1 I0/1 I0/1 I0/1
3 4 18
16 15 6 12 Page 51
SSE4.A/SSE4.2
Bobcat LZCNT SETcc SETcc CLC, STC CMC CLD STD r,r/m r m 8 1 1 1 1 1 2 5 1 SSE4.A, AMD only 1/2 1 1/2 1/2 1 2
I0/1 I0/1 I0 I0,I1
Control transfer instructions JMP short/near JMP r JMP m(near) Jcc short/near J(E/R)CXZ short LOOP short CALL near CALL r CALL m(near) RET RET i BOUND m INTO String instructions LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC
1 1 1 1 2 8 2 2 5 1 4 8 4
2 2 2 1/2 - 2 1-2 4 2 2 2 ~3 ~4 4 2
4 5 4 2 7 2 5 6 7 6
~3 ~3 2 5
values are per count best case 6-7 Byte/clk best case 5 Byte/clk
3 3 4 3
values are per count values are per count
1 0 1 0 6 i,0 12 a,b 10+6b 2 30-52 70-830 26 14
1/2 1/2 6
I0/1 I0/1 36 34+6b
3 87 8
32 bit mode

Instruction Operands Ops Latency Reciprocal throughput Page 52 Execution pipe Notes
Bobcat Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(T)(P) FLDZ, FLD1 FCMOVcc FFREE FINCSTP, FDECSTP FNSTSW FNSTSW FNSTCW FLDCW Arithmetic instructions FADD(P),FSUB(R)(P) FADD(P),FSUB(R)(P) FIADD,FISUB(R) FMUL(P) FMUL(P) FIMUL FDIV(R)(P) FDIV(R)(P) FIDIV(R) FABS, FCHS FCOM(P), FUCOM(P) FCOM(P), FUCOM(P) FCOMPP, FUCOMPP FCOMI(P) FICOM(P) FTST FXAM FRNDINT FPREM FPREM1 Math FSQRT FLDPI, etc. FSIN FCOS FSINCOS FPTAN r m32/64 m80 m80 r m32/64 m80 m80 r m m st0,r r AX m16 m16 m16 1 1 7 21 1 1 16 217 1 1 1 1 12 1 1 2 2 3 12 2 6 14 30 2 6 19 177 0 9 6 7 1 ~20 ~20 1/2 1 5 35 1/2 1 9 180 1 1 1 1 7 1 1 10 10 2 10 FP0/1 FP0/1
FP0/1 FP1
FP1 FP1 FP1 FP0/1 FP1 FP1 FP1 FP1 FP0 FP1
r m m r m m r m m r m r m
1 1 2 1 1 2 1 1 2 1 1 1 1 1 2 1 2 5 1 1
3 3 5 5 19
1 1 3 3 3 19 19 19 2 1 1 1 2 1 1 2
11 11-16 11-19
FP0 FP0 FP0,FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP0 FP0 FP0 FP0 FP0, FP1 FP0 FP1 FP0, FP1 FP1 FP1
1 31 1 4-44 27-105 11-51 51-94 11-75 48-110 ~45 ~113 Page 53
1 27-105 51-94 48-110 ~113
FP1 FP0 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1
Bobcat FPATAN FSCALE FXTRACT F2XM1 FYL2X FYL2XP1 Other FNOP (F)WAIT FNCLEX FNINIT FNSAVE FRSTOR FXSAVE FXRSTOR 9-75 49-163 5 8 7 9 30-56 ~60 8 29 12 44 49-163 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1
m m m m
1 1 9 26 85 80 71 111
0 0
1/2 1/2 30 78 163 123 105 118
FP0, FP1 ALU FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1

Instruction Move instructions MOVD MOVD MOVD MOVD MOVD MOVD MOVD MOVD (MOVQ) MOVD (MOVQ) MOVD (MOVQ) MOVQ MOVQ MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU, LDDQU MOVDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB Operands Ops Latency Reciprocal throughput 1 1 1 1 3 2 1 1 2 3 1 2 1 2 1 2 2 2 2 2 1 2 1 2 1 3 7 7 5 6 6 5 6 7 7 7 1 1 5 5 6 1 6 6 6-9 6-9 1 1 13 13 1 2 Page 54 1 3 1 1 3 1 2 1 3 3 1/2 1 1 1 2 1 2 3 2-5.5 3-6 1/2 1 1.5 3 1/2 2 Execution pipe FP0 FP0/1 FP0/1 FP0 FP1 FP1 FP1 FP0 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP1 FP1 FP0/1 AGU FP1 AGU FP1 FP0/1 FP0/1 FP1 FP1 FP0/1 FP0/1 Moves 64 bits.Name of instruction differs do. do. Notes
r32, mm mm, r32 mm,m32 r32, xmm xmm, r32 xmm,m32 m32,(x)mm r64,(x)mm mm,r64 xmm,r64 mm,mm xmm,xmm mm,m64 xmm,m64 m64,(x)mm xmm,xmm xmm,m m,xmm xmm,m m,xmm mm,xmm xmm,mm m,mm m,xmm mm,r/m xmm,r/m
Bobcat PUNPCKH/LBW/WD/ DQ PUNPCKH/LBW/WD/ DQ PUNPCKHQDQ PUNPCKLQDQ PSHUFB PSHUFB PSHUFD PSHUFW PSHUFL/HW PALIGNR MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRW PINSRW PINSRW INSERTQ INSERTQ EXTRQ EXTRQ Arithmetic instructions PADDB/W/D/Q PADDSB/W PADDUSB/W PSUBB/W/D/Q PSUBSB/W PSUBUSB/W PADDB/W/D/Q PADDSB/W ADDUSB/W PSUBB/W/D/Q PSUBSB/W PSUBUSB/W PHADD/SUBW/SW/D PHADD/SUBW/SW/D PCMPEQ/GT B/W/D PCMPEQ/GT B/W/D PMULLW PMULHW PMULHUW PMULUDQ PMULLW PMULHW PMULHUW PMULUDQ PMULHRSW PMULHRSW PMADDWD PMADDWD PMADDUBSW PMADDUBSW mm,r/m xmm,r/m xmm,r/m xmm,r/m mm,mm xmm,xmm xmm,xmm,i mm,mm,i xmm,xmm,i xmm,xmm,i mm,mm xmm,xmm r32,(x)mm r32,(x)mm,i mm,r32,i xmm,r32,i xmm,xmm xmm,xmm,i,i xmm,xmm xmm,xmm,i,i 1 2 2 1 1 6 3 1 2 20 32 64 1 2 2 3 3 3 1 1 1 1 1 1 2 3 2 1 2 19 146-1400 279-3000 8 12 10 10 3-4 3-4 1 2 1/2 1 1 1/2 1 3 2 1/2 2 12 130-1170 260-2300 2 2 6 3 3 1 2
FP0, FP1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0, FP1 FP0, FP1 FP0 FP0, FP1 FP0/1 FP0/1 FP0, FP1 FP0, FP1 FP0/1 FP0/1
Suppl. SSE3 Suppl. SSE3
Suppl. SSE3
SSE4.A, AMD only SSE4.A, AMD only SSE4.A, AMD only SSE4.A, AMD only
mm,r/m
1/2
FP0/1
xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m
2 1 2 1 2
1 1 4 1 1
1 1/2 1 1/2 1
FP0/1 FP0/1 FP0/1 FP0/1 FP0/1
mm,r/m
FP0
xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m
2 1 2 1 2 1 2
2 2 2 2 2 2 2 Page 55
2 1 2 1 2 1 2
FP0 FP0 FP0 FP0 FP0 FP0 FP0
Bobcat PAVGB/W PAVGB/W PMIN/MAX SW/UB PMIN/MAX SW/UB PABSB/W/D PABSB/W/D PSIGNB/W/D PSIGNB/W/D PSADBW PSADBW Logic PAND PANDN POR PXOR PAND PANDN POR PXOR PSLL/RL W/D/Q PSRAW/D PSLL/RL W/D/Q PSRAW/D PSLLDQ, PSRLDQ Other EMMS mm,r/m xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m 1 2 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 2 2 1/2 1 1/2 1 1/2 1 1/2 1 2 2 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0 FP0, FP1
Suppl. SSE3 Suppl. SSE3 Suppl. SSE3 Suppl. SSE3
mm,r/m xmm,r/m mm,i/mm/m

xmm,i/xmm/m
1 2 1 2 2
1 1 1 1 1
1/2 1 1 1 1
FP0/1 FP0/1 FP0/1 FP0/1 FP0/1
xmm,i
1/2
FP0/1

Instruction Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHLPS, MOVLHPS MOVHPS/D, MOVLPS/D MOVHPS/D, MOVLPS/D MOVNTPS/D MOVNTSS/D MOVDDUP MOVDDUP MOVSHDUP, MOVSLDUP Operands Ops Latency Reciprocal throughput 2 2 2 2 2 2 1 2 1 1 1 1 2 1 2 2 2 1 6 6 1 6-9 6-9 1 6 5 1 6 5 12 12 2 7 1 Page 56 1 2 3 1 2-6 3-6 1/2 2 2 1/2 2 3 3 2 1 2 1 Execution pipe FP0/1 AGU FP1 FP0/1 AGU FP1 FP0/1 FP1 FP1 FP0/1 AGU FP1 FP1 FP1 FP0/1 FP0/1 FP0/1 Notes
r,r r,m m,r r,r r,m m,r r,r r,m m,r r,r r,m m,r m,r m,r r,r r,m64 r,r
SSE4.A, AMD only SSE3 SSE3
Bobcat MOVSHDUP, MOVSLDUP MOVMSKPS/D SHUFPS/D UNPCK H/L PS/D Conversion CVTPS2PD CVTPD2PS CVTSD2SS CVTSS2SD CVTDQ2PS CVTDQ2PD CVT(T)PS2DQ CVT(T)PD2DQ CVTPI2PS CVTPI2PD CVT(T)PS2PI CVT(T)PD2PI CVTSI2SS CVTSI2SD CVT(T)SS2SI CVT(T)SD2SI Arithmetic ADDSS/D SUBSS/D ADDPS/D SUBPS/D ADDSUBPS/D HADDPS/D HSUBPS/D MULSS MULSD MULPS MULPD DIVSS DIVPS DIVSD DIVPD RCPSS RCPPS MAXSS/D MINSS/D MAXPS/D MINPS/D CMPccSS/D CMPccPS/D COMISS/D UCOMISS/D Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D r,m r32,r r,r/m,i r,r/m 2 1 3 2 12 ~6 2 1 3 2 2 1 AGU FP0 FP0/1 FP0/1
2 4 3 1 2 2 2 4 1 2 1 3 3 2 2 2
5 5 5 4 4 5 4 6 4 5 4 6 12 11 12 11
2 3 3 1 4 2 4 3 2 2 1 2 3 3 1 1
FP1 FP0, FP1 FP0, FP1 FP1 FP1 FP1 FP1 FP0, FP1 FP1 FP1 FP1 FP0, FP1 FP0, FP1 FP1 FP0, FP1 FP0, FP1
r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m
1 2 2 2 1 1 2 2 1 2 1 2 1 2 1 2 1 2 1
3 3 3 3 2 4 2 4 13 38 17 34 3 3 2 2 2 2
1 2 2 2 1 2 2 4 13 38 17 34 1 2 1 2 1 2 1
FP0 FP0 FP0 FP0 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP0 FP0 FP0 FP0 FP0
SSE3 SSE3
r,r/m
FP0/1
Page 57
Bobcat Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS RSQRTPS Other LDMXCSR STMXCSR r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m 1 2 1 2 1 2 14 48 24 48 3 3 14 48 24 48 1 2 FP1 FP1 FP1 FP1 FP1 FP1
m m
12 3
10 11
FP0, FP1 FP0, FP1
Page 58
Intel Pentium
Intel Pentium and Pentium MMX

List of instruction timings
Operands r = register, accum = al, ax or eax, m = memory, i = immediate data, sr = segment register, m32 = 32 bit memory operand, etc. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. u = pairable in u-pipe, v = pairable in v-pipe, uv = pairable in either pipe, np = not pairable.
Clock cycles
Pairability
Integer instructions (Pentium and Pentium MMX)

Instruction NOP MOV MOV MOV MOV XCHG XCHG XCHG XLAT PUSH POP PUSH POP PUSH POP PUSHF POPF PUSHA POPA PUSHAD POPAD LAHF SAHF MOVSX MOVZX LEA LDS LES LFS LGS LSS ADD SUB AND OR XOR ADD SUB AND OR XOR ADD SUB AND OR XOR ADC SBB ADC SBB ADC SBB CMP CMP TEST TEST TEST Operands r/m, r/m/i r/m, sr sr , r/m m , accum (E)AX, r r,r r,m r/i r m m sr sr Clock cycles Pairability 1 uv 1 uv 1 np >= 2 b) np 1 uv h) 2 np 3 np >15 np 4 np 1 uv 1 uv 2 np 3 np 1 b) np >= 3 b) np 3-5 np 4-6 np 5-9 i) np 5 np 2 np 3 a) np 1 uv 4 c) np 1 uv 2 uv 3 uv 1 u 2 u 3 u 1 uv 2 uv 1 uv 2 uv 1 f) Page 59
r , r/m r,m m r , r/i r,m m , r/i r , r/i r,m m , r/i r , r/i m , r/i r,r m,r r,i
Intel Pentium TEST INC DEC INC DEC NEG NOT MUL IMUL MUL IMUL DIV DIV DIV IDIV IDIV IDIV CBW CWDE CWD CDQ SHR SHL SAR SAL SHR SHL SAR SAL SHR SHL SAR SAL ROR ROL RCR RCL ROR ROL ROR ROL RCR RCL RCR RCL SHLD SHRD SHLD SHRD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR SETcc JMP CALL JMP CALL conditional jump CALL JMP RETN RETN RETF RETF J(E)CXZ LOOP BOUND CLC STC CMC CLD STD CLI STI LODS REP LODS STOS REP STOS MOVS m,i r m r/m r8/r16/m8/m16 all other versions r8/m8 r16/m16 r32/m32 r8/m8 r16/m16 r32/m32 2 1 3 1/3 11 9 d) 17 25 41 22 30 46 3 2 1 3 4/5 1/3 1/3 4/5 8/10 7/9 4 a) 5 a) 4 a) 4 a) 9 a) 7 a) 8 a) 14 a) 7-73 a) 1/2 a) 1 e) >= 3 e) 1/4/5/6 e) 2/5 e 2/5 e 3/6 e) 4/7 e) 5/8 e) 4-11 e) 5-10 e) 8 2 6-9 2 7+3*n g) 3 10+n g) 4 Page 60 np uv uv np np np np np np np np np np np u u np u np np np np np np np np np np np np np np v np v np np np np np np np np np np np np np np np
r,i m,i r/m, CL r/m, 1 r/m, i(><1) r/m, CL r/m, i(><1) r/m, CL r, i/CL m, i/CL r, r/i m, i m, i r, r/i m, i m, r r , r/m r/m short/near far short/near r/m i i short short r,m
Intel Pentium REP MOVS SCAS REP(N)E SCAS CMPS REP(N)E CMPS BSWAP CPUID RDTSC Notes: a 12+n g) 4 9+4*n g) 5 8+4*n g) 1 a) 13-16 a) 6-13 a) j) np np np np np np np np
b c d e f g h i j
This instruction has a 0FH prefix which takes one clock cycle extra to decode on a P1 unless preceded by a multi-cycle instruction. versions with FS and GS have a 0FH prefix. see note a. versions with SS, FS, and GS have a 0FH prefix. see note a. versions with two operands and no immediate have a 0FH prefix, see note a. high values are for mispredicted jumps/branches. only pairable if register is AL, AX or EAX. add one clock cycle for decoding the repeat prefix unless preceded by a multi-cycle instruction (such as CLD). pairs as if it were writing to the accumulator. 9 if SP divisible by 4 (imperfect pairing). on P1: 6 in privileged or real mode; 11 in non-privileged; error in virtual mode. On PMMX: 8 and 13 clocks respectively.
Floating point instructions (Pentium and Pentium MMX)

Explanation of column headings
Operands Clock cycles r = register, m = memory, m32 = 32-bit memory operand, etc. The numbers are minimum values. Cache misses, misalignment, denormal operands, and exceptions may increase the clock counts considerably. + = pairable with FXCH, np = not pairable with FXCH. Overlap with integer instructions. i-ov = 4 means that the last four clock cycles can overlap with subsequent integer instructions. Overlap with floating point instructions. fp-ov = 2 means that the last two clock cycles can overlap with subsequent floating point instructions. (WAIT is considered a floating point instruction here)
Pairability i-ov
fp-ov
Instruction FLD FLD FBLD FST(P) FST(P) FST(P) FBSTP FILD
Operand r/m32/m64 m80 m80 r m32/m64 m80 m80 m
Clock cycles Pairability 1 0 3 np 48-58 np 1 np 2 m) np 3 m) np 148-154 np 3 np Page 61
i-ov 0 0 0 0 0 0 0 2
fp-ov 0 0 0 0 0 0 0 2
Intel Pentium FIST(P) FLDZ FLD1 FLDPI FLDL2E etc. FNSTSW FLDCW FNSTCW FADD(P) FSUB(R)(P) FMUL(P) FDIV(R)(P) FCHS FABS FCOM(P)(P) FUCOM FIADD FISUB(R) FIMUL FIDIV(R) FICOM FTST FXAM FPREM FPREM1 FRNDINT FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN FNOP FXCH FINCSTP FDECSTP FFREE FNCLEX FNINIT FNSAVE FRSTOR WAIT Notes: m n o p m 6 2 5 s) 6 q) 8 2 3 3 3 19/33/39 p) 1 1 6 6 22/36/42 p) 4 1 17-21 16-64 20-70 9-20 20-32 12-66 70 65-100 r) 89-112 r) 53-59 r) 103 r) 105 r) 120-147 r) 112-134 r) 1 1 2 2 6-9 12-22 124-300 70-95 1 np np np np np np 0 0 0 0 0 0 np np np np np np np np np np np np np np np np np np np np np np np np np np np np 0 0 2 0 0 0 2 2 2 38 o) 0 0 2 2 38 o) 0 0 4 2 2 0 5 0 69 o) 2 2 2 2 2 36 o) 2 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 2 2 2 n) 2 0 0 2 2 2 0 0 0 2 2 0 0 0 2 2 2 2 2 2 0 2 0 0 0 0 0 0 0 0 0
AX/m16 m16 m16 r/m r/m r/m r/m r/m m m m m
r r
m m
q r
The value to store is needed one clock cycle in advance. 1 if the overlapping instruction is also an FMUL. Cannot overlap integer multiplication instructions. FDIV takes 19, 33, or 39 clock cycles for 24, 53, and 64 bit precision respectively. FIDIV takes 3 clocks more. The precision is defined by bit 8-9 of the floating point control word. The first 4 clock cycles can overlap with preceding integer instructions. Clock counts are typical. Trivial cases may be faster, extreme cases may be slower. Page 62
Intel Pentium s May be up to 3 clocks more when output needed for FST, FCHS, or FABS.
MMX instructions (Pentium MMX)

A list of MMX instruction timings is not needed because they all take one clock cycle, except the MMX multiply instructions which take 3. MMX multiply instructions can be pipelined to yield a throughput of one multiplication per clock cycle. The EMMS instruction takes only one clock cycle, but the first floating point instruction after an EMMS takes approximately 58 clocks extra, and the first MMX instruction after a floating point instruction takes approximately 38 clocks extra. There is no penalty for an MMX instruction after EMMS on the PMMX. There is no penalty for using a memory operand in an MMX instruction because the MMX arithmetic unit is one step later in the pipeline than the load unit. But the penalty comes when you store data from an MMX register to memory or to a 32-bit register: The data have to be ready one clock cycle in advance. This is analogous to the floating point store instructions. All MMX instructions except EMMS are pairable in either pipe. Pairing rules for MMX instructions are described in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs".
Page 63
Pentium II and III
Intel Pentium II and Pentium III

List of instruction timings and op breakdown
Operands: i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc. The number of ops that the instruction generates for each execution port. Port 0: ALU, etc. Port 1: ALU, jumps Instructions that can go to either port 0 or 1, whichever is vacant first. Port 2: load data, etc. Port 3: address generation for store Port 4: store data This is the delay that the instruction generates in a dependency chain. (This is not the same as the time spent in the execution unit. Values may be inaccurate in situations where they cannot be measured exactly, especially with memory operands). The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays by 50-150 clocks, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay.
ops: p0: p1: p01: p2: p3: p4: Latency:
Reciprocal throughput: The average number of clock cycles per instruction for a series of independent instructions of the same kind.
Integer instructions (Pentium Pro, Pentium II and Pentium III)

Instruction MOV MOV MOV MOV MOV MOV MOV MOVSX MOVZX MOVSX MOVZX CMOVcc CMOVcc XCHG XCHG XLAT PUSH POP POP PUSH POP PUSH POP Operands p0 r,r/i r,m m,r/i r,sr m,sr sr,r sr,m r,r r,m r,r r,m r,r r,m r/i r (E)SP m m sr sr p1 ops p01 p2 p3 1 1 1 1 1 1 1 1 1 1 1 1 1 3 4 1 1 1 2 1 5 2 8 1 1 1 1 1 1 1 1 1 1 1 1 high b) Latency p4 Reciprocal throughput
1 1 5 8
8 7
1 1 1
1 1 1
Page 64
Pentium II and III PUSHF(D) POPF(D) PUSHA(D) POPA(D) LAHF SAHF LEA LDS LES LFS LGS LSS ADD SUB AND OR XOR ADD SUB AND OR XOR ADD SUB AND OR XOR ADC SBB ADC SBB ADC SBB CMP TEST CMP TEST INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS AAD AAM IMUL IMUL DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV CBW CWDE CWD CDQ SHR SHL SAR ROR ROL SHR SHL SAR ROR ROL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL SHLD SHRD SHLD SHRD BT BT BTR BTS BTC BTR BTS BTC BSF BSR BSF BSR SETcc 3 10 11 6 2 2 1 1 1 8 8 1 c) 8 1 1 1 2 2 3 1 1 1 1 1 1 1 1 1 2 3 3 2 2 2 1 r,i/CL m,i/CL r,1 r8,i/CL r16/32,i/CL m,1 m8,i/CL m16/32,i/CL r,r,i/CL m,r,i/CL r,r/i m,r/i r,r/i m,r/i r,r r,m r 1 1 1 4 3 1 4 4 2 2 1 1 1 1 1 1 4 3 2 3 2 1 1 6 1 6 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 4 15 4 4 19 23 39 19 23 39 3 1 1 1 1 1 1 1 1 8 1
r,m m r,r/i r,m m,r/i r,r/i r,m m,r/i r,r/i m,r/i r m
r,(r),(i) (r),m r8 r16 r32 m8 m16 m32
1 1 1
1 1 12 21 37 12 21 37
1 1 1 1 1 1 1
1 1 1 1
1 1 1 1
Page 65
Pentium II and III SETcc JMP JMP JMP JMP JMP conditional jump CALL CALL CALL CALL CALL RETN RETN RETF RETF J(E)CXZ LOOP LOOP(N)E ENTER ENTER LEAVE BOUND CLC STC CMC CLD STD CLI STI INTO LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP(N)E SCAS CMPS REP(N)E CMPS BSWAP NOP (90) Long NOP (0F 1F) CPUID RDTSC IN OUT PREFETCHNTA d) PREFETCHT0/1/2 d) SFENCE d) Notes m short/near far r m(near) m(far) short/near near far r m(near) m(far) i i short short short i,0 a,b r,m 23 23 2 2 ca. 7 1 8 8 12 18 +4b 2 6 1 4 1 1 1 1 1 21 1 1 21 1 1 28 1 1 28 1 1 2 3 2 4 1 1 1 2 1 1 3 3 1 1 2 1 2 1 1 2 1 2 1 1 2 2 2 2 2 2 2 2 2 1 1 2
1 b-1 1 2
1 2b
9 17 5 2 10+6n ca. 5n 1 ca. 6n 1 12+7n 4 12+9n r 1 1 1 1 0.5 1 2 1 a) 3 a) 2 1 1 1 1
23-48 31 18 18 m m 1 1 1 1
>300 >300
Page 66
Pentium II and III a) b) c) d) Faster under certain conditions: see manual 3: "The microarchitecture of Intel, AMD and VIA CPUs". Has an implicit LOCK prefix. 3 if constant without base or index register P3 only.
Floating point x87 instructions (Pentium Pro, II and III)

Instruction FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FADD(P) FSUB(R)(P) FADD(P) FSUB(R)(P) FMUL(P) FMUL(P) FDIV(R)(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT Operands p0 r m32/64 m80 m80 r m32/m64 m80 m80 r m m 1 2 38 1 2 165 3 2 1 2 2 3 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 6 6 6 6 1 1 23 33 30 1 1 1 1 2 2 1 2 2 1 2 2 1 p1 ops p01 p2 p3 Latency p4 Reciprocal throughput
0 5 5
f)
r AX m16 m16 m16 r m r m r m
2 7 1 1 1 1 1 1 1 1 3 3-4 5 5-6 38 h) 38 h) 2 1 1 1 1 1 1 1 2 g) 2 g) 37 37 1 10
r m r m m m m m
1 1 1 1 1 1 1
1 2
Page 67
Pentium II and III FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN FNOP FINCSTP FDECSTP FFREE FFREEP FNCLEX FNINIT FNSAVE FRSTOR WAIT Notes: e) f) g) 56 15 1 17-97 18-110 17-48 36-54 31-53 21-102 25-86 1 1 1 2 3 13 141 72 2 Not pipelined FXCH generates 1 op that is resolved by register renaming without going to any port. FMUL uses the same circuitry as integer multiplication. Therefore, the combined throughput of mixed floating point and integer multiplications is 1 FMUL + 1 IMUL per 3 clock cycles. FDIV latency depends on precision specified in control word: 64 bits precision gives latency 38, 53 bits precision gives latency 32, 24 bits precision gives latency 18. Division by a power of 2 takes 9 clocks. Reciprocal throughput is 1/ (latency-1). Faster for lower precision.
27-103 29-130 66 103 98-107 13-143 44-143
69 e) e) e) e) e) e) e)
e,i)
r r
h)
i)
Integer MMX instructions (Pentium II and Pentium III)

Instruction MOVD MOVQ MOVD MOVQ MOVD MOVQ PADD PSUB PCMP PADD PSUB PCMP PMUL PMADD PMUL PMADD PAND(N) POR PXOR PAND(N) POR PXOR PSRA PSRL PSLL PSRA PSRL PSLL PACK PUNPCK PACK PUNPCK EMMS MASKMOVQ d) Operands p0 r,r mm,m32/64 m32/64,mm mm,mm mm,m64 mm,mm mm,m64 mm,mm mm,m64 mm,mm/i mm,m64 mm,mm mm,m64 mm,mm p1 ops p01 p2 p3 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 Page 68 1 1 1 1 1 1 1 6 k) 2-8 Latency p4 1 1 1 3 3 1 Reciprocal throughput 0.5 1 1 0.5 1 1 1 0.5 1 1 1 1 1 2 - 30
1 1
Pentium II and III PMOVMSKB d) MOVNTQ d) PSHUFW d) PSHUFW d) PEXTRW d) PINSRW d) PINSRW d) PAVGB PAVGW d) PAVGB PAVGW d) PMIN/MAXUB/SW d) PMIN/MAXUB/SW d) PMULHUW d) PMULHUW d) PSADBW d) PSADBW d) Notes: d) k) r32,mm m64,mm mm,mm,i mm,m64,i r32,mm,i mm,r32,i mm,m16,i mm,mm mm,m64 mm,mm mm,m64 mm,mm mm,m64 mm,mm mm,m64 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 2 2 1 2 1 2 1 2 3 4 5 6 1 1 1 - 30 1 1 1 1 1 0.5 1 0.5 1 1 1 2 2
P3 only. The delay can be hidden by inserting other instructions between EMMS and any subsequent floating point instruction.
Floating point XMM instructions (Pentium III)

Instruction MOVAPS MOVAPS MOVAPS MOVUPS MOVUPS MOVSS MOVSS MOVSS MOVHPS MOVLPS MOVHPS MOVLPS MOVLHPS MOVHLPS MOVMSKPS MOVNTPS CVTPI2PS CVTPI2PS CVT(T)PS2PI CVTPS2PI CVTSI2SS CVTSI2SS CVT(T)SS2SI CVTSS2SI ADDPS SUBPS ADDPS SUBPS ADDSS SUBSS ADDSS SUBSS Operands p0 xmm,xmm xmm,m128 m128,xmm xmm,m128 m128,xmm xmm,xmm xmm,m32 m32,xmm xmm,m64 m64,xmm xmm,xmm r32,xmm m128,xmm xmm,mm xmm,m64 mm,xmm mm,m128 xmm,r32 xmm,m32 r32,xmm r32,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m32 p1 ops p01 p2 p3 2 2 2 4 1 4 1 1 1 1 1 1 1 2 2 2 2 1 2 2 1 1 2 2 1 1 1 2 1 2 1 2 2 1 Latency p4 1 2 3 2 3 1 1 1 1 1 1 1 3 4 3 4 4 5 3 4 3 3 3 3 Reciprocal throughput 1 2 2 4 4 1 1 1 1 1 1 1 2 - 15 1 2 1 1 2 2 1 2 2 2 1 1
2 4
1 1
1 2
Page 69
Pentium II and III MULPS MULPS MULSS MULSS DIVPS DIVPS DIVSS DIVSS AND(N)PS ORPS XORPS AND(N)PS ORPS XORPS MAXPS MINPS MAXPS MINPS MAXSS MINSS MAXSS MINSS CMPccPS CMPccPS CMPccSS CMPccSS COMISS UCOMISS COMISS UCOMISS SQRTPS SQRTPS SQRTSS SQRTSS RSQRTPS RSQRTPS RSQRTSS RSQRTSS RCPPS RCPPS RCPSS RCPSS SHUFPS SHUFPS UNPCKHPS UNPCKLPS UNPCKHPS UNPCKLPS LDMXCSR STMXCSR FXSAVE FXRSTOR xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm,i xmm,m128,i xmm,xmm xmm,m128 m32 m32 m4096 m4096 2 2 1 1 2 2 1 1 2 2 2 2 1 1 2 2 1 1 1 1 2 2 2 2 2 2 1 1 2 2 1 1 2 2 2 2 11 6 116 89 1 2 2 2 2 1 2 1 2 2 1 2 1 1 2 1 2 1 2 1 4 4 4 4 48 48 18 18 2 2 3 3 3 3 3 3 3 3 1 1 56 57 30 31 2 3 1 2 2 3 1 2 2 2 3 3 15 7 62 68 2 2 1 1 34 34 17 17 2 2 2 2 1 1 2 2 1 1 1 1 56 56 28 28 2 2 1 1 2 2 1 1 2 2 2 2 15 9
Page 70
Pentium M
Intel Pentium M, Core Solo and Core Duo

Operands: i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc. The number of ops at the decode, rename, allocate and retirement stages in the pipeline. Fused ops count as one. The number of ops for each execution port. Fused ops count as two. Port 0: ALU, etc. Port 1: ALU, jumps Instructions that can go to either port 0 or 1, whichever is vacant first. Port 2: load data, etc. Port 3: address generation for store Port 4: store data This is the delay that the instruction generates in a dependency chain. (This is not the same as the time spent in the execution unit. Values may be inaccurate in situations where they cannot be measured exactly, especially with memory operands). The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays by 50-150 clocks, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. The average number of clock cycles per instruction for a series of independent instructions of the same kind.
ops fused domain: ops unfused domain: p0: p1: p01: p2: p3: p4: Latency:
Instruction Operands ops fused domain ops unfused domain p0 p1 p01 p2 p3 Latency Reciprocal p4 through put 0.5 1 1 1
Move instructions MOV MOV MOV MOV MOV MOV MOV MOV MOVNTI MOVSX MOVZX MOVSX MOVZX CMOVcc CMOVcc
r,r/i r,m m,r m,i r,sr m,sr sr,r sr,m m,r32 r,r r,m r,r r,m
1 1 1 2 1 2 8 8 2 1 1 2 2
1 1 1 1 1 1 8 7 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 5 8
2 0.5 1 1.5
Page 71
Pentium M XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D) PUSHA(D) POP POP POP POP POPF(D) POPA(D) LAHF SAHF SALC LEA BSWAP LDS LES LFS LGS LSS PREFETCHNTA PREFETCHT0/1/2 SFENCE/LFENCE/MFENCE IN OUT Arithmetic instructions ADD SUB ADD SUB ADD SUB ADC SBB ADC SBB ADC SBB CMP CMP CMP INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS AAD AAM MUL IMUL MUL IMUL IMUL IMUL MUL IMUL MUL IMUL IMUL IMUL DIV IDIV r,r r,m r i m sr 3 7 2 1 2 2 2 16 18 1 3 2 10 17 10 1 2 1 2 11 1 1 2 3 4 1 1 1 1 1 1 1 1 1 8 1 1 1 1 1 1 8 2 high b) 1 1 2 1.5 1 1 1 1 6 8
1 3 1 11 2 2 9 6 2 1 1 1 1 1 1 1 8
r (E)SP m sr
1 16 7 1 1 1
10
7 1 1
r,m r m m m
1 1 1
1 8
3 1 1 1 1 >300 >300
1 1 6
18 18
r,r/i r,m m,r/i r,r/i r,m m,r/i r,r/i m,r m,i r m
r8 r16/r32 r,r r,r,i m8 m16/m32 r,m r,m,i r8
1 1 3 2 2 7 1 1 2 1 3 1 3 4 1 3 1 1 1 3 1 2 5
1 1
1 1 1 1 1 4 1 1 1 1 1 2 2
1 1 1 1 1 1 1
1 2 1 1 2 1 1 1 1 1 1 1 2 15 4 5 4 4 4 5 4 4 15-16 c)
0.5 1 1 2
0.5 1 1 0.5
1 1 1 1 3 1 1 1 3 1 1 4 1
1 1 1 1 1
1 1 1 1 1 1 1 1 12
Page 72
Pentium M DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV CBW CWDE CWD CDQ Logic instructions AND OR XOR AND OR XOR AND OR XOR TEST TEST TEST SHR SHL SAR ROR ROL SHR SHL SAR ROR ROL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL SHLD SHRD SHLD SHRD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR BSF BSR SETcc SETcc CLC STC CMC CLD STD r16 r32 m8 m16 m32 4 4 6 5 5 1 1 3 3 4 3 3 1 1 1 1 1 1 1 15-24 c) 15-39 c) 15-16 c) 15-24 c) 15-39 c) 1 1 12-20 c) 12-20 c) 12 12-20 c) 12-20 c) 1 1
1 1 1
r,r/i r,m m,r/i r,r/i m,r m,i r,i/CL m,i/CL r,1 r8,i/CL r8,i/CL r16/32,i/CL m,1 m8,i/CL m8,i/CL m16/32,i/CL r,r,i/CL m,r,i/CL r,r/i m,r m,i r,r/i m,r m,i r,r r,m r m
1 1 3 1 1 2 1 3 2 9 8 6 7 12 11 10 2 4 1 8 2 1 10 3 2 2 1 2 1 4
1 1 1 1 1 1 1 1 1 5 4 3 2 6 5 5 2 1 1 7 1 1 7 1 1 1 1 1 1 1 1
1 1 1 1 1
1 2 1 1 1 1 1 1 1 2 11 10 9
0.5 1 1 0.5 1 1 1 2
1 4 4 3 2 3 3 2 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1
1 1 1 1 2 1 1 1 2
1 1
1 1
1 4
1 1 7
Control transfer instructions JMP short/near JMP far JMP r JMP m(near) JMP m(far) conditional jump short/near J(E)CXZ short LOOP short LOOP(N)E short
1 22 1 2 25 1 2 11 11
1 21 1 1 23 1 1 1 1 1 8 8 1 1 2
2 2
1 28 1 2 31 1 1 6 6
Page 73
Pentium M CALL CALL CALL CALL CALL RETN RETN RETF RETF BOUND INTO String instructions LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP(N)E SCAS CMPS REP(N)E CMPS Other NOP (90) Long NOP (0F 1F) PAUSE CLI STI ENTER ENTER LEAVE CPUID RDTSC Notes: a) b) c) near far r m(near) m(far) i i r,m 4 32 4 4 35 2 3 27 27 15 5 1 27 1 1 29 1 1 24 24 7 2 1 2 1 2 1 1 3 3 2 1 1 1 2 1 1 2 1 2 1 1 2 2 27 9 2 30 2 2 30 30 8 4
6 5
2 6n 3 5n 6 6n 3 7n 6 9n
2 10+6n ca. 5n 1 ca. 6n 1 12+7n 4 12+9n 1 a) 3 a) 2 2 1 1 1 1
4 0.5 1 0.7 0.7 0.5 1.3 0.6 0.7 0.5
1 1 2 9 17 i,0 a,b 12 ca. 3 38-59 13 38-59 13
1 1 2
0.5 1
10 18 +4b 2
1 1 b-1 2b 1 ca. 130 42
Faster under certain conditions: see manual 3: "The microarchitecture of Intel, AMD and VIA CPUs". Has an implicit LOCK prefix. High values are typical, low values are for round divisors. Core Solo/Duo is more efficient than Pentium M in cases with round values that allow an early-out algorithm.

Instruction Operands ops fused domain ops unfused domain p0 p1 p01 p2 p3 Latency Reciprocal p4 through put
Move instructions Page 74
Pentium M FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FISTTP g) FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE FFREEP FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FADD(P) FSUB(R)(P) FMUL(P) FMUL(P) FDIV(R)(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT Math FSCALE FXTRACT r m32/64 m80 m80 r m32/m64 m80 m80 r m m m 1 1 4 40 1 1 6 169 1 4 4 4 1 2 2 3 2 3 3 1 1 2 142 72 1 2 38 1 2 165 3 2 2 1 2 2 3 1 1 1 1 1 2 142 72 1 1 1 1 1 1 2 2 1 2 2 1 2 2 1 1
1 3 167 0.33 f) 2 2 2
0 5 5 5
r AX m16 m16 m16 r r
2 7 1 1 1 1 1 1 1
3 19 3 1 2 131 91
r m r m r m
r m r m m m m
1 1 1 1 1 1 1 1 1 1 2 1 6 6 6 6 1 1 26 15
1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1
1 1 1 1 1 1 1
3 5 5 3
3 3 5 5 9-38 c) 9-38 c) 1 1 1 1 1 1 3 5 9-38 c)
2 1 1
1 1 2 2 8-37 c) 8-37 c) 1 1 1 1 1 1 3 3 8-37 c) 4 1 1
26 15
37 19
28 15
28 15
43 9
Page 75
Pentium M FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN Other FNOP WAIT FNCLEX FNINIT Notes: c) f) g) 1 80-100 90-110 ~ 20 ~ 40 ~ 55 ~ 100 ~ 85 1 80-100 90-110 ~20 ~40 ~55 ~100 ~85 9 h) 80-110 100-130 ~45 ~60 ~65 ~140 ~140 8
1 2 3 14
1 1 3 14 1
1 1 13 27
High values are typical, low values are for low precision or round divisors. FXCH generates 1 op that is resolved by register renaming without going to any port. SSE3 instruction only available on Core Solo and Core Duo.

Instruction Operands ops fused domain ops unfused domain p0 p1 p01 p2 p3 Latency Reciprocal p4 through put 1 1 1 1 1 2 1 1 1 1 2 1 2 2 2 2 5-6 1 1 2 2-3 2-3 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 0.5 0.5 1 1 1 1 1 1 0.5 1 1 1 1 1 1 2 2 2-10 4-20 2 1 1 2
Move instructions MOVD MOVD MOVD MOVD MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU LDDQU g) MOVDQ2Q MOVQ2DQ MOVNTQ
r32,mm mm,r32 mm,m32 m32,mm r32,xmm xmm,r32 xmm,m32 m32, xmm mm,mm mm,m64 m64,mm xmm,xmm xmm,m64 m64, xmm xmm, xmm xmm, m128 m128, xmm xmm, m128 m128, xmm xmm, m128 mm, xmm xmm,mm m64,mm
1 1 1 1 1 2 2 1 1 1 1 2 2 1 2 2 2 4 8 4 1 2 1 Page 76
1 1
Pentium M MOVNTDQ PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKHQDQ PUNPCKHQDQ PUNPCKLQDQ PUNPCKLQDQ PSHUFW PSHUFW PSHUFD PSHUFD PSHUFL/HW PSHUFL/HW MASKMOVQ MASKMOVDQU PMOVMSKB PMOVMSKB PEXTRW PEXTRW PINSRW PINSRW Arithmetic instructions PADD/SUB(U)(S)B/W/D PADD/SUB(U)(S)B/W/D PADD/SUB(U)(S)B/W/D PADD/SUB(U)(S)B/W/D PADDQ PSUBQ PADDQ PSUBQ PADDQ PSUBQ PADDQ PSUBQ PCMPEQ/GTB/W/D PCMPEQ/GTB/W/D PCMPEQ/GTB/W/D PCMPEQ/GTB/W/D PMULL/HW PMULHUW PMULL/HW PMULHUW PMULL/HW PMULHUW PMULL/HW PMULHUW PMULUDQ PMULUDQ m128,xmm mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 xmm,xmm xmm, m128 xmm,xmm xmm, m128 mm,mm,i mm,m64,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm, m128,i mm,mm xmm,xmm r32,mm r32,xmm r32,mm,i r32,xmm,i mm,r32,i xmm,r32,i 4 1 1 3 4 1 1 2 3 2 3 1 1 1 2 3 4 2 3 3 8 1 1 2 4 1 2 1 1 2 1 1 1 2 1 1 1 2 1 2 2 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 j) 1 2 2 2 2 1 2 1 2 1 1 2 3 1 1 1 1 1 2 1 2 1 1 2 2 1 1 2 2 1 3 1 1 2 2 1 1 2 2 1 1 1 1 1 1 2 2 1 1
mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64
1 1 2 4 2 2 4 6 1 1 2 2 1 1 2 4 1 1
1 1 2 2 2 2 4 4 1 1 2 2 1 1 2 2 1 1
1 1 1 2 2 1 2 2 1 1 1 2 1 2 1 3 3 3 3 4 4
0.5 1 1 2 1 1 2 2 0.5 1 1 2 1 1 2 2 1 1
Page 77
Pentium M PMULUDQ PMULUDQ PMADDWD PMADDWD PMADDWD PMADDWD PAVGB/W PAVGB/W PAVGB/W PAVGB/W PMIN/MAXUB/SW PMIN/MAXUB/SW PMIN/MAXUB/SW PMIN/MAXUB/SW PSADBW PSADBW PSADBW PSADBW Logic instructions PAND(N) POR PXOR PAND(N) POR PXOR PAND(N) POR PXOR PAND(N) POR PXOR PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ Other EMMS Notes: g) j) k) xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 2 4 1 1 2 4 1 1 2 4 1 1 2 4 2 2 4 6 2 2 1 1 2 2 1 1 2 2 1 1 2 2 2 2 4 4 2 1 2 1 1 2 1 1 1 2 1 2 4 4 4 4 4 4 3 3 3 3 1 2 2 1 1 2 2 0.5 1 1 2 0.5 1 1 2 1 1 2 2
mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm/i mm,m64 xmm,i xmm,xmm xmm,m128 xmm,i
1 1 2 4 1 1 2 3 3 4
1 1 2 2 1 1 2 2 3
1 1 1 2 1 1 2 2 2 3
1 1 1
0.5 1 1 2 1 1 2 2 2 3
11
11
6 k)
SSE3 instruction only available on Core Solo and Core Duo. Also uses some execution units under port 1. You may hide the delay by inserting other instructions between EMMS and any subsequent floating point instruction.

Instruction Operands ops fused domain ops unfused domain p0 p1 p01 p2 p3 Latency Reciprocal p4 through put 1 2 3 2 1 2 2 2
Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D
xmm,xmm xmm,m128 m128,xmm xmm,m128
2 2 2 4 Page 78
2 2 2 4 2
Pentium M MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVHPS/D MOVLPS/D MOVLHPS MOVHLPS MOVMSKPS/D MOVNTPS/D SHUFPS/D SHUFPS/D MOVDDUP g) MOVSH/LDUP g) MOVSH/LDUP g) UNPCKH/LPS UNPCKH/LPS UNPCKH/LPD UNPCKH/LPD Conversion CVTPS2PD CVTPS2PD CVTPD2PS CVTPD2PS CVTSD2SS CVTSD2SS CVTSS2SD CVTSS2SD CVTDQ2PS CVTDQ2PS CVT(T) PS2DQ CVT(T) PS2DQ CVTDQ2PD CVTDQ2PD CVT(T)PD2DQ CVT(T)PD2DQ CVTPI2PS CVTPI2PS CVT(T)PS2PI CVT(T)PS2PI CVTPI2PD CVTPI2PD CVT(T) PD2PI CVT(T) PD2PI CVTSI2SS CVT(T)SS2SI CVT(T)SS2SI CVTSI2SD CVTSI2SD CVT(T)SD2SI m128,xmm xmm,xmm xmm,m32/64 m32/64,xmm xmm,m64 m64,xmm xmm,xmm r32,xmm m128,xmm xmm,xmm,i xmm,m128,i xmm,xmm xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m128 8 1 2 1 1 1 1 1 2 3 4 2 2 4 4 4 2 3 4 1 1 1 1 j) 2 2 1 1 1 2 1 2 2 2 2 1 1 3-4 2 1 1 1 1 2 2 1 1 1 1 1 1 1 2 2 3 1 1 1 1 1 1 2 4 1 1 1 1 1 1 1 3 2 2 1 2 5 5 1 1
xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m64 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,mm xmm,m64 mm,xmm mm,m128 xmm,mm xmm,m64 mm,xmm mm,m128 xmm,r32 r32,xmm r32,m32 xmm,r32 xmm,m32 r32,xmm
4 4 4 6 2 3 2 3 2 4 2 4 4 5 4 6 1 2 1 2 4 5 3 5 2 2 3 2 3 2
2 1 3 3
2 2 1 1 2 2
3 1 4 2 4 1 2 1 2 2 2 2 4 4 4 4 3 2 3 2 4 1 4 2 3 1 3 1 5 1 3 3 4 2 4 4 1 4 1 1 4
2 2
2 2
1 1 1 1 2 2
1 1
1 1 1 1 1 1
1 1
3 3 3 3 2 2 2 2 2 2 2 2 2 2 3 3 1 1 1 1 2 2 2 2 1 1 1 1 1 1
Page 79
Pentium M CVT(T)SD2SI Arithmetic ADDSS/D SUBSS/D ADDSS/D SUBSS/D ADDPS/D SUBPS/D ADDPS/D SUBPS/D ADDSUBPS/D g) HADDPS HSUBPS g) HADDPD HSUBPD g) MULSS MULSD MULSS MULSD MULPS MULPD MULPS MULPD DIVSS DIVSD DIVSS DIVSD DIVPS DIVPD DIVPS DIVPD CMPccSS/D CMPccSS/D CMPccPS/D CMPccPS/D COMISS/D UCOMISS/D COMISS/D UCOMISS/D MAXSS/D MINSS/D MAXSS/D MINSS/D MAXPS/D MINPS/D MAXPS/D MINPS/D RCPSS RCPSS RCPPS RCPPS Math SQRTSS SQRTSS SQRTSD SQRTSD SQRTPS SQRTPD SQRTPS SQRTPD r32,m64 3 1 1 1 1
xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,m32 xmm,m64 xmm,xmm xmm,xmm xmm,m128 xmm,m128 xmm,xmm xmm,xmm xmm,m32 xmm,m64 xmm,xmm xmm,xmm xmm,m128 xmm,m128 xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm xmm,m32/64 xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m128
1 2 2 4 2 6? 3 1 1 2 2 2 2 4 4 1 1 2 2 2 2 4 4 1 2 2 4 1 2 1 2 2 4 1 2 2 4
1 1 2 2 2 ? 3 1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2 1 1 2 2 1 1 1 1 2 2 1 1 2 2
1 2
1 1
2 2
1 1
2 2 1
3 3 3 3 3 7 4 4 5 4 5 4 5 4 5 9-18 c) 9-32 c) 9-18 c) 9-32 c) 16-34 c) 16-62 c) 16-34 c) 16-62 c) 3 3
2 1 1 2 1 3 2 3 3 3 3 3
1 1 2 2 2 4 2 1 2 1 2 2 4 2 4 8-17 c) 8-31 c) 8-17 c) 8-31 c) 16-34 c) 16-62 c) 16-34 c) 16-62 c) 1 1 2 2 1 1 1 1 2 2 1 1 2 2
xmm,xmm xmm,m32 xmm,xmm xmm,m64 xmm,xmm xmm,xmm xmm,m128 xmm,m128
2 3 1 2 2 2 4 4
2 2 1 1 2 2 2 2
6-30 1 5-58 1 8-56 16-114 2 2
4-28 4-28 4-57 4-57 16-55 16-114 16-55 16-114
Page 80
Pentium M RSQRTSS RSQRTSS RSQRTPS RSQRTPS Logic AND/ANDN/OR/XORPS/D AND/ANDN/OR/XORPS/D Other LDMXCSR STMXCSR FXSAVE FXRSTOR Notes: c) g) j) xmm,xmm xmm,m32 xmm,xmm xmm,m128 1 2 2 4 1 1 3 2 3 1 3 2 1 1 2 2
xmm,xmm xmm,m128
2 4
2 2
1 2
1 1
m32 m32 m4096 m4096
9 6 118 87
9 6 32 43
43 44
43
20 12 63 72
High values are typical, low values are for round divisors. SSE3 instruction only available on Core Solo and Core Duo. Also uses some execution units under port 1.
Page 81
Merom
Intel Core 2 (Merom, 65nm)

Operands: i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm register, (x)mm = mmx or xmm register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc. The number of ops at the decode, rename, allocate and retirement stages in the pipeline. Fused ops count as one. The number of ops for each execution port. Fused ops count as two. Fused macro-ops count as one. The instruction has op fusion if the sum of the numbers listed under p015 + p2 + p3 + p4 exceeds the number listed under ops fused domain. An x under p0, p1 or p5 means that at least one of the ops listed under p015 can optionally go to this port. For example, a 1 under p015 and an x under p0 and p5 means one op which can go to either port 0 or port 5, whichever is vacant first. A value listed under p015 but nothing under p0, p1 and p5 means that it is not known which of the three ports these ops go to. The total number of ops going to port 0, 1 and 5. The number of ops going to port 0 (execution units). The number of ops going to port 1 (execution units). The number of ops going to port 5 (execution units). The number of ops going to port 2 (memory read). The number of ops going to port 3 (memory write address). The number of ops going to port 4 (memory write data). Tells which execution unit cluster is used. An additional delay of 1 clock cycle is generated if a register written by a op in the integer unit (int) is read by a op in the floating point unit (float) or vice versa. fltint means that an instruction with multiple ops receive the input in the float unit and delivers the output in the int unit. Delays for moving data between different units are included under latency when they are unavoidable. For example, movd eax,xmm0 has an extra 1 clock delay for moving from the XMM-integer unit to the general purpose integer unit. This is included under latency because it occurs regardless of which instruction comes next. Nothing listed under unit means that additional delays are either unlikely to occur or unavoidable and therefore included in the latency figure. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter.
ops fused domain: ops unfused domain:
p015: p0: p1: p5: p2: p3: p4: Unit:
Latency:
Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.
Instruction Operands ops ops unfused domain Unit fused dop015 p0 p1 p5 p2 p3 p4 main Page 82 Laten- Recicy procal throughput
Merom Move instructions MOV MOV a) MOV a) MOV MOV MOV MOV MOV MOVNTI MOVSX MOVZX MOVSXD MOVSX MOVZX CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) i) POP POP POP POP POPF(D/Q) POPA(D) i) LAHF SAHF SALC i) LEA a) BSWAP LDS LES LFS LGS LSS PREFETCHNTA PREFETCHT0/1/2 LFENCE MFENCE SFENCE CLFLUSH IN OUT Arithmetic instructions ADD SUB ADD SUB ADD SUB ADC SBB ADC SBB r,r/i r,m m,r m,i r,sr m,sr sr,r sr,m m,r r,r r,m r,r r,m r,r r,m r i m sr 1 1 1 1 1 2 8 8 2 1 1 2 2 3 7 2 1 1 2 2 17 18 1 4 2 10 24 10 1 2 1 2 11 1 1 2 2 2 4 1 x x x 1 1 1 1 1 4 5 1 1 1 1 int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int 1 2 3 3 0.33 1 1 1 1 1 16 16 2 0.33 1 1 2 1 1 1 1 1 7 8 1 1.5 17 20 1 4 1 4 7 0.33 1 1 1 17 1 1 8 9 9 117
4 3
x x
x x
1 1 2 2 3 1 x x x x x x x x x 1 x x x 1 1 1 1 1 1 1 1 1 1
1 2 2 high b) 4 3
1 1 1 1 1 1 8
1 1 15 9 3 9 23 2 1 2 1 2 11 x x x 1 1 1 1 1 8
r (E/R)SP m sr
x x x 1 1
x x x
x x x 1
r,m r m m m
1 1 1 1 1 1 1 1 1 1 1
m8
240
r,r/i r,m m,r/i r,r/i r,m
1 1 2 2 2
1 1 1 2 2
x x x x x
x x x x x
x x x x x
1 1 1
int int int int int
1 6 2 2
0.33 1 1 2 2
Page 83
Merom ADC SBB CMP CMP INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS i) AAD i) AAM i) MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV CBW CWDE CDQE CWD CDQ CQO Logic instructions AND OR XOR AND OR XOR AND OR XOR TEST TEST SHR SHL SAR SHR SHL SAR ROR ROL m,r/i r,r/i m,r/i r m 4 1 1 1 3 1 3 4 1 3 3 3 1 1 1 1 1 1 1 3 3 3 1 1 1 1 1 1 3 5 4 32 56 4 6 5 32 56 1 1 3 1 1 1 1 1 3 4 1 3 3 3 1 1 1 1 1 1 1 3 3 2 1 1 1 1 1 1 3 5 4 32 56 3 5 4 31 55 1 1 x x x x x x x x x x x 1 x 1 x x x 1 1 1 1 1 x x 2 1 x x 1 1 1 1 1 1 x x 1 1 1 1 1 1 1 1 1 1 x x x x x x 1 1 1 1 1 1 1 int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int 7 1 1 1 6 0.33 1 0.33 1 1 1 1 1.5 1.5 4 1 1 2 1 1 2 1 1.5 1.5 4 1 1 2 2 1 2 12 12-20 c) 12-36 c) 18-37 c) 28-40 c) 12 12-20 c) 12-36 c) 18-37 c) 28-40 c)
r8 r16 r32 r64 r16,r16 r32,r32 r64,r64 r16,r16,i r32,r32,i r64,r64,i m8 m16 m32 m64 r16,m16 r32,m32 r64,m64 r16,m16,i r32,m32,i r64,m64,i r8 r16 r32 r64 r64 m8 m16 m32 m64 m64
x x x
x x x
17 3 5 5 7 3 3 5 3 3 5 3 5 5 7 3 3 5
1 1 1 1 1 x x x x x
18 18-26 18-42 29-61 39-72 18 18-26 18-42 29-61 39-72 1 1
r,r/i r,m m,r/i r,r/i m,r/i r,i/cl m,i/cl r,i/cl
1 1 2 1 1 1 3 1
1 1 1 1 1 1 2 1
x x x x x x x x
x x x x x
x x x x x x x x
1 1 1 1
int int int int int int int int
1 6 1 1 6 1
0.33 1 1 0.33 1 0.5 1 1
Page 84
Merom ROR ROL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL SHLD SHRD SHLD SHRD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR BSF BSR SETcc SETcc CLC STC CMC CLD STD m,i/cl r,1 r8,i/cl r8,i/cl r16/32/64,i/cl m,1 m8,i/cl m8,i/cl m16/32/64,i/cl r,r,i/cl m,r,i/cl r,r/i m,r m,i r,r/i m,r m,i r,r r,m r m 3 2 9 8 6 4 12 11 10 2 3 1 10 2 1 11 3 2 2 1 2 1 7 6 2 2 9 8 6 3 9 8 7 2 2 1 9 1 1 8 1 2 2 1 1 1 7 6 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 1 1 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 1 1 1 int int int int int int int int int int int int int int int int int int int int int int int int 6 2 12 11 11 7 14 13 13 2 7 1 1 2
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 5 1
1 1
1 1
1 5 6 2 1 1
1 2 1 1 0.33 4 14
Control transfer instructions JMP short/near JMP i) far JMP r JMP m(near) JMP m(far) Conditional jump short/near Fused compare/test and branch e,i) J(E/R)CXZ short LOOP short LOOP(N)E short CALL near CALL i) far CALL r CALL m(near) CALL m(far) RETN RETN i RETF RETF i BOUND i) r,m INTO i) String instructions LODS
1 30 1 1 31 1 1 2 11 11 3 43 3 4 44 1 3 32 32 15 5
1 30 1 1 29 1 1 2 11 11 2 43 2 3 42 1 30 30 13 5
1 1 1 1 1 1 x x x
1 2
x x x x
x x x x
1 1 1
1 1 1
1 1
1 2 1 1 2 2 2
int int int int int int int int int int int int int int int int int int int int int
0 0 0 0 0
1-2 76 1-2 1-2 68 1 1 1-2 5 5 2 75 2 2 75 2 2 78 78 8 3
2 Page 85
int
Merom REP LODS STOS REP STOS MOVS REP MOVS SCAS REP(N)E SCAS CMPS REP(N)E CMPS Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC Notes: a) b) c) e) i) 4+7n - 14+6n 4 2 8+5n - 20+1.2n 8 5 1 1 1 7+7n - 13+n 4 3 7+8n - 17+7n 7 5 7+10n - 7+9n 1 1 int int int int int int int int int int 1+5n - 21+3n 1 7+2n - 0.55n
5 1 2
1+3n - 0.63n 1 3+8n - 23+6n 3 2+7n - 22+5n
i,0 a,b
1 1 3 12 3 46-100 29 23
1 1 3 10 2
x x x
x x x
x x x 1 1 1
int int int int int int int int int
0.33 1 8 8
180-215 64 54
Applies to all addressing modes Has an implicit LOCK prefix. Low values are for small results, high values for high results. See manual 3: "The microarchitecture of Intel, AMD and VIA CPUs" for restrictions on macro-op fusion. Not available in 64 bit mode.

Instruction Operands ops ops unfused domain Unit fused dop015 p0 p1 p5 p2 p3 p4 main 1 1 4 40 1 1 7 170 1 1 2 3 3 1 2 1 1 2 38 1 1 2 1 x x x x 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 2 1 2 2 float float float float float float float float float float float float float float float Laten- Recicy procal throughput 1 3 4 45 1 3 4 164 0 6 6 6 6 1 1 3 20 1 1 5 166 1 1 1 1 1 1 2
Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST FISTP FISTTP g) FLDZ FLD1
r m32/64 m80 m80 r m32/m64 m80 m80 r m m m m
3 x 166 x 0 f) 1 1 1 1 1 1 1 2 1 Page 86
Merom FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FADD(P) FSUB(R)(P) FMUL(P) FMUL(P) FDIV(R)(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT Math FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN Other FNOP WAIT FNCLEX FNINIT Notes: r AX m16 m16 m16 r m m 2 2 1 2 2 3 1 2 142 78 2 2 1 1 1 1 1 2 2 2 1 1 1 1 1 2 1 float float float float float float float float float float 2 2 2 1 2 10 8 1 2 192 177
1 184 169
r m r m r m
r m r m m m m
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 2 2 2 2 2 2 2 1 1 1 1 21-27 21-27 7-15 7-15
1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1
1 1 2 2 1
1 1 1
1 1 1 1
float float float float float float float float float float float float float float float float float float float float
1 1 5 2 2 6-38 d) 5-37 d) 5-37 d) 1 1 1 1 1 1 1 2 2 5-37 d) 2 1 1 16-56 22-29
27 82 1 ~96 ~100 ~19 ~53 ~98 ~70
27 82 1 ~96 ~100 ~19 ~53 ~98 ~70
float float float float float float float float float
41 170 6-69 ~96 ~115 ~45 ~96 ~136 ~119
1 2 4 15
1 2 4 15
float float float float
1 1 15 63
Page 87
Merom d) f) g) Round divisors or low precision give low values. Resolved by register renaming. Generates no ops in the unfused domain. SSE3 instruction set.

Instruction Operands ops ops unfused domain Unit fused dop015 p0 p1 p5 p2 p3 p4 main 1 1 1 1 1 1 1 1 1 1 9 4 4 1 1 1 1 1 1 3 4 1 1 3 4 1 2 1 2 4 5 1 2 2 3 1 2 2 2 2 1 1 1 x x x x x x 1 x 1 x 1 1 1 x x x 1 4 2 2 1 1 x x x x x x x x x x x 1 2 2 1 2 1 2 int int int int 1 1 1 1 3 3 1 1 3 3 1 1 1 1 4 4 1 1 2 2 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 x x x 1 1 1 1 1 1 int int fltint int int int fltint int int int int int int int int int fltint int int int int int int 1 3 1 3 1 1 3 1 3 1 2 2 1 int int 1 int int int int int Laten- Recicy procal throughput 2 3 2 2 1 2 3 1 2 3 3-8 2-8 2-8 1 1 0.33 1 0.5 1 0.33 1 1 0.33 1 1 4 2 2 0.33 0.33 2 2 1 1 2 2 1 1 2 2 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1
Move instructions MOVD k) MOVD k) MOVD k) MOVD k) MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU LDDQU g) MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKH/LQDQ PUNPCKH/LQDQ PSHUFB h) PSHUFB h) PSHUFB h) PSHUFB h) PSHUFW PSHUFW PSHUFD PSHUFD PSHUFL/HW PSHUFL/HW PALIGNR h) PALIGNR h) PALIGNR h)
r32/64,(x)mm m32/64,(x)mm (x)mm,r32/64 (x)mm,m32/64 (x)mm, (x)mm (x)mm,m64 m64, (x)mm xmm, xmm xmm, m128 m128, xmm m128, xmm xmm, m128 xmm, m128 mm, xmm xmm,mm m64,mm m128,xmm mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 xmm,xmm xmm, m128 mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm,i mm,m64,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm, m128,i mm,mm,i mm,m64,i xmm,xmm,i
x x
x x
x x
x x x
x x x
Page 88
Merom PALIGNR h) MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRW PEXTRW PINSRW PINSRW PINSRW PINSRW Arithmetic instructions PADD/SUB(U)(S)B/W/D PADD/SUB(U)(S)B/W/D PADDQ PSUBQ PADDQ PSUBQ PHADD(S)W PHSUB(S)W h) PHADD(S)W PHSUB(S)W h) PHADD(S)W PHSUB(S)W h) PHADD(S)W PHSUB(S)W h) PHADDD PHSUBD h) PHADDD PHSUBD h) PHADDD PHSUBD h) PHADDD PHSUBD h) PCMPEQ/GTB/W/D PCMPEQ/GTB/W/D PMULL/HW PMULHUW PMULL/HW PMULHUW PMULHRSW h) PMULHRSW h) PMULUDQ PMULUDQ PMADDWD PMADDWD PMADDUBSW h) PMADDUBSW h) PAVGB/W PAVGB/W PMIN/MAXUB/SW PMIN/MAXUB/SW PABSB PABSW PABSD h) PSIGNB PSIGNW PSIGND h) PSADBW PSADBW xmm,m128,i mm,mm xmm,xmm r32,(x)mm r32,mm,i r32,xmm,i mm,r32,i mm,m16,i xmm,r32,i xmm,m16,i 2 4 10 1 2 3 1 2 3 4 2 x x x 1 int int int int int int int int int int 1 2-5 6-10 1 1 1 1 1 1.5 1.5
1 2 3 1 1 3 3
x x
x x
1 1 x x
2 3 5 2 6
1 1
(x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m
1 1 2 2 5 6 7 8 3 4 5 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 2 2 5 5 7 7 3 3 5 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
x x x x
x x x x
1 1
int int int int int
1 2
0.5 1 1 1 4 4
int int 6
4 4 2 2 3 3 0.5 1 1 1 1 1 1 1 1 1 1 1 0.5 1 0.5 1 0.5 1 0.5 1 1 1
1 1 1 x x 1 1 1 1 1 1 1 1 1 1 x x x x x x x x 1 1 x x x x x x x x x x 1 1 1 1 1 1 1 1 1 1 1
int int int int int int int int int int int int int int int int int int int int int int int int int int int
3 5 1 3 3 3 3 3 1 1 1 1 3
Page 89
Merom Logic instructions PAND(N) POR PXOR PAND(N) POR PXOR PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ Other EMMS Notes: g) h) k) (x)mm,(x)mm (x)mm,m mm,mm/i mm,m64 xmm,i xmm,xmm xmm,m128 xmm,i 1 1 1 1 1 2 3 2 1 1 1 1 1 2 2 2 x x 1 1 1 x x x x x x x int int int int int int int int 1 1 1 2 2 0.33 1 1 1 1 1 1 1
1 1
x x x
11
11
float
SSE3 instruction set. Supplementary SSE3 instruction set. MASM uses the name MOVD rather than MOVQ for this instruction even when moving 64 bits.

Instruction Operands ops ops unfused domain Unit fused dop015 p0 p1 p5 p2 p3 p4 main 1 1 1 4 9 1 1 1 2 2 1 1 1 1 3 4 1 2 1 2 1 2 3 4 1 1 x x x 1 1 2 4 1 1 x x x x 1 x x 2 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 3 3 1 1 1 1 1 1 3 3 1 1 1 1 1 1 3 3 1 1 1 1 1 1 1 fltint fltint float float int int int int fltint int float 3 1 1 1 3 1 1 1 1 1 1 float float 1 int 2 1 int 2 int int int int Laten- Recicy procal throughput 1 2 3 2-4 3-4 1 2 3 3 5 3 1 1 0.33 1 1 2 4 0.33 1 1 1 1 1 1 1 2-3 2 2 1 1 1 1 1 1 2 2 1
Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVHPS/D MOVLPS/D MOVLHPS MOVHLPS MOVMSKPS/D MOVNTPS/D SHUFPS SHUFPS SHUFPD SHUFPD MOVDDUP g) MOVDDUP g) MOVSH/LDUP g) MOVSH/LDUP g) UNPCKH/LPS UNPCKH/LPS UNPCKH/LPD
xmm,xmm xmm,m128 m128,xmm xmm,m128 m128,xmm xmm,xmm xmm,m32/64 m32/64,xmm xmm,m64 m64,xmm m64,xmm xmm,xmm r32,xmm m128,xmm xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm
Page 90
Merom UNPCKH/LPD Conversion CVTPD2PS CVTPD2PS CVTSD2SS CVTSD2SS CVTPS2PD CVTPS2PD CVTSS2SD CVTSS2SD CVTDQ2PS CVTDQ2PS CVT(T) PS2DQ CVT(T) PS2DQ CVTDQ2PD CVTDQ2PD CVT(T)PD2DQ CVT(T)PD2DQ CVTPI2PS CVTPI2PS CVT(T)PS2PI CVT(T)PS2PI CVTPI2PD CVTPI2PD CVT(T) PD2PI CVT(T) PD2PI CVTSI2SS CVTSI2SS CVT(T)SS2SI CVT(T)SS2SI CVTSI2SD CVTSI2SD CVT(T)SD2SI CVT(T)SD2SI Arithmetic ADDSS/D SUBSS/D ADDSS/D SUBSS/D ADDPS/D SUBPS/D ADDPS/D SUBPS/D ADDSUBPS/D g) ADDSUBPS/D g) HADDPS HSUBPS g) HADDPS HSUBPS g) HADDPD HSUBPD g) HADDPD HSUBPD g) MULSS MULSS MULSD xmm,m128 2 1 1 1 float 1
xmm,xmm xmm,m128 xmm,xmm xmm,m64 xmm,xmm xmm,m64 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,mm xmm,m64 mm,xmm mm,m128 xmm,mm xmm,m64 mm,xmm mm,m128 xmm,r32 xmm,m32 r32,xmm r32,m32 xmm,r32 xmm,m32 r32,xmm r32,m64
2 2 2 2 2 2 2 2 1 1 1 1 2 3 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2 2 1 1
2 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2 1 1 1
1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1
float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float
4 4 2 2 3 3 4 4 3 3 4 4 4 3 4 3
1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 3 3 1 1 3 3 1 1
xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm
1 1 1 1 1 1 6 7 3 4 1 1 1
1 1 1 1 1 1 6 6 3 3 1 1 1
1 1 1 1 1 1
1 1 1 1 1
1 1 1
float float float float float float float float float float float float float
3 3 3 9 5 4 5
1 1 1 1 1 1 3 3 2 2 1 1 1
Page 91
Merom MULSD MULPS MULPS MULPD MULPD DIVSS DIVSS DIVSD DIVSD DIVPS DIVPS DIVPD DIVPD RCPSS/PS RCPSS/PS CMPccSS/D CMPccSS/D CMPccPS/D CMPccPS/D COMISS/D UCOMISS/D COMISS/D UCOMISS/D MAXSS/D MINSS/D MAXSS/D MINSS/D MAXPS/D MINPS/D MAXPS/D MINPS/D Math SQRTSS/PS SQRTSS/PS SQRTSD/PD SQRTSD/PD RSQRTSS/PS RSQRTSS/PS xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm xmm,m32/64 xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 float float float float float float float float float float float float float float float float float float float float float float float float float 4 5 6-18 d) 6-32 d) 6-18 d) 6-32 d) 3 3 3 3 3 3 1 1 1 1 1 5-17 d) 5-17 d) 5-31 d) 5-31 d) 5-17 d) 5-17 d) 5-31 d) 5-31 d) 2 2 1 1 1 1 1 1 1 1 1 1
xmm,xmm xmm,m xmm,xmm xmm,m xmm,xmm xmm,m
1 2 1 2 1 1
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1
float float float float float float
6-29 6-58 3
6-29 6-29 6-58 6-58 2 2
Logic AND/ANDN/OR/XORPS/D xmm,xmm AND/ANDN/OR/XORPS/D xmm,m128 Other LDMXCSR STMXCSR FXSAVE FXRSTOR Notes: d) g)
1 1
1 1
x x
x x
x x
int int
0.33 1
m32 m32 m4096 m4096
14 6 141 119
13 4
1 1 1 145 164
42 19 145 164
Round divisors give low values. SSE3 instruction set.
Page 92
Wolfdale
Intel Core 2 (Wolfdale, 45nm)

Operands: i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm register, (x)mm = mmx or xmm register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc. The number of ops at the decode, rename, allocate and retirement stages in the pipeline. Fused ops count as one. The number of ops for each execution port. Fused ops count as two. Fused macro-ops count as one. The instruction has op fusion if the sum of the numbers listed under p015 + p2 + p3 + p4 exceeds the number listed under ops fused domain. An x under p0, p1 or p5 means that at least one of the ops listed under p015 can optionally go to this port. For example, a 1 under p015 and an x under p0 and p5 means one op which can go to either port 0 or port 5, whichever is vacant first. A value listed under p015 but nothing under p0, p1 and p5 means that it is not known which of the three ports these ops go to. The total number of ops going to port 0, 1 and 5. The number of ops going to port 0 (execution units). The number of ops going to port 1 (execution units). The number of ops going to port 5 (execution units). The number of ops going to port 2 (memory read). The number of ops going to port 3 (memory write address). The number of ops going to port 4 (memory write data). Tells which execution unit cluster is used. An additional delay of 1 clock cycle is generated if a register written by a op in the integer unit (int) is read by a op in the floating point unit (float) or vice versa. fltint means that an instruction with multiple ops receive the input in the float unit and delivers the output in the int unit. Delays for moving data between different units are included under latency when they are unavoidable. For example, movd eax,xmm0 has an extra 1 clock delay for moving from the XMM-integer unit to the general purpose integer unit. This is included under latency because it occurs regardless of which instruction comes next. Nothing listed under unit means that additional delays are either unlikely to occur or unavoidable and therefore included in the latency figure. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter.
p015: p0: p1: p5: p2: p3: p4: Unit:
Latency:
Instruction Operands ops ops unfused domain Unit fused dop015 p0 p1 p5 p2 p3 p4 main Laten- Recicy procal throughput
Page 93
Wolfdale Move instructions MOV MOV a) MOV a) MOV MOV MOV MOV MOV MOVNTI MOVSX MOVZX MOVSXD MOVSX MOVZX MOVSX MOVSXD CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) i) POP POP POP POP POPF(D/Q) POPA(D) i) LAHF SAHF SALC i) LEA a) BSWAP LDS LES LFS LGS LSS PREFETCHNTA PREFETCHT0/1/2 LFENCE MFENCE SFENCE CLFLUSH IN OUT Arithmetic instructions ADD SUB ADD SUB ADD SUB ADC SBB r,r/i r,m m,r m,i r,sr m,sr sr,r sr,m m,r r,r r16/32,m r64,m r,r r,m r,r r,m r i m sr 1 1 1 1 1 2 8 8 2 1 1 2 2 2 3 7 2 1 1 2 2 17 18 1 4 2 10 24 10 1 2 1 2 11 1 1 2 2 2 4 1 x x x 1 1 1 1 1 4 5 1 1 1 1 1 2 3 3 0.33 1 1 1 1 1 16 16 2 0.33 1 1 1 2 1 1 1 1 1 7 8 1 1.5 17 20 1 4 1 4 1 1 1 1 1 1 1 1 1 1 1 7 0.33 1 1 1 17 1 1 8 6 9 90
4 3
x x
x x
1 1 1 2 2 3 1 x x x x x x x x x x x x x x x 1 1
1 1
2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 8 2 high b) 4 3
1 1 15 9 3 9 23 2 1 2 1 2 11 x x x 1 1 1 1 1 8
r (E/R)SP m sr
2 1 1
x x x 1 1
x x x
x x x 1
r,m r m m m
m8
120
r,r/i r,m m,r/i r,r/i
1 1 2 2
1 1 1 2
x x x x
x x x x
x x x x
1 1 1 1 1 6 2
0.33 1 1 2
Page 94
Wolfdale ADC SBB ADC SBB CMP CMP INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS i) AAD i) AAM i) MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV CBW CWDE CDQE CWD CDQ CQO Logic instructions AND OR XOR AND OR XOR AND OR XOR TEST TEST SHR SHL SAR SHR SHL SAR r,m m,r/i r,r/i m,r/i r m 2 2 4 3 1 1 1 1 1 1 3 1 1 1 3 3 5 5 1 1 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 4 4 7 7 7 7 32-38 32-38 56-62 56-62 4 3 7 6 7 6 32 31 56 55 1 1 1 1 x x x x x x x x x x x x x x x x x 1 x x 1 x x x 1 1 1 1 1 x x 2 1 x x 1 1 1 1 1 1 1 2 1 x x x 2 3 2 9 10 13 x x x 1 2 2 3 2 x x x x x x x x x x x x x x x x 1 1 1 1 1 1 1 1 1 1 x x x x x x x x x x x 1 1 1 1 1 1 1 1 2 7 1 1 1 6 2 0.33 1 0.33 1 1 1 1 1.5 1.5 4 1 1 2 1 1 2 1 1.5 1.5 4 1 1 2 2 1 2
r8 r16 r32 r64 r16,r16 r32,r32 r64,r64 r16,r16,i r32,r32,i r64,r64,i m8 m16 m32 m64 r16,m16 r32,m32 r64,m64 r16,m16,i r32,m32,i r64,m64,i r8 r16 r32 r64 r64 m8 m16 m32 m64 m64
17 3 5 5 7 3 3 5 3 3 5 3 5 5 7 3 3 5
1 1 1 1 1
9-18 c) 14-22 c) 14-23 c) 18-57 c) 34-88 c) 9-18 14-22 c) 14-23 c) 34-88 c) 39-72 c) 1 1
r,r/i r,m m,r/i r,r/i m,r/i r,i/cl m,i/cl
1 1 2 1 1 1 3
1 1 1 1 1 1 2
x x x x x x x
x x x x x
x x x x x x x
1 1 1 1 1 1 1 1 6 1 1 6 1
0.33 1 1 0.33 1 0.5 1
Page 95
Wolfdale ROR ROL ROR ROL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL SHLD SHRD SHLD SHRD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR BSF BSR SETcc SETcc CLC STC CMC CLD STD r,i/cl m,i/cl r,1 r8,i/cl r8,i/cl r16/32/64,i/cl m,1 m8,i/cl m8,i/cl m16/32/64,i/cl r,r,i/cl m,r,i/cl r,r/i m,r m,i r,r/i m,r m,i r,r r,m r m 1 3 2 9 8 6 4 12 11 10 2 3 1 9 3 1 10 3 2 2 1 2 1 6 6 1 2 2 9 8 6 3 9 8 7 2 2 1 8 2 1 7 1 2 2 1 1 1 6 6 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 1 1 1 1 6 2 12 11 11 7 14 13 13 2 7 1 1 1 2
x x x x x x x x x x x x x x x x 1 1 x x x x x
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 4 1
1 1
1 1
1 5 6 2 1
1 1
1 1 1 1 0.33 3 14
Control transfer instructions JMP short/near JMP i) far JMP r JMP m(near) JMP m(far) Conditional jump short/near Fused compare/test and branch e,i) J(E/R)CXZ short LOOP short LOOP(N)E short CALL near CALL i) far CALL r CALL m(near) CALL m(far) RETN RETN i RETF RETF i BOUND i) r,m INTO i) String instructions
1 30 1 1 31 1 1 2 11 11 3 43 3 4 44 1 3 32 32 15 5
1 30 1 1 29 1 1 2 11 11 2 43 2 3 42 1 30 30 13 5
1 1 1 1 1 1 x x x
0 0 0 0 0
1 2
x x x x
x x x x
1 1 1
1 1 1
1 1
1 2 1 1 2 2 2
1-2 76 1-2 1-2 68 1 1 1-2 5 5 2 75 2 2 75 2 2 78 78 8 3
Page 96
Wolfdale LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP(N)E SCAS CMPS REP(N)E CMPS Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC Notes: a) b) c) e) i) 3 2 4+7n-14+6n 4 2 8+5n-20+1.2n 8 5 1 1 1 7+7n-13+n 4 3 7+8n-17+7n 7 5 7+10n-7+9n 1 1+5n-21+3n 1 1 7+2n-0.55n 5 1+3n-0.63n 1 3+8n-23+6n 2 2+7n-22+5n 3 1 1 1
i,0 a,b
1 1 3 12 3 53-117 13 23
1 1 3 10 2
x x x
x x x
x x x 1 1 1
0.33 1 8 8
53-211 32 54
Applies to all addressing modes Has an implicit LOCK prefix. Low values are for small results, high values for high results. The reciprocal throughput is only slightly less than the latency. See manual 3: "The microarchitecture of Intel, AMD and VIA CPUs" for restrictions on macro-op fusion. Not available in 64 bit mode.

Instruction Operands ops ops unfused domain Unit fused dop015 p0 p1 p5 p2 p3 p4 main 1 1 4 40 1 1 7 171 1 1 2 3 1 1 2 38 1 1 2 x 1 1 2 2 1 2 2 1 1 1 1 1 1 2 2 float float float float float float float float float float float float Laten- Recicy procal throughput 1 3 4 45 1 3 4 164 0 6 6 6 1 1 3 20 1 1 5 166 1 1 1 1
Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST FISTP
r m32/64 m80 m80 r m32/m64 m80 m80 r m m m
3 x 167 x 0 f) 1 1 1 Page 97
x x 1 1 1
x x
Wolfdale FISTTP g) FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FADD(P) FSUB(R)(P) FMUL(P) FMUL(P) FDIV(R)(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT Math FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN m 3 1 2 2 2 1 2 2 3 1 2 141 78 1 1 2 2 2 1 1 1 1 1 2 95 51 1 1 1 2 1 1 1 1 1 x x x x x x 1 1 1 2 1 1 float float float float float float float float float float float float float 6 1 1 2 2 2 1 2 10 8 1 2 142 177
r AX m16 m16 m16 r m m
x x 7 23 23 x 27
r m r m r m
r m r m m m m
1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 1 26-29 28-35 17-19
1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 1
1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 x x x
1 1 1
1 1
1 1 1 1
x x x
x x x
float float float float float float float float float float float float float float float float float float float float float
1 1 5 2 2 6-21 d) 5-20 d) 6-21 d) 5-20 d) 1 1 1 1 1 1 1 2 2 5-20 d) 2 1 1
3 5 6-21
13-40 18-41 10-22
28 28 53-84 1 1 18-85 76-100 18105 19 19 57-65 19-100 23-87
x x 1 x x x x x x x
x x x x x x x x x
x x x x x x x x x
float float float float float float float float float float
43 ~170 6-20 32-85 70-100 38-107 45 50-100 40-130 55-130
Page 98
Wolfdale Other FNOP WAIT FNCLEX FNINIT Notes: d) f) g) 1 2 4 15 1 2 4 15 1 x x float float float float 1 1 15 63
x x x
x x x
Round divisors or low precision give low values. Resolved by register renaming. Generates no ops in the unfused domain. SSE3 instruction set.

Instruction Operands ops ops unfused domain Unit fused dop015 p0 p1 p5 p2 p3 p4 main 1 1 1 1 1 1 1 1 1 1 9 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 x x x x x x 1 x 1 x 1 1 1 x x x 1 4 2 2 1 1 x x x x x x x x x x x 1 2 2 1 2 1 2 int int int int 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 int int int int int int int int int int int int 1 1 1 2 1 1 int int 1 int int int int int Laten- Recicy procal throughput 2 3 2 2 1 2 3 1 2 3 3-8 2-8 2-8 1 1 0.33 1 0.5 1 0.33 1 1 0.33 1 1 4 2 2 0.33 0.33 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1
Move instructions MOVD k) r32/64,(x)mm MOVD k) m32/64,(x)mm MOVD k) (x)mm,r32/64 MOVD k) (x)mm,m32/64 MOVQ (x)mm, (x)mm MOVQ (x)mm,m64 MOVQ m64, (x)mm MOVDQA xmm, xmm MOVDQA xmm, m128 MOVDQA m128, xmm MOVDQU m128, xmm MOVDQU xmm, m128 LDDQU g) xmm, m128 MOVDQ2Q mm, xmm MOVQ2DQ xmm,mm MOVNTQ m64,mm MOVNTDQ m128,xmm MOVNTDQA j) xmm, m128 PACKSSWB/DW PACKUSWB mm,mm PACKSSWB/DW PACKUSWB mm,m64 PACKSSWB/DW PACKUSWB xmm,xmm PACKSSWB/DW PACKUSWB xmm,m128 PACKUSDW j) xmm,xmm PACKUSDW j) xmm,m PUNPCKH/LBW/WD/DQ mm,mm PUNPCKH/LBW/WD/DQ mm,m64 PUNPCKH/LBW/WD/DQ xmm,xmm PUNPCKH/LBW/WD/DQ xmm,m128 PUNPCKH/LQDQ xmm,xmm PUNPCKH/LQDQ xmm, m128
x x
1 1 1 1
Page 99
Wolfdale PMOVSX/ZXBW j) PMOVSX/ZXBW j) PMOVSX/ZXBD j) PMOVSX/ZXBD j) PMOVSX/ZXBQ j) PMOVSX/ZXBQ j) PMOVSX/ZXWD j) PMOVSX/ZXWD j) PMOVSX/ZXWQ j) PMOVSX/ZXWQ j) PMOVSX/ZXDQ j) PMOVSX/ZXDQ j) PSHUFB h) PSHUFB h) PSHUFB h) PSHUFB h) PSHUFW PSHUFW PSHUFD PSHUFD PSHUFL/HW PSHUFL/HW PALIGNR h) PALIGNR h) PALIGNR h) PALIGNR h) PBLENDVB j) PBLENDVB j) PBLENDW j) PBLENDW j) MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRB j) PEXTRB j) PEXTRW PEXTRW j) PEXTRD j) PEXTRD j) PEXTRQ j,m) PEXTRQ j,m) PINSRB j) PINSRB j) PINSRW PINSRW PINSRD j) PINSRD j) PINSRQ j,m) PINSRQ j,m) xmm,xmm xmm,m64 xmm,xmm xmm,m32 xmm,xmm xmm,m16 xmm,xmm xmm,m64 xmm,xmm xmm,m32 xmm,xmm xmm,m64 mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm,i mm,m64,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm, m128,i mm,mm,i mm,m64,i xmm,xmm,i xmm,m128,i x,x,xmm0 x,m,xmm0 xmm,xmm,i xmm,m,i mm,mm xmm,xmm r32,(x)mm r32,xmm,i m8,xmm,i r32,(x)mm,i m16,(x)mm,i r32,xmm,i m32,xmm,i r64,xmm,i m64,xmm,i xmm,r32,i xmm,m8,i (x)mm,r32,i (x)mm,m16,i xmm,r32,i xmm,m32,i xmm,r64,i xmm,m64,i 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 2 1 2 2 3 1 1 2 2 1 1 4 10 1 2 2 2 2 2 2 2 2 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 2 1 1 1 4 1 2 2 2 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 1 1 2 2 1 1 1 1 1 x x x x x 3 x x x x x x x x 1 x 1 x 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2-5 6-10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 2
1 3
1 1 1 1 1 1 1 1 1 1 1
2 3 3 3 3 3 1 2 1 1
Page 100
Wolfdale Arithmetic instructions PADD/SUB(U)(S)B/W/D (x)mm, (x)mm PADD/SUB(U)(S)B/W/D (x)mm,m PADDQ PSUBQ (x)mm, (x)mm PADDQ PSUBQ (x)mm,m PHADD(S)W PHSUB(S)W h) (x)mm, (x)mm PHADD(S)W PHSUB(S)W h) (x)mm,m64 PHADDD PHSUBD h) (x)mm, (x)mm PHADDD PHSUBD h) (x)mm,m64 PCMPEQ/GTB/W/D (x)mm,(x)mm PCMPEQ/GTB/W/D (x)mm,m PCMPEQQ j) xmm,xmm PCMPEQQ j) xmm,m128 PMULL/HW PMULHUW (x)mm,(x)mm PMULL/HW PMULHUW (x)mm,m PMULHRSW h) (x)mm,(x)mm PMULHRSW h) (x)mm,m PMULLD j) xmm,xmm PMULLD j) xmm,m128 PMULDQ j) xmm,xmm PMULDQ j) xmm,m128 PMULUDQ (x)mm,(x)mm PMULUDQ (x)mm,m PMADDWD (x)mm,(x)mm PMADDWD (x)mm,m PMADDUBSW h) (x)mm,(x)mm PMADDUBSW h) (x)mm,m PAVGB/W (x)mm,(x)mm PAVGB/W (x)mm,m PMIN/MAXSB j) xmm,xmm PMIN/MAXSB j) xmm,m128 PMIN/MAXUB (x)mm,(x)mm PMIN/MAXUB (x)mm,m PMIN/MAXSW (x)mm,(x)mm PMIN/MAXSW (x)mm,m PMIN/MAXUW j) xmm,xmm PMIN/MAXUW j) xmm,m PMIN/MAXSD j) xmm,xmm PMIN/MAXSD j) xmm,m128 PMIN/MAXUD j) xmm,xmm PMIN/MAXUD j) xmm,m128 PHMINPOSUW j) xmm,xmm PHMINPOSUW j) xmm,m128 PABSB PABSW PABSD h)(x)mm,(x)mm PABSB PABSW PABSD h) (x)mm,m PSIGNB PSIGNW PSIGND h) (x)mm,(x)mm 1 1 2 2 3 4 3 4 1 1 1 1 1 1 1 1 4 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 1 1 1 1 1 2 2 3 3 3 3 1 1 1 1 1 1 1 1 4 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 1 1 1 x x x x 1 1 1 1 x x x x x x 2 2 2 2 x x 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 x x 1 1 x x x x 1 1 1 1 1 1 1 4 4 x x x 1 1 1 int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int 1 1 2 0.5 1 1 1 2 2 2 2 0.5 1 1 1 1 1 1 1 2 4 1 1 1 1 1 1 1 1 0.5 1 1 1 0.5 1 0.5 1 1 1 1 1 1 1 4 4 0.5 1 0.5
1 1
3 1 1 3 3 5 5 3 3 3 3 1 1 1 1 1 1 1 4 1
x x 1 1 x x x x 1
x x x
Page 101
Wolfdale PSIGNB PSIGNW PSIGND h) PSADBW PSADBW MPSADBW j) MPSADBW j) Logic instructions PAND(N) POR PXOR PAND(N) POR PXOR PTEST j) PTEST j) PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ Other EMMS Notes: g) h) j) k) m) (x)mm,m (x)mm,(x)mm (x)mm,m xmm,xmm,i xmm,m,i 1 1 1 3 4 1 1 1 3 3 x 1 1 1 1 x 1 1 2 2 1 int int int int int 1 1 1 2 2
3 5
(x)mm,(x)mm (x)mm,m xmm,xmm xmm,m128 mm,mm/i mm,m64 xmm,i xmm,xmm xmm,m128 xmm,i
1 1 2 2 1 1 1 2 3 1
1 1 2 2 1 1 1 2 2 1
x x 1 1 1 1 1 x x x
x x x x
x x x x
1 1 1
x x x
int int int int int int int int int int
1 1 1 1 2 1
0.33 1 1 1 1 1 1 1 1 1
11
11
float
SSE3 instruction set. Supplementary SSE3 instruction set. SSE4.1 instruction set MASM uses the name MOVD rather than MOVQ for this instruction even when moving 64 bits Only available in 64 bit mode

Instruction Operands ops ops unfused domain Unit fused dop015 p0 p1 p5 p2 p3 p4 main 1 1 1 4 9 1 1 1 2 2 1 1 1 1 1 1 x x x 1 1 2 4 1 1 x x x x 1 x x 2 1 1 1 1 1 1 1 1 Page 102 1 1 1 1 1 1 1 int 1 1 1 1 1 1 float float 1 int 2 1 int 2 int int int int Laten- Recicy procal throughput 1 2 3 2-4 3-4 1 2 3 3 5 3 1 1 0.33 1 1 2 4 0.33 1 1 1 1 1 1 1 2-3 1
Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVHPS/D MOVLPS/D MOVLHPS MOVHLPS MOVMSKPS/D MOVNTPS/D SHUFPS
xmm,xmm xmm,m128 m128,xmm xmm,m128 m128,xmm xmm,xmm xmm,m32/64 m32/64,xmm xmm,m64 m64,xmm m64,xmm xmm,xmm r32,xmm m128,xmm xmm,xmm,i
Wolfdale SHUFPS SHUFPD SHUFPD BLENDPS/PD j) BLENDPS/PD j) BLENDVPS/PD j) BLENDVPS/PD j) MOVDDUP g) MOVDDUP g) MOVSH/LDUP g) MOVSH/LDUP g) UNPCKH/LPS UNPCKH/LPS UNPCKH/LPD UNPCKH/LPD EXTRACTPS j) EXTRACTPS j) INSERTPS j) INSERTPS j) Conversion CVTPD2PS CVTPD2PS CVTSD2SS CVTSD2SS CVTPS2PD CVTPS2PD CVTSS2SD CVTSS2SD CVTDQ2PS CVTDQ2PS CVT(T) PS2DQ CVT(T) PS2DQ CVTDQ2PD CVTDQ2PD CVT(T)PD2DQ CVT(T)PD2DQ CVTPI2PS CVTPI2PS CVT(T)PS2PI CVT(T)PS2PI CVTPI2PD CVTPI2PD CVT(T) PD2PI CVT(T) PD2PI CVTSI2SS CVTSI2SS CVT(T)SS2SI CVT(T)SS2SI CVTSI2SD xmm,m128,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i
xmm,xmm,xmm0
xmm,m,xmm0 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m128 r32,xmm,i m32,xmm,i xmm,xmm,i xmm,m32,i
2 1 2 1 1 2 2 1 2 1 2 1 1 1 2 2 2 1 2
1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 2 1 1 1
1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 x
1 1 1 1 1 1 1 1
x 1 1 1
1 1
int float float int int int int int int int int int int float float int int int int
1 1 2 1 1 1 1 4 1
1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1
xmm,xmm xmm,m128 xmm,xmm xmm,m64 xmm,xmm xmm,m64 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,mm xmm,m64 mm,xmm mm,m128 xmm,mm xmm,m64 mm,xmm mm,m128 xmm,r32 xmm,m32 r32,xmm r32,m32 xmm,r32
2 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2
2 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2
1 1 1 1 2 2 2 2
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
float float float float float float float float float float float float float float float float float float float float float float float float float float float float float
4 4 2 2 3 3 4 4 3 3 4 4 4 3 4
1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 3 3 1 1 3
Page 103
Wolfdale CVTSI2SD CVT(T)SD2SI CVT(T)SD2SI Arithmetic ADDSS/D SUBSS/D ADDSS/D SUBSS/D ADDPS/D SUBPS/D ADDPS/D SUBPS/D ADDSUBPS/D g) ADDSUBPS/D g) HADDPS HSUBPS g) HADDPS HSUBPS g) HADDPD HSUBPD g) HADDPD HSUBPD g) MULSS MULSS MULSD MULSD MULPS MULPS MULPD MULPD DIVSS DIVSS DIVSD DIVSD DIVPS DIVPS DIVPD DIVPD RCPSS/PS RCPSS/PS CMPccSS/D CMPccSS/D CMPccPS/D CMPccPS/D COMISS/D UCOMISS/D COMISS/D UCOMISS/D MAXSS/D MINSS/D MAXSS/D MINSS/D MAXPS/D MINPS/D MAXPS/D MINPS/D ROUNDSS/D j) ROUNDSS/D j) ROUNDPS/D j) ROUNDPS/D j) DPPS j) DPPS j) DPPD j) xmm,m32 r32,xmm r32,m64 2 1 1 1 1 1 1 1 1 1 1 float float float 3 3 1 1
xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm xmm,m32/64 xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i xmm,xmm,i
1 1 1 1 1 1 3 4 3 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4
1 1 1 1 1 1 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4
x x 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 x x
1 1 1 2 2 x x 1 1 1 1 1 1 1 1 1 1
2 2 x
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 x
1 1 1 1 1 1 1 1 1 x
float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float
3 3 3 7 6 4 5 4 5 6-13 d) 6-21 d) 6-13 d) 6-21 d) 3 3 3 3 3 3 3 3 11 9
1 1 1 1 1 1 3 3 1.5 1.5 1 1 1 1 1 1 1 1 5-12 d) 5-12 d) 5-20 d) 5-20 d) 5-12 d) 5-12 d) 5-20 d) 5-20 d) 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3
Page 104
Wolfdale DPPD j) Math SQRTSS/PS SQRTSS/PS SQRTSD/PD SQRTSD/PD RSQRTSS/PS RSQRTSS/PS Logic
AND/ANDN/OR/XORPS/D AND/ANDN/OR/XORPS/D
xmm,m128,i
float
1 2 1 2 1 1
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1
6-13 6-20 3
5-12 5-12 5-19 5-19 2 2
xmm,xmm xmm,m128
1 1
1 1
x x
x x
x x
int int
0.33 1
Other LDMXCSR STMXCSR FXSAVE FXRSTOR Notes: d) g)
m32 m32 m4096 m4096
13 10 151 121
12 8 67 74
x x x x
x x x x
x 1 x 1 1 x 8 38 38 x 47
38 20 145 150
Round divisors give low values. SSE3 instruction set.
Page 105
Nehalem
Intel Nehalem
Operands: i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm register, (x)mm = mmx or xmm register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc. The number of ops at the decode, rename, allocate and retirement stages in the pipeline. Fused ops count as one. The number of ops for each execution port. Fused ops count as two. Fused macro-ops count as one. The instruction has op fusion if the sum of the numbers listed under p015 + p2 + p3 + p4 exceeds the number listed under ops fused domain. An x under p0, p1 or p5 means that at least one of the ops listed under p015 can optionally go to this port. For example, a 1 under p015 and an x under p0 and p5 means one op which can go to either port 0 or port 5, whichever is vacant first. A value listed under p015 but nothing under p0, p1 and p5 means that it is not known which of the three ports these ops go to. The total number of ops going to port 0, 1 and 5. The number of ops going to port 0 (execution units). The number of ops going to port 1 (execution units). The number of ops going to port 5 (execution units). The number of ops going to port 2 (memory read). The number of ops going to port 3 (memory write address). The number of ops going to port 4 (memory write data). Tells which execution unit domain is used: "int" = integer unit (general purpose registers), "ivec" = integer vector unit (SIMD), "fp" = floating point unit (XMM and x87 floating point). An additional "bypass delay" is generated if a register written by a op in one domain is read by a op in another domain. The bypass delay is 1 clock cycle between the "int" and "ivec" units, and 2 clock cycles between the "int" and "fp", and between the "ivec" and "fp" units. The bypass delay is indicated under latency only where it is unavoidable because either the source operand or the destination operand is in an unnatural domain such as a general purpose register (e.g. eax) in the "ivec" domain. For example, the PEXTRW instruction executes in the "int" domain. The source operand is an xmm register and the destination operand is a general purpose register. The latency for this instruction is indicated as 2+1, where 2 is the latency of the instruction itself and 1 is the bypass delay, assuming that the xmm operand is most likely to come from the "ivec" domain. If the xmm operand comes from the "fp" domain then the bypass delay will be 2 rather than one. The flags register can also have a bypass delay. For example, the COMISS instruction (floating point compare) executes in the "fp" domain and returns the result in the integer flags. Almost all instructions that read these flags execute in the "int" domain. Here the latency is indicated as 1+2, where 1 is the latency of the instruction itself and 2 is the bypass delay from the "fp" domain to the "int" domain. The bypass delay from the memory read unit to any other unit and from any unit to the memory write unit are included in the latency figures in the table. Where the domain is not listed, the bypass delays are either unlikely to occur or unavoidable and therefore included in the latency figure.
p015: p0: p1: p5: p2: p3: p4: Domain:
Page 106
Nehalem Latency: This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter.
Instruction Operands Doops ops unfused domain main fused dop015 p0 p1 p5 p2 p3 p4 main 1 1 1 1 1 2 6 6 2 1 1 2 2 3 7 2 1 1 2 2 3 18 1 3 2 7 8 10 1 2 1 1 1 x x x 1 1 1 1 1 3 4 1 1 1 1 int int int int int int int int int int 1 2 2 3 1 x x x x x x x x x 1 1 1 1 1 1 1 1 1 8 1 1 1 1 1 1 8 int int int int int int int int int int int int int int int int int int int int int int Laten- Recicy procal throughput 1 2 3 3 0.33 1 1 1 1 1 13 14 1 0.33 1 1 2 1 1 1 1 1 1 8 1 5 1 15 14 8 0.33 1 1 1
Move instructions MOV MOV a) MOV a) MOV MOV MOV MOV MOV MOVNTI MOVSX MOVZX MOVSXD MOVSX MOVZX MOVSXD CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) i) POP POP POP POP POPF(D/Q) POPA(D) i) LAHF SAHF SALC i) LEA a) BSWAP
r,r/i r,m m,r m,i r,sr m,sr sr,r sr,m m,r r,r r,m r,r r,m r,r r,m r i m sr
3 2
x x
x x
1 1 x x x
~270 1
2 2 20 b) 5 3
1 1 2 2 2 2 7 2 1 2 1 1 x x x x 1 1 x x x 1 1 1 5 1 8
r (E/R)SP m sr
x x x
x x x 1 1
x x x
r,m r32
1 4 1 1
Page 107
Nehalem BSWAP LDS LES LFS LGS LSS PREFETCHNTA PREFETCHT0/1/2 LFENCE MFENCE SFENCE Arithmetic instructions ADD SUB ADD SUB ADD SUB ADC SBB ADC SBB ADC SBB CMP CMP INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS i) AAD i) AAM i) MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV c) DIV c) DIV c) DIV c) IDIV c) IDIV c) IDIV c) IDIV c) r64 m m m 1 9 1 1 2 3 2 1 3 x 1 x x 6 1 1 1 1 1 1 1 1 int int int int int int int 3 1 15 1 1 9 23 5
r,r/i r,m m,r/i r,r/i r,m m,r/i r,r/i m,r/i r m
r8 r16 r32 r64 r16,r16 r32,r32 r64,r64 r16,r16,i r32,r32,i r64,r64,i m8 m16 m32 m64 r16,m16 r32,m32 r64,m64 r16,m16,i r32,m32,i r64,m64,i r8 r16 r32 r64 r8 r16 r32 r64
1 1 2 2 2 4 1 1 1 3 1 3 5 1 3 3 3 1 1 1 1 1 1 1 3 3 3 1 1 1 1 1 1 4 6 6 ~40 4 8 7 ~60
1 1 1 2 2 3 1 1 1 1 1 3 5 1 3 3 3 1 1 1 1 1 1 1 3 3 2 1 1 1 1 1 1 4 6 6 4 8 7
x x x x x x x x x x x x x x x
x x x x x x x x x x 1 x x 1 x x x 1 1 1 1
x x x x x x x x x x x x x x x
1 1 1 1 1 1
1 x x 2 1 x x 1 1 1 1 1 1 1 x x x 1 x x x 2 4 3 x 2 5 3 x 1 x x x 1 x x x x x 1 1 1 1 1 1 1 1 1 1
int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int
1 6 2 2 7 1 1 1 6 3 15 20 3 5 5 3 3 3 3 3 3 3 3 5 5 3 3 3 3
0.33 1 1 2 2 0.33 1 0.33 1 1 2 7 1 2 2 2 1 1 1 1 1 2 1 2 2 2 1 1 1 1 1 1 7-11 7-12 7-17 19-69 7-11 7-12 7-17 26-86
11-21 17-22 17-28 28-90 10-22 18-23 17-28 37-100
Page 108
Nehalem CBW CWDE CDQE CWD CDQ CQO POPCNT ) POPCNT ) CRC32 ) CRC32 ) Logic instructions AND OR XOR AND OR XOR AND OR XOR TEST TEST SHR SHL SAR SHR SHL SAR ROR ROL ROR ROL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL SHLD SHLD SHRD SHRD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR BSF BSR SETcc SETcc CLC STC CMC CLD STD 1 1 1 1 1 1 1 1 1 1 1 1 x x x 1 1 1 1 x x 1 1 int int int int int int 1 1 3 3 1 1 1 1 1 1
r,r r,m r,r r,m
r,r/i r,m m,r/i r,r/i m,r/i r,i/cl m,i/cl r,i/cl m,i/cl r,1 r8,i/cl r8,i/cl r16/32/64,i/cl m,1 m8,i/cl m8,i/cl m16/32/64,i/cl r,r,i/cl m,r,i/cl r,r,i/cl m,r,i/cl r,r/i m,r m,i r,r/i m,r m,i r,r r,m r m
1 1 2 1 1 1 3 1 3 2 9 8 6 4 12 11 10 2 3 2 3 1 9 2 1 10 3 1 2 1 2 1 2 2
1 1 1 1 1 1 2 1 2 2 9 8 6 3 9 8 7 2 2 2 2 1 8 2 1 7 3 1 1 1 1 1 2 2
x x x x x x x x x x x x x x x x x x x x x x x x x x x
x x x x x
x x x x x x x x x x x x
x 1 1
x x x x x x x x x x x x x x x x x x x x x x x x x x x
1 1 1 1 1
1 1
1 1
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
1 1
1 1
x x x x x
x x x x
x x x x x
int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int
1 6 1 1 6 1 6 2 13 11 12-13 7 16 14 15 3 8 4 9 1
0.33 1 1 0.33 1 0.5 1 1 1 2
12-13
1 1 1 5 1 1
1 6 6 3 3 1 1
1 1 1 1 0.33 4 5
Control transfer instructions JMP short/near JMP i) far JMP r JMP m(near) JMP m(far) Conditional jump short/near
1 31 1 1 31 1
1 31 1 1 31 1 Page 109
1 1 1 1
1 11
int int int int int int
0 0 0 0
2 67 2 2 73 2
Nehalem Fused compare/test and branch e) J(E/R)CXZ short LOOP short LOOP(N)E short CALL near CALL i) far CALL r CALL m(near) CALL m(far) RETN RETN i RETF RETF i BOUND i) r,m INTO i) String instructions LODS REP LODS STOS REP STOS REP STOS MOVS REP MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC Notes: a) b) c) e) i) ) 1 2 6 11 2 46 3 4 47 1 3 39 40 15 4 1 2 6 11 2 46 2 3 47 1 2 39 40 13 4 x x x x x x 1 1 x x 1 9 1 1 1 1 1 1 1 1 1 1 1 1 int int int int int int int int int int int int int int int 0 2 2 4 7 2 74 2 2 79 2 2 120 124 7 5
small n large n small n large n
2 1 11+4n 3 1 60+n 2.5/16 bytes 5 2 13+6n 2/16 bytes 3 2 37+6n 5 3 65+8n
x x
x x
x x
1 1 1
x x
x x
x x
1 2
int int int int int int int int int int int int
1 40+12n 1 12+n 1 clk / 16 bytes 4 12+n 1 clk / 16 bytes 1 40+2n 4 42+2n
a,0 a,b
1 1 5 11 34+7b 3 25-100 22 28
1 1 5 9 3
x x x x
x x x x
x x x x
1 1
int int int int int int int int int
0.33 1 9 8 79+5b ~200 5 ~200 24 40-60
Applies to all addressing modes Has an implicit LOCK prefix. Low values are for small results, high values for high results. See manual 3: "The microarchitecture of Intel, AMD and VIA CPUs" for restrictions on macro-op fusion. Not available in 64 bit mode. SSE4.2 instruction set.
Page 110
Nehalem

Instruction Operands ops ops unfused domain Dofused main dop015 p0 p1 p5 p2 p3 p4 main 1 1 4 41 1 1 7 208 1 1 3 3 1 2 2 2 2 3 2 2 1 2 143 79 1 1 2 38 1 3 204 0 f) 1 1 1 1 2 2 2 2 2 1 1 1 2 89 52 1 1 x 1 x x 1 x 1 2 3 1 2 2 1 1 1 1 1 1 2 2 float float float float float float float float float float float float float float float float float float float float float float float float Laten- Recicy procal throughput 1 3 4 45 1 4 5 242 0 6 7 7 1 1 2 20 1 1 5 245 1 1 1 1 1 2 2 2 1 2 31 1 1 4 178 156
Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FISTTP g) FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FADD(P) FSUB(R)(P) FMUL(P) FMUL(P) FDIV(R)(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM
r m32/64 m80 m80 r m32/m64 m80 m80 r m m m
x x 1 1 1
x x
1 1 2
1 2
2+2
1 1 1 1 x x x 1 x x x
1 1
7 5 1 178 156
x x 8 23 23 x 27
r m r m r m
r m r m m m m
1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 1
1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 1
1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1
1 1 1
1 1
1 1 1 1
float 3 1 float 1 float 5 1 float 1 float 7-27 d) 7-27 d) float 7-27 d) 7-27 d) float 1 1 float 1 1 float 1 float 1 float 1 float 1 float 3 2 float 5 2 float 7-27 d) 7-27 d) float 1 float 1 float 1
Page 111
Nehalem FPREM FPREM1 FRNDINT Math FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN Other FNOP WAIT FNCLEX FNINIT Notes: d) f) g) 25 35 17 25 35 17 x x x x x x x x x float float float 14 19 22
24 17 1 ~100 ~100 ~100 19 ~55 ~100 ~82
24 17 1
~100 ~100 ~100
19 ~55
~100
~82
x x 1 x x x x x x x
x x x x x x x x x
x x x x x x x x x
float 12 float 13 float ~27 float 40-100 float 40-100 float ~110 float 58 float ~80 float ~115 float ~120
1 1 1 2 2 x 3 3 ~190 ~190 x
x x x
x x x
float float float float
1 1 17 77
Round divisors or low precision give low values. Resolved by register renaming. Generates no ops in the unfused domain. SSE3 instruction set.

Instruction Operands ops ops unfused domain Dofused main dop015 p0 p1 p5 p2 p3 p4 main 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 x x x x x x x 1 x 1 x 1 1 1 x x x 1 1 1 1 1 1 1 1 1 1 x x x x x x 1 1 Page 112 1 1 ivec ivec 1 1 1 ivec ivec 1 ivec int Laten- Recicy procal throughput 1+1 3 1+1 2 1 2 3 1 2 3 2 3 2 1 1 ~270 ~270 0.33 1 0.33 1 0.33 1 1 0.33 1 1 1 1 1 0.33 0.33 2 2
Move instructions MOVD k) MOVD k) MOVD k) MOVD k) MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU LDDQU g) MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ
r32/64,(x)mm m32/64,(x)mm (x)mm,r32/64 (x)mm,m32/64 (x)mm, (x)mm (x)mm,m64 m64, (x)mm xmm, xmm xmm, m128 m128, xmm xmm, m128 m128, xmm xmm, m128 mm, xmm xmm,mm m64,mm m128,xmm
Nehalem MOVNTDQA j) PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKUSDW j) PACKUSDW j) PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKH/LQDQ PUNPCKH/LQDQ PMOVSX/ZXBW j) PMOVSX/ZXBW j) PMOVSX/ZXBD j) PMOVSX/ZXBD j) PMOVSX/ZXBQ j) PMOVSX/ZXBQ j) PMOVSX/ZXWD j) PMOVSX/ZXWD j) PMOVSX/ZXWQ j) PMOVSX/ZXWQ j) PMOVSX/ZXDQ j) PMOVSX/ZXDQ j) PSHUFB h) PSHUFB h) PSHUFW PSHUFW PSHUFD PSHUFD PSHUFL/HW PSHUFL/HW PALIGNR h) PALIGNR h) PBLENDVB j) PBLENDVB j) PBLENDW j) PBLENDW j) MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRB j) PEXTRB j) PEXTRW PEXTRW j) PEXTRD j) PEXTRD j) PEXTRQ j,m) xmm, m128 mm,mm mm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m (x)mm, (x)mm (x)mm,m xmm,xmm xmm, m128 xmm,xmm xmm,m64 xmm,xmm xmm,m32 xmm,xmm xmm,m16 xmm,xmm xmm,m64 xmm,xmm xmm,m32 xmm,xmm xmm,m64 (x)mm, (x)mm (x)mm,m mm,mm,i mm,m64,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm, m128,i (x)mm,(x)mm,i (x)mm,m,i x,x,xmm0 xmm,m,xmm0 xmm,xmm,i xmm,m,i mm,mm xmm,xmm r32,(x)mm r32,xmm,i m8,xmm,i r32,(x)mm,i m16,(x)mm,i r32,xmm,i m32,xmm,i r64,xmm,i 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 2 1 2 2 3 1 2 4 10 1 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 4 1 2 2 2 2 2 1 2 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 1 1 x x 1 x 1 x x x x x x x 1 1 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 1 1 x x x x x x x x x x x x x x x 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 1 2 1 2 1 x ivec ivec float ivec ivec 1 1 1 ivec 1 ivec 2+1 2+1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ivec 1 1 ivec 2 1 1 1 2 0.5 2 2 2 0.5 2 0.5 1 1 2 1 2 1 2 1 2 1 2 1 2 0.5 1 0.5 1 0.5 1 0.5 1 1 1 1 1 0.5 1 2 7 1 1 1 1 1 1 1 1
2+2 2+1 2+1
Page 113
Nehalem PEXTRQ j,m) PINSRB j) PINSRB j) PINSRW PINSRW PINSRD j) PINSRD j) PINSRQ j,m) PINSRQ j,m) Arithmetic instructions PADD/SUB(U) (S)B/W/D/Q PADD/SUB(U) (S)B/W/D/Q PHADD/SUB(S)W/D h) PHADD/SUB(S)W/D h) PCMPEQ/GTB/W/D PCMPEQ/GTB/W/D PCMPEQQ j) PCMPEQQ j) PCMPGTQ ) PCMPGTQ ) PMULL/HW PMULHUW PMULL/HW PMULHUW PMULHRSW h) PMULHRSW h) PMULLD j) PMULLD j) PMULDQ j) PMULDQ j) PMULUDQ PMULUDQ PMADDWD PMADDWD PMADDUBSW h) PMADDUBSW h) PAVGB/W PAVGB/W PMIN/MAXSB j) PMIN/MAXSB j) PMIN/MAXUB PMIN/MAXUB PMIN/MAXSW PMIN/MAXSW PMIN/MAXUW j) PMIN/MAXUW j) PMIN/MAXU/SD j) PMIN/MAXU/SD j) PHMINPOSUW j) m64,xmm,i xmm,r32,i xmm,m8,i (x)mm,r32,i (x)mm,m16,i xmm,r32,i xmm,m32,i xmm,r64,i xmm,m64,i 2 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 x x x x x x x x x x x x x x x x x x 1 1 ivec 1 ivec 1 ivec 1 1+1 1+1 1+1 1 ivec 1+1 1 1 1 1 1 1 1 1 1
(x)mm, (x)mm (x)mm,m (x)mm, (x)mm (x)mm,m64 (x)mm,(x)mm (x)mm,m xmm,xmm xmm,m128 xmm,xmm xmm,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m xmm,xmm xmm,m128 xmm,xmm xmm,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m xmm,xmm xmm,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m xmm,xmm xmm,m xmm,xmm xmm,m128 xmm,xmm
1 1 3 4 1 1 1 1 1 1 1 1 1 1 2 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 3 3 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
x x x x x x x x 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 x x x x x x x x x x x x 1
x x x x x x x x 1
ivec
0.5 2 1.5 3 0.5 2 0.5 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 0.5 1 1 2 0.5 2 0.5 2 1 2 1 2 1
ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1
3 1 1 3 3 3 6 3 3 3 3 1 1 1 1 1 1 3
x x x x x x x x x x x x
ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec
Page 114
Nehalem PHMINPOSUW j) PABSB PABSW PABSD h) PABSB PABSW PABSD h) PSIGNB PSIGNW PSIGND h) PSIGNB PSIGNW PSIGND h) PSADBW PSADBW MPSADBW j) MPSADBW j) PCLMULQDQ n) AESDEC, AESDECLAST, AESENC, AESENCLAST n) AESIMC n) AESKEYGENASSIST n) Logic instructions PAND(N) POR PXOR PAND(N) POR PXOR PTEST j) PTEST j) PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ String instructions PCMPESTRI ) PCMPESTRI ) PCMPESTRM ) PCMPESTRM ) PCMPISTRI ) PCMPISTRI ) PCMPISTRM ) PCMPISTRM ) Other EMMS Notes: g) h) j) k) xmm,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m xmm,xmm,i xmm,m,i xmm,xmm,i 1 1 1 1 1 1 1 3 4 1 1 1 1 1 1 1 3 3 x x x x 1 1 x x 1 x x x x 1 ivec 1 x x ivec 1 12 5 3 1 ivec 1 1 ivec 1 3 0.5 1 0.5 2 1 3 1 2 8
x x
xmm,xmm xmm,xmm xmm,xmm,i
~5 ~5 ~5
~2 ~2 ~2
(x)mm,(x)mm (x)mm,m xmm,xmm xmm,m128 mm,mm/i mm,m64 xmm,i xmm,xmm xmm,m128 xmm,i
1 1 2 2 1 1 1 2 3 1
1 1 2 2 1 1 1 2 2 1
x x x x
x x x
x x x x 1 1 1 1 1
x x x x
ivec 1 ivec 1 ivec 1 ivec ivec 1 ivec
1 3 1 1 2 1
x x x
0.33 1 1 1 1 2 1 2 1 1
xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i
8 9 9 10 3 4 4 6
8 8 9 10 3 4 4 5
x x x x x x x x
x x x x x x x x
x x x x x x x x
1 1 1 1
ivec ivec ivec ivec ivec ivec ivec ivec
14 14 7 7 8 8 7 7
5 6 6 6 2 2 2 5
11
11
float
SSE3 instruction set. Supplementary SSE3 instruction set. SSE4.1 instruction set MASM uses the name MOVD rather than MOVQ for this instruction even when moving 64 bits Page 115
Nehalem ) m) n) SSE4.2 instruction set Only available in 64 bit mode Only available on newer models

Instruction Operands Doops ops unfused domain fused main dop015 p0 p1 p5 p2 p3 p4 main 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 1 2 2 3 1 1 1 1 1 1 1 2 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 float float 1 float float float 1 1+2 1 1 1 1 float 1 float float float float float float float 1 1 1 float float 1 1 1 float Laten- Recicy procal throughput 1 2 3 2 3 1 2 3 3 5 1 1+2 ~270 1 1 2 1 2 1 1 1 1 1-4 1-3 1 1 1 2 1 1 1 2 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2
Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVH/LPS/D MOVLHPS MOVHLPS MOVMSKPS/D MOVNTPS/D SHUFPS/D SHUFPS/D BLENDPS/PD j) BLENDPS/PD j) BLENDVPS/PD j) BLENDVPS/PD j) MOVDDUP g) MOVDDUP g) MOVSH/LDUP g) MOVSH/LDUP g) UNPCKH/LPS/D UNPCKH/LPS/D EXTRACTPS j) EXTRACTPS j) INSERTPS j) INSERTPS j) Conversion CVTPD2PS CVTPD2PS CVTSD2SS CVTSD2SS CVTPS2PD CVTPS2PD CVTSS2SD CVTSS2SD
xmm,xmm xmm,m128 m128,xmm xmm,m128 m128,xmm xmm,xmm xmm,m32/64 m32/64,xmm xmm,m64 m64,xmm xmm,xmm r32,xmm m128,xmm xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i x,x,xmm0 xmm,m,xmm0 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 r32,xmm,i m32,xmm,i xmm,xmm,i xmm,m32,i
xmm,xmm xmm,m128 xmm,xmm xmm,m64 xmm,xmm xmm,m64 xmm,xmm xmm,m32
2 2 2 2 2 2 1 1
2 2 2 2 2 2 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
float float float float float float float float
4 4 2 1
1 1 1 1 1 1 1 2
Page 116
Nehalem CVTDQ2PS CVTDQ2PS CVT(T) PS2DQ CVT(T) PS2DQ CVTDQ2PD CVTDQ2PD CVT(T)PD2DQ CVT(T)PD2DQ CVTPI2PS CVTPI2PS CVT(T)PS2PI CVT(T)PS2PI CVTPI2PD CVTPI2PD CVT(T) PD2PI CVT(T) PD2PI CVTSI2SS CVTSI2SS CVT(T)SS2SI CVT(T)SS2SI CVTSI2SD CVTSI2SD CVT(T)SD2SI CVT(T)SD2SI Arithmetic ADDSS/D SUBSS/D ADDSS/D SUBSS/D ADDPS/D SUBPS/D ADDPS/D SUBPS/D ADDSUBPS/D g) ADDSUBPS/D g) HADDPS HSUBPS g) HADDPS HSUBPS g) HADDPD HSUBPD g) HADDPD HSUBPD g) MULSS MULPS MULSS MULPS MULSD MULPD MULSD MULPD DIVSS DIVPS DIVSS DIVPS DIVSD DIVPD DIVSD DIVPD RCPSS/PS RCPSS/PS CMPccSS/D CMPccPS/D CMPccSS/D CMPccPS/D xmm,m 2 1 Page 117 1 1 float 1 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,mm xmm,m64 mm,xmm mm,m128 xmm,mm xmm,m64 mm,xmm mm,m128 xmm,r32 xmm,m32 r32,xmm r32,m32 xmm,r32 xmm,m32 r32,xmm r32,m64 1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2 2 1 1 1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 x x 1
float/ivec
float float float float float float float float float float float float
ivec/float
3+2 3+2 4+2 4+2 3+2 3+2 6 6 3+2 3+2 4+2 3+2
x x
1 1 1 1 1 float float float float float float float float
1 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 3 3 1 1 3 3 1 1
xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m xmm,xmm xmm,m xmm,xmm xmm,m xmm,xmm xmm,m xmm,xmm xmm,m xmm,xmm
1 1 1 1 1 1 3 4 3 4 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 2 2 2 2 1 1 1 1 1 1 1
float float float float float float float float float float float float float float float float float float float float float
3 3 3 5 3 4 5 7-14 7-22 3
1 1 1 1 1 1 2 2 2 2 1 1 1 1 7-14 7-14 7-22 7-22 2 2 1
Nehalem COMISS/D UCOMISS/D COMISS/D UCOMISS/D MAXSS/D MINSS/D MAXSS/D MINSS/D MAXPS/D MINPS/D MAXPS/D MINPS/D ROUNDSS/D ROUNDPS/D j) ROUNDSS/D ROUNDPS/D j) DPPS j) DPPS j) DPPD j) DPPD j) Math SQRTSS/PS SQRTSS/PS SQRTSD/PD SQRTSD/PD RSQRTSS/PS RSQRTSS/PS Logic
AND/ANDN/OR/XORPS/D AND/ANDN/OR/XORPS/D
xmm,xmm xmm,m32/64 xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i
1 1 1 1 1 1 1 2 4 6 3 4
1 1 1 1 1 1 1 1 4 5 3 3
1 1 1 1 1 1 1 1 2 x x x
1 1 1
float float float float float float float
1+2 3 3
1 1 1 1 1 1 1 1 2 1 3
1 1 x x x 1 1
1 x x x
float float float float float
11 9
1 2 1 2 1 1
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1
7-18 7-32 3
7-18 7-18 7-32 7-32 2 2
xmm,xmm xmm,m128
1 1
1 1
1 1
float float
1 1
Other LDMXCSR STMXCSR FXSAVE FXRSTOR Notes: g)
m32 m32 m4096 m4096
6 2 141 112
6 x 1 141 x 90 x
x x x
x 1 1 1 1 x 5 38 38 x 42
90
5 1 90 100
SSE3 instruction set.
Page 118
Sandy Bridge
Intel Sandy Bridge

Operands: i = immediate data, r = register, mm = 64 bit mmx register, x = 128 bit xmm register, (x)mm = mmx or xmm register, y = 256 bit ymm register, same = same register for both operands. m = memory operand, m32 = 32-bit memory operand, etc. The number of ops at the decode, rename, allocate and retirement stages in the pipeline. Fused ops count as one. The number of ops for each execution port. Fused ops count as two. Fused macro-ops count as one. The instruction has op fusion if the sum of the numbers listed under p015 + p23 + p4 exceeds the number listed under ops fused domain. A number indicated as 1+ under a read or write port means a 256-bit read or write operation using two clock cycles for handling 128 bits each cycle. The port cannot receive another read or write op in the second clock cycle, but a read port can receive an address-calculation op in the second clock cycle. An x under p0, p1 or p5 means that at least one of the ops listed under p015 can optionally go to this port. For example, a 1 under p015 and an x under p0 and p5 means one op which can go to either port 0 or port 5, whichever is vacant first. A value listed under p015 but nothing under p0, p1 and p5 means that it is not known which of the three ports these ops go to. The total number of ops going to port 0, 1 and 5. The number of ops going to port 0 (execution units). The number of ops going to port 1 (execution units). The number of ops going to port 5 (execution units). The number of ops going to port 2 or 3 (memory read or address calculation). The number of ops going to port 4 (memory write data). This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Where hyperthreading is enabled, the use of the same execution units in the other thread leads to inferior performance. Denormal numbers, NAN's and infinity do not increase the latency. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter.
p015: p0: p1: p5: p23: p4: Latency:
Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread. The latencies and throughputs listed below for addition and multiplication using full size YMM registers are obtained only after a warm-up period of a thousand instructions or more. The latencies may be one or two clock cycles longer and the reciprocal throughputs double the values for shorter sequences of code. There is no warm-up effect when vectors are 128 bits wide or less.
Instruction Operands ops ops unfused domain Latency ReciComfused p015 p0 p1 p5 p23 p4 procal ments dothroughmain put
Move instructions Page 119
Sandy Bridge MOV MOV r,r/i r,m 1 1 1 x x x 1 1 2 0.33 0.5 all addressing modes
MOV MOV MOVNTI MOVSX MOVZX MOVSXD MOVSX MOVZX MOVSXD CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POPF(D/Q) POPA(D) LAHF SAHF SALC LEA LEA
m,r m,i m,r r,r r,m r,r r,m r,r r,m
1 1 2 1 1 2 2 3 8 3 1 1 2 3 16 1 1 2 9 18 1 3 1 1
1 1 1 1 x x x 1 2 2 3 x x x x x x x x x
1 1 1
3 ~350 1
1 1 1 0.33 0.5
2 1 2 1 2 25 7 3
1 1 1 implicit lock 1 1 1 1 1 8 0.5 0.5 1 18 9 1 1 0.5 1
r i m
2 0 0 8 10 1 3 1 1
r (E/R)SP m
1 1 1 2 1 8 1 1 2 1 8
1 1 1 1 8
not 64 bit
2 1
not 64 bit not 64 bit simple complex or rip relative
r,m r,m
1 1
1 1 1 3
BSWAP BSWAP PREFETCHNTA PREFETCHT0/1/2 LFENCE MFENCE SFENCE Arithmetic instructions ADD SUB ADD SUB ADD SUB SUB ADC SBB ADC SBB ADC SBB CMP CMP
r32 r64 m m
1 2 1 1 2 3 2
1 2
1 2 1 1 1 1 1
1 2
1 1 1
1 1 0.5 0.5 4 33 6
r,r/i r,m m,r/i r,same r,r/i r,m m,r/i r,r/i m,r/i
1 1 2 1 2 2 4 1 1
1 1 1 0 2 2 3 1 1
x x x x x x x x
x x x x x x x x
x x x x x x x x
1 1 2 1 6 0 2 2 7 1 1
1 2 1
0.33 0.5 1 0.25 1 1 1.5 0.33 0.5
Page 120
Sandy Bridge INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS AAD AAM MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL DIV DIV DIV DIV IDIV IDIV IDIV IDIV CBW CWDE CDQE CWD CDQ CQO POPCNT POPCNT CRC32 CRC32 Logic instructions AND OR XOR AND OR XOR AND OR XOR XOR TEST TEST SHR SHL SAR r m 1 3 2 3 3 8 1 4 3 2 1 2 1 1 1 4 3 2 1 2 1 1 10 11 10 34-56 10 10 9 59138 1 1 1 2 1 1 1 1 1 1 1 1 2 3 3 8 1 4 3 2 1 2 1 1 1 3 2 1 1 2 1 1 10 11 10 10 10 9 x x x x x x 2 1 1 6 4 4 2 20 3 4 4 3 3 4 3 3 3 0.33 2 not 64 bit not 64 bit not 64 bit not 64 bit
r8 r16 r32 r64 r,r r16,r16,i r32,r32,i r64,r64,i m8 m16 m32 m64 r,m r16,m16,i r32,m32,i r64,m64,i r8 r16 r32 r64 r8 r16 r32 r64
1 1 1 1
1 1 1
1 1 1 1 1 1 1 1
20-24 21-25 20-28 30-94 21-24 21-25 20-27 40-103 1 1 1 1 1 1 3 1 3 1
11 1 2 2 1 1 1 1 1 1 2 2 2 1 1 1 1 11-14 11-14 11-18 22-76 11-14 11-14 11-18 25-84 0.5 1 0.5 1 1 0.5 1 1 1 1
r,r r,m r,r r,m
1 1 1 2 1 1 1 1 1 1
1 1 1 1
SSE4.2 SSE4.2 SSE4.2 SSE4.2
r,r/i r,m m,r/i r,same r,r/i m,r/i r,i
1 1 2 1 1 1 1
1 1 1 0 1 1 1
x x x x x x
x x x x x
x x x x x x
1 1 2 1 6 0 1 1
0.33 0.5 1 0.25 0.33 0.5 0.5
Page 121
Sandy Bridge SHR SHL SAR SHR SHL SAR SHR SHL SAR ROR ROL ROR ROL ROR ROL ROR ROL RCR RCR RCR RCR RCR RCR RCL RCL RCL RCL RCL SHRD SHLD SHRD SHLD SHRD SHLD SHRD SHLD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR BSF BSR SETcc SETcc CLC STC CMC CLD STD m,i r,cl m,cl r,i m,i r,cl m,cl r8,1 r16/32/64,1 r,i m,i r,cl m,cl r,1 r,i m,i r,cl m,cl r,r,i m,r,i r,r,cl m,r,cl r,r/i m,r m,i r,r/i m,r m,i r,r r,m r m 3 3 5 1 4 3 5 high 3 8 11 8 11 3 8 11 8 11 1 3 4 5 1 10 2 1 11 3 1 1 1 2 1 1 3 1 3 3 1 3 3 3 3 8 7 8 7 3 8 7 8 7 1 2 4 3 1 8 1 1 7 1 1 1 1 1 0 1 3 2 1 1 1 2 2 1 x x x x x x x 1 1 1 1 1 1 1 3 1 2 1 1 2 2 2 2 1 1 1 1 2 1 high 2 5 5 2 6 6 1 2 2 2 4 1 2 2 4 high 2 5 6 5 6 2 6 6 6 6 0.5 2 2 4 0.5 5 0.5 0.5 5 2 1 1 0.5 1 0.25 0.33 4
Control transfer instructions JMP short/near JMP r JMP m Conditional jump short/near
1 1 1 1
1 1 1 1
1 1 1 1
0 0 0 0
2 2 2 1-2
fastest if not jumping
Fused arithmetic and branch J(E/R)CXZ LOOP LOOP(N)E CALL CALL
1 short short short near r 2 7 11 3 2
1 2 7 11 2 1 x x
1 1
1-2 2-4 5 5 2 2
1 1
1 1
1 1
Page 122
Sandy Bridge CALL RET RET BOUND INTO String instructions LODS REP LODS STOS REP STOS REP STOS MOVS REP MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS Other NOP (90) Long NOP (0F 1F) m i r,m 3 2 3 15 4 2 2 2 13 4 1 1 1 2 1 1 1 2 2 2 7 6
not 64 bit not 64 bit
3 5n+12 3 2n 1.5/16B 5 2n 3/16B 3 6n+47 5 8n+80
2 1
1 ~2n 1 1 n 1/16B
1 1 worst case best case 4 1.5 n 1/16B 1 2n+45 4 2n+80 worst case best case
1 1
0 0
0.25 0.25
decode only 1 per clk
PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC
a,0 a,b
7 7 12 10 49+6b 3 3 31-75 21 35
2 1
1 84+3b
11 8 7 100-250 28 42

Instruction Operands ops ops unfused domain Latency ReciComfused p015 p0 p1 p5 p23 p4 procal ments dothroughmain put 1 1 4 43 1 1 7 1 1 2 40 1 3 Page 123 1 1 1 1 2 1 2 1 1 2 3 1 3 4 45 1 4 5 1 1 2 21 1 1 5
Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP
r m32/64 m80 m80 r m32/m64 m80
Sandy Bridge FBSTP FXCH FILD FIST(P) FISTTP FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FADD(P) FSUB(R)(P) FMUL(P) FMUL(P) FDIV(R)(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT Math FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X m80 r m m m 246 1 1 3 3 1 2 2 3 2 2 3 2 1 1 143 90 0 1 1 1 1 2 2 3 2 1 2 1 1 1 0 6 7 7 252 0.5 1 2 2 2 2 2 2 1 1 1 1 1 166 165
1 1 1 1 1 1 2
1 1 1
1 1
SSE3
3 2 1 1 1 1 1 8 5 1
1 1
r m r m r m
r m r m m m m
1 2 1 1 1 1 1 1 1 1 2 3 2 2 2 2 1 2 28 41-87 17
1 2 1 1 1 1 1 1 1 1 2 3 2 2 2 2 1 2 28 17
1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1
3 1 5 1 10-24 1 1 1 3 1 4 1 1 1 1
1 1
1 1 1 1 10-24 10-24 1 1 1 1 1 1 1 1 2 1 2 21 26-50
21 26-50 22
27 27 17 17 1 1 1 64-100 20-110 20-110 53-118 454 454 Page 124
12 10 10-24 47-100 47-115 43-123 61-69 724
Sandy Bridge FYL2XP1 FPTAN FPATAN Other FNOP WAIT FNCLEX FNINIT 464 464 102 102 28-91 726 130 93-146
1 2 5 26
1 2 5 26
1 1 22 81

Instruction Operands ops ops unfused domain Latency ReciComfused p015 p0 p1 p5 p23 p4 procal ments dothroughmain put 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 x x x x x x x 1 x 1 x 1 1 1 x x x 1 1 1 1 1 1 1 1 1 1 3 1 3 1 1 3 1 3 3 3 3 3 1 1 ~300 ~300 0.33 1 0.33 0.5 0.33 0.5 1 0.33 0.5 1 0.5 1 0.5 1 0.33 1 0.5 1 1 x x x x x x x x x x x x x x x x x x x x 1 1 1 1 1 1 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1
Move instructions MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU LDDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ MOVNTDQA PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKUSDW PACKUSDW PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKH/LQDQ PUNPCKH/LQDQ PMOVSX/ZXBW PMOVSX/ZXBW
r32/64,(x)mm m32/64,(x)mm (x)mm,r32/64 (x)mm,m32/64 (x)mm,(x)mm (x)mm,m64 m64, (x)mm x,x x, m128 m128, x x, m128 m128, x x, m128 mm, x x,mm m64,mm m128,x x, m128 mm,mm mm,m64 x,x x,m128 x,x x,m (x)mm,(x)mm (x)mm,m x,x x, m128 x,x x,m64
1 1 1 2 1
SSE3
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1
SSE4.1
SSE4.1 SSE4.1
SSE4.1 SSE4.1
Page 125
Sandy Bridge PMOVSX/ZXBD PMOVSX/ZXBD PMOVSX/ZXBQ PMOVSX/ZXBQ PMOVSX/ZXWD PMOVSX/ZXWD PMOVSX/ZXWQ PMOVSX/ZXWQ PMOVSX/ZXDQ PMOVSX/ZXDQ PSHUFB PSHUFB PSHUFW PSHUFW PSHUFD PSHUFD PSHUFL/HW PSHUFL/HW PALIGNR PALIGNR PBLENDVB PBLENDVB PBLENDW PBLENDW MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRB PEXTRB PEXTRW PEXTRW PEXTRD PEXTRD PEXTRQ PEXTRQ PINSRB PINSRB PINSRW PINSRW PINSRD PINSRD PINSRQ PINSRQ Arithmetic instructions
PADD/SUB(U,S)B/W/D/Q PADD/SUB(U,S)B/W/D/Q
x,x x,m32 x,x x,m16 x,x x,m64 x,x x,m32 x,x x,m64 (x)mm,(x)mm (x)mm,m mm,mm,i mm,m64,i xmm,x,i x,m128,i x,x,i x, m128,i (x)mm,(x)mm,i (x)mm,m,i x,x,xmm0 x,m,xmm0 x,x,i x,m,i mm,mm x,x r32,(x)mm r32,x,i m8,x,i r32,(x)mm,i m16,(x)mm,i r32,x,i m32,x,i r64,x,i m64,x,i x,r32,i x,m8,i (x)mm,r32,i (x)mm,m16,i x,r32,i x,m32,i x,r64,i x,m64,i
1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 2 1 2 2 3 1 2 4 10 1 2 2 2 2 2 3 2 3 2 2 2 2 2 2 2 2
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 4 1 2 1 2 1 2 2 2 2 2 1 2 1 2 1 2 1
x x x x x x x x x x x x x x x x x x x x 1 1 x x 1 1 x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x x x x x 1 1 x x
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 4 1 x 2 2 1 1 1 1 1 2 1 2 1 2 1 1 2 1 2 1 2 1 2
x x
x x x x x x x x x x x x x x x x
0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 1 0.5 0.5 1 6 1 1 1 1 2 1 1 1 1 1 0.5 1 0.5 1 0.5 1 0.5
SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSSE3 SSSE3
SSSE3 SSSE3 SSE4.1 SSE4.1 SSE4.1 SSE4.1
SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1, 64b SSE4.1 SSE4.1
SSE4.1 SSE4.1 SSE4.1, 64 b
PHADD/SUB(S)W/D PHADD/SUB(S)W/D PCMPEQ/GTB/W/D
(x)mm, (x)mm (x)mm,m (x)mm, (x)mm (x)mm,m64 (x)mm,(x)mm
1 1 3 4 1
1 1 3 3 1
x x x x x
x x x x x
1 1 2 1 1
0.5 0.5 1.5 1.5 0.5
SSSE3 SSSE3
Page 126
Sandy Bridge PCMPEQ/GTB/W/D PCMPEQQ PCMPEQQ PCMPGTQ PCMPGTQ PSUBxx, PCMPGTx PCMPEQx PMULL/HW PMULHUW PMULL/HW PMULHUW PMULHRSW PMULHRSW PMULLD PMULLD PMULDQ PMULDQ PMULUDQ PMULUDQ PMADDWD PMADDWD PMADDUBSW PMADDUBSW PAVGB/W PAVGB/W PMIN/MAXSB PMIN/MAXSB PMIN/MAXUB PMIN/MAXUB PMIN/MAXSW PMIN/MAXSW PMIN/MAXUW PMIN/MAXUW PMIN/MAXU/SD PMIN/MAXU/SD PHMINPOSUW PHMINPOSUW PABSB/W/D PABSB/W/D PSIGNB/W/D PSIGNB/W/D PSADBW PSADBW MPSADBW MPSADBW (x)mm,m x,x x,m128 x,x x,m128 x,same x,same (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m x,x x,m128 x,x x,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m x,x x,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m x,x x,m x,x x,m128 x,x x,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m x,x,i x,m,i 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 4 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 x x x 1 1 x x x 1 1 1 5 1 0 0 5 1 5 1 5 1 5 1 5 1 5 1 5 1 x x x x x x x x x x x x 1 1 1 1 1 1 1 1 1 1 1 1 5 1 x x x x 1 1 1 1 5 1 6 1 0.5 0.5 0.5 1 1 0.25 0.5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 1 0.5 0.5 0.5 0.5 1 1 1 1 SSE4.1 SSE4.1 SSE4.2 SSE4.2
1 1 1 1 1 1 1 1 1 1 1 1 1 1 x x x x x x x x x x x x 1 1 x x x x 1 1
SSSE3 SSSE3 SSE4.1 SSE4.1 SSE4.1 SSE4.1
SSSE3 SSSE3
SSE4.1 SSE4.1
SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSSE3 SSSE3 SSSE3 SSSE3
PCLMULQDQ AESDEC, AESDECLAST, AESENC, AESENCLAST
x,x,i
18
18
14
SSE4.1 SSE4.1 only in some processors
x,x
2 Page 127
do.
Sandy Bridge AESIMC AESKEYGENASSIST Logic instructions PAND(N) POR PXOR PAND(N) POR PXOR PXOR PTEST PTEST PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ String instructions PCMPESTRI PCMPESTRI PCMPESTRM PCMPESTRM PCMPISTRI PCMPISTRI PCMPISTRM PCMPISTRM Other EMMS x,x x,x,i 2 11 2 11 2 8 2 8 do. do.
(x)mm,(x)mm (x)mm,m x,same x,x x,m128 mm,mm/i mm,m64 xmm,i x,x x,m128 x,i
1 1 1 1 1 1 1 1 2 3 1
1 1 0 1 1 1 1 1 2 2 1
x x
x x
x x
1 1 0 1 1
1 1 1
1 1 1 2 1 1
0.33 0.5 0.25 1 1 1 2 1 1 1 1
SSE4.1 SSE4.1
x,x,i x,m128,i x,x,i x,m128,i x,x,i x,m128,i x,x,i x,m128,i
8 8 8 8 3 4 3 4
8 7 8 7 3 3 3 3
4 1 11-12 1 3 1 11 1
4 4 4 4 3 3 3 3
SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2
31
31
18
Floating point XMM and YMM instructions

Instruction Operands ops ops unfused domain Latency ReciComfused p015 p0 p1 p5 p23 p4 procal ments dothroughmain put 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1+ 1 1 1 1 1 1 1 1 1 1 1 1 3 4 3 3 1 3 3 3 3 1 1 1 0.5 1 1 1 1 0.5 1 1 1 1
Move instructions MOVAPS/D VMOVAPS/D MOVAPS/D MOVUPS/D VMOVAPS/D VMOVUPS/D MOVAPS/D MOVUPS/D VMOVAPS/D VMOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVH/LPS/D MOVLHPS MOVHLPS
x,x y,y x,m128 y,m256 m128,x m256,y x,x x,m32/64 m32/64,x x,m64 m64,x x,x
AVX
AVX
1 1+
AVX
1 1 1
1 1 1
Page 128
Sandy Bridge MOVMSKPS/D VMOVMSKPS/D MOVNTPS/D VMOVNTPS/D SHUFPS/D SHUFPS/D VSHUFPS/D VSHUFPS/D VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERM2F128 VPERM2F128 BLENDPS/PD BLENDPS/PD VBLENDPS/PD VBLENDPS/PD BLENDVPS/PD BLENDVPS/PD VBLENDVPS/PD VBLENDVPS/PD MOVDDUP MOVDDUP VMOVDDUP VMOVDDUP VBROADCASTSS VBROADCASTSS VBROADCASTSD VBROADCASTF128 MOVSH/LDUP MOVSH/LDUP VMOVSH/LDUP VMOVSH/LDUP UNPCKH/LPS/D UNPCKH/LPS/D VUNPCKH/LPS/D VUNPCKH/LPS/D EXTRACTPS EXTRACTPS VEXTRACTF128 VEXTRACTF128 INSERTPS INSERTPS VINSERTF128 VINSERTF128 VMASKMOVPS/D VMASKMOVPS/D r32,x r32,y m128,x m256,y x,x,i x,m128,i y,y,y,i y, y,m256,i x,x,x/i y,y,y/i x,x,m y,y,m x,m,i y,m,i y,y,y,i y,y,m,i x,x,i x,m128,i y,y,i y,m256,i x,x,xmm0 x,m,xmm0 y,y,y,y y,y,m,y x,x x,m64 y,y y,m256 x,m32 y,m32 y,m64 y,m128 x,x x,m128 y,y y,m256 x,x x,m128 y,y,y y,y,m256 r32,x,i m32,x,i x,y,i m128,y,i x,x,i x,m32,i y,y,x,i y,y,m128,i x,x,m128 y,y,m256 1 1 1 1 1 2 1 2 1 1 2 2 2 2 1 2 1 2 1 2 2 3 2 3 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 2 3 1 2 1 2 1 2 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1+ 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 2 Page 129 1 1+ 1 1 1 1 1 1 1 1 1 1 1 1 1 1+ 2 1 1 1 2 1 1 1+ 1 2 1 1 1 1 1+ 1 1 1 1+ 1 1+ 2 1+ 1 1 1 1+ 2 1 2 1+ 1 1 3 1 3 1 4 2 2 ~300 ~300 1 1 1 1 25 1 1 1 1 1 1 1 1 1 1 1 1 0.5 0.5 1 1 1 1 1 1 1 0.5 1 1 1 1 1 1 1 0.5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
AVX
1 1 1 1
1 1 1 1
1 3 1 4 1
AVX AVX AVX AVX AVX AVX AVX AVX AVX AVX SSE4.1 SSE4.1 AVX AVX SSE4.1 SSE4.1 AVX AVX SSE3 SSE3 AVX AVX AVX AVX AVX AVX SSE3 SSE3 AVX AVX SSE3 SSE3 AVX AVX SSE4.1 SSE4.1 AVX AVX SSE4.1 SSE4.1 AVX AVX AVX AVX
Sandy Bridge VMASKMOVPS/D VMASKMOVPS/D Conversion CVTPD2PS CVTPD2PS VCVTPD2PS VCVTPD2PS CVTSD2SS CVTSD2SS CVTPS2PD CVTPS2PD VCVTPS2PD VCVTPS2PD CVTSS2SD CVTSS2SD CVTDQ2PS CVTDQ2PS VCVTDQ2PS VCVTDQ2PS CVT(T) PS2DQ CVT(T) PS2DQ VCVT(T) PS2DQ VCVT(T) PS2DQ CVTDQ2PD CVTDQ2PD VCVTDQ2PD VCVTDQ2PD CVT(T)PD2DQ CVT(T)PD2DQ VCVT(T)PD2DQ VCVT(T)PD2DQ CVTPI2PS CVTPI2PS CVT(T)PS2PI CVT(T)PS2PI CVTPI2PD CVTPI2PD CVT(T) PD2PI CVT(T) PD2PI CVTSI2SS CVTSI2SS CVT(T)SS2SI CVT(T)SS2SI CVTSI2SD CVTSI2SD CVT(T)SD2SI CVT(T)SD2SI Arithmetic Page 130 m128,x,x m256,y,y 4 4 2 2 1 1 1 1+ 1 2 AVX AVX
x,x x,m128 x,y x,m256 x,x x,m64 x,x x,m64 y,x y,m128 x,x x,m32 x,x x,m128 y,y y,m256 x,x x,m128 y,y y,m256 x,x x,m64 y,x y,m128 x,x x,m128 x,y x,m256 x,mm x,m64 mm,x mm,m128 x,mm x,m64 mm,x mm,m128 x,r32 x,m32 r32,x r32,m32 x,r32 x,m32 r32,x r32,m64
2 2 2 2 2 2 2 2 2 3 2 2 1 1 1 1 1 1 1 1 2 2 2 3 2 2 2 2 1 1 2 2 2 2 2 2 2 1 2 2 2 1 2 2
2 2 2 2 2 2 2 2 2 3 2 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 1 1 2 1 2 2 2 2 2 1 2 2 2 1 2 2
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1+ 1 1 1 1 1 1
3 4 3 3 1 4 1 3 1 3 1 3 1+ 3 1 3 1+
1 1 1 1 1 1 1 1
4 1 5 1 4 1 5 1+ 4 1 4 1
1 1
4 1 4 1
1 1 1 1 1 1 1 1
4 1 4 1 4 1 4 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1.5 1.5 1 1 1.5 1.5 1 1
AVX AVX
AVX AVX
AVX AVX
AVX AVX
AVX AVX
AVX AVX
Sandy Bridge ADDSS/D SUBSS/D ADDSS/D SUBSS/D ADDPS/D SUBPS/D ADDPS/D SUBPS/D VADDPS/D VSUBPS/D VADDPS/D VSUBPS/D ADDSUBPS/D ADDSUBPS/D VADDSUBPS/D VADDSUBPS/D HADDPS/D HSUBPS/D HADDPS/D HSUBPS/D VHADDPS/D VHSUBPS/D VHADDPS/D VHSUBPS/D MULSS MULPS MULSS MULPS VMULPS VMULPS MULSD MULPD MULSD MULPD VMULPD VMULPD DIVSS DIVPS DIVSS DIVPS VDIVPS VDIVPS DIVSD DIVPD DIVSD DIVPD VDIVPD VDIVPD RCPSS/PS RCPSS/PS VRCPPS VRCPPS CMPccSS/D CMPccPS/D CMPccSS/D CMPccPS/D VCMPccPS/D VCMPccPS/D COMISS/D UCOMISS/D COMISS/D UCOMISS/D MAXSS/D MINSS/D MAXSS/D MINSS/D MAXPS/D MINPS/D MAXPS/D MINPS/D VMAXPS/D VMINPS/D VMAXPS/D VMINPS/D ROUNDSS/SD/PS/PD x,m128 y,y,y y,y,m256 x,x x,m32/64 x,x x,m32/64 x,x x,m128 y,y,y y,y,m256 x,x,i 2 1 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 Page 131 1 1 1 1 1 1 1 1 1 1 1 1 3 1+ 2 1 3 1 3 1 3 1+ 3 1 1 1 1 1 1 1 1 1 1 1 1 AVX AVX x,x x,m32/64 x,x x,m128 y,y,y y,y,m256 x,x x,m128 y,y,y y,y,m256 x,x x,m128 y,y,y y,y,m256 x,x x,m y,y,y y,y,m256 x,x x,m y,y,y y,y,m256 x,x x,m y,y,y y,y,m256 x,x x,m y,y,y y,y,m256 x,x x,m128 y,y y,m256 x,x 1 1 1 1 1 1 1 1 1 1 3 4 3 4 1 1 1 1 1 1 1 1 1 1 3 4 1 1 3 4 1 1 2 4 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 1 1 1 1 1 1 1 1 1 1 3 3 1 1 3 3 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 2 1 1 3 1 3 1 3 1+ 3 1 3 1+ 2 2 2 2 1+ 5 1 5 1+ 5 1 5 1+ 10-14 1 1 1 21-29 1+ 10-22 1 1 1+ 5 1 7 1+ 1 3 21-45 5 1 5 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 10-14 10-14 20-28 20-28 10-22 10-22 20-44 20-44 1 1 2 2 1
AVX AVX SSE3 SSE3 AVX AVX SSE3 SSE3 AVX AVX
AVX AVX
AVX AVX
AVX AVX
AVX AVX
AVX AVX
AVX AVX SSE4.1
Sandy Bridge ROUNDSS/SD/PS/PD VROUNDSS/SD/PS/PD VROUNDSS/SD/PS/PD DPPS DPPS VDPPS VDPPS DPPD DPPD Math SQRTSS/PS SQRTSS/PS VSQRTPS VSQRTPS SQRTSD/PD SQRTSD/PD VSQRTPD VSQRTPD RSQRTSS/PS RSQRTSS/PS VRSQRTPS VRSQRTPS Logic
AND/ANDN/OR/XORPS/PD AND/ANDN/OR/XORPS/PD
x,m128,i y,y,i y,m256,i x,x,i x,m128,i y,y,y,i y,m256,i x,x,i x,m128,i
2 1 2 4 6 4 6 3 4
1 1 1 4 5 4 5 3 3
1 1 1 2
1 3 1+ 1 1 12 1+ 9 1 12
1 1 1 2 4 2 4 2 2
SSE4.1 AVX AVX SSE4.1 SSE4.1 AVX AVX SSE4.1 SSE4.1
x,x x,m128 y,y y,m256 x,x x,m128 y,y y,m256 x,x x,m128 y,y y,m256
1 1 3 4 1 2 3 4 1 1 3 4
1 1 3 3 1 1 3 3 1 1 3 3
1 1
10-14 1 1+
1 1
10-21 1 21-43 1+
1 1
5 1 7 1+
10-14 10-14 21-28 21-28 10-21 10-21 21-43 21-43 1 1 2 2
AVX AVX
AVX AVX
AVX AVX
x,x x,m128 y,y,y y,y,m256 x/y,x/y,same
1 1 1 1 1
1 1 1 1 0
1 1 1 1
1 1 1 1+ 0
1 1 1 1 0.25 AVX AVX
VAND/ANDN/OR/XORPS /PD VAND/ANDN/OR/XORPS /PD (V)XORPS/PD Other VZEROUPPER VZEROALL VZEROALL LDMXCSR STMXCSR VSTMXCSR FXSAVE FXRSTOR XSAVEOPT
4 12 20 3 3 3 3 3 3 130 116 100-161
1 11 9 3 1 1 68 72
AVX AVX, 32 bit AVX, 64 bit
m32 m32 m32 m4096 m4096 m
1 1
1 1 1
1 1
AVX
60-500
Page 132
Pentium 4
Intel Pentium 4
This list is measured for a Pentium 4, model 2. Timings for model 3 may be more like the values for P4E, listed on the next sheet

Instruction: Operands: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit mmx register, xmm = 128 bit xmm register, sr = segment register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc. Number of ops issued from instruction decoder and stored in trace cache. Number of additional ops issued from microcode ROM. This is the delay that the instruction generates in a dependency chain if the next dependent instruction starts in the same execution unit. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency of moves to and from memory cannot be measured accurately because of the problem with memory intermediates explained above under How the values were measured. This number is added to the latency if the next dependent instruction is in a different execution unit. There is no additional latency between ALU0 and ALU1. This is also called issue latency. This value indicates the number of clock cycles from the execution of an instruction begins to a subsequent independent instruction can begin to execute in the same execution subunit. A value of 0.25 indicates 4 instructions per clock cycle in one thread. The port through which each op goes to an execution unit. Two independent ops can start to execute simultaneously only if they are going through different ports. Use this information to determine additional latency. When an instruction with more than one op uses more than one execution unit, only the first and the last execution unit is listed. Throughput measures apply only to instructions executing in the same subunit. Indicates the compatibility of an instruction with other 80x86 family microprocessors. The instruction can execute on microprocessors that support the instruction set indicated.
ops: Microcode: Latency:
Additional latency:
Port:
Execution unit:
Execution subunit: Instruction set
Page 133
Pentium 4 ops Microcode Latency Additional latency Port Reciprocal throughput Execution unit Subunit Instruction set Notes Instruction Operands
Move instructions MOV MOV MOV MOV MOV MOV MOV MOV MOVNTI MOVZX MOVZX MOVSX MOVSX CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D) PUSHA(D) POP POP POP POPF(D) POPA(D) LEA LEA LEA LEA LEA LAHF SAHF SALC LDS, LES, ... BSWAP IN, OUT PREFETCHNTA PREFETCHT0/1/2
r,r r,i r32,m r8/16,m m,r m,i r,sr sr,r/m m,r32 r,r r,m r,r r,m r,r/m r,r r,m r i m sr
r m sr
r,[r+r/i] r,[r+r+i] r,[r*i] r,[r+r*i] r,[r+r*i+i]
r,m r r,r/i m m
1 1 1 2 1 3 4 4 2 1 1 1 2 3 3 4 4 2 2 3 4 4 4 2 4 4 4 4 1 2 3 2 3 1 1 3 4 3 8 4 4
0 0.5 0.5-1 0.25 0/1 0 0.5 0.5-1 0.25 0/1 0 2 0 1 2 0 3 0 1 2 0 1 2 0 0 2 0,3 2 6 4 12 0 14 0 33 0 0.5 0.5-1 0.25 0/1 0 2 0 1 2 0 0.5 0.5-1 0.5 0 0 3 0.5-1 1 2,0 0 6 0 3 0 1.5 0.5-1 1 0/1 8 >100 0 3 0 1 2 0 1 2 0 2 4 7 4 10 10 19 0 1 0 1 8 14 5 13 8 52 16 14 0 0.5 0.5-1 0.25 0/1 0 1 0.5-1 0.5 0/1 0 4 0.5-1 1 1 0 4 0.5-1 1 1 0 4 0.5-1 1 1 0 4 0 4 1 0 0.5 0.5-1 0.5 0/1 0 5 0 1 1 7 15 0 7 0 2 64 >1000 2 6 2 6 Page 134
alu0/1 alu0/1 load load store store
alu0/1 load alu0
alu0/1
alu0/1 alu0/1 int,alu int,alu int,alu int alu0/1 int int,alu 86
86 86 86 86 86 86 86 86 sse2 386 386 386 386 ppro 86 86 86 86 186 86 86 86 186 86 86 86 86 186 86 86 386 386 386 86 86 86 86 486 sse sse
b, c
a, q c c a, e
Pentium 4 SFENCE LFENCE MFENCE Arithmetic instructions ADD, SUB ADD, SUB ADD, SUB ADC, SBB ADC, SBB ADC, SBB ADC, SBB CMP CMP INC, DEC INC, DEC NEG NEG AAA, AAS DAA, DAS AAD AAM MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV DIV DIV IDIV IDIV IDIV CBW CWD, CDQ CWDE Logic instructions AND, OR, XOR AND, OR, XOR AND, OR, XOR TEST TEST NOT 4 4 4 2 2 2 40 38 100 sse sse2 sse2
r,r r,m m,r r,r r,i r,m m,r r,r r,m r m r m
r8/32 r16 m8/32 m16 r32,r r32,(r),i r16,r r16,r,i r16,m16 r32,m32 r,m,i r8/m8 r16/m16 r32/m32 r8/m8 r16/m16 r32/m32
1 2 3 4 3 4 4 1 2 2 4 1 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 2 1
0 0.5 0.5-1 0.25 0 1 0.5-1 1 0 8 4 4 6 0 6 0 6 0 6 6 8 0 8 7 9 8 0 0.5 0.5-1 0.25 0 1 0.5-1 1 0 0.5 0.5-1 0.5 0 4 4 0 0.5 0.5-1 0.5 0 3 27 90 57 100 10 22 22 56 6 16 0 8 7 17 0 8 7-8 16 0 8 10 16 0 8 0 14 0 4.5 0 14 0 4.5 5 16 0 9 5 15 0 8 7 15 0 10 0 14 0 8 7 14 0 10 20 61 0 24 18 53 0 23 21 50 0 23 24 61 0 24 22 53 0 23 20 50 0 23 0 1 0.5-1 1 0 1 0.5-1 0.5 0 0.5 0.5-1 0.5
0/1
alu0/1
1 1 1 0/1 0/1 0
int,alu int,alu int,alu alu0/1 alu0/1 alu0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0/1 0
int int int int int int int int int int int int int int int int int int int alu0 alu0/1 alu0
fpmul fpdiv fpmul fpmul fpmul fpmul fpmul fpmul fpmul fpmul fpmul fpmul fpmul fpdiv fpdiv fpdiv fpdiv fpdiv fpdiv
86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 386 386 386 186 386 386 186 86 86 386 86 86 386 86 86 386
c c c
c c
a a a a a
r,r r,m m,r r,r r,m r
1 2 3 1 2 1
0 0 0 0 0 0
0.5 1 8 0.5 1 0.5
0.5-1 0.5 0.5-1 1 4 0.5-1 0.5 0.5-1 1 0.5-1 0.5
alu0
0 0
alu0 alu0
86 86 86 86 86 86
c c c c c
Page 135
Pentium 4 NOT SHL, SHR, SAR SHL, SHR, SAR ROL, ROR ROL, ROR RCL, RCR RCL, RCR RCL, RCR SHL,SHR,SAR,ROL, ROR RCL, RCR RCL, RCR SHLD, SHRD SHLD, SHRD BT BT BT BT BTR, BTS, BTC BTR, BTS, BTC BTR, BTS, BTC BTR, BTS, BTC BSF, BSR BSF, BSR SETcc SETcc CLC, STC CMC CLD STD CLI STI m r,i r,CL r,i r,CL r,1 r,i r,CL m,i/CL m,1 m,i/CL r,r,i/CL m,r,i/CL r,i r,r m,i m,r r,i r,r m,i m,r r,r r,m r m 4 1 2 1 2 1 4 4 4 4 4 4 4 3 2 4 4 3 2 4 4 2 3 3 4 3 3 4 4 4 4 0 0 0 0 0 0 15 15 4 6 4 6 4 16 16 1 0 1 0 1 0 0 4 1 1 1 1 1 15 14 10 10 14 14 14 2 1 2 12 2 4 8 14 2 3 1 3 2 2 52 48 35 43 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 int int int int int int int int int int int int int int int int int int int int int int int int mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh 86 186 86 186 86 86 186 86 86 86 86 386 386 386 386 386 386 386 386 386 386 386 386 386 386 86 86 86 86 86 86
d d d d d d d d d
7-8 10 0 7 10 0 18 18-28 14 14 0 18 14 0 0 4 0 0 4 0 0 4 0 12 12 0 0 6 0 0 6 0 7 18 0 15 14 0 0 4 0 0 4 0 0 5 0 0 5 0 0 10 0 0 10 0 7 52 0 5 48 0 5 35 12 43
d d d d
Control transfer instructions JMP short/near JMP far JMP r JMP m(near) JMP m(far) Jcc short/near J(E)CXZ short LOOP short CALL near CALL far CALL r CALL m(near) CALL m(far) RETN RETN i RETF
1 4 3 3 4 1 4 4 3 4 4 4 4 4 4 4
0 28 0 0 31 0 4 4 0 34 4 4 38 0 0 33
0 118 4 4 11 0 0 0 2 8 9 2 2 11
1 118 4 4 11 2-4 2-4 2-4 2
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
alu0 alu0 alu0 alu0 alu0 alu0 alu0 alu0 alu0 alu0 alu0
branch branch branch branch branch branch branch branch branch branch branch
86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86
Page 136
Pentium 4 RETF IRET ENTER ENTER LEAVE BOUND INTO INT String instructions LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS Other NOP (90) Long NOP (0F 1F) PAUSE CPUID RDTSC Notes: a) b) i i,0 i,n m i 4 33 11 4 48 24 4 12 26 4 45+24n 4 0 3 4 14 14 4 5 18 4 84 644 0 0 26 128+16n 3 14 18 186 186 186 86 86 86 86 186
4 4 4 4 4 4 4 4 4 4
3 6 5n 4n+36 2 6 2n+3 3n+10 4 6 163+1.1n 3 40+6n 4n 5 50+8n 4n
6 86 6 86 4 86 6 86 8 86
86 86 86 86 86
1 0 0 1 0 0 4 2 4 39-81 4 7
0.25 0/1 0.25 0/1 200-500 80
alu0/1 alu0/1 p5
86 ppro sse2 p5
c) d) e) q)
Add 1 op if source is a memory operand. Uses an extra op (port 3) if SIB byte used. A SIB byte is needed if the memory operand has more than one pointer register, or a scaled index, or ESP is used as base pointer. Add 1 op if source or destination, but not both, is a high 8-bit register (AH, BH, CH, DH). Has (false) dependence on the flags in most cases. Not available on PMMX Latency is 12 in 16-bit real or virtual mode, 24 in 32-bit protected mode.

ops Microcode Latency Additional latency Port Reciprocal throughput Execution unit Subunit Instruction set Notes Instruction Operands
Move instructions FLD FLD
r m32/64
1 1
0 0
6 7
0 0
1 1
0 2
mov load
87 87
Page 137
Pentium 4 FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FILD FIST FIST FISTP FLDZ FLD1 FCMOVcc FFREE FINCSTP, FDECSTP FNSTSW FSTSW FNSTSW FNSTCW FLDCW Arithmetic instructions FADD(P),FSUB(R)(P) FADD,FSUB(R) FIADD,FISUB(R) FIADD,FISUB(R) FMUL(P) FMUL FIMUL FIMUL FDIV(R)(P) FDIV(R) FIDIV(R) FIDIV(R) FABS FCHS FCOM(P), FUCOM(P) FCOM(P) FCOMPP, FUCOMPP FCOMI(P) FICOM(P) FICOM(P) FTST FXAM FRNDINT FPREM FPREM1 m80 m80 r m32/64 m80 m80 r m16 m32/64 m16 m32/64 m 3 3 1 2 3 3 1 3 2 3 2 3 1 2 4 3 1 4 6 4 4 4 4 75 0 0 8 311 0 3 0 0 0 0 0 0 0 0 0 0 0 4 4 7 6 2 90 2 1 0 2-3 0 8 0 400 0 1 0 6 2 1 2 2-4 0 2-3 0 2-4 0 2 0 2 0 4 1 4 0 1 0 3 1 3 1 6 0 6 0 (8) 0,2 load load mov store store store mov load load store store store mov mov fp mov mov 87 87 87 87 87 87 87 87 87 87 87 87 87 87 PPro 87 87 287 287 87 87 87
6 7
0 10 10 10 10 10
st0,r r AX AX m16 m16 m16
2-4 0 11 11
1 0 0 0
(3)
r m m16 m32 r m m16 m32 r m m16 m32
r m r m16 m32
1 2 3 3 1 2 3 3 1 2 3 3 1 1 1 2 2 3 4 3 1 1 3 6 6
0 0 4 0 0 0 4 0 0 0 4 0 0 0 0 0 0 0 4 0 0 0 15 84 84
5 5 6 5 7 7 7 7 43 43 43 43 2 2 2 2 2 10 2 2 2 23 212 212
1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0
1 1 6 2 2 2 6 2 43 43 43 43 1 1 1 1 1 3 6 2 1 1 15
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0,1 1 1,2 1 1 0,1 1 1
fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp
add add add add mul mul mul mul div div div div misc misc misc misc misc misc misc misc misc misc
87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 PPro 87 87 87 87 87 87 387
g, h g, h g, h g, h
Page 138
Pentium 4 Math FSQRT FLDPI, etc. FSIN FCOS FSINCOS FPTAN FPATAN FSCALE FXTRACT F2XM1 FYL2X FYL2XP1 Other FNOP (F)WAIT FNCLEX FNINIT FNSAVE FRSTOR FXSAVE FXRSTOR Notes: e) f) 1 2 6 6 7 6 3 3 3 3 3 11 0 43 0 150 180 175 207 178 216 160 230 92 187 24 57 15 20 45 165 60 200 134 242 0 43 3
170 207 211 200 153
66 20 63 90
220
1 1 1 1 1 1 1 1 1 1 1 1
fp fp fp fp fp fp fp fp fp fp fp fp
div
87 87 387 387 387 87 87 87 87 87 87 87
g, h
1 2 4 6 4 4 4 4
0 0 4 29 174 96 69 94
1 0
0 0
456 528 132 208
1 0 1 0 96 1 172 420 0,1 532 96 208
mov mov
87 87 87 87 87 87 sse sse
i i
Not available on PMMX The latency for FLDCW is 3 when the new value loaded is the same as the value of the control word before the preceding FLDCW, i.e. when alternating between the same two values. In all other cases, the latency and reciprocal throughput is 143. Latency and reciprocal throughput depend on the precision setting in the F.P. control word. Single precision: 23, double precision: 38, long double precision (default): 43. Throughput of FP-MUL unit is reduced during the use of the FP-DIV unit. Takes 6 ops more and 40-80 clocks more when XMM registers are disabled.
g)
h) i)

Move instructions MOVD MOVD MOVD MOVD MOVD
r32, mm mm, r32 mm,m32 r32, xmm xmm, r32
2 2 1 2 2
0 0 0 0 0
5 2 8 10 6
1 0 0 1 1
1 2 1 2 2
0 1 2 0 1
fp mmx load fp mmx
alu
shift
mmx mmx mmx sse2 sse2
Page 139
Pentium 4 MOVD MOVD MOVQ MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/ DQ PUNPCKHBW/WD/DQ/ QDQ PUNPCKLBW/WD/DQ/Q DQ PSHUFD PSHUFL/HW PSHUFW MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRW PEXTRW PINSRW PINSRW Arithmetic instructions PADDB/W/D PADD(U)SB/W PSUBB/W/D PSUB(U)SB/W PADDQ, PSUBQ PADDQ, PSUBQ PCMPEQB/W/D PCMPGTB/W/D PMULLW PMULHW PMULHUW PMADDWD PMULUDQ PAVGB/W PMIN/MAXUB xmm,m32 m32, r mm,mm xmm,xmm r,m64 m64,r xmm,xmm xmm,m m,xmm xmm,m m,xmm mm,xmm xmm,mm m,mm m,xmm mm,r/m xmm,r/m mm,r/m xmm,r/m xmm,r/m xmm,xmm,i xmm,xmm,i mm,mm,i mm,mm xmm,xmm r32,r r32,mm,i r32,xmm,i mm,r32,i xmm,r32,i 1 2 1 1 1 2 1 1 2 4 4 3 2 3 2 1 1 1 1 1 1 1 1 4 4 2 3 3 2 2 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 4 6 0 0 0 0 0 8 8 6 2 8 8 6 8 8 0 0 1 1 2 1 2 1 2 1 1 2 2 2 2 2 75 18 1 2 1 2 2 2 2 1 7 10 3 2 2 2 2 2 load 0,1 0 mov 1 mmx shift 2 load 0 mov 0 mov 2 load 0 mov 2 load 0 mov 0,1 mov-mmx sse2 0,1 mov-mmx sse2 0 mov 0 mov 1 1 1 1 1 1 1 1 0 0 0,1 1 1 1 1 mmx mmx mmx mmx mmx mmx mmx mmx mov mov
mmx-alu0
sse2 mmx mmx sse2 mmx mmx sse2 sse2 sse2 sse2 sse2
k k
8 8
1 1
sse sse2 mmx mmx mmx sse2 sse2 sse2 sse2 mmx sse sse2 a a a a a
2 4 2 4 2 4 2 2
1 1 1 1 1 1 1 1
shift shift shift shift shift shift shift shift
7 8 9 3 4
1 1 1 1 1
mmx-int mmx-int int-mmx int-mmx
sse sse sse2 sse sse2
r,r/m r,r/m mm,r/m xmm,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m
1 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0
2 2 2 4 2 6 6 6 6 2 2
1 1 1 1 1 1 1 1 1 1 1
1,2 1,2 1 2 1,2 1,2 1,2 1,2 1,2 1,2 1,2
1 1 1 1 1 1 1 1 1 1 1
mmx mmx mmx fp mmx fp fp fp fp mmx mmx
alu alu alu add alu mul mul mul mul alu alu
mmx mmx sse2 sse2 mmx mmx sse mmx sse2 sse sse
a,j a,j a a a,j a,j a,j a,j a,j a,j a,j
Page 140
Pentium 4 PMIN/MAXSW PSADBW Logic PAND, PANDN POR, PXOR PSLL/RLW/D/Q, PSRAW/D PSLLDQ, PSRLDQ Other EMMS Notes: a) j) k) r,r/m r,r/m 1 1 0 0 2 4 1 1 1,2 1,2 1 1 mmx mmx alu alu sse sse a,j a,j
r,r/m r,r/m r,i/r/m xmm,i
1 1 1 1
0 0 0 0
2 2 2 4
1 1 1 1
1,2 1,2 1,2 2
1 1 1 1
mmx mmx mmx mmx
alu alu shift shift
mmx mmx mmx sse2
a,j a,j a,j a
11
12
12
mmx
Add 1 op if source is a memory operand. Reciprocal throughput is 1 for 64 bit operands, and 2 for 128 bit operands. It may be advantageous to replace this instruction by two 64-bit moves

Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVUPS/D MOVSS MOVSD MOVSS, MOVSD MOVSS, MOVSD MOVHLPS MOVLHPS MOVHPS/D, MOVLPS/D MOVHPS/D, MOVLPS/D MOVNTPS/D MOVMSKPS/D SHUFPS/D UNPCKHPS/D UNPCKLPS/D
r,r r,m m,r r,r r,m m,r r,r r,r r,m m,r r,r r,r r,m m,r m,r r32,r r,r/m,i r,r/m r,r/m
1 1 2 1 4 4 1 1 1 2 1 1 3 2 2 2 1 1 1
0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0
6 7 7 6
0 0 0
2 2 7 4 2
0 1 0 0 0
1 1 2 1 2 8 2 2 1 2 2 2 4 2 4 3 2 2 2
0 2 0 0 2 0 1 1 2 0 1 1 2 0 0 1 1 1 1
mov
mov
mmx mmx
shift shift
mmx mmx
shift shift
sse sse sse sse sse sse sse sse sse sse sse sse sse sse sse/2 sse sse sse sse
k k
6 4 4 2
1 1 1 1
fp mmx mmx mmx
shift shift shift
Page 141
Pentium 4 Conversion CVTPS2PD CVTPD2PS CVTSD2SS CVTSS2SD CVTDQ2PS CVTDQ2PD CVT(T)PS2DQ CVT(T)PD2DQ CVTPI2PS CVTPI2PD CVT(T)PS2PI CVT(T)PD2PI CVTSI2SS CVTSI2SD CVT(T)SD2SI CVT(T)SS2SI Arithmetic ADDPS/D ADDSS/D SUBPS/D SUBSS/D MULPS/D MULSS/D DIVSS DIVPS DIVSD DIVPD RCPPS RCPSS MAXPS/D MAXSS/DMINPS/D MINSS/D CMPccPS/D CMPccSS/D COMISS/D UCOMISS/D Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS RSQRTPS Other LDMXCSR STMXCSR Notes: a) r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m xmm,mm xmm,mm mm,xmm mm,xmm xmm,r32 xmm,r32 r32,xmm r32,xmm 4 2 4 4 1 3 1 2 4 4 3 3 3 4 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 10 14 10 4 9 4 9 10 11 7 11 10 15 8 8 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 4 2 6 6 2 4 2 2 4 5 2 3 3 6 2.5 2.5 1 1 1 1 1 1 1 1 1 1 0,1 0,1 1 1 1 1 mmx fp-mmx mmx mmx fp mmx-fp fp fp-mmx mmx fp-mmx fp-mmx fp-mmx fp-mmx fp-mmx fp fp shift sse2 shift shift sse2 sse2 sse2 sse sse2 sse sse2 sse2 a sse2 sse2 sse2 a sse2 a sse a a a a a sse2 sse a a a a a a
a a
r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m
1 1 1 1 1 1 1 2
0 0 0 0 0 0 0 0
4 4 6 23 39 38 69 4
1 1 1 0 0 0 0 1
2 2 2 23 39 38 69 4
1 1 1 1 1 1 1 1
fp fp fp fp fp fp fp mmx
add add mul div div div div
sse sse sse sse sse sse2 sse2 sse
a a a a,h a,h a,h a,h a
r,r/m r,r/m r,r/m
1 1 2
0 0 0
4 4 6
1 1 1
2 2 3
1 1 1
fp fp fp
add add add
sse sse sse
a a a
r,r/m
mmx
alu
sse
1 1 1 1 2 2
0 0 0 0 0 0
23 39 38 69 4 4
0 0 0 0 1 1
23 39 38 69 3 4
1 1 1 1 1 1
fp fp fp fp mmx mmx
div div div div
sse sse sse2 sse2 sse sse
a,h a,h a,h a,h a a
m m
4 4
8 4
98
100 6
1 1
sse sse
Add 1 op if source is a memory operand. Page 142
Pentium 4 h) k) Throughput of FP-MUL unit is reduced during the use of the FP-DIV unit. It may be advantageous to replace this instruction by two 64-bit moves.
Page 143
Prescott
Intel Pentium 4 w. EM64T (Prescott)

Instruction: Operands: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit mmx register, xmm = 128 bit xmm register, sr = segment register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc., mabs = memory operand with 64-bit absolute address. Number of ops issued from instruction decoder and stored in trace cache. Number of additional ops issued from microcode ROM. This is the delay that the instruction generates in a dependency chain if the next dependent instruction starts in the same execution unit. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency of moves to and from memory cannot be measured accurately because of the problem with memory intermediates explained above under How the values were measured. This number is added to the latency if the next dependent instruction is in a different execution unit. There is no additional latency between ALU0 and ALU1. This is also called issue latency. This value indicates the number of clock cycles from the execution of an instruction begins to a subsequent independent instruction can begin to execute in the same execution subunit. A value of 0.25 indicates 4 instructions per clock cycle in one thread. The port through which each op goes to an execution unit. Two independent ops can start to execute simultaneously only if they are going through different ports. Use this information to determine additional latency. When an instruction with more than one op uses more than one execution unit, only the first and the last execution unit is listed. Throughput measures apply only to instructions executing in the same subunit. Indicates the compatibility of an instruction with other 80x86 family microprocessors. The instruction can execute on microprocessors that support the instruction set indicated.
ops: Microcode: Latency:
Additional latency: Reciprocal throughput:
Port:
Execution unit:
Execution subunit: Instruction set
Move instructions MOV MOV MOV
r,r r8/16/32,i r64,i32
1 1 1
0 0 0
1 1
0 0 0
0.25 0/1 0.25 0/1 0.5 0/1
alu0/1 alu0/1 alu0/1
86 86 x64
Page 144
Prescott MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV MOVNTI MOVZX MOVZX MOVZX MOVSX MOVSX MOVSX MOVSXD CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POPF(D/Q) POPA(D) LEA LEA LEA LEA LEA LEA LAHF SAHF SALC LDS, LES, ... LODS REP LODS STOS REP STOS MOVS REP MOVSB REP MOVSW r64,i64 r8/16,m r32/64,m m,r m,i m64,i32 r,sr sr,r/m r,mabs mabs,r m,r32 r,r r16,r8 r,m r16,r8 r32/64,r8/16 r,m r64,r32 r,r/m r,r r,m r i m sr 2 0 0 2 0 3 0 1 0 2 0 1 0 2 0 2 0 1 2 1 8 3 0 3 0 2 0 1 0 1 0 2 0 2 0 1 0 2 0 2 0 2 0 1 0 1 0 2 0 3 0 1 0 1 0 3 0 9.5 0 3 0 2 0 2 6 100 4 0 6 2 0 2 2 0 2 3 0 2 1 3 1 3 1 9 2 0 1 0 2 6 1 8 1 8 2 16 1 0 1 0 2.5 0 2 0 3.5 0 3 0 3.5 0 2 0 3.5 0 3 0 3.5 0 1 0 4 0 1 0 5 0 2 0 0 2 10 1 3 8 1 5n 4n+50 1 2 8 1 2.5n 3n 1 4 8 9 .3n .3n 1 .5-1.1n .6-1.4n Page 145 1 1 1 2 2 2 8 27 1 2 2 0.25 1 1 1 0.5 1 0.5 3 1 1 2 2 0 0,3 0,3 alu1 load load store store store x64 86 86 86 b,c 86 x64 86 86 a,q x64 l x64 l sse2 386 c 386 c 386 386 a,c,o 386 a,c,o 386 x64 a PPro a,e 86 86 86 86 186 86 86 86 186 m 86 86 86 86 186 m 86 p 86 86 386 386 386 86 n 86 d,n 86 m 86 m 86 86 86 8 86 86 86
0/1 0/1 2 0 0 2 0 0/1
alu0/1 alu0/1 load alu0 alu0 load alu0 alu0/1
r m sr
r,[m] r,[r+r/i] r,[r+r+i] r,[r*i] r,[r+r*i] r,[r+r*i+i]
2 2 2 9 9 16 1 10 30 70 15 0.25 0.25 0.5 1 1 1
r,m
1 28 8 8
0/1 0/1 0/1 1 0,1 1 1 0/1 1
alu0/1 alu0/1 alu0/1 alu alu0,1 alu int alu0/1 int
86
Prescott REP MOVSD REP MOVSQ BSWAP IN, OUT PREFETCHNTA PREFETCHT0/1/2 SFENCE LFENCE MFENCE Arithmetic instructions ADD, SUB ADD, SUB ADD, SUB ADC, SBB ADC, SBB ADC, SBB ADC, SBB CMP CMP INC, DEC INC, DEC NEG NEG AAA, AAS DAA, DAS AAD AAM MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV DIV DIV DIV 1 1 1 1 1 1 1 1 1
1.1n 1.4 n 1.1n 1.4 n
r r,r/i m m
0 52 0 0 2 2 4
1 >1000 1 1 50 50 124
86 x64 alu 86
486 sse sse sse sse2 sse2
r,r r,m m,r r,r/i r,m m,r m,i r,r r,m r m r m
r8 r16 r32 r64 m8 m16 m32 m64 r16,r16 r16,r16,i r32,r32 r32,(r32),i r64,r64 r64,(r64),i r16,m16 r32,m32 r64,m64 r,m,i r8/m8 r16/m16 r32/m32 r64/m64
1 2 3 3 2 2 3 1 2 2 4 1 3 1 1 2 2 1 4 3 1 2 2 3 2 1 2 1 1 1 1 2 2 2 3 1 1 1 1
0 0 0 0 5 6 5 0 0 0 0 0 0 10 16 5 17 0 0 0 5 0 5 0 6 0 0 0 0 0 0 0 0 0 0 20 19 21 31
1 1 5 10 10 20 22 1 1 1 5 1 5 26 29 13 71 10 11 11 11 10 11 11 11 10 11 10 10 10 10 10 10 10 10 74 73 76 63
0 0 0 0
0 0 0 0
0.25 0/1 1 2 10 1 10 1 10 10 0.25 0/1 1 0.5 0/1 3 0.5 0 3
alu0/1
int,alu int,alu
alu0/1 alu0/1 alu0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5

1-2.5
34 34 34 52
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
int int int int int int int int int int int int int int int int int int int int int int int int
mul fpdiv mul mul mul mul mul mul mul mul mul mul mul mul mul mul mul mul mul mul fpdiv fpdiv fpdiv fpdiv
86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 x64 86 86 86 x64 386 186 386 386 x64 x64 386 386 x64 186 86 86 386 x64
c c c
c c
m m m m
a a a a
Page 146
Prescott IDIV IDIV IDIV IDIV CBW CWD CDQ CQO CWDE CDQE SCAS REP SCAS CMPS REP CMPS Logic AND, OR, XOR AND, OR, XOR AND, OR, XOR TEST TEST NOT NOT SHL SHR, SAR SHR, SAR SHL SHR, SAR SHR, SAR ROL, ROR ROL, ROR ROL, ROR ROL, ROR RCL, RCR RCL RCR RCL RCR SHL, SHR, SAR ROL. ROR SHL, SHR, SAR ROL. ROR RCL, RCR RCL, RCR RCL, RCR SHLD, SHRD SHLD SHRD SHLD, SHRD SHLD r8/m8 r16/m16 r32/m32 r64/m64 1 21 76 0 1 19 79 0 1 19 79 0 1 58 96 0 2 0 2 0 2 0 2 0 1 0 1 0 1 0 7 0 2 0 2 0 1 0 1 0 1 3 0 1 54+6n 4n 1 5 1 81+8n 5n 34 34 34 91 1 1 1 1 1 1 8 10 86 1 1 1 1 0 0/1 0/1 0/1 0/1 0/1 int int int int alu0 alu0/1 alu0/1 alu0/1 alu0/1 alu0/1 fpdiv fpdiv fpdiv fpdiv 86 86 386 x64 86 86 386 x64 386 x64 86 86 a a a a
86
r,r r,m m,r r,r r,m r m r,i r8/16/32,i r64,i r,CL r8/16/32,CL r64,CL r8/16/32,i r64,i r8/16/32,CL r64,CL r,1 r,i r,i r,CL r,CL m8/16/32,i m8/16/32,i m8/16/32,cl m8/16/32,cl m8/16/32,1 m8/16/32,i m8/16/32,cl r8/16/32,r,i r64,r64,i r64,r64,i r8/16/32,r,cl r64,r64,cl
1 2 3 1 2 1 3 1 1 1 2 2 2 1 1 2 2 1 2 2 1 1 3 3 2 2 2 3 2 3 4 3 4 4
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 11 11 11 11 6 6 6 6 5 13 13 0 5 7 0 5
1 1 5 1 1 1 5 1 1 7 2 2 8 1 7 2 8 7 31 25 31 25 10 10 10 10 27 38 37 8 10 10 9 14
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0.5 1 2 0.5 1 0.5 2 0.5 0.5 2 2 2 1 7 2 8 7 31 25 31 25
alu0
0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
alu0 alu0 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1
27 38 37 7
86 86 86 86 86 86 86 186 186 x64 86 86 x64 186 x64 86 x64 86 186 186 86 86 86 86 86 86 86 86 86 386 x64 x64 386 x64
c c c c c
d d d d d d d d d d d d d d
Page 147
Prescott SHRD SHLD, SHRD SHLD, SHRD BT BT BT BT BTR, BTS, BTC BTR, BTS, BTC BTR, BTS, BTC BTR, BTS, BTC BSF, BSR SETcc SETcc CLC, STC CMC CLD, STD r64,r64,cl m,r,i m,r,CL r,i r,r m,i m,r r,i r,r m,i m,r r,r/m r m 3 3 2 1 2 3 2 1 2 3 2 2 2 3 2 3 1 8 8 8 0 0 0 7 0 0 6 10 0 0 0 0 0 8 12 20 20 8 9 8 10 8 9 28 14 16 9 9 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 10 8 9 8 10 8 9 10 14 4 1 2 8 53 1 1 1 1 1 1 1 1 1 1 1 1 1 1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 int int x64 386 386 386 386 386 386 386 386 386 386 386 386 386 86 86 86
d d d d
Control transfer instructions JMP short/near JMP far JMP r JMP m(near) JMP m(far) Jcc short/near J(E)CXZ short LOOP short CALL near CALL far CALL r CALL m(near) CALL m(far) RETN RETN i RETF RETF i IRET BOUND m INT i INTO Other NOP (90) Long NOP (0F 1F) PAUSE LEAVE CLI STI CPUID RDTSC
1 2 3 3 2 1 4 4 3 3 4 4 2 4 4 1 2 1 2 2 1
0 25 0 0 28 0 0 0 0 29 0 0 32 0 0 30 30 49 11 67 4
1 154 15 10 157 2-4 4 4 7 160 7 9 160 7 7 160 160 325 12 470 26
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
alu0 alu0 alu0 alu0 alu0 alu0 alu0 alu0 alu0 alu0 alu0
branch branch branch branch branch branch branch branch branch branch branch
86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 186 86 86
m m
1 0 1 0 1 2 4 0 1 5 1 11 1 49-90 1 12
0 0 5
0.25 0/1 0.25 0/1 50 5 52 64 300-500 100
alu0/1 alu0/1
86 ppro sse2 186 86 86 p5
p5
Page 148
Prescott RDPMC (bit 31 = 1) RDPMC (bit 31 = 0) MONITOR MWAIT Notes: a) b) c) d) e) l) m) n) o) 1 4 37 154 100 240 p5 p5 (sse3) (sse3) Add 1 op if source is a memory operand. Uses an extra op (port 3) if SIB byte used. Add 1 op if source or destination, but not both, is a high 8-bit register (AH, BH, CH, DH). Has (false) dependence on the flags in most cases. Not available on PMMX Move accumulator to/from memory with 64 bit absolute address (opcode A0 A3). Not available in 64 bit mode. Not available in 64 bit mode on some processors. MOVSX uses an extra op if the destination register is smaller than the biggest register size available. Use a 32 bit destination register in 16 bit and 32 bit mode, and a 64 bit destination register in 64 bit mode for optimal performance. LEA with a direct memory operand has 1 op and a reciprocal throughput of 0.25. This also applies if there is a RIP-relative address in 64-bit mode. A signextended 32-bit direct memory operand in 64-bit mode without RIP-relative address takes 2 ops because of the SIB byte. The throughput is 1 in this case. You may use a MOV instead. These values are measured in 32-bit mode. In 16-bit real mode there is 1 microcode op and a reciprocal throughput of 17.
p)
q)

Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FILD FIST(P) FISTTP FLDZ
r m32/64 m80 m80 r m32/64 m80 m80 r m16 m32/64 m m
1 1 3 3 1 2 3 3 1 3 2 3 3 1
0 0 3 74 0 0 6 311 0 2 0 0 0 0
0 0
7 7
1 1 8 90 1 2 10 400 1 8 2 2.5 2.5 2
0 2 2 2 0 0 0 0 0 2 2 0 0 0
mov load load load mov store store store mov load load store store mov
87 87 87 87 87 87 87 87 87 87 87 87 sse3 87
Page 149
Prescott FLD1 FCMOVcc FFREE FINCSTP, FDECSTP FNSTSW FSTSW FNSTSW FNSTCW FLDCW Arithmetic instructions FADD(P),FSUB(R)(P) FADD,FSUB(R) FIADD,FISUB(R) FIADD,FISUB(R) FMUL(P) FMUL FIMUL FIMUL FDIV(R)(P) FDIV(R) FIDIV(R) FIDIV(R) FABS FCHS FCOM(P), FUCOM(P) FCOM(P) FCOMPP, FUCOMPP FCOMI(P) FICOM(P) FICOM(P) FTST FXAM FRNDINT FPREM FPREM1 Math FSQRT FLDPI, etc. FSIN, FCOS FSINCOS FPTAN FPATAN FSCALE FXTRACT F2XM1 FYL2X FYL2XP1 st0,r r AX AX m16 m16 m16 2 4 3 1 4 6 2 4 3 0 0 0 0 0 0 3 0 6 5 0 1 0 0 0 2 4 3 1 3 3 8 3 10 0 1 0 0 1 1 0 0 0,2 mov fp mov mov 87 PPro 87 87 287 287 87 87 87 e
r m m16 m32 r m m16 m32 r m m16 m32
r m r m16 m32
1 2 3 3 1 2 3 3 1 2 3 3 1 1 1 2 2 3 3 3 1 1 3 8 9
0 0 3 0 0 0 3 0 0 0 3 3 0 0 0 0 0 0 3 0 0 0 14 86 92
6 6 7 6 8 8 8 8 45 45 45 45 3 3 3 3 3
1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0
28 220 220
1 1 1
1 1 6 2 2 2 8 3 45 45 45 45 1 1 1 1 1 3 8 2 1 1 16
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0,1 1 1,2 1 1 0,1 1 1
fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp
add add add add mul mul mul mul div div div div misc misc misc misc misc misc misc misc misc misc
87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 PPro 87 87 87 87 87 87 387
g,h g,h g,h g,h
1 2 3 5 8 4 3 4 3 3 3
0 0 100 150 170 97 25 16 190 63 58
45 200 200 270 250 96 27 270 170 170
45 2
200 200 270 250
1 1 1 1 1 1 1 1 1 1 1
fp fp fp fp fp fp fp fp fp fp fp
div
87 87 387 387 87 87 87 87 87 87 87
g,h
Page 150
Prescott Other FNOP (F)WAIT FNCLEX FNINIT FNSAVE FRSTOR FXSAVE FXRSTOR Notes: e) f) 1 2 1 1 2 2 2 2 0 0 4 30 181 96 121 118 1 0 0 0 1 1 120 200 0 0 1 0,1 160 244 mov mov 87 87 87 87 87 87 sse sse
500 570
i i
Not available on PMMX The latency for FLDCW is 3 when the new value loaded is the same as the value of the control word before the preceding FLDCW, i.e. when alternating between the same two values. In all other cases, the latency and reciprocal throughput is > 100. Latency and reciprocal throughput depend on the precision setting in the F.P. control word. Single precision: 32, double precision: 40, long double precision (default): 45. Throughput of FP-MUL unit is reduced during the use of the FP-DIV unit. Takes fewer microcode ops when XMM registers are disabled, but the throughput is the same.
g)
h) i)

Move instructions MOVD MOVD MOVD MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU LDDQU MOVDQ2Q
r32, mm mm, r32 mm,m32 r32, xmm xmm, r32 xmm,m32 m32, r mm,mm xmm,xmm r,m64 m64,r xmm,xmm xmm,m m,xmm xmm,m m,xmm xmm,m mm,xmm
2 1 1 1 2 1 2 1 1 1 2 1 1 2 4 4 4 3
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0
6 3 7 4
1 1 1 1
7 2
0 1
10
1 1 1 1 2 1 2 1 2 1 2 1 1 2 23 8 2.5 2
0 fp 1 mmx 2 load 0 fp 1 mmx 2 load 0,1 0 mov 1 mmx 2 load 0 mov 0 mov 2 load 0 mov 2 load 0 mov 2 load 0,1 mov-mmx
alu
shift
shift
mmx mmx mmx sse2 sse2 sse2 mmx mmx sse2 mmx mmx sse2 sse2 sse2 sse2 sse2 sse3
k k
sse2
Page 151
Prescott MOVQ2DQ MOVNTQ MOVNTDQ MOVDDUP MOVSHDUP MOVSLDUP PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/ DQ PUNPCKHBW/WD/DQ/ QDQ PUNPCKLBW/WD/DQ/Q DQ PSHUFD PSHUFL/HW PSHUFW MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRW PEXTRW PINSRW Arithmetic instructions PADDB/W/D PADD(U)SB/W PSUBB/W/D PSUB(U)SB/W PADDQ, PSUBQ PADDQ, PSUBQ PCMPEQB/W/D PCMPGTB/W/D PMULLW PMULHW PMULHUW PMADDWD PMULUDQ PAVGB/W PMIN/MAXUB PMIN/MAXSW PSADBW Logic PAND, PANDN POR, PXOR PSLL/RLW/D/Q, PSRAW/D PSLLDQ, PSRLDQ xmm,mm m,mm m,xmm xmm,xmm xmm,xmm mm,r/m xmm,r/m mm,r/m xmm,r/m xmm,r/m xmm,xmm,i xmm,xmm,i mm,mm,i mm,mm xmm,xmm r32,r r32,mm,i r32,xmm,i r,r32,i 2 3 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 4 6 0 0 0 0 10 1 2 4 4 2 2 2 4 2 4 2 2 2 1 10 12 3 2 3 2 0,1 mov-mmx sse2 0 mov 0 mov 1 mmx shift 1 1 1 1 1 1 1 1 1 0 0 0,1 1 1 1 mmx mmx mmx mmx mmx mmx mmx mmx mmx mov mov shift shift shift shift shift shift shift shift shift sse sse2 sse3 sse3 mmx mmx mmx sse2 sse2 sse2 sse sse sse sse2 a a a a a
2 4 2 4 2 4 2 4 2 2
1 1 1 1 1 1 1 1 1 1
7 7 7 4
mmx-alu0 sse
mmx-int sse mmx-int sse2 int-mmx sse
r,r/m r,r/m mm,r/m xmm,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m
1 1 1 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0
2 2 2 5 2 7 7 7 7 2 2 2 4
1 1 1 1 1 1 1 1 1 1 1 1 1
1,2 1,2 1 2 1,2 1,2 1,2 1,2 1,2 1,2 1,2 1,2 1,2
1 1 1 1 1 1 1 1 1 1 1 1 1
mmx mmx mmx fp mmx fp fp fp fp mmx mmx mmx mmx
alu alu alu add alu mul mul mul mul alu alu alu alu
mmx mmx sse2 sse2 mmx mmx sse mmx sse2 sse sse sse sse
a,j a,j a a a,j a,j a,j a,j a,j a,j a,j a,j a,j
r,r/m r,r/m r,i/r/m xmm,i
1 1 1 1
0 0 0 0
2 2 2 4
1 1 1 1
1,2 1,2 1,2 2
1 1 1 1
mmx mmx mmx mmx
alu alu shift shift
mmx mmx mmx sse2
a,j a,j a,j
Page 152
Prescott Other EMMS Notes: a) j) k) 10 10 12 0 mmx
Add 1 op if source is a memory operand. Reciprocal throughput is 1 for 64 bit operands, and 2 for 128 bit operands. It may be advantageous to replace this instruction by two 64-bit moves or LDDQU.

Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVUPS/D MOVSS MOVSD MOVSS, MOVSD MOVSS, MOVSD MOVHLPS MOVLHPS
MOVHPS/D, MOVLPS/D
MOVHPS/D, MOVLPS/D
MOVSH/LDUP MOVDDUP MOVNTPS/D MOVMSKPS/D SHUFPS/D UNPCKHPS/D UNPCKLPS/D Conversion CVTPS2PD CVTPD2PS CVTSD2SS CVTSS2SD CVTDQ2PS CVTDQ2PD CVT(T)PS2DQ CVT(T)PD2DQ CVTPI2PS
r,r r,m m,r r,r r,m m,r r,r r,r r,m m,r r,r r,r r,m m,r r,r r,r m,r r32,r r,r/m,i r,r/m r,r/m
1 1 2 1 4 4 1 1 1 2 1 1 2 2 1 1 2 2 1 2 1
0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0
2 4
1 1 0 1 1
4 2
4 2 5 4 4 2
1 1 1 1 1 1
1 1 2 1 2 8 2 2 1 2 2 2 2 2 2 2 4 3 2 2 2
0 2 0 0 2 0 1 1 2 0 1 1 2 0 1 1 0 1 1 1 1
mov
mov
mmx mmx
shift shift
mmx mmx
shift shift
fp mmx mmx mmx
shift shift shift
sse sse sse sse sse sse sse sse sse sse sse sse sse sse sse3 sse3 sse sse sse sse sse
k k
r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m xmm,mm
1 2 3 2 1 3 1 2 4
0 0 0 0 0 0 0 0 0
4 10 14 8 5 10 5 11 12
1 1 1 1 1 1 1 1 1
4 2 6 6 2 4 2 2 6
1 1 1 1 1 1 1 1 1
mmx fp-mmx mmx mmx fp mmx-fp fp fp-mmx mmx
shift sse2 shift shift sse2 sse2
sse2 a sse2 sse2 sse2 a sse2 a sse
a a a a a a
Page 153
Prescott CVTPI2PD CVT(T)PS2PI CVT(T)PD2PI CVTSI2SS CVTSI2SD CVT(T)SD2SI CVT(T)SS2SI Arithmetic ADDPS/D ADDSS/D SUBPS/D SUBSS/D ADDSUBPS/D HADDPS/D HSUBPS/D MULPS/D MULSS/D DIVSS DIVPS DIVSD DIVPD RCPPS RCPSS MAXPS/D MAXSS/DMINPS/D MINSS/D CMPccPS/D CMPccSS/D COMISS/D UCOMISS/D Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS RSQRTPS Other LDMXCSR STMXCSR Notes: a) h) k) xmm,mm mm,xmm mm,xmm xmm,r32 xmm,r32 r32,xmm r32,xmm 4 3 4 3 4 2 2 0 0 0 0 0 0 0 12 8 12 20 20 12 17 1 0 1 1 1 1 1 5 2 3 4 5 4 4 1 0,1 0,1 1 1 1 1 fp-mmx sse2 fp-mmx sse fp-mmx sse2 fp-mmx sse fp-mmx sse2 fp fp a a a a a sse2 sse
a a
r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m
1 1 1 3 1 1 1 1 1 2
0 0 0 0 0 0 0 0 0 0
5 5 5 13 7 32 41 40 71 6
1 1 1 1 1 1 1 1 1 1
2 2 2 5-6 2 23 41 40 71 4
1 1 1 1 1 1 1 1 1 1
fp fp fp fp fp fp fp fp fp mmx
add add add add mul div div div div
sse sse sse3 sse3 sse sse sse sse2 sse2 sse
a a a a a a,h a,h a,h a,h a
r,r/m r,r/m r,r/m
1 1 2
0 0 0
5 5 6
1 1 1
2 2 3
1 1 1
fp fp fp
add add add
sse sse sse
a a a
r,r/m
mmx
alu
sse
1 1 1 1 2 2
0 0 0 0 0 0
32 41 40 71 5 6
1 1 1 1 1 1
32 41 40 71 3 4
1 1 1 1 1 1
fp fp fp fp mmx mmx
div div div div
sse sse sse2 sse2 sse sse
a,h a,h a,h a,h a a
m m
2 3
11 0
13 3
1 1
sse sse
Add 1 op if source is a memory operand. Throughput of FP-MUL unit is reduced during the use of the FP-DIV unit.
It may be advantageous to replace this instruction by two 64-bit moves or LDDQU.
Page 154
Atom
Intel Atom
Instruction: Operands: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm register, (x)mm = mmx or xmm register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc. The number of ops from the decoder or ROM. Tells which execution unit is used. Instructions that use the same unit cannot execute simultaneously. ALU0 and ALU1 means integer unit 0 or 1, respectively. ALU0/1 means that either unit can be used. ALU0+1 means that both units are used. Mem means memory in/out unit. FP0 means floating point unit 0 (includes multiply, divide and other SIMD instructions). FP1 means floating point unit 1 (adder). MUL means multiplier, shared between FP and integer units. DIV means divider, shared between FP and integer units. np means not pairable: Cannot execute simultaneously with any other instruction. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. The average number of clock cycles per instruction for a series of independent instructions of the same kind in the same thread.
ops: Unit:
Latency:
Operands ops Unit Latency Reciprocal throughput 1 1 1-3 1 1 1/2 1/2 1 1 1 1 5 21 26 2.5 1 2 Remarks
Move instructions MOV MOV MOV MOV MOV MOV MOV MOV MOV MOVNTI MOVSX MOVZX MOVSXD CMOVcc
r,r r,i r,m m,r m,i r,sr m,sr sr,r sr,m m,r r,r/m r,r
1 1 1 1 1 1 2 7 8 1 1 1
ALU0/1 ALU0/1 ALU0, Mem ALU0, Mem ALU0, Mem
All addr. modes All addr. modes
ALU0, Mem ALU0 ALU0+1
1 2
Page 155
Atom CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POP POPF(D/Q) POPA(D) LAHF SAHF SALC LEA BSWAP LDS LES LFS LGS LSS PREFETCHNTA PREFETCHT0/1/2 LFENCE MFENCE SFENCE Arithmetic instructions ADD SUB ADD SUB ADD SUB ADC SBB ADC SBB ADC SBB CMP CMP INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS AAD AAM MUL IMUL MUL IMUL MUL IMUL MUL IMUL r,m r,r r,m r i m sr 1 3 4 3 1 1 2 3 14 9 1 1 3 7 19 16 1 1 2 1 1 10 1 1 1 1 1 6 6 6 1 3 6 6 6 1 1 5 6 12 11 1 1 6 31 28 12 2 1/2 5 1 1 30 1 1 1/2 1 1
Implicit lock
np np
Not in x64 mode
r (E/R)SP m sr
np np
1 1
Not in x64 mode
ALU0+1 ALU0/1
2 1 7 1-4 1 30
r,m r m m m
AGU1 ALU0 Mem Mem
Not in x64 mode 4 clock latency on input register
r8 r16 r32 r64
1 1 1 1 1 1 1 1 1 1 13 13 20 21 4 10 3 4 3 8
ALU0/1 ALU0/1, Mem
1 2 2 2 2 1 1 1 16 12 20 25 7 24 7 6 6 14
ALU0/1 ALU0/1
1/2 1 1 2 2 2 1/2 1 1/2 Not in x64 mode Not in x64 mode Not in x64 mode Not in x64 mode Not in x64 mode Not in x64 mode 7 6 6 14
ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Mul
Page 156
Atom IMUL IMUL IMUL IMUL IMUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL DIV DIV DIV DIV IDIV IDIV IDIV IDIV CBW CWDE CDQE CWD CDQ CQO Logic instructions AND OR XOR AND OR XOR AND OR XOR TEST TEST SHR SHL SAR SHR SHL SAR ROR ROL ROR ROL RCR RCL RCR RCL SHLD SHLD SHLD SHLD SHLD SHLD SHRD SHRD SHRD SHRD SHRD r16,r16 r32,r32 r64,r64 r16,r16,i r32,r32,i r64,r64,i m8 m16 m32 m64 r/m8 r/m16 r/m32 r/m 64 r/m8 r/m16 r/m32 r/m64 2 1 6 2 1 7 3 5 4 8 9 12 12 38 26 29 29 60 2 1 1 2 1 1 ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Div ALU0, Div ALU0, Div ALU0, Div ALU0, Div ALU0, Div ALU0, Div ALU0, Div ALU0 ALU0 ALU0 ALU0 ALU0 ALU0 6 5 13 5 5 14 6 7 7 14 22 33 49 183 38 45 61 207 5 1 1 5 1 1 5 2 11 5 2 14
22 33 49 183 38 45 61 207
r,r/i r,m m,r/i r,r/i m,r/i r,i/cl m,i/cl r,i/cl m,i/cl r,1 r,1 r/m,i/cl r/m,i/cl r16,r16,i r32,r32,i r64,r64,i r16,r16,cl r32,r32,cl r64,r64,cl r16,r16,i r32,r32,i r64,r64,i r16,r16,cl r32,r32,cl
1 ALU0/1 1 1 ALU0/1, Mem 1 ALU0/1, Mem 1 1 ALU0/1 1 1 ALU0/1, Mem 1 ALU0 1 1 ALU0 1 1 ALU0 1 1 ALU0 1 5 ALU0 7 2 ALU0 1 12-17 ALU0 12-15 14-20 ALU0 14-18 10 ALU0 10 2 ALU0 5 10 ALU0 11 9 ALU0 9 2 ALU0 5 9 ALU0 10 8 ALU0 8 2 ALU0 5 10 ALU0 9 7 ALU0 8 2 ALU0 5 Page 157
1/2 1 1 1/2 1 1 1 1 1
1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem
Atom SHRD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR SETcc SETcc CLC STC CMC CLD STD r64,r64,cl r,r/i m,r m,i r,r/i m,r m,i r,r/m r m 9 1 9 2 1 10 3 10 1 2 1 1 5 6 ALU0 ALU1 9 1 10 5 1 11 6 16 2 1-2 more if mem 1
ALU1 ALU1 ALU1 ALU0+1 ALU0/1
2 5 1/2 2 7 25
Control transfer instructions JMP short/near JMP far JMP r JMP m(near) JMP m(far) Conditional jump short/near J(E/R)CXZ short LOOP short LOOP(N)E short CALL near CALL far CALL r CALL m(near) CALL m(far) RETN RETN i RETF RETF i BOUND r,m INTO String instructions LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS Other
1 29 1 2 30 1 3 8 8 1 37 1 2 38 1 1 36 36 11 4
ALU1
ALU1
np np
2 66 4 7 78 2 7 8 8 3 65 18 20 64 6 6 80 80 10 6
Not in x64 mode
Not in x64 mode
Not in x64 mode Not in x64 mode
3 5n+11 2 3n+10 4 4n+11 3 5n+16 5 6n+16
6 3n+50 5 2n+4 6 2n - 4n 6 3n+60 7 4n+40
fastest for high n
Page 158
Atom NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC 1 1 5 14 20+6b 4 40-80 16 24 ALU0/1 ALU0/1 24 23 6 100-170 29 48 1/2 1/2
a,0 a,b

Operands Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FISTTP FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FMUL(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) ops Unit Latency Reciprocal through1 1 3 1 9 10 92 92 1 1 7 9 12 13 221 221 1 1 7 6 11 9 11 9 1 8 10 9 9 10 10 8 9 1 1 1 321 321 177 177 Remarks
r m32/m64 m80 m80 r m32/m64 m80 m80 r m m m
r AX m16 m16 m16
m m
1 1 4 52 1 3 8 189 1 1 3 3 1 2 2 3 4 4 2 3 1 1 166 83
SSE3
r/m r/m r/m
r/m r m
1 1 1 1 1 1 1 5 3 Page 159
Mul Div
5 5 71 1 1 1 1
1 2 71 1 1 1 1 10 9
Atom FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT Math FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN Other FNOP WAIT FNCLEX FNINIT m m m 3 3 3 1 1 26 37 19 Mul Div 1 1 ~110 ~130 48 9 73 9 1 1
30 15 1 9 112 25 63 100 91
Div
56 24 71 ~260 ~260 ~100 ~220 ~300 ~300
1 2 4 23
5 74
1 5 26

Operands Move instructions MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU LDDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ ops Unit Latency Reciprocal through4 5 3 4 1 4 5 1 4 5 6 6 6 1 1 ~400 ~450 2 1 1 1 1/2 1 1 1/2 1 1 6 6 6 1 1 1 3 Remarks
r32/64,(x)mm m32/64,(x)mm (x)mm,r32/64 (x)mm,m32/64 (x)mm, (x)mm (x)mm,m64 m64, (x)mm xmm, xmm xmm, m128 m128, xmm m128, xmm xmm, m128 xmm, m128 mm, xmm xmm,mm m64,mm m128,xmm
1 1 1 1 1 1 1 1 1 1 3 4 4 1 1 1 1
Mem Mem FP0/1 Mem Mem FP0/1 Mem Mem Mem Mem Mem
Mem Mem
Page 160
Atom PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/DQ PUNPCKH/LQDQ PSHUFB PSHUFB PSHUFW PSHUFL/HW PSHUFD PALIGNR MASKMOVQ MASKMOVDQU PMOVMSKB PINSRW PEXTRW Arithmetic instructions PADD/SUB(U)(S)B/W/D PADDQ PSUBQ PHADD(S)W PHSUB(S)W PHADDD PHSUBD PCMPEQ/GTB/W/D PMULL/HW PMULHUW PMULL/HW PMULHUW PMULHRSW PMULHRSW PMULUDQ PMULUDQ PMADDWD PMADDWD PMADDUBSW PMADDUBSW PSADBW PSADBW PAVGB/W PMIN/MAXUB PMIN/MAXSW PABSB PABSW PABSD PSIGNB PSIGNW PSIGND (x)mm, (x)mm (x)mm, (x)mm (x)mm, (x)mm mm,mm xmm,xmm mm,mm,i xmm,xmm,i xmm,xmm,i xmm, xmm,i mm,mm xmm,xmm r32,(x)mm (x)mm,r32,i r32,(x)mm,i 1 1 1 1 4 1 1 1 1 1 2 1 1 2 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP0 Mem Mem 1 1 1 1 6 1 1 1 1 1 1 1 1 6 1 1 1 1 2 7 2 1 5
4 3 5
(x)mm, (x)mm (x)mm, (x)mm (x)mm, (x)mm (x)mm, (x)mm (x)mm,(x)mm mm,mm xmm,xmm mm,mm xmm,xmm mm,mm xmm,xmm mm,mm xmm,xmm mm,mm xmm,xmm mm,mm xmm,xmm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm
1 2 7 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
FP0/1
FP0/1 FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0/1 FP0/1 FP0/1 FP0/1 FP0/1
1 5 8 6 1 4 5 4 5 4 5 4 5 4 5 4 5 1 1 1 1 1
1/2 5 8 1/2 1 2 1 2 1 2 1 2 1 2 1 2 1/2 1/2 1/2 1/2 1/2
Logic instructions PAND(N) POR PXOR PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ Other EMMS
(x)mm,(x)mm (x)mm,(x)mm (x)xmm,i xmm,i
1 2 1 1
FP0/1 FP0 FP0 FP0
1 5 1 1
1/2 5 1 1
Page 161
Atom

Operands Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVHPS/D MOVLPS/D MOVLHPS MOVHLPS MOVMSKPS/D MOVNTPS/D SHUFPS SHUFPD MOVDDUP MOVSH/LDUP UNPCKH/LPS UNPCKH/LPD Conversion CVTPD2PS CVTSD2SS CVTPS2PD CVTSS2SD CVTDQ2PS CVT(T) PS2DQ CVTDQ2PD CVT(T)PD2DQ CVTPI2PS CVT(T)PS2PI CVTPI2PD CVT(T) PD2PI CVTSI2SS CVT(T)SS2SI CVTSI2SD CVT(T)SD2SI Arithmetic ADDSS SUBSS ADDSD SUBSD ADDPS SUBPS ADDPD SUBPD ADDSUBPS ADDSUBPD ops Unit Latency Reciprocal through1 1/2 4 1 5 1 6 6 6 6 1 1/2 4 1 5 1 5 1 4 1 4 1 1 1 4 2 ~500 3 1 1 1 1 1 1 1 1 1 1 1 Remarks
xmm,xmm xmm,m128 m128,xmm xmm,m128 m128,xmm xmm,xmm xmm,m32/64 m32/64,xmm xmm,m64 m64,xmm m64,xmm xmm,xmm r32,xmm m128,xmm xmm,xmm,i xmm,xmm,i xmm,xmm xmm,xmm xmm,xmm xmm,xmm
1 1 1 4 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
FP0/1 Mem Mem Mem Mem FP0/1 Mem Mem Mem Mem Mem FP0 Mem FP0 FP0 FP0 FP0 FP0 FP0
xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,mm mm,xmm xmm,mm mm,xmm xmm,r32 r32,xmm xmm,r32 r32,xmm
4 3 4 3 3 3 3 3 1 1 3 4 3 3 3 3
11 10 7 6 6 6 7 6 6 4 7 7 7 10 8 10
11 10 6 6 6 6 6 6 5 1 6 7 6 8 6 8
xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm
1 1 1 3 1 3 Page 162
FP1 FP1 FP1 FP1 FP1 FP1
5 5 5 6 5 6
1 1 1 6 1 6
Atom HADDPS HSUBPS HADDPD HSUBPD MULSS MULSD MULPS MULPD DIVSS DIVSD DIVPS DIVPD RCPSS RCPPS CMPccSS/D CMPccPS/D COMISS/D UCOMISS/D MAXSS/D MINSS/D MAXPS/D MINPS/D Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS RSQRTPS Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D Other LDMXCSR STMXCSR FXSAVE FXRSTOR xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm 5 5 1 1 1 6 3 3 6 6 1 5 1 3 4 1 3 FP0+1 FP0+1 FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Div FP0, Div FP0, Div FP0, Div 8 8 4 5 5 9 31 60 64 122 4 9 5 6 9 5 6 7 7 1 2 2 9 31 60 64 122 1 8 1 6 9 1 6
FP0 FP0 FP0 FP0 FP0
3 5 3 5 1 5
FP0, Div FP0, Div FP0, Div FP0, Div FP0 FP0
31 63 60 121 4 9
31 63 60 121 1 8
xmm,xmm xmm,xmm xmm,xmm xmm,xmm
1 1 1 1
FP0/1 FP0/1 FP0/1 FP0/1
1 1 1 1
1/2 1/2 1/2 1/2
m32 m32 m4096 m4096
4 4 121 116
5 14 142 149
6 15 144 150
Page 163
VIA Nano 2000
VIA Nano 2000 series

Operands: i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm register, (x)mm = mmx or xmm register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc. The number of micro-operations from the decoder or ROM. Note that the VIA Nano 2000 processor has no reliable performance monitor counter for ops. Therefore the number of ops cannot be determined except in simple cases. Tells which execution port or unit is used. Instructions that use the same port cannot execute simultaneously. Integer add, Boolean, shift, etc. I1: Integer add, Boolean, move, jump. I2: Can use either I1 or I2, whichever is vacant first. I12: Multiply, divide and square root on all operand types. MA: Various Integer and floating point SIMD operations. MB: Floating point addition subunit under MB. MBfadd: Memory store address. SA: Memory store. ST: Memory load. LD: This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. Note: There is an additional latency for moving data from one unit or subunit to another. A table of these latencies is given in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs". These additional latencies are not included in the listings below where the source and destination operands are of the same type. Reciprocal throughput: The average number of clock cycles per instruction for a series of independent instructions of the same kind in the same thread.
ops:
Port:
Latency:
Operands ops Port Latency Reciprocal thruoghput 1 1 1 1.5 1.5 1 2 20 20 1.5 Latency 4 on pointer register Remarks
Move instructions
MOV MOV MOV MOV MOV MOV MOV MOV MOV MOVNTI r,r r,i r,m m,r m,i r,sr m,sr sr,r sr,m m,r 1 1 1 1 1 I2 I2 LD SA, ST SA, ST 1 1 2 2
SA, ST Page 164
20 20 2
VIA Nano 2000 MOVSX MOVSXD MOVZX MOVSX MOVSXD MOVZX CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POP POPF(D/Q) POPA(D) LAHF SAHF SALC LEA BSWAP LDS LES LFS LGS LSS PREFETCHNTA PREFETCHT0/1/2 LFENCE MFENCE SFENCE Arithmetic instructions ADD SUB ADD SUB ADD SUB ADC SBB ADC SBB ADC SBB CMP CMP INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS r,r r,m r,m r,r r,m r,r r,m m r i m sr 1 2 1 2 3 I2 LD, I2 LD I1, I2 LD, I1 I2 1 3 2 2 5 3 20 6 1 1 1 1 2 3 20 1-2 1-2 2 17 8 15 1.25 4 5 20 9 12 1 1 6 1 1 30 1-2 1-2 14 14 14
Implicit lock
SA, ST SA, ST Ld, SA, ST 8
Not in x64 mode
r (E/R)SP m sr
LD
9 1 1 r,m r m m m 1 1 I1 I1 SA I2 1 1 9 1 1 30 LD LD
Not in x64 mode
Not in x64 mode 3 clock latency on input register
1 2 3 1 2 3 1 2 1 3
I12 LD I12
LD I12 SA ST
1 5 1 5 1 1 5
I1 LD I1 LD I1 SA ST I12 LD I12 I12

LD I12 SA ST
1/2 1 2 1 1 2 1/2 1 1/2 37 37 22 24 Not in x64 mode Not in x64 mode Not in x64 mode Not in x64 mode
Page 165
VIA Nano 2000 AAD AAM MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV DIV DIV DIV IDIV IDIV IDIV IDIV CBW CWDE CDQE CWD CDQ CQO Logic instructions AND OR XOR AND OR XOR AND OR XOR TEST TEST SHR SHL SAR ROR ROL RCR RCL RCR RCL SHLD SHRD SHLD SHRD SHLD SHRD SHLD SHRD SHLD SHRD SHLD SHRD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR SETcc r8 r16 r32 r64 r16,r16 r32,r32 r64,r64 r16,r16,i r32,r32,i r64,r64,i r8 r16 r32 r64 r8 r16 r32 r64 1 1 MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA I1 I1 7-9 7-9 7-9 8-10 4-6 4-6 5-7 4-6 4-6 5-7 26 27-35 25-41 148-183 26 27-35 23-39 187-222 1 1 23 30 Not in x64 mode Not in x64 mode Extra latency to other ports do. do. do. do. do. do. do. do. do. do. do. do. do. do. do. do. do.
1 1 2 1 1 2 26 27-35 25-41 148-183 26 27-35 23-39 187-222 1 1
r,r/i r,m m,r/i r,r/i m,r/i r,i/cl r,i/cl r,1 r,i/cl r16,r16,i r32,r32,i r64,r64,i r64,r64,i r16,r16,cl r32,r32,cl r64,r64,cl r64,r64,cl r,r/i m,r m,i r,r/i m,r m,i r,r r
1 2 3 1 2 1 1 1
I12 LD I12
LD I12 SA ST
1 5 1 1 1 1 28+3n 11 7 33 43 11 7 33 43 1
1 2 2
I12 LD I12 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 Page 166
2 10 8 3 2
1/2 1 2 1/2 1 1 1 1 28+3n 11 7 33 43 11 7 33 43 1 8 1 2 10 8 2 1
VIA Nano 2000 SETcc CLC STC CMC CLD STD Control transfer instructions JMP JMP JMP JMP JMP Conditional jump short/near far r m(near) m(far) short/near 1 I2 3 58 3 3 55 1-3-8 3 8 if >2 jumps in 16 bytes block Not in x64 mode 8 if >2 jumps in 16 bytes block do. 1 if not jumping. 3 if jumping. 8 if >2 jumps in 16 bytes block do. do. 8 if >2 jumps in 16 bytes block Not in x64 mode 8 if >2 jumps in 16 bytes block do. 8 if >2 jumps in 16 bytes block do. m I1 3 3 1 3 3
I2
3 3 1-3-8
J(E/R)CXZ LOOP LOOP(N)E CALL CALL CALL CALL CALL RETN RETN RETF RETF BOUND INTO String instructions LODSB/W/D/Q REP LODSB/W/D/Q STOSB/W/D/Q REP STOSB/W/D/Q
short short short near far r m(near) m(far)
1-3-8 1-3-8 25 3 72 3 4 72 3 3 39 39
1-3-8 1-3-8 25 3 72 3 3 72 3 3 39 39 13 7
i i r,m
MOVSB/W/D/Q REP MOVSB/W/D/Q
SCASB/W/D/Q REP SCASB REP SCASW/D/Q
CMPSB/W/D/Q Page 167
1 3n+22 1-2 Small: 2n+2, Big: 6 bytes per clock 2 Small: 2n+45, Big: 6 bytes per clock 1 2.2n Small: 2n+50 Big: 5 bytes per clock 6
VIA Nano 2000 REP CMPSB/W/D/Q Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC 2.4n+24
1 1 a,0 a,b
All I12
4 53-173 40
1 1/2 25 23 52+5b 4 39 40
Blocks all ports

Operands ops Port and Unit MB LD MB LD MB MB MB SA ST MB SA ST I2
Move instructions
FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FILD FILD FIST(T)(P) FIST(T)(P) FIST(T)(P) FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FMUL(P) FDIV(R)(P) r/m r/m r/m 1 1 MB MA MA r m32/m64 m80 m80 r m32/m64 m80 m80 r m16 m32 m64 m16 m32 m64 1 2 2 1 3 3 1
Latency Reciprocal thruoghput 1 4 4 54 1 5 5 125 0 7 5 5 6 5 5 1 1 1 54 1 1-2 1-2 125 1
Remarks
1 r AX m16 m16 m16 1 1 m m
MB 2
13 I2 MB 0 321 195
1 10 2 5 3 13 2 1 1 321 195
2 4 15-42
1 2 15-42
Lower precision: Lat: 4, Thr: 2
Page 168
VIA Nano 2000 FABS FCHS FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT Math FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN Other FNOP WAIT FNCLEX FNINIT 1 1 1 1 1 MB MB MB MB MB MB 1 1 1 1 1 1 1 2 4 42 2 1 41
r/m r m m m m
1 1
MB 151-171 106-155 29
39 36-57 73 51-159 270-360 50-200 ~60 ~170 300-370 ~170
1 1
MB I12
1 1/2 57 85

Operands ops Port and Unit
Move instructions
MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU r32/64,(x)mm m32/64,(x)mm (x)mm,r32/64 (x)mm,m32/64 (x)mm, (x)mm (x)mm,m64 m64, (x)mm xmm, xmm xmm, m128 m128, xmm m128, xmm xmm, m128 1 1 1 1 1 1 1 1 1 1 1 SA ST LD MB LD SA ST MB LD SA ST SA ST LD
Latency Reciprocal thruoghput 3 2-3 4 2-3 1 2-3 2-3 1 2-3 2-3 2-3 2-3 1 1-2 1 1 1 1 1-2 1 1 1-2 1-2 1
Remarks
Page 169
VIA Nano 2000 LDDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/ DQ PUNPCKH/LQDQ PSHUFB PSHUFW PSHUFL/HW PSHUFD PALIGNR MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRW PINSRW Arithmetic instructions PADD/SUB(U)(S)B/W/D PADDQ PSUBQ PHADD(S)W PHSUB(S)W PHADDD PHSUBD PCMPEQ/GTB/W/D PMULL/HW PMULHUW PMULHRSW PMULUDQ PMADDWD PMADDUBSW PSADBW PAVGB/W PMIN/MAXUB PMIN/MAXSW PABSB PABSW PABSD PSIGNB PSIGNW PSIGND Logic instructions PAND(N) POR PXOR PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ Other EMMS (x)mm, (x)mm (x)mm, (x)mm (x)mm, (x)mm (x)mm, (x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm 1 1 3 3 1 1 1 1 MB MB MB MB MB MA MA MA 1 1 3 3 1 3 3 3 4 10 2 1 1 1 1 1 1 1 3 3 1 1 1 1 2 8 1 1 1 1 1 1 xmm, m128 mm, xmm xmm,mm m64,mm m128,xmm (x)mm, (x)mm (x)mm, (x)mm (x)mm, (x)mm (x)mm,(x)mm mm,mm,i xmm,xmm,i xmm,xmm,i xmm, xmm,i mm,mm xmm,xmm r32,(x)mm r32 ,(x)mm,i (x)mm,r32,i 1 1 1 3 3 1 1 1 1 1 1 1 1 LD MB MB 2-3 1 1 ~300 ~300 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1-3 1-3 1 1 9
MB MB MB MB MB MB MB MB
3 3 9
1 1 1 1 1
MB MB MB MB MB MB
(x)mm,(x)mm (x)mm,(x)mm (x)xmm,i xmm,i
1 1 1 1
MB MB MB MB
1 1 1 1
1 1 1 1
MB Page 170
VIA Nano 2000

Operands ops Port and Unit MB LD SA ST LD SA ST MB LD SA ST
Move instructions
MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVHPS/D MOVLPS/D MOVLHPS MOVHLPS MOVMSKPS/D MOVNTPS/D SHUFPS SHUFPD MOVDDUP MOVSH/LDUP UNPCKH/LPS UNPCKH/LPD Conversion CVTPD2PS CVTSD2SS CVTPS2PD CVTSS2SD CVTDQ2PS CVT(T) PS2DQ CVTDQ2PD CVT(T)PD2DQ CVTPI2PS CVT(T)PS2PI CVTPI2PD CVT(T) PD2PI CVTSI2SS CVT(T)SS2SI CVTSI2SD CVT(T)SD2SI Arithmetic ADDSS SUBSS ADDSD SUBSD ADDPS SUBPS xmm,xmm xmm,m128 m128,xmm xmm,m128 m128,xmm xmm,xmm xmm,m32/64 m32/64,xmm xmm,m64 xmm,m64 m64,xmm m64,xmm xmm,xmm r32,xmm m128,xmm xmm,xmm,i xmm,xmm,i xmm,xmm xmm,xmm xmm,xmm xmm,xmm 1 1 1 1 1 1 1 1
Latency Reciprocal thruoghput 1 2-3 2-3 2-3 2-3 1 2-3 2-3 6 6 6 2 1 3 ~300 1 1 1 1 1 1 1 1 1-2 1 1-2 1 1 1-2 1 1 1-2 1-2 1 1 2.5 1 1 1 1 1 1
Remarks
MB
1 1 1 1 1 1
MB MB MB MB MB MB
xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,mm mm,xmm xmm,mm mm,xmm xmm,r32 r32,xmm xmm,r32 r32,xmm
3-4 15 3-4 15 3 2 4 3 4 3 4 3 5 4 5 4
xmm,xmm xmm,xmm xmm,xmm
1 1 1
MBfadd MBfadd MBfadd
2-3 2-3 2-3
1 1 1
Page 171
VIA Nano 2000 ADDPD SUBPD ADDSUBPS ADDSUBPD HADDPS HSUBPS HADDPD HSUBPD MULSS MULSD MULPS MULPD DIVSS DIVSD DIVPS DIVPD RCPSS RCPPS CMPccSS/D CMPccPS/D COMISS/D UCOMISS/D MAXSS/D MINSS/D MAXPS/D MINPS/D Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS RSQRTPS Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D Other LDMXCSR STMXCSR FXSAVE FXRSTOR xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm 1 1 1 MBfadd MBfadd MBfadd MBfadd MBfadd MA MA MA MA MA MA MA MA 2-3 2-3 2-3 5 5 3 4 3 4 15-22 15-36 42-82 24-70 5 14 2 2 3 2 2 1 1 1 3 3 1 2 1 2 15-22 15-36 42-82 24-70 5 11 1 1 1 1 1
1 1
1 1
MBfadd MBfadd
1 1
MBfadd MBfadd
MA MA MA MA
33 126 62 122 5 14
33 126 62 122 5 11
1 1 1 1
MB MB MB MB
1 1 1 1
1 1 1 1
m32 m32 m4096 m4096
45 13 208 232
29 13 208 232
VIA-specific instructions
Instruction XSTORE XSTORE REP XSTORE REP XSTORE REP XCRYPTECB Conditions Data available No data available Quality factor = 0 Quality factor > 0 128 bits key Clock cycles, approximately 160-400 clock giving 8 bytes 50-80 clock giving 0 bytes 4800 clock per 8 bytes 19200 clock per 8 bytes 44 clock per 16 bytes
Page 172
VIA Nano 2000 REP XCRYPTECB REP XCRYPTECB REP XCRYPTCBC REP XCRYPTCBC REP XCRYPTCBC REP XCRYPTCTR REP XCRYPTCTR REP XCRYPTCTR REP XCRYPTCFB REP XCRYPTCFB REP XCRYPTCFB REP XCRYPTOFB REP XCRYPTOFB REP XCRYPTOFB REP XSHA1 REP XSHA256 192 bits key 256 bits key 128 bits key 192 bits key 256 bits key 128 bits key 192 bits key 256 bits key 128 bits key 192 bits key 256 bits key 128 bits key 192 bits key 256 bits key 46 clock per 16 bytes 48 clock per 16 bytes 54 clock per 16 bytes 59 clock per 16 bytes 63 clock per 16 bytes 43 clock per 16 bytes 46 clock per 16 bytes 48 clock per 16 bytes 54 clock per 16 bytes 59 clock per 16 bytes 63 clock per 16 bytes 54 clock per 16 bytes 59 clock per 16 bytes 63 clock per 16 bytes 3 clock per byte 4 clock per byte
Page 173
Nano 3000
VIA Nano 3000 series

Operands: i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm register, (x)mm = mmx or xmm register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc. The number of micro-operations from the decoder or ROM. Note that the VIA Nano 3000 processor has no reliable performance monitor counter for ops. Therefore the number of ops cannot be determined except in simple cases. Tells which execution port or unit is used. Instructions that use the same port cannot execute simultaneously. Integer add, Boolean, shift, etc. I1: Integer add, Boolean, move, jump. I2: Can use either I1 or I2, whichever is vacant first. I12: Multiply, divide and square root on all operand types. MA: Various Integer and floating point SIMD operations. MB: Floating point addition subunit under MB. MBfadd: Memory store address. SA: Memory store. ST: Memory load. LD: This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. Note: There is an additional latency for moving data from one unit or subunit to another. A table of these latencies is given in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs". These additional latencies are not included in the listings below where the source and destination operands are of the same type. Reciprocal throughput: The average number of clock cycles per instruction for a series of independent instructions of the same kind in the same thread.
ops:
Port:
Latency:
Operands ops Port Latency Reciprocal thruoghput 1 1 2 2 1 1/2 1 1.5 1.5 1/2 1.5 20 20 1.5 Latency 4 on pointer register Remarks
Move instructions
MOV MOV MOV MOV MOV MOV MOV MOV MOV MOVNTI r,r r,i r,m m,r m,i r,sr m,sr sr,r sr,m m,r 1 1 1 1 1 I2 I12 LD SA, ST SA, ST I12
SA, ST Page 174
20 20 2
Nano 3000 MOVSX MOVZX MOVSXD MOVSX MOVSXD MOVZX CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POP POPF(D/Q) POPA(D) LAHF SAHF SALC LEA BSWAP LDS LES LFS LGS LSS PREFETCHNTA PREFETCHT0/1/2 LFENCE MFENCE SFENCE Arithmetic instructions ADD SUB ADD SUB ADD SUB ADC SBB ADC SBB ADC SBB CMP CMP INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS AAD r,r r64,r32 r,m r,m r,r r,m r,r r,m m r i m sr 1 1 2 1 1 3 3 1 1 I12 LD, I12 LD I12 LD, I12 I12 LD, I1 SA, ST SA, ST LD, SA, ST 1 1 3 2 1 5 3 18 6 1/2 1 1 1 1/2 1 1.5 18 2 1-2 1-2 2 6 2 15 1.25 4 2 11 1 12 1 1 6 1 1 28 1 1 15
Implicit lock
r (E/R)SP m sr
3 9 2 3 3 16 1 1 2
2 LD
Not in x64 mode
Not in x64 mode
I1 I1
1 1 10 1 1 28
r,m r m m m
1 1 12 1 1
SA I2
Not in x64 mode Extra latency to other ports
LD LD
1 2 3 1 2 3 1 2 1 3 12 12 14 14 7
I12 LD I12
LD I12 SA ST
1 5 1 5 1 1 5
I1 LD I1 LD I1 SA ST I12 LD I12 I12

LD I12 SA ST
1/2 1 2 1 1 2 1/2 1 1/2 37 22 22 24 24 Not in x64 mode Not in x64 mode Not in x64 mode Not in x64 mode Not in x64 mode
Page 175
Nano 3000 AAM MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV DIV DIV DIV IDIV IDIV IDIV IDIV CBW CWDE CDQE CWD CDQ CQO Logic instructions AND OR XOR AND OR XOR AND OR XOR TEST TEST SHR SHL SAR ROR ROL RCR RCL RCR RCL SHLD SHRD SHLD SHRD SHLD SHRD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR SETcc SETcc CLC STC CMC CLD STD r8 r16 r32 r64 r16,r16 r32,r32 r64,r64 r16,r16,i r32,r32,i r64,r64,i r8 r16 r32 r64 r8 r16 r32 r64 13 1 3 3 3 1 1 1 1 1 1 31 I2 I2 I2 MA I2 I2 MA I2 I2 MA MA MA MA MA MA MA MA MA I2 I2 2 3 3 8 2 2 5 2 2 5 22-24 24-28 22-30 145-162 21-24 24-28 18-26 182-200 1 1 8 1 1 2 1 1 Extra latency to other ports Not in x64 mode
Extra latency to other ports
1 1
Extra latency to other ports 2 22-24 24-28 22-30 145-162 21-24 24-28 18-26 182-200 1 1
r,r/i r,m m,r/i r,r/i m,r/i r,i/cl r,i/cl r,1 r,i/cl r16,r16,i/cl r32,r32,i/cl r64,r64,i/cl r64,r64,i/cl r,r/i m,r m,i r,r/i m,r m,i r,r r8 m
1 I12 2 LD I12 3 LD I12 SA ST 1 I12 2 LD I12 1 I12 1 I1 1 I1 5+2n I1 2 I1 2 I1 16 I1 23 I1 1 I1 6 I1 2 I1 2 I1 8 I1 5 I1 2 I1 1 I1 2 3 I1 3 I1
1 5 1 1 1 1 28+3n 2 2 32 42 1
2 10 8 2 1 3 3
1/2 1 2 1/2 1 1/2 1 1 28+3n 2 2 32 42 1 8 1 2 10 8 2 1 2 3 3
Page 176
Nano 3000 Control transfer instructions JMP JMP JMP JMP JMP short/near far r m(near) m(far) 1 14 2 2 17 I2 3 3 50 3 3 42 8 if >2 jumps in 16 bytes block Not in x64 mode 8 if >2 jumps in 16 bytes block do. 1 if not jumping. 3 if jumping. 8 if >2 jumps in 16 bytes block
I2
3 3
Conditional jump J(E/R)CXZ LOOP LOOP(N)E CALL CALL CALL CALL CALL RETN RETN RETF RETF BOUND INTO String instructions LODSB/W/D/Q REP LODSB/W/D/Q STOSB/W/D/Q
short/near short short short near far r m(near) m(far)
1 2 2 5 2 17 2 3 19 3 4 20 20 9 3
I2
1-3-8 1-3-8 1-3-8 24 3
1-3-8 1-3-8 1-3-8 24 3 58 3 3 54 3 3 49 49 13 7
3 4
8 if >2 jumps in 16 bytes block Not in x64 mode 8 if >2 jumps in 16 bytes block do. 8 if >2 jumps in 16 bytes block do.
i i r,m
3 3
2 3n 1
REP STOSB/W/D/Q MOVSB/W/D/Q
REP MOVSB/W/D/Q SCASB/W/D/Q REP SCASB
REP SCASW/D/Q CMPSB/W/D/Q REP CMPSB/W/D/Q
1 3n+27 1-2 Small: n+40, Big: 6-7 bytes/clk 2 Small: 2n+20, Big: 6-7 bytes/clk 1 2.4n Small: 2n+31, Big: 5 bytes/clk 6 2.2n+30
Page 177
Nano 3000 Other NOP (90) long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC 0-1 0-1 2 10 3 I12 I12 0 0 1/2 1/2 6 21 52+5b 2 37 40 Sometimes fused
a,0 a,b
2 55-146

Operands ops Port
Move instructions
FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FILD FILD FIST(T)(P) FIST(T)(P) FIST(T)(P) FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FMUL(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM r m32/m64 m80 m80 r m32/m64 m80 m80 r m16 m32 m64 m16 m32 m64 1 2 2 36 1 3 3 80 1 3 2 2 3 3 3 1 3 1 1 3 5 3 1 1 122 115 MB LD MB LD MB MB MB SA ST MB SA ST I2
Latency Reciprocal thruoghput 1 4 4 54 1 5 5 125 0 7 5 5 6 5 5 1 1 1 54 1 1-2 1-2 125 1
Remarks
MB MB 2
r AX m16 m16 m16
I2 MB
0 319 196
m m
1 10 2 1 2 8 2 1 1 319 196
r/m r/m r/m
1 1 1 1 1
r/m
MB MA MA MB MB MB Page 178
2 4 14-23 1 1
1 2 14-23 1 1 1
Nano 3000 FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT Math FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN Other FNOP WAIT FNCLEX FNINIT r m m m m 1 1 3 3 3 3 1 15 MB MB MB 2 1 1 2 4 16 2 1 38
MB
11
2 38 ~130 ~130 27
22 13
37 57 73 ~150 270-360 50-200 ~50 ~50 300-370 ~180 Less at lower precision
1 1
MB I12
1 1/2 59 84

Operands ops Port
Move instructions
MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU LDDQU MOVDQ2Q r32/64,(x)mm m32/64,(x)mm (x)mm,r32/64 (x)mm,m32/64 (x)mm, (x)mm (x)mm,m64 m64, (x)mm xmm, xmm xmm, m128 m128, xmm m128, xmm xmm, m128 xmm, m128 mm, xmm 1 1 1 1 1 1 1 1 1 1 1 1 1 1 MB SA ST I2 LD MB LD SA ST MB LD SA ST SA ST LD LD MB Page 179
Latency Reciprocal thruoghput 3 2 4 2 1 2 2 1 2 2 2 2 2 1 1 1-2 1 1 1 1 1-2 1 1 1-2 1-2 1 1 1
Remarks
Nano 3000 MOVQ2DQ MOVNTQ MOVNTDQ MOVNTDQA PACKSSWB/DW PACKUSWB PACKUSDW

PUNPCKH/LBW/WD/DQ
xmm,mm m64,mm m128,xmm xmm,m128 (x)mm, (x)mm xmm,xmm (x)mm, (x)mm (x)mm, (x)mm (x)mm,(x)mm mm,mm,i xmm,xmm,i xmm,xmm,i x,x,xmm0 xmm,xmm,i xmm, xmm,i mm,mm xmm,xmm r32,(x)mm r32 ,(x)mm,i r32/64 ,xmm,i (x)mm,r32,i xmm,r32/64,i xmm,xmm
1 2 2 1 1 1 1 1 1 1 1 1 1 1 1
MB
1 ~360 ~360 2 1 1 1 1 1 1 1 1 2 1 1
1 2 2 1 1 1 1 1 1 1 1 1 2 1 1 1-2 1-2 1 1 1 1 1 1
PUNPCKH/LQDQ PSHUFB PSHUFW PSHUFL/HW PSHUFD PBLENDVB PBLENDW PALIGNR MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRW PEXTRB/D/Q PINSRW PINSRB/D/Q PMOVSX/ZXBW/BD/ BQ/WD/WQ/DQ Arithmetic instructions PADD/SUB(U)(S)B/W/D PADDQ PSUBQ PHADD(S)W PHSUB(S)W PHADDD PHSUBD PCMPEQ/GTB/W/D PCMPEQQ PMULL/HW PMULHUW PMULHRSW PMULLD PMULUDQ PMULDQ PMADDWD PMADDUBSW PSADBW MPSADBW PAVGB/W PMIN/MAXSW PMIN/MAXUB PMIN/MAXSB/D PMIN/MAXUW/D PHMINPOSUW
MB MB MB MB MB MB MB MB MB MB MB
1 1 2 2 1
MB MB MB MB MB
3 3 3 5 5 1
(x)mm, (x)mm (x)mm, (x)mm (x)mm, (x)mm (x)mm, (x)mm (x)mm,(x)mm xmm,xmm (x)mm,(x)mm (x)mm,(x)mm xmm,xmm (x)mm,(x)mm xmm,xmm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm xmm,xmm,i (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm xmm,xmm xmm,xmm xmm,xmm
1 1 3 3 1 1 1 1 1 1 1 1 7 1 1 1 1 1 1 1 1
MB MB MB MB MB MB MA MA MA MA MA MA MB MB MB MB MB MB MB MB
1 1 3 3 1 1 3 3 3 3 3 4 10 2 2 1 1 1 1 1 2
1 1 3 3 1 1 1 1 1 1 1 2 8 1 1 1 1 1 1 1 1
Page 180
Nano 3000 PABSB PABSW PABSD (x)mm,(x)mm PSIGNB PSIGNW PSIGND Logic instructions PAND(N) POR PXOR PTEST PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ Other EMMS (x)mm,(x)mm 1 1 MB MB 1 1 1 1
(x)mm,(x)mm xmm,xmm (x)mm,(x)mm (x)xmm,i xmm,i
1 1 1 1 1
MB MB MB MB MB
1 3 1 1 1
1 1 1 1 1
MB

Operands ops Port
Move instructions
MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVHPS/D MOVLPS/D MOVLHPS MOVHLPS MOVMSKPS/D MOVNTPS/D SHUFPS SHUFPD MOVDDUP MOVSH/LDUP UNPCKH/LPS UNPCKH/LPD Conversion CVTPD2PS CVTSD2SS CVTPS2PD CVTSS2SD CVTDQ2PS CVT(T) PS2DQ CVTDQ2PD xmm,xmm xmm,m128 m128,xmm xmm,m128 m128,xmm xmm,xmm xmm,m32/64 m32/64,xmm xmm,m64 xmm,m64 m64,xmm m64,xmm xmm,xmm r32,xmm m128,xmm xmm,xmm,i xmm,xmm,i xmm,xmm xmm,xmm xmm,xmm xmm,xmm 1 1 1 1 2 1 1 2 2 2 3 1 1 2 1 1 1 1 1 1 MB LD SA ST LD SA ST MB LD SA ST
Latency Reciprocal thruoghput 1 2 2 2 2 1 2-3 2-3 6 6 6 2 1 3 ~360 1 1 1 1 1 1 1 1 1 1 1 1 1 1-2 1 1 1-2 1-2 1 1 1-2 1 1 1 1 1 1
Remarks
MB MB MB MB MB MB
xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm
2 1 2 1 1 1 2
MB
5 2 5 2 3 2 5
2 1 1 1 1
Page 181
Nano 3000 CVT(T)PD2DQ CVTPI2PS CVT(T)PS2PI CVTPI2PD CVT(T) PD2PI CVTSI2SS CVT(T)SS2SI CVTSI2SD CVT(T)SD2SI Arithmetic ADDSS SUBSS ADDSD SUBSD ADDPS SUBPS ADDPD SUBPD ADDSUBPS ADDSUBPD HADDPS HSUBPS HADDPD HSUBPD MULSS MULSD MULPS MULPD DIVSS DIVSD DIVPS DIVPD RCPSS RCPPS CMPccSS/D CMPccPS/D COMISS/D UCOMISS/D MAXSS/D MINSS/D MAXPS/D MINPS/D Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS RSQRTPS Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D xmm,xmm xmm,mm mm,xmm xmm,mm mm,xmm xmm,r32 r32,xmm xmm,r32 r32,xmm 2 1 2 2 2 1 2 1 4 5 4 4 4 5 4 5 4 2 2 1 1 2 1 1
xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm
1 1 1 1 1 1 3 3 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1
MBfadd MBfadd MBfadd MBfadd MBfadd MBfadd MBfadd MBfadd MA MA MA MA MA MA MA MA MA MA MBfadd MBfadd MBfadd MBfadd MBfadd
2 2 2 2 2 2 5 5 3 4 3 4 13 13-20 24 21-38 5 14 2 2 3 2 2
1 1 1 1 1 1 3 3 1 2 1 2 13 13-20 24 21-38 5 11 1 1 1 1 1
1 1 1 1 1 3
MA MA MA MA
33 64 62 122 5 14
33 64 62 122 5 11
1 1 1 1
MB MB MB MB
1 1 1 1
1 1 1 1
Page 182
Nano 3000 Other LDMXCSR STMXCSR FXSAVE FXRSTOR m32 m32 m4096 m4096 31 13 97 201
VIA-specific instructions
Instruction XSTORE XSTORE REP XSTORE REP XSTORE REP XCRYPTECB REP XCRYPTECB REP XCRYPTECB REP XCRYPTCBC REP XCRYPTCBC REP XCRYPTCBC REP XCRYPTCTR REP XCRYPTCTR REP XCRYPTCTR REP XCRYPTCFB REP XCRYPTCFB REP XCRYPTCFB REP XCRYPTOFB REP XCRYPTOFB REP XCRYPTOFB REP XSHA1 REP XSHA256 Conditions Data available No data available Quality factor = 0 Quality factor > 0 128 bits key 192 bits key 256 bits key 128 bits key 192 bits key 256 bits key 128 bits key 192 bits key 256 bits key 128 bits key 192 bits key 256 bits key 128 bits key 192 bits key 256 bits key Clock cycles, approximately 160-400 clock giving 8 bytes 50-80 clock giving 0 bytes 1300 clock per 8 bytes 5455 clock per 8 bytes 15 clock per 16 bytes 17 clock per 16 bytes 18 clock per 16 bytes 29 clock per 16 bytes 33 clock per 16 bytes 37 clock per 16 bytes 23 clock per 16 bytes 26 clock per 16 bytes 27 clock per 16 bytes 29 clock per 16 bytes 33 clock per 16 bytes 37 clock per 16 bytes 29 clock per 16 bytes 33 clock per 16 bytes 37 clock per 16 bytes 5 clock per byte 5 clock per byte
Page 183

Instruction Table For Assembly Language

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Instruction Table For Assembly Language

Uploaded by

Copyright:

Available Formats

Introduction

How the values were measured

Microprocessor versions tested

MOV MOV MOV MOV MOV MOV MOV MOV

r8,m8 r16,m16 r32,m32 m8,r8H m8,r8L m16/32,r m,i r,sr

1/2 1/2 1/2 1/2 1/2 1/2 1/2 1

r,r/i r,m m,r r,r/i r,m m,r/i r,r/i r,m r m

r8/m8 r16/m16 r32/m32 r16,r16/m16

1/3 1/2 2.5 1/3 1/2 2.5 1/3 1/2 1/3 3 5 6 7

1 16-20 1 1 17-21 1 2 7 3 16-22 4 5 16-22 2 2 15-23 15-24 32 33 6 2 23-32

ALU low values = real mode

2 2 25-33 1/3 - 2 1/3 - 2 3-4 2

3-4 2 23-32 3 3 24-33 3 3 24-35 24-35 81 42

ALU ALU ALU ALU

2 2 2 1 3 1-4 2 2 6 3-4 Page 10

1 1 i,0 3 8-9 16-17 19-28 5 9

ALU ALU 12 3 ops, 5 clk if 16 bit

Floating point x87 instructions

r m32/64 m80 m80 r m32/64 m80 m80 r m m

FCMOVcc FFREE FINCSTP, FDECSTP

AX AX m16 m16 m16

r/m m r/m m r/m

FADD FMISC, ALU FMUL FMUL

35 90-100 90-100 100-150 100-200 160-170 8 11 27 126 147

1/3 1/3 24 92 147 120 59 87

FANY ALU FMISC FMISC

Integer MMX instructions

Floating point XMM instructions

xmm,mm mm,xmm xmm,r32 r32,xmm

FMISC FMISC FMISC FMISC

r,r/m r,r/m r,r/m r,r/m

r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m

FMUL FMUL FMUL FMUL FADD FADD FADD FADD FADD

r,r/m r,r/m r,r/m r,r/m

FMUL FMUL FMUL FMUL

3DNow instructions (obsolete)

3DNow E 3DNow E 3DNow E

r16,[m] r32,[m] r64,[m]

Any address size Any address size Any address size

r,r/i r,m m,r r,r/i r,m m,r/i r,r/i r,m r m

1/3 1/2 2.5 1/3 1/2 2.5 1/3 1/2 1/3 3

latency ax=3, dx=4 latency rax=4, rdx=5

1 16-20 1 1 17-21 1 2 7 3 16-22 4 5 16-22 2 2 15-23 15-24 32 33

2 23-32 2 2 25-33 1/3 - 2 1/3 - 2 3-4 2 3 3 3 3

ALU ALU, AGU

3-4 2 23-32 3 3 24-33 3 3 24-35 24-35 81 42 Page 19

ALU ALU ALU ALU ALU ALU, AGU

recip. thrp.= 2 if jump recip. thrp.= 2 if jump

low values = real mode

low values = real mode

real mode real mode

4 2 5 2 4 2 1.5 - 2 0.5 - 1 7 3 3 1-2 5 2 5 2 2 3 6 2

1 0 1 0 i,0 12 2 8-9 16-17 22-50 47-164 6 10 9 12

ALU ALU 12 3 ops, 5 clk if 16 bit

Floating point x87 instructions

r m32/64 m80 m80 r m32/64 m80 m80 r m m

6-12 6-12 Page 20

r/m m r/m m r/m m r/m r m

4 4 4 4 11-25 12-26 2 2 2 3 2 10 7-10 8-11

1 1-2 1 2 8-22 9-23 1 1 1 1 1 1 1 3 8 8

27 140-190 150-190 170-200 150-180 217 8 12 126 179 175

1/3 1/3 27 100 171 136 56 95

FANY ALU FMISC FMISC