You are on page 1of 19

Microprocessors The key element of all computers, providing the mathematical and decision making ability Current state-of-the-art

-art uPs (Pentium, Athlon, SPARC, PowerPC) contain complex circuits consisting of tens of millions of transistors They operate at ultra-fast speeds doing over a billion operations every second Made up from a semiconductor, Silicon

Integrated Circuits Commonly known as an IC or a chip A tiny piece of Silicon that has several electronic parts on it Most of the size of an IC comes form the pins and packaging; the actual Silicon occupies a very small piece of the volume The smallest components on an IC are much smaller than the thickness of a human hair

Those components are Devices Transistors Diodes Resistors Capacitors Wires And are made of the following materials Silicon - semiconductor Copper - conductor Silicon Dioxide - insulator

A microprocessor system? uPs are powerful pieces of hardware, but not much useful on their own Just as the human brain needs hands, feet, eyes, ears, mouth to be useful; so does the uP A uP system is uP plus all the components it requires to do a certain task A microcomputer is 1 example of a uP system

Micro-controllers? Micro-controllers are another type of uP systems They are generally not that powerful, cost a few dollars a piece, and are found embedded in video games, VCRs, microwave ovens, printers, autos, etc. They are a complete computer on a chip containing direct input and output capability and memory along with the uP on a single chip. Many times they contain other specialized application-specific components as well

More than 90% of the microprocessors/micro-controllers manufactured are used in embedded computing applications In 2000 alone, 365 million uPs and 6.4 billion micro-controllers were manufactured The Main Memory Bottleneck Modern super-fast uPs can process a huge amount of data in a short duration They require quick access to data to maximize their performance If they dont receive the data that they require, they literally stop and wait this results in reduced performance and wasted power Current uPs can process an instruction in about a ns. Time required for fetching data from main memory (RAM) is of the order of 100 ns

Solution to the Bottleneck Problem Make the main memory faster Problem with that approach: The 1-ns memory is extremely expensive as compared the currently popular 100-ns memory Another solution: In addition to the relatively slow main memory, put a small amount of ultrafast RAM right next to the uP on the same chip and make sure that frequently used data and instructions resides in that ultra-fast memory Advantage: Much better overall performance due to fast access to frequently-used data and instructions

On-Chip Cache Memory That small amount of memory located on the same chip as the uP is called On-Chip Cache Memory The uP stores a copy of frequently used data and instructions in its cache memory When the uP desires to look at a piece of data, it checks in the cache first. If it is not there, only then the uP asks for the same from the main memory

The small size and proximity to the uP makes access times short, resulting in a boost in performance (it is easy to find things in a small box placed next to you) uPs predict what data will be required for future calculations and pre-fetches that data and places it in the cache so that it is available immediately when the need arises The speed-advantage of cache memory is greatly dependent on the algorithm used for deciding about what to put in cache or not

uP Building Blocks

Bus Interface Unit Receives instructions & data from main memory Instructions are then sent to the instruction cache, data to the data cache Also receives the processed data and sends it to the main memory

Instruction Decoder This unit receives the programming instructions and decodes them into a form that is understandable by the processing units, i.e. the ALU or FPU Then, it passes on the decoded instruction to the ALU or FPU

Arithmetic & Logic Unit (ALU) Also known as the Integer Unit It performs whole-number math calculations (subtract, multiply, divide, etc) comparisons (is greater than, is smaller than, etc.) and logical operations (NOT, OR, AND, etc) The new breed of popular uPs have not one but two almost identical ALUs that can do calculations simultaneously, doubling the capability

Floating-Point Unit (FPU) Also known as the Numeric Unit

It performs calculations that involve numbers represented in the scientific notation (also known as floating-point numbers). This notation can represent extremely small and extremely large numbers in a compact form Floating-point calculations are required for doing graphics, engineering and scientific work The ALU can do these calculations as well, but will do them very slowly

Registers Both ALU & FPU have a very small amount of super-fast private memory placed right next to them for their exclusive use. These are called registers The ALU & FPU store intermediate and final results from their calculations in these registers Processed data goes back to the data cache and then to main memory from these registers

Control Unit The brain of the uP Manages the whole uP Tasks include fetching instructions & data, storing data, managing input/output devices

Language of a uP Instruction Set The set of machine instructions that a uP recognizes and can execute the only language uP knows An instruction set includes low-level, a single step-at-a-time instructions, such as add, subtract, multiply, and divide Each uP family has its unique instruction set Bigger instruction-sets mean more complex chips (higher costs, reduced efficiency), but shorter programs

The 1st uP: Intel 4004 Introduced 1971 2250 transistors 108 kHz, 60,000 ops/sec 16 pins 10-micron process As powerful as the ENIAC which had 18000 tubes and occupied a large room

Targeted use: Calculators Cost: less than $100 Why Intel came up with the idea?

A Japanese calculator manufacturer Busicom wanted Intel to develop 16 separate ICs for a line of new calculators Intel, at that point in time known only as a memory manufacturer, was quite small and did not have the resources to do all 16 chips Ted Hoff came up with the idea of doing all 16 on a single chip Later, Intel realized that the 4004 could have other uses as well

Currently Popular Intel Pentium 4 (2.2GHz) Introduced December 2001 55 million transistors 32-bit word size 2 ALUs, each working at 4.4GHz 128-bit FPU 0.13 micron process Targeted use: PCs and low-end workstations Cost: around $600

Moores Law In 1965, one of the founders of Intel Gordon Moore predicted that the number of transistor on an IC (and therefore the capability of microprocessors) will double every year. Later he modified it to 18-months His prediction still holds true in 02. In fact, the time required for doubling is contracting to the original prediction, and is closer to a year now

Evolution of Intel Microprocessors

4-, 8-, 16-, 32-, 64-bit (Word Length) The 4004 dealt with data in chunks of 4-bits at a time Pentium 4 deals with data in chunks (words) of 32-bit length The new Itanium processor deals with 64-bit chunks (words) at a time Why have more bits (longer words)?

kHz, MHz, GHz (Clock Frequency) 4004 worked at a clock frequency of 108kHz The latest processors have clock freqs. in GHz Out of 2 uPs having similar designs, one with higher clock frequency will be more powerful Same is not true for 2 uPs of dissimilar designs. Example: Out of PowerPC & Pentium 4 uPs working at the same freq, the former performs better due to superior design. Same for the Athlon uP when compared with a Pentium

Enhancing the capability of a uP? The computing capability of a uP can be enhanced in many different ways: By increasing the clock frequency By increasing the word-width By having a more effective caching algorithm and the right cache size By adding more functional units (e.g. ALUs, FPUs, Vector/SIMD units, etc.) Improving the architecture

Microprocessor Generations First generation: 1971-78 Behind the power curve (16-bit, <50k transistors)

Second Generation: 1979-85 Becoming real computers (32-bit , >50k transistors)

Third Generation: 1985-89 Challenging the establishment (Reduced Instruction Set Computer/RISC, >100k transistors)

Fourth Generation: 1990 Architectural and performance leadership (64-bit, > 1M transistors, Intel/AMD translate into RISC internally)

In the beginning (8-bit) Intel 4004 First general-purpose, single-chip microprocessor Shipped in 1971 8-bit architecture, 4-bit implementation 2,300 transistors Performance < 0.1 MIPS (Million Instructions Per Sec) 8008: 8-bit implementation in 1972 3,500 transistors First microprocessor-based computer (Micral) Targeted at laboratory instrumentation Mostly sold in Europe

1st Generation (16-bit) Intel 8086 Introduced in 1978 Performance < 0.5 MIPS

New 16-bit architecture Assembly language compatible with 8080 29,000 transistors Includes memory protection, support for Floating Point coprocessor

In 1981, IBM introduces PC Based on 8088--8-bit bus version of 8086

2nd Generation (32-bit) Motorola 68000 Major architectural step in microprocessors: First 32-bit architecture initial 16-bit implementation

First flat 32-bit address Support for paging

General-purpose register architecture Loosely based on PDP-11 minicomputer

First implementation in 1979 68,000 transistors < 1 MIPS (Million Instructions Per Second)

Used in Apple Mac Sun , Silicon Graphics, & Apollo workstations

3rd Generation: MIPS R2000 Several firsts: First (commercial) RISC microprocessor First microprocessor to provide integrated support for instruction & data cache First pipelined microprocessor (sustains 1 instruction/clock)

Implemented in 1985 125,000 transistors 5-8 MIPS (Million Instructions per Second)

4th Generation (64 bit) MIPS R4000 First 64-bit architecture Integrated caches On-chip Support for off-chip, secondary cache

Integrated floating point Implemented in 1991: Deep pipeline 1.4M transistors Initially 100MHz

> 50 MIPS

Intel translates 80x86/ Pentium X instructions into RISC internally

Key Architectural Trends Increase performance at 1.6x per year (2X/1.5yr) True from 1985-present

Combination of technology and architectural enhancements Technology provides faster transistors ( 1/lithographic feature size) and more of them Faster transistors leads to high clock rates More transistors (Moores Law): Architectural ideas turn transistors into performance Responsible for about half the yearly performance growth

Two key architectural directions Sophisticated memory hierarchies Exploiting instruction level parallelism

Memory Hierarchies Caches: hide latency of DRAM and increase BW CPU-DRAM access gap has grown by a factor of 30-50!

Trend 1: Increasingly large caches On-chip: from 128 bytes (1984) to 100,000+ bytes Multilevel caches: add another level of caching First multilevel cache:1986 Secondary cache sizes today: 128,000 B to 16,000,000 B Third level caches: 1998

Trend 2: Advances in caching techniques: Reduce or hide cache miss latencies early restart after cache miss (1992) nonblocking caches: continue during a cache miss (1994)

Cache aware combos: computers, compilers, code writers prefetching: instruction to bring data into cache early

Exploiting Instruction Level Parallelism (ILP) ILP is the implicit parallelism among instructions (programmer not aware) Exploited by Overlapping execution in a pipeline Issuing multiple instruction per clock superscalar: uses dynamic issue decision (HW driven) VLIW: uses static issue decision (SW driven)

1985: simple microprocessor pipeline (1 instr/clock) 1990: first static multiple issue microprocessors 1995: sophisticated dynamic schemes determine parallelism dynamically execute instructions out-of-order speculative execution depending on branch prediction

Off-the-shelf ILP techniques yielded 15 year path of 2X performance every 1.5 years => 1000X faster!

Where have all the transistors gone? Superscalar (multiple instructions per clock cycle) levels of cache Branch prediction (predict outcome of decisions) Out-of-order execution (executing instructions in different order than programmer wrote them)

Deminishing Return On Investment With Until recently: Microprocessor effective work per clock cycle (instructions per clock)goes up by ~ square root of number of transistors Microprocessor clock rate goes up as lithographic feature size shrinks

>4 instructions per clock, microprocessor performance increases even less efficiently

Chip-wide wires no longer scale with technology They get relatively slower than gates (1/scale)3 More complicated processors have longer wires

New view: ClusterOnaChip (CoC) Use several simple processors on a single chip: Performance goes up linearly in number of transistors Simpler processors can run at faster clocks Less design cost/time, Less time to market risk (reuse)

Inspiration: Google Search engine for world: 100M/day Economical, scalable build block: PC cluster today 8000 PCs, 16000 disks Advantages in fault tolerance, scalability, cost/performance

32-bit MPU as the new Transistor Cluster on a chip with 1000s of processors enable amazing MIPS/$, MIPS/watt for cluster applications MPUs combined with dense memory + system on a chip CAD

Concluding Remarks A great 30 year history and a challenge for the next 30! Not a wall in performance growth, but a slowing down Diminishing returns on silicon investment

But need to use right metrics. Not just raw (peak) performance, but: Performance per transistor Performance per Watt

Possible New Direction? Consider true multiprocessing? Key question: Could multiprocessors on a single piece of silicon be much easier to use efficiently then todays multiprocessors?

Microprocessor History Early microprocessors MOS technology slow and awkward to interface with TTL family 4 bit processor Instructions were executed in about 20 s. Intel 4004 the first MP. 4K address space. Intel 8008- can manipulate a whole byte. 16Kbytes address space 50,000 operations/second.

N-channel MOSFET 1970. Faster than MOS. Work with +ve supply; easy to interface with TTL. 1973 Intel 8080 MP. 500,000 operations/second. 64K bytes memory. Upward software compatible with 8008. Other brands are MC6800, Fairchilds F-8 etc.

Basic types of MP Two types Single component microprocessors Bit sliced microprocessors Can be cascaded to allow functioning systems with word size from 4 bits to 200 bits.

Single component M Computer Composed of A processor read only memory (for program storage)

Read/Write memory (for data storage) Input/output connections for interfacing Timer as event counter

Intel 8048, Motorola 6805R2. Oven, washing machine, dish washer etc.

Modern MP 8, 16, 32, 64 bits are available. Intel 8085, Motorola 6800 8 bit word 16 bit address. Intel 8088, 8086, Motorola 68000 16 bits word, 20 bits address. 80186 never used. 286 real mode and protected mode; 16MB memory 386 paging, 4GB memory, 32 bits word 486 math coprocessor, L1 cache

Modern MP Pentium 64 bits i/o off the chip but process 32bits word, exception floating point processed 64 bits, cache doubled, instruction pipelining.

Pentium Pro L2 cache, Improved pipelining

Pentium MMX Multi-Media extensions, 57 new inter instruc mostly used for multimedia programming

Pentium II, III, IV Pentium pro with MMX tech, increased L2 cache, full 64 bit operation

RISC Reduced instruction set processor, uniform length instruc, faster in operation, cannot perform many different thing as CISC.

Basic MP architecture Fetch, decode, execute.

PC increment. First instruction is a fetch 0000H for 8085 FFFF0H for 8086, 8088

Memory Interfacing and IO decoding Interfacing needs bus Isolation and separation of signals from different devices connected to MP. Memory map Pictorial representation of the whole range of memory address space. Defines which memory system is where, their sizes etc. Unidirectional Bidirectional

Address space or range. 8086 has 1M address space in minimum mode. 8085 has 64K address space.

Address Decoding Address decoder is a digital ckt that indicates that a particular area of memory is being addressed, or pointed to, by the MP. Absolute address decoding

Decode an address to one single output Decode 10110 so that u can get a signal from the decoder when it receives exactly that bit pattern.

Partial address decoding Some bits are used as dont care so that decoder gives a signal for a range of consecutive bit patterns.

Absolute decoding

Partial decoding When a range of addresses are decoded then it is called partial decoding. For example, if we need to generate a control signal for an address generated by the MP within the range FFF0 FFFF, then it is called partial decoding. Decoder, multiplexer can be used for address decoding

Sample Microprocessor 8085 Microprocessor Internal architecture of 8085

Flag register

1. S : after the execution of an arithmetic operation, if bit 7 of the result is 1, then sign flag is set. 2. Z : bit is set if ALU operation results a zero in the Acc or registers. 3. AC: bit is set, when a carry is generated by bit 3 and passed on bit 4. 4. P: parity bit is set when the result has even number of 1s. 5. CY = carry is set when result generates a carry. Also a borrow flag. Accumulator Hold data for manipulation (arithmetic, logical). Whenever the operation combines two words, either arithmetically or logically, the accumulator contains one word (say A) and the other word(say B) may be contained in a register or in memory location. After the operation the result is placed in the Acc replacing the word A. Major working register. MP can directly work on Acc. Programmed data tranfer.

General purpose registers Six registers.

ALU

B, C, D, E, H and L can store 8 bit data. They can be combined to perform some 16 bit operation.

Arithmetic logic unit. Two input ports, one output port. Perform AND, OR, ExOR, Add, subtract, complement, Increment, Decrement, shift left, shift right. ALUs two temporary registers are connected to MPs internal bus from which it can take data from any registers. It can place data directly to data bus through its single output port.

Program counter Its job is to keep track of what instruction is being used and what the next instruction will be. For 8085 it is 16 bit long. Can get data from internal bus as well as memory location. PC automatically increments to point to the next memory during the execution of the present instruction. PC value can be changed by some instructions.

Stack pointer 16 bit register acts as memory pointer. Can save the value of the program counter for later use. points to a region of memory which is called stack. follows LIFO algorithm. After every stack operation SP points to next available location of the stack. Usually decrements.

Memory address register PC sends address to MAR. MAR points to the location of the memory where the content is to be fetched from. PC increments but MAR does not. If the content is an instruction, IR decodes it. During execution if it is required to fetch another word from memory, PC is loaded with the value

PC again sends it to the MAR and fetch operation starts.

Instruction register Others 8085 40 pin DIP. +5V 3 - 5MHz ADD BUS DATA BUS CONTROL STATUS POWER SUPPLY AND FREQ EXTERNALLY INITIATED SIGNALS SERIAL I/O PORTS Instruction decoder. Control logic. Internal data bus. Holds instruction the micro is currently being executed. 8 bit long.

ADD/DATA bus Address bus 16 bits A8 to A15 unidirectional. Higher 8 bit AD0 to AD7 multiplexed with data. This pins are bidirectional when used as data bus.

Data bus 8 bit long: AD0 to AD7

Control signals

ALE active high output used to latch the lower 8 address bits. RD, WR - active low output signals. IO/M output signal to differentiate memory and IO operation. S1 and S0 status output signal. Identify various operations.

External control signals INTR interrupt request. Input signal INTA interrupt acknowledge. o/p signal. RST7.5,RST 6.5, RST5.5 restart interrupts. Vectored interrupts. Higher priority. TRAP - Nonmaskable interrupt. Highest priority. Hold request for the control of buses. O/P signal HLDA Hold Acknowledge. I/P signal READY I/P signal. When low Mp waits for integral number of clock cycles until it goes high.

You might also like