You are on page 1of 70

CPE 408340 Computer Organization Chapter 1: Computer Abstractions and Technology

Saed R. Abed
[Computer Engineering Department, Hashemite University] [Adapted from Otmane Ait Mohamed Slides & Computer Organization and Design, Patterson & Hennessy, 2005, UCB] 1

Course Administration
Instructor: Instructor's e-mail: Office Hours: Saed Rasmi Abed sabed@hu.edu.jo

Mon, wed: 9:00 - 10:00 or by appointment Mon, wed: 12:30 - 2:00

Lecture Time: Text:

Required: Computer Org and Design, 4th Edition, Patterson and Hennessy 2008 Optional: Computer Organization and

Architecture: Designing for Performance, 7th Edition, William Stallings, published by Prentice Hall, July 2005.
Slides : pdf on the course web page (Moodle System)
2

Course Content
Content
Principles of computer architecture: CPU datapath and control unit design (single-issue pipelined, superscalar, VLIW), memory hierarchies and design, I/O organization and design, advanced processor design (multiprocessors).

Course goals
To learn the organizational paradigms that determine the capabilities and performance of computer systems. To understand the interactions between the computers architecture and its software so that future software designers (compiler writers, operating system designers, database programmers, ) can achieve the best cost-performance trade-offs and so that future architects understand the effects of their design choices on software applications.

Course prerequisites
CPE 408330: Assembly Language and Microprocessor Systems.
3

What You Should Know


Basic logic design & machine organization:
logical minimization, FSMs, component design processor, memory, I/O.

To learn the organizational paradigms that determines the capabilities and performance of computer systems. Create, assemble, run, debug programs in an assembly language:
MIPS preferred.

To explore the memory hierarchy system and how to interface it to a computer.

Course Structure
Design focused class:

Various homework assignments throughout the semester

Lectures:
Computer Abstractions and Technology Instructions: Language of the Computer Arithmetic for Computers Chapter 1 (2 Weeks) (Sec. 1.1 to 1.4) Chapter 2 (2 1/2 Weeks) (Sec. 2.1 to 2.7 & 2.10) Chapter 3 (1 1/2 Weeks) (Sec. 3.1 to 3.4) (1/2 Week) Chapter 4 (4 Weeks) (Sec. 4.1 to 4.9) (1/2 Week) Chapter 5 (3 Weeks) (Sec. 5.1 to 5.3 & 5.5) (1/2 Week) 5

Review and First Exam


The processor

Review and Second Exam


Exploiting Memory Hierarchy

Review and Final Exam

Grading Information
Grade determinates

First Exam

~20%

Monday, March 12th.

Second Exam

~25%

Monday, April 16th.


Final

Exam

~50%

TBD

Class participation & pop quizzes

~5%

Let me know about any exam conflicts ASAP


6

Ethics and Professionalism


Ethics
Disciplined dealing with moral duty. Moral Principles or Practice. System of right behavior.

Professionalism
The conduct, aims or qualities that characterize a professional person.

What characterizes a professional?


a professional accepts responsibility fully does not blame others for failure. a professional is reliable - gets the job done on time. a professional is competent - gets the correct answer. a professional works independently finds out what he/she does not know. a professional follows up on all the details. a professional has high standards of ethical behavior does not lie or cheat. a professional does not steal the work of others and present it as his own.
8

What characterizes a professional?


a professional is respectful to others. a professional does not offer excuses in lieu of completed work. a professional is resourceful. a professional has initiative. a professional succeeds in spite of obstacles and road blocks. a professional has justifiable self-confidence.

The Student is the Product of our Engineering School

We are an accredited engineering school: our product is engineering professionals. Employers expect our graduates to behave like professionals. Employers seek the qualities of a professional in job interviews. Professionalism must start in the first semester and be part of every course over four years.

10

The Student is the Product of our Engineering School


Every student must learn to think like an engineer:
o o o o

accept responsibility for his/her own learning follow up on lecture material and homework learn problem-solving skills not just how to solve each specific homework problem build a body of knowledge integrated over four years of courses

We all want HUs excellent reputation to be reinforced so that employers will hire our graduates!

11

By the architecture of a system, I mean the complete and detailed specification of the user interface. As Blaauw has said, Where architecture tells what happens, implementation tells how it is made to happen.

The Mythical Man-Month, Brooks, pg 45

12

Moores Law
In 1965, Gordon Moore predicted that the number of transistors that can be integrated on a die would double every 18 to 24 months (i.e., grow exponentially with time).

Amazingly visionary million transistor/chip barrier was crossed in the 1980s.


2300 transistors, 1 MHz clock (Intel 4004) - 1971 16 Million transistors (Ultra Sparc III) 42 Million transistors, 2 GHz clock (Intel Xeon) 2001 55 Million transistors, 3 GHz, 130nm technology, 250mm2 die (Intel Pentium 4) - 2004 140 Million transistor (HP PA-8500)
13

Where is the Market?


Embedded Desktop Servers

1200
Millions of Computers

1122 892 862

1000 800 600 400 290 200 0 93 3 1998 114 3 1999 488

135 4 2000

129 4 2001

131 5 2002
14

Processor Performance Increase


10000 Performance (SPEC Int)
Intel Pentium 4/3000 DEC Alpha 21264A/667 DEC Alpha 21264/600 Intel Xeon/2000 DEC Alpha 5/500 DEC Alpha 5/300

1000

DEC Alpha 4/266

100

DEC AXP/500

IBM POWER 100

10

HP 9000/750 IBM RS6000 MIPS M2000 MIPS M/120


1989 1991 1993 1995 1997 1999 2001 2003

SUN-4/260

1987

Year
15

Growth Capacity of DRAM Chips

K = 1024 (210)

In recent years growth rate has slowed to 2x every 2 year

16

The Evolution of Computer Hardware When was the first transistor invented?
Modern-day electronics began with the invention in 1947 of the transfer resistor - the bi-polar transistor by Bardeen et.al at Bell Laboratories

18

The Evolution of Computer Hardware When was the first IC (integrated circuit) invented?
In 1958 the IC was born when Jack Kilby at Texas Instruments successfully interconnected, by hand, several transistors, resistors and capacitors on a single substrate

20

The Underlying Technologies

Year 1951 1965 1975 1995 2005

Technology Vacuum Tube Transistor Integrated Circuit (IC) Very Large Scale IC (VLSI) Submicron VLSI

Relative Perform/Unit Cost 1 35 900 2,400,000 6,200,000,000

What if technology in the transportation industry advanced at the same rate?


21

The PowerPC 750 Introduced in 1999 3.65M transistors 366 MHz clock rate 40 mm2 die size 250nm (0.25micron) technology

22

Technology Outlook
High Volume Manufacturing

2004 90 2 0.7 >0.35

2006 65 4 ~0.7 >0.5

2008 45 8 >0.7 >0.5

2010 32 16

2012 22 32

2014 16 64

2016 11 128

2018 8 256

Technology Node (nm) Integration Capacity (BT) Delay = CV/I scaling Energy/Logic Op scaling Bulk Planar CMOS Alternate, 3G etc Variability ILD (K) RC Delay Metal Layers

Delay scaling will slow down Energy scaling will slow down Low Probability High Probability High Very High 1 1 1 1 Reduce slowly towards 2 to 2.5

High Probability Low Probability Medium ~3 1 6-7 <3 1 7-8 1 8-9

0.5 to 1 layer per generation 23

Impacts of Advancing Technology


Processor
logic capacity: performance: increases about 30% per year 2x every 1.5 years ClockCycle = 1/ClockRate

500 MHz ClockRate = 2 nsec ClockCycle 1 GHz ClockRate = 1 nsec ClockCycle 4 GHz ClockRate = 250 psec ClockCycle

Memory
DRAM capacity: 4x every 3 years, now 2x every 2 years memory speed: cost per bit: 1.5x every 10 years decreases about 25% per year

Disk
capacity: increases about 60% per year
25

Computer Organization and Design


This course is all about how computers work But what do we mean by a computer?
Different types: embedded, laptop, desktop, server Different uses: automobiles, graphics, finance, genomics, Different manufacturers: Intel, AMD, IBM, HP, Apple, IBM, Sony, Sun Different underlying technologies and different costs !

Best way to learn:


Focus on a specific instance and learn how it works While learning general principles and historical perspectives
26

Example Machine Organization


Workstation design target
25% of cost on processor 25% of cost on memory (minimum memory size) Rest on I/O devices, power supplies, box
Computer CPU Control Datapath Memory Devices Input Output

27

Embedded Computers in You Car

28

Why Learn this Stuff?


You want to call yourself a computer scientist/engineer You want to build HW/SW people use (so you need to deliver performance at low cost) You need to make a purchasing decision or offer expert advice Both hardware and software affect performance
The algorithm determines number of source-level statements The language/compiler/architecture determine the number of machine-level instructions
- (Chapters 1, 2 and 3)

The processor/memory determine how fast machine-level instructions are executed


- (Chapters 4, and 5) 29

What is a Computer?
Components:
processor (datapath, control) input (mouse, keyboard) output (display, printer) memory (cache (SRAM), main memory (DRAM), disk drive, CD/DVD) network

Our primary focus: the processor (datapath and control)


Implemented using millions of transistors Impossible to understand by looking at each transistor We need abstraction!
30

Major Components of a Computer

31

PC Motherboard Closeup

32

Inside the Pentium 4 Processor Chip

33

Below the Program


High-level language program (in C)
swap (int v[], int k) (int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; )

one-to-many C compiler

Assembly language program (for MIPS)


swap: sll add lw lw sw sw jr $2, $5, 2 $2, $4, $2 $15, 0($2) $16, 4($2) $16, 0($2) $15, 4($2) $31

one-to-one assembler

Machine (object) code (for MIPS)


000000 00000 00101 0001000010000000 000000 00100 00010 0001000000100000 . . .

35

Advantages of Higher-Level Languages ?


Higher-level languages
Allow the programmer to think in a more natural language and for their intended use (Fortran for scientific computation, Cobol for business programming, Lisp for symbol manipulation, Java for web programming, ) Improve programmer productivity more understandable code that is easier to debug and validate Improve program maintainability Allow programs to be independent of the computer on which they are developed (compilers and assemblers can translate high-level language programs to the binary instructions of any machine) Emergence of optimizing compilers that produce very efficient assembly code optimized for the target machine

As a result, very little programming is done today at the assembler level


37

Machine Organization
Capabilities and performance characteristics of the principal Functional Units (FUs)
e.g., register file, ALU, multiplexors, memories, ...

The ways those FUs are interconnected


e.g., buses

Logic and means by which information flow between FUs is controlled The machines Instruction Set Architecture (ISA) Register Transfer Level (RTL) machine description
38

Instruction Set Architecture (ISA)


ISA: An abstract interface between the hardware and the lowest level software of a machine that encompasses all the information necessary to write a machine language program that will run correctly, including instructions, registers, memory access, I/O, and so on.
... the attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls, the logic design, and the physical implementation. Amdahl, Blaauw, and Brooks, 1964 Enables implementations of varying cost and performance to run identical software

ABI (application binary interface): The user portion of the instruction set plus the operating system interfaces used by application programmers. Defines a standard for binary portability across computers.
39

ISA Type Sales


Other SPARC Hitachi SH PowerPC Motorola 68K MIPS IA-32 ARM

1400 1200

Millions of Processor

1000 800 600 400 200 0 1998 1999 2000 2001 2002

PowerPoint comic bar chart with approximate values (see text for correct values) 40

Major Components of a Computer

Processor Control

Devices Network Memory Input Output

Datapath

41

Below the Program


High-level language program (in C)
swap (int v[], int k) . . .

Assembly language program (for MIPS)


swap: sll add lw lw sw sw jr $2, $5, 2 $2, $4, $2 $15, 0($2) $16, 4($2) $16, 0($2) $15, 4($2) $31

C compiler

Machine (object) code (for MIPS)


000000 000000 100011 100011 101011 101011 000000 00000 00100 00010 00010 00010 00010 11111 00101 00010 01111 10000 10000 01111 00000 0001000010000000 0001000000100000 0000000000000000 0000000000000100 0000000000000000 0000000000000100 0000000000001000

assembler

43

Input Device Inputs Object Code


000000 000000 100011 100011 101011 101011 000000 00000 00100 00010 00010 00010 00010 11111 00101 00010 01111 10000 10000 01111 00000 0001000010000000 0001000000100000 0000000000000000 0000000000000100 0000000000000000 0000000000000100 0000000000001000

Processor Control

Devices Network Memory Input Output

Datapath

44

Object Code Stored in Memory

Processor Control
000000 000000 100011 100011 101011 101011 000000

Memory
00000 00100 00010 00010 00010 00010 11111 00101 00010 01111 10000 10000 01111 00000 0001000010000000 0001000000100000 0000000000000000 0000000000000100 0000000000000000 0000000000000100 0000000000001000

Devices Network Input Output

Datapath

45

Processor Fetches an Instruction


Processor fetches an instruction from memory

Processor Control
000000 000000 100011 100011 101011 101011 000000

Memory
00000 00100 00010 00010 00010 00010 11111 00101 00010 01111 10000 10000 01111 00000 0001000010000000 0001000000100000 0000000000000000 0000000000000100 0000000000000000 0000000000000100 0000000000001000

Devices Network Input Output

Datapath

46

Control Decodes the Instruction


Control decodes the instruction to determine what to execute
Devices Network Memory Input Output

Processor Control
000000 00100 00010 0001000000100000

Datapath

47

Datapath Executes the Instruction


Datapath executes the instruction as directed by control
Devices Network Memory Input Output

Processor Control
000000 00100 00010 0001000000100000

Datapath
contents Reg #4 ADD contents Reg #2 results put in Reg #2

48

What Happens Next?


Processor fetches the next instruction from memory
Processor Control
000000 000000 100011 100011 101011 101011 000000

Memory
00000 00100 00010 00010 00010 00010 11111 00101 00010 01111 10000 10000 01111 00000 0001000010000000 0001000000100000 0000000000000000 0000000000000100 0000000000000000 0000000000000100 0000000000001000

Devices Network Input Output

Datapath

Fetch

Exec

Decode

How does it know which location in memory to fetch from next?


50

Processor Organization
Control needs to have circuitry to
Decide which is the next instruction and input it from memory Decode the instruction Issue signals that control the way information flows between datapath components Control what operations the datapaths functional units perform

Datapath needs to have circuitry to


Execute instructions - functional units (e.g., adder) and storage locations (e.g., register file) Interconnect the functional units so that the instructions can be executed as required Load data from and store data to memory What location does it load from and store to?
52

Output Data Stored in Memory


At program completion the data to be output resides in memory

Processor Control

Memory

Devices Network Input

Datapath

00000100010100000000000000000000 00000000010011110000000000000100 00000011111000000000000000001000

Output

53

Output Device Outputs Data

Processor Control

Devices Network Memory Input Output

Datapath

00000100010100000000000000000000 00000000010011110000000000000100 00000011111000000000000000001000

54

The Instruction Set Architecture (ISA)

software
instruction set architecture

hardware

The interface description separating the software and hardware


55

The MIPS ISA


Instruction Categories
Load/Store Computational Jump and Branch Floating Point
R0 - R31 Registers

- coprocessor
Memory Management Special
3 Instruction Formats: all 32 bits wide OP OP OP rs rs rt rt rd sa

PC HI LO

funct

immediate

jump target 56

Q: How many already familiar with MIPS ISA?

How Do the Pieces Fit Together?


Applications Operating System Compiler Instruction Set Architecture Memory system Firmware I/O system network

Processor Datapath & Control Digital Design Circuit Design

Coordination of many levels of abstraction Under a rapidly changing set of forces Design, measurement, and evaluation
57

Performance Metrics
Purchasing perspective
given a collection of machines, which has the

- best performance ? - least cost ? - best cost/performance? Design perspective


faced with design options, which has the

- best performance improvement ? - least cost ? - best cost/performance? Both require


basis for comparison metric for evaluation

Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors
59

Which of these airplanes has the best performance?


Airplane Passengers 101 470 132 146 Range (mi) 630 4150 4000 8720 Speed (mph) 598 610 1350 544

Boeing 737-100 Boeing 747 BAC/Sud Concorde Douglas DC-8-50

How much faster is the Concorde compared to the 747? How much bigger is the 747 than the Douglas DC-8?

60

Computer Performance: TIME, TIME, TIME


Response Time (latency) How long does it take for my job to run? How long does it take to execute a job? How long must I wait for the database query? Throughput How many jobs can the machine run at once? What is the average execution rate? How much work is getting done?

If we upgrade a machine with a new processor what do we increase? If we add a new machine to the lab what do we increase?
61

Execution Time
Elapsed Time
counts everything (disk and memory accesses, I/O , etc.) a useful number, but often not good for comparison purposes

CPU time
doesn't count I/O or time spent running other programs can be broken up into system time, and user time

Our focus: user CPU time


time spent executing the lines of code that are "in" our program

62

Book's Definition of Performance


For some program running on machine X, PerformanceX = 1 / Execution timeX "X is n times faster than Y" PerformanceX / PerformanceY = n

Problem:
machine A runs a program in 20 seconds machine B runs the same program in 25 seconds

63

Defining (Speed) Performance


Normally interested in reducing
Response time (aka execution time) the time between the start and the completion of a task

- Important to individual users


Thus, to maximize performance, need to minimize execution time

performanceX = 1 / execution_timeX
If X is n times faster than Y, then

performanceX execution_timeY -------------------- = --------------------- = n performanceY execution_timeX


Throughput the total amount of work done in a given time
- Important to data center managers

Decreasing response time almost always improves throughput


64

Performance Factors
Want to distinguish elapsed time and the time spent on our task CPU execution time (CPU time) time the CPU spends working on a task
Does not include time waiting for I/O or running other programs

CPU execution time = # CPU clock cyclesx clock cycle time for a program for a program
or

CPU execution time = # CPU clock cycles for a program ------------------------------------------for a program clock rate Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program
65

Review: Machine Clock Rate


Clock rate (MHz, GHz) is inverse of clock cycle time (clock period) CC = 1 / CR

one clock period

10 nsec clock cycle => 100 MHz clock rate 5 nsec clock cycle => 200 MHz clock rate 2 nsec clock cycle => 500 MHz clock rate 1 nsec clock cycle => 500 psec clock cycle => 250 psec clock cycle => 200 psec clock cycle => 1 GHz clock rate 2 GHz clock rate 4 GHz clock rate 5 GHz clock rate 66

Clock Cycles
Instead of reporting execution time in seconds, we often use cycles
cycles seconds seconds = cycle program program

Clock ticks indicate when to start activities (one abstraction):


time

cycle time = time between ticks = seconds per cycle clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec) A 4 Ghz. clock has a cycle time
1 4 109 1012 = 250 picoseconds (ps)

67

How to Improve Performance


cycles seconds seconds = cycle program program

So, to improve performance (everything else being equal) you can either (increase or decrease?)

________ the # of required cycles for a program, or ________ the clock cycle time or, said another way, ________ the clock rate.

68

How many cycles are required for a program?


Could assume that number of cycles equals number of instructions
2nd instruction 3rd instruction 1st instruction

4th

5th

6th

This assumption is incorrect, different instructions take different amounts of time on different machines. Why? hint: remember that these are machine instructions, not lines of C code

...
time

69

Different numbers of cycles for different instructions

time

Multiplication takes more time than addition Floating point operations take longer than integer ones Accessing memory takes more time than accessing registers

Important point: changing the cycle time often changes the number of cycles required for various instructions
70

Clock Cycles per Instruction


Not all instructions take the same amount of time to execute
One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction

# CPU clock cycles # Instructions Average clock cycles = for a program x for a program per instruction Clock cycles per instruction (CPI) the average number of clock cycles each instruction takes to execute
A way to compare two different implementations of the same ISA

CPI
CSE431 L01 Introduction.71

CPI for this instruction class A B C 1 2 3


Irwin, PSU, 2005

71

Effective CPI
Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging Overall effective CPI =

(CPIi x ICi)

i=1

Where ICi is the count (percentage) of the number of instructions of class i executed CPIi is the (average) number of clock cycles per instruction for that instruction class n is the number of instruction classes

The overall effective CPI varies by instruction mix a measure of the dynamic frequency of instructions across one or many programs
72

THE Performance Equation


Our basic performance equation is then CPU time = Instruction_count x CPI x clock_cycle
or

CPU time

Instruction_count x CPI ----------------------------------------------clock_rate

These equations separate the three key factors that affect performance
Can measure the CPU execution time by running the program The clock rate is usually given Can measure overall instruction count by using profilers/ simulators without knowing all of the implementation details CPI varies by instruction type and ISA implementation for which we must know the implementation details
73

Determinates of CPU Performance


CPU time = Instruction_count x CPI x clock_cycle
Instruction_ count Algorithm Programming language Compiler ISA Processor organization Technology
CSE431 L01 Introduction.75

CPI X X X X X

clock_cycle

X X X X

X X X
Irwin, PSU, 2005

75

A Simple Example
Op ALU Load Store Branch Freq 50% 20% 10% 20% CPIi 1 5 3 2 Freq x CPIi
.5 1.0 .3 .4 .5 .4 .3 .4 1.6 .5 1.0 .3 .2 2.0 .25 1.0 .3 .4 1.95

2.2

How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?
CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster

How does this compare with using branch prediction to save a cycle off the branch time?
CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster

What if two ALU instructions could be executed at once?


CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster
CSE431 L01 Introduction.77 Irwin, PSU, 2005

77

Comparing and Summarizing Performance


How do we summarize the performance for benchmark set with a single number?
The average of execution times that is directly proportional to total execution time is the arithmetic mean (AM)

AM =

1/n

i=1

Timei

Where Timei is the execution time for the ith program of a total of n programs in the workload A smaller mean indicates a smaller average execution time and thus improved performance

Guiding principle in reporting performance measurements is reproducibility list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.))
78

Remember
Performance is specific to a particular program/s
Total execution time is a consistent summary of performance

For a given architecture performance increases come from:


increases in clock rate (without adverse CPI affects) improvements in processor organization that lower CPI compiler enhancements that lower CPI and/or instruction count Algorithm/Language choices that affect instruction count

Pitfall: expecting improvement in one aspect of a machines performance to affect the total performance
79

Summary: Evaluating ISAs


Design-time metrics:
Can it be implemented, in how long, at what cost? Can it be programmed? Ease of compilation?

Static Metrics:
How many bytes does the program occupy in memory?

Dynamic Metrics:
How many instructions are executed? How many bytes does the processor fetch to execute the program? CPI How many clocks are required per instruction? How "lean" a clock is practical?

Best Metric: Time to execute the program!


depends on the instructions set, the processor organization, and compilation techniques.
Inst. Count Cycle Time

80

Next Lecture and Reminders


Next lecture
Instructions: Language of the Computer
- Reading assignment Chapter 2

82

You might also like