Professional Documents
Culture Documents
Saed R. Abed
[Computer Engineering Department, Hashemite University] [Adapted from Otmane Ait Mohamed Slides & Computer Organization and Design, Patterson & Hennessy, 2005, UCB] 1
Course Administration
Instructor: Instructor's e-mail: Office Hours: Saed Rasmi Abed sabed@hu.edu.jo
Required: Computer Org and Design, 4th Edition, Patterson and Hennessy 2008 Optional: Computer Organization and
Architecture: Designing for Performance, 7th Edition, William Stallings, published by Prentice Hall, July 2005.
Slides : pdf on the course web page (Moodle System)
2
Course Content
Content
Principles of computer architecture: CPU datapath and control unit design (single-issue pipelined, superscalar, VLIW), memory hierarchies and design, I/O organization and design, advanced processor design (multiprocessors).
Course goals
To learn the organizational paradigms that determine the capabilities and performance of computer systems. To understand the interactions between the computers architecture and its software so that future software designers (compiler writers, operating system designers, database programmers, ) can achieve the best cost-performance trade-offs and so that future architects understand the effects of their design choices on software applications.
Course prerequisites
CPE 408330: Assembly Language and Microprocessor Systems.
3
To learn the organizational paradigms that determines the capabilities and performance of computer systems. Create, assemble, run, debug programs in an assembly language:
MIPS preferred.
Course Structure
Design focused class:
Lectures:
Computer Abstractions and Technology Instructions: Language of the Computer Arithmetic for Computers Chapter 1 (2 Weeks) (Sec. 1.1 to 1.4) Chapter 2 (2 1/2 Weeks) (Sec. 2.1 to 2.7 & 2.10) Chapter 3 (1 1/2 Weeks) (Sec. 3.1 to 3.4) (1/2 Week) Chapter 4 (4 Weeks) (Sec. 4.1 to 4.9) (1/2 Week) Chapter 5 (3 Weeks) (Sec. 5.1 to 5.3 & 5.5) (1/2 Week) 5
Grading Information
Grade determinates
First Exam
~20%
Second Exam
~25%
Exam
~50%
TBD
~5%
Professionalism
The conduct, aims or qualities that characterize a professional person.
We are an accredited engineering school: our product is engineering professionals. Employers expect our graduates to behave like professionals. Employers seek the qualities of a professional in job interviews. Professionalism must start in the first semester and be part of every course over four years.
10
accept responsibility for his/her own learning follow up on lecture material and homework learn problem-solving skills not just how to solve each specific homework problem build a body of knowledge integrated over four years of courses
We all want HUs excellent reputation to be reinforced so that employers will hire our graduates!
11
By the architecture of a system, I mean the complete and detailed specification of the user interface. As Blaauw has said, Where architecture tells what happens, implementation tells how it is made to happen.
12
Moores Law
In 1965, Gordon Moore predicted that the number of transistors that can be integrated on a die would double every 18 to 24 months (i.e., grow exponentially with time).
1200
Millions of Computers
1000 800 600 400 290 200 0 93 3 1998 114 3 1999 488
135 4 2000
129 4 2001
131 5 2002
14
1000
100
DEC AXP/500
10
SUN-4/260
1987
Year
15
K = 1024 (210)
16
The Evolution of Computer Hardware When was the first transistor invented?
Modern-day electronics began with the invention in 1947 of the transfer resistor - the bi-polar transistor by Bardeen et.al at Bell Laboratories
18
The Evolution of Computer Hardware When was the first IC (integrated circuit) invented?
In 1958 the IC was born when Jack Kilby at Texas Instruments successfully interconnected, by hand, several transistors, resistors and capacitors on a single substrate
20
Technology Vacuum Tube Transistor Integrated Circuit (IC) Very Large Scale IC (VLSI) Submicron VLSI
The PowerPC 750 Introduced in 1999 3.65M transistors 366 MHz clock rate 40 mm2 die size 250nm (0.25micron) technology
22
Technology Outlook
High Volume Manufacturing
2010 32 16
2012 22 32
2014 16 64
2016 11 128
2018 8 256
Technology Node (nm) Integration Capacity (BT) Delay = CV/I scaling Energy/Logic Op scaling Bulk Planar CMOS Alternate, 3G etc Variability ILD (K) RC Delay Metal Layers
Delay scaling will slow down Energy scaling will slow down Low Probability High Probability High Very High 1 1 1 1 Reduce slowly towards 2 to 2.5
500 MHz ClockRate = 2 nsec ClockCycle 1 GHz ClockRate = 1 nsec ClockCycle 4 GHz ClockRate = 250 psec ClockCycle
Memory
DRAM capacity: 4x every 3 years, now 2x every 2 years memory speed: cost per bit: 1.5x every 10 years decreases about 25% per year
Disk
capacity: increases about 60% per year
25
27
28
What is a Computer?
Components:
processor (datapath, control) input (mouse, keyboard) output (display, printer) memory (cache (SRAM), main memory (DRAM), disk drive, CD/DVD) network
31
PC Motherboard Closeup
32
33
one-to-many C compiler
one-to-one assembler
35
Machine Organization
Capabilities and performance characteristics of the principal Functional Units (FUs)
e.g., register file, ALU, multiplexors, memories, ...
Logic and means by which information flow between FUs is controlled The machines Instruction Set Architecture (ISA) Register Transfer Level (RTL) machine description
38
ABI (application binary interface): The user portion of the instruction set plus the operating system interfaces used by application programmers. Defines a standard for binary portability across computers.
39
1400 1200
Millions of Processor
1000 800 600 400 200 0 1998 1999 2000 2001 2002
PowerPoint comic bar chart with approximate values (see text for correct values) 40
Processor Control
Datapath
41
C compiler
assembler
43
Processor Control
Datapath
44
Processor Control
000000 000000 100011 100011 101011 101011 000000
Memory
00000 00100 00010 00010 00010 00010 11111 00101 00010 01111 10000 10000 01111 00000 0001000010000000 0001000000100000 0000000000000000 0000000000000100 0000000000000000 0000000000000100 0000000000001000
Datapath
45
Processor Control
000000 000000 100011 100011 101011 101011 000000
Memory
00000 00100 00010 00010 00010 00010 11111 00101 00010 01111 10000 10000 01111 00000 0001000010000000 0001000000100000 0000000000000000 0000000000000100 0000000000000000 0000000000000100 0000000000001000
Datapath
46
Processor Control
000000 00100 00010 0001000000100000
Datapath
47
Processor Control
000000 00100 00010 0001000000100000
Datapath
contents Reg #4 ADD contents Reg #2 results put in Reg #2
48
Memory
00000 00100 00010 00010 00010 00010 11111 00101 00010 01111 10000 10000 01111 00000 0001000010000000 0001000000100000 0000000000000000 0000000000000100 0000000000000000 0000000000000100 0000000000001000
Datapath
Fetch
Exec
Decode
Processor Organization
Control needs to have circuitry to
Decide which is the next instruction and input it from memory Decode the instruction Issue signals that control the way information flows between datapath components Control what operations the datapaths functional units perform
Processor Control
Memory
Datapath
Output
53
Processor Control
Datapath
54
software
instruction set architecture
hardware
- coprocessor
Memory Management Special
3 Instruction Formats: all 32 bits wide OP OP OP rs rs rt rt rd sa
PC HI LO
funct
immediate
jump target 56
Coordination of many levels of abstraction Under a rapidly changing set of forces Design, measurement, and evaluation
57
Performance Metrics
Purchasing perspective
given a collection of machines, which has the
Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors
59
How much faster is the Concorde compared to the 747? How much bigger is the 747 than the Douglas DC-8?
60
If we upgrade a machine with a new processor what do we increase? If we add a new machine to the lab what do we increase?
61
Execution Time
Elapsed Time
counts everything (disk and memory accesses, I/O , etc.) a useful number, but often not good for comparison purposes
CPU time
doesn't count I/O or time spent running other programs can be broken up into system time, and user time
62
Problem:
machine A runs a program in 20 seconds machine B runs the same program in 25 seconds
63
performanceX = 1 / execution_timeX
If X is n times faster than Y, then
Performance Factors
Want to distinguish elapsed time and the time spent on our task CPU execution time (CPU time) time the CPU spends working on a task
Does not include time waiting for I/O or running other programs
CPU execution time = # CPU clock cyclesx clock cycle time for a program for a program
or
CPU execution time = # CPU clock cycles for a program ------------------------------------------for a program clock rate Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program
65
10 nsec clock cycle => 100 MHz clock rate 5 nsec clock cycle => 200 MHz clock rate 2 nsec clock cycle => 500 MHz clock rate 1 nsec clock cycle => 500 psec clock cycle => 250 psec clock cycle => 200 psec clock cycle => 1 GHz clock rate 2 GHz clock rate 4 GHz clock rate 5 GHz clock rate 66
Clock Cycles
Instead of reporting execution time in seconds, we often use cycles
cycles seconds seconds = cycle program program
cycle time = time between ticks = seconds per cycle clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec) A 4 Ghz. clock has a cycle time
1 4 109 1012 = 250 picoseconds (ps)
67
So, to improve performance (everything else being equal) you can either (increase or decrease?)
________ the # of required cycles for a program, or ________ the clock cycle time or, said another way, ________ the clock rate.
68
4th
5th
6th
This assumption is incorrect, different instructions take different amounts of time on different machines. Why? hint: remember that these are machine instructions, not lines of C code
...
time
69
time
Multiplication takes more time than addition Floating point operations take longer than integer ones Accessing memory takes more time than accessing registers
Important point: changing the cycle time often changes the number of cycles required for various instructions
70
# CPU clock cycles # Instructions Average clock cycles = for a program x for a program per instruction Clock cycles per instruction (CPI) the average number of clock cycles each instruction takes to execute
A way to compare two different implementations of the same ISA
CPI
CSE431 L01 Introduction.71
71
Effective CPI
Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging Overall effective CPI =
(CPIi x ICi)
i=1
Where ICi is the count (percentage) of the number of instructions of class i executed CPIi is the (average) number of clock cycles per instruction for that instruction class n is the number of instruction classes
The overall effective CPI varies by instruction mix a measure of the dynamic frequency of instructions across one or many programs
72
CPU time
These equations separate the three key factors that affect performance
Can measure the CPU execution time by running the program The clock rate is usually given Can measure overall instruction count by using profilers/ simulators without knowing all of the implementation details CPI varies by instruction type and ISA implementation for which we must know the implementation details
73
CPI X X X X X
clock_cycle
X X X X
X X X
Irwin, PSU, 2005
75
A Simple Example
Op ALU Load Store Branch Freq 50% 20% 10% 20% CPIi 1 5 3 2 Freq x CPIi
.5 1.0 .3 .4 .5 .4 .3 .4 1.6 .5 1.0 .3 .2 2.0 .25 1.0 .3 .4 1.95
2.2
How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?
CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster
How does this compare with using branch prediction to save a cycle off the branch time?
CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster
77
AM =
1/n
i=1
Timei
Where Timei is the execution time for the ith program of a total of n programs in the workload A smaller mean indicates a smaller average execution time and thus improved performance
Guiding principle in reporting performance measurements is reproducibility list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.))
78
Remember
Performance is specific to a particular program/s
Total execution time is a consistent summary of performance
Pitfall: expecting improvement in one aspect of a machines performance to affect the total performance
79
Static Metrics:
How many bytes does the program occupy in memory?
Dynamic Metrics:
How many instructions are executed? How many bytes does the processor fetch to execute the program? CPI How many clocks are required per instruction? How "lean" a clock is practical?
80
82