Professional Documents
Culture Documents
Vittorio Giovara
149374
04/08/2008
http://gle-mips.googlecode.com
Contents
1 Overview 4
2 Processor Desing 5
2.1 General Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Arithmetic Logic Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Module Interconnection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 PC / NPC conflict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 VLIW Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Physical Design 8
3.1 Systesys Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Results Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 On Silicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Optimizations 11
4.1 Going back to RTL level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Syntesys Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.3 Final Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.3.1 Final Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Bibliography 14
A Simulation Waves 15
1
List of Tables
2
List of Figures
3
Chapter 1
Overview
4
Chapter 2
Processor Desing
the detailed explanation of the processor design will be provided; the two
I N THIS CHAPTER
most important modules, the Control Unit and the Arithmetic Logic Unit, will be carefully
analyzed, taking in consideration the distributed decoding approach. Furthermore an overview
on the Very Long Instruction Word support will be available, carefully analyzing implementation
details.
5
2.2 Module Interconnection 6
• Signed/Unsigned Subtraction
• Comparisons
The addition/subtraction module uses a sparse tree implementation, allowing very fast carry
spreading and result computation; the logical module use a similar module present in SPARC T2
processor, with four selection signals; the comparison module is implemented in a behavioral way,
using the subtraction result of the two inputs.
The input operands from the two multiplexers are connected to all the modules above and
each one of them provides a result; only one result is selected with the alu opcode control
signal provided directly by the Control Unit.
As a matter of fact, the registers (Rs1 , Rs2 and Rd ) are activated in five different categories by
the ARG PROC process:
1. For R-type instructions, bits from 25 to 12 are for the three registers respectively;
2. For unconditional jumps, the first register is always set to 0, in order to force jump exe-
cution, while the second register is used for register operations (like jalr or jr) and the
destination register is forced to 31 for jal and jalr operations (this doesn’t cause prob-
lems with the other jump instructions because the register file write enable is activated only
for these two instructions);
3. For conditional jumps, the first register is needed for comparison and branch evaluation;
4. For store intstructions, a particular configuration is needed since the second register is used
as content to be saved in memory;
5. For all the other instructions, the first register is the first operand while the second register
corresponds to the destination.
While providing the registers, this process also performs sign extension of the immediate
number and save the correct value in the related immediate register; integers are usually on 16 bits
but they are extended to 26 for unconditiona jumps.
Another distributed control is the size of the word to be saved in memory or loaded from it: a
simple behavioral multiplexer selects the control signals according to the opcode of the instruction.
Also the saving of results in the register file has distributed control with a similar approach:
there is multiplexer that by default selects data arriving from the Write Back stage, but for jal and
jalr the correct value of the Program Counter is chosen (the destination register has already been
selected by the ARG PROC process) and sent to the register file.
Physical Design
Type Value
Clock Period 40 ns
Power Consumption 550 µW
Area Size unconstrained
Every module of the processor has been analyzed, elaborated and compiled by Synopsys with
no errors; also a minimum amount of RAM, only two lines, has been synthetysed because other-
wise a missing module generates problems in the silicon placement in Encounter.
The command for applying the above constraints are:
8
3.3 On Silicon 9
Type Value
Data Required Time 39,90 ns
Data arrival Time -24,46 ns
Slack 15,44 ns
Type Value
Cell Internal Power 496,97 µW
Net Switching Power 34,28 µW
Total Dynamic Power 531,25 µW
Type Value
Combinational area 57983,38 mm2
Noncombinational area 25153,22 mm2
Total area 83136,61 mm2
3.3 On Silicon
Now the design process has been carried on to Silicon route and placement for exacting more
detailed information about IRdrop, electromicrogration and delay.
As it’s possible to see from picture 3.1 there are four route violations, marked by white crosses.
However it didn’t influence the delay computation that showed no maximum delay violation; the
maximum delay correspond to the previously selected clock period (40 ns).
As for the power results, Encounter reported the following data1 :
Type Value
Average Power 1,305 mW
Average Leakage Power 0,083 mW
Wrost IR drop 12,1 mV
Wrost Electromigration (M1) 29,539 µA
1
more exahustive data is reported in the project data files
Chapter 4
Optimizations
Type Value
Clock Period 30 ns
Power Consumption 250 µW
Area Size unconstrained
As for the power reduction, since this processor has a pecular structure, a Very Long Instruc-
tion Word path that can be unconnected at any time, a power reduction model that is very effective
is Clock Gating. Due to the fact that the second data path can be unused, clock gating deactivates
the clock strobes for unused modules, until actual input data arrives.
So the configuration script used in Synopsys is
11
4.3 Final Results 12
set_max_dynamic_power 250 uW
propagate_constraints -gate_clock
Figure 4.1: Optimized DLX processor silicon after place and route
All the synthesys constraints have been respected, in fact the new configuration reports
Type Value
Data Required Time 30 ns
Data arrival Time -3,78 ns
Slack 26,22 ns
Type Value
Cell Internal Power 196,54 µW
Net Switching Power 46,73 µW
Total Dynamic Power 243,27 µW
Type Value
Combinational area 63230,93 mm2
Noncombinational area 26457,17 mm2
Total area 89688,1 mm2
It’s possible to see in table 4.2 that the timing for the data path has been reduced to the value
selected, and that the slack is positive so no timing violation are present. The power consumption
in table 4.3 is greatly reduced thanks to clock gating technology; as a matter of fact this technl-
ogy correctly increased the Combinational Area size and Net Switching power, reducing the Cell
Internal Power. The area size has suffered only a small area increase (table 4.4).
After the silicon placement and route, the extracted values from Encounter are1 :
Type Value
Average Power 1,911 mW
Average Leakage Power 0,101 mW
Wrost IR drop 15,09 mV
Wrost Electromigration (M1) 0,158 mA
1
more details in the project data files is available
Bibliography
[2] Frank Emnett, Mark Beigel, Power Reduction through RTL clock gating
14
Appendix A
Simulation Waves
Here is provided the assembly program and the resulting waves, used to test the processor func-
tionalities; the testbench deactivates the VLIW support after ten clock cycles, and in figure A.1
it’s possible to see that execution continues with single instructions.
addi r5,r2,#13
addui r7,r2,#15
sub r11,r7,r5
sw 10(r0),r11
addu r14,r5,r7
slli r1,r11,#1
jal #32
nop
nop
lw r2,10(r0)
addu r14,r11,r7
and r22,r2,r2
sge r20,r14,r1
bnez r31, #-4
nop
nop
nop
nop
15
16
/tb_glx/cpu/clock
/tb_glx/cpu/reset
/tb_glx/cpu/vliw_en
/tb_glx/cpu/result 0 13 0 2 0 28 0 4 0
/tb_glx/cpu/result_vliw 0 15 0
/tb_glx/cpu/aluout 0 13 2 28 4
/tb_glx/cpu/aluout_vliw 0 15 10 0
/tb_glx/cpu/newpc 0 8 16 20 24 28
/tb_glx/cpu/i_ir_latch_enable
/tb_glx/cpu/i_npc_latch_enable
/tb_glx/cpu/i_ir_latch_enable_vliw
/tb_glx/cpu/i_rega_latch_enable
/tb_glx/cpu/i_regb_latch_enable
/tb_glx/cpu/i_regimm_latch_enable
/tb_glx/cpu/i_rega_latch_enable_vliw
/tb_glx/cpu/i_regb_latch_enable_vliw
/tb_glx/cpu/i_regimm_latch_enable_vliw
/tb_glx/cpu/i_eq_condition
/tb_glx/cpu/i_jump_enable
/tb_glx/cpu/i_alu_opcode adds subs addu llsh adds
/tb_glx/cpu/i_muxa_selection
/tb_glx/cpu/i_muxb_selection
/tb_glx/cpu/i_alu_outreg_enable
/tb_glx/cpu/i_alu_opcode_vliw addu adds nop
/tb_glx/cpu/i_muxb_selection_vliw
/tb_glx/cpu/i_alu_outreg_enable_vliw
/tb_glx/cpu/i_dram_wenable
/tb_glx/cpu/i_lmd_latch_enable
/tb_glx/cpu/i_pc_latch_enable
Entity:tb_glx Architecture:test Date: Thu Jul 31 04:08:40 PM CEST 2008 Row: 1 Page: 1