Designing A Custom DLX Processor With Very Long Instruction Word Support

Microelectronic System Design
Final Project Report
Designing a custom DLX processor

with Very Long Instruction Word support
Vittorio Giovara
149374
04/08/2008
http://gle-mips.googlecode.com
Contents
1 Overview 4
2 Processor Desing 5
2.1 General Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Arithmetic Logic Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Module Interconnection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 PC / NPC conflict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 VLIW Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Physical Design 8
3.1 Systesys Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Results Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 On Silicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Optimizations 11
4.1 Going back to RTL level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Syntesys Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.3 Final Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.3.1 Final Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Bibliography 14
A Simulation Waves 15
1
List of Tables
3.1 Non optimized synthesys constraints . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2 Non optimized synthesys timing results . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Non optimized synthesys power results . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Non optimized synthesys area results . . . . . . . . . . . . . . . . . . . . . . . . 9
3.5 Non optimized synthesys silicon results . . . . . . . . . . . . . . . . . . . . . . 10
4.1 Optimized synthesys constraints . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.2 Optimized synthesys timing results . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.3 Optimized synthesys power results . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.4 Optimized synthesys area results . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.5 Optimized synthesys silicon results . . . . . . . . . . . . . . . . . . . . . . . . . 13
2
List of Figures
1.1 Optimized DLX processor on silicon, simulating IRdrop measurements . . . . . 4
2.1 DLX processor structural view . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1 DLX processor on silicon after place and route . . . . . . . . . . . . . . . . . . 9

3.2 DLX processor IR drop rail analysis . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1 Optimized DLX processor silicon after place and route . . . . . . . . . . . . . . 12
A.1 DLX simulation waves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3
Chapter 1
Overview
focus on the implementation details about designing a custom DLX

T HIS DOCUMENT WILL
processor with Very Long Instruction Word and a non pipelined data path. An architectural
description of the processor will be provided, outlining main modules functionalities and general
structure desing. Afterwards the design will carry on to the physical design level, providing de-
tailed information about system on chip simulation, power consumption and gate characterization.
The industrial standard tools such as Design Vision from Synopsys and First Encounter from
Cadence are being used in the process.
Figure 1.1: Optimized DLX processor on silicon, simulating IRdrop measurements
4
Chapter 2
Processor Desing
the detailed explanation of the processor design will be provided; the two
I N THIS CHAPTER
most important modules, the Control Unit and the Arithmetic Logic Unit, will be carefully
analyzed, taking in consideration the distributed decoding approach. Furthermore an overview
on the Very Long Instruction Word support will be available, carefully analyzing implementation
details.
2.1 General Structure

The processor is built using the classical DLX implementation, with a simple non pipelined Con-
trol Unit and distributed decoding of the instruction. Every instruction requires exactly five clock
cycles for complete execution.
At any time processor can activate a second data path, exploiting the Very Long Instruction
Word architecture, that is two instructions can be executed in parallel at the same time, with no
problems of data dependecies, as they have already been resolved by the compiler.
In figure 2.1 is reported the schematic view of the processor describing the double data path
and the 4R2W1 register file and memory.
2.1.1 Control Unit

The control unit inserted in the processor uses a microcode control memory with instruction re-
location. Every I-type and J-type instructions activates different control words according to the
opcode. R-type2 instructions that have a single opcode (all zeroes) are relocated in a similar way,
but the func field (11 bits) shifted is used; this field is shifted left by two in order not cause conflicts
with the opcode of I and J type instructions.
The microcode is composed of four line per instruction plus the two lines for resetting (all ze-
roes) and instruction fetch that are the same for every instruction. In order to acces the microcode,
a relocation vector is used, mapping every opcode or func to the corresponding control word.
Finally the control unit provides with the correct operation selection for the Arithemetic Logic
Unit, again using the opcode or the func of the instruction.
2.1.2 Arithmetic Logic Unit

The Arithmetic Logic Unit is capable of
• Signed/Unsigned Addition
1
this acronym means, 4 read ports 2 write ports
2
respectively register-to-integer and jump and branch and register-to-register instructions
5
2.2 Module Interconnection 6
Figure 2.1: DLX processor structural view
• Signed/Unsigned Subtraction
• Logical Operation (and or xor)
• Comparisons
The addition/subtraction module uses a sparse tree implementation, allowing very fast carry
spreading and result computation; the logical module use a similar module present in SPARC T2
processor, with four selection signals; the comparison module is implemented in a behavioral way,
using the subtraction result of the two inputs.
The input operands from the two multiplexers are connected to all the modules above and
each one of them provides a result; only one result is selected with the alu opcode control
signal provided directly by the Control Unit.
2.2 Module Interconnection

In order to reflect general DLX implementations, the Control Unit doesn’t provide all the control
signals needed during the various stages, but leaves most instruction decoding during the process
of execution.
2.3 VLIW Support 7
As a matter of fact, the registers (Rs1 , Rs2 and Rd ) are activated in five different categories by
the ARG PROC process:
1. For R-type instructions, bits from 25 to 12 are for the three registers respectively;
2. For unconditional jumps, the first register is always set to 0, in order to force jump exe-
cution, while the second register is used for register operations (like jalr or jr) and the
destination register is forced to 31 for jal and jalr operations (this doesn’t cause prob-
lems with the other jump instructions because the register file write enable is activated only
for these two instructions);
3. For conditional jumps, the first register is needed for comparison and branch evaluation;
4. For store intstructions, a particular configuration is needed since the second register is used
as content to be saved in memory;
5. For all the other instructions, the first register is the first operand while the second register
corresponds to the destination.
While providing the registers, this process also performs sign extension of the immediate
number and save the correct value in the related immediate register; integers are usually on 16 bits
but they are extended to 26 for unconditiona jumps.
Another distributed control is the size of the word to be saved in memory or loaded from it: a
simple behavioral multiplexer selects the control signals according to the opcode of the instruction.
Also the saving of results in the register file has distributed control with a similar approach:
there is multiplexer that by default selects data arriving from the Write Back stage, but for jal and
jalr the correct value of the Program Counter is chosen (the destination register has already been
selected by the ARG PROC process) and sent to the register file.
2.2.1 PC / NPC conflict

The possible clock strobe loss due to the presence of two sequential modules (the Program Counter
and the New Program Counter) in the same stage has been initially resolved using a register with
sensitivity on the falling edge of the clock.
However this element caused a frequency reduction, as data must be ready in a shorter period
of time, even if the critical path was not involved. As explained later in section 4.1 this module
has been substituted with a simple latch
2.3 VLIW Support

The Very Long Instruction Word support can be enabled or disable through the vliw en control
signal: the instruction memory is designed to provide a 64 bits instructions, split and sent to two
different data path.
The LSB part of the long instruction is always executed and any instruction can be inserted; the
MSB part of it can contain a reduced set of instructions only or no instruction at all. No instruction
is provided when the vliw en is set to zero and no data is written in the register file or memory
module, so there are no problems of data corruption; however in order to build a simpler structure,
the second data path cannot execute jump or branches.
The register file and the memory module have been carefully redesigned with double write
ports and quadruple read ports with an additional input for the vliw en signal that disconnects
the additional write port when VLIW support is turned off.
Chapter 3
Physical Design
of the processor structure, this chapter will proceed to the

A FTER DETAILED EXPLANATION
physical design level, reporting syntesys configuration and relative results of the non opti-
mized design.
3.1 Systesys Configuration

The first pass of non optimized synthesys was given with normal quite dull constraints, and with
normal compilation efforts; the table of the values used follows
Type Value
Clock Period 40 ns
Power Consumption 550 µW
Area Size unconstrained
Table 3.1: Non optimized synthesys constraints
Every module of the processor has been analyzed, elaborated and compiled by Synopsys with
no errors; also a minimum amount of RAM, only two lines, has been synthetysed because other-
wise a missing module generates problems in the silicon placement in Encounter.
The command for applying the above constraints are:
create_clock -name "CLK" -period 40 clock

set_max_delay 40 -from [all_inputs] -to [all_outputs]
set_max_dynamic_power 550 uW
3.2 Results Report

Synopsys did manage to respect the given constraints both for the timing and for the power; in
table 3.2 it is possible to see that the slack time is positive and thus there are no potential timing
conflicts.
Also the power constraint has been respected, as the obtained Dynamic Power in table 3.3 is
lower than the value inserted.
8
3.3 On Silicon 9
Type Value
Data Required Time 39,90 ns
Data arrival Time -24,46 ns
Slack 15,44 ns
Table 3.2: Non optimized synthesys timing results
Type Value
Cell Internal Power 496,97 µW
Net Switching Power 34,28 µW
Total Dynamic Power 531,25 µW
Table 3.3: Non optimized synthesys power results
Type Value
Combinational area 57983,38 mm2
Noncombinational area 25153,22 mm2
Total area 83136,61 mm2
Table 3.4: Non optimized synthesys area results
3.3 On Silicon
Now the design process has been carried on to Silicon route and placement for exacting more
detailed information about IRdrop, electromicrogration and delay.
Figure 3.1: DLX processor on silicon after place and route

3.3 On Silicon 10
As it’s possible to see from picture 3.1 there are four route violations, marked by white crosses.
However it didn’t influence the delay computation that showed no maximum delay violation; the
maximum delay correspond to the previously selected clock period (40 ns).
As for the power results, Encounter reported the following data1 :
Type Value
Average Power 1,305 mW
Average Leakage Power 0,083 mW
Wrost IR drop 12,1 mV
Wrost Electromigration (M1) 29,539 µA
Table 3.5: Non optimized synthesys silicon results
Figure 3.2: DLX processor IR drop rail analysis
1
more exahustive data is reported in the project data files
Chapter 4
Optimizations
a complete optimization analysis will be performed, restarting from

I N THIS FINAL CHAPTER
the RTL level and coming back to silicon representation.
4.1 Going back to RTL level

In order to achieve higher frequencies, the design has restarted from the RTL level, in which single
modules are implemented.
There is a particular module that previously cause a perfomance loss: the New Program
Counter falling edge register. Even if the critical path is not hit by this factor very much, data
from the PC was to be delivered in half clock period, causing a general slow down.
This module has been substituted with a simple latch, with rising edge sensitivity, so the PC /
NPC conflict due to the presence of two sequential modules in the same stage is still not violated.
4.2 Syntesys Reconfiguration

The target optimization is to increase timing frequency and reduce power consumption, neglecting
area constraints; this time it has been put a higher computation load in both the synthesys process
and silicon placement and route. So the constraints used are
Type Value
Clock Period 30 ns
Power Consumption 250 µW
Area Size unconstrained
Table 4.1: Optimized synthesys constraints
As for the power reduction, since this processor has a pecular structure, a Very Long Instruc-
tion Word path that can be unconnected at any time, a power reduction model that is very effective
is Clock Gating. Due to the fact that the second data path can be unused, clock gating deactivates
the clock strobes for unused modules, until actual input data arrives.
So the configuration script used in Synopsys is
create_clock -name "CLK" -period 30 clock

set_max_delay 30 -from [all_inputs] -to [all_outputs]
11
4.3 Final Results 12
set_clock_gating_style -sequential_cell latch

-positive_edge_logic {and} -negative_edge_logic {or}
set_max_dynamic_power 250 uW
propagate_constraints -gate_clock
compile -exact_map -gate_clock -map_effort high -power_effort high
4.3 Final Results

After synthesys the processor has been placed on silicon with Ecounter; no violations were de-
tected.
Figure 4.1: Optimized DLX processor silicon after place and route
All the synthesys constraints have been respected, in fact the new configuration reports
Type Value
Data Required Time 30 ns
Data arrival Time -3,78 ns
Slack 26,22 ns
Table 4.2: Optimized synthesys timing results

4.3 Final Results 13
Type Value
Cell Internal Power 196,54 µW
Net Switching Power 46,73 µW
Total Dynamic Power 243,27 µW
Table 4.3: Optimized synthesys power results
Type Value
Combinational area 63230,93 mm2
Noncombinational area 26457,17 mm2
Total area 89688,1 mm2
Table 4.4: Optimized synthesys area results
It’s possible to see in table 4.2 that the timing for the data path has been reduced to the value
selected, and that the slack is positive so no timing violation are present. The power consumption
in table 4.3 is greatly reduced thanks to clock gating technology; as a matter of fact this technl-
ogy correctly increased the Combinational Area size and Net Switching power, reducing the Cell
Internal Power. The area size has suffered only a small area increase (table 4.4).
After the silicon placement and route, the extracted values from Encounter are1 :
Type Value
Average Power 1,911 mW
Average Leakage Power 0,101 mW
Wrost IR drop 15,09 mV
Wrost Electromigration (M1) 0,158 mA
Table 4.5: Optimized synthesys silicon results
4.3.1 Final Conclusions

From the synthesys point of view the optimization was quite successfull, as with only 7% area
expansion, the frequency has been increased by 25% and the power consumption has been reduced
by 54%. However from the silicon placement point of view it’s interesting to notice that the
extracted values slightly enlarged: this is normal because, most likely, in order to sastify the
optimized time and power constraints, Encounter must have used very high speed ports which
require more current and dissipate more leakage power.
1
more details in the project data files is available
Bibliography
[1] John L. Hennessy, David A. Patterson, Computer Architecture: A Quantitative Approach
[2] Frank Emnett, Mark Beigel, Power Reduction through RTL clock gating
[3] Wikipedia, the free encyclopedia, DLX,

http://en.wikipedia.org/wiki/DLX
[4] Wikipedia, the free encyclopedia, Very long instruction word,

http://en.wikipedia.org/wiki/Very_long_instruction_word
14
Appendix A
Simulation Waves
Here is provided the assembly program and the resulting waves, used to test the processor func-
tionalities; the testbench deactivates the VLIW support after ten clock cycles, and in figure A.1
it’s possible to see that execution continues with single instructions.
addi r5,r2,#13
addui r7,r2,#15
sub r11,r7,r5
sw 10(r0),r11
addu r14,r5,r7
slli r1,r11,#1
jal #32
nop
nop
lw r2,10(r0)
addu r14,r11,r7
and r22,r2,r2
sge r20,r14,r1
bnez r31, #-4
nop
nop
nop
nop
15
16
/tb_glx/cpu/clock
/tb_glx/cpu/reset
/tb_glx/cpu/vliw_en
/tb_glx/cpu/result 0 13 0 2 0 28 0 4 0
/tb_glx/cpu/result_vliw 0 15 0
/tb_glx/cpu/aluout 0 13 2 28 4
/tb_glx/cpu/aluout_vliw 0 15 10 0
/tb_glx/cpu/newpc 0 8 16 20 24 28
/tb_glx/cpu/i_ir_latch_enable
/tb_glx/cpu/i_npc_latch_enable
/tb_glx/cpu/i_ir_latch_enable_vliw
/tb_glx/cpu/i_rega_latch_enable
/tb_glx/cpu/i_regb_latch_enable
/tb_glx/cpu/i_regimm_latch_enable
/tb_glx/cpu/i_rega_latch_enable_vliw
/tb_glx/cpu/i_regb_latch_enable_vliw
/tb_glx/cpu/i_regimm_latch_enable_vliw
/tb_glx/cpu/i_eq_condition
/tb_glx/cpu/i_jump_enable
/tb_glx/cpu/i_alu_opcode adds subs addu llsh adds
/tb_glx/cpu/i_muxa_selection
/tb_glx/cpu/i_muxb_selection
/tb_glx/cpu/i_alu_outreg_enable
/tb_glx/cpu/i_alu_opcode_vliw addu adds nop
/tb_glx/cpu/i_muxb_selection_vliw
/tb_glx/cpu/i_alu_outreg_enable_vliw
/tb_glx/cpu/i_dram_wenable
/tb_glx/cpu/i_lmd_latch_enable
/tb_glx/cpu/i_pc_latch_enable
0 ps 500000 ps 1000000 ps 1500000 ps 2000000 ps
Entity:tb_glx Architecture:test Date: Thu Jul 31 04:08:40 PM CEST 2008 Row: 1 Page: 1
Figure A.1: DLX simulation waves

Designing A Custom DLX Processor With Very Long Instruction Word Support

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Designing A Custom DLX Processor With Very Long Instruction Word Support

Uploaded by

Copyright:

Available Formats

Microelectronic System Design

Final Project Report

Designing a custom DLX processor

3.1 Non optimized synthesys constraints . . . . . . . . . . . . . . . . . . . . . . . . 8

4.1 Optimized synthesys constraints . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.1 Optimized DLX processor on silicon, simulating IRdrop measurements . . . . . 4

2.1 DLX processor structural view . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1 DLX processor on silicon after place and route . . . . . . . . . . . . . . . . . . 9

4.1 Optimized DLX processor silicon after place and route . . . . . . . . . . . . . . 12

A.1 DLX simulation waves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

focus on the implementation details about designing a custom DLX

Figure 1.1: Optimized DLX processor on silicon, simulating IRdrop measurements

2.1 General Structure

2.1.1 Control Unit

2.1.2 Arithmetic Logic Unit

Figure 2.1: DLX processor structural view

• Logical Operation (and or xor)

2.2 Module Interconnection

2.2.1 PC / NPC conflict

2.3 VLIW Support

of the processor structure, this chapter will proceed to the

3.1 Systesys Configuration

Table 3.1: Non optimized synthesys constraints

create_clock -name "CLK" -period 40 clock

3.2 Results Report

Table 3.2: Non optimized synthesys timing results

Table 3.3: Non optimized synthesys power results

Table 3.4: Non optimized synthesys area results

Figure 3.1: DLX processor on silicon after place and route

Table 3.5: Non optimized synthesys silicon results

Figure 3.2: DLX processor IR drop rail analysis

a complete optimization analysis will be performed, restarting from

4.1 Going back to RTL level

4.2 Syntesys Reconfiguration

Table 4.1: Optimized synthesys constraints

create_clock -name "CLK" -period 30 clock

set_clock_gating_style -sequential_cell latch

compile -exact_map -gate_clock -map_effort high -power_effort high

4.3 Final Results

Table 4.2: Optimized synthesys timing results

Table 4.3: Optimized synthesys power results

Table 4.4: Optimized synthesys area results

Table 4.5: Optimized synthesys silicon results

4.3.1 Final Conclusions

[1] John L. Hennessy, David A. Patterson, Computer Architecture: A Quantitative Approach

[3] Wikipedia, the free encyclopedia, DLX,

[4] Wikipedia, the free encyclopedia, Very long instruction word,

0 ps 500000 ps 1000000 ps 1500000 ps 2000000 ps

Figure A.1: DLX simulation waves

You might also like