Rajeev Dokania Ms Thesis

ON-CHIP OPTICAL INTERCONNECTS: ARCHITECTURE, CIRCUIT & DEVICE CHALLENGES AND DIRECTIONS
A Thesis Presented to the Faculty of the Graduate School of Cornell University In Partial Fulfillment of the Requirements for the Degree of Master of Science
by RAJEEV KUMAR DOKANIA August 2008
2008 RAJEEV KUMAR DOKANIA
ABSTRACT
Recent trends in VLSI chip design suggest a shift towards multi-core design. With the integration of huge number of processing cores in a single die, there is a need to provide means to carry the huge amount of data both on-chip and off-chip for the large number of integrated cores. As, the famous Moores law promises to continue in the next decade, there is increasing amount of research directed towards the immense possibility that the technology scaling opens up towards the performance improvement of future multi-core designs. In this thesis, we first look into the scaling trends, and see how interconnects could potentially be a bottleneck for future technologies. After appreciating the fact that there is increasing gap between interconnects and transistor performance, we study optical interconnects and their potential based on various research work done by researchers across the world. After showing that optical interconnect does have the potential to provide low latency, low power and high-bandwidth interconnect for both on-chip and off-chip communications, we assess the performance advantages that the optical technology will bring in a future scaled technology, both on and off-chip. Through collaboration, the work presented here indicates that for some multi-core architectures, optics can give an average 13% speed-up as compared to the corresponding electrical interconnect based system, with the same power budget. For better exploitation of benefits of optical interconnects larger data-rate communication per channel would be more beneficial. With that in mind, we looked into the design of a 10-Gbps Clock and Data Recovery (CDR) circuit to study the power and per-channel bandwidth of high speed parallel interconnects. The design was based on phaseinterpolator based CDR topologies. The design was a low complexity design due to innovative set-reset latch based loop-filter scheme. Overall, the power/Gbps number
of the designed circuit in a IBM-90nm process suggests, that the design is competitive compared to other state-of-the-art designs. This work demonstrates that the latencies involved with the non-source-synchronous data transfer might actually be detrimental to the overall system performance and that for an on-chip communication an optical clock-forwarding based solution might be more useful based on latency considerations. Another direction in on-chip optical interconnects focuses on thermal and process tuning of optical elements. This work here focuses on Ring-based modulators in collaboration with the silicon photonics group at Cornell. We looked at carrier injection and depletion based schemes for the thermal tuning with split-ring structure, before finding out that we can use the large thermal sensitivity of the device structure itself, for the localized tuning of the modulator, which proved to be a 2X better solution compared to the metal heaters, based on power consumption. We experimentally demonstrated that we can recover the data-pattern over 15-degree C of temperature variation by this method. The result is significant for integrate-ability of the modulator. We also looked at other system aspect of design and what it will take before optics could be integrated on-chip. With all this work, I hope to have contributed to the advancement of knowledge and theory of on-chip optical interconnects.
BIOGRAPHICAL SKETCH Rajeev Kumar Dokania, was born in Banka in the state of Bihar, India, in December, 1981. He studied at local school Bal Vikas Vidyalaya till his VIIth standard, at Rani Mahakam Kumari High School till his Xth, and at Bihar college of Engineering (now, NIT), Patna till his XIIth standard. He cleared IIT Entrance exam and got into Indian Institue of Technology, Kharagpur in 1999. He graduated from the IIT with a B.Tech(H) degree in 2003 securing the top rank in the Electrical Engineering Department. After Graduating from IIT, Rajeev worked with Intel Corporation at Bangalore on their networking and multi-core server product as a component design engineer for two years till August 2005. Rajeev was admitted to the school of Electrical and Computer Engineering in Cornell University in August 2005. He started working as a Graduate Research Assistant in Dr. Apsels lab from August 2005 and started looking at Optical interconnects for his research. Rajeev interned with the silicon photonics group at Intel in 2006-2007 for 6 months. Rajeev was also awarded an Intel PhD Fellowship at Cornell in 2007-2008
iii
To my Dad, Mom and Brother
iv
ACKNOWLEDGMENTS
I would like to first thank Dr. Apsel for taking me as her graduate student, and providing me guidance to do research. I want to thank her for being a very good adviser. Special thanks to my special committee members, Dr. Jos F. Martnez and Dr. Michal Lipson, for their help and time. Special, thanks also goes to my collaborators across different Research groups. I worked on thermal and process tuning aspects of the modulators with Dr. Lipsons group and got good support from her student Sasikant Manipatruni. I also worked with Dr. Jos Martnez and Dr. Albonesis Research Group. I am extremely thankful to Nevin Kirman and Meyrem Kirman for our collaborative work on on-chip optical interconnects. I am also thankful to Mathew Wattkins and prof. David Albonesi for the collaborative work we did on on-chip optical interconnects. I am also thankful to my lab-mates, Anand, Zhongtao, Tony, Xiao, Silvia, Paul ,Bo and room-mate Siva for useful discussions. I would also like to say thanks to Intel for their fellowship support and would like to acknowledge the support from NSF.
TABLE OF CONTENTS
BIOGRAPHICAL SKETCH ____________________________________________ iii DEDICATION _______________________________________________________ iv ACKNOWLEDGEMENTS _____________________________________________ v LIST OF FIGURES ___________________________________________________ xi LIST OF TABLES ___________________________________________________ xvi CHAPTER 1: TECHNOLOGY SCALING & THE OPTICAL PROMISE 1.1 Chapter Introduction: _______________________________________________ 1 1.2. Trends in VLSI chip-design: _________________________________________ 2 1.3. Trends in off-chip and on-chip bandwidth: ______________________________ 3 1.4. Trends of delays global and local interconnect: __________________________ 4 1.4.1 Predictions for delay in different technology nodes: ____________________ 7 1.5 Optical interconnect, and promise: ___________________________________ 10 1.6 Optical interconnect evaluations: _____________________________________ 13 1.6.1Transmitter system: ____________________________________________ 13 1.6.2Waveguide: ___________________________________________________ 15 1.6.3Receiver System: ______________________________________________ 15 1.7 Results: _________________________________________________________ 16 1.8 Key Takeaways: __________________________________________________ 18
CHAPTER 2: OPTICAL INTERCONNECT FOR FUTURE CMPs 2.1 Chapter Introduction: ______________________________________________ 19 2.2 Architectural considerations: ________________________________________ 20 2.2.1 What characteristics of optical interconnect would we like to exploit for onchip communication?: ______________________________________________ 20 2.2.2 What kind of interconnect fabric: Mesh, cross-bar or bus? :_____________ 21 2.2.3 How deep will optics have to go to for each and every node?: ___________ 21 2.2.4 Optical component choices: _____________________________________ 22 2.2.5 Electrical and optical power & delay characterization and assumptions: ___ 23
vi
2.3Design Space exploration of different Architectural organizations: ___________ 23 2.3.1 Design Exploration of multiplex-by-node Vs multiplex-by-address topologies.: _______________________________________________________ 24 2.3.2 Scalability of the two organizations: _______________________________ 2.3.2.1Number of Wavelengths _____________________________________ 2.3.2.2Power ____________________________________________________ 2.3.2.3 Extinction ratio ____________________________________________ 25 25 26 29
2.3.3 Bus Frequency ________________________________________________ 29 2.3.4 Other Design Issues ____________________________________________ 30 2.3.6 Arbitration and latency considerations: _____________________________ 31 2.3.7 Summary of the design space exploration for architectural organization ___ 31 2.4 Hybrid Optical-Electrical Interconnect ________________________________ 31 2.5 Optical Interconnect for a CMP in 32nm Technology a case study ___________ 32 2.5.1 CMP Model __________________________________________________ 32 2.5.2 CMP Core Frequency Calculation _________________________________ 34 2.5.3 Opto-electrical hierarchical bus design _____________________________ 36 2.5.4 Protocol for the hierarchal hybrid bus ______________________________ 37 2.5.5 Area Estimation _______________________________________________ 38 2.6 Electrical Baseline: ________________________________________________ 39 2.7 Assumptions about the processor core: ________________________________ 41 2.8 Results _________________________________________________________ 41 2.9 Future work: what will be more important to tackle for on-chip optics, latency or higher bandwidth for same power budget? : _______________________________ 43 2.10 Key Takeaways: _________________________________________________ 43
CHAPTER 3: DESIGN OF CLOCK & DATA RECOVERY (CDR) CIRCUIT FOR HIGH SPEED PARALLEL ON-CHIP INTERCONNECT 3.1 Chapter Introduction: ______________________________________________ 45 3.2 Design of a 10Gbps low Complexity CDR: _____________________________ 46 3.2.1 Functionality of a CDR Circuit: __________________________________ 46 3.2.2 CDR Architecture Exploration: ___________________________________ 46 3.2.3 CDR design for 10Gbps: ________________________________________ 49 3.2.3.1 Phase detector Topology: ____________________________________ 49 3.2.3.2 CML Flip-flop design: ______________________________________ 49
vii
3.2.3.3 Design of CML (Analog) XOR gate: ___________________________ 3.2.3.4 Results with the Phase detector: _______________________________ 3.2.3.5 Set-Reset Latch as the loop-filter: _____________________________ 3.2.3.4 Design of CML (Analog) NAND gate: _________________________
52 53 53 54
3.2.4 Design of the Phase Interpolator __________________________________ 56 3.2.5 Design of the control circuit: _____________________________________ 58 3.2.6 I & Q clock generation circuit using DLL: _________________________ 3.2.6.1 Delay stages for DLL: ______________________________________ 3.2.6.2 Phase detector: ____________________________________________ 3.2.6.3 Charge pump for the DLL: ___________________________________ 59 60 60 61
3.2.6 CML 2 CMOS converter: _______________________________________ 63 3.2.7Design layout: _________________________________________________ 63 3.2.8 Top level simulation: ___________________________________________ 64 3.2.8 A Comparison of our results with other state-of-the-art designs: _________ 66 3.3 Understanding the power trade-off with high speed operations: _____________ 67 3.4 Understanding the latency involved with synchronization with asynchronous high speed operations: ____________________________________________________ 67 3.5 Use of clock-forwarding for better performance for on-chip communications: _ 68 3.6 Key Takeaways: __________________________________________________ 69
CHAPTER 4: THERMAL & PROCESS TUNING of RING MODULATOR 4.1 Chapter Introduction: ______________________________________________ 70 4.2. Silicon Modulator: ________________________________________________ 70 4.3 Ring-Resonator based Modulator: ____________________________________ 71 4.3.1 Understanding Q and its importance for device performance: ___________ 73 4.3.2 High Q, the problem?: __________________________________________ 73 4.3.2 Ways to De-Q: ________________________________________________ 74 4.3.3. Is just lowering Q the solution for integrate-ability: __________________ 75 4.4 Thermal sensitivity of the device: ____________________________________ 75 4.4.1. Device structure and operation: __________________________________ 75 4.4.2 Degradation in Modulated Waveform due to Thermal Effects ___________ 77 4.4.3 Background work: _____________________________________________ 79 4.5 Thought progression towards injection charge based prospective solution: ____ 80
viii
4.6 Proposed Solution: ________________________________________________ 82 4.7 Design of the device structure: _______________________________________ 83 4.7.1 Method: _____________________________________________________ 83 4.7.2 Isolation _____________________________________________________ 84 4.7.3 Reverse Isolation: _____________________________________________ 85 4.7.4 Back-Reflection _______________________________________________ 85 4.7.5 Extinction Ratio Degradation: ____________________________________ 86 4.7.6 Thermal self-heating, the killer: __________________________________ 86 4.7.6.1 Thermal Modelling: ________________________________________ 87 4.8 Change of basic device structure by altering thickness of si or sio2 to reduce thermal sensitivity: ___________________________________________________ 90 4.9 Depletion based solution: ___________________________________________ 90 4.10 Wide temperature range operation using thermal thin-film heating: _________ 91 4.11 Discussion benefits and scalability of thin film heating method: ____________ 94 4.12 Suggested future Solutions for thermal compensation: ___________________ 95 4.12.1 silicon-related changes: ________________________________________ 95 4.12.2 Cladding related changes:_______________________________________ 96 4.13 Polymer cladding based thermally insensitive device: ____________________ 96 4.14 process variation and tuning: _______________________________________ 97 4.15. Key Takeaways: ________________________________________________ 97
CHAPTER 5: OPTICAL INTERCONNECT: QUESTION MARKS & DIRECTIONS 5.1 Chapter introduction: ______________________________________________ 99 5.2 Integration: 3D stacking and constraint on off-chip optical signal path? _______ 99 5.3 Thermal budgeting for 3-D stacking _________________________________ 101 5.4.) Thermal sensitivity and process variation of silicon photonic devices ______ 101 5.5 Waveguide cross-over and optical vias: _______________________________ 102 5.5.1 PSE (Photonic switching element): _______________________________ 102 5.5.2Interlayer-coupling (Optical Vias): _______________________________ 103 5.6 Polarization sensitivity of the devices: ________________________________ 104 5.7 Misplaced assumption about WDM channels: __________________________ 105 5.8 Synchronous Vs Asynchronous data transfer and latency overhead: _________ 106
ix
5.9 Power distribution across asymmetrically distributed nodes in a broadcast based network: __________________________________________________________ 106 5.10 OTDM: is it beneficial for on-chip interconnect?: ______________________ 107 5.11 WDM and filtering: _____________________________________________ 107 5.12 Optical logic for computation and optical memory on-chip? ______________ 107 5.13 System power Vs Real Power: _____________________________________ 108 5.14 System Real-estate considerations: _________________________________ 108 5.15 Optical gain and their thermal sensitivities: ___________________________ 108 5.16 Laser source: ___________________________________________________ 108 5.17 Mac-zhender modulator fo the I/Os. ________________________________ 108 5.18 cost: packaging , testing, assembly, design: ___________________________ 108 5.19 Conclusion: ____________________________________________________ 110 REFERENCES: ____________________________________________________ 111
LIST OF FIGURES
Figure 1.1 Moores law over the year [1.2] _________________________________ 2 Figure 1.2: changing trends in the design space in recent time (note: trends are on per plot basis and doesnt necessarily reflect the trend w.r.t each other) __________ 3 Figure 1.3: Bandwidth situation for feeding the processor with data, on-chip and offchip, (the figure is indicative of just the trends over the year, and not relative to each other). ______________________________________________________ 4 Figure 1.4: Memory wall, showing the significance of larger on-chip cache and the importance of the communication network between them [1.3]. _____________ 4 Figure 1.5: Wire and gate delay performance [1.4] ___________________________ 5 Figure 1.6: 0.25um aluminum interconnect stack compared to a 0.13um copper interconnect stack [1.5] ____________________________________________ 6 Figure 1.7: Modeling of Resistance and capacitance of wire ____________________ 7 Figure 1.8: un-repeatered and repeatered wire segments _______________________ 8 Figure 1.9: Definition of FO4 ____________________________________________ 8 Figure 1.10: optical interconnect, from long-distance to short distance [1.20] _____ 13 Figure 1.11: On-chip optical interconnect system ___________________________ 13 Figure 1.12 The characteristics based on which the ring-based resonators [1.24] work ______________________________________________________________ 14 Figure1.13(a) Electrical with 100% activity _______________________________ 17 Figure 1.13(b) Optical with 100% activity _________________________________ 17 Figure 1.14(a) Electrical with 20% activity ________________________________ 17 Figure 1.14(b) Optical with 20% activity __________________________________ 17 Figure 2.1. Optical bus for design exploration ______________________________ 25
xi
Figure 2.2: interconnect Power for the two organization, for any number of nodes the organization1 (top) is more power consuming than organization 2 (bottom) __ 28 Figure 2.3 : Leakage power (% of total power) projections ____________________ 36 Figure 2.4 Optical bus considered for evaluation in our work __________________ 37 a.) Address path for the Electrical baseline ____________________________ 41 b.) Data path for the Electrical baseline _______________________________ 41 Figure 2.5 Electrical baseline ___________________________________________ 41 Figure 2.6 Speedup of different applications _______________________________ 42 Figure 2.7 latency breakdowns for different applications _____________________ 42 Figure 3.1 Typical Data eye, and optimal clock sampling edge. ________________ 46 Figure 3.2 Parallel sampling of the data edge ______________________________ 46 Figure 3.3 PLL based CDR ____________________________________________ 47 Figure 3.4 Phase Interpolator based CDR _________________________________ 48 a.) Hogges Phase-detector _________________________________________ 50 b.) Alexanders Phase-detector ______________________________________ 50 c.) Timing for Hogges detector _____________________________________ 50 d) Timing for Alexanders detector __________________________________ 50 e.) Transfer fn for Hogges detector after integrator ______________________ 50 f.) Transfer fn for Alexanders detector _______________________________ 50 Figure 3.5 two main type of Phase detectors for high speed low power CDRs _____ 50 Figure 3.6 CML based Analog FF _______________________________________ 51 Figure 3.7 Digital FF _________________________________________________ 51 Figure 3.8 CML based XOR ___________________________________________ 52 Figure 3.9a)Up, Down pulse when the clock is leading the data ________________ 53 Figure 3.9.b)Up, Down pulse when the clock is lagging the data _______________ 53 Figure 3.10 Set Reset Latch for use as loop-filter ___________________________ 54
xii
Figure 3.11 Analog NAND Gate ________________________________________ 55 Figure 3.12.a) Modified up an down pulses when data leads the clock edge. Note: X and Y are the corresponding up and down pulses before the set-reset latch ______________________________________________________________ 55 Figure 3.12.b)Modified up and down pulses when data lages the clock edge. Note: X and Y are the corresponding up and down pulses before the set-reset latch. ______________________________________________________________ 55 Figure 3.13 Phase Interpolator __________________________________________ 56 Figure 3.14: clock interpolated phases between 0 and 90-degree, with the designed scheme. ________________________________________________________ 57 Figure 3.15 : Interpolated clocks over 2 quadratures. ________________________ 57 Figure 3.16 : Phase interpolator control block. _____________________________ 58 Figure 3.17 : controllable current source for the control circuit _________________ 58 Figure 3.18 : DLL schematic for generating I & Q clocks ____________________ 59 Figure 3.19 : variable delay stage used for DLL ____________________________ 60 Figure 3.20 Sampling type phase detector used for the DLL, phase detection _____ 61 Figure 3.21 Simple charge-pump based loop-filter for the DLL ________________ 62 Figure 3.22: charge Pump node voltage, vctl _______________________________ 62 Figure 3.23: different clocks at different stage of the delay line apart by 90-degree. 63 Figure 3.24.a)Phase-detector ___________________________________________ 63 Figure 3.24.b)Phase-interpolator ________________________________________ 63 Figure 3.25: Up and down pulses of the control loop at the top-level ________ 64 Figure 3.26: Data eye at 10-Gbps after retiming with the recovered clock ________ 64 Figure 3.27: recovered clock eye at 10-GHz with lesser stages of interpolation ____ 65 Figure 3.28: recovered clock eye at 10-GHz with 64 interpolation stages_________ 65 Figure 3.29: Data rate Vs power/Gbps trade-off ____________________________ 67
xiii
Figure 4.1 MZI and Ring-Resonator _____________________________________ 71 Figure 4.2 Ring-resonator [4.1] _________________________________________ 72 Figure 4.3: Transmission characteristics vs _______________________________ 73 Figure 4.4: Transmission effect by varying t and __________________________ 74 Figure 4.5: De-Qed spectrum by cascading ________________________________ 75 Figure 4.6a)Measured spectrum of the experimental ring _____________________ 76 Figure 4.6.b) Zoomed out spectrum ______________________________________ 76 Figure 4.7. Transmission spectrum under DC bias voltage ____________________ 77 Figure 4.8. Modualted waveform at 1 Gbit/s _______________________________ 77 Figure 4.9. Transmission spectrum under increasing temperature by 1K successively77 Figure 4.10. Distortion in modulated waveforms as temperature is increased, successive pictures from left to right are with T=5K ____________________ 78 Figure 4.11. Heater based tuning[4.14] ___________________________________ 80 Figure 4.12. MEMS based tuning[4.15] ___________________________________ 80 Figure 4.13. Carrier injection based tuning Figure 4.14. Carrier injection Vs spectral shift ___________________________________________________________ 81 Figure 4.15 (a.) original device structure [4.1] and (b) proposed device structure __ 83 Figure 4.16. Leakage current as a function of voltage difference between two region w/o reverse-isolation _____________________________________________ 85 Figure 4.17 Mesh structure of the simulated structure ________________________ 85 Figure 4.18 a.) voltage Vs current required/5um of the device length b.) current Vs carrier concentration(in /cm3) _______________________________________ 86 Figure 4.19 Transient simulation to see the life-time of the carrier at 1e18/cm3 level of injection _______________________________________________________ 87
xiv
Figure 4.20. Self heating effect of the device, temperature profile shown at a crosssection of the device in lateral and vertical direction for normal modulation case. ______________________________________________________________ 89 Figure 4.21: self-heating effect with carrier injection level of 1018/cm3 , locally temperature rises by around 40-degree C!!!. ___________________________ 89 Figure 4.22 Depletion width Vs. Voltage, with 1e18/cm3 carrier _______________ 91 Figure 4.23 Setup for controlling DC bias current through the device, ___________ 92 Figure 4.24 Restoration of the distorted waveforms using bias current compensation scheme a.) Normal Data b.) Corrupted Data at T =15K and c.) Recovered Data at T =15K ____________________________________________________ 92 a) Eye Diagram at T=0 K _________________________________________ 93 b) Degraded at T= 15K __________________________________________ 93 c) Retrieved at T=15 K __________________________________________ 93 Figure 4.25 Optical transmission eye diagrams of the electro-optic modulator. ____ 93 Figure 5.1. A 3-D Stack [5.1] _________________________________________ 100 Figure 5.2. A waveguide inter-section __________________________________ 102 Figure 5.3 A PSE Scheme[5.2] ________________________________________ 103 Figure 5.4 Inter-layer coupling (or, optical Via) ___________________________ 104 Figure 5.5 cascading rings, for increasing FSR ____________________________ 105 Figure 5.6 flexible scheme for off-chip communication _____________________ 109
xv
LIST OF TABLES
Table 1.1. Wire Delay according to ITRS with typical sizing [1.6] _______________ 9 Table 1.2. Wire Delay with optimal sizing__________________________________ 9 Table 1.3. Parameters used for evaluation _________________________________ 16 Table 2.1: Parameters used to estimate the power components in Fig2.2 _________ 28 Table 2.2: component delays of transmitters and receivers at 45, 32 and 22nm technologies ____________________________________________________ 30 Table 2.3 : Summary of ITRS [2.14] parameters used to calculate the processor frequencies _____________________________________________________ 36 Table 2.4 Processor core assumptions [2.16] _______________________________ 41 Table 3.1: comparison of performance with other state-of-the art designs ________ 66
xvi
CHAPTER 1 TECHNOLOGY SCALING & THE OPTICAL PROMISE
1.1 Chapter Introduction: Before we conceive of any architecture or assess benefits of a technology in a future node it is very important to first assess the design trends. With that in mind, in this chapter we first look at the recent trends from microprocessor design. We first appreciate the shift to the multi-core design and the increasing focus on performance/watt. We move on to look at the on-chip and off-chip bandwidth trend for the processor design. After giving a brief overview about the bandwidth requirement of the current generation processors, we begin to discuss our primary subject, interconnects. After appreciating the problem on the interconnect front, we talk about some of the promising technologies and make a case for optical interconnects, based on ITRS projections. We talk of the potential performance advantages of an optical interconnect in a computing environment and then talk about how it can help us work around the problem of electrical interconnects. We conclude the chapter with an observation of recent developments in the field which shows onchip optical-interconnect as a possible future technology of choice.
1.2. Trends in VLSI chip-design: VLSI chip design has come a long way over last several decades with several orders of magnitude improvements in system performances [1.1]. These improvements in system performances have been brought about due to our ever increasing ability to pack huge number of transistors in a single die coupled with equally impressive architectural innovations.
Figure 1.1 Moores law over the year [1.2] As is shown in Fig.1.2, the increased performance of the processors upto 2004 have been brought about with increasing processor speed, increased die area and increasing power budget. In recent time we have seen the shifting trend towards slower parallel processors rather than fast single core. Increasing energy costs and on-chip power density requirements also forces a focus on energy efficient design and has put a ceiling of around 180-190W TDP (Thermal Design Power). Yield and packaging issues have also seen the chip sizes stagnating to around ~400mm2. Though, the technological innovations are still resulting into increasing performance of the processors. These trends are very important for us to track, as they are helpful when we go and conceive of architecture at future technology nodes.
Performance(MIPS)
Arbitrary Unit (logarithmic scale)
# of cores/die
Core Frequency
Power/die
Size/die
1970 YEAR
2004
2008
Figure 1.2: changing trends in the design space in recent time (note: trends are on per plot basis and doesnt necessarily reflect the trend w.r.t each other) 1.3. Trends in off-chip and on-chip bandwidth: With the shift in design trends to multi-cores there has been change of trend in both on-chip and off-chip bandwidth requirements as shown in Fig.1.3. More cores tend to generate on an average more traffic on the I/O due to memory consistency and coherency requirements. This coupled with stagnating I/O pin count, has resulted into a shift to higher I/O speed. Architectural innovations which result in hiding the memory latency has also stagnated, creating a shift towards increasing cache size [1.3] as shown in Fig.1.4. With increased cache sizes and number of cores in designs there is more on-chip traffic and a demand for both higher on-chip and off-chip bandwidth. With a physically limited number of pins, the trend is to increase the frequency of operation for I/Os. This trend started with AMDs hyper-transport, that goes to 2.4Gbps, while Intels quickpath or common system interface (CSI) promises to be even higher 6.4Gbps/pin.
Core Performance(MIPS)
Cache traffic per instruction Internal cache
Arbitrary Unit (logarithmic scale)
I/O Frequency Core Frequency # 0f I/O Pins I/O traffic per instruction
I/O Bandwidth(normalized ) On-chip bandwidth(normalized) 1970 YEAR 2004 2008
Figure 1.3: Bandwidth situation for feeding the processor with data, on-chip and offchip, (the figure is indicative of just the trends over the year, and not relative to each other).
Figure 1.4: Memory wall, showing the significance of larger on-chip cache and the importance of the communication network between them [1.3]. 1.4. Trends of delays global and local interconnect: At the time of this thesis, we are at 45nm technology node, and according to ITRS and industry projections by 2010-2012 devices will be built in 32nm node. Scaled technology will provide more transistors and more on-chip processing cores with more
latency and a higher bandwidth requirement for the communication between these cores. With scaled technology nodes the problem of interconnect connecting the transistors become increasingly important as the performance gap between interconnect and the transistors widens. This trend was first noted by a paper from early 1980s by Saraswat et. al.[1.4], as in Fig.1.5. Though the figure doesnt speak about the length of the wire used, it still conveys the all important message that for future scaled technology wires may potentially be the performance bottleneck.
Figure 1.5: Wire and gate delay performance [1.4] Wire performance degrades, as for scaled technology nodes, all dimensions of the metal stack, the width, thickness and separation gets reduced.
Figure 1.6: 0.25um aluminum interconnect stack compared to a 0.13um copper interconnect stack [1.5] The R and C per unit length of the wire that determines the delay doesnt scale exactly in proportion to the R and C of the gates. This has happened as between the two technology nodes even though the width and thickness will ideally make the capacitors and resistor track each other, the vertical and horizontal separation forces result in difficulty in scaling the cap value in proportion to the R value. Since for local wires, we need many interconnects to connect all the devices, we cant increase the separation to a high value and hence the delay effect is much larger on the local wires. Fortunately, the local wires length scales as well as with the device size, thereby keeping the delay component of the local wires in check. Using copper wire instead of the aluminum wire has also reduced the problem for local wires. This is not true for the global wire, where the access length is increasing with every generation of design,
though off-late with the chip-size stagnating, average access length for the global wire have also started to stagnate around 25mm. Even with a fixed access length, the law of scaling dictates that the global wire delay per unit length will increase. Though, with increased separation and optimal repeater placement latency can be kept to a near constant value. This comes with the cost of power and reduced spatial-bandwidth due to higher separation between wires to keep the parasitics in-check.
Figure 1.7: Modeling of Resistance and capacitance of wire l --------- (1.1) (thickness barrier dishing ) * ( width 2 * barrier)
left
(thickness)l separation
right
(thickness)l separation
vert
( width )l htop
vert
( width )l (1.2) hbottom
1.4.1 Predictions for delay in different technology nodes: For calculating

the wire delays for different technologies ITRS projections [1.6] are taken as reference. From the ITRS 2004 documents for the global delay of the wire the unrepeatered delay is given as in Table.1.1. Since, the un-repeatered wires have large delay and are limited in signal bandwidth due to large RC time-constant of the long wires. The un-repeatered delay doesnt have much significance. The repeatered wire
delay is of more importance and as per the theory developed in the paper by Ho et. al. [1.7] we take the repeatered wire delay as the main delay of interest for the global wire.
Figure 1.8: un-repeatered and repeatered wire segments For, calculating the repeatered wire delay, FO4 delay of the min-sized inverter comes into play. FO4 delay is defined as the delay of an inverter with a fanout of 4 as shown in Fig1.9. For defining the F04 delay, there are many theories in place. Earlier, people used to use F04=(250ps/um)*Lmin as the rule of thumb for the calculation of this value.
FO4 delay Figure 1.9: Definition of FO4 According to the white paper published by ITRS, FO4 delay are given as 13.5* CV/I of a min-sized inverter. We used this formula as the guiding formula for evaluation of the F04 delay. Also to calculate F01 delay, we use 1/3*FO4 delay according to the same method as defined by ITRS.
Delay(Re peatered _ wire )( ps / mm ) 2.3 Rwire ( / mm ) * C wire ( F / mm ) * F 01( ps) (1.3)
Engergy(Re peatered _ wire ) 1.3 * C wire ( F / mm ) * Lwire (mm ) * Vdd 2 ..(1.4)

We use these formulae to calculate the wire delay and energy cost for different interconnect systems. Note as see in Table.1.1, the wire delay for the global wire is around 65ps/mm according to ITRS projection. However, with optimal sizing [1.5] Table 1.1. Wire Delay according to ITRS with typical sizing [1.6] Wire Delay(Global) 45nm 32nm 22nm RC(ps/mm)(typical width) 200 380 760 F01 2.99 1.82 1.21 Repeatered Delay(ps/mm) 61.10 65.75 75.92
Table 1.2. Wire Delay with optimal sizing Wire Delay(optimal sizing) 45nm 32nm 22nm R(ohm/um) 0.15 0.306 0.6 C(fF/um) 0.345 0.315 0.288 RC(optimal)(ps/mm) 51.75 96.39 172.8 F01 2.98 1.82 1.21 Repeatered Delay(ps/mm) 31.08 33.11 36.2
one can reduce the repeatered wire delay to a lower value of 35ps/mm. Some researchers have demonstrated that with optimal sizing of the wire, this delay can be kept to a low of 25ps/mm, though that comes with the cost of increased power. Although, there are different suggested methods and different way of calculating the global delay in a future scaled topology, all indications are that it will vary between 25ps/mm-70ps/mm, depending upon different trade-offs of power, spacing and latency. These three trade-offs are not good with increased design focus on faster,
bigger and less power hungry designs. Thus there is a need to look at alternative means for the global interconnects that avoid such hazards. 1.5 Optical interconnect, and promise: Obviously, the problem of interconnect is not new and has generated a number of approaches from industry and the academic world. One potential solution that has generated interest is to use light as a means of communicating on-die. There exists a significant body of work comparing performance of these systems against their equivalent electrical counterparts. Various research group in academia as well as in industry ( e.g. IBM, Intel, Luxtera, Sun) are working towards identifying the promise that optical interconnect may have. There is significant amount of work in literature that talks about the seeming advantages that optical interconnect will have over their electrical counterparts. Haurylau et al. [1.8] extract the delay, bandwidth density, and power requirements of the optical interconnect components and discuss the criteria that they must meet in order for them to be comparable with electrical links on-chip. Similarly, Chen et al. [1.9] projects the performance characteristics of future optical devices and then compare the optical and electrical interconnects paths in terms of delay, bandwidth density, and power. They estimate that, for a unit distance at 32nm technology, the delay of an optical interconnect would be approximately 2.2 times faster than an electrical wire. Further they show that, at the same technology node, optical interconnects consume less power, but suffer lower bandwidth density than their electrical counterparts due to their wider pitches. Kobrinski et al. [1.10] look at optical clock distribution and optical global signaling and compare these with their electrical counterpart implementations. They do not find a large power, jitter, and skew improvement benefits from using optics in clock distribution. However, they compare the delay and bandwidth opportunities of optical interconnects over electrical wires and conclude that together with WDM, optics can be beneficial for global
10
signaling in terms of high bandwidth and low latency. Chen et al. [1.11] compared four different technologies (electrical, 3D, optical, and RF) for on-chip clock distribution. They also show that because most of the skew and power consumption of clock signaling arise in local clock distribution, there is no significant skew and power advantages of new technologies, including optical solution. Connor [1.12] reviews the optical interconnect technologies and optoelectronic devices for inter-chip and intrachip interconnects, followed by an EDA design flow methodology for optical link designs. He describes an optical clock distribution network implementation and finds, through circuit simulation, that such realization can consume significantly less power (5 times lower power in case of 64-node H-three at 5GHz) than its electrical counterpart. The work also proposes a behavioral model of a 4x4 crossbar-like data network, based on wavelength routing that connects four masters to four slaves. Onchip transmission line based interconnects have also been proposed as alternative to global wires. These interconnects make use of very wide metal wires so that signals propagate in the high frequency LC domain at near the speed of light [1.13]. While they do not require any new process to implement, one of their major drawbacks is that they have very low bandwidth due to the large wire width required, which may not be suitable to realize a wide inter-processor interconnect. There exist several offchip optical interconnect proposals targeting large-scale shared or distributed memory multiprocessors. However, many of these cannot be directly implemented on chip due to numerous different constraints such as limited number of optical layers, power, layout issues like waveguide crossings and multiple bits, and relatively small number of wavelengths available for on-chip use. Louri et al. [1.14, 1.15] proposes snoopy address sub-interconnect where an optical token circulates around the processors to provide arbitration to transmit the requests through an H-tree like fully optical interconnect. It requires modification in the coherence protocol. Webb et al. [1.16]
11
focus on optical network implementations in large scale distributed shared memory systems. Particularly, they propose the use of optical crossbar (implemented using free space optic) for intra-cluster connections, and either crossbar or a point-to-point hypercube optical interconnect that has less connectivity for the inter-cluster connections. Burger and Goodman [1.17], in an attempt to exploit the high-bandwidth broadcasting capability of optical interconnects (particularly when free-space optics is used), propose a new execution model to reduce serial overheads within a parallel program by having the serial code performed redundantly at any node of a massively parallel multiprocessor/multi-computer system allocated to the program. Nelson et al. [1.18] evaluate the performance improvement of replacing global point-to-point electrical wires between the unified front-end and multiple back-ends of a large-scale clustered multithreaded (CMT) processor, where the back-ends are spread across the die, spatially interleaved with caches due to thermal constraints. Overall, the main point that comes out of reviewing all these work is that optics is debatable in computing. The main advantages associated with the optics as have been pointed out are the 1.) Low latency: though there is a break-even length over which the optical wires can compete with electrical wires due to E/O and O/E conversion, for global wire with their typical lengths approaching around 25-40mm, optical interconnect has definitive advantages. 2.) Electrical Isolation: Pappu et. al [1.19] have shown that one can exploit optical interconnects even for short length communications because of the electrical isolation that they provide. This electrical isolation is beneficial for high fan-out interconnects, 3.) Low jitter & skew, 4.) High data-bandwidth and 5.) Low Power.
12
Figure 1.10: optical interconnect, from long-distance to short distance [1.20] These features coupled with the scaled technology demands have put lot of focus recently on on-chip optics. 1.6 Optical interconnect evaluations:
Figure 1.11: On-chip optical interconnect system Fig.1.11, here shows a block diagram of a typical on-chip implementation of optical channel. The three major components in an optical system are Transmitter, waveguide & the Receiver.
1.6.1Transmitter system: The transmitter proposed requires an off-chip laser

source, a modulator & a modulator driver. Coupling the light efficiently from external source onto the chip tends to be very lossy. However researchers have demonstrated coupling with less than 1dB of loss, and we expect it to continuously improve. For the purpose of this evaluation work, we assumed a coupling efficiency of 50%, i.e. 3 dB of loss. Once light gets coupled into the die, the splitters and waveguides route the
13
light to different modulators. We have accounted for the distribution losses with this path. The modulator is an important part of optical transmitter system; this is where electrical information gets translated into an optical signal. High speed si electro-optic modulators are designed such that injection of an electrical signal will change the refractive index or the absorption coefficient of an optical path. Over the years different types of modulators have been proposed based on free carrier plasma dispersion effect[1.21], the Mach-Zehnder interferometer[1.22] the Mos-
capacitor[1.23] and the Optical resonator [1.24].
The optical resonator based
implementations is advantageous for silicon due to its low operating voltage and
(nm)
Probe Wavelength
Figure 1.12 The characteristics based on which the ring-based resonators [1.24] work compact size. We base our analysis on this modulator type. From the perspective of a circuit designer, this device acts like an optical switch. Like an electrical switch, the performance of a modulator is based on off-to-on light intensity ratio, called the extinction ratio, and is dependent upon the strength of the electrical input signal. A 1015dB extinction ratio has been reported recently with high input signal swing. Since researchers have already designed modulators with 10um diameter, we can safely
Transmission
1/2 Applied voltage
14
predict that going forward they will become smaller, though this is limited by lithography and bending curvature. After the modulator at the transmitter we have the driver, which is just a simple series of inverter stages driving the modulators capacitive load. Here the modulators capacitance comes into the picture. A smaller capacitor will improve the power & latency. In our case we assume a capacitance of 50fF, which is expected to decrease as the technology improves. The insertion loss for the modulator is assumed to be 1db/cm. An inverter based drive for 2Gbps data with 50-fF load means a power consumption of 117uw according to ITRS projections for 32nm node.
1.6.2Waveguide: The optical equivalent of an on-chip wire is a waveguide. For onchip applications polymer and Si-waveguides are used most frequently. While the polymer used for waveguide has an effective refractive index of 1.5, Si-waveguide has an index around 3.5, making the signal propagate slower in a Si waveguide, however Si-waveguides are preferred for higher bandwidth-density due to their smaller size and pitch. Bandwidth density in optical waveguides can be enhanced by wave-division multiplexing (WDM), thereby carrying multiple-information in a single waveguide. For si-waveguide the width of the waveguide is assumed to be 1um, with 5u of separation, while for polymer waveguide a 2um width requires 15um of separation. The loss in the path has been assumed to be 1.3db/cm [1.12]. But in the absence of any good polymer based E/O modulators, we keep our evaluation focused on only the silicon based waveguide.
1.6.3Receiver System: The Receiver is comprised of a Photo-detector and Transimpedance amplifier stages followed by high gain stages to threshold the amplified voltage. The Photo-detector most often proposed for use with si waveguide is a SiGe P-I-N diode [1.25]. Typically, the detectors have large input capacitance & pose a design challenges for the high-speed gain stages. The TIA amplifies the photo-detector
15
current to a voltage which is thresholded by subsequent stages to digital level. The shrinking voltage-supply & higher bias voltage requirement for the detector means we have to go for higher voltages to realize the Receiver, which thereby again consumes more power. For our considerations we have assumed 100fF of detector capacitance & 100fF of output [1.26] capacitance. We assume a TIA supply voltage that is 20% higher than the nominal supply for our power calculations in the next section. Table 1.3. Parameters used for evaluation Loss Type Losses On-chip coupling loss(dB)[1.12] 3 Si waveguide loss(dB/cm)[1.12] 1.3 Splitter Loss(dB)[1.12] 0.2 Modulator insertion loss(dB)[1.24] 1 Interlayer coupling loss(dB) 1 Bending loss(dB) 0.5 Quantum efficiency[1.12] 0.8 For, the minimum receiver power required for detection calculation we followed the following methodologies. We first assume a BER of 10-15, the SNR comes to be around 15.8. After this based on the assumption of an Extinction ratio of 8, we do the math as suggested by Conor et.al [1.12]. The minimum detection current for 2GHz of operation comes to be around 30uA, while the TIA power consumption was calculated to be 257uW. 1.7 Results: Based on these considerations we evaluate the amount of power, delay and spatial bandwidth that will be required for a 25mm length optical communication link in a 32nm node for a fix number of waveguides. The relative values of spatial bandwidth, latency and power are all important performance parameters. With 100% activity on the bus, the optical link performs better as compared to the electrical link. While, with the 20% activity, the power in the electrical link goes down while the optical link still
16
consumes the power, thereby reducing the performance advantages associated with the optics. Thus it is very important that the optical link replaces only those links, where the link-activities are higher as compared to the electrical links.
Figure1.13(a) Electrical with 100% activity
Figure 1.13(b) Optical with 100% activity
Figure 1.14(a) Electrical with 20% activity
Figure 1.14(b) Optical with 20% activity
-Based on these results as well as the research review work carried out, we establish, that optics can be utilized as future global interconnect due to its better latency and
17
low power consumption. It is very important to see, that optical link should replace only high activity electrical links. 1.8 Key Takeaways: 1.) The design trend suggests that future processors will be designed with more cores in a single die. 2.) With multi-core designs there is increasing need for I/O bandwidth as well as on-chip bandwidth. 3.) Wire delays and power numbers are not scaling with the gate delay, and are creating performance bottleneck. 4.) Optics may provide a low power alternative to the global wires. 5.) Our initial evaluations suggest that optics has a definitive advantage in terms of power and latency.
18
CHAPTER 2
OPTICAL INTERCONNECT FOR FUTURE CMPs
2.1 Chapter Introduction: As discussed in chapter 1, the Current research and technology trends indicate that future chip multiprocessors (CMPs) may comprise tens or even hundreds of processing elements. An important hurdle towards attaining this scale, however, is supplying the data to such large numbers of on-chip cores. This can only be achieved if architecture and technology developments provide sufficient chipto-chip and on-chip communication performance to these future generations of CMPs. As discussed in chapter-1, optical interconnects seem to have promise. In this chapter we build on that idea and look for interconnect architecture that can exploit the benefits of optical interconnects. Although several efforts have attempted to identify under what conditions optics will be favorable over on-chip electrical signaling, these studies have been limited in scope to clock distribution networks and comparisons of point-to-point signaling. Although the technology is admittedly still in its formative stages, there is now enough understanding and data regarding on-chip, CMOS-compatible, optical components to consider the broader architectural tradeoffs in designing an on-chip optical network for future high performance microprocessors. In this chapter, we leverage optical device and integrated circuit expertise to investigate the potential of optical technology as a low-latency, high-bandwidth shared bus supporting snoopy cache coherence in future CMPs. We discuss possible optical bus organizations in terms of power, scalability, architectural advantages, and other implementation issues, as well as the implications on the coherence protocol. Through a carefully projected case study for a 32nm CMP, we conduct the first evaluation of
19
optical buses for this application in collaboration with other research groups at cornell. This initial step yields insights into the advantages and current limitations of the technology to catalyze future interdisciplinary work. We first discuss the architectural considerations that we had to first decide even before deciding about the organization. Based on the discussion on these architectural considerations we then embark into a discussion on the micro architecture of our CMP at 32nm technology and the design of our optical-based shared bus. We discuss the significance of these results. After discussing the significance of these results, we look into other design/architectural modifications that can help us exploit the gains from an optical interconnect system for even better architecture. The work was done in collaboration with Prof. Jos Martnez and Prof. David Albonesis research group. Nevin Kirman, Meyrem Kirman, Dr. Jos Martnez, Dr. Albonesi, Matthew Wattkins and Dr. Alyssa Apsel were contributors to this work, the architectural evaluation was done entirely by Nevin kirman and Meyrem Kirman. 2.2 Architectural considerations: Even before considering the organization of the architecture we must decide which optical characteristics we would particularly like to exploit, what kind of interconnect fabric to be used, and the type of optical devices that will be selected for integration. In this section we look at those questions and give the rational for our selections.
2.2.1 What characteristics of optical interconnect would we like to exploit for on-chip communication?: In this study for on-chip communication
we intend to utilize the lower power and low latency characteristics of optical communication. Though optical devices which can work upto 40-Gbps per channel are expected to be available, the communication network at 40-Gbps per channel data rate will require lot of high performance analog circuitry and communication protocols for encoding, decoding, serialization, de-serialization, bit boundary evaluation, skew
20
correction of parallel channels etc. These extra circuitries will add significant latency in the communication path as will be discussed in chapter 3. For cache-cache transactions, as the amount of information transferred per transaction is not much, the added latency will reduce the system performance. Though, an optical system with onchip interconnect at high data-rate may still be more power efficient (as compared to its exact electrical counterpart) and can be considered for future evaluations, in this architecture we consider a low-latency interconnect network, and hence will try to utilize low power and low latency of the optical interconnects. We will use WDM to provide larger spatial bandwidth, we will still keep the per channel data-rate to such level that facilitates synchronous design.
2.2.2 What kind of interconnect fabric: Mesh, cross-bar or bus? : Due to

E/O and O/E conversion related power and latency overhead, an optical interconnect system perform better over electrical system in terms of latency and power only after a break-even length. For the synchronous data transfer based system that we were conceiving of in this project, mesh interconnect system cant exploit the benefit of latency and power because of the short distance communication that is involved between two successive nodes. Though, Pappu et.al. [2.1] have shown that for really high fan-out system this break-even length can be reduced to a relatively low value, for the synchronous data transfer based system that we are conceiving of in this architecture, it doesnt buy the performance. For our system a shared-bus based system looks to be best suited for exploiting the benefits of the optical interconnects over their electrical counterpart.
2.2.3 How deep will optics have to go to for each and every node?:
Optical devices though have significantly reduced in their sizes are still limited and have large foot-print. Scalability of these devices will fundamentally be limited by the wavelength and the diffraction limit. For the future scaled topology, as will be
21
explained later, we find that optical devices cant go all the way down to each node. The power and performance trade-off as will be explained later requires us to go for an hybrid interconnect system, where the nodes are connected at higher hierarchy with a shared-optical bus, while downstream they have an electrical-tree based interconnect network.
2.2.4 Optical component choices: As explained in the chapter-1, an optical

communication system requires a light source, a modulator, and a detector, besides filters, couplers and wave guiding medium. Optical transmission requires a laser source, modulator and modulator driver. At the transmitter side, Laser source provides the light for the modulator to encode electrical information into optical information. Both off-chip and on-chip laser sources are feasible. In our analyses we employ offchip laser source because of on-chip power, area, and cost efficiency. When light is coupled into an on-chip waveguide a certain amount of optical power loss is incurred. This figure is generally 50%, i.e. 3dB of loss, although ongoing research has demonstrated coupling with less than 1dB of loss, and we expect it to improve even further. Once light gets coupled into the die, optical splitters and waveguides route it to different modulators to be used for actual data transmission. These distribution paths are also source of losses, which we take into account in our power calculations. The modulator translates the electrical information into optical signal. High-speed electro-optic modulators are designed such that injection of an electrical signal changes the refractive index or the absorption coefficient of an optical path. Among different types of modulators proposed [2.2, 2.3], the most recent optical resonator based implementations are preferable for ICs due to their low operating voltage and compact size [2.3]. We assume this type of modulator in our work. For the waveguiding medium, The material they are built of has important effects on bandwidth, latency, and area. As was explained in the chapter-1, we will go for silicon based wave
22
guiding medium due to non-availability of the polymer based modulators. We will use WDM to increase the spatial bandwidth of the wave-guiding medium. On the receiver side we consider a 1.5um photo-detector and a trans-impedance amplifier (TIA) stage followed by high gain stages to threshold the amplified voltage to digital levels. The photo-detector most often proposed for use with Si waveguide is a SiGe P-I-N diode [2.4]. The photo-detectors quantum efficiency is an important figure of merit for the system. A high quantum efficiency means lower losses while converting optical information into electrical. Typically, the detectors have large input capacitance and pose a design challenge for the high-speed gain stages. The TIA amplifies the photodetector current to a voltage which is threshold by subsequent stages to digital level [2.5]. The shrinking voltage-supply and higher bias voltage requirement for the detector means we have to use higher voltages to realize the receiver, which again consumes more power. For our analyses we have assumed 100fF of detector capacitance [2.6] and 20% higher supply voltage as compared to the nominal supply. 2.2.5 Electrical and optical power & delay characterization and assumptions: These considerations are discussed in detail in chapter1 of this thesis. The delay of an optimized wire is considered to be 35ps/mm in our analysis. 2.3Design Space exploration of different Architectural organizations: In this work, we propose and analyze an optical broadcast interconnect that can scale snoopy bus-based coherence to large on-chip multiprocessor systems. This broadcast interconnect is based on a loop bus, a low-loss and simple topology for optical communication. It allows any of the masters on the bus to broadcast requests and responses to the others with as low signal loss as possible. Usage of an H-tree like or highly bended structure would have significant signal degradation due to turns and waveguide crossings, especially when trying to layout multiple bits. A loop structure
23
also enables a many-to-many interconnect without requiring duplicated buses or fully connected point-to-point networks. In the following sections, we first explore the design space of an optical loop bus under the constraints of power, area, and other onchip optical technology projections for the general organization. 2.3.1 Design Exploration of multiplex-by-node Vs multiplex-by-address topologies.: The bus is composed of optical waveguides (residing on the optical layer) that traverse a large portion of the chip in a closed loop (Fig.2.1). Multiple nodes on the bus, each being responsible for issuing requests for a processor or a subset of processors, are equipped with necessary transmitters and receivers to interface the optical bus. We assume that the bus includes a conventional address/command bus (64 bits, including ECC and tag bits), snoop response bus (8 bits), and data bus (we consider 64-bit and 128-bit data paths plus error correction and tag bits). We presume the availability of WDM. Accordingly, in a N-node system with W wavelengths, two different broadcast organizations are possible. In the first one (Organization 1, (multiplex-by-address), Fig.21.a), the wavelengths split the address space into W partitions, thus all nodes can transmit information using all wavelengths. However, to prevent multiple concurrent requests with same wavelength, there is need for arbitration per address space (thus wavelength). It is effectively similar to the conventional electrical shared bus/multi-bus interconnect. In the second organization (Organization 2 (multiplex-by-node), Fig2.1.b)), the wavelengths are equally distributed among all of the nodes, with each wavelength serving a particular node. Both organizations achieve a peak aggregate bandwidth of W address/data transactions per bus cycle. Note that it is also possible to use some of the wavelengths to narrow the physical width (number of waveguides) of a bus (particularly the data bus). We do not consider this in our analysis below, but it could be easily integrated.
24
Figure 2.1. Optical bus for design exploration( Courtesy, Meyrem and Nevin Kirman) We present a thorough analysis of both organizations to understand their scalability, power, area, frequency, and other implementation issues. Together with the architectural aspects, we try to determine a feasible and efficient organization which we focus on in the rest of the chapter
2.3.2 Scalability of the two organizations: The number of nodes that can be
placed on the bus is constrained by several factors: (1) the number of available wavelengths, (2) power consumption, and (3) extinction ratio factor. 2.3.2.1Number of Wavelengths The number of available wavelengths is crucial to Organization 2 where it directly determines the maximum number of nodes on the bus. In Organization 1, on the other hand, multiple (but possibly few) wavelengths are important to increase the snoop bandwidth of the bus. Different sources project different number of available wavelengths for on-chip use. Chen et al. [2.7] projected that the number of wavelengths will increase by 4 each technology generation reaching 13 WDM channels at 32nm node, and Kobrinsky et al. [2.8] assumed an increase of 1 WDM channel every two technology generations. In our analysis, we vary the number of
25
wavelengths from 1 to 16, keeping in mind that we anticipate it to be quite limited in reality. Ideally looking at the device size, the number of available wavelength is dependent upon the FSR (Free Spectral Range) and Q of the filter. Though, with the devices available today a 20nm FSR is possible. That with a channel spacing of ~1.2nm will provide ~16 channels. Please note with Ultra High Q ring-modulators the channel spacing could be as small as 0.1nm [2.3], we understand that these cavities will be highly unstable. The process and thermal considerations will require higher channel spacing for robustness. 2.3.2.2Power There are two main power-consumption components of an optical loop interconnect: electrical and optical power. Electrical power is the on-chip power consumed by the modulator driver and TIA components of the transmitters and receivers, respectively In Organization 1, the number of total transmitters and receivers are both N*W*B, where N is the number of nodes on the bus, W is the number of available wavelengths, and B is the bus width. Using the same notation, the numbers of transmitters and receivers in Organization 2 are: N*(W/N)*B and N*(N 1)*(W/N)*B respectively. Here, W/N is the number of wavelengths exclusively assigned to each node. The remaining wavelengths, if any, are not used. The second organization requires a noticeably smaller number of transmitters and receivers, especially transmitters. The total electrical power increases commensurately with the number of these components. Optical power is the off-chip power required by the modulator to modulate and transmit the information optically from one-node to the others. In our analysis, we first calculate the minimum optical threshold power required for the detector to detect the signal correctly, which is based on the voltage swing requirement and signal-to-noise ratio (SNR) of the receiver as discussed in chapter 1. Having this minimum destination
26
power, we back-calculate the input power required per modulator accounting for all main sources of losses in the optical loop bus. Each transmitter requires this amount of optical power, since we assume a continuous wave laser source which will be always on, irrespective of transmitting data or not. We formulated the minimum power per modulator, which is valid for both organizations. In the equation, Pth is the minimum power that is required for a detector to detect the optical signal, Ploss is the waveguide loss per unit length, L is the length of the bus, N is the number of nodes on the bus, and Xi is the remaining fraction (in dB) of the incoming power after being tapped at each node i (X0 is the transmitting node, and XN1 is the furthest node from X0). The second term in the equation accounts for the waveguide loss, the third term is for the remaining optical power after being tapped at each receiver, the last term accounts for the tapped power at the last node, and K accounts for the other losses in the system such as bending losses.
Ploss *L ( N 1) 10*N X 1 X 2 ......... X N 10
Pmod ulator
Pth * K *10
*10
1 1 10
XN 1 10
. (2.1)
In this equation, Xi depends on the particular organization. For Organization 1, since any of the nodes can transmit with the same wavelength and the relative distance to a transmitter from a receiver cannot be distinguished, all Xi values are the same. This has two consequences. First, this creates uneven optical power distribution at different nodes. Second, and more importantly, because of the same tapping percentages but decreasing incoming power levels, to ensure that the last receiver gets its minimum power requirement, each preceding node usually taps more power than the minimum it is required to tap, resulting in significantly higher overall power. Setting all Xi = X equal and solving the equation for minimum Pmodulator, the optimum X is found to be 10*log ((N1)/(N2 )). A very important advantage of the second organization is that,
27
since only one node transmits with a specific wavelength, relative distance to the transmitter node from a receiver is known at design time, and this allows tuning each Xi to achieve minimum modulator power. This optimization allows the last node to tap all of its incoming optical power, causing the last term in the equation above to approach 1, and for minimum Pmodulator, the preceding nodes are designed to tap the smallest amount of power possible.
Figure 2.2: interconnect Power for the two organization, for any number of nodes the organization1 (top) is more power consuming than organization 2 (bottom) We summarize the parameters used in Table 2.1 Table 2.1: Parameters used to estimate the power components in fig2.2 Modulator driver power(electrical) 0.117mw@2GHz TIA power(electrical) 0.257mw@2GHz Bus Length(L) 36mm Number of nodes(N) 4, 8, 16 Number of wavelengths(W) Upto 12 One of the most important conclusions is that Organization 1 is very power hungry mainly due to the higher optical power requirement per modulator and significantly higher number of transmitters. Generally up to 8 nodes with single wavelength or up to 4 nodes with up to 4 wavelengths configurations seem feasible with low power requirements, rendering this organization not very scalable. Narrowing the bus width
28
and/or shortening the bus length might be possible optimizations that can reduce the power required to make its implementation more attractive. Organization 2, on the other hand, shows more promising results. Generally up to 8 nodes and up to 16 wavelengths are possible configurations. Reducing further the bus length or width again are alternatives to reduce the power. 2.3.2.3 Extinction ratio Extinction ratio is the ratio between on and off power levels generated by a light source. Ideally, a modulator in the off state would not pass any light, however this is not possible in practice. The extinction ratio is important because it effectively limits the number of nodes that can transmit on the same wavelength. When many transmitters are transmitting on the same wavelength the effective extinction ratio goes down due to higher off levels, thereby creating detection problems at the detector side. In this scenario the off-condition power increases and hence the receiver design becomes a problem. Note that the off power is high due to relatively high current still flowing. As the worst case on power to best case on power ratio is larger it becomes difficult to optimally design the receiver. The extinction ratio is a limiting factor only for Organization 1 as the second organization does not have multiple transmitters employing the same wavelength on a single waveguide. 2.3.3 Bus Frequency The signal latency on the outermost waveguide of the loop bus determines the bus speed, since a transaction must be visible to all nodes in the same cycle on a coherent bus and the signal on the outermost waveguide has the longest distance to travel. Using the time of flight equation T = L*neff /c (L: length of the waveguide, neff : effective refraction index of the waveguide, and c: speed of light), and accounting for transmitter/receiver components delays (Table2.2) and 4 FO4 latching delay
29
(estimated using ITRS data), we calculate the maximum frequency at which the optical loop bus of a certain length can operate. If the effective waveguide length is approximately 25mm-40mm long (sufficient length for light to reach all nodes), it can operate safely at 2GHz using silicon waveguides. We observe that, for an application of optical signaling that requires a single cycle transmission on a relatively long global waveguide, the frequency drops significantly from the maximum achievable optical transmission speed. One possible improvement may be to broadcast concurrently in opposite directions so that the light only has to traverse half the length of the bus. For the sake of simplicity, however, we exclude this option from our study. Table 2.2: component delays of transmitters and receivers at 45, 32 and 22nm technologies Delays of optical components[2.9] Technology 45nm 32nm 22nm Modulator driver(ps) 25.8 16.3 9.8 Modulator(ps) 30.4 20 14.3 Detector(ps) 0.6 0.5 0.4 Amplifier(ps) 10.4 6.9 4.0 Si-waveguide delay(ps/mm) 10.5 10.5 10.5 2.3.4 Other Design Issues One of the most important considerations for the loop topology is to prevent the light circulating around the loop for more than one complete cycle. Otherwise old messages can lead to inter-symbol interference. This restriction again favors Organization 2. In Organization 2 the attenuators can be placed just before the specific transmitters for the corresponding wavelength to prevent the light from circulating around the loop again. We cannot do the same for the first organization as the transmitter node for a particular wavelength is not fixed.
30
2.3.6 Arbitration and latency considerations: From architectural point of view, Organization 2 is again more advantageous, because it does not require any arbitration for broadcasting on the bus as each node transmits with a different wavelength. This, however, implies that conflicting requests may be broadcast on the bus during the same cycle. A method to handle such cases will be explained later. For Organization 1, realizing a low-latency global arbitration optically needed is by itself a challenge. We envision a token scheme where a node that wants to issue a request attenuates the token signal using a modulator, thus absorbing the token. In the Next cycle, the owner of the token emits the token in the same direction. The current owner of the token is responsible for absorbing the token if there is no request in that cycle, i.e. the token traverses the entire loop without being absorbed. 2.3.7 Summary of the design space exploration for architectural organization In summary, Organization 2 has clear advantages over Organization 1 in terms of power-efficiency, implement-ability, and the architectural benefit of not requiring arbitration. As a result, Organization 2 is a preferable implementation to pursue further in the rest of the chapter. 2.4 Hybrid Optical-Electrical Interconnect Further scaling the optical loop-bus interconnects may be required to accommodate larger numbers of coherent caches than the number of nodes that can be practically placed on the bus. This scalability can be achieved using an optical-electrical hybrid solution where relatively local electrical sub interconnects can constitute each node on the optical bus. One such design topology is logically demonstrated in Figure 4b, where there are four nodes on the optical bus each being a switch connecting a subset of caches. Compared to an organization where all of the caches are directly connected to the optical bus, this hybrid organization has several advantages: (1) It has a lower
31
probability of request conflicts as the nodes can select local requests that are not already conflicting, (2) It utilizes the available wavelengths more efficiently by dynamically allocating them for requests coming from multiple caches, and (3) It can benefit from local bypass paths for data without occupying optical bandwidth. This organization has drawbacks of increased overall interconnect latency, a slightly more complex protocol, and requiring high-bandwidth electrical paths down to the caches to match the optical bandwidth. 2.5 Optical Interconnect for a CMP in 32nm Technology a case study In this section, we focus on implementing an optical interconnect network for a particular case of a CMP in 32nm technology. We apply our findings presented in Section 2.1-2.4 to determine the best suitable and feasible organizations for the targeted CMP. We first model a reasonable CMP architecture in Section 2.5.1. Then, accordingly, we reason on the possible optical interconnect organizations in Section 2.5.2 2.5.1 CMP Model We target a 32nm process technology, and assume a 400mm2 die area. Assuming 10mm2 per core+L1 at 65nm and extrapolating to 32nm, we find that 64 cores fit comfortably on the die (occupying 40% of the die area), with enough additional space to allocate L2 caches (20%), interconnect (15%), controllers for off-chip L3 cache and memory, and other system components (25%). The area breakdown closely follows the one in [2.10]. It is most reasonable to have a group of processors share an L2 cache due to the numerous benefits of cache sharing, among which are efficiently utilizing the limited on-chip cache capacity, alleviating the off-chip bandwidth requirement, and reducing the number of nodes across which coherence must be maintained. However, because of the significant crossbar area and power overheads as the sharing
32
degree increases, we adopt 4-way L2 cache sharing, resulting in 16 L2 caches on the chip. We find the L2 cache capacity using CACTI 4.1 to determine a cache size that matches the area of a single L2 (found by dividing the total area devoted to L2s by the number of L2s). As a result, L2 cache size in our CMP model is 2MB, total of 32MB. Before integrating optics to on-chip, it is anticipated that optics will be first used for chip-to-chip interconnection to solve the I/O bandwidth problem of chips. Chip-to-chip optical interconnection is a current active research area on which both academia and industry is putting a lot of effort. Intels chip-to-chip optical interconnection system prototype achieved 12 channel links x 8Gb/s bandwidth [2.11]. The Terabus project [2.12] carried out by Agilent Technologies and IBM targets up to 40Tb/s bandwidth at year 2010. The fiber-to-the processor (FTTP)[2.13] program projects minimum 1Tb/s/port x 5 ports/chip chip-to-chip optical I/O bandwidth at year 2010. In the light of these projections, in our CMP model we assume a relatively reasonable optical I/O bandwidth to off-chip L3 and memory of 256GB/s (for L3) and 128GB/s (for memory), aggregating for a total of approximately 3Tb/s I/O bandwidth. Another crucial parameter is the core frequency. Blindly picking a frequency based on the projected transistor switching frequencies provided in ITRS [2.14] ignoring the power considerations may lead to wrong decisions and therefore misleading conclusions. In the following section we try to carefully project the core frequency at 32nm technology, but for a scenario where all cores are active and running simultaneously. The end result of the analysis presented in the next subsection is in agreement with the expectation that as the numbers of cores are doubled each new generation, the frequency will remain relatively constant. We end up with 4GHz core frequency in 32nm technology for the above scenario.
33
2.5.2 CMP Core Frequency Calculation We derive a trend of the core frequencies for a scenario where all the cores are active and running. In spite of the many factors affecting the run-time core-frequency, we use a relatively simple methodology that gives a rough estimate but sufficient for our purposes. ITRS projects very high processor frequency for the considered technology node (23 GHz at 32 nm) based solely on the maximum transistor switching frequency. However, if the allowable maximum power budget constrained by cooling, test, and packaging limitations is taken into account and considering the increasing leakage power contribution, it becomes infeasible to run all the cores concurrently at such high frequencies. Therefore, we try to find a trend for the core frequencies taking into account power limitations. The total power can be expressed as the sum of the dynamic power (PD) and leakage power (PL), neglecting short-circuit and any other static power components:
PTOT
PL
PD ..(2.2)
ITRS projects allowable maximum power with heat-sink of 189W for 65nm and 198W for subsequent technology nodes. Borkar [2.15] provides a trend for the percentage of leakage power at high temperature for the technology generations up to 50nm (Figure 6). We used an exponential curve fitting to obtain the percentage values for the 32nm and 22nm technologies. Note that this trend is for the case where no leakage reduction technique is applied. There are various techniques for controlling leakage power that can reduce the leakage power significantly [2.15], and we believe such techniques will be employed in future technologies. For our case, we assume a moderate 3x reduction in leakage power as a result of employing such techniques for 45nm and subsequent technologies (Figure2.3). Having the maximum total power
34
budget projected by ITRS, and using the leakage power percentages, we find a target dynamic power. The consumed dynamic power on a chip can be written as:
#ofnodes
PD
Where, f : Core frequency.
f .Vdd .C g .Wg
AFi K i
i 1
(2.3)
Cg : Total gate capacitance per micron device width (F/m). We use ITRS projections. Wg : Minimum transistor width (m). It scales by a factor of 0.7 each new generation. Vdd : Power supply voltage (V ). We use ITRS projections. AFi : Switching activity factor for each capacitive node in the processor. ki : Ratio of node capacitance to the minimum NMOS transistor gate capacitance. It depends on the circuit topology and transistor sizing, as well as interconnects capacitance. For the sum of AFi*ki, we do not use absolute values. Instead, since we mainly double the number of cores, caches, etc. retaining the structuring, the number of nodes having the same switching activity factors and kis is doubled. Thus, the sum doubles each new generation. One assumption we make is that the ratios for the interconnect capacitances (ki) remain constant across generations. This trend, together with the trend for the transistor width and the remaining known parameters (Table 2.3), enables us to form a ratio between the core frequencies of successive CMP designs. As a starting point to find the absolute frequency numbers, we start with 4 GHz frequency at 65nm technology. This gives us frequencies slightly above 4 GHz in the following technologies. In our CMP model we pick 4 GHz frequency to run all the available processors at the same time. Note that, instead of running all available processors, it is possible to utilize a subset of the processors but running them at higher frequency. It
35
would be another way to stress the inter-processor interconnects. Though, for the sake of first hand scope of this project, we didnt look into those details and it can be considered for future evaluations.
Figure 2.3 : Leakage power (% of total power) projections Table 2.3 : Summary of ITRS [2.14] parameters used to calculate the processor frequencies Technology 65nm 45nm 32nm PTOT(W) 189 198 198 Cg(E-16F/um) 6.99 7.35 6.28 Vdd(V) 1.1 1 0.9 4.0 4.4 4.08 Frequency(GHz) 2.5.3 Opto-electrical hierarchical bus design We propose an opto-electrical hierarchical bus, where the optical loop constitutes the top level of the hierarchy, and nodes deliver information to processors via electrical sublevels. Fig2.4 depicts a possible four-node organization for our 64- processor CMP, where each node is shared among four electrically interconnected L2 caches. Our bus comprises an address/command bus, a data bus, and a snoop response bus. We allocate 64 bits to address/ command (including ECC and tag bits), 72 bits to data (including 8-bit ECC and assuming that tags are provided at the header), and 8 bits per snoop response. Therefore, the number of waveguides is 136 for address/command plus data buses, and 8n to support snoop responses (each node provides w snoop
36
responses using w/n different wavelengths, for a total of 8w/(w/n)= 8n waveguides[2.16, 2.17]. The uncontended latencies in these optical buses are 10 bus cycles for arbitration plus snoop request/ response phases, and 12 bus cycles for a cache line data to be transferred on the bus across bus nodes.
Figure 2.4 Optical bus considered for evaluation in our work (courtesy, Nevin and Meyrem Kirman) 2.5.4 Protocol for the hierarchal hybrid bus Here we focus on the handling of coherence requests by the split-transaction, fully pipelined hierarchical bus. L2 cache accesses by processors may result in coherence requests, which travel down the electrical sublevel to the corresponding node where they are en-queued. Node switches arbitrate among the incoming coherence requests, and broadcast the winner(s) on the optical address bus. Every node snoops in the requests put on the optical address bus by every other node. (Recall that each node transmits through different wavelengths.) Then, nodes arbitrate among concurrent requests, using the same finite state machine so that they all reach the same outcome independently. (This requires factoring in requests even at their originating switch.) Next, the selected requests are delivered to all caches simultaneously, and the rest are
37
retried later. Caches compose individual snoop responses, which are relayed back down to the optical snoop response bus, which again all nodes read and process concurrently. Finally, the appropriate decision is made and the final snoop result is propagated up to the caches where the appropriate action is taken. Eventually, if indicated, data is generally sent down to the optical data bus (after winning arbitration over possibly competing responses from other caches in the same node), which the original requesting node collects and sends up to the requesting L2 cache. 2.5.5 Area Estimation We estimate the required areas on the active, optical, and metal layers for the optical organization. All address, snoop, and data buses are considered in the area calculations. In the active area, we account for electrical switches in each node, as well as transmitters and receivers on the optical bus. For simplicity, however, we do not include the area occupied by the repeaters in the electrical wiring, although we do include their contribution to power consumption later in this section. We use Orion to estimate the area of input and output buffers, as well as the crossbar areas inside the switches. We estimate the active area taken up by transmitters and receivers required for the optical buses by conservatively assuming that modulator driver and TIA each occupy 50m2, although standard scaling rules predict smaller areas for these components. We assume 80m2 modulators (10m diameter ring), 10m10m detectors, and 80m2 wave-selective filter areas (10m-diameter ring resonator). Modulators and detectors consume area in both the active and optical layers; modulator drivers and TIAs are on the active layer, and filters are on the optical layer. The area occupied in the optical layer is calculated as the sum of waveguide, modulator, detector, and wave-selective filter areas. We assume the component areas
38
specified above. The resulting active area is relatively modest, and the required optical layer easily fits within 400mm2. Finally, we estimate the metal wiring area required for the electrical subinterconnects in hierarchical organizations. We assume a global wire pitch of 400nm and wire length of 4.5mm and 2.25mm for four node configuration. 2.6 Electrical Baseline: To conduct a meaningful evaluation of the impact of incorporating optical technology to bus-based interconnects; we establish a competitive, state-of-the-art electrical baseline with similar power and active/metal area characteristics as the competing opto-electrical buses. For the address network, we empirically found the tree topology to yield low latency and competitive bandwidth relative to other alternatives for our configuration, and therefore choose it as our baseline. In the modeled tree organization (Figure 2.5a) four L2 caches and a memory controller connect to an address switch (AS), and four such address switches connect to a top-level address switch, all through point-to-point links. Requests issued by L2 caches are arbitrated in the switches at each level of the tree, until they reach the top level and are selected. From that point on, broadcasting a snoop request down to all caches, combining snoop responses up at the top-level switch, and again broadcasting the final snoop result down to the caches, takes a fixed amount of cycles. We implement a multi-bus by selecting multiple snoop requests at the top-level address switch and employing as many snoop request/response buses as needed. We assume an H-tree layout with 4.5mm first-level (from the L2 caches) and 9mm second-level wire links. By using power-delay optimized repeatered-wires, we can accommodate a 2GHz bus clock frequencyhalf the cores speed. Under no contention, the address phase of a request spends a total of 13 bus cycles on the bus: 4 cycles for request arbitration, 3 bus cycles for snoop request, and 6 bus cycles for snoop response combining and result broadcasting
39
(excluding time spent in the caches). The data network (Figure 2.5b) consists of a four-node bidirectional ring. As in the case of the address switches, each data router serves requests from/to four local caches and a memory controller connected to it through point-to-point links. Routing is deterministic and balanced. Transfers within a node use a 16GB/s bypass path within the local router. Bandwidth at each ring link is 16GB/s in each direction, as is the read and write bandwidth of each L2 cache. Bandwidth from (to) the memory controller is 48GB/s (32GB/s). In the absence of contention, it takes 14 bus cycles to transfer a cache line on the data network to a cache in the farthest node. Finally, we do not simulate I/O, and therefore we do not include it in the system we model. To obtain area and power characteristics of the electrical bus, we follow the estimation methodology described earlier for the relevant electrical components. When compared to hybrid opto-electrical bus, an electrical bus with support for an equal number of snoop requests per bus cycle (four) exhibits comparable power consumption and active device area, but a 50% increase in metal area overhead. On the other hand, an electrical baseline with support for half as many snoop requests per bus cycle adds up to similar area and power characteristics as the opto-electrical counterparts. Thus, for our comparison, we choose the latter configuration as our baseline.
40
a.) Address path for the Electrical baseline b.) Data path for the Electrical baseline Figure 2.5 Electrical baseline (Courtesy, Meyrem and Nevin Kirman)
2.7 Assumptions about the processor core: Following assumptions were made about the processor core as shown in Table 2.4 Table 2.4 Processor core assumptions [2.16] Frequency 4GHz Fetch/issue/commit width 4/4/6 Inst,window [(int+Mem)/FP] 56/48 ROB entries 128 Int/FP registers 96/96 Int ALUs/Branch units 4/2 Int Mul/Div units 1/1 FP ALUs 3 FP Mul/Div units 2/2 Ld/St units 2/2 Ld/St queue entries 24/24 Branch Penalty(cycles) 7(min.) Store forward delay(cycles) 2 Branch predictor 16K-entries 2.8 Results We use eleven applications from the SPLASH-2 suite for the evaluation of the two topologies. The opto-electrical configuration was found to achieve an average speed
41
up of 1.13 and peak speed-up of 1.71 as compared to the electrical baseline as shown in Fig2.6.
Optical
Cache Miss-rate
Figure 2.6 Speedup of different applications (courtesy, Meyrem and Nevin Kirman)
Applications Figure 2.7 latency breakdowns for different applications (Courtesy, Nevin and Meyrem Kirman)
Looking at the average latency breakdowns in Fig2.7, the latency advantages are obvious for each phase of the transactions i.e. snoop, address and data. Even in the absence of contention, the opto-electrical buses have a latency advantage over the electrical baseline. Moreover, the opto-electrical buses can support twice as much snoop request/response bandwidth as the electrical baseline at similar power and area cost. Also another point to note from the results were that applications with relatively high bandwidth demand such as radix, ocean and fft perform better with optical interconnects. In general, scalability improves with the addition of optical technology. Those applications that suffer from more contention in the data network tend to exhibit
42
lower parallel efficiencies in all configurations. And it is the scalability of these applications that improves the most with the addition of optical technology. Overall, the evaluation results suggest that there are performance advantages with optical interconnect, which is on expected line. The whole exercise quantifies the improvement by an average factor of 13% as compared to the electrical baseline. 2.9 Future work: what will be more important to tackle for on-chip optics, latency or higher bandwidth for same power budget? : As was discussed in section above, its proven that optical interconnect based on-chip network gives performance advantage over their electrical counterparts. This in essence been made possible by low latency and high data bandwidth provided by the optical interconnect. Since the latency improvement with optics is a function of the length over which the information travels and as the total length of the chip is limited, there is only limited scope of improvement of latency with the optics. A, closer look at the result reveal that its the high bandwidth at the same power budget that gives higher performance for on-chip optical interconnect. Thus, for the next stage of work, we would like to focus more on increasing the on-chip bandwidth as compared to the low-latency aspect of the optical interconnects. With optical components showing more performance improvement as compared to what was assumed in these evaluations, it may seemingly be possible that more on-chip bandwidth can be made available at the same power budget and hence the future focus would be more on higher bandwidth system, by using higher data rate and more WDM channels. 2.10 Key Takeaways: 1.) For the future CMPs the optical based interconnect network provide lower latency interconnects. 2.) For the same interconnect power budget compared to electrical interconnects the optical interconnect provide higher bandwith.
43
3.) Higher available on-chip bandwidth translates to higher performance for applications which are bandwidth limited. 4.) For on-chip inter-connects the bandwidth provided is more important as compared to the low latency aspect of optical links, as seemingly, the speed-up is more due to higher bandwidth. Though latency reduction too added to the performance advantages. 5.) At similar power budget the on-chip optical interconnect based system provides an average performance improvement of close to 30% and for some applications a peak performance improvement of ~50%.
44
CHAPTER 3
DESIGN OF CLOCK & DATA RECOVERY (CDR) CIRCUIT FOR HIGH SPEED PARALLEL ON-CHIP INTERCONNECT
3.1 Chapter Introduction: As discussed at the end of Chapter-2, the main benefits of optical interconnect can be exploited by using high bandwidth that they provide. For higher bandwidth applications we need non-source-synchronous design, which will invariably require a clock and data recovery scheme to extract the information at the receiver end. Though its still not clear as to whether or not we would need a CDR for on-chip interconnect, they are a must for off-chip interconnects. For On-chip interconnect as we go for a faster data-transfer rate, distributing a synchronous clock is not possible, and hence, the transfer has to happen without source-synchronous clocking. In this chapter we talk about design of a low complexity 10Gbps Clock and data recovery circuit to be used for asynchronous data-transfer, after detailing the performance results and design considerations that went into it in the design phase. We look back at the performance results and compare them with other state-of-the-art results. After that we come back to our original question of whether non-sourcesynchronous data transfers on-chip are beneficial. We try to understand the latency overhead of synching and also the encoding decoding related latency in the datatransfer mechanism. After appreciating the importance of latency for inter-cache transfers on-chip, we make a case for source-synchronous clocking with clock forwarding for point-point or broadcasting based architecture instead, while reserving the asynchronous data transfer scheme for off-chip communications. We also talk about why its not good to have high per channel bandwidth for on-chip interconnects as it trades spatial bandwidth with more important power, latency and complexity.
45
3.2 Design of a 10Gbps low Complexity CDR:
3.2.1 Functionality of a CDR Circuit: CDR is a part of the receiver chain for
serial data channels which extracts the clock information from the preamble bits and provides a good sampling clock that reduces jitter and the phase error, thereby reducing the probability of wrong detection. It provides a reference point for the receiver chain about the bit-boundary of the data and keeps track of it during the normal data-transfer phase.
Figure 3.1 Typical Data eye, and optimal clock sampling edge.
3.2.2 CDR Architecture Exploration: There are multiple ways in which one can
design a CDR circuit. The easiest (only in concept!!!) and fastest way of finding the
Figure 3.2 Parallel sampling of the data edge
46
bit-boundary would be to just over-sample(Fig3.2) the incoming data and find-out the transition point and hence an optimal sampling clock. But, for high data rate this scheme cant work, as over sampling the data will require very fast sampling clock and/or multiple parallel phase-detectors. Power required for such system would be high. The other type of architecture recovers the information about the clock in a PLL
Figure 3.3 PLL based CDR based CDR as shown in Fig3.3. To be able to extract the clock from random data, this scheme will require certain amount of data-transition density and special type of Phase-detector. This scheme is less power hungry as compared to the first scheme. For parallel channel communication the cost of VCO per channel is unnecessary. The design can be optimized by a Phase-interpolator based architecture to give lower power design as shown in Fig3.4. This architecture requires a Phase detector, a Phase interpolator, a PLL and DLL circuit. For Parallel I/Os operating at the same frequencies, the PLL and DLL circuits can be shared as shown in Fig.3.4.
47
xtal
PFD
Loop filter +CP
VCO
PLL
Div_N2
Div_N1
CLOCK_H
SHARED
DLL
Update_clock ICLOCK_H QCLOCK_H
UP
Din
Phase Detector
Down
Phase Interpolator
Dout Clockout
CHn CH2 CH1
Figure 3.4 Phase Interpolator based CDR In this architecture the incoming data-stream is compared against the sampling clock, and the phase detector is a bang-bang circuit that indicates if the sampling clock is leading or lagging the data. The Phase-interpolator interpolates the phase of the sampling clock based on this information and after sometime it finds an optimal sampling point. This architecture is the least power hungry. The phase interpolator circuit can be controlled digitally, which effectively gives a good digital filter, while the Phase-detector circuit can also be implemented with digital blocks. Since, the design shares the VCO and the DLL for multiple channels; its good for power and area saving.
48
3.2.3 CDR design for 10Gbps: The design of the CDR involves multiple sub
circuit designs such as the Phase interpolator, the phase detector, the control circuit and the DLL. In this chapter we will talk about the topology that was selected for the design of the subcircuits.
3.2.3.1 Phase detector Topology: There are two different types of phase-detector
circuits in the literature, namely Hogges [3.1] and Alexanders[3.2] as shown in Fig.3.5 above. Out of the two Phase-detectors we select the Alexanders Phasedetector over the Hoggs detector due to the simplicity of loop-filter and control circuit that is required for Alexanders PD.
3.2.3.2 CML Flip-flop design: As, shown in the topology for the Phase-detector, it
requires a Flip-flip. The D-flip flop can be designed both digital logic based gate (Fig.3.6) or with Analog CML latches (Fig3.5). With High frequency data, we need high precision Phase-detectors, normal digital flip-flop with their high clock-Q delay are not good for this application, so in our case, we go for CML based latch and flipflop design. Also, due to high frequency nature, we go for differential design while designing the CML flip-flop.
49
a.) Hogges Phase-detector
b.) Alexanders Phase-detector
c.) Timing for Hogges detector
d) Timing for Alexanders detector
e.) Transfer fn for Hogges detector after f.) Transfer fn for Alexanders detector integrator Figure 3.5 two main type of Phase detectors for high speed low power CDRs
50
Figure 3.6 CML based Analog FF
Figure 3.7 Digital FF The CML based flip-flop were designed based on master-slave configuration, and consisted of two CML latches. Each one of the CML latch has 2 stages, one sample and the hold stage. Its important to have some-amount of gain in the sample and hold stages to ensure that the signal doesnt get weak as it passes through the amplifiers. Also, for robustness of the design, high voltage swing is required. The requirement of gain and bandwidth with high swing signals, at low power supply means higher power consumption for the circuit. We optimize the power consumption of the CML flip-flop by appropriate transistor sizing, and it came out to be 1mW. The circuit operates properly for supply voltage of 0.9V-1.2V and was designed for a data-bandwidth of
51
10GHz, which means it will work properly for data rate upto 10Gbps for data if the clock is at 10GHz. The corresponding digital flip-flop was limited in its operation, as the clock-Q delay was found to be of the order of 25ps, @5gbps, whereas for the CML based design it was only 5ps @10Gbps.
3.2.3.3 Design of CML (Analog) XOR gate: As discussed earlier, the phasedetector requires XOR gate too. As the FF were designed with CML logic and since the XOR gate in our application need to work with low latency. We designed the XOR gate too with the CML based logic as shown in Fig3.8
Figure 3.8 CML based XOR When A is high, then polarity of B decides the output. If A and B both are high, out gets pulled to zero, while with A high and B low, OUTBAR gets pulled to low. This way by controlling where the current moves in the branches, the design can control the state of OUT and OUTBAR and acts as an XOR gate. The basic cell used here is also
52
named Gilbert-cell [3.3], which is used for the design of high speed RF mixers as well. The power consumption of the circuit was 0.5mw, and operates at supply voltage 0.91.2V properly for data rate upto 10Gbps. 3.2.3.4 Results with the Phase detector:
UP
DOWN
DOWN
UP
Figure 3.9a)Up, Down pulse when the clock is leading the data
Figure 3.9.b)Up, Down pulse when the clock is lagging the data
With the designed CML FF and CML- XOR gates, we tested the Phase-detector circuit for its performance. The circuit was able to resolve the phase to 2 ps accuracy at 10Gbps. In Figure3.9, we show the bang-bang nature of the phase detector, where when the data phase lags the clock, the down pulse is high, when it leads the clock, the up pulse is high.
3.2.3.5 Set-Reset Latch as the loop-filter: Since the Phase-detector used in our
design is bang-bang. The up and down pulses at any given point of time will be transitioning with the data-transitions. Thus with random data-transitions, we will still see the up and down pulses having transitions, as its the way the phase-detector works. If the data-phase lags the down pulse will be high for every data-transition, and thus will acquire the randomness of the data. Though, its guaranteed that the up pulse will stay low all the time. Since the loop-update logic cant update the clock-
53
phase very fast. We will need a loop filter. The loop-filters can be designed using both digital and analog methods. But as our phase-detector is bang-bang, we can use digital filters. For most of the loop-filters designers use majority-vote based filters. Where in every update cycle, the number of up and down pulses is counted, and based on which pulses came more often a decision is taken. These counters are usually working at the same speed as the data and are not good for low power applications.
Figure 3.10 Set Reset Latch for use as loop-filter We in our design went with set-reset latch, in which in any update cycle the last pulse dictates the decision about the phase-update direction. This scheme is much simpler and low power as compared to the complex majority vote based decision circuit.
3.2.3.4 Design of CML (Analog) NAND gate: As, the latches can be made by
cross-coupling the two nand-gates, and as our phase-detectors were all working with CML logics. We designed the NAND gate too with CML. Based on the same concept as the CMOS logic. the one arm of the differential CML logic circuit has the NMOS connected in series while the other arm has them connected in parallel. Thus for the OUT to go low, both A and B needs to be high. The circuit takes 0.5mw of power and is designed to work
54
Figure 3.11 Analog NAND Gate appropriately with the desired data-rate. It also works for all the power supply range of 0.9-1.2V.
1 X 0
1 Y X 0
Y 0 UP 1 0 DOWN
DOWN 0 UP
1 0
Figure 3.12.a) Modified up an down pulses when data leads the clock edge. Note: X and Y are the corresponding up and down pulses before the set-reset latch
Figure 3.12.b)Modified up and down pulses when data lages the clock edge. Note: X and Y are the corresponding up and down pulses before the set-reset latch.
55
3.2.4 Design of the Phase Interpolator: The phase interpolator was designed
based on the concept of addition of two different magnitude of the clocks separated by 90-degree.
Figure 3.13 Phase Interpolator As, shown in figure above, if CLKI and CLK are 90-degree apart. CLKI = Asin(wt) CLK = Asin(wt+ 900) Then, CLKOUT = K*Asin(wt) + (N-K)*Asin(wt+900) Further simplification of math will give the phase of the CLKOUT to be arctan((NK)/K)). For a fixed N of say 16, and varying the value of K from 0 to 15 we can interpolate the phase of the output clock from 0 to 90 degree with respect to the CLKI. The same idea can be extended to the clocks of different phases, like 90-180 , 180-270
56
and 270-360 degree and we can get clock interpolation over the whole range of 0-360 degree.
Figure 3.14: clock interpolated phases between 0 and 90-degree, with the designed scheme.
Figure 3.15 : Interpolated clocks over 2 quadratures. The designed phase interpolator consumes 1mw of power at 10GHz clock rate for the interpolation stage, while it consumes a further 2mw of power for the clock-
57
amplification stage after the interpolation, to maintain good swing on the clock-net. Thus the overall power consumption of the phase-interpolator circuit was around 3mw.
3.2.5 Design of the control circuit: We used this phase-interpolation technique

to interpolate the clock. The interpolating stage was designed to be controlled by a 64bit shift register. The 64-bit shift register wakes up on power on with the first 16-bits set to high. The control pulses UP and Down decides, how the 16 bits shifts across the 64-bit
clkout clkoutbar 00 1800 900 2700 I1 IK 8

64-bit Shift Register
I8 8
Shift-UP
Shift-down
Figure 3.16 : Phase interpolator control block.
A1
Ak
An
Vbias ABAR1 ABARk ABARn
Figure 3.17 : controllable current source for the control circuit
58
Shift register, and hence we get the interpolated clocks as decided by the control logic. This control logic during every update cycle either shifts the data up or down. Thereby, even during the lock time, the clock will jitter around the optimal clock edge. As it moves between two different edges and hence creates deterministic jitter. We could have created a state-machine to prevent the dithering but for the sake of simplicity of the design, we kept we didnt look into creating a third state, where the updation can be kept on hold. The amount of dithering of the clock around the lock time will only be +-1/64UI, and is insignificant. Furthermore, this dithering of the clock is not entirely bad, as it can track higher speed jitter in the data-pattern, as compared to the other schemes.
3.2.6 I & Q clock generation circuit using DLL: The phase interpolator
scheme discussed above, requires quadrature clocks. The way we generate these quadrature clocks is by using Delay locked loops. The delay stages are controlled to create a 360-degree phase. This requires controllable delay block, high speed phasedetector with very high phase accuracy, and the control circuit for controlling the delay.
Delayed_clock
Phase-detector Control-circuit
Delaycontrol
Td Ref_clock I_clock
Td
Td
Td
Q_clock
Figure 3.18 : DLL schematic for generating I & Q clocks
59
3.2.6.1 Delay stages for DLL: The delay stage, for the DLL were designed using
normal ve gm based delay block. As shown in picture, the cross-coupled stage acts as a ve gm load on the first stage. As, the CTL input changes, the current in that branch changes the gm value, thereby creating a change of delay through the stage.
Figure 3.19 : variable delay stage used for DLL The delay block was designed to have delay between 15ps-35ps each, which means, the delay through the 4-stages can be controlled between 60% to 140% of the 10GHz clock cycle. This much of delay control will easily be able to keep track of any process variation and hence should be good for this application.
3.2.6.2 Phase detector: The phase detector required for this application was
supposed to work at high speed with high-accuracy. We found that a simple sampling based phase-detector is ideal for us. Where the ref-clock samples the delayed clock. If the delayed clock lags the ref-clock, then it will sample a 1, while if it leads the refclock it will sample a o. The scheme works properly for DLL as the two clockfrequencies are same, they differ only in phase.
60
Lead
Lead
-
-
Phase-offset
Phase-offset
SET
Lead
CLR
Lag
Lag
Lag
Delayed Clock
Ref CLK
- -
Phase-offset Phase-offset
Figure 3.20 Sampling type phase detector used for the DLL, phase detection
Also, as the phase-detector needs to work with high accuracy, we utilized our earlier CML flip-flop itself, as it was fast and accurate, with low region of metastability. In concept the Phase detector designed here again acts as bang-bang phase-detector, as the outputs are digital 1 and 0
3.2.6.3 Charge pump for the DLL: The charge pump required for this control loop
was designed based on typical topology as shown in figure below.
61
Isource
Vctl
Lead
Figure 3.21 Simple charge-pump based loop-filter for the DLL The lead or lag pulses as they come, charge the capacitor in appropriate direction, which thereby controls the delay. And at appropriate delay, the circuit latches on to the correct phase.
Figure 3.22: charge Pump node voltage, vctl The figure above shows the charge-pump node voltage Vctl, around the time, where the DLL locks, the charge-pump Vctl dithers due to the bang-bang nature of phasedetection. The capacitor size can be increased to high value to reduce these ripples.
62
Lag
Isink
Figure 3.23: different clocks at different stage of the delay line apart by 90-degree.
3.2.6 CML 2 CMOS converter: since the control logic works at the CMOS level,
while most of the Phase-detection etc. are in CML level. A CML2CMOS converter circuit was also designed. This circuit consumes around 0.5mw of power and consists of differential to single-ended conversion and thresholding by skewed inverter stages.
3.2.7Design layout: The design was laid out with IBM -90nm process
Figure 3.24.a)Phase-detector
Figure 3.24.b)Phase-interpolator
63
3.2.8 Top level simulation:
Locking
Locked
UP
DOWN
Figure 3.25: Up and down pulses of the control loop at the top-level As shown, in figure above. The up and down pulses change between successive cycles due to the bang-bang-nature of the phase-detector.
Figure 3.26: Data eye at 10-Gbps after retiming with the recovered clock
64
Figure 3.27: recovered clock eye at 10-GHz with lesser stages of interpolation
Figure 3.28: recovered clock eye at 10-GHz with 64 interpolation stages As shown, in figure3.26, the data eye looks very good at 10-gbps with the retimed clock. The recovered clock has jitter due to bang-bang nature of the PD. The jitter in recovered clock is higher when the number of interpolation stage is set to only 32, while with 64-interpolation stage it gets reduced to half and is consistent with the interpolation accuracy of the phase-interpolator.
65
The overall power consumption break-out for the whole CDR circuit was as below. For the phase-detector: 4mw for Alexander PD, 1mw for the retiming block, 1mw for CML EX-OR gates, 2mw for retiming of the up and down pulses, 1mw for the setreset latch and 0.5mw for cml2cmos converter. While for the phase-interpolator circuit it was 3mw total with clock amplifiers. The clock distribution and the corresponding amplification required another 4mw of power. This means the basic CDR circuit requires close to 16.5mw of power. The DLL required total 2mw of power. But as the DLL, can be shared across multiple links. Considering a high-speed clock-distribution power of 1mw from the VCO, the actual power for the CDR per channel can be considered to be <20mw. For low speed operations the delay characteristics of the neighboring channel can be controlled to have good matching, which mean the same extracted clock can be utilized for the neighboring channels, but for high speed operations at 10Gbps or so, usually controlling the delay characteristics of the neighboring channel is not an easy task, and hence the CDR circuit for them may be required on a per channel basis.
3.2.8 A Comparison of our results with other state-of-the-art designs: A

comparison of our result with other designs in a 90nm process, reveals that our design is towards low-power as compared to other designs. Where it takes ~2mw/Gbps in our case, for other 10gbps or greater speed designs in a 90nm process, the power consumption is of the order of 4-20mw/Gbps [3.4, 3.5, 3.6, 3.7, 3.8]. Table 3.1: comparison of performance with other state-of-the art designs Design Data Rate Power Power/gbps This Work 10Gbps 20mw 2mw/Gbps ISSC, 2006 25Gbps 98mw 4mw/Gbps ISSC, 2007 20Gbps 102mw 5mw/Gbps ISSC, 2007 40Gbps 800mw 20mw/Gbps JSSC, 2001 10Gbps 72mw 7mw/Gbps
66
3.3 Understanding the power trade-off with high speed operations: With higher and higher data-rate the amount of power consumption at the receiver side to resolve the information increases super-linearly as also the complexity.
Power/Gbps
Optimal rate for low power
Data Rate(Gbps) logarithmic scale
Figure 3.29: Data rate Vs power/Gbps trade-off This super-linear increment in the power for higher data-speed mean power/gbps of data-rate increases. As, there, is an optimal point of operation for low power
applications. Its all the more appropriate to limit the data-rate to the point where the power consumption/Gbps of the link is the lowest. This optimal point is dependent upon technology and topology as well as the architecture of the whole system. Normally for off-chip communications constraints like the limited number of pins force the designers to ignore this optimal point as their purpose there is to extract the maximum bandwidth out of the limited number of pins. Since for on-chip communication, the spatial bandwidth is high, it shouldnt be much of a problem to limit the data rate to this optimal point. 3.4 Understanding the latency involved with synchronization with asynchronous high speed operations: As, we go for high speed operation to reduce BER invariably encoding and decoding are required. There is latency involved with the bit-boundary detection, latency involved with the code-boundary detection, as well as latency
67
involved with the de-skewing the channels. The latency involved with the designed for our case is 64-update cycles. As the updation clock has to run at-least 2 times slower than the actual high frequency clock. This means a latency of 128 cycle. Even with binary search of phases, one will still have high latency of ~14 cycles. After this, the latency is there with the decoding circuit and code-boundary detection circuit. For a normal 8b10b decoding, this latency runs into ~20 low-speed cycles, which will translate into ~200 high-speed cycles. even with parallel code group detection. With higher data rate there will be higher chances of delay mismatches in the parallel channels, which may warrant the de-skewing of channels too, and there is latency involved with that too. Even if we assume that the link was dedicated continuous running link and that the latency of synching and de-skewing etc. being only initial latency overhead are not that important in comparison to the throughput of the channel. There is still considerable latency involved in the continuous throughput of the channel too, which for most of the high speed applications comes to around 20-30 cycles. Since for on-chip applications the amount of data that moves between the two cores is not too high, this latency may reduce the system performance instead of increasing it. One need to appropriately take it into consideration while architecting the design with high speed asynchronous data-rate. 3.5 Use of clock-forwarding for better performance for on-chip communications: one can also think of using clock forwarding with the data for on-chip communication. This will take out some of the complexity of bit-boundary detection and also the CDR. Though, the internal clock still need to be generated with the forwarded clock, and hence a PLL will still be required per channel. This scheme will involve lower latencies. The speed achieved through this scheme though will be dependent upon the dispersion characteristics of the on-chip waveguides.
68
3.6 Key Takeaways: 1.) We have designed a CDR with low power and low complexity that has competitive performances with other state-of-the art CDRs in a 90nm cmos process. 2.) The main innovation in the CDR is the simplicity of the loop-filter with the set-reset latch. 3.) The DLL was also designed with sampling based phase-detector, though a patent application was found to have been granted earlier-on on it to other designers [3.9]. 4.) Complexity of the receiver circuit increases with higher data-rate operations. 5.) The power consumption of the link/Gbps has super-linear characteristics beyond some optimal data-rate. For low power designs this should be tradedoff with the spatial bandwidth of the link. 6.) Latency overhead of the asynchronous high speed data transfer need to be appropriately considered for any on-chip communication.
69
CHAPTER 4 THERMAL & PROCESS TUNING of RING MODULATOR
4.1 Chapter Introduction: As has been discussed in chapter 1 and 2 of the thesis modulators are one of the most important components for realizing an opticalinterconnect system. There are mainly two types of silicon compatible modulators that are talked about in literature. In this chapter, we talk about integrate-ability of the two for dense on-chip optical interconnect. After justifying the selection of ringmodulators due to their compact size, we look into the biggest problem that these types of modulators have. The High Q that is the basis of performance for the ringmodulator is also very much the reason for concern too. Due to high Q they are equally susceptible to process and thermal variation. In this chapter we look at these problems and evaluate thin film heating, charge injection and depletion based mechanism towards performance improvement. After discussing the pros and cons of the proposed solution, we look into some of the recent work done at the material level to fix the thermal sensitivity problem. We also look at process variation, and discuss our charge injection solution for process tuning of the ring-resonator. 4.2. Silicon Modulator: High speed electro-optic modulation in silicon is a crucial requirement for integration of silicon photonics with microelectronics. High speed modulators in Gbit/s regime have been demonstrated recently using either resonant structures [4.1, 4.2, 4.3, 4.4] or Mach-Zhender interferometers [4.5, 4.6]. Resonant electro-optic modulators are ideally suited for large scale on chip optical networks due to the compact size, high extinction ratio per unit length and low power consumption due to resonant enhancement of interaction of optics with index change.
70
Light in (CW)
1 0 Data in (electrical) 1 0 Modulator 1

Modulated light out Implementation
Phase-shifter
IndexMmodulator
MZI
Ring-Resonator
Figure 4.1 MZI and Ring-Resonator Even, though MZI modulators have low process and thermal sensitivity due to their differential structure and are very robust in their performance, they are not good for on-chip optical interconnect due to their bulky size. 4.3 Ring-Resonator based Modulator: Ring Resonator based modulators are
designed based on the principle of resonant enhancement with index change. Application of an electrical signal generates carriers, which thereby control the Polarizability of the photons resulting in to a change in index. This change of index for a fixed wavelength results in change of transmission characteristics for the light. As the electrical ones and zeros control the transmission characteristic, the information thereby gets modulated into an optical carrier.
71
ei
-t
k k
a1
b1
Figure 4.2 Ring-resonator [4.1] As shown in figure above, a1 is the field of incoming light, b1 the field of outgoing ligh, t & ks are transmission and coupling coefficient, and alpha is the absorption coefficient of the ring. For the resonance to happen, one can find the condition to be
2
t2
b1 a1
2 t * cos( )
ei te
i
0 .(4.1)
Also, solving, detailed math, the transmission characteristics of the the ring is given as
t 1
(4.2)
This function is very important in understanding the principle of operation of the ring based modulators. In Fig.4.3 we show the transmission characteristics with typical value of t and being 0.96. X-axis is the and Y-axis is the ratio b1/a1. As shown
72
~0.1nm
20nm
Figure 4.3: Transmission characteristics vs In the figure, at some , which is a measure of , the light transmitted is zero. As, varies with the change of index for a fixed probe wavelength one can conceive of changing the index, to control the transmission of the light, which is the basic principle on which these devices operate.
4.3.1 Understanding Q and its importance for device performance: The

Peaks in the transmission characteristic of the modulator lie at around 1550nm. Note, the transmission characteristics drop only during a small window of 0.1 nm. The Q of the device is defined as 1/2/ , which for these devices typically is ~10000-20000. This High Q of the device is good for higher performance (?) of the device as it means little bit of change in index will result in large change in the characteristics of the device for a given wavelength. Since small change in index can be done with smaller change in charge, it promises low power operation.
4.3.2 High Q, the problem?: High Q for a well controlled device will give high
performance due to higher sensitivity, but in the world of integrated circuits, the environment is not controlled. The same high Q, which results in higher electrical performance also results in higher sensitivity of the device with respect to the process
73
and temperature variations. Any uncontrolled parameter will affect where the spectral dip happens in the transmission characteristics of the device, and since one cant control all the parameters perfectly, the device usability for integrated applications can be questioned. Also, higher Q means lower available data-bandwidth. For a Q of 10000, the data bandwidth for 1.5um (200THz) light will be (200THz/2*10000) = 10GHz. Thus for higher modulation bandwidth and lower sensitivity to process and thermal variation the Q of the device needs to be lowered. This will result in higher modulation power consumption for the device as a larger amount of index change will be required for modulating information for a device with lower Q.
4.3.2 Ways to De-Q: As shown in figure below, the De-Qing can be done by
changing
Figure 4.4: Transmission effect by varying t and the coupling width of the device, as well as by increasing the losses in the cavity. Both of these methods invariably result in a change of extinction ratio, i.e. the turn-off to turn-on light intensity ratio as shown in the transmission spectrum above. Degradation in the extinction ratio compromises the signal quality at the receiver end and will result into higher BER. Another way of increasing the Q would be to cascade devices two or more devices on the same waveguide. The transmission characteristics will combine together for the cascaded devices, thereby resulting into higher Q.
74
Figure 4.5: De-Qed spectrum by cascading This scheme assumes that the cascading of devices will track each other due to local and thermal correlation, which is true for most integrated applications. This scheme increases Q of the modulator without compromising the extinction ratio, so is better. 4.3.3. Is just lowering Q the solution for integrate-ability: With lower Q, we would be required to create larger shift in the spectrum to modulate light. Since the electro-optic effect of silicon is very weak, lowering Q will require more charge injection and removal for modulation, which will invariably mean higher power consumption and lower speed of operation. This is not an ideal solution for integrateability of the modulator, as it compromises the low power characteristics. Thus there we need to find either a static or pseudo-dynamic compensation scheme to compensate for process variation or track the thermal variations. 4.4 Thermal sensitivity of the device: To understand the thermal and process variation of the silicon ring we collaborated with Sasikanth Manipatruni and Dr. Lipson and worked with their ring-modulators in their lab.
4.4.1. Device structure and operation: The investigated device is a silicon

electro-optic ring modulator formed on a Silicon-on-Insulator (SOI) substrate. The
75
modulator is formed by building a P-I-N junction around a ring resonator of quality factor 4000 and diameter 10m [4.5]. The dimensions of the waveguides forming the ring and the add-drop ports are 250 nm X 450 nm. The drop port coupling distance gives the quality factor of the ring as 4000. Nanotapers are used on both ends to couple light in and out of the chip [4.7]. The transmission spectrum of the ring for TM polarized light is shown in figure 2. At constant temperature, transmission through the ring is modulated using a non return to zero (NRZ) bit sequence at 1 Gbit/s. The refractive index is the ring is modulated by active carrier injection and extraction using the PIN junction. The modulated output waveform and eye diagram at 1 Gbit/s at nominal temperature of operation is shown in Figure 4.6. An On/Off extinction ratio of 5 dB is measured in accordance to the transmission spectral characteristics with a +/- 4V applied voltage. The high applied voltage is attributed to a large contact resistance of the device which though can be greatly reduced by varying the implant conditions, doping profile and contact metallization as shown in [4.8].
Figure 4.6a)Measured spectrum of the experimental ring
Figure 4.6.b) Zoomed out spectrum
76
Figure 4.7. Transmission spectrum under DC bias voltage
Figure 4.8. Modualted waveform at 1 Gbit/s
4.4.2 Degradation in Modulated Waveform due to Thermal Effects: In this section, we explore the effects of temperature variation on the modulated waveform. We analyzed the transmission spectral shift as a function of temperature. Silicons thermo-optic effect is given by
1.86 10 4 / K which leads to a shift T of ~0.11 nm/K from the base resonant wavelength. The effective index change in the
modulator results from a combination of thermo-optic effects in both silicon and oxide. Figure 4.9 shows the spectral shift as temperature of the chip is varied over 5 K.
. Figure 4.9. Transmission spectrum under increasing temperature by 1K successively
77
We analyzed the degradation of the modulated waveforms as the temperature is varied. The variation in the signal quality can be seen in figure 4. At the nominal operating conditions at the base temperature the transmission is modulated between the minima and the high transmission points to the right of the resonance spectrum. However, as temperature increases, with a fixed probe wavelength, the transmission spectrum is modulated from points on the left side of the resonance leading to a higher transmission during the zero state. When the temperature is sufficiently high, the modulation occurs on the left edge between points leading to a complete bit reversal. However, the inverted bit pattern shows high pattern dependency since the optical state change can occur before the steady state carrier concentration is reached in the PIN junction. This leads to pattern effects as shown in Figure 4.10.
Figure 4.10. Distortion in modulated waveforms as temperature is increased, successive pictures from left to right are with T=5K
78
We compared the effect of thermal shift on the modulated waveforms with electrooptic simulations and show a good match between simulation and experiment. Our collaborator Sasikanth Manipatruni from Dr. Lipsons group did the electrical modeling with SILVACO ATLAS device simulation software. The software models the internal physics of the device by solving the Poisson equation and the charge continuity equation numerically. The suitability of SILVACO for simulation of these characteristics has been established by prior works [4.9, 4.10, 4.11]. We included Shockley-Read-Hall (SRH), Auger, direct recombination models. We assumed an interface trap density of 1010/cm2/eV and an interface recombination rate of 104 cm/s [4.11, 4.12]. The surface recombination rate of silicon is of the order of 10 4 cm/s for un-passivated surfaces [4.13] and 100 cm/s for passivated surfaces. The optical modeling assumes the ring resonator as a unidirectional single mode traveling wave cavity coupled to the waveguides with a quality factor of 4000. The changes in refractive index and absorption of silicon were modeled via free carrier dispersion [4.1]. The peak injection is estimated from the transmission waveforms. The injection rise time is given by the 4 ns as can be seen from the injection transient effect in the distorted bit pattern at 15 K. 4.4.3 Background work: In prior art, researchers proposed solutions which are essentially based on temperature tracking, i.e. operating the chip at the same fixed temperature. This requires thermal heating and cooling of the chip as the environmental temperature changes happen as shown in Fig. 4.11. The typical thermal-resistivity of the heaters designed are of the order of 5000K/W [4.14]. This means, a power requirement of the order of milliwatt for stabilizing the cavity against a temperature difference of 5K. Though this scheme is still widely used for long distance communication system, its not good for integrated on-chip applications of optical devices, as the main advantage of optics for on-chip communication is due to
79
their power efficiency, which will become worse if we have to incur the thermal power penalties. Thus, the thermal heating is not a viable solution for densely integrated electronics and we have to look for certain solutions which doesnt require heating. MEMS based solution can as well be thought about [4.15], where the gap between the ring resonator structure and a mems-plate can be utilized to tune the variations due to temperature changes. Though, to compensate for 5K .
Figure 4.11. Heater based tuning[4.14]
Figure 4.12. MEMS based tuning[4.15]
Of temperature variations, the change in gap required will be around 0.25um [4.15], which will require very high voltage >20V. This voltage level is incompatible with the CMOS process. 4.5 Thought progression towards injection charge based prospective solution: We looked into the mechanism of carrier injection itself for possible compensation of temperature variations. As we know the carrier injection creates electro-optic effect for silicon [4.16], which is the basis of modulation itself. This effect can be utilized to compensate for thermal variations. Since with higher temperature the resonance spectrum of the modulator shifts towards right, while with higher and higher carrier injection it shifts towards the left, we can think of compensating for the temperature variation by injecting carriers, as shown in Fig. 4.13, 4.14.
80
Resonance shift ( lambda=1.55um, nsi =3.4),electrical injection (nm) 12.00
Resonance Shift(nm)
10.00 8.00 6.00 4.00 2.00 0.00 1.00E+16
1.00E+17
1.00E+18
1.00E+19
Carrier Injection level(log)
Figure 4.13. Carrier injection based tuning Figure 4.14. Carrier injection Vs spectral shift The change in resonance is created due to the equation given in Eq.(4.3) below. It can be seen as n si (electrical )
22
N * 8.8 *10
( P) 0.8 * 8.5 *10
18
(4.3)
in Fig. 4.12, that for high carrier injection levels the spectral shift could be very high which can be utilized for thermal variation too. By changing the bias level of the modulator we can change the concentration of the carrier in the ring to compensate for the temperature changes. In this scheme a thermal sensor will sense the temperature, and based on this will decide about the compensatory carrier injection level Qexcess. The modulator which was earlier operating between 0 & Qmod level of injection for 0s and 1s will have to now operate at a carrier level of Qexcess & Qmod+ Qexcess. As can be seen easily this control scheme puts extra requirement on the drive circuit, where between 1s and 0s it has to inject higher carrier level and have to precisely remove only the Qmod carriers. Such kind of control is very tough to realize precisely, and hence such scheme will have pattern dependent jitters, notwithstanding the design difficulties of the modulator driver itself, which has to change its 0 and 1 voltage level to track the temperature changes, this kind of control scheme will also put limitation on the driver vis--vis its operating frequency and hence is not good. Though the idea of carrier injection is still good, we need to get a way of doing it in such a manner that the modulator driver circuit can be designed independent of
81
thermal compensator circuit, i.e. the charge injected for compensation should not be made part of the dynamics of modulation. In the next section we propose one such solution where the temperature compensatory injection charge doesnt participate in the modulation of data, and hence its is easier to design the driver and control circuits without impacting their speed. 4.6 Proposed Solution: In the proposed solution we change the device structure such that one half of the ring is utilized for data-modulation and the other half is utilized for the thermal tuning based on carrier injection as shown in Fig4.15. As is seen from the cross-section across b-b , the waveguide path still remains there, while the doped region is cut out and filled with sio2, to facilitate isolation between two PIN devices. One set of PIN device is used for high speed modulation, whereas the thermal controller controls the other PIN device. This scheme is good as it makes the modulator driver design and operation; independent of the thermal compensator. Overall the resonance structure works as before, but as the length over which the modulation happens has got reduced by , it requires twice the injection level as before per cubic centimeter for the datamodulation to happen.
82
c/s view at a-a and c-c
c/s view at a-a
c/s view at b-b Figure 4.15 (a.) original device structure [4.1] and (b) proposed device structure 4.7 Design of the device structure:
4.7.1 Method: We do the complete device simulation, using the following device
dimension. The dimensions of the waveguides forming the ring and the add port are 250 nm X 450 nm. The quality factor of the ring is found to be around 4000. The electrical modeling was carried out in SILVACO ATLAS device simulation software. The software models the internal physics of the device by solving the Poisson equation and the charge continuity equation numerically. The suitability of SILVACO for simulation of these characteristics has been established by prior works [4.9, 4.10, 4.11]. We included Shockley-Read-Hall (SRH), Auger, direct recombination models. We assumed an interface trap density of 1010/cm2/eV and an interface recombination
83
rate of 104 cm/s [4.11, 4.12]. The surface recombination rate of silicon is of the order of 104 cm/s for un-passivated surfaces [4.13] and 100 cm/s for passivated surfaces. The optical modeling assumes the ring resonator as a unidirectional single mode traveling wave cavity coupled to the waveguides with a quality factor of 4000. The changes in refractive index and absorption of silicon are modeled via free carrier dispersion [4.16].
4.7.2 Isolation: Though the design idea looks simple in concept, there are still lot of
issues that surface when we sit and design the device. One of the main constraints is the isolation requirement. Since the data-modulation part of the device and the compensator part of the device can see different voltages, we need to provide electrical isolation between the two devices, or else there will be pattern-dependent leakage current. Since we cant remove the optical path, there will always be a n-i-n or p-i-n or a p-i-p type of parasitic device that will get built between the two arms. With low intrinsic doping and bigger isolation-length, this leakage current can be reduced. We found that for 5um of isolation length between the two arms, the leakage current get reduced to reasonable value of 5uA for a voltage difference of 1.2V between the two arms, shown in Fig.4.16.. This requirement increases the size of the device and also puts the restriction on maximum voltages that can be applied across the device while modulating the carrier density. Since the carrier density is a function of voltage applied, it correspondingly also forces the maximum temperature compensation that can be done based on this structure.
84
Figure 4.16. Leakage current as a function of voltage difference between two region w/o reverse-isolation 4.7.3 Reverse Isolation: Since the diffusion of modulating carrier between the modulating regions to the thermal compensation region need to be kept to a low value for device speed region, we found that creating a reverse-biased region between the two regions would be a good solution for our purpose. A 2um length reverse isolated region was enough to prevent the modulating region from interacting with the static thermal tuning region. This sweeps out any diffused carrier through the reverseisolation region, before getting into the thermal compensation region.
Figure 4.17 Mesh structure of the simulated structure 4.7.4 Back-Reflection: We also looked at the back-reflection that will get created by the change of index in the optical path due to discontinuity in the carrier profile.
85
Since the change in index is very small, it doesnt create any appreciable backreflection for the carrier densities of interest. 4.7.5 Extinction Ratio Degradation: With high carrier density, the losses in the ring increases to a high value, and will degrade of the extinction ratio. For this we limit our carrier injection level to 1018/cm3 level only, which results into a modest loss <10dB/cm for the ring. 4.7.6 Thermal self-heating, the killer: After conceptualizing the whole device structure and taking care of all the issues we did silvaco simulations with the device structure. As shown in Fig. 4.18.b, the current required for 1018/cm3 level of concentration was found to be 1mA for 5um length of the ring, which amounts to around 240 uw/um length of the ring. For a nominal size of 10um diameter ring, this will mean ~8mw of power consumption. This level of power consumption is unacceptable for integrated applications.
Figure 4.18 a.) voltage Vs current required/5um of the device length b.) current Vs carrier concentration(in /cm3)
86
This high current requirement to create a charge carrier density of 1018/cm3 is due to high level of surface assisted hole-electron recombination. The lifetime of the carrier was found to be only ~1ns, when we do the transient simulation with Silvaco as shown in Fig.4.19.
Figure 4.19 Transient simulation to see the life-time of the carrier at 1e18/cm3 level of injection 4.7.6.1 Thermal Modelling: We also did a thermal simulation, using the thermal heat flow model. The key parameters and equations utilized in modeling the thermal heat flow model is given as below. 1.) r
1 2 * K si * (t si )
, is a measure of thermal resistivity of the thin silicon-film
87
2.)
1 R
2 * K sio2 t sio 2
z r R
, is a measure of themal conductivity of the sio2
3.) T
Ts e
, gives the temperature profile away from the ring , in lateral
direction. 4.) R1
t sio2 w * K sio 2
, is a measure of the thermal conductivity of the ring in
vertical direction. 5.) Ts
P rR 1 rR R1
, is a measure local temperature at the ring.
Where, Ksi = 149w/m/k is the thermal conductivity of the silicon & Ksio2 = 1.38w/m/k, thermal conductivity of the sio2. Due to very low thermal conductivity of the sio2 and large thickness of it, the thermal heat resistance of the device is very high, which results into higher local temperature. For the power level used for modulation with High Q cavities, i.e. carrier modulation level of 5*1016/cm3, the self-heating is limited to only 1K as shown in Fig4.20 below. For the power level utilized for our thermal compensation the local temperature will go up to 40K or so(Fig.4.21). This is when we wanted to compensate for 10K, quite obviously at that injection level the device will be thermally unstable, as any amount of thermal compensation will heat up the device much more and hence the scheme will not be exactly utilizable due to thermal instability of the device.
88
Figure 4.20. Self heating effect of the device, temperature profile shown at a cross-section of the device in lateral and vertical direction for normal modulation case.
Figure 4.21: self-heating effect with carrier injection level of 1018/cm3 , locally temperature rises by around 40-degree C!!!. Due to this high self-heating effect, the high carrier injection level based scheme for tuning the ring is not a feasible solution for integrated applications.
89
4.8 Change of basic device structure by altering thickness of si or sio2 to reduce thermal sensitivity: If sio2 thickness is reduced or the si thickness is increased, then the self-heating effect can be significantly reduced, but reduction of the thickness of the sio2 or increment of si-thickness means device performance degradation, as the effective n will be smaller, which will degrade the confinement of the optical mode, and reduce interaction with the carriers. Device engineers need to re-look at the structure of the device to make it much more thermally stable for it to be utilizable for integrate-able applications. 4.9 Depletion based solution: Since the carrier injection scheme required high power to create high carrier density and wasnt feasible for integrate-ability. We looked at depletion based device structure, in which, utilizing high field effect, we propose to deplete the carrier away from already doped area. This depletion of the carrier will have the similar effect on the index as with the creation, though in the opposite direction. As shown in Figure 4.22, though this solution is low power due to reverse biased operation, it now requires high voltage ~20V, to deplete the carrier from the region of high doping concentration where the optical mode lies.
90
Figure 4.22 Depletion width Vs. Voltage, with 1e18/cm3 carrier As, for future applications, the devices are supposed to operate at low voltages, this type of high voltage required for depleting carriers is again not an integrate-able solution. 4.10 Wide temperature range operation using thermal thin-film heating: As noted earlier, by design the device is thermally very sensitive. In this section we use the thermal self-heating effect of the device, to compensate for the temperature variation. We demonstrate wide temperature range operation by controlling the DC bias current through the P-i-N diode. The experimental setup used for controlling the DC bias current is shown in Figure 4.23. The Bias-T adds the NRZ electrical driving signal to a DC bias decided externally. The DC bias current is varied in a direction opposite to the change in temperature for retrieving the bit pattern. The temperature of the chip is
91
controlled through an external temperature controller with feedback to control the temperature of the chip holder.
Lightin Bias-T Test-chip
Bit-pattern
Lightout DC-source
Figure 4.23 Setup for controlling DC bias current through the device,
a.)
b.)
c.) Figure 4.24 Restoration of the distorted waveforms using bias current compensation scheme a.) Normal Data b.) Corrupted Data at T =15K and c.) Recovered Data at T =15K
92
The bit patterns are recovered by varying the DC bias current as the temperature of the chip holder is varied (or the balance of the signal, ie ratio of 1s to 0s). The exact temperature shift of the ring is estimated by the distortion in the bit pattern. At 24.7 oC ambient temperature a current of 400 uA is passed through the ring to setup the base operating condition. The temperature of the ring is varied by varying the wafer temperature using the external feedback stabilized controller. The degradation of the modulated waveforms as temperature is shifted can be seen in Figure 4.24. The modulated waveforms can be retrieved over ~15 oC by reducing the DC bias current to bring back the temperature of the ring to the original operating condition. The injected DC current is varied in a direction opposite to the change in temperature. We show open eye diagrams with clear eye opening over 15 K by controlling the DC bias current to maintain the local temperature of the ring at the original operating condition as shown in Fig. 4.25. We estimate the quality factor of the eye diagrams
) to be 11.35 at the nominal operating temperature. The
retrieved quality factor at T= 15 K is 7.15. These Q values are sufficient for a BER of 10-12 [4.17].
62=46m 2=46mv v 21=140mv 61=28m v
62=81mv 21=136mv 61=33mv
a) Eye Diagram at T=0 K
b) Degraded at T= 15K
c) Retrieved at T=15 K
Figure 4.25 Optical transmission eye diagrams of the electro-optic modulator. With Sasikanth Manipatruni and Nicolas Sherwood-droz in Dr. Lipsons group, we did the thermal modeling with industry standard COMSOL software. We assumed that
93
the bottom of the wafer is the heat sink and is maintained at 300 K (Tamb). We also assumed a 3 m buried oxide layer and 1 m top cladding oxide. The top cladding layer is chosen to be 1 m so that the heating due to the metal layers is optimal while limiting the optical losses due to mode overlap with metal [4.18]. We find that the direct localized heating inside the waveguide is more efficient than a metal heater on top of the device for tuning resonances. The two dimensional thermal simulations show that the temperature difference produced by direct localized heating using the PIN structure is significantly larger than metal-strip-heater method. The direct localized heating using PIN structure produces T= 40.1 K/(mW/m3), while a metal heater positioned on top produces T= 21.3 K/(mW/m3). Hence, the simulations show that the direct localized heating method is approximately twice as efficient as the metal heater. The direct localized heating method makes use of the existing structure (i.e. the contacts of the PIN device) to achieve thermal tuning. Though we could restore the data eye over 15K range[4.19], the idea behind this function was the utilization of self-heating effect of the ring-itself and there are some limitations with the scheme, as will be discussed in next section. 4.11 Discussion benefits and scalability of thin film heating method: While doing the testing in the lab, it was found that the self-heating of the device is very high. Just turning-on the device created a spectrum shift of 2.1nm, which is equivalent to raising the local temperature by around 21K. This self-heating occurred as the device itself was consuming more power, owing primarily due to high contact and thin-film resistance of the device. This high resistance requires the current device to operate at a higher swing level of 3-5V. Ideally for integrate-ability of the device the voltage level are required to be brought down to a lower value. This lower value of operation will happen only with lower contact resistances. If we reduce the resistance, than more current need to be injected into the thin-film waveguide area to have similar
94
temperature effect. At higher current, the carrier injection level can reach higher value, and will create performance degradation such as extinction ratio degradation due to increased losses in the ring, as well as shift in resonance due to carrier injection itself. Thus for future improved device topology, the thermal tuning that can be achieved with the local thin-film self heating of the device is limited and can not be scaled to the 40K range of insensitivity that one would ideally like for the device to operate in for integrated applications. Also, for any thermal feedback look we would need good accurate thermal sensors. The thermal sensors themselves have a process variation of +- 2-degree C, which means the device should be insensitive to at-least this kind of variation. The output from most of the thermal sensors is typically around 2mv/degree C, which will require high amplification in the control loop. Needless to say, with all these control circuits around it, the E/O conversion based solution wont be a lowpower solution for on-chip interconnects. Though, this solution can also be utilized with the split-ring structure that we proposed earlier too, where only part of the ring is heated by high contact resistance path, while the other half is used with low resistance modulating circuit. 4.12 Suggested future Solutions for thermal compensation: Since, most of the
solutions required for the thermal insensitivity of the device are limited for deviceintegrate-ability. The need is to look at the material science and see if we can compensate the thermal sensitivity of the silicon by changing the materials used in the device structure. 4.12.1 silicon-related changes: The silicon is a very thermally sensitive material and the thermal sensitivity that silicon has is due to three different effects. 1.) The index change happens due to change of distribution function of carrier and phonon, which thereby impacts the polarize-ability of the carriers by propagating EM waves, which in turn decides the refractive index., for silicon this effect is a +ve with temperature. 2.)
95
The index change happens due to shrinking of the band-gap with temperature, for silicon, this is again +ve with temperature. 3.) The index change happens due to thermal crystal expansion. This is ve with temperature. One set of solution would be to mix some other material with silicon to enhance the thermal crystal expansion based effect to offset the other two effect can give rise to an overall thermally insensitive device. 4.12.2 Cladding related changes: The thermal sensitivity of the resonator-ring modulator is a net effect of the thermal sensitivity of the silicon and the cladding layer. If we could change the cladding layers sensitivity in the reverse direction of silicon, then we can get the thermal insensitive device. This cladding layer needs to have the same index and loss characteristics ideally so that we dont see any performance degradation. Its though has been found hard to get such material which have good ve thermal sensitivity and have the same optical characteristics as sio2. Though, recently, a research group from Korea has done the demonstration with a polymer which has the opposite characteristics of the silicon, and they could get a good thermally insensitive device as will be discussed in next section. 4.13 Polymer cladding based thermally insensitive device: As sio2 has very weak thermo-optic coefficient, as compared to the si, it is not an ideal material for cladding. Instead, polymer WIR30-490 is a better material to use for cladding, as it has higher ve temperature coefficient. Lee et. al.[4.20] designed a device with polymer cladding and found that they can reduce the thermal sensitivity from ~100nm/degree C to 5nm/degree C . Its a significant improvement and promises the usability of the device for integrated applications. The solution looks promising, and they found a slot waveguide based solution for TE mode too. The polymer used in their device would need to become a part of integration process. Also to reduce the thermal sensitivity of the overall device they
96
had to design the device structure such that mode interacts with the polymer. Ideally for a modulator we would require the mode to be confined in the silicon so that it can interact with the carrier in a better manner for enhanced performance. Thus there solution comes at the cost of integrating polymer layer and decreased modulating performance. Nevertheless, its a good solution for realizing thermally insensitive device. Maybe, future effort can be put on enhancing the cladding layers thermal sensitivity even more, which will result into tighter confinement of the mode in the waveguide, due to which one may not have to compromise the E/O modulation performance by much. 4.14 process variation and tuning: As has been the case, the large thermal sensitivity of the device cant be tracked by the charge injection method and better method such as the thermal insensitivity rendered through the temperature compensation by cladding layer would be the way to go for the thermal compensation of the modulator. But as we go and make the thermally insensitive device by altering the materials used, the problem of process variation becomes significant. One way of compensating the process variation would have been by thermal compensation itself. With thermally insensitive devices, the process variation cant be controlled by changing the temperature of the device. This is where the charge injection based tuning can be utilized. With the variations expected to be in the range of 0.5nm, the charge injection based turning can be appropriately utilized in this case. Also, since the device is already thermally stable, the charge injection related heating wont be a problem. 4.15. Key Takeaways: 1.) Thermal and process sensitivity is a big problem while utilizing the ring-resonator based modulators in integrated applications.
97
2.) The carrier injection based method proposed and explored in the work wasnt found to be an effective solution due to self-heating and high power consumption. 3.) Depletion based solution proposed was found to be though low power, but high voltage and hence was not usable for thermal compensation. 4.) The thin-film heating based solution gave us a range of 15K, and is less power hungry as compared to the metal heating based solutions. 5.) Change of cladding layer based solution renders thermal insensitivity to the device over a wider range of temperature, albeit with a bit of performance degradation seems to be the most ideal solution for integrate-ability of the device. 6.) With thermally insensitive device carrier injection based tuning proposed can be utilized for process compensation of the device, as the process compensation will still be required for the devices, which in the wake of thermally insensitive devices becomes the only way of process-tuning the device.
98
CHAPTER 5 OPTICAL INTERCONNECT: QUESTION MARKS & DIRECTIONS
5.1 Chapter introduction: During the system level work on optical interconnects and their utilization, we came through multiple design choices and questions which were necessary to understand, to see if on-chip optical interconnect makes sense for integration or not. In this chapter we talk about a host of these issues that may dictate whether or how much benefit on-chip optical interconnect will actually have. Multiple architectures have been proposed and evaluated around these devices which promise to give modest system performance improvement. However, while evaluating these architectures the architects have tended to overlook the challenges that integration of these devices face and also have failed to provision for the system constraints that will arise. Lack of clarity is also a problem when it comes to understanding architectural implication of putting these devices on-chip. Similarly, the device researchers though impressive in the results that they get in tightly controlled lab-environment for individual devices tend to overlook the issue about the device performance that will be required if we are going to deploy large number of these devices in a non-controlled wildly varying environment in contrast to the controlled lab environment. Robustness and predictability of performance matter a lot when it comes to dense-integration. In this chapter we consider these constraints appropriately based on our studies while trying to work out the system aspects of the design. We also enlist/discuss the directions that we think can be pursued towards tackling these challenges in a consistent manner. 5.2 Integration: 3D stacking and constraint on off-chip optical signal path? Due to on-chip real estate requirement as well as the routing constraints of optical signals, the architecture has to be vertically integrated. There are efforts to vertically
99
integrate using the growth of polycrystalline silicon, but as the grown polycrystalline silicon has very bad optical characteristics, it is better to go for bonding of the photonic and electronic chip in a 3-D structure using two separate high quality active silicon layers. The problem occurs in understanding how the optical signal be routed off-chip.
Figure 5.1. A 3-D Stack [5.1] If we look at the architecture above for a normal 3D device, the active si#2(ideally SOI and not bulk si as in this case)) has to be the layer on which the photonic devices have to get fabricated. This will ideally be the structure, as the more power consuming electronic components need to be nearer to the heat sink. Looking at this architecture one can see one cant couple the light in or out from the top surface of the chip. Accommodating the fact that the photonic devices are all made in an SOI process and that the other side of the chip is populated with C4-bumps and decoupling caps, it leaves us with the option of routing the optical signal in and out of chip only through the edges of the Si#2. Considering that off-chip these I/Os will be required to get connected through fiber and as the fiber diameters are of the order of 150um, the edge
100
coupling will limit the number of such I/Os based on chip-perimeter and need to be appropriately accounted for. 5.3 Thermal budgeting for 3-D stacking As with any 3-D stacking, its well known that they create additional thermal constraints for the chip and hence power budgeting need to be done to accommodate these additional thermal constraints. Researchers need to appropriately scale down the power required while comparing the performance improvement for a system against baseline system to ensure that the chip work within the same thermal envelope. Recently Intel demonstrated the advantages of 3-D die stacking though in the context of memory. They found that though within the same power budget the performance improvement is there, it comes at the cost of increased die temperature. The performance improvement becomes lower when the system is designed to operate within the same temperature envelope. Though, 3-D die stacking definitely gives performance/watt advantage. The researchers while talking about optical integration should also take careful note of this fact. Temperature problem is more enhanced with optical components as they are supposed to be integrated with thick (~3um) oxide layer. The high thickness of oxide required for good performance of SOI optical devices requires more consideration to the thermal budgeting as compared to the normal 3-D stacking where the devices are in bulk-silicon or standard SOI process with thin oxide layer (~100-300nm) and is required to be appropriately addressed. 5.4.) Thermal sensitivity and process variation of silicon photonic devices Design of high speed electrical to optical data converters i.e. modulators have been the key development driving the buzz around the silicon photonic devices. The thermal and process variation of the ring-resonator based modulators and filters were talked about at length in chapter4. We also discussed about various directions that can be taken for these devices in the corresponding chapters. The other silicon photonic
101
devices like the gain amplifiers, couplers, laser source etc. too are thermally sensitive. Though in time as research progresses, there will invariably be a solution to these problems, one need to budget for these problems and the adverse performance impact that the state-of-the-art solution will have towards these problems. This is particularly important when some of the solutions like thermal heating etc. are extremely power hungry and take the cool factor out of the optical interconnect architecture. 5.5 Waveguide cross-over and optical vias: Whenever the waveguide-cross they have losses, and back-reflection, due to change of effective index around the region where they cross. These losses could be very high for a bus configuration where the number of crossings can be significantly high. These losess and back-reflections are not good for high data rate communication where they will create ISI in addition to reduced signal strength.
Back-reflection X-talk radiation
Figure 5.2. A waveguide inter-section 5.5.1 PSE (Photonic switching element): One need to take care of these waveguide crossings and need to provision for the losses in them. One way to avoid crossings and network configurability would be with
102
the photonic switching element based network configuration as done by Shacham et.al [5.2]
ADD
Figure 5.3 A PSE Scheme[5.2] Though, PSE provide cross-less routing of information, its only possible for single data/waveguide scheme. For bus, spatial width for which is mode than 1, the scheme cant work. For bus architecture we suggested in our architecture in section 3 to use inter-layer coupling for extracting and putting the information into the bus as shown in figure below.
5.5.2Interlayer-coupling (Optical Vias): As shown in figure below, interlayer

coupling can be done using a ring-resonator again, first the light gets vertically coupled to the ring and then it gets horizontally coupled to drop port
103
DROP
DROP
`
z y x
ADD
Figure 5.4 Inter-layer coupling (or, optical Via) The interlayer coupling provides us with an optical via which is very useful for massive level of interconnect network for on-chip optical interconnects. This scheme though will require that the device is polarization in-sensitive, i.e. the TE and TM polarization modes behave the same way for vertical and horizontal coupling. This scheme coupled with 5.6 Polarization sensitivity of the devices: Silicon photonic devices have seen a change in trend whereby reduction of waveguide dimension is required for performance improvement and compactness. This reduction comes at the cost of the difficulties in maintaining single-mode operation of the device while controlling polarization characteristics. Most of the miniaturized silicon photonic devices work differently with TE or TM mode. A directional coupler and ring-modulators are the most important component of the interconnect fabric that are sensitive to the polarization of the light. Ideally for integrated applications we would like that the devices perform similarly for both TE and TM mode. Its required to design the
104
device/waveguides to minimize the effective index difference between TE and TM mode while maintaining single mode operation. The device design engineers need to take care of this sensitivity of the device for integrated applications. Reed et. al.[5.3] have recently shown a mechanism to make polarization insensitive devices which work similarly for TE/TM mode. 5.7 Misplaced assumption about WDM channels: In literature, many people have misplaced assumption about the number of WDM channels which will be available with the photonic devices. As a general rule of thumb for filter and ring-modulators the FSR is limited by the geometry which for 10um sized ring-modulator gives around 25nm. With a channel spacing of ~2nm, one can have close to 12 channels. But, this is not the only limitation. The FSR of the ring-resonator which controls the number of WDM channel is limited by the geometry of the ring. Though by sizing and cascading different ring structures the FSR can be increased, and hence the number of WDM channels.
DROP
ADD
Figure 5.5 cascading rings, for increasing FSR
105
Thus its not the ring-resonator that limits the number of WDM channels for on-chip applications. Its rather about creating as many sources and controlling their characteristics over wider range of wavelength for the whole interconnect network is what will limit the number of WDM channels. 5.8 Synchronous Vs Asynchronous data transfer and latency overhead: As has been discussed in chapter-3, for Asynchronous data-transfer, we have significant power and latency overhead for on-chip interconnect, as the bit-boundary, code-boundary and the lane-lane skewing and packet boundary etc. requires huge latency cost. One need to understand these cost before going for high-per channel data-rate on-chip interconnect scheme. The Asynchronous data transfer is good for off-chip communication. With the on-chip communication the clock may be forwarded with the data channel as reference. Though, the dispersion and delay uncertainties of the channel in that case will limit the maximum speed of operation and that aspect need to be addressed appropriately. 5.9 Power distribution across asymmetrically distributed nodes in a broadcast based network: As in our architecture, for asymmetrically distributed network, though we optimized the power that can be tapped off from the bus for distributing to individual nodes, it requires couplers of good precision to tap the optimal power. The power tapped by the couplers is always in some % term. And since the couplers cant be designed to say more than 5% accuracy, this limits the signal distribution for asymmetrically distributed nodes. As the number of nodes grows to a larger value, the nodes initially in the bus will be un-optimized, as they will tap higher power than they required due to non-availability of the accurate couplers. In our conceived architecture we took care of this issue, while calculating the power for the whole network. But, the important thing is that designers need to take care of these issues when they address scalability of the optical network.
106
5.10 OTDM: is it beneficial for on-chip interconnect?: OTDM is proposed as a good solution for long-distance optical fiber based communication. Where, they want to extract maximum out of the fiber infrastructure. This improves the per-channel bandwidth but with an increase in latency and power/Gbps. Though the common component used for OTDM, namely ADD and drop, MZI-phase shifters etc. can be fabricated in silicon too. For real usage of it for integrated application, the main problem would be the synchronization, precision and sensitivity. The implementation will require precise control of the delays, which are required to be tuned appropriately against thermal and process variations. Also, towards the receiver side, there is power and latency overhead associated with the required synchronization. Synchronization is something one though need to do anyway, whenever one goes for higher data rate. It's just that it becomes more complicated in this case and as per the learnings from chapter-3, its not good for the on-chip interconnect. As the cost of infrastructure, i.e. a waveguide is much cheaper than a say 100km fiber. Also, since main purpose of integrating optics on-chip is to reduce power/Gbps, and as OTDM only makes it higher its not a good solution for on-chip interconnects. 5.11 WDM and filtering: As has been highlighted before WDM and number of channels are just dependent upon the filtering characteristics. Now using ringresonator based modulators and their high Q one can envision 100s of WDM channel, but realistically accounting for the data-bandwidth and the filters tolerance should ideally limit the WDM channel spacing to 1-2nm and one need to appropriately adjust their calculations accordingly. 5.12 Optical logic for computation and optical memory on-chip? : Since optical devices have diffraction limits they cant be smaller than our transistors. Also, for realizing optical memories one needs to trap photons with a low-loss mechanism.
107
Though photon trapping has been demonstrated for 100ps-500ps, its still far from electron based storage. 5.13 System power Vs Real Power: With most of the off-chip laser based interconnect system there is a tendency among researchers to take the on-chip power while assuming off-chip lasers. The system architects should ideally account for the laser power and the conversion efficiency of the off-chip lasers in their system budget. 5.14 System Real-estate considerations: with the off-chip lasers, one need to be aware of the off-chip real-estate too, which is increasingly becoming important. 5.15 Optical gain and their thermal sensitivities: For larger on-chip interconnect network, there will be requirement of gain medium. In recent times the research group from intel-ucsb have shown the optical gain [5.4]. Though its thermally sensitive and have inefficiency considerations. For this reason we didnt consider them in our evaluation. 5.16 Laser source: though the UCSB-intel team have [5.5] recently announced an onchip laser. Their thermal sensitivity and efficiencies need to be appropriately considered before conceiving an architecture based on on-chip lasers. 5.17 Mac-zhender modulator fo the I/Os. : Mac-zhender modulator is a better more robust solution than ring-modulators when it comes to off-chip communication, due to their robustness against process and thermal variation, but as they are bulky, they can be utilized only at the interface. 5.18 cost: packaging , testing, assembly, design: With the on-chip interconnects cost becomes the major drag. For chip manufacturing the cost works out to be evenly distributed during design, test, debug and assembly phase. The on-chip optical interconnects based topologies seems to increase this cost structure to a very high level at every level of the manufacturing stage. With optics, low yield will also be a drag on cost. Though on-chip optical interconnect can have higher performance, this
108
performance will invariably be with higher cost, and hence must be factored in during the architecture phase, one need to come-up with the design architecture, which increases yield by putting features which can help bin the part even if say, optics is not working. Though the cost of integration isnt in the hand of designers and may change with the economy of scale. The architectural considerations which will help in binning the parts into a lower bin, thereby recovering the yield will help lower the cost of the system. Some of the architectural solution for off-chip optical interconnect can be conceived of in the same-line.
Chip1 Low Speed Electrical Link
Chip2
Chip1
E/O
O/E
Chip2
High-speed Optical link

Figure 5.6 flexible scheme for off-chip communication To provide a solution with the interface, where it can work at lower speed with all electrical link, while with the provision for optical link, the speed can be boosted up. Similarly, with the capability of resizing of number of lane will also give architectural flexibility and will help regain some of the yield loss. One need to come up with more such cost-effective solutions.
109
5.19 Conclusion: In all, on-chip optics for large scale integration still requires tremendous amount of work across architecture, device, circuit and packaging. Though, the use of integrated on-chip optics is being pursued very actively by industry for increasing the I/O bandwidth, we understand that in future, it may be utilized for within chip communications too. The question of how deep and how dense these devices will get integrated will be dependent upon improving the device performances as discussed in this chapter.
110
REFERENCES
[1.1] G.E. Moore, Cramming more components onto integrated circuits, Electronics, Vol. 38, No. 8, April 1965. [1.2 ] Intel Corporation. Microprocessor quick reference guide. Web resource, Available: http://www.intel.com/pressroom/kits/quickreffam.htm. [1.3] W.A. Wulf and S.A. Mckee, Hitting the memory wall: Implications of the obvious, Computer Architecture News, Mar. 1995. [1.4] K. Saraswat et al., Effect of Interconnection Scaling on Time Delay of VLSI Circuits, IEEE Transactions on Electron Devices, pp. 645-50, April 1982. [1.5] Ho Ron, on-chip wires: scaling & efficiency, Ph.D. Thesis, Stanford University, Stanford, California, August 2003. [1.6] ITRS-2004, Interconnect, semiconductor, 2004 Update. International Technology Roadmap for
[1.7] Ho, R.; Mai, K.W.; Horowitz, M.A.; The future of wires, Proceedings of the IEEE, Volume 89, Issue 4, April 2001 Page(s):490 504 [1.8] M. Haurylau, H. Chen, J. Zhang, G. Chen, N.A. Nelson, D.H. Albonesi, E.G. Friedman, and P.M. Fauchet. On-chip optical interconnect roadmap: Challenges and critical directions. In 2nd International Conference on Group IV Photonics, pages 1719, Antwerp, Belgium, September 2005. [1.9] G. Chen, H. Chen, M. Haurylau, N. Nelson, D. Albonesi, P. M. Fauchet, and E.G. Friedman. Electrical and optical on-chip interconnects in scaled microprocessors. In International Symposium on Circuits and Systems, pages 2514 2517, Kobe, Japan, May 2005. [1.10] M. Kobrinsky, B. Block, J-F. Zheng, B. Barnett, E. Mohammed, M. Reshotko, F. Robertson, S. List, I. Young, and K. Cadien. On-chip optical interconnects. Intel Technology Journal, 08(02), May 2004. [1.11] K.-N. Chen, M. J. Kobrinsky, B. C. Barnett, and R. Reif. Comparisons of conventional, 3-D, optical, and RF interconnects for on-chip clock distribution. IEEE Transactions on Electron Devices, 51(2):233239, February 2004.
111
[1.12 ] Ian OConnor. Optical solutions for system-level interconnect. In International Workshop on System-Level Interconnect Prediction, pages 7988, Paris, France, February 2004. [1.13] R. T. Chang, N. Talwalkar, P. Yue, and S. S. Wong. Near speed-of-light signaling over on-chip electrical interconnects. IEEE Journal of Solid-State Circuits, 38(5):834838, May 2003. [1.14] A. Louri and A. K. Kodi. Parallel optical interconnection network for address transactions in large-scale cache coherent symmetric multiprocessors. IEEE Journal of Selected Topics on Quantum Electronics, 9(2):667676, MarchApril 2003. [1.15] A. Louri and A. K. Kodi. An optical interconnection network and a modified snooping protocol for the design of large-scale symmetric multiprocessors (SMPs). IEEE Transactions on Parallel and Distributed Systems, 15(12):10931104, December 2004. [1.16] B. Webb and A. Louri. A class of highly scalable optical crossbarconnected interconnection networks (SOCNs) for parallel computing systems. IEEE Transactions on Parallel and Distributed Systems, 11(5):444458, May 2000. [1.17] D. Burger and J. R. Goodman. Exploiting optical interconnects to eliminate serial bottlenecks. In Proceedings of the Third International Conference on Massively Parallel Processing Using Optical Interconnections, pages 106113, October 1996. [1.18] N. Nelson, G. Briggs, M. Haurylau, G. Chen, H. Chen, D.H. Albonesi, E.G. Friedman, and P.M. Fauchet. Alleviating thermal constraints while maintaining performance via silicon-based on-chip optical interconnects. In Workshop on Unique Chips and Systems, Austin, Texas, March 2005. [1.19] Pappu, A.; Apsel, A; Electrical isolation and fanout in intra-chip optical interconnects Proceedings of the 2004 International Symposium on Circuits and Systems, ISCAS 2004 Volume 2, 23-26 May 2004 Page(s):II - 533-6 Vol.2. [1.20] Pappu, A, Short Distance Optical Links: Analysis , Demonstration and Circuit design, Ph.D. Thesis, Cornell university, Ithaca, NY, August 2006. [1.21] R. A. Soref and B. R. Bennett, Electrooptical Effects in Silicon, IEEE Journal of Quantum Electronics, Vol. 23, No. 1, pp. 123129, January 1987. [1.22] L. Lia et al., High speed silicon Mach-Zehnder modulator, Optics Express, Vol. 13, No. 8, pp. 31293135, April 2005.
112
[1.23] A. Liu et al., A High-Speed Silicon Optical Modulator Based on a MetalOxide-Semiconductor Capacitor, Nature, Vol. 427, pp. 615618. [1.24] C. A. Barrios, V. R. Almeida, and M. Lipson, Low-Power-Consumption Short-Length and High-Modulation-Depth Silicon Electrooptic Modulator Journal of Lightwave Technology, Vol. 21, No. 4, pp. 10891098, April 2003. [1.25] Yin, T.; Pappu, A.M.; Apsel, A.B.; Low-Cost, High-Efficiency, and HighSpeed SiGe Phototransistors in Commercial BiCMOS Photonics Technology Letters, IEEE Volume 18, Issue 1, Jan. 1 2006 Page(s):55 57 [1.26] Pappu, A.M.; Apsel, A.B.; A low power, low delay TIA for on-chip applications, Lasers and Electro-Optics, 2005. (CLEO). Conference on Volume 1, 22-27 May 2005 Page(s):594 - 596 Vol. 1 [2.1] A. Pappu and A. Apsel. Electrical isolation and fan-out in intra-chip optical interconnects. In International Symposium on Circuits and Systems, pages II5336 Vol. 2, Vancouver, Canada, May 2004. [2.2] C. A. Barrios, V. R. de Almeida, and M. Lipson. Low-power-consumption short-length and high-modulation-depth silicon electrooptic Modulator. Journal of Lightwave Technology, 21(4):10891098, April 2003. [2.3] A. Liu, R. Jones, L. Liao, D. Samara-Rubio, D. Rubin, O. Cohen, R. Nicolaescu, and M. Paniccia. A high-speed silicon optical modulator based on a metal-oxidesemiconductor capacitor. Nature, 427:615618, February 2004. [2.4] T. Yin, A. M. Pappu, and A. B. Apsel. Low-cost, high-efficiency, and highspeed SiGe phototransistors in commercial BiCMOS. IEEE Photonics Technology Letters, 18(1):5557, January 2006. [2.5] A. M. Pappu and A. B. Apsel. A low power, low delay TIA for on-chip applications. Conference on Lasers and Electro-Optics, 1:594596, May 2005. [2.6] Ian OConnor. Optical solutions for system-level interconnect, In International Workshop on System-Level Interconnect Prediction, pages 7988, Paris, France, February 2004. [2.7] G. Chen, H. Chen, M. Haurylau, N. Nelson, P. M. Fauchet, E.G. Friedman, and D. Albonesi. Predictions of CMOS compatible on-chip optical interconnect. In International Workshop on System-Level Interconnect Prediction, pages 1320, San Francisco, CA, April 2005.
113
[2.8] M. Kobrinsky, B. Block, J-F. Zheng, B. Barnett, E. Mohammed, M. Reshotko, F. Robertson, S. List, I. Young, and K. Cadien, On-chip optical interconnects. Intel Technology Journal, 08(02), May 2004. [2.9] G. Chen, H. Chen, M. Haurylau, N. Nelson, P. M. Fauchet, E.G. Friedman, and D. Albonesi. Predictions of CMOS compatible on-chip optical interconnect. In International Workshop on System-Level Interconnect Prediction, pages 1320, San Francisco, CA, April 2005. [2.10] J. D. Davis, J. Laudon, and K. Olukotun. Maximizing CMP throughput with mediocre cores. In International Conference on Parallel Architectures and Compilation Techniques, Saint Louis, MO, September 2005. [2.11] I. Young. Intel introduces chip-to-chip optical I/O interconnect prototype,. Technology@Intel Magazine, pages 37, April 2004. [2.12] J. Crow. Terabus Objectives and Challenges, C2COI Kickoff Meeting, http://www.darpa.mil/mto/c2oi/kick-off/Crow Terabus.pdf, 2003. [2.13] A. F. J. Levi. Fiber-to-the-Processor and Other Challenges for Photonics in Future Systems, http://asia.stanford.edu/events/Spring05/slides/050421-Levi.pdf, 2005. [2.14] The ITRS Technology Working Groups, http://public.itrs.net. International Technology Roadmap for Semiconductors (ITRS) 2005 Edition. [2.15] S. Borkar. Low power design challenges for the decade. In Conference on Asia South Pacific Design Automation, pages 293296, Yokohama, Japan, January February 2001. [2.16] N. Kirman, M. Kirman, R. K. Dokania, J. F. Martinez, A. B. Apsel, M. A. Watkins, and D. H. Albonesi, "Leveraging Optical Technology in Future Busbased Chip Multiprocessors", 39th Annual IEEE/ACM International Symposium on Microarchitecture, Orlando FL, December 2006 [2.17] N. Kirman, M. Kirman, R.K. Dokania, J. Martinez, A.B. Apsel, M.A. Watkins, and D.H. Albonesi, "On-chip Optical Technology in Future Bus-based Multicore Designs" IEEE Micro, Special Issue on the Top Picks from Microarchitecture Conferences, Vol. 27, No. 1, January/February 2007. [3.1] C. R. Hogge, A Self-Correcting Clock Recovery Circuit, IEEE J. Lightwave Tech., vol. 3, Dec. 1985, pp. 131214. [3.2] J. D. H. Alexander, Clock Recovery from Random Binary Data, Elect. Lett., vol. 11, Oct. 1975, pp. 54142
114
[3.3] B. Gilbert, A new wide-band amplifier technique, IEEE Journal of solid state circuits, Dec 1968. [3.4] J. Savoj and B. Razavi, A 10-Gb/s CMOS Clock and Data Recovery Circuit with a Half Rate Linear Phase Detector, IEEE J. Solid-State Circuits, vol. 36, May 2001, pp. 76168 [3.5] Nikola Nedovic A 40-44Gb/s 3x oversampling CMOS CDR/1:16 DEMUX, IEEE solid state circuit conference,ISSCC 2007. [3.6] Lan-chou cho et.al. A 33.6-to-33.8Gb/s Burst-mode CDR in 90nm CMOS IEEE solid state circuit conference, ISSCC, Feb 2007. [3.7] Jri Lee, Mingchung Liu, A 20Gb/s Burst-Mode CDR circuit using injectionlocking technique, solid state circuit conference, ISSCC, Feb- 2007 [3.8] Christian Kromer et.al, A 25-Gb/s CDR in 90-nm CMOS for High-Density Interconnects, IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 12, DECEMBER 2006. [3.9] Jaussi et. al. sampling phase detector for delay-locked loop, US Patent# 7173460, feb6, 2006. [4.1] Q. Xu, B. Schmidt, S. Pradhan, and M. Lipson, Micrometre-scale silicon electro-optic modulator , Nature, Vol. 435, pp. 325-327, 19 May 2005 [4.2] G. Gunn, CMOS photonicsTM - SOI learns a new trick, in Proceedings of IEEE International SOI Conference (Institute of Electrical and Electronics Engineers, New York, 2005), pp. 7-13. [4.3] B. Schmidt, Q. Xu, J. Shakya, S. Manipatruni, and M. Lipson, " Compact electro-optic modulator on silicon-on-insulator substrates using cavities with ultra-small modal volumes," Opt. Express 15, 3140-3148 (2007) [4.4] L. Zhou and A. W. Poon, "Silicon electro-optic modulators using p-i-n diodes embedded 10-micron-diameter microdisk resonators," Opt. Express 14, 6851-6857 (2006) [4.5] A. Liu, R. Jones, L. Liao, D. Samara-Rubio, D. Rubin, O. Cohen, R. Nicolaescu and M. Paniccia, A high-speed silicon optical modulator based on a metal-oxidesemiconductor capacitor, Nature, 427, 615-618 (2004).
115
[4.6] F. Gan and F. X. Krtner, "High-Speed Electrical Modulator in High-IndexContrast (HIC) Si-Waveguides," in Conference on Lasers and Electro-Optics, Technical Digest (CD) (Optical Society of America, 2005), paper CMG1. [4.7] V. R. Almeida, R. R. Panepucci, and M. Lipson, "Nanotaper for compact mode conversion," Opt. Lett. 28, 1302-1304 (2003) [4.8] Wen Luh Yang, Tan Fu Lei, Chung Len Lee. "Contact Resistivities of Al and Ti on Si Measured by a Self-aligned Vertical Kelvin Test Resistor Structure", Solid-State Electronics, Vol. 32, No. 1 1, pp.997-1001, 1989. [4.9] C.A. Barrios, V.R. Almeida, R. Panepucci, M. Lipson, Electrooptic modulation of silicon-on-insulator submicrometer size waveguide devices, Lightwave Technology, Journal of, 2003 [4.10] P. D. Hewitt and G. T. Reed, Improved modulation performance of a silicon p-i-n device by trench isolation, J. Lightwave Technol., vol. 19, no. 3, p. 387, 2001. [4.11] B. Jalali, O. Boyraz, D. Dimitropoulos, V. Raghunathan, Scaling laws of nonlinear silicon nanophotonics Proceedings of SPIE, 2005 [4.12] T. Kuwayama, M. Ichimura, E. Arai, Interface recombination velocity of silicon-on-insulator wafers measured by microwave reflectance photoconductivity decay method with electric field, Applied Phys. Let. 83, 928930, (2003) [4.13] Palais, A. Arcari, Contactless measurement of bulk lifetime and surface recombination velocity in silicon wafers, J. of Appl. Phys. 93, 4686-4690, (2003) [4.14] R. Amatya, C. W. Holzwarth, F. Gan, H. I. Smith, F. Krtner, R. J. Ram, and M. A. Popovic, " Low Power Thermal Tuning of Second-Order Microring Resonators," in Conference on Lasers and Electro-Optics/Quantum Electronics and Laser Science Conference and Photonic Applications Systems Technologies, OSA Technical Digest Series (CD) (Optical Society of America, 2007), paper CFQ5. [4.15] Gregory N. Nielson*, Dilan Seneviratne, Francisco Lopez-Royo, Peter T. Rakich, Fabrizio Giacometti, Harry L. Tuller, George Barbastathis MEMS based wavelength selective optical switching for integrated photonic circuits CLEO 2004. [4.16] R. A. Soref and B. R. Bennett, "Electrooptical effects in silicon," IEEE J. Quantum Electron. 23, 123-129 (1987).
116
[4.17] Ippei Shake, Hidehiko Tikara, and satoki Kawanishi Simple Measurement of Eye Diagram and BER Using High-Speed Asynchronous Sampling, Journal of Lightwave Technology, vol 22, No. 5 , May 2004. [4.18] Gan, F.; Barwicz, T.; Popovic, M. A.; Dahlem, M. S.; Holzwarth, C. W.; Rakich, P. T.; Smith, H. I.; Ippen, E. P.; and Kartner, F. X., "Maximizing the Thermo-Optic Tuning Range of Silicon Photonic Structures," Photonics in Switching, 2007 , vol., no., pp.67-68, 19-22 Aug. 2007. [4.19] S. Manipatruni, R.K. Dokania, B. Schmidt, J. Shakya, A.B. Apsel and M. Lipson " Wide Temperature Range Operation of Resonant Silicon Electro-Optic Modulators " Integrated Photonics and Nanophotonics Research and Applications(IPNRA), 13-16th July, 2008. [4.20] Jong-Moo Lee, Duk-Jun Kim, Hokyun Ahn, Sang-Ho Park, and Gyungock Kim, Temperature Dependence of Silicon Nanophotonic Ring Resonator With a Polymeric Overlayer, Journal Of Lightwave Technology, VOL. 25, NO. 8, Aug. 2007 [5.1] Bryan black, et. al, Die Stacking (3D) Microarchitecture, 39th IEEE/ACM international symposium on Microarchitecture, pages:469:479, December 2006. [5.2] G.T. Reed et. al. Are smaller devices Always better? Japanes journal of applied physics, 45(2006), pp.6609-6615 [5.3] Shacham, B.G. Lee, K. Bergman, A Wideband, Non-Blocking, 2x2 Switching Node for a SPINet Network, IEEE Photon. Technol. Lett., 17, (12) pp.2742-2744, (Dec 2005). [5.4] Ansheng Liu et.al. Net optical gain in a low loss silicon-on-insulator waveguide by stimulated Raman Scattering , optics express, vol.12, issue18, 2004 . [5.5]Hyundai et.al. A Hybrid silicon evanescent laser fabricated with a silicon waveguide and III-V offset quantum wells., optics express, 2005 13(23).
117

Rajeev Dokania Ms Thesis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rajeev Dokania Ms Thesis

Uploaded by

Copyright:

Available Formats

ON-CHIP OPTICAL INTERCONNECTS: ARCHITECTURE, CIRCUIT & DEVICE CHALLENGES AND DIRECTIONS

by RAJEEV KUMAR DOKANIA August 2008

2008 RAJEEV KUMAR DOKANIA

To my Dad, Mom and Brother

CHAPTER 1 TECHNOLOGY SCALING & THE OPTICAL PROMISE

Arbitrary Unit (logarithmic scale)

Cache traffic per instruction Internal cache

Arbitrary Unit (logarithmic scale)

I/O Bandwidth(normalized ) On-chip bandwidth(normalized) 1970 YEAR 2004 2008

( width )l (1.2) hbottom

1.4.1 Predictions for delay in different technology nodes: For calculating

Delay(Re peatered _ wire )( ps / mm ) 2.3 Rwire ( / mm ) * C wire ( F / mm ) * F 01( ps) (1.3)

Engergy(Re peatered _ wire ) 1.3 * C wire ( F / mm ) * Lwire (mm ) * Vdd 2 ..(1.4)

1.6.1Transmitter system: The transmitter proposed requires an off-chip laser

capacitor[1.23] and the Optical resonator [1.24].

The optical resonator based

1/2 Applied voltage

Figure1.13(a) Electrical with 100% activity

Figure 1.13(b) Optical with 100% activity

Figure 1.14(a) Electrical with 20% activity

Figure 1.14(b) Optical with 20% activity

OPTICAL INTERCONNECT FOR FUTURE CMPs

2.2.2 What kind of interconnect fabric: Mesh, cross-bar or bus? : Due to

2.2.4 Optical component choices: As explained in the chapter-1, an optical

3.2 Design of a 10Gbps low Complexity CDR:

Figure 3.2 Parallel sampling of the data edge

Loop filter +CP

a.) Hogges Phase-detector

b.) Alexanders Phase-detector

c.) Timing for Hogges detector

d) Timing for Alexanders detector

Figure 3.6 CML based Analog FF

3.2.5 Design of the control circuit: We used this phase-interpolation technique

clkout clkoutbar 00 1800 900 2700 I1 IK 8

Figure 3.16 : Phase interpolator control block.

Vbias ABAR1 ABARk ABARn

Figure 3.17 : controllable current source for the control circuit

Figure 3.18 : DLL schematic for generating I & Q clocks

3.2.8 Top level simulation:

3.2.8 A Comparison of our results with other state-of-the-art designs: A

Optimal rate for low power

Data Rate(Gbps) logarithmic scale

CHAPTER 4 THERMAL & PROCESS TUNING of RING MODULATOR

1 0 Data in (electrical) 1 0 Modulator 1

4.3.1 Understanding Q and its importance for device performance: The

4.4.1. Device structure and operation: The investigated device is a silicon

Figure 4.6a)Measured spectrum of the experimental ring

Figure 4.6.b) Zoomed out spectrum

Figure 4.7. Transmission spectrum under DC bias voltage

Figure 4.8. Modualted waveform at 1 Gbit/s

. Figure 4.9. Transmission spectrum under increasing temperature by 1K successively

Figure 4.11. Heater based tuning[4.14]

Figure 4.12. MEMS based tuning[4.15]

Resonance shift ( lambda=1.55um, nsi =3.4),electrical injection (nm) 12.00

10.00 8.00 6.00 4.00 2.00 0.00 1.00E+16

Carrier Injection level(log)

( P) 0.8 * 8.5 *10

c/s view at a-a and c-c

c/s view at a-a

, is a measure of thermal resistivity of the thin silicon-film

, is a measure of themal conductivity of the sio2

, gives the temperature profile away from the ring , in lateral

, is a measure of the thermal conductivity of the ring in