This paper describes a simulation environment for reliability prediction of VLSI designs. The effect of electromigration on the time to failure is investigated. The capabilities of the environment are illustrated with a case study of a microprocessor intended for control applications.
This paper describes a simulation environment for reliability prediction of VLSI designs. The effect of electromigration on the time to failure is investigated. The capabilities of the environment are illustrated with a case study of a microprocessor intended for control applications.
This paper describes a simulation environment for reliability prediction of VLSI designs. The effect of electromigration on the time to failure is investigated. The capabilities of the environment are illustrated with a case study of a microprocessor intended for control applications.
Gwan S. Choi Ravi K. Iyer Janak H. Patel Center for Reliable and High Performance Computing Coordinated Science Laboratory University of Illinois at Urbana-Champaign 1101 W. Springfield Avenue Urbana, Illinois 61801 Abstract This paper describes a simulation environment for relia- bility prediction of VU1 designs. Specifically, the effect of electromigration on the time to failure is investigated. The capabilities of the environment are illustrated with a case study of a microprocessor intended for control applications. The system under investigation is first simulated at the switch level and trace data on the switching activity is collected. Thisdata is then used along with Monte Carlo simulation to model wear-out at the chip-level. Key Words: Reliability prediction, VLSI, device failure mechanism, electromigration, simulation, time-to-failure. 1. Introduction The reliability of a VLSI chip is ultimately limited by the failure characteristics of its basic building materials, under the stresses imposed by the operating environment. Among all of the IC technological trends, scaling is an important method for reducing die size and thus increasing circuit performance and complexity. Common wear-out processes in ICs are highly influenced by scaling of device dimensions since this usually leads to increased electrical stresses. A vast variety of methods are available for modeling and predicting device reli- ability and to a lesser extent system reliability. Seldomhow- ever, do these models accurately represent actual systems under realistic operating conditions. This paper describes a simulation environment for relia- bility prediction of VLSI designs. Specifically, the effect of electromigration on the time to failure is investigated. The design under study is first simulated at the switch level and trace data on the switching activity is collected. This data is then used along with Monte Carlo simulation to model wear- out of chip, due to electromigration. The process of electromi- gration is modeled by removing elements (metal grains) in a matrix that depicts a metal line. The elements are removed based on the current density calculated from the trace data. Based on the wear-out data, the time-to-failure (lTF) of the system is estimated. The above procedure is repeated many times to obtain a 'ITF distribution. The method is illustrated by applying it to predict the 'ITF characteristics of a microprocessor chip in a typical operating environment (room temperature, 5 volt). A log- linear relationship is found between the 'ITF and operating voltage/fabrication scale. The log-linear model is used to predict reliability of a given system in a certain operating environment or a circuit fabricated at a different scale. The next section discusses the related research in thi s area. Section 3 describes the experimental environment. Sec- tion 4 contains the description of the target system; the experi- mental analysis of the target system is illustrated in Section 5. Section 6 quantifies the electromigration analysis result and proposes a model to predict the expected lifetime of the target system. Concluding remarks are in Section 7. 2. Related Research Among all of the IC technological trends, scaling has always been an important method for reducing die size and thus increasing circuit performance and complexity. Scaling of design layout rules can lead to increased electrical stresses and this in tum can accelerate the wear-out process. In [Woods881 the impact of scaling of device dimension on elec- tromigration is discussed. An early investigation Black691 of the electromigration process shows that the mean time-to-failure (MTTF) of a con- ductor under a constant current stress is expressed by the fol- lowing equation: where A and m are constants dependent upon microstructure of the metal film, 0 is the current density, Eo is the activation energy of metal, k is the Boltmann's constant and T is the temperature in degrees Kelvin. This equation is widely accepted and used to model reliability of IC devices. It is experimentally verified in [Hu88][McPherson86]. Based on the above model, a number of approaches [Frost891 [Hu88] [Harrison881 to analyze and predict the reliability of ICs have been proposed. In [Frost89], 0 is obtained via circuit simulation and Black's equation is used to predict device and chip reliability. [Hmison88] predicts the operational MTTF for electromigra- tion failures by extrapolating the results obtained via accelerated testing. In [Lacombe861 an experimental study to determine the device 'ITF due to electromigration is described. The test circuits consisting of metal lines of varying length and width were fabricated and tested under different current stresses. A log-linear relationship is found between the 'ITF and operating voltage/fabrication scale. 249 TH0340-0/0000/0249$01 .OO 0 1991 IEEE There is considerable evidence to show that the failure rate of a system is a dynamic function of the system activity. Statistical evidence shows that there is an increased probability of failure of logic devices at higher activity level [Cortes84] [Duba86] [Iyer86]. Several studies have attempted to approxi- mate the effect of activity on device reliability. In [Brooke87], an experiment to determine the M?TF due to electromigration with pulsed rectangular current applied at the metal line is described. This and other experimental results indicate that average current density, measured at the device, can be used to approximate @ (current density in MV/cm). In [Hu88], a model for the unipolar electromigration lifetime under arbi- trary unipolar current waveformis developed. In our study, accurate simulation of the target chip, using a hierarchical switch-level simulator, SPLICEl. is per- formed to acquire trace data on switch activity. Then, using this switch activity information, the electromigration wear-out process for the entire chip is simulated via Monte Carlo tech- niques. 3. The Experimental Analysis Environment DESIGN/LAYOUT INFO e7 CIRCUIT Fabrication Specificatio erating Environme Parameters Monte Carlo Data Analysis Reliability Prediction Figure 1. Experimental Environment. The environment allows logic simulation and Monte Carlo analysis of the electromigration process for an entire IC chip. The functions of different parts of the experimental environment are illustrated in Figure 1. First, accurate simula- tion of the target chip, using a hierarchical switch-level simu- lator, SPLICE1. is performed to acquire trace data on switch activity. A tracing facility is used to monitor switching activi- ties on all of the intemal nodes. Using the switching activity information, the electromigration wear-out process is simu- lated using Monte Carlo techniques. A failure site on a metal line is modeled by a matrix of metal grains. The grains in the matrix are removed during the simulation based on a normal probability distribution. This process is carried out in parallel for all the metal lines in the circuit. A metal line failure, i.e. chip failure, is assumed to occur if a path is created from the left to the right edge of the matrix. The analysis is performed a number of times to determine the distribution of the time- to-failure of the target chip. The above procedure is per- formed under varying operation environments and fabrication technology parameters, e.g., operating voltage, temperature and the device dimension are varied and the reliability impact of reduced dimension and technology improvements are quantified. 3.1. The Logic Simulation. In order to performfast and accurate simulation of the target chip, a hierarchical switch-level simulator SPLICE1 [Saleh87]' was used. For a comprehensive study of switching activity in the microprocessor, a tracing facility was also developed to monitor all of the intemal nodes of the target chip. The tracing facility is capable of monitoring each node for all processed switching events. The trace data for each event consists of the time of the event, the hierarchical node name and the new and previous electrical levels and their strengths. The trace data kom the simulation is then used to generate the work load consisting of switching and logic state information and related t i mi ng information for all the devices in the target circuit. The work load data is then used along with Monte Carlo simulation to model wear-out at the device level, due to electromigration. 3.2. Monte Carlo Simulation of The Wear-out Process. An example of the metal line model is shown in Figure 2. The process of electromigration is modeled by removing elements (metal grains) in a matrix that depicts a metal line. Monte Carlo analysis is used to simulate the grain removal process under a normal distribution. The grains are removed based on the current density calculated from the trace data. This process is carried out in parallel for all the metal lines in thecircuit. A metal line failure, i.e. chip failure, is assumed to 'The switch-level analysis in SPLICE1 is performed using a relaxation based method that uses MOS oriented models. VifiuaUy unlimited levels of signal strength can be associated with each of the logic values in order to further enhance the accuracy. This approach allows a correspondence between the elearical outpt conductance and the logic output strength. A fanout-dependent delay-model capable of handling first-order effects is used to achieve accurate delay-handling. 250 [Hammersley64]. estimating theintegral: Given a function f ( x ) , assume that we are interested in 4 f (XW by taking N independent samples (sI, ... sN) fromf(x) and cal- culating theaverage of the samples. The objective in impor- tancesampliig is to concentrate the distribution of the sample points on the parts of the interval that are of the greatest "importance", instead of spreading them out evenly. Thus, instead of sampling froman uniform distribution, we introduce a sampling distribution G ( x ) with its density function g (x): 1 G (x)=jg (x)& =1. 0 So as not to bias the result, we compensate for thedistortion by taking f'(x)=E in place of f ( x ) as our estimator of 8. The variance of this new unbiased estimator is: I I I I 8 g I Simulated Matrix . . . . . : . , : *. . . . . .. . . . . . . . . . . . b. . . .. . . . . . . . . .. metal width =3 micron 0 grain size 1x1 micron ................................ ................................ removed i element ................................ ................................ Figure 2: Modeled metal line. occur if a path is created fromthe left to the right edge of the matrix. The above analyses can be performed under varying operating environments and fabrication technology parameters. In particular, operating voltage, temperature and the device dimension can bevaried and the reliability impact of reduced dimension and technology improvements can bequantified. The Monte Carlo analysis environment makes automatic use of "importance sampling" to reduce the run lengths. The underlying theory of the method of importance sampling is described in the next section. 33. Importance Sampling. The problemwith direct simulation of device wear-out processes is the time required to performthe analysis. This is because the device, failure due to wear-out, occurs in order of years if not in tens of years. Simulation of such a senario is impossible with the capabilities of current simulators. Impor- tance sampling technique allows us to accelerate the events causing the wear-out mechanisms by biasing the related parameters in order to increase the chance of failure f ( XI e For a minimumvariance solution g ( x ) must beclose to -. However this requires the knowledge of 8 which is unknown. The above considerations do however provide some guideline for selecting g ( x ) . In particular the shape of g(x) should fol- low the shape of f(x) as closely as possible. The above methodology is used to accelerate the wear- out senario in our Monte Carlo simulations. The reliability measure we are interested in estimating is the mean time to failure (MTTF). In the Monte Carlo analysis, the wear-out process is carried out in parallel for all themetal lines in the circuit. The key is to focus on those metal lines that are most likely to fail. Metal lines having a higher rate of switching events have a higher chance of causing a failure and have their wear-out accelerated. We accomplish this effect by biasing the probability distribution of grain removal appropriately at each switching event. The biased result is normalized by the appropriate acceleration factor to obtain the actual (unbiased) MTTF. 4. Example Design. The target systemfor our study is a 16-bit microproces- sor typically used for controlling engine functions. The con- trol systemsamples engine parameters such as the fuel flow, the temperature, the engine speed and other extemal inputs such as speed and positional parameters. The sampled param- eters are digitized and updated into the RAM approximately every millisecond for further processing. For a fault tolerant design, the microprocessor can be used in a dual configuration with a suitable reconfiguration strategy. A simple approach would be to have the lead chan- nel stop its usual operation on detecting a fault and transfer control to the dual. The overall systemarchitecture thus contains micropro- cessors, memory units, I/O gate array chips, communication channels, A/D converters and D/A converters. In this 25 1 A Timing Countdown Addr 5 Control - - ALU I V IIO Memory Decode UART 7- arity Disc , . . . . . . . . . . . . . . ,.............. Figure 3. Target Chip. experiment we simulate the microprocessor. The example microprocessor is shown in (Figure 3). It consists of six major functional units. The arithmetic and logic unit (ALU) can perform double precision arithmetic operations. The control unit which is responsible for issuing signals to control the operations of the ALU, is made up of combinational logic and several registers. The decoder unit decodes U0 signals, the multiplexer unit provides the discrete lines and buses and, thecountdown unit is used to drive chip- wide clock signals. The watchdog unit provides detection of faults by resetting the processor in the event of a parity error or when the application software is timed out by the software sanity timer. Also, the signal to synchronize the dual system is provided by this unit. 5. Experiment. First, the entire microprocessor was simulated at the switch-level. The initialization phase of the microprocessor, consisting of a watchdog test, a parity test, an instruction set test, a RAM test and a ROM sumtest which ensures that all of the functional units are exercised, was simulated. The simula- tion included the processor accessing one extemal ROM for instructions and another extemal ROM for the initialization parameters. Arithmetic processing and address generation was also performed. Trace of events were collected from the simulation. Using the switching activity information, theelectromi- gration wear-out process was simulated using Monte Carlo techniques. Each metal line was modeled by a 3 by 3 matrix of metal grains. The grains in the matrix were removed dur- ing thesimulation based on a normal probability distribution and the current density calculated fromthe trace data. The Monte Carlo analysis environment makes automatic use of the Importance Sampling technique to accelerate the wear-out pro- The above analysis was performed under varying operating environments and fabrication technology parameters. The operating voltage, temperature and the device dimension were varied and the reliability impact of reduced dimension and technology improvements were quantified. Voltage levels tested were 3, 5, 7 and 9 volts. Temperature was varied 25C to 100C. Different device scales (2.lmicrons 2.8microns 3.5microns) were tested to study the reliability impact of reduced dimension. Other technology improvements such as conducting metal l i es with better conduction were also stu- died. cess. 6. Results. Reliability projections based on the electromigration analysis, performed assuming T =25C with 5 Volt power supply, aregiven in Figure 4. The figure shows the expected life-time (z-axis) of the controller as a function of the overall average operational hours per day (x-axis) and the percent of actual time (y-axis) in use. The MTTF figures in this graph compare favorably (within the 90% confidence interval) with those reported in manufacturers' reliability reports [Intel]. For example (Figure 3: Example l), if the overall operational period is 12 hourslday and the actual usage of thecontroller is 25% of this time. then the expected life time of the chip is about 46 years. The results shown in Figure 5 give M" F distributions (y-axis) for operating voltage (x-axis) and the device dmen- sion (shown in different line attributes). For example (Figure 4 example 1). the same circuit fabricated in 2.8-micron tech- nology (all other factors remain unchanged) has M'ITF of 4.61887 years operating at 3 volts. 7. Conclusions. This paper has described a simulation environment for reliability prediction of VLSI designs. Specifically, the effect of electromigration on the time to failure was investigated. The environment was illustrated with a case study of a microprocessor intended for control applications. A system under investigation was first simulated at switch level and trace data on switching activity was collected. This data was then used along with Monte Carlo simulation to model wear-out at the chip-level. The results of the simula- tion compare favorably with actual manufacturer supplied experimental results. Acknowledgments This work was supported by theNational Aeronautics and Space Administration under NASA grant NAG-1-602. 252 6 12 18 24 Operational hourdday Figure 4. Expected life-time of the chip Thanks are also due to Antoine Mourad and Dong Tang for their careful reading of this manuscript. REFERENCES [Black691 I. Black, "Electromigration Failure Modes in AluminumMetalization for Semiconductor Devices," Proceed- ings of the IEEE, Vol. 57, No. 9, pp. 1587-1593. 1969. [Brooke] L. Brooke, "Pulsed Current Electromigration Failure Model," IEEE Proceedings IRPS, pp. 136-144, April 1987. [Chen87] I. Chen, C. Hu, "Accelerated Testing of Time- Dependent Breakdown of SiOZ," IEEE Electron Device Letters, Vol. EDL-8, No. 4, pp. 140-142, April 1987. [Corte&] M. Cortes, R. Iyer, "Device Failures and System Activity: A. Thermal Effects Model," FTCS-14, 1984. [Duba85] P. Duba and others, "Effects of SystemActivity on Chip Reliability," Proceedings, The First International Wrokshop on VLSI Design, Madras India, December 1985. [Frost891 D. Frost, K. Poole. "RELIANT: A Reliability Analysis Tool for VLSI Interconnects," IEEE J ournal of Solid-state Circuits, Vol. 24, No. 2, pp. 458-462, April 1989. [Hammersley64] J. Hammersley, D. Handscomb, "Monte Carlo Methods," Methuen, London, 1964. [Harrison881 J . Harrison, "On Extrapolation fromAccelerated Test M'ITF to Operating Condition MTTF for Electromigra- tion Failures," SRC TECHCON-88, Dallas TX, pp. 240-243, October 88. [Hut381 C. Hu, P. KO, P. Lee, N. Cheung, B. Liew, "IC Relia- bility Prediction," SRC TECHCON-88, Dallas TX, pp. 240- 243, October 88. untel] Intel Manuals on Reliability Assessment. flyer861 R. Iyer, D. Rossetti, M. Hsueh, "Measurement and Modeling of Computer Reliability as Affected by System Activity," ACM Transaction on Computer Systems, Vol. 4. No. 3, pp. 214-237, August 1986. [Lacombe861 D. Lacombe, E. Parks, "The DistTibution of 253 Electromigration Failures," IEEE Proceedings IRPS, pp. 1-6. April 1986. [Lee881 J . Lee, I. Chen, C. Hu, "Statistical Modeling of Sili- con Dioxide Reliability," IEEE Proceedings IRPS, pp. 131- 138, 1988. [McPherson86] J . McPherson, "Stress Dependent Activation Energy," IEEE Proceedings IRPS, pp. 12-18, April 1987. [Ricco83] B. Ricco, M. Azbel. M. Bordsky, "Novel mechan- ismfor tunneling and breakdown of thin SiO, films," Phys. Rev. Lett., Vol. 51, No. 19, pp. 1795, 1983. [Saleh871 R. A. Saleh, "Nonlinear relaxation algorithms for circuit simulation," Memorandom No. UCB/ERL M87/21, Electronics Research Laboratory, University of California, Berkeley, 1987. [Woods86] M. Woods, "MOS VLSI Reliability and Yield Trends," Proceedings of the IEEE, Vol. 74, No. 12, pp. 1715- 1729, December 1986. 14 12 10 8 MTTF 6 (Years) 4 2 2.1 micron - 3 5 7 9 Operating Voltage (V) Figure S. MTTF distributions at different operathg voltages and fabrication dimensions. 254