You are on page 1of 6

A MONTE-CARLO SIMULATION ENVIRONMENT

FOR WEAR OUT IN VLSI SYSTEMS


Gwan S. Choi Ravi K. Iyer Janak H. Patel
Center for Reliable and High Performance Computing
Coordinated Science Laboratory
University of Illinois at Urbana-Champaign
1101 W. Springfield Avenue
Urbana, Illinois 61801
Abstract
This paper describes a simulation environment for relia-
bility prediction of VU1 designs. Specifically, the effect of
electromigration on the time to failure is investigated. The
capabilities of the environment are illustrated with a case
study of a microprocessor intended for control applications.
The system under investigation is first simulated at the switch
level and trace data on the switching activity is collected.
Thisdata is then used along with Monte Carlo simulation to
model wear-out at the chip-level.
Key Words: Reliability prediction, VLSI, device failure
mechanism, electromigration, simulation, time-to-failure.
1. Introduction
The reliability of a VLSI chip is ultimately limited by
the failure characteristics of its basic building materials, under
the stresses imposed by the operating environment. Among all
of the IC technological trends, scaling is an important method
for reducing die size and thus increasing circuit performance
and complexity. Common wear-out processes in ICs are
highly influenced by scaling of device dimensions since this
usually leads to increased electrical stresses. A vast variety of
methods are available for modeling and predicting device reli-
ability and to a lesser extent system reliability. Seldomhow-
ever, do these models accurately represent actual systems
under realistic operating conditions.
This paper describes a simulation environment for relia-
bility prediction of VLSI designs. Specifically, the effect of
electromigration on the time to failure is investigated. The
design under study is first simulated at the switch level and
trace data on the switching activity is collected. This data is
then used along with Monte Carlo simulation to model wear-
out of chip, due to electromigration. The process of electromi-
gration is modeled by removing elements (metal grains) in a
matrix that depicts a metal line. The elements are removed
based on the current density calculated from the trace data.
Based on the wear-out data, the time-to-failure (lTF) of the
system is estimated. The above procedure is repeated many
times to obtain a 'ITF distribution.
The method is illustrated by applying it to predict the
'ITF characteristics of a microprocessor chip in a typical
operating environment (room temperature, 5 volt). A log-
linear relationship is found between the 'ITF and operating
voltage/fabrication scale. The log-linear model is used to
predict reliability of a given system in a certain operating
environment or a circuit fabricated at a different scale.
The next section discusses the related research in thi s
area. Section 3 describes the experimental environment. Sec-
tion 4 contains the description of the target system; the experi-
mental analysis of the target system is illustrated in Section 5.
Section 6 quantifies the electromigration analysis result and
proposes a model to predict the expected lifetime of the target
system. Concluding remarks are in Section 7.
2. Related Research
Among all of the IC technological trends, scaling has
always been an important method for reducing die size and
thus increasing circuit performance and complexity. Scaling
of design layout rules can lead to increased electrical stresses
and this in tum can accelerate the wear-out process. In
[Woods881 the impact of scaling of device dimension on elec-
tromigration is discussed.
An early investigation Black691 of the electromigration
process shows that the mean time-to-failure (MTTF) of a con-
ductor under a constant current stress is expressed by the fol-
lowing equation:
where A and m are constants dependent upon
microstructure of the metal film, 0 is the
current density, Eo is the activation energy of metal,
k is the Boltmann's constant and T is the
temperature in degrees Kelvin.
This equation is widely accepted and used to model reliability
of IC devices. It is experimentally verified in
[Hu88][McPherson86]. Based on the above model, a number
of approaches [Frost891 [Hu88] [Harrison881 to analyze and
predict the reliability of ICs have been proposed.
In [Frost89], 0 is obtained via circuit simulation and
Black's equation is used to predict device and chip reliability.
[Hmison88] predicts the operational MTTF for electromigra-
tion failures by extrapolating the results obtained via
accelerated testing. In [Lacombe861 an experimental study to
determine the device 'ITF due to electromigration is described.
The test circuits consisting of metal lines of varying length
and width were fabricated and tested under different current
stresses. A log-linear relationship is found between the 'ITF
and operating voltage/fabrication scale.
249
TH0340-0/0000/0249$01 .OO 0 1991 IEEE
There is considerable evidence to show that the failure
rate of a system is a dynamic function of the system activity.
Statistical evidence shows that there is an increased probability
of failure of logic devices at higher activity level [Cortes84]
[Duba86] [Iyer86]. Several studies have attempted to approxi-
mate the effect of activity on device reliability. In [Brooke87],
an experiment to determine the M?TF due to electromigration
with pulsed rectangular current applied at the metal line is
described. This and other experimental results indicate that
average current density, measured at the device, can be used to
approximate @ (current density in MV/cm). In [Hu88], a
model for the unipolar electromigration lifetime under arbi-
trary unipolar current waveformis developed.
In our study, accurate simulation of the target chip,
using a hierarchical switch-level simulator, SPLICEl. is per-
formed to acquire trace data on switch activity. Then, using
this switch activity information, the electromigration wear-out
process for the entire chip is simulated via Monte Carlo tech-
niques.
3. The Experimental Analysis Environment
DESIGN/LAYOUT INFO
e7 CIRCUIT
Fabrication
Specificatio
erating Environme
Parameters
Monte Carlo
Data Analysis
Reliability Prediction
Figure 1. Experimental Environment.
The environment allows logic simulation and Monte
Carlo analysis of the electromigration process for an entire IC
chip. The functions of different parts of the experimental
environment are illustrated in Figure 1. First, accurate simula-
tion of the target chip, using a hierarchical switch-level simu-
lator, SPLICE1. is performed to acquire trace data on switch
activity. A tracing facility is used to monitor switching activi-
ties on all of the intemal nodes. Using the switching activity
information, the electromigration wear-out process is simu-
lated using Monte Carlo techniques. A failure site on a metal
line is modeled by a matrix of metal grains. The grains in the
matrix are removed during the simulation based on a normal
probability distribution. This process is carried out in parallel
for all the metal lines in the circuit. A metal line failure, i.e.
chip failure, is assumed to occur if a path is created from the
left to the right edge of the matrix. The analysis is performed
a number of times to determine the distribution of the time-
to-failure of the target chip. The above procedure is per-
formed under varying operation environments and fabrication
technology parameters, e.g., operating voltage, temperature
and the device dimension are varied and the reliability impact
of reduced dimension and technology improvements are
quantified.
3.1. The Logic Simulation.
In order to performfast and accurate simulation of the
target chip, a hierarchical switch-level simulator SPLICE1
[Saleh87]' was used. For a comprehensive study of switching
activity in the microprocessor, a tracing facility was also
developed to monitor all of the intemal nodes of the target
chip. The tracing facility is capable of monitoring each node
for all processed switching events. The trace data for each
event consists of the time of the event, the hierarchical node
name and the new and previous electrical levels and their
strengths. The trace data kom the simulation is then used to
generate the work load consisting of switching and logic state
information and related t i mi ng information for all the devices
in the target circuit. The work load data is then used along
with Monte Carlo simulation to model wear-out at the device
level, due to electromigration.
3.2. Monte Carlo Simulation of The Wear-out Process.
An example of the metal line model is shown in Figure
2. The process of electromigration is modeled by removing
elements (metal grains) in a matrix that depicts a metal line.
Monte Carlo analysis is used to simulate the grain removal
process under a normal distribution. The grains are removed
based on the current density calculated from the trace data.
This process is carried out in parallel for all the metal lines in
thecircuit. A metal line failure, i.e. chip failure, is assumed to
'The switch-level analysis in SPLICE1 is performed using a
relaxation based method that uses MOS oriented models. VifiuaUy
unlimited levels of signal strength can be associated with each of the
logic values in order to further enhance the accuracy. This approach
allows a correspondence between the elearical outpt conductance and
the logic output strength. A fanout-dependent delay-model capable of
handling first-order effects is used to achieve accurate delay-handling.
250
[Hammersley64].
estimating theintegral:
Given a function f ( x ) , assume that we are interested in
4 f (XW
by taking N independent samples (sI, ... sN) fromf(x) and cal-
culating theaverage of the samples. The objective in impor-
tancesampliig is to concentrate the distribution of the sample
points on the parts of the interval that are of the greatest
"importance", instead of spreading them out evenly. Thus,
instead of sampling froman uniform distribution, we introduce
a sampling distribution G ( x ) with its density function g (x):
1
G (x)=jg (x)& =1.
0
So as not to bias the result, we compensate for thedistortion
by taking f'(x)=E in place of f ( x ) as our estimator of 8.
The variance of this new unbiased estimator is:
I
I
I
I
8 g I
Simulated
Matrix
. .
. . .
: . , : *.
. .
. . .. . .
. . .
. . .
. . . b.
. . ..
. .
. .
. .
. . ..
metal width =3 micron
0 grain size 1x1 micron
................................ ................................
removed i
element
................................ ................................
Figure 2: Modeled metal line.
occur if a path is created fromthe left to the right edge of the
matrix.
The above analyses can be performed under varying
operating environments and fabrication technology parameters.
In particular, operating voltage, temperature and the device
dimension can bevaried and the reliability impact of reduced
dimension and technology improvements can bequantified.
The Monte Carlo analysis environment makes
automatic use of "importance sampling" to reduce the run
lengths. The underlying theory of the method of importance
sampling is described in the next section.
33. Importance Sampling.
The problemwith direct simulation of device wear-out
processes is the time required to performthe analysis. This is
because the device, failure due to wear-out, occurs in order of
years if not in tens of years. Simulation of such a senario is
impossible with the capabilities of current simulators. Impor-
tance sampling technique allows us to accelerate the events
causing the wear-out mechanisms by biasing the related
parameters in order to increase the chance of failure
f ( XI
e
For a minimumvariance solution g ( x ) must beclose to -.
However this requires the knowledge of 8 which is unknown.
The above considerations do however provide some guideline
for selecting g ( x ) . In particular the shape of g(x) should fol-
low the shape of f(x) as closely as possible.
The above methodology is used to accelerate the wear-
out senario in our Monte Carlo simulations. The reliability
measure we are interested in estimating is the mean time to
failure (MTTF). In the Monte Carlo analysis, the wear-out
process is carried out in parallel for all themetal lines in the
circuit. The key is to focus on those metal lines that are most
likely to fail. Metal lines having a higher rate of switching
events have a higher chance of causing a failure and have their
wear-out accelerated. We accomplish this effect by biasing
the probability distribution of grain removal appropriately at
each switching event. The biased result is normalized by the
appropriate acceleration factor to obtain the actual (unbiased)
MTTF.
4. Example Design.
The target systemfor our study is a 16-bit microproces-
sor typically used for controlling engine functions. The con-
trol systemsamples engine parameters such as the fuel flow,
the temperature, the engine speed and other extemal inputs
such as speed and positional parameters. The sampled param-
eters are digitized and updated into the RAM approximately
every millisecond for further processing.
For a fault tolerant design, the microprocessor can be
used in a dual configuration with a suitable reconfiguration
strategy. A simple approach would be to have the lead chan-
nel stop its usual operation on detecting a fault and transfer
control to the dual.
The overall systemarchitecture thus contains micropro-
cessors, memory units, I/O gate array chips, communication
channels, A/D converters and D/A converters. In this
25 1
A Timing
Countdown
Addr
5 Control -
-
ALU
I
V
IIO
Memory
Decode
UART
7-
arity
Disc
, . . . . . . . . . . . . . .
,..............
Figure 3. Target Chip.
experiment we simulate the microprocessor.
The example microprocessor is shown in (Figure 3). It
consists of six major functional units. The arithmetic and
logic unit (ALU) can perform double precision arithmetic
operations. The control unit which is responsible for issuing
signals to control the operations of the ALU, is made up of
combinational logic and several registers. The decoder unit
decodes U0 signals, the multiplexer unit provides the discrete
lines and buses and, thecountdown unit is used to drive chip-
wide clock signals. The watchdog unit provides detection of
faults by resetting the processor in the event of a parity error
or when the application software is timed out by the software
sanity timer. Also, the signal to synchronize the dual system
is provided by this unit.
5. Experiment.
First, the entire microprocessor was simulated at the
switch-level. The initialization phase of the microprocessor,
consisting of a watchdog test, a parity test, an instruction set
test, a RAM test and a ROM sumtest which ensures that all of
the functional units are exercised, was simulated. The simula-
tion included the processor accessing one extemal ROM for
instructions and another extemal ROM for the initialization
parameters. Arithmetic processing and address generation was
also performed. Trace of events were collected from the
simulation.
Using the switching activity information, theelectromi-
gration wear-out process was simulated using Monte Carlo
techniques. Each metal line was modeled by a 3 by 3 matrix
of metal grains. The grains in the matrix were removed dur-
ing thesimulation based on a normal probability distribution
and the current density calculated fromthe trace data. The
Monte Carlo analysis environment makes automatic use of the
Importance Sampling technique to accelerate the wear-out pro-
The above analysis was performed under varying
operating environments and fabrication technology parameters.
The operating voltage, temperature and the device dimension
were varied and the reliability impact of reduced dimension
and technology improvements were quantified. Voltage levels
tested were 3, 5, 7 and 9 volts. Temperature was varied 25C
to 100C. Different device scales (2.lmicrons 2.8microns
3.5microns) were tested to study the reliability impact of
reduced dimension. Other technology improvements such as
conducting metal l i es with better conduction were also stu-
died.
cess.
6. Results.
Reliability projections based on the electromigration
analysis, performed assuming T =25C with 5 Volt power
supply, aregiven in Figure 4. The figure shows the expected
life-time (z-axis) of the controller as a function of the overall
average operational hours per day (x-axis) and the percent of
actual time (y-axis) in use. The MTTF figures in this graph
compare favorably (within the 90% confidence interval) with
those reported in manufacturers' reliability reports [Intel]. For
example (Figure 3: Example l), if the overall operational
period is 12 hourslday and the actual usage of thecontroller is
25% of this time. then the expected life time of the chip is
about 46 years.
The results shown in Figure 5 give M" F distributions
(y-axis) for operating voltage (x-axis) and the device dmen-
sion (shown in different line attributes). For example (Figure
4 example 1). the same circuit fabricated in 2.8-micron tech-
nology (all other factors remain unchanged) has M'ITF of
4.61887 years operating at 3 volts.
7. Conclusions.
This paper has described a simulation environment for
reliability prediction of VLSI designs. Specifically, the effect
of electromigration on the time to failure was investigated.
The environment was illustrated with a case study of a
microprocessor intended for control applications.
A system under investigation was first simulated at
switch level and trace data on switching activity was collected.
This data was then used along with Monte Carlo simulation to
model wear-out at the chip-level. The results of the simula-
tion compare favorably with actual manufacturer supplied
experimental results.
Acknowledgments
This work was supported by theNational Aeronautics
and Space Administration under NASA grant NAG-1-602.
252
6 12 18 24
Operational hourdday
Figure 4. Expected life-time of the chip
Thanks are also due to Antoine Mourad and Dong Tang for
their careful reading of this manuscript.
REFERENCES
[Black691 I. Black, "Electromigration Failure Modes in
AluminumMetalization for Semiconductor Devices," Proceed-
ings of the IEEE, Vol. 57, No. 9, pp. 1587-1593. 1969.
[Brooke] L. Brooke, "Pulsed Current Electromigration Failure
Model," IEEE Proceedings IRPS, pp. 136-144, April 1987.
[Chen87] I. Chen, C. Hu, "Accelerated Testing of Time-
Dependent Breakdown of SiOZ," IEEE Electron Device
Letters, Vol. EDL-8, No. 4, pp. 140-142, April 1987.
[Corte&] M. Cortes, R. Iyer, "Device Failures and System
Activity: A. Thermal Effects Model," FTCS-14, 1984.
[Duba85] P. Duba and others, "Effects of SystemActivity on
Chip Reliability," Proceedings, The First International
Wrokshop on VLSI Design, Madras India, December 1985.
[Frost891 D. Frost, K. Poole. "RELIANT: A Reliability
Analysis Tool for VLSI Interconnects," IEEE J ournal of
Solid-state Circuits, Vol. 24, No. 2, pp. 458-462, April 1989.
[Hammersley64] J. Hammersley, D. Handscomb, "Monte
Carlo Methods," Methuen, London, 1964.
[Harrison881 J . Harrison, "On Extrapolation fromAccelerated
Test M'ITF to Operating Condition MTTF for Electromigra-
tion Failures," SRC TECHCON-88, Dallas TX, pp. 240-243,
October 88.
[Hut381 C. Hu, P. KO, P. Lee, N. Cheung, B. Liew, "IC Relia-
bility Prediction," SRC TECHCON-88, Dallas TX, pp. 240-
243, October 88.
untel] Intel Manuals on Reliability Assessment.
flyer861 R. Iyer, D. Rossetti, M. Hsueh, "Measurement and
Modeling of Computer Reliability as Affected by System
Activity," ACM Transaction on Computer Systems, Vol. 4.
No. 3, pp. 214-237, August 1986.
[Lacombe861 D. Lacombe, E. Parks, "The DistTibution of
253
Electromigration Failures," IEEE Proceedings IRPS, pp. 1-6.
April 1986.
[Lee881 J . Lee, I. Chen, C. Hu, "Statistical Modeling of Sili-
con Dioxide Reliability," IEEE Proceedings IRPS, pp. 131-
138, 1988.
[McPherson86] J . McPherson, "Stress Dependent Activation
Energy," IEEE Proceedings IRPS, pp. 12-18, April 1987.
[Ricco83] B. Ricco, M. Azbel. M. Bordsky, "Novel mechan-
ismfor tunneling and breakdown of thin SiO, films," Phys.
Rev. Lett., Vol. 51, No. 19, pp. 1795, 1983.
[Saleh871 R. A. Saleh, "Nonlinear relaxation algorithms for
circuit simulation," Memorandom No. UCB/ERL M87/21,
Electronics Research Laboratory, University of California,
Berkeley, 1987.
[Woods86] M. Woods, "MOS VLSI Reliability and Yield
Trends," Proceedings of the IEEE, Vol. 74, No. 12, pp. 1715-
1729, December 1986.
14
12
10
8
MTTF
6
(Years)
4
2
2.1 micron
-
3 5 7 9
Operating Voltage (V)
Figure S. MTTF distributions at different operathg
voltages and fabrication dimensions.
254

You might also like