You are on page 1of 19

BAYESIAN STATISTICS 6, pp. 000{000 J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith (Eds.

) Oxford University Press, 1998

ANTHONY O'HAGAN, MARC C. KENNEDY and JEREMY E. OAKLEY Department of Mathematics, University of Nottingham, UK
SUMMARY This paper builds on work by Haylock and O'Hagan which developed a Bayesian approach to uncertainty analysis. The generic problem is to make posterior inference about the output of a complex computer code, and the speci c problem of uncertainty analysis is to make inference when the \true" values of the input parameters are unknown. Given the distribution of the input parameters (which is often a subjective distribution derived from expert opinion), we wish to make inference about the implied distribution of the output. The computer code is su ciently complex that the time to compute the output for any input con guration is substantial. The Bayesian approach was shown to improve dramatically on the classical approach, which is based on drawing a sample of values of the input parameters and thereby obtaining a sample from the output distribution. We review the basic Bayesian approach to the generic problem of inference for complex computer codes, and present some recent advances|inference about the distribution of quantile functions of the uncertainty distribution, calibration of models, and the use of runs of the computer code at di erent levels of complexity to make e cient use of the quicker, cruder, versions of the code. The emphasis is on practical applications.
Keywords: COMPUTATIONAL EXPERIMENT; SIMULATION; GAUSSIAN PROCESS; SENSITIVITY ANALYSIS; UNCERTAINTY DISTRIBUTION; CALIBRATION; MULTI-LEVEL CODES; MODEL INADEQUACY.

Uncertainty Analysis and other Inference Tools for Complex Computer Codes

1. INTRODUCTION 1.1. Complex computer codes In many elds, complex computer programs are used to model and predict real phenomena. For example, weather forecasting uses highly sophisticated models for atmospheric pressures and humidities; the behaviour of large or complex engineering structures is typically modelled in great detail as an aid to their design; astronomers and physicists have long required massive computations to model and predict the movements of planets or atomic particles. A feature of such computer programs is that they generally require substantial amounts of computing time, even on powerful computers. When it is necessary to use many runs of the program, in order to compute the output over a range of input con gurations, the time required for each run becomes important. This paper presents some statistical tools for such problems. We consider the computer model as a black box. Its output is represented as a function ( ) taking value (x) for input x 2 X : The input space X is typically multidimensional, with many individual quantities needing to be speci ed to de ne the input con guration for a given run, so we can think of x as a vector. In practice, computer codes usually also produce many outputs from a single run, but for the purposes of this paper we will suppose that the output of is a scalar. There are straightforward extensions to

A. O'HAGAN, M. C. KENNEDY and J. E. OAKLEY

multivariate outputs for the methods presented here, but the univariate case su ces to illustrate the theory and to tackle some interesting practical problems. We treat ( ) as an unknown function, and our objective is to make inference about individual values (x) or various functionals of ( ). The data will typically comprise observations of the output from n runs of the program, yi = (xi) ; i = 1; 2; . . . ; n: The set X = fx1 ; x2 ; . . . ; xn g of inputs will be called the sample design. To say that ( ) is an unknown function may seem strange, since somebody has written the computer code, programming a series of operations which, implicitly at least, de ne the function ( ) : However, we say that (x) is unknown for a given input con guration x in p the same sense that a given digit, say the one thousandth, in the decimal expansion of 2 is unknown. It is well de ned mathematically, and algorithms exist to compute it, but I do not know its value. To me it is an unknown quantity, until I compute it. In the same way, (x) is unknown until we run the computer code with input x and observe its output. 1.2. Prior modelling Since ( ) is to be an unknown function, the subject of inference, we will need to formulate a prior distribution for it. The most tractable form of prior distribution, and by far the most commonly used in previous work, is the Gaussian process. Formally, writing
( ) N (m0 ( ) ; v0 ( ; )) (1:1)

speci es that ( ) has a Gaussian process distribution with mean function m0 ( ) and covariance function v0 ( ; ) : It means that each individual (x) is normally distributed as
(x) N (m0 (x) ; v0 (x; x)) ;

and furthermore that the joint distribution of the values of ( ) at any nite number of di erent x values is multivariate normal, such that Cov ? (x) ; ?x0 = v0 ?x; x0 for all x; x0 2 X : The Gaussian process is the natural extension of the multivariate normal distribution to in nite numbers of dimensions. Although the computer code is too complex for (x) to be known without running the computer program, we will typically have prior information about ( ) : In particular, since the code is intended to model some real phenomenon, prior information about how (x) responds to input x may come from knowledge of that phenomenon. It may be expressed via the mean and covariance functions, m0 ( ) and v0 ( ; ) : The following hierarchical formulation is useful.
m0 (x) = h (x)T ; v0 x; x0 = 2 w0 x; x0 ;

(1:2)

where h ( ) is a known vector of p regressor functions, is a p 1 vector of unknown coe cients and 2 is an unknown scale parameter. Through h ( ) one can express prior information about the general form of the function ( ) : We will adopt the speci cation
p ;
2

/ ?2

(1:3)

for weak prior information about the parameters and 2 ; although a more general normal inverse gamma prior is equally tractable; see O'Hagan (1992).

Inference Tools for Complex Computer Codes

The speci cation of the function w0 ( ; ) is important because it expresses beliefs about the smoothness of ( ). We will assume a stationary formulation, ? ? w0 x; x0 = c x ? x0 ; (1:4) where x ? x0 represents a suitable distance measure on X . For vector x this might be Euclidean distance. The function c ( ) is then a correlation function, so that c (0) = 1 and c (d) is a decreasing function for d > 0: It expresses prior belief about smoothness of ( ) in terms of correlation between points close together or further apart. For instance, as explained in O'Hagan (1992), belief in the existence of derivatives of ( ) is linked to di erentiability of c ( ) at the origin. c ( ) may be completely speci ed or may be modelled in terms of further hyperparameters. For instance (1:5) c (d) = exp ?bd2 ; which expresses a belief that ( ) is in nitely di erentiable, depends on a roughness parameter b > 0; which may be speci ed or treated as another unknown parameter. (To be a valid correlation function, Bochner's Theorem requires that c ( ) must be the characteristic function of a random variable whose distribution is symmetric about the origin: see Feller, 1966, p622.) 1.3. Posterior inference Given the Gaussian process prior (1.1) and observation vector y = ( (x1 ) ; . . . (xn))T ; the posterior distribution of ( ) is another Gaussian process (because if vectors z1 and z2 are jointly multivariate normally distributed, the distribution of z1 given z2 is also multivariate normal). With the hierarchical structure (1.2) and (1.3) added, the posterior distribution of ( ) given and 2 is a Gaussian process, while that of and 2 is normal inverse gamma; see, for example O'Hagan (1992). After integrating out ; we can write ( ) j 2; y N m ( ) ; 2w ( ; ) ; (1:6) 2 j y IG (a; d) ; (1:7) where IG (a; d) denotes the inverse gamma distribution with d degrees of freedom and scale parameter a; so that E 2 j y = a= (d ? 2) ; and where a; d; m ( ) and w ( ; ) depend on y; h ( ) and w0 ( ; ) : For details, see O'Hagan (1992). It follows that the posterior distribution of (x) is a t distribution, and the joint distribution of ( ) at any nite set of points in X is multivariate t: In fact, we call the posterior distribution of ( ) a Student process because it generalises the multivariate t distribution in the same way as the Gaussian process generalises the multivariate normal distribution. The simplest inference to make now is about the value (x) of ( ) at some point x 2 X at which we have not yet run the program. As just mentioned, the posterior distribution is a t distribution, and its mean E ( (x) j y) = m (x) is the natural posterior estimate. Indeed, m ( ) can now be used as an approximate surrogate for the computer program. Whereas computing (x) is highly complex and demanding of both time and computer resources, m (x) is trivial to calculate. In the following sections of this paper we address a variety of more inference problems, and variations on the basic model of section 1.2.

A. O'HAGAN, M. C. KENNEDY and J. E. OAKLEY

1.4. Previous work There is already a sizeable literature on what is often called \computer experiments". Much of this work is non-Bayesian, but nevertheless adopts (1.1) and (1.2), or similar models (although it is hard to conceive how anyone could believe in a frequentist interpretation of such models!). The review paper of Sacks, Welch, Mitchell, and Wynn (1989) is an important source. Currin et al. (1991) provide a Bayesian interpretation. This part of the literature concentrates on inference about (x) ; to predict the output of the computer model for new input con gurations, and much of this work focuses on the design problem (See for example Sacks, Schiller, and Welch, 1989; Morris and Mitchell, 1995; Bates et al. , 1996). Latin hypercube designs have generally been found to be useful, particularly when a code has many inputs. Often many of the inputs are not in uential, and these designs these have good projective properties. Welch et al. (1992) present an algorithm to identify the important inputs, which can be used to reduce the size of the prediction problem. The choice of suitable covariance function is another issue which is often discussed in the literature (Sacks, Welch, Mitchell, and Wynn (1989); Currin et al. (1991)). Morris et al. (1993) considers the design of experiments when, in addition to model output, we can also obtain derivative information. 1.5. Outline of paper Section 2 rst reviews the Bayesian approach to uncertainty analysis, which is concerned with inferences about the distribution of (x) induced by a distribution on x: This is then extended with particular reference to inference about the distribution function of the induced distribution, and a exible new simulation technique is presented for computing such inferences. An application is given to a model for the e ect of ingesting radioactive iodine. Computer codes can often be run at di erent levels of complexity. Section 3 shows how fast, but cruder, runs of the code can be used to improve inference about the highest level output, reducing the number of runs needed of the slow code. An application to an oil eld simulator is given. Section 4 concerns sensitivity analysis, where the objective is to identify those elements of the input vector to which the output is most sensitive. Section 5 concerns model inadequacy and calibration. Observing the real physical system which the computer code is modelling allows us to learn about inadequacies in the code. We can also use such observations to learn about input parameters which characterise the speci c application context, a process known as calibration. The methods are illustrated with data arising in a simulated radiation accident. Finally, Section 6 presents some general comments and areas for further research. 2. UNCERTAINTY ANALYSIS 2.1. The uncertainty distribution Suppose that we wish to use the output of the program to predict the real phenomenon in a situation when some or all of the inputs are unknown. The following example is considered in O'Hagan and Haylock (1997). A model developed by the International Commission on Radiological Protection (1993) models the movement of plutonium{239 (Pu) through the body, in order to predict the e ective radioactive dose that a person would receive after ingesting a unit quantity of Pu. There are many inputs to this

Inference Tools for Complex Computer Codes

model which are the rates at which Pu is transmitted from one part of the body to another. These rates are in general unknown and vary between individuals. Therefore if we consider a randomly chosen individual the inputs will be random variables, and hence the model output (the predicted e ective dose) is also random. In general, suppose that the model input x is a random variable with distribution G: Then Y = (x) is a random variable. The probability distribution of Y induced by the distribution G of x is called the uncertainty distribution. If ( ) were a simple function like (x) = x2 then it would be a simple exercise of the probability calculus to infer the distribution of Y: Then we obtain whatever summaries of the uncertainty distribution may be of interest, such as its mean E (Y ) = EG x2 : However, ( ) is a complex function and is itself being treated as unknown. Just as we have considered making statistical inference about (x) for xed x, the problem of uncertainty analysis is that of making statistical inferences about the uncertainty distribution induced by random x. 2.2. Mean and variance Haylock and O'Hagan (1996) considered the question of inference about the mean and variance of the uncertainty distribution. For instance the mean is de ned by
M = E (Y ) = EG ( (x)) =

(x) d G (x) :

(2:1)

Since according to (1.6) ( ) has a Gaussian process posterior distribution given 2 ; M has a normal distribution M j 2 ; y N m1 ; 2 v1 ; where Z Z Z ? ?

m1 = m (x) d G (x) ; v1 = v x; x0 d G (x) d G x0 : X X X Hence from (1.7) the posterior distribution of M is a t distribution with mean m1: Haylock and O'Hagan (1996) gave explicit formulae for m1 and v1 in the case where G is a

(multivariate) normal distribution and the prior covariance structure (1.4) and (1.5) is used. They also derived formulae for the posterior mean and variance of Var (Y ) and gave an illustration using a computer model similar to, but simpler than, the plutonium{239 model. The standard non-Bayesian approach to uncertainty analysis is based on Monte Carlo methods. In its simplest form, a random sample x1 ; x2 ; . . . ; xN is drawn from G and the computer code run for each of these input con gurations. The resulting yi = (xi ) ; i = 1; 2; . . . ; N; is a sample from the uncertainty distribution and hence, for instance, the sample mean y is a frequentist unbiased estimator of M: We have denoted the sample size by N here, rather than n; to emphasise that the Monte Carlo method demands large samples, which in view of the time required for each run of some codes is not very practical. Our Bayesian uncertainty analysis is far more e cient. Haylock and O'Hagan (1996) obtained more accurate inferences about M for their simple iodine{131 model using Bayesian uncertainty analysis and n = 10 well chosen runs than with a conventional Monte Carlo analysis using N = 1000 runs. O'Hagan and Haylock (1997) applied the method to the more complex plutonium model. (Whereas the iodine model had only a 2-dimensional input, the plutonium model had 14 uncertain input parameters.) They also considered a more general correlation

A. O'HAGAN, M. C. KENNEDY and J. E. OAKLEY

structure in which (1.4) and (1.5) were replaced by

? w x; x0
0

0 1 X = exp @? bj xj A
2

(2:2)

of the analysis proceeded as if these were the true values. The roughness parameters are most often chosen using maximum likelihood: for example, this method is used within a non-Bayesian framework by Sacks, Welch, Mitchell and Wynn (1989), Welch et al. (1992) and Sacks, Schiller, and Welch (1989), and by Currin et al. (1992) within a Bayesian framework. This is the same as using a non-informative prior in our case. Informative priors could be useful when there is not enough information in the data to identify competing values of bj and 2 .

bj were given uniform prior distributions, but since the posterior distribution of ( ) depends in a highly complex way on w0 ( ; ) a full Bayesian analysis is not practical. Instead, the bj s were estimated by their joint posterior mode and then the remainder

allowing a separate roughness parameter for each of the 14 inputs. The parameters

2.3 Distribution function A complete description of the uncertainty distribution is provided by its distribution function Z F (c) = P (Y c) = I f (x) cg d G (x) ; where I ( ) denotes the indicator function. We can obtain expressions for posterior moments of F (c) such as Z E fF (c) j yg = P ( (x) c j y) d G (x) : (2:3) The integrand of (2.3) is computed from the posterior t distribution of (x). To illustrate this inference, we consider again the iodine model of Haylock and O'Hagan (1996). The model predicts the e ective dose following the ingestion of a unit quantity of radioactive iodine. The iodine accumulates in the thyroid gland, and there are two unknown inputs to the model, the mass of the thyroid gland (w) and the fraction of iodine absorbed by the thyroid (f ) : Following a study by Dunning, Schwarz and Schwarz (1981), lognormal distributions were used:
log w N 2:889; 0:4632 log f

N ?1:315; 0:3552 :

We write x = (log w; log f ) and set h (x)T = (1; log w; log f ) : For the prior covariance function we used (2.2) and estimated the two roughness parameters b1 and b2 by their joint posterior mode as in O'Hagan and Haylock (1997). This model was simple enough for Haylock and O'Hagan to explore ( ) quite exhaustively with a grid of one million runs, and from these the true distribution function F ( ) was computed and is plotted as the solid line in Figure 1. Here we have written dose as the CEDE 108. The dotted line is the posterior mean (2.3) based on just n = 9 runs. The estimate is already good: to achieve comparable accuracy with the empirical distribution function of a Monte Carlo sample would require hundreds of runs. The posterior mean from n = 16 runs is indistinguishable from the true F ( ), the solid line in Figure 1.

Inference Tools for Complex Computer Codes

1.0

0.8

0.6 0.4 0.2 0.0 0

4 dose

10

Figure 1. Expected distribution function 2.4. Computation Computing the posterior mean (2.3) is not trivial, even if the dimensionality of X is low, as in the iodine example. This is because the integrand is not smooth. For consider evaluating F (yi) ; when c = yi is an observed value. At the corresponding x = xi there is no posterior uncertainty, and P ( (xi) yi) = 1; but for x arbitrarily close to xi there is uncertainty. Oakley and O'Hagan (1998) show that not only is lim P ( (xi + x0 ) yi j y) &0 strictly less than one for all xo 2 X ; but that furthermore it depends on x0 : Therefore when c = yi the integrand of (2.3) is extremely badly behaved in the neighbourhood of xi : For c close to yi ; the integrand is continuous at xi but has steep gradients in the neighbourhood of xi: Evaluating (2.3) accurately by conventional numerical integration is therefore di cult for low-dimensional X and quite impractical for something like the 14-dimensional plutonium model. We propose an alternative technique based on simulated draws from the posterior distribution of ( ), which not only facilitates computation of E (F (c) j y) without the need to take account of the intransigent nature of the integrand of (2.3) but also handles many other intractable inferences. Consider a set of points X 0 = x01 ; x02 ; . . . ; x0N in X , distinct 0 ; y0 ; . . . ; y0 T ; where y0 = ?x0 ; i = 1; 2; . . . ; N; from the design points X: De ning y0 = ?y1 2 i i N 0 the posterior distribution of y0 is multivariate t: Suppose that we make a random draw y(1) 0 . from this distribution, and consider the posterior distribution of ( ) given both y and y(1) Let X0 be a subset of X such that G (X0) is close to one. Then if N is large enough and the 0 will be small for all x 2 X0 : points in X 0 cover X0 well, the variance of ( ) given by y and y(1) 0 as approximately a In that case we can regard the posterior mean m0(1) ( ) = E ( ) j y; y(1) random draw from the posterior distribution of ( ) ; and the approximation is good for all

A. O'HAGAN, M. C. KENNEDY and J. E. OAKLEY

0 ; y0 ; . . . ; y0 from the distribution of x 2 X0 : If we then repeat the exercise, drawing y(2) (3) (M ) y0 given y; the resulting functions m0(1) ( ) ; m0(2) ( ) ; . . . ; m0(M ) ( ) can be treated as a sample of size M from the posterior Student process distribution of ( ) :

We can now make inference about any aspect of ( ) of interest, by computing that functional for each m0(j) ( ) and thereby deriving a sample of size N: For instance, the posterior mean shown dotted in Figure 1 was computed using a grid of N = 40 and drawing M = 1000 realisations m0(j) ( ) : For each realisation a sample of 1000 points xk were drawn from G and a realisation F(j) ( ) of F ( ) constructed from the empirical distribution function of the sample m0(j) xk (k = 1; 2; . . . ; 1000) : Then E (F ( ) j y) was estimated from the sample mean of these F(j) ( )s. Although computationally intensive, this approach is simple to implement and avoids the numerical di culties of evaluating (2.3) directly. It is worth remembering that even this kind of extensive computation to evaluate posterior inferences may take less computer time and power than a single run of the original computer code. One use of this approach is to perform uncertainty analysis with less tractable forms of covariance function than (1.5). As a simple arti cial example, we considered the function
(x) = 5 + x + cos x

when x N (0; 4) and in place of (1.5) we have c (d) = exp (? jdj =2) : The uncertainty analysis formulae of Haylock and O'Hagan cannot be computed analytically. We used the simulation approach with just 5 data points and drawing 1000 realisations m0(j) ( ) : The uncertainty distribution mean was estimated for each realisation as the sample mean of m0(j ) xk for 1000 draws xk from N (0; 4) : We obtained E (M j y) = 5:06 and Var (M j y) = 0:2: Obviously, for such a simple ( ) we can evaluate M exactly to check this computation. The true value is M = 5:13: 2.5. Distribution and quantile functions When F (c) is close to zero or one, we may expect its posterior distribution to be strongly skewed. This may lead to the posterior mean tending to overestimate F (c) when it is small, and to underestimate it when it is large. The posterior median of F ( ) would be another useful inference, therefore, and although no comparable expression to (2.3) exists to allow it to be computed directly the simulation method is still applicable. Indeed, the same sample of 1000 empirical distribution functions F(j) ( ) drawn to compute Figure 1 yields Figure 2. The curves here are the 2.5, 50 and 97.5 percentile curves. For any c on the horizontal axis, these curves plot percentiles of the posterior distribution of F (c) ; so that for instance the middle curve is the posterior median. Figure 3 plots the posterior density of F (c) for a small value of c, obtained from the simulated values F(j) (c) by kernel smoothing. It shows that the posterior is positively skewed as expected. Another interpretation of Figure 2 is to give posterior inference about the quantile function. For any p on the vertical axis, reading horizontally the curves give percentiles of the posterior distribution of F ?1 (p). For more information about these methods see Oakley and O'Hagan (1998). In particular they show the further complexities involved in making inference about the density function d F (c) =d c of the uncertainty distribution.

Inference Tools for Complex Computer Codes

0.0 0

0.2

0.4

0.6

0.8

1.0

4 dose

10

Figure 2. Quantile distribution functions

10

15

20

25

0.05 F(c)

0.10

0.15

Figure 3. Density of the distribution function for a small output 3. MULTI-LEVEL CODES 3.1. A model for two-level codes It will often be the case that a computer model can be run at di erent levels of complexity. Simpler versions of the code will run faster but provide less accurate simulation of the real phenomenon. For example, Craig et al. (1996) describe a computer model simulating an oil reservoir. The rocks which comprise the reservoir are divided into blocks, and nite

10

A. O'HAGAN, M. C. KENNEDY and J. E. OAKLEY

element analysis used to solve complex equations linking the properties of the blocks. If the division into blocks is ne, producing many small blocks, the simulator is believed to be accurate, but at this level of complexity a single run takes between 1 and 3 days of computation. A coarser division reduces the run time to a matter of minutes, but yields a much cruder simulation of the real process. Suppose that the code may be run at s di erent levels. Their outputs de ne s functions 1 ( ) ; 2 ( ) ; . . . ; s ( ) : Our objective is to make inference about s ( ) ; which is the highest level code, the most accurate but the most costly to run. Our only purpose in recognising other levels of code is because we will use output from runs at other levels to learn about s ( ) : The hope is that data from some runs at lower levels will provide additional information about s ( ) much more cost-e ectively than if we made all runs at level s. In this paper, we will simplify the presentation by considering only the two-level case, s = 2: Generalisation to s > 2 is straightforward, although it introduces more questions of how to model the relationships between levels. We will suppose that 1 ( ) and 2 ( ) jointly have a Gaussian process prior distribution. In addition to specifying mean functions and covariance functions for the marginal Gaussian process distributions of 1 ( ) and 2 ( ) separately, we will need to de ne a cross-covariance function for Cov ? 1 (x) ; 2 ?x0 : This task is simpli ed by the following assumption: ? ? Cov 2 (x) ; 1 x0 j 1 (x) = 0 for all x0 = 6 x: (3:1) This says that the most we can learn about the output 2 (x) at x 2 X under the complex model, from observations of the simpler the same input ? model, is to observe 1 (x)6 for x: If we know 1 (x) then observing 1 x0 for some (or even all) x0 = x gives no further information about 2 (x) : This is a kind of Markov property: the complex code run 2 (x) depends on 1 ( ) only through the nearest point 1 (x) : The following autoregressive model is easily seen to have the property (3.1) and can be shown to be the simplest formulation with that property. We write
2

(x) =

(x) + (x) ;

(3:2)

where is a regression parameter and where the two processes 1 ( ) and ( ) are independent. This model therefore removes the need to think about a cross-covariance function, replacing it with the scalar parameter : 3.2. Oil reservoir example For this example we use part of the data from the oil reservoir simulator referred to by Craig et al. (1996). The model was used to simulate an oil reservoir which was split into 5 regions, each with di erent characteristics of permeability and porosity. The permeabilities and porosities of the 5 regions comprised the 10 inputs to the program. The program produces many outputs for a single input con guration, but our example uses one of these outputs, the pressure in one of the three wells in the reservoir, at a single time point. A latin hypercube design of 180 points was generated in the 10 dimensional input space, and both the complex code and the simple code were run at each of the 180 points. We used as data for our analysis the output of the complex code at just 7 points, and the output of the simpler code at these 7 points plus another 38, making 45 in all. The performance of our method was then evaluated by using it to predict the output of the complex code at the remaining 135 points for which we had the known values.

Inference Tools for Complex Computer Codes

11

Our model used the structure (3.2), with 1 ( ) and ( ) separately having Gaussian process hierarchical priors of the form given by (1.1), (1.2) and (1.5). We set h (x) = (1) in each case, since we had no prior knowledge of how either 1 ( ) or 2 ( ) would respond 2 to the inputs. The hyperparameters are therefore 1 and (both scalar coe cients), 1 and 2 (variances), (the autocorrelation coe cient), b1 and b (roughness parameters), seven in all. The posterior distribution of 1 ( ) and ( ) is easily seen to be a Gaussian 2 process conditional on 1 ; 2 ; ; b1 and b : For this example we estimate all these ve hyperparameters from their joint posterior mode and then proceeded as if these were xed values, although we are also working towards implementing a more fully Bayesian analysis. Full details of the theory and this practical application given in Kennedy and O'Hagan (1998a). We computed the root mean squared error (RMSE) of prediction of the 135 values of 2 (x) for a number of di erent prediction methods. (a) First, we can ignore the simple simulator 1 ( ) and apply the Bayesian method of section 1 to just the 7 observed 2 (x) values. The method yields RMSE (b2 ) = 51:3: (b) Using also the 45 1 (xi) values and tting the full model for two-level codes, we estimate 2 (x) by b b1 (x)+b (x) : The accuracy improves substantially to RMSE b b1 + b = This improvement is, of course, achieved at the cost of obtaining the 45 runs of the simple model 1 ( ) : On the assumption that these are very quick and cheap to do, the gain in accuracy has been made very cheaply, but it is interesting to ask how many more runs of the complex code would be needed to obtain the same RMSE using method (a). We found that an extra 8 runs were needed before method (a) gave a lower RMSE than method (b). Since for this particular computer program approximately 36 runs of the simple code can be made in the time it takes to do one run of the complex code, we see that it is de nitely worth using the 1 (xi) values and method (b). The original idea of the simple code was as a quick approximation to the more complex code. If we run the simple code 1 ( ) at each input x for which we need to predict 2 (x) ; we have the following methods. (c) The simplest, non-Bayesian, method is just to estimate 2 (x) by 1 (x) ; with resulting RMSE ( 1 ) = 266:5: It is clear that in the case of this computer model the simple code alone is a poor approximation. Since it requires no runs at all of the complex code, this method is much cheaper than either (a) or (b), but its accuracy would be unacceptable. (d) Finally, we t the full model to the 7 + 45 training observations. We then use the resulting estimate b and the posterior mean b ( ) to adjust the simple model output b 1 + b = 29:9: 1 (x) at each x: Then RMSE Figure 4 clearly illustrates the di erence between methods (c) and (d). The diagonal line in this gure represents perfect prediction, where the estimated value always equals the true value. The simple code 1 ( ) systematically overestimates, but this is very e ectively corrected by b 1 ( ) + b ( ) : It is clear from these data how a simple regressionlike adjustment would correct 1 ( ) ; but remember that the data plotted here are the 135 predictors, not the data on which the model was tted. It is possible to create a non-Bayesian regression estimator, but this is based only on the 7 pairs ( 1 (xi) ; 2 (xi)) for which both the simple and complex model outputs were available in the original data. This estimator yields RMSE ( b1 + b2 1 ) = 41:9. This clearly is a big improvement on method (c), but the Bayesian method (d) is better still.
32:4:

12

A. O'HAGAN, M. C. KENNEDY and J. E. OAKLEY

actual values 1800 2000

2200

2400

+ + + + + + ++ ++ +++ + ++ + + + + + + + + + ++ + + + + + + + ++ + + + + + + + + + + + ++ + ++ + + + + + + + ++ + ++ + + + + + + + + + ++ + + + + + ++ + + + + + + + + ++ ++ + + + + + + + + + ++ + + ++ + ++ + + + ++ ++ 1600 1800 2000 predicted values 2200

1600

2400

Figure 4. Predicted and actual values of 2 ( ) using ^ 1 ( ) + ^( ) (+); and using the fast code 1 ( ) ( ) 4. SENSITIVITY ANALYSIS Computer codes can often have large numbers of inputs, and it may be important to identify those which are most in uential. The method described in Sacks et al. (1989) can be used to assess how the various inputs a ect the output. This is known as sensitivity analysis. For example, the main e ect of a single input can be estimated by averaging the posterior mean of s(x) over the remaining inputs x(l) . An important application of this technique, as demonstrated by Welch et al. (1992), is to identify a subset of active factors. Suppose that after a possible reparameterisation the inputs are distributed as x N (0; vIp ), where Ip is the p p identity matrix. It follows that the distribution of x(l) , which we write as Gl (x(l) ), is N (0; vIp?1). Sacks et al. (1989) integrate the posterior mean of the code with respect to a uniform distribution over the input space. We instead use the prior input distribution. The main e ect for input l as a function of xl is de ned by Z (l) (4:1) s (x) dGl (x ) ? M; l (xl ) = (l) where the integration is over the space of all x(l) and M is the uncertainty distribution mean de ned in (2:1). The posterior mean of the main e ect is obtained by replacing s (x) in (4.1) by the posterior mean function ^s(x). The variance of the main e ect l ( ) at xl is given by Z Z (l) (l) Varf l (xl )g = Var ( x ) dG ( x ) + Var( M ) ? 2Cov M; s s (x) dGl (x ) : l X (l) X (l) Interaction e ects are estimated in a similar way. The interaction e ect for inputs xk and xl is estimated by Z ^ ^k;l (xk ; xl ) = (k;l) ^s (x) dGk;l (x(k;l) ) ? ^k (xk ) ? ^l (xl ) ? M
X X

Inference Tools for Complex Computer Codes

13

where Gk;l (x(k;l) ) is a N (0; vIp?2) distribution. To illustrate these ideas, we again use the 2-level oil reservoir data and our objective is to study sensitivity of the top level code to its inputs. Prior beliefs about the 10 inputs were elicited from experts (Craig et al. ,1996) and the input space was transformed in such a way that x N (0; 0:0651I10). In Figure 5 we plot the main e ects for inputs x3 ; x7 and x10 , together with bounds of 1 standard deviation. Notice how the output seems to be insensitive to x10 . Figure 5 also includes a plot of the interaction e ect between inputs x3 and x7 .
main effect -100 -50 0 50 100 main effect -100 -50 0 50 100 -0.4 -0.2 0.0 x3 0.2 0.4 -0.4

-0.2

0.0 x7

0.2

0.4

main effect -100 -50 0 50 100

x7 -0.4 -0.2 0.0 0.2 0.4

-20

-15

5 0 -5 -10

-5 -0.4 -0.2

-10 0.0 x3

-15 0.2

-20 0.4

-0.4

-0.2

0.0 x10

0.2

0.4

Figure 5. Estimated main e ects of x3 ; x7 and x10 ; 1 standard deviation, and the estimated interaction e ect for x3 and x7 5. MODEL INADEQUACY AND CALIBRATION 5.1. Reality The computer model ( ) is representing some real phenomenon, which we might denote by ( ) : We have been concentrating on inference about ( ) when in fact we are usually more interested in inference about ( ) : In the absence of any information about how ( ) di ers from ( ) ; it is natural to focus on ( ) ; but if we have some observations of the real phenomenon we should use them to learn about any inadequacy in the computer model. There is a clear parallel with multi-level codes, in that we can regard ( ) as the ultimate, highest level, code. The model ( ) will have been created because, even though it may be highly computer intensive and expensive to run, it is still much cheaper to compute (x) than to `compute' (x) : We can hope to use the techniques developed for multi-level codes, to make use of a small sample of observations of reality, the `complex' code, together with a larger sample of runs of ( ) ; the `simple' code, in order to estimate ( ) : However, some new considerations arise.

14

A. O'HAGAN, M. C. KENNEDY and J. E. OAKLEY

The rst is that ( ) can usually only be `computed' subject to observation error. If we can assume that observation errors are normally distributed, then only a simple modi cation of the theory is needed. The main complication is that the error variance becomes another hyperparameter to be estimated. A more serious problem is that we often do not know the values of all the input variables when we observe the real phenomenon. Consider for example a model describing how rainfall in a valley or river catchment is rst absorbed by the soil and then gradually ows down the hills to the river. The model is to be used to predict the e ect of rainfall on river levels. We can observe rainfall events and measure the river level, and this provides real data, but consider the model inputs. Some will describe the geography in order to identify gradients, and these inputs will be known. Others will describe the quantity of rain falling, its duration etc., and these will also be known for the observed data. However, still other inputs will describe the saturation and absorbency of the soil and underlying rocks, which will vary over the catchment and will not usually be known. It is necessary, therefore, to estimate the values of unknown inputs, a process known as calibrating the model. The problem addressed by Craig et al. (1996) in relation to the oil reservoir model was a calibration problem. 5.2. Calibration We write the input vector x in the form x = (z; ) ; where represents inputs which characterise the speci c application of the computer model. We refer to as the parameters of the model application. The remaining inputs z describe features which vary within the speci c application. In the context of the river catchment model, the parameters describe the characteristics of the particular catchment being modelled, including the unknown parameters like absorbencies, while z might describe magnitudes and durations of rainfall events. We wish to estimate to calibrate the model for the particular catchment, and then use (z; ) to predict river ows given speci ed rainfall inputs z: Data are of two types. We have observed data, which allowing for a relationship of the form (3.2) to represent model inadequacy, and also for observation error, we can write as
ai = (zi ) + i =
(zi; ) + (zi ) + i; i = 1; 2; . . . ; n

where i are normally distributed independent observations, with zero mean and variance ; ( ) is the model inadequacy function and is the regression parameter. We know the inputs z1 ; z2; . . . ; zn but not the model application parameters : has a prior distribution
G:

The second type of data is from runs of the computer code,


yi = zj ;
j

j = 1; 2; . . . ; m:

Both the zj and j values are known for these data, of course. Modelling ( ; ) and ( ) as in (1.1) to (1.5), with hyperparameters ; ; 2 ; 2 ; b ; b ; the posterior distribution of ( ; ) and ( ) given 2 ; 2 ; b ; b ; ; and is a Gaussian process. If G is a normal prior distribution for ; then it is possible to integrate out analytically and the distribution given 2 ; 2 ; b ; b ; and is still a Gaussian process. Estimating these remaining hyperparameters by their posterior joint mode, it is then straightforward to predict the physical system's true state (z) for given inputs z, or to conduct uncertainty analysis for random z. For details, see Kennedy and O'Hagan (1998b).

Inference Tools for Complex Computer Codes

15

The posterior distribution of may be of interest. It is analytically intractable because enters into the likelihood in a complex way through the covariance function of the ( ; ) process. However, it may be explored numerically, which is feasible if the dimension of is not too large. 5.3. Nuclear accident example We consider data arising form an exercise in early assessment of the consequences of a nuclear accident. In the event of any real emergency involving the escape of radioactive material into the environment, the National Radiological Protection Board (NRPB) needs to be able to assess consequences and provide advice as quickly as possible. Our data are from one of a series of exercises which are used periodically to test the NRPB's procedures. An escape of radionuclides into the atmosphere is presumed to have occurred at a known location and time. Information available to NRPB includes the wind direction during the release of radiation. Wind direction changed twice. The key unknown parameters 1; 2 and 3 are the quantities of radioactive material released during each of the three wind phases. A simple Gaussian plume model is used to describe the atmosphere dispersion and deposition of the radionuclides. The model is described in Clarke (1979) and Jones (1981). More sophisticated models are available but require more computation and demand more parameters which would be unknown. In the early stages of an emergency, NRPB use the simple model because it is necessary to respond very quickly as data come in.
D D D D D D D D D DD D DD D D D D 20 D 20 40 x 60 80 D

y 40

Figure 6. Locations of the observed physical deposition values These data are in the form of radiation measurements at di erent geographical locations. The geographical coordinates are z, the other variable inputs of the model. Figures 6 and 7 shows the 115 design points and measurements which became available to NRPB during the exercise. The rst 20 are shown as `D', and these will comprise our calibration data a1 ; a2 ; . . . ; a20 . Notice that these are not particularly well placed. 4 of these data

60

80

16

A. O'HAGAN, M. C. KENNEDY and J. E. OAKLEY

0 2
80 70 60 Y 50 40 30

8 10 12 14

80 60 40 20 X

Figure 7. Log of observed physical deposition values are the 4 furthest points from the area of greatest radiation deposition, and an area of high deposition at low values of the z1 coordinate is not represented in these data. This is not unusual, and in practice one must use whatever early data are available, but it means that we cannot expect to predict the actual deposition pattern accurately from these data. Traditional calibration would require computation of (zi; ) for the i = 1; 2; . . . ; 20 points and a variety of values until a best- tting is found. Measuring t by sum of squared errors, we nd = (31; 33; 36). The root mean squared error is 2.30. Prediction of deposition at other locations z would then be by (z; ) : Using the remaining 95 observations which came available during the exercise to assess the accuracy of this method, we obtain a root mean squared prediction error of 4.03 for these 95 observations. Notice that this is much higher than the t of the model to the rst 20 observations would have led us to expect, which is primarily due to model inadequacy. Applying our Bayesian approach, we used a relatively weak normal prior distribution for ; centered around the best- tting but with variances of 10 in each direction. The posterior mean vector and variance matrix were found to be

0 31:45 1 @ 32:31 A ;
35:31

0 4:928 @ 0:6913

0:6913 0:9410 5:406 ?0:6596 A 0:9410 ?0:6596 6:674

based on just four di erent values. That is, the data yj comprise 80 runs of the code (zi ; k ) for i = 1; 2; . . . ; 20 and k = 1; 2; 3; 4: Figure 8 plots the predicted values, the posterior mean of (z) ; against true values for each of the 95 remaining subsequent observations. The root mean squared error is 3.75.

Inference Tools for Complex Computer Codes

17

15

10 actual 5

5 fitted 10 15

-5

Figure 8. Predicted deposition plotted against actual deposition 6. CONCLUDING REMARKS We have presented a number of separate techniques applicable to complex computer models. In any given problem, we may need to combine several of these tools. For instance, when we have multi-level codes, in addition to estimating the highest level code output s ( ) (or the ultimate highest level ( )), we may wish to apply sensitivity analysis and/or uncertainty analysis. It is straightforward to combine these tools|see for example Kennedy and O'Hagan (1998a). We believe that the methods presented here already provide a powerful toolbox, but there are several areas for future research. 1. We have referred to some literature on design of computer experiments, but it is clear that new problems of design are posed by some of these techniques. Design for sensitivity analysis of a model with large numbers of inputs needs to be sparse compared with designs for estimating ( ) accurately, or for uncertainty analysis. With multi-level codes, a design needs to specify sets X1 ; X2 ; . . . Xs of design points for the di erent levels, and the optimal sizes of these sets will clearly depend on run times for the di erent levels. 2. Analysis conditional on hyperparameters is quite tractable, but work is needed to develop computationally feasible methods for full Bayesian analysis. In a context where each computer run demands large amounts of computer time, it is acceptable to use computationally intensive methods (such as Markov chain Monte Carlo) to analyse the data, but there will still be a limit to the kind of computation which is practicable. 3. We have treated the case of a single output variable throughout this paper. There is one area where it will be particularly important to recognise the multivariate nature of the output of most models, and that is calibration. To make inferences about the unknown inputs it is not sensible to analyse outputs separately, because they will yield di erent estimates of those inputs. Analysis of all the outputs simultaneously

18

A. O'HAGAN, M. C. KENNEDY and J. E. OAKLEY

will be required for an e ective calibration method. 4. Gaussian process models, particularly with the prior covariance structures used here, are appropriate essentially for smooth ( ) : New extensions to the theory may be needed to deal with functions having jumps and/or singularities. ACKNOWLEDGMENTS The research on multi-level codes, model uncertainty and calibration has been supported by research grant GR/K54557 from the Engineering and Physical Sciences Research Council. We also gratefully acknowledge nancial help from the National Radiological Protection Board, both as a contribution to that research grant and, through a CASE studentship, in connection with the work on uncertainty analysis. We thank Peter Craig, his colleagues at the University of Durham and Scienti c Software Intercomp for providing us with the data for the oil reservoir sample. We thank Richard Haylock, Neil Higgins, Tom Charnock and their colleagues at the National Radiological Protection Board for providing us with the data and advice on the iodine model and the radiation deposition example.
Bates, R. A., Buck, R. J., Riccomagno, E. and Wynn, H. P. (1996). Experimental Design and Observation for Large Systems. J. Roy. Statist. Soc. B 58, 77{94. Clarke, R. H. (1979). The First Report of a Working Group on Atmospheric Dispersion: A Model for Short and Medium Range Dispersion of Radionuclides Released to the Atmosphere. Harwell, NRPBR91. London: HMSO. Craig, P.S., Goldstein, M., Seheult, A. H., and Smith, J. A. (1996). Bayes Linear Strategies for Matching Hydrocarbon Reservoir History. Bayesian Statistics 5 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.). Oxford: University Press. 69{95. Currin, C., Mitchell, T., Morris, M. and Ylvisaker, D. (1991). Bayesian Prediction of Deterministic Functions, with Applications to the Design and Analysis of Computer Experiments. J. Amer. Statist. Assoc. 86, 953{963. Dunning, D. E., Schwarz, J. R. and Schwarz, G. (1981). Variability of Human Thyroid Characteristics and Estimates of Dose from Ingested 131 I . Health Physics, 40 Feller, W. (1966). An Introduction to Probability Theory and its Applications. Vol. II. New York: Wiley. Haylock, R. and O'Hagan, A. (1996). On Inference for Outputs of Computationally Expensive Algorithms with Uncertainty on the Inputs. Bayesian Statistics 5 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.). Oxford: University Press. 629{637. ICRP. (1993). Age-dependent doses to members of the public from intake of Radionuclides: Part 2 Ingestion dose coe cients, ICRP publication 67. Oxford: Pergamon Press. Jones, J. A. (1981). The second report of a working group on atmospheric dispersion: A procedure to include deposition in the model for short and medium range atmospheric dispersion of radionuclides. Harwell, NRPB-R122. London: HMSO. Kennedy, M. C. and O'Hagan, A. (1998a). Predicting the Output from a Complex Computer Code when Fast Approximations are Available. Tech. Rep., 98-09. Nottingham Statistics Group. Kennedy, M. C. and O'Hagan, A. (1998b). Bayesian Calibration of Complex Computer Models. Tech. Rep., 98-10. Nottingham Statistics Group. Morris, M. D. and Mitchell, T. J. (1995) Exploratory Designs for computational experiments. J. Statist. Planning and Inference 43, 381{402. Morris, M. D., Mitchell, T. J. and Ylvisaker, D. (1993) Bayesian Design and Analysis of Computer Experiments: use of Derivatives in Surface Prediction. Technometrics 35, 243{255. Oakley, J. E. and O'Hagan, A. (1998). Bayesian Inference for the Uncertainty Distribution. Tech. Rep., 98-11. Nottingham Statistics Group.

REFERENCES

Inference Tools for Complex Computer Codes

19

O'Hagan, A. (1992). Some Bayesian Numerical Analysis. Bayesian Statistics 4 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.). Oxford: University Press. 345{363, (with discussion). O'Hagan, A. and Haylock, R. (1997). Bayesian Uncertainty Analysis and Radiological Protection. Statistics for the Environment: 3. Pollution Assessment and Control (V. Barnett and K. F. Turkman, eds.). Wiley: Chichester. 109{128. Sacks, J., Welch, W.J., Mitchell, T.J. and Wynn, H.P. (1989) Design and Analysis of Computer Experiments. Statist. Sci. 4, 409{435, (with discussion). Sacks, J., Schiller, S. B. and Welch, W. J. (1989). Designs for Computer Experiments. Technometrics 31, 41{47. Welch, W. J., Buck, R. J., Sacks, J., Wynn, H. P., Mitchell, T. J. and Morris, M. D. (1992). Screening, Predicting, and Computer Experiments. Technometrics 34, 15{25.

You might also like