Professional Documents
Culture Documents
a r t i c l e
i n f o
Article history:
Received 24 August 2014
Received in revised form 2 January 2015
Accepted 4 January 2015
Available online 12 January 2015
Keywords:
Inverse modeling
Functional data analysis
Canonical correlation analysis
Groundwater
Reservoir modeling
a b s t r a c t
Inverse modeling is widely used to assist with forecasting problems in the subsurface. However, full
inverse modeling can be time-consuming requiring iteration over a high dimensional parameter space
with computationally expensive forward models and complex spatial priors. In this paper, we investigate
a prediction-focused approach (PFA) that aims at building a statistical relationship between data variables and forecast variables, avoiding the inversion of model parameters altogether. The statistical relationship is built by rst applying the forward model related to the data variables and the forward model
related to the prediction variables on a limited set of spatial prior models realizations, typically generated
through geostatistical methods. The relationship observed between data and prediction is highly non-linear for many forecasting problems in the subsurface. In this paper we propose a Canonical Functional
Component Analysis (CFCA) to map the data and forecast variables into a low-dimensional space where,
if successful, the relationship is linear. CFCA consists of (1) functional principal component analysis
(FPCA) for dimension reduction of time-series data and (2) canonical correlation analysis (CCA); the latter
aiming to establish a linear relationship between data and forecast components. If such mapping is successful, then we illustrate with several cases that (1) simple regression techniques with a multi-Gaussian
framework can be used to directly quantify uncertainty on the forecast without any model inversion and
that (2) such uncertainty is a good approximation of uncertainty obtained from full posterior sampling
with rejection sampling.
2015 Elsevier Ltd. All rights reserved.
1. Introduction
Inverse modeling with dynamic data such as piezometric head
(pressure), tracer or transport data has been an active area of
research as evidenced by several review papers in both oil/gas
and groundwater research [14]. Essentially two main factors
make this a difcult to solve problem. First, the relationship
between model parameters and the data is complex, non-linear
and often requires the numerical solution of partial differential
equations that are CPU demanding, limiting in practice the number
of forward model runs that can be done. Secondly, the ill-posedness nature, due to non-linearity and incompleteness of the
dynamic data, requires the formulation of a 3D spatial prior model
based on other knowledge of the subsurface geological heterogeneity. The latter may include elements of structural uncertainty (layering and faults), lithofacies and petrophysical properties (porosity,
hydraulic conductivity) from geological and geophysical data
Corresponding author.
http://dx.doi.org/10.1016/j.advwatres.2015.01.002
0309-1708/ 2015 Elsevier Ltd. All rights reserved.
sources. Ignoring such prior information may lead to inverse solutions that are geologically unrealistic and have limited forecasting
ability. In addition, modelers need to be mindful of generating
inverse solution that span a realistic range of uncertainty [5,6].
The latter requires formulating the inverse problem as a sampling
problem by adhering to certain theories that combine likelihood
and prior such as Bayes theory or perhaps more general, Tarantolas conjunction of information models.
Regardless of the methodologies being used, the end-goal of
inverse modeling is not necessarily the inverted subsurface models
or parameters themselves. Instead, one often requires performing
forecasts using the inverse solutions. Such forecasts could be the
future evolution of an existing plume of contaminants or the arrival
time at certain wells. The forecasting model itself may require a forward numerical model different from the forward model
establishing the data variables. In a recent contribution [7], it was
recognized that time-consuming full-edged inverse modeling
may not always be required. In their work, a direct, statistical,
relationship was built between the data variables and the forecast
70
(prediction) variables using non-linear principal component analysis (NLPCA). The latter required the generation of only a few subsurface models and accompanying forward model runs (200 in their
case) and did not require any iterative inverse modeling. For the
case studied their method provided the same posterior uncertainty
in the forecast as the full sampling technique (rejection sampling).
However, their approach, relying on non-linear principal component analysis required reducing the dimension of both data and prediction variables to a very low-dimensional space (12D). This may
not be realistically feasible in all practical cases. In addition, their
method called upon sampling the non-linear low-dimensional relationship between data and prediction variables using a Metropolis
sampler, which may be difcult to tune to convergence.
In this paper we extend on the idea in [7], by relating data variables and prediction variables using functional data analysis (FDA).
Techniques of FDA have been recently developed in the eld of statistical science [8] and rely essentially on modeling a phenomenon
by means of a linear combination of basis functions where coefcients are statistically estimated from a set of observations. For
example, a series of weather stations with measured temperature,
moisture or atmospheric pressures constitute time-series that vary
systematically (functionally with time), but have in addition station-specic uctuations that can be modeled using a stochastic
process (statistical uctuations). We recognize that most dynamic
data in the subsurface are essentially time-series of dynamic variations observed at well locations (stations). We will use functional
data analysis to build a functional principal component space of
both the data variables and the prediction variables based on a
few forward models runs. Next, we will attempt to linearize the
relationship between the functional components of data and prediction variables using canonical analysis. In case this linearization
is statistically successful, then we propose to directly quantify
uncertainty in the forecast using linear least-squares (Gaussian
process regression) based on the observed data mapped into the
canonical functional principal component space. If this is not
successful, full non-linear inverse modeling is required. We rst
illustrate our proposed approach on the same case study as
presented in [7] allowing a comparison with NLPCA and full
inverse modeling with rejection sampling. Then we apply the same
methodology to an extension of the same case.
2. Acronyms
Because of the extensive use of existing acronyms, the following
summary is provided:
PFA: Prediction Focused Analysis [7].
CCA: canonical correlation analysis [9,10].
PCA: principal component analysis [10].
NLPCA: non-linear principal component analysis [11].
MPS: Multiple Point Statistics [12].
FCA: Functional Component Analysis [13].
FDA: functional data analysis [14].
CFCA: Canonical Functional Component Analysis (this paper).
IMPALA: Improved Parallel Multiple-point Algorithm using List
Approach [15].
MaFloT: Matlab Flow and Transport [16].
for dimension reduction and canonical correlation analysis for linearizing the relationship between two multivariate quantities.
3.1. Functional Component Analysis
To perform Functional Component Analysis (FCA), functional
data analysis (FDA) [8] and principal component analysis (PCA)
are performed in succession. FDA assumes that changes in any
measurement of a physical variable over space or time is based
on an underlying smooth physical process [14] that in turn can
be mathematically represented using a continuous and differentiable mathematical function and that this function not always needs
to be known for analyzing measurements. This assumption allows
for the decomposition of any time series measurement xt into a
linear combination of underlying continuous functions called basis
functions, forming a functional basis. Multiple functional bases
such as a sinusoidal basis, a Fourier basis, a polynomial basis, an
exponential basis, a spline basis etc are available and the choice
between them is application driven. The spline basis has an advantage over the others with its versatility in terms of computational
ease of evaluation of basis functions and their derivatives as well as
its exibility. When a time series xt is analyzed using a splinebasis of L spline functions fn1 t; n2 t nL tg, it can be represented as a linear combination
xt
L
X
kn;i ni t
i1
where the FDA components kn;i are the scalar linear combination
coefcients of the spline function ni t.
PCA is a classical multivariate analysis procedure that uses an
orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components [9] such that the rst
principal component accounts for as much of the variability in
the data as possible, and each succeeding component in turn has
the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components. When
applied to vector data, PCA identies the principal modes of variation from the eigen-vectors of the covariance matrix. Analogously,
when applied to FDA component data kn;i , PCA identies the dominant functional modes of variation or eigen-functions /x;i t. Performing PCA on FDA component data allows for a time series xt
to be represented as a linear combination of K orthonormal
eigen-functions f/x;1 t; /x;2 t /x;K tg with coefcients x f such
that
xt
K
X
xif /x;i t
i1
71
Acronym
Purpose
FDA
PCA
FCA
CCA
CFCA
Represents a set of measurements as a linear combination of continuous and differentiable mathematical functions
in space or time
Represents a set of vectors as a linear combination of principal vector modes of variation
FDA + PCA, Represents a set of curves as a linear combination of principal functional modes of variation
Identies linear combinations of two vector variables such that the linear combinations are maximally correlated
FDA + PCA + CCA, Described in Section 4
x and y. The scalar values aT1 x and b1 y are termed the rst canonical variates of x and y respectively.
The jth pair of vectors aj and bj is obtained by maximizing the
T
T
T
bj x; bk;8k<j y
and
T
aTj x; bk;8k<j y
of
Canonical
Correlaon
Analysis
RXX RXY
RZZ
RYX RYY
1
YY
Component
Analysis
eigenvalues.
The
eigen-vectors
1
R1
XX RXY RYY RYX
of
Funconal
data analysis
Canonical
Funconal
Component
Analysis Funconal
Principal
Component
Analysis
1
1
1
The matrices R1
XX RXY RYY RYX and RYY RYX RXX RXY have the same posi-
tive
Canonical
Funconal
Analysis
and
1
XX
1
YY
1
YY
1
XX
Fig. 1. Canonical Functional Component Analysis: comparison with other multivariate data analysis techniques.
th
rM m kqM mLm
rM m which
4
72
the prior information available about the system, MPS [12] is used.
The MPS training image contains the binary spatial distribution of
the depositional feature with high hydraulic conductivity. A contaminant (or tracer) is injected on the left-hand edge of the earth
model. Observed data describes the contaminant concentration at
three different depths in the center of the system for 3.5 days after
contamination. The desired forecast is to predict the contaminant
concentration at the drinking well on the right-hand edge over
12 days after contamination on the left-hand edge.
For the illustrative problem, an ensemble of N 200 models,
fm1 ; m2 m200 g were created on a 100 25 grid using MPS code
in IMPALA [15] using the training image in Fig. 2. For each of these
models mi , using MaFloT, a ow and transport multi-scale nite
volume model [16], the contaminant concentration in the three
locations w1 ; w2 ; w3 in the center of the grid during 3.5 days after
contamination was calculated as
1
dw1;i t
B
C
di t @ dw2;i t A gmi
dw3;i t
hi t rmi
with one measurement every 2.88 h resulting in a 3 30 dimensional variable for d and a 100 dimensional variable for h.
To analyze different situations, three different synthetic
observed data-sets dobs;1 , dobs;2 and dobs;3 were chosen by [7] to exhibit different possible situations of contaminant arrival and were
generated in a manner analogous to the other di t to ensure that
the observed data are consistent with and fully contained in the
prior. A reference posterior distribution on ht is obtained using
the NLPCA-based PFA [7] in each case for comparison. In [7] posterior samples for each case were generated using NLPCA, see Fig. 3. In
this paper, posterior uncertainty on forecasts will be represented
by quantiles, specically, the P10, P50 and P90 quantiles, which
In Prediction Focused Analysis (PFA) introduced by [7], a statistical relationship is established between ensembles of
fh1 ; h2 . . . hL g and fd1 ; d2 . . . dL g obtained from forward simulations,
r and g respectively, of an ensemble of prior models
fm1 ; m2 . . . mN g sampled from the prior qM m. Directly establishing such a relationship is difcult because the dimension of d
and h may be large, although typically much less than m. Lower
dimensional representations of d and h are required and are
denoted as d and h. If a signicant dimension reduction can be
achieved without much loss of variability in d and h then the
joint distribution f d ; h approximates the joint relationship
between the data and the forecast variable.
In this paper, we use Functional Component Analysis (FCA) [13]
and canonical correlation analysis (CCA) [9] to establish this joint
relationship in relatively small dimension (less than 100). When
FCA is performed upon a time series ht , it can be represented
Fig. 2. Illustration case set-up. Top: training image containing binary spatial distribution of higher conductivity spatial feature (Source: [7]). Bottom: aquifer modeled using
two dimensional grid with 100 25 cells. Contaminant injected on left-hand edge, contaminant concentration measured in the center 3.5 days after contamination and
forecast seeks contaminant concentration on right-hand edge 12 days after contamination.
73
Fig. 3. The top row shows the contaminant arrival at the three observation wells. The red lines indicate Data Case One, the green lines indicate Data Case Two and the blue
lines indicate Data Case Three. The grey lines are the data responses evaluated on a set of prior model realization. The bottom row shows predictions of posterior uncertainty
reduction obtained using NLPCA-based PFA using the P10P50P90 statistics for each data case. The grey lines indicate the P10P50P90 statistics of the prior. (For interpretation
of the references to color in this gure legend, the reader is referred to the web version of this article.)
dt
K
X
di /d;i t
i1
and
i1
ht
K
X
hi /h;i t
i1
where /d;i t are the eigen-functions of dt and /h;i t are the eigen
f
f
functions of ht . It results in N points d ; h in functional space
Fig. 4. Functional Component Analysis preserves proximity relationships: contaminant concentration curves simulated using MaFlot (left), FCA components of each curve in a
two dimensional Functional Component Space (center) and contamination curves reconstructed using only 2 FCA components containing 88% variability (right).
74
d Gh
1
T
c
c
c
c
c
Lh exp Gh dobs C 1
dc Gh dobs
2
11
10
fc
h G
C 1
dc G
1
T 1 c
C 1
H G C dc dobs
fc
C 1
H h
12
1
CfH GT C 1
dc A C H
13
c
d Gh
14
N
1X
c
c
d Ghi
N i1 i
CT
1
Ddiff DTdiff
N
Fig. 5. Comparison of relationships between data and prediction in functional space and Canonical Functional Space.
15
16
75
ddiff di Ghi
17
ship between d ; h using canonical correlation analysis is unsuccessful, the error covariance C T will exceed the C dc covariance. In
that case, instead of linear Gaussian regression, explicit inverse
modeling of m might be required.
Since a Gaussian distribution is completely dened by its mean
c
c
c
and covariance, an ensemble fh1 ; h2 . . . hM g of M samples of the
c
c
fH . Each of these
h and C
posterior rc h can be sampled using f
H
samples can be back transformed into a time series using the following steps:
1. Calculating functional posterior samples from canonical posterfT
CT
functional space (Three dt time series, each with four d components). Each hi t was decomposed using a 2nd order 5-spline
basis resulting in ve eigen-function components containing
99.99% of the variability. The choice of basis (4th order 6-spline/
2nd order 5-spline) was guided by minimizing the maximum
RMS error between simulated and reconstructed curves in an
ensemble. This minimization provides the least number of splines
to achieve the maximum dimension reduction. Using canonical
correlation analysis, the functional data-forecast relationships
c
c
were linearized into a 10 dimensional d ; h Canonical Functional
f
sional d + 5 dimensional h ). All the relevant relationship information, however, could be effectively separated into ve
independent Canonical Functional Component Planes. Three of
these planes are seen in Fig. 6. In each of these planes, the dataforecast relationship is highly linear. Moreover the linear correlation went from 0.60 in functional space to 0.94 in functional
canonical space. (Fig. 5) The successfully achieved linear relationship permits the use of linear Gaussian regression to estimate
c
the posterior rcH h on the forecast components as a Gaussian distribution. Since sampling a Gaussian posterior distribution and
back transforming those samples are computationally inexpensive
linear operations, as many samples can be generated as needed to
effectively quantify an estimate on the forecast-uncertainty. In this
c
case, 100 posterior h points were sampled that, in turn, were
back-transformed into 100 ht curves seen in Fig. 7.
As seen in Fig. 8, the P10P50P90 quantile statistics of the posterior ensemble estimated using CFCA-based PFA agree very well
with the statistics of the NLPCA-based PFA estimate in data-cases
One and Three in terms of uncertainty reduction. In data-case
Two, CFCA-based PFA agrees with NLPCA-based PFA in suggesting
that the observed data does not contain information about the system that is relevant to the targeted forecast. As a result, inverse
modeling would not provide much uncertainty reduction.
The latter result (data-case Two) suggests that CFCA-based PFA
can be used as a diagnostic technique to ascertain the need for explicit inverse modeling for a forecasting problem just like NLPCAbased PFA can. However, CFCA is computationally more inexpensive
(no need for Monte Carlo iterations) and more straightforward
(involving linear models) compared to NLPCA, and especially so
for higher dimensional data, because it employs linear methods.
Fig. 6. Canonical Functional Space for dobs;1 : red dots indicate Canonical Functional Component Projections d ; h , and the blue line indicates where, in each of the Canonical
c
Functional Component Planes, the dobs;1 (the projection in low dimensional Canonical Functional Component Space of the data observation) lies. (For interpretation of the
references to color in this gure legend, the reader is referred to the web version of this article.)
76
Fig. 7. Posterior sampling in Data-case one. Histogram of 100 samples of the posterior of the rst component h1 (left). Reconstructed posterior ht curves in blue showing
uncertainty reduction compared to grey prior ht curves (right). (For interpretation of the references to color in this gure legend, the reader is referred to the web version of
this article.)
Fig. 8. Comparison of P10-P50-P90 quantile statistics of posterior estimates using CFCA and NLPCA based PFA for Data-cases One (left), Two (center) and Three (right). The grey
lines indicate the P10-P50-P90 statistics of the Prior. The dotted blue lines indicate the P10-P50-P90 statistics obtained from NLPCA-based PFA as described in [7]. The dotted red
lines indicate the P10-P50-P90 statistics obtained from CFCA-based PFA. (For interpretation of the references to color in this gure legend, the reader is referred to the web
version of this article.)
tion m : Lm kL e
77
Fig. 9. Three depositional scenarios. The white cells indicate presence of depositional feature with high hydraulic conductivity. The bottom scenario is considered a priori
twice as likely as the upper two scenarios.
Fig. 10. Extended case: the top row shows dobs t curves (center) used as reference for rejection sampling and PFA. The middle row shows the projections in Canonical
Functional Space. The bottom row shows that the P10-P50-P90 statistics of rejection sampling forecast and CFCA-based PFA forecast-estimate agree well.
78
Fig. 11. Robustness check: dobs dened in Canonical Functional Space such that it lies near the edge of the prior (top). Reconstructed dobs t curves (center) used as reference
for rejection sampling and PFA. P10-P50-P90 statistics of rejection sampling forecast and CFCA-based PFA forecast-estimate do not agree (bottom).
For CFCA-based PFA, this time each di t simulated using the initial ensemble of 300 models was functionally decomposed using a
basis composed of 27 tri-variate fourth-order spline functions.
The choice of a large basis (more spline functions), as compared to
the illustration case, ensures as many simulated measurements as
possible can be used for calibrating the basis. In this case, the simulation provides 27 data points for each di t. However, this leads
to 81 kn;i coefcients for each di t if 27 uni-variate splines are used.
Using 27 tri-variate splines instead reduces the dimensionality
without any change in accuracy of reconstruction because only 27
kn;i coefcients are required for each di t: Additionally, using a
tri-variate basis rather than three uni-variate bases better accounts
for the statistical-relationships between the three time series
dw1;i t, dw2;i t and dw3;i t. This resulted in the rst 13 tri-variate
eigen-function components containing 99.99% of the variability.
This allows for maximum granularity in the FDA thereby ensuring,
rstly, a minimum loss of accuracy in reconstruction of the timeseries at the time of back-transformation and, secondly, preservation of the relationships between the constituent time series of dt.
In case of ht, the functional basis contained 102 fourth order
splines. In this case, the simulation provides 102 data points for
each hi t and all these simulated measurements can be used for
calibrating the basis. This yielded ve eigen-functions containing
99.99% of the variability. Using canonical correlation analysis, the
data-forecast relationships were linearized into independent
79
Fig. 12. Trade-off between linearity and dimension-reduction: canonical correlation of the data-forecast pair when dimensions are reduced using NLPCA.
80
Appendix B
The relationship between the physical-space data covariance C d
and the canonical-functional space data component covariance C dc
c
can be modeled using the linear relationships between d and d . In
general, if Y XAT , the covariance between the columns of Y
diagonal matrix. If d d AT ,
C dc AC d f AT
relates the covariance of the functional components C d f with C dc .
C d f is a K K matrix where K is the number of eigen-functions in
the functional basis such that
dt
K
X
f
di ud;i t
i1
...
t M g,
C d UC d f UT
or
1
C d UA1 UC d f AT AT UT
in terms of the desired canonical functional data component covariance. Thus the expression
1
C d UA1 C dc AT UT
(The matrix A1 represents the MoorePenrose inverse of A if A is
not a square matrix) can be used to express the relationship
between the physical-space data covariance C d and the canonicalfunctional space data component covariance C dc .
In an Earth Sciences forecasting problem, the error-covariance
on the observable data C d often models the measurement error
of an instrument. In a 2q-dimensional Canonical Functional Component Space, the corresponding error-covariance can be modeled
as
C dc
argmin
Xdiagbix8xRq
1
C d UA1 XAT UT
81