Professional Documents
Culture Documents
http://www.jenvstat.org
Jesper Ryd
en
Skandia, Stockholm
Department of Mathematics
Uppsala University
Abstract
The exponentiated Gumbel (EG) distribution has been proposed as a generalization
of the classical Gumbel distribution. In this paper we discuss estimation of T -year return
values for significant wave height in a case study and compare point estimates and their
uncertainties to the results given by alternative approaches using Gumbel or Generalized
Extreme Value distributions. A jackknife approach is made to investigate the sensitivity
of the parameter estimates and various model selection criteria are employed to compare
the models. When examining AndersonDarling distances between samples and extremevalue distribtions, the EG distribution turns out to give the closest fit. However, general
recommendations whether to use Gumbel or EG distribution cannot be given.
1. Introduction
A frequently occurring problem in statistics is model selection and related issues. In standard applications like regression analysis, model selection may be related to the number of
independent variables to include in a final model. In some applications of statistical extremevalue analysis, convergence to some standard extreme-value distributions is crucial. A choice
has occasionally to be made between special cases of distributions versus the more general
versions. In this paper, statistical properties of a recently proposed distribution is examined
closer and a case study is performed where comparison is made to classical distributions.
In applications of extreme-value analysis to risk analysis, computation of return levels are of
importance, often for some quantity obtained from environmental data (wind speeds, wave
heights, maximum rainfall). The 100-year return level is a value which is exceeded in average
only once per 100 years. Usually this is estimated as a quantile of some extreme-value distribution, as will be defined in Section 3.2 (see Chapter 10 in Rychlik and Ryden (2006) for
elementary discussion). The uncertainty of the obtained point estimate of the return level is
of interest and might be considerable; hence related confidence intervals are of interest. In
extreme-value analysis, it is shown that the maximum of many of the common distributions
(normal, lognormal) converges to the Gumbel distribution,
!
"
x
F (x) = exp exp
#$
< x < ,
where > 0, < < . This is a special case of the so-called Generalized Extreme
Value (GEV) distribution and in the literature of probabilistic risk analysis, there has been
discussion about the choice of Gumbel compared to GEV. To quote Coles and Pericchi (2003),
where analysis of rainfall measurements was made: Although standard tests may support a
reduction to the Gumbel family, this is a risky strategy. Similar conclusions are given by
Koutsoyiannis and Baloutsos (2000). The Gumbel distribution yields narrower confidence
intervals than the three-parameter GEV but has also the risk of under-estimating the return
level. Hence, the choice of distribution is not trivial.
Recently, a generalization of the Gumbel distribution, called the exponentiated Gumbel (EG)
distribution, was introduced (Nadarajah, 2006):
%
!
"
#$&
x
FEG (x) = 1 1 exp exp
,
< x < ,
(1)
where > 0, > 0. Moreover, hazard-rate functions, moments, asymptotics and maximumlikelihood functions were presented. A numerical illustration was given; computation of return
levels for one single data set of annual maximum daily rainfall. A comparison was made with
the Gumbel and GEV distributions where the EG distribution proved to be advantageous.
In this article, we further extend the analysis of the EG. Based on real data sets, return levels
are computed. The resulting point estimates are compared with results based on Gumbel and
GEV by performing a model selection based on deviance. As the question of uncertainty of
estimates is important, we give related approximate confidence intervals for the return levels.
The datasets are observations of significant wave height and originate from two buoys in
the Pacific Ocean. The data was chosen because of the authors interest in and experience of
modelling such quantities, cf. e.g. Rychlik, Ryden and Anderson (2009), where new estimation
methodologies are presented based on theory for crossings in stationary Gaussian processes.
Measurements of this type have often been collected for a time period of the order of some
decades; hence, when regarding annual maxima, relatively small samples result.
The paper is organized as follows: in the next section, we give a brief orientation on exponentiated distributions. In Section 3, estimation of T -year return values is discussed and
expressions for confidence intervals by the delta method are explicitly stated. The remaining
sections are devoted to data analysis: in Section 4, two data sets are fitted to the EG distribution and we discuss whether the fit is reasonable. Moreover, in Section 5, comparison
of the results for T -year return values using EG, GEV and Gumbel distributions is made,
using various model selection criteria, such as deviance and comparison of Anderson Darling
distances.
2. On exponentiated distributions
During the first half of the nineteenth century, certain cumulative distributions were introduced by B. Gompertz and P.F. Verhulst, for instance
F (x; , , ) = (1 ex/ ) ,
x > ln ,
see Ahuja and Nash (1967). More recently, the exponentiated exponential family, with distribution function
F (x; , ) = (1 ex ) , , > 0, x > 0,
has been extensively studied in a number of papers by Gupta and Kundu, see for instance
Gupta and Kundu (2001). The families of exponential and Weibull distributions are found
within the exponentiated exponential distribution and therefore studies were performed to
investigate asymptotic results as well as fits to data sets. More generally, distributions F (x) =
[G(x)] where G(x) is a distribution family and > 0 are occasionally called Lehmann
alternatives in the context of modelling of failure times (Gupta, Gupta and Gupta 1998).
However, note that the definition in Eq. (1) is rather of the form FEG (x) = 1 [1 G(x)] ,
where G(x) is the distribution function of the Gumbel distribution. In this paper, we have
kept this definition, as it was stated by Nadarajah (2006). A reparametrization is possible.
Turning to extreme-value distributions, various forms of generalized extreme-value distributions have been proposed in the literature, e.g. a four-parameter distribution by Scarf (1992).
For a review see Kotz and Nadarajah (2000), Chapter 2.7. We find it interesting to compare the three-parameter EG distribution to another three-parameter family the GEV
distribution as well as the Gumbel distribution (EG with = 1). The general class of
exponentiated distributions is closed under maximum (Gupta and Kundu 2007), that is, if
X1 , . . . , Xn are iid random variables then the Xi variables are exponentiated random variables
if and only if the maximum of X1 , . . . , Xn is an exponentiated random variable.
3. Extreme-value modelling
In this section we first give a brief review on classical extreme-value modelling and then
present the estimation framework for the exponentiated Gumbel distribution. For further
reference, consult Coles (2001) or Rychlik and Ryden (2006).
Suppose X1 , . . . , Xn is a sequence of independent and identically distributed variables, and
let Mn = max{X1 , . . . , Xn }. Classical extreme-value theory is concerned with the limiting
distribution of Mn as n , or rather its normalized version: If there exist sequences of
constants {an > 0} and {bn } such that
P((Mn bn )/an x) F (x)
as n ,
the Extremal Types Theorem states that G must belong to one of three families of distributions (Gumbel, Frechet and Weibull). These can combined into a single family, the Generalized
Extreme Value (GEV) distribution
F (x) =
exp((1 (x )/)1/ ), %= 0,
exp( exp((x )/)),
= 0,
for x > / + m (when 0) and x < / + m (when > 0). As the special case = 0 is
found the Gumbel distribution.
A common procedure in applications to environmental data is the method of block maxima.
The original sequence is broken up into blocks of size n, say, and the maximum observation
is extracted from each block. For the resulting sequence of iid observations, an extreme-value
distribution is fitted. Often in applications the block size is chosen to be one year, and the
goal is to estimate quantiles of the distribution of block maxima, so-called return levels. The
method of block maxima is robust and used in codes but lots of data are disregarded. Other
approaches exist, for instance threshold methods.
Often measurements are made at networks of stations, and a statistical problem is then to
pool the information. It is out of the scope of this paper though to discuss such procedures; we
focus at data collected at one individual setting at a time and then use the classical approach
of block maxima.
n
'
i=1
log(1 ee
n
'
xi
i=1
n
'
x
i
xi
)
.
i=1
Moreover, the first-order derivatives of l(, , ) = log L(, , ) with respect to the three
parameters are:
l
x
i
n '
+
log(1 ee )
i=1
n
xi
x
n ' xi
1 ' (xi )e ee
i
=
(1
e
)
+
x
i
2
2
1 ee
i=1
i=1
x
i
xi
i
n
n
n 1 ' xi 1 ' e ee
e +
.
x
i
e
i=1
i=1 1 e
By using approximate normality of ML estimates, the so-called delta method can be applied
to construct approximate confidence intervals for functions of the estimated parameters, in
our case, the T -year return level. (Note that in the literature on statistical extreme-value
analysis, intervals based on profile likelihood are usually preferred (see e.g. Coles and Dixon
1999) and bootstrap approaches have also been suggested. We find it though out of the scope
of this article to evaluate different methods for obtaining confidence intervals.)
For the convenience of the reader and possible future implementations, we give some details
of the confidence interval for the return level as obtained by the delta method:
xT x
(
xT /2 D
T + /2 D).
Here
where the gradient vector is
( xT (
2 = xT (
D
,
,
)T
,
,
),
%
x
T x
T x
T
xT (
,
,
) =
,
,
&
where the derivatives of the T -year return wave with respect to the parameters are
xT
xT
xT
T 1/ ln T
2 (1 T 1/ ) ln(1 T 1/ )
,
,
) .
(2)
4. Fitting of data
In oceanography and ocean engineering, the quantity of significant wave height (Hs) is studied.
This is defined as the average height of the highest one third wave amplitudes at a given
location. Data originate from buoy measurements with Hs reported hourly, calculated as the
average of the highest one-third of all of the wave heights during 20-minute sampling periods.
Assuming independence between years, the method of block maxima (annual maxima) will
be used; the most basic methodology for estimation of return values. It is a well established
model in the literature that the limiting extreme-value distribution for data of this type is
Gumbel; hence, it is of interest to also study the EG distribution.
10.70
9.30
9.87
7.00
8.80
13.04
11.30
11.00
9.79
13.60
11.90
12.26
11.70
9.20
11.52
8.20
8.71
12.92
11.70
10.10
12.78
9.10
11.20
14.23
8.40
9.56
11.21
8.80
7.20
12.47
11.80
9.80
16.32
12.70
10.80
14.65
Dataset 2
Nelder-Mead
100.31
7.32
22.49
Nelder-Mead
2.19
2.94
11.67
BFGS
32.76
5.74
18.46
BFGS
2.20
2.94
11.68
Diagnostic plots in the form of QQ plots of residuals after fitting are presented in Figures 1-2.
These plots seem to indicate a reasonable fit of the EG distribution; the dots in the QQ plot
follow a straight line, etc.
EG cdf and empirical cdf
QQ plot, EG distribution
14
0.9
12
0.8
0.7
10
0.6
0.5
0.4
0.3
0.2
0.1
0
7
2
8
10
11
Hs (m)
12
13
14
0
0
10
12
14
Figure 1: Dataset 1. Left: Empirical and fitted cdf. Right: QQ plot of residuals after fitting.
EG cdf and empirical cdf
QQ plot, EG distribution
16
0.9
0.8
14
0.7
12
0.6
10
0.5
0.4
0.3
0.2
0.1
0
6
2
8
10
12
Hs (m)
14
16
18
0
0
10
12
14
16
Figure 2: Dataset 2. Left: Empirical and fitted cdf. Right: QQ plot of residuals after fitting.
The sensitivity of data is investigated by removing one observation at a time in the sorted
sample and estimating ,
and
. Thus sample i consists of 20 observations with the ith
component removed. Denote by (i) the ith estimate of a parameter and the sample mean of
the estimates by (.) . Then the jackknife estimate of the standard error is defined by
djack =
-2
n 1 ,
(i) (.)
n
and assuming normally distributed estimates, a 95 percent confidence interval for can be
constructed as ( 1.96 djack ). The resulting 95% confidence intervals for x100 are similar
to those computed before: For Dataset 1, the interval (12.6, 15.7) is found; for Dataset 2,
(12.9, 22.4).
Dataset 1
40.8584
40.6890
42.6171
Dataset 2
46.6529
46.6315
46.8145
We note that GEV has the highest log-likelihood in both samples, the EG distribution has the
second highest and Gumbel the lowest. These values can be used to test the null hypothesis
of Gumbel distribution, since this is a special case of both EG ( = 1) and GEV ( = 0).
The statistics 2(LL(EG) LL(Gumbel)) and 2(LL(GEV) LL(Gumbel)) are approximately
chi-square distributed with one degree of freedom. The following values were obtained from
the chi-square distribution:
Test
EG vs Gumbel
GEV vs Gumbel
p value (Data 1)
0.06
0.05
p value (Data 2)
0.57
0.54
For Dataset 1, we note from the table that the hypothesis of Gumbel practically can be
rejected at a 5 percent significance level, while for Dataset 2, the hypothesis cannot be rejected.
Alternatively, an asymptotic test for shape parameter equal to zero, proposed by Hosking,
Wallis and Wood (1985), gives that the hypothesis of Gumbel distribution is rejected for
Dataset 1 (p value 0.015).
Next, the AndersonDarling distance is used to study the differences between the samples
and distribution functions. This is given for a distribution function F (x) by the formula
%
&
n
1 '
DAD = 2
(2i 1) ln F (x(i) ) + ln(1 F (x(n+1i) )) 1
n
i=1
where x(1) , . . . , x(n) is the ordered sample; see e.g. Boos (1982). The distance DAD is usually considered superior to other distances (like KolmogorovSmirnov) with respect to tail
behaviours. Values of differences as measured with DAD are given in the table below from
which we conclude that for both datasets, EG has the lowest discrepancy. Curiously, the
Gumbel distribution in both cases gives a closer fit to data than the GEV.
DAD
EG
GEV
Gumbel
Dataset 1
0.0089
1.06
0.018
Dataset 2
0.030
5.75
0.38
In summary, GEV had the highest log-likelihood in both samples, EG the second highest and
the Gumbel distribution the lowest log-likelihood. Considering AndersonDarling distance
DAD , EG gives the closer fit; hence one cannot say whether EG or GEV provides the better
fit. The Gumbel distribution had the lowest log-likelihood and second smallest DAD . This
indicates that the Gumbel distribution is not as good model as the EG and the GEV. On the
other hand, the Gumbel distribution is a special case of GEV and naturally it has a smaller
flexibility in modelling the data. In statistics it is not recommended to use more complicated
models than needed to describe data adequately and, therefore, this model can be preferable
in certain circumstances.
x100 (m)
14.2
14.3
17.5
Confidence interval
(12.4, 16.0)
(8.2, 20.4)
(14.6, 20.3)
x100 (m)
17.7
17.6
19.0
Confidence interval
(13.0, 22.3)
(2.7, 32.5)
(15.6, 22.4)
10
Gumbel distribution is not rejected at the level 0.05. The difference between EG and GEV is
about 2 metres for longer return periods, about one metre for return periods less than 1000
years. In the right panel, the differences between the established distributions Gumbel and
GEV are quite large for high values of T and the EG distribution gives an intermediate result.
30
26
28
Gumbel
GEV
EG
22
24
20
18
16
24
22
20
Gumbel
GEV
EG
18
14
12
0
26
2000
4000
6000
Time period T (years)
8000
10000
16
0
2000
4000
6000
Time period T (years)
8000
10000
Figure 3: Return values as a function of return period T . Left: Dataset 1; Right: Dataset 2.
However, the authors did also have access to 21 yearly maxima from Buoy 44004, and the
corresponding results are shown in Figure 4. Here the estimates by the EG distributions are
the highest, although for practical purposes, the results from the Gumbel distribution are
close (and
= 0.97). For this data set, the point estimates are close and the difference in
confidence interval may be interesting to study (though not shown here). In summary, based
on the analysis of the three datasets, no general conclusion can be drawn about the behaviour
of the EG distribution with respect to GEV and Gumbel.
21
20
19
18
17
16
15
Gumbel
GEV
EG
14
13
0
2000
4000
6000
Time period T (years)
8000
10000
11
6. Concluding remarks
The purpose of this paper was to extend the analysis of the recently introduced EG distribution by investigating estimation of T -year return values for significant wave height. From
an applied point of view, such estimations are by no ways trivial and the method of block
maxima is the classical, and most simplest, way to proceed. Nevertheless, use of this method
is often employed due to its simplicity and may serve as a benchmark when evaluating other,
more refined methods (like e.g. threshold methods).
We investigated two datasets with values of about the same order. However, the estimated
parameter values possessed a high variability, in particular the estimate of the shape parameter
(100 respectively 2). The authors performed simulation studies and found that it is not
unlikely to receive high values of (for parameter settings used in this paper, as high as
2000 could be found). However, even with such high values, the estimates of return values
are not affected but behave in a stable way. An explanation could be the maximization of
the likelihood; the objective function might be flat around the extremum. Moreover, the
estimators , , are positively correlated. It might be interesting to test other estimation
strategies, e.g. the method of moments, possibly in a future work.
Nevertheless, the statistical analysis of datasets indicate that the EG distribution could serve
as an alternative to the more well-established GEV distribution (also a three-parameter distribution). In particular, for the data analysed, EG renders narrower confidence intervals than
GEV and has for both datasets the smallest AndersonDarling distance of the distributions
examined. The EG distribution deserves further studies, theoretical (estimation methodology)
as well as practical (analysis of further datasets).
7. References
Ahuja JC and Nash SW. 1967. The generalized Gompertz-Verhulst family of distributions, Sankhya, series A
29, 141-156.
Boos D. 1982. Minimum Anderson-Darling estimation, Communications in Statistics - Theory and Methods,
11:24, 2747-2774.
Coles SG, Dixon, MJ. 1999. Likelihood-based inference for extreme value models. Extremes 2: 5-23.
Coles, S. 2001. An Introduction to Statistical Modelling of Extreme Values. Springer-Verlag.
Coles S, Pericchi L. 2003. Anticipating catastrophes through extreme value modelling. Appl. Statist. 52:
405-416.
Gupta RC, Gupta PL, Gupta RD. 1998. Modeling failure time data by Lehmann alternatives. Comm Stat
Theory and Methods 27: 887-904.
Gupta D, Kundu D. 2001. Exponentiated exponential family: An alternative to Gamma and Weibull distributions, Biometrical Journal 43: 117-130.
Gupta RD, Kundu D. 2007. Generalized exponential distribution: Existing results and some recent developments. Journal of Statistical Planning and Inference 137: 3537-3547
Hosking JRM, Wallis JR, Wood EF. 1985. Estimation of the generalized extreme-value distribution by the
method of probability-weighted moments. Technometrics 27: 251-261.
Koutsoyiannis D, Baloutsos, G. 2000. Analysis of long record of annual maxima rainfall in Athens, Greece,
and design rainfall inferences. Natural Hazards 29: 29-48.
12
Kotz S, Nadarajah S. 2000. Extreme Value Distributions: Theory and Applications. Imperial College Press:
London.
Leadbetter MR, Lindgren G, Rootzen H. 1983. Extremes and related properties of random sequences and processes. Springer-Verlag.
Nadarajah S. 2006. The exponentiated Gumbel distribution with climate application, Environmetrics 17:
13-23
Nelder J.A. and Mead R. 1965. A simplex algorithm for function minimization. Computer Journal 7, 308-313.
Rychlik I, Ryden J. 2006. Probability and Risk Analysis. An Introduction for Engineers. Springer-Verlag.
Rychlik I, Ryden J, Anderson, CW. 2009. Estimation of return values for significant wave height from satellite
data. Preprint - Department of Mathematical Sciences, Chalmers University of Technology and G
oteborg
University, ISSN 1652-9715; nr 2009:29
Scarf, P. 1992. Estimation for a four parameter generalized extreme value distribution. Comm. Statist. Theory
and Methods 21:2185-2201.
Affiliation:
Jesper Ryden
Department of Mathematics
Uppsala University
E-mail: jesper.ryden@math.uu.se
http://www.jenvstat.org
Submitted: 2008-11-25
Accepted: 2009-12-19