(A Chapman & Hall Book) Masanobu Taniguchi, Hiroshi Shiraishi, Junichi Hirukawa, Hiroko Kato Solvang, Takashi Yamashita-Statistical Portfolio Estimation-CRC Press - Chapman and Hall - CRC (2018)

STATISTICAL
PORTFOLIO
ESTIMATION
STATISTICAL
PORTFOLIO
ESTIMATION
Masanobu Taniguchi, Hiroshi Shiraishi,

Junichi Hirukawa, Hiroko Kato Solvang,
and Takashi Yamashita
MATLAB• is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not warrant the accuracy
of the text or exercises in this book. This book’s use or discussion of MATLAB • software or related products does not constitute
endorsement or sponsorship by The MathWorks of a particular pedagogical approach or particular use of the MATLAB •
software.
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2018 by Taylor & Francis Group, LLC

CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Printed on acid-free paper

Version Date: 20170720
International Standard Book Number-13: 978-1-4665-0560-5 (Hardback)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity
of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized
in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying,
microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents
Preface ix
1 Introduction 1
2 Preliminaries 5
2.1 Stochastic Processes and Limit Theorems 5
3 Portfolio Theory for Dependent Return Processes 19

3.1 Introduction to Portfolio Theory 19
3.1.1 Mean-Variance Portfolio 20
3.1.2 Capital Asset Pricing Model 23
3.1.3 Arbitrage Pricing Theory 27
3.1.4 Expected Utility Theory 28
3.1.5 Alternative Risk Measures 33
3.1.6 Copulas and Dependence 36
3.1.7 Bibliographic Notes 38
3.1.8 Appendix 38
3.2 Statistical Estimation for Portfolios 41
3.2.1 Traditional Mean-Variance Portfolio Estimators 42
3.2.2 Pessimistic Portfolio 44
3.2.3 Shrinkage Estimators 46
3.2.4 Bayesian Estimation 48
3.2.5 Factor Models 52
3.2.5.1 Static factor models 52
3.2.5.2 Dynamic factor models 55
3.2.6 High-Dimensional Problems 58
3.2.6.1 The case of m/n → y ∈ (0, 1) 59
3.2.6.2 The case of m/n → y ∈ (1, ∞) 62
3.2.8 Appendix 63
3.3 Simulation Results 65
3.3.1 Quasi-Maximum Likelihood Estimator 66
3.3.2 Efficient Frontier 68
3.3.3 Difference between the True Point and the Estimated Point 69
3.3.4 Inference of µP 70
3.3.5 Inference of Coefficient 72
4 Multiperiod Problem for Portfolio Theory 75

4.1 Discrete Time Problem 76
4.1.1 Optimal Portfolio Weights 76
4.1.2 Consumption Investment 78
v
vi CONTENTS
4.1.3 Simulation Approach for VAR(1) model 80
4.2 Continuous Time Problem 84
4.2.1 Optimal Consumption and Portfolio Weights 85
4.2.2 Estimation 90
4.2.2.1 Generalized method of moments (GMM) 91
4.2.2.2 Threshold estimation method 95
4.3 Universal Portfolio 98
4.3.1 µ-Weighted Universal Portfolios 98
4.3.2 Universal Portfolios with Side Information 101
4.3.3 Successive Constant Rebalanced Portfolios 104
4.3.4 Universal Portfolios with Transaction Costs 106
4.3.6 Appendix 109
5 Portfolio Estimation Based on Rank Statistics 113

5.1 Introduction to Rank-Based Statisticsrank 113
5.1.1 History of Ranks 113
5.1.1.1 Wilcoxon’s signed rank and rank sum tests 113
5.1.1.2 Hodges–Lehmann and Chernoff–Savage 117
5.1.2 Maximal Invariantsmaximal invariant 124
5.1.2.1 Invariance of sample space, parameter space and tests 124
5.1.2.2 Most powerful invariant testmost powerful invariant test 125
5.1.3 Efficiencyefficiency of Rank-Based Statistics 126
5.1.3.1 Least favourableleast favourable density and most powerfulmost
powerful test 126
5.1.3.2 Asymptotically most powerful rank test 129
5.1.4 U-Statistics for Stationary Processes 137
5.2 Semiparametrically Efficient Estimation in Time Series 139
5.2.1 Introduction to Rank-Based Theory in Time Series 139
5.2.1.1 Testing for randomness against ARMA alternatives 139
5.2.1.2 Testing an ARMA model against other ARMA alternatives 146
5.2.2 Tangent Spacetangent space 149
5.2.3 Introduction to Semiparametric Asymptotic Optimal Theory 155
5.2.4 Semiparametrically Efficient Estimation in Time Series, and Multivariate
Cases 159
5.2.4.1 Rank-based optimal influence functions (univariate case) 159
5.2.4.2 Rank-based optimal estimation for elliptical residuals 164
5.3 Asymptotic Theory of Rank Order Statistics for ARCH Residual Empirical
Processes 170
5.4 Independent Component Analysis 180
5.4.1 Introduction to Independent Component Analysis 180
5.4.1.1 The foregoing model for financial time series 180
5.4.1.2 ICA modeling for financial time series 187
5.4.1.3 ICA modeling in frequency domain for time series 191
5.5 Rank-Based Optimal Portfolio Estimation 202
5.5.1 Portfolio Estimation Based on Ranks for Independent Components 202
5.5.2 Portfolio Estimation Based on Ranks for Elliptical Residualselliptical
residuals 204
CONTENTS vii
6 Portfolio Estimation Influenced by Non-Gaussian Innovations and Exogenous
Variables 207
6.1 Robust Portfolio Estimation under Skew-Normal Return Processes 207
6.2 Portfolio Estimators Depending on Higher-Order Cumulant Spectra 211
6.3 Portfolio Estimation under the Utility Function Depending on Exogenous Variables 215
6.4 Multi-Step Ahead Portfolio Estimation 221
6.5 Causality Analysis 224
6.6 Classificationclassification by Quantile Regressionquantile regression 227
6.7 Portfolio Estimation under Causal Variables 229
7 Numerical Examples 235

7.1 Real Data Analysis for Portfolio Estimation 235
7.1.1 Introduction 235
7.1.2 Data 235
7.1.3 Method 237
7.1.3.1 Model selection 237
7.1.3.2 Confidence region 238
7.1.3.3 Locally stationary estimation 239
7.1.4 Results and Discussion 240
7.1.5 Conclusions 242
7.2 Application for Pension Investment 243
7.2.2 Data 244
7.2.3 Method 244
7.3 Microarray Analysis Using Rank Order Statistics for ARCH Residual 248
7.3.2 Data 250
7.3.3 Method 252
7.3.3.1 The rank order statistic for the ARCH residual empirical process 252
7.3.3.2 Two-group comparison for microarray data 252
7.3.3.3 GO analysis 254
7.3.3.4 Pathway analysis 254
7.3.4 Simulation Study 254
7.3.5.1 Simulation data 255
7.3.5.2 Affy947 expression dataset 255
7.4 Portfolio Estimation for Spectral Density of Categorical Time Series Data 259
7.4.2 Method 259
7.4.2.1 Spectral Envelope 259
7.4.2.2 Diversification analysis 260
7.4.2.3 An extension of SpecEnv to the mean-diversification efficient
frontier 260
7.4.3 Data 261
7.4.3.1 Simulation data 261
7.4.3.2 DNA sequence data 261
7.4.4.1 Simulation study 262
viii CONTENTS
7.4.4.2 DNA sequence for the BNRF1 genes 263
7.5 Application to Real-Value Time Series Data for Corticomuscular Functional
Coupling for SpecEnv and the Portfolio Study 270
7.5.2 Method 274
8 Theoretical Foundations and Technicalities 283

8.1 Limit Theorems for Stochastic Processes 283
8.2 Statistical Asymptotic Theory 287
8.3 Statistical Optimal Theory 293
8.4 Statistical Model Selection 300
8.5 Efficient Estimation for Portfolios 312
8.5.1 Traditional mean variance portfolio estimators 313
8.5.2 Efficient mean variance portfolio estimators 315
8.5.3 Appendix 324
8.6 Shrinkage Estimation 331
8.7 Shrinkage Interpolation for Stationary Processes 342
Bibliography 349
Author Index 365
Subject Index 371

Preface
The field of financial engineering has developed as a huge integration of economics, probability
theory, statistics, etc., for some decades. The composition of porfolios is one of the most funda-
mental and important methods of financial engineering to control the risk of investments. This book
provides a comprehensive development of statistical inference for portfolios and its applications.
Historically, Markowitz (1952) contributed to the advancement of modern portfolio theory by lay-
ing the foundation for the diversification of investment portfolios. His approach is called the mean-
variance portfolio, which maximizes the mean of portfolio return with reducing its variance (risk
of portfolio). Actually, the mean-variance portfolio coefficients are expressed as a function of the
mean and variance matrix of the return process. Optimal portfolio coefficients based on the mean
and variance matrix of return have been derived by various criteria. Assuming that the return pro-
cess is i.i.d. Gaussian, Jobson and Korkie (1980) proposed a portfolio coefficient estimator of the
optimal portfolio by making the sample version of the mean-variance portfolio. However, empirical
studies show that observed stretches of financial return are often are non-Gaussian dependent. In
this situation, it is shown that portfolio estimators of the mean-variance type are not asymptotically
efficient generally even if the return process is Gaussian, which gives a strong warning for use of the
usual portfolio estimators. We also provide a necessary and sufficient condition for the estimators
to be asymptotically efficient in terms of the spectral density matrix of the return. This motivates
the fundamental important issue of the book. Hence we will provide modern statistical techniques
for the problems of portfolio estimation, grasping them as optimal statistical inference for various
return processes. We will introduce a variety of stochastic processes, e.g., non-Gaussian stationary
processes, non-linear processes, non-stationary processes, etc. For them we will develop a mod-
ern statistical inference by use of local asymptotic normality (LAN), which is due to LeCam. The
approach is a unified and very general one. Based on this we address a lot of important problems
for portfolios. It is well known that a Markowitz portfolio is to optimize the mean and variance
of portfolio return. However, recent years have seen increasing development of new risk measures
instead of the variance such as Value at Risk (VaR), Expected Shortfall (ES), Conditional Value at
Risk (CVaR) and Tail Conditional Expectation (TCE). We show how to construct optimal portfo-
lio estimators based on these risk measures. Thanks to the advancement of computer technology,
a variety of financial transactions have became possible in a very short time in recent years. From
this point of view, multiperiod problems are one of the most important issues in portfolio theory. We
discuss this problems from three directions of perspective. In the investment of the optimal portfolio
strategy, multivariate time series models are required. Moreover, the assumption that the innovation
densities underlying those models are known seems quite unrealistic. If those densities remain un-
specified, the model becomes a semiparametric one, and rank-based inference methods naturally
come into the picture. Rank-based inference methods under very general conditions are known to
achieve semiparametric efficiency bounds. However, defining ranks in the context of multivariate
time series models is not obvious. Two distinct definitions can be proposed. The first one relies on
the assumption that the innovation process is described by some unspecified independent compo-
nent analysis model. The second one relies on the assumption that the innovation density is some
unspecified elliptical density. Applications to portfolio management problems, rank statistics and
mean-diversification efficient frontier are discussed. These examples give readers a practical hint
for applying the introduced portfolio theories. This book contains applications ranging widely, from
ix
x PREFACE
financial field to genome and medical science. These examples give readers a hint for applying the
introduced portfolio theories.
This book is suitable for undergraduate and graduate students who specialize in statistics, math-
ematics, finance, econometrics, genomics, etc., as a textbook. Also it is appropriate for researchers
in related fields as a reference book.
Throughout our writing of the book, our collaborative researchers have offered us advice, debate,
inspiration and friendship. For this we wish to thank Marc Hallin, Holger Dette, Yury Kutoyants,
Ngai Hang Chan, Liudas Giraitis, Murad S. Taqqu, Anna Clara Monti, Cathy W. Chen, David Stof-
fer, Sangyeol Lee, Ching-Kan Ing, Poul Embrechts, Roger Koenker, Claudia Klüppelberg, Thomas
Mikosch, Richard A.Davis and Zudi Lu.
The research by the first author M.T. has been supported by the Research Institute for Science &
Engineering, Waseda University, Japanese Government Pension Investment Fund and JSPS fund-
ings: Kiban(A)(23244011), Kiban(A)(15H02061) and Houga(26540015). M.T. deeply thanks all of
them for their generous support and kindness. The research by the second author H.S. has been
supported by JSPS fundings: Wakate(B)(24730193), Kiban(C)(16K00036) and Core-to-Core Pro-
gram (Foundation of a Global Research Cooperative Center in Mathematics focused on Number
Theory and Geometry). The research by the third author J.H. has been supported by JSPS funding:
Kiban(C)(16K00042).
Masanobu Taniguchi, Waseda, Tokyo,

Hiroshi Shiraishi, Keio, Kanagawa,
Junichi Hirukawa, Niigata,
Hiroko Kato Solvang, Institute of Marine Research, Bergen
Takashi Yamashita, Government Pension Investment Fund, Tokyo
MATLAB R
is a registered trademark of The MathWorks, Inc. For product information, please
contact:
The MathWorks, Inc.

3 Apple Hill Drive
Natick, MA, 01760-2098 USA
Tel: 508-647-7000
Fax: 508-647-7001
E-mail: info@mathworks.com,
Web: www.mathworks.com.
Chapter 1
Introduction
This book provides a comprehensive integration of statistical inference for portfolios and its appli-
cations. Historically, Markowitz (1952) contributed to the advancement of modern portfolio theory
by laying the foundation for the diversification of investment portfolios. His approach is called the
mean-variance portfolio, which maximizes the mean of portfolio return with reducing its variance
(risk of portfolio). Actually, if the return process of assets has the mean vector µ and variance matrix
Σ, then the mean-variance portfolio coefficients are expressed as a function of µ and Σ.
Optimal portfolio coefficients based on µ and Σ have been derived by various criteria. Assuming
that the return process is i.i.d. Gaussian, Jobson and Korkie (1980) proposed a portfolio coefficient
estimator of the optimal portfolio by substituting the sample mean vector µ̂ and the sample variance
matrix Σ̂ into µ and Σ, respectively. However, empirical studies show that observed stretches of
financial return are often not i.i.d. and non-Gaussian. This leads us to the assumption that finan-
cial returns are non-Gaussian-dependent processes. From this point of view, Basak et al. (2002)
showed the consistency of optimal portfolio estimators when portfolio returns are stationary pro-
cesses. However, in the literature there has been no study on the asymptotic efficiency of estimators
for optimal portfolios.
In this book, denoting optimal portfolios by a function g = g(µ, Σ) of µ and Σ, we discuss the
asymptotic efficiency of estimators ĝ = g(µ̂, Σ̂) when the return is a vector-valued non-Gaussian
stationary process. In Section 8.5 it is shown that ĝ is not asymptotically efficient generally even
if the return process is Gaussian, which gives a strong warning for use of the usual estimators ĝ.
We also provide a necessary and sufficient condition for ĝ to be asymptotically efficient in terms
of the spectral density matrix of the return. This motivates the fundamental important issue of the
book. Hence we will provide modern statistical techniques for the problems of portfolio estimation,
grasping them as optimal statistical inference for various return processes. Because empirical finan-
cial phenomena show that return processes are non-Gaussian and dependent, we will introduce a
variety of stochastic processes, e.g., non-Gaussian stationary processes, non-linear processes, non-
stationary processes, etc. For them we will develop a modern statistical inference by use of local
asymptotic normality (LAN), which is due to LeCam. The approach is a unified and very general
one. We will deal with not only the mean-variance portfolio but also other portfolios such as the
mean-VaR portfolio, the mean-CVaR portfolio and the pessimistic portfolio. These portfolios have
recently wide attention. Moreover, we discuss multiperiod problems received which are much more
realistic that (traditional) single period problems. In this framework, we need to construct a sequence
or a continuous function of asset allocations. If the innovation densities are unspecified for the mul-
tivariate time series models in the analyses of optimal portfolios, semiparametric inference methods
are required. To achieve semiparametric efficient inference, rank-based methods under very general
conditions are widely used. As examples, we provide varous practical applications ranging from
financial science to medical science.
This book is organized as follows. Chapter 2 explains the foundation of stochastic processes,
because a great deal of data in economics, finance, engineering and the natural sciences occur in
the form of time series where observations are dependent and where the nature of this dependence
1
2 INTRODUCTION
is of interest in itself. A model which describes the probability structure of the time series is called
a stochastic process. We provide a modern introduction to stochastic processes. Because statistical
analysis for stochastic processes largely relies on asymptotic theory, some useful limit theorems
and central limit theorems are introduced. In Chapter 3, we introduce the modern portfolio theory
of Markowitz (1952), the capital asset pricing model (CAPM) of Sharpe (1964) and the arbitrage
pricing theory of Ross (1976). Then, we discuss other portfolios based on new risk measures such as
VaR and CVaR. Moreover, a variety of statistical estimation methods for portfolios are introduced
with some simulation studies. We discuss the multiperiod problem in Chapter 4. By using dynamic
programming, a secence (or a continuous function) of portfolio weights is obtained in the discrete
time case (or the continuous time case). In addition, the universal portfolio (UP) introduced by
Cover (1991) is discussed. The idea of this portfolio comes from the information theory of Shannon
(1945) and Kelly (1956).
The concepts of rank-based inference are simple and easy to understand even at an introductory
level. However, at the same time, they provide asymptotically optimal semiparametric inference. We
discuss portfolio estimation problems based on rank statistics in Chapter 5. Section 5.1 contains the
introduction and history of rank statistics. The important concepts and methods for rank-based in-
ference, e.g., maximal invariants, least favourable and U-statistics for stationary processes, are also
introduced. In Section 5.2, semiparametrically efficient estimations in time series are addressed. As
a introduction, we discuss the testing problem for randomness against the ARMA alternative, and
testing for the ARMA model against other ARMA alternatives. The notion of tangent space, which
is important for semiparametric optimal inference, is introduced, and the semiparametric asymp-
totic optimal theory is addressed. Semiparametrically efficient estimations in univariate time series,
and the multivariate elliptical residuals case are discussed. In Section 5.3, we discuss the asymp-
totic theory of a class of rank order statistics for the two-sample problem based on the squared
residuals from two classes of ARCH models. Since portfolio estimation includes inference for mul-
tivariate time series, independent component analysis (ICA) for multivariate time series is useful for
rank-based inference, and is introduced in Section 5.4. First, we address the foregoing models for
financial multivariate time series. Then, we discuss ICA modeling in both the time domain and the
frequency domain. Finally, portfolio estimations based on ranks for an independent component and
on ranks for elliptical residuals are discussed in Section 5.5.
The classical theory of portfolio estimation often assumes that asset returns are independent
normal random variables. However, in view of empirical financial studies, we recognize that the
financial returns are not Gaussian or independent. In Chapter 6, assuming that the return process is
generated by a linear process with skew-normal innovations, we evaluate the influence of skewness
on the asymptotics of mean-variance portfolios, and discuss a robustness with respect to skewness.
We also address the problems of portfolio estimation (i) when the utility function depends on ex-
ogeneous variables and (ii) when the portfolio coefficients depend on causal variables. In the case
of (ii) we apply the results to the Japanese pension investment portfolio by use of the canonical
correlation method. Chapter 7 provides five applications: Japanese companies monthly log-returns
analysis based on the theory of portfolio estimators, Japanese goverment pension investment fund
analysis for the multiperiod portfolio optimization problem, microarray data analysis by rank-order
statistics for an ARCH residual, and DNA sequence data analysis and a cortico muscular coupling
analysis as applications for a mean-diversification portfolio. As for the statistical inference, Chapter
8 is the main chapter. To describe financial returns, we introduce a variety of stochastic processes,
e.g., vector-valued non-Gaussian linear processes, semimartingales, ARCH(∞)-SM, CHARN pro-
cess, locally stationary process, long memory process, etc. For them we introduce some funda-
mental statistics, and show their fundamental asymptotics. As the actual data analysis we have to
choose a statistical model from a class of candidate processes. Section 8.4 introduces a generalized
AIC(GAIC), and derives contiguous asymptotics of the selected order. We elucidate the philosophy
of GAIC and the other criteria, which is helpful for use of these criteria. Section 8.5 shows that
the generalized mean-variance portfolio estimator ĝ is not asymptotically optimal in general even if
the return process is a Gaussian process. Then we give a necessary and sufficient condition for ĝ to
3
be asymptotically optimal in terms of the spectra of the return process. Then we seek the optimal
one. In this book we develop modern statistical inference for stochastic processes by use of LAN.
Section 8.3 provides the foundation of the LAN approach, leading to a conclusion that the MLE
is asymptotically optimal. Section 8.6 addresses the problem of shrinkage estimation for stochastic
processes and time series regression models, which will be helpful when the return processes are of
high dimension.
Chapter 2
Preliminaries
This chapter discusses the foundation of stochastic processes. Much of statistical analysis is con-
cerned with models in which the observations are assumed to vary independently. However, a great
deal of data in economics, finance, engineering, and the natural sciences occur in the form of time
series where observations are dependent and where the nature of this dependence is of interest in
itself. A model which describes the probability structure of a series of observations Xt , t = 1, . . . , n,
is called a stochastic process. An Xt might be the value of a stock price at time point t, the water
level in a lake at time point t, and so on. The main purpose of this chapter is to provide a modern in-
troduction to stochastic processes. Because statistical analysis for stochastic processes largely relies
on asymptotic theory, we explain some useful limit theorems and central limit theorems.
2.1 Stochastic Processes and Limit Theorems

Suppose that Xt is the water level at a given place of a lake at time point t. We may describe
its fluctuating evolution with respect to t as a family of random variables {X1 , X2 , . . . } indexed by
the discrete time parameter t ∈ N ≡ {1, 2, . . . }. If Xt is the number of accidents at an intersection
during the time interval [0, t], this leads to a family of random variables {Xt : t ≥ 0} indexed by the
continuous time parameter t. More generally,
Definition 2.1.1. Given an index set T , a stochastic process indexed by T is a collection of random
variables {Xt : t ∈ T } on a probability space (Ω, F , P) taking values in a set S, which is called the
state space of the process.
If T = Z ≡ the set of all integers, we say that {Xt } is a discrete time stochastic process, and it is
often written as {Xt : t ∈ Z}. If T = [0, ∞), then {Xt } is called a continuous time process, and it is
often written as {Xt : t ∈ [0, ∞)}.
The state space S is the set in which the possible values of each Xt lie. If S = {. . . , −1, 0, 1, 2, . . .},
we refer to the process as an integer-valued process. If S = R = (−∞, ∞), then we call {Xt } a real-
valued process. If S is the m-dimensional Euclidean space Rm , then {Xt } is said to be an m-vector
process, and it is written as {Xt }.
The distinguishing features of a stochastic process {Xt : t ∈ T } are relationships among the
random variables Xt , t ∈ T . These relationships are specified by the joint distribution function of
every finite family Xt1 , . . . , Xtn of the process.
In what follows we provide concrete and typical models of stochastic processes.
Example 2.1.2. (AR(p)-, MA(q)-, and ARMA(p, q)-processes).
Let {ut : t ∈ Z} be a family of independent and identically distributed (i.i.d.) random vari-
ables with mean zero and variance σ2 (for convenience, we write {ut } ∼ i.i.d.(0, σ2 ) ). We define a
5
6 PRELIMINARIES
stochastic process {Xt : t ∈ Z} by
Xt = −(a1 Xt−1 + · · · + a p Xt−p ) + ut (2.1.1)
where the a j ’s are real constants (a p , 0). The process is said to be an autoregressive process of
order p (or AR(p)), which was introduced in Yule’s analysis of sunspot numbers. This model has an
intuitive appeal in view of the usual regression analysis. A more general class of stochastic model is
X
p X
q
Xt = − a j Xt− j + ut + b j ut− j (2.1.2)
j=1 j=1
where the b j ’s are real constants (bq , 0). This process {Xt : t ∈ Z} is said to be an autoregressive
moving average process (model) of order (p, q), and is denoted by ARMA(p, q). The special case of
ARMA(0, q) is referred to as a moving average process (model) of order q, denoted by MA(q).
Example 2.1.3. (Nonlinear processes).

A stochastic process {Xt : t ∈ Z} is said to follow a nonlinear autoregressive model of order p if
there exists a measurable function f : R p+1 → R such that
Xt = f (Xt−1 , Xt−2 , . . . , Xt−p , ut ), t∈Z (2.1.3)
where {ut } ∼ i.i.d.(0, σ2 ). Typical examples are as follows. Tong (1990) proposed a self-exciting
threshold autoregressive model (SETAR(k; p, . . . , p) model) defined by
( j) ( j) ( j)
Xt = a0 + a1 Xt−1 + · · · + a p Xt−p + ut , (2.1.4)
S
if Xt−d ∈ Ω j , j = 1, . . . , k, where the Ω j are disjoint intervals on R with kj=1 Ω j = R, and d > 0 is
called the “threshold lag.” Here a(i j) , i = 0, . . . , p, j = 1, . . . , k, are real constants. In econometrics,
it is not natural to assume a constant one-period forecast variance. As a plausible model, Engle
(1982) introduced an autoregressive conditional heteroscedastic model (ARCH(q)), which is defined
as
v
u
t Xq
Xt = u t a 0 + a j Xt−2 (2.1.5)
j
j=1
where a0 > 0 and a j ≥ 0, j = 1, . . . , q.

A continuous time stochastic process {Xt : t ∈ [0, ∞)} with values in the state space S =
{0, 1, 2, . . .} is called a counting process if Xt for any t represents the total number of events that
have occurred during the time period [0, t].
Example 2.1.4. (Poisson processes) A counting process {Xt : t ∈ [0, ∞)} is said to be a homoge-
neous Poisson process with rate λ > 0 if
(i) X0 = 0,
(ii) for all choices of t1 , . . . , tn ∈ [0, ∞) satisfying t1 < t2 < · · · < tn , the increments
Xt2 − Xt1 , Xt3 − Xt2 , . . . , Xtn − Xtn−1
are independent,
(iii) for all s, t ≥ 0,
e−λx (λt) x
P(Xt+s − X s = x) = , x = 0, 1, 2, . . . ,
x!
i.e., Xt+s − X s is Poisson distributed with rate λt.
STOCHASTIC PROCESSES AND LIMIT THEOREMS 7
The following example is one of the most fundamental and important continuous time stochastic
processes.
Example 2.1.5. (Brownian motion or Wiener process) A continuous time stochastic process {Xt :
t ∈ [0, ∞)} is said to be a Brownian motion or Wiener process if
(i) X0 = 0,
(ii) for all choices of t1 , . . . , tn ∈ [0, ∞) satisfying t1 < t2 < . . . < tn , the increments
Xt2 − Xt1 , Xt3 − Xt2 , . . . , Xtn − Xtn−1
are independent,
(iii) for any t > s,
Z x ( )
1 u2
P(Xt − X s ≤ x) = √ exp − du,
2πc(t − s) −∞ 2c(t − s)
where c is a positive constant, i.e., Xt − X s ∼ N(0, c(t − s)).
The following process is often used to describe a variety of phenomena.
Example 2.1.6. (Diffusion process) Suppose a particle is plunged into a nonhomogeneous and
moving fluid. Let Xt be the position of the particle at time t and µ(x, t) the velocity of a small volume
v of fluid located at time t. A particle within v will carry out Brownian motion with parameter σ(x, t).
Then the change in position of the particle in time interval [t, t + ∆t] can be written approximately
in the form
Xt+∆t − Xt ∼ µ(Xt , t)∆t + σ(Xt , t)(Wt+∆t − Wt ),
where {Wt } is a Brownian motion process. If we replace the increments by differentials, we obtain
the differential equation
dXt = µ(Xt , t)dt + σ(Xt , t)dWt , (2.1.6)
where µ(Xt , t) is called the drift and σ(Xt , t) the diffusion. A rigorous discussion will be given later
on. A stochastic process {Xt : t ∈ [0, ∞)} is said to be a diffusion process if it satisfies (2.1.6). Re-
cently this type of diffusion process has been used to model financial data.
Since probability theory has its roots in games of chance, it is profitable to interpret results in
terms of gambling.
Definition 2.1.7. Let (Ω, F , P) be a probability space, {X1 , X2 , . . .} a sequence of integrable random
variables on (Ω, F , P), and F1 ⊂ F2 ⊂ · · · an increasing sequence of sub σ-algebras of F , where
Xt is assumed to be Ft -measurable. The sequence {Xt } is said to be a martingale relative to the Ft
(or we say that {Xt , Ft } is a martingale) if and only if for all t = 1, 2, . . . ,
E(Xt+1 |Ft ) = Xt a.e. (2.1.7)
Martingales may be understood as models for fair games in the sense that Xt signifies the amount of
money that a player has at time t. The martingale property states that the average amount a player
will have at time (t + 1), given that he/she has amount Xt at time t, is equal to Xt regardless of what
his/her past fortune has been.
8 PRELIMINARIES
Nowadays the concept of a martingale is very fundamental in finance. Let Bt be the price (non-
random) of a bank account at time t. Assume that Bt satisfies
Bt − Bt−1 = ρBt−1 , for t ∈ N.
Let Xt be the price of a stock at time t, and write the return as
rt = (Xt − Xt−1 )/Xt−1 .
If we assume
E {rt |Ft−1 } = ρ (2.1.8)
then it is shown that
( )
Xt Xt−1
E F
t−1 = , (2.1.9)
Bt Bt−1
n o
(e.g., Shiryaev (1999)), hence, XBtt becomes a martingale. The assumption (2.1.8) seems natural in
an economist’s view. If (2.1.8) is violated, the investor can decide whether to invest in the stock or
in the bank account.
In what follows we denote by F (X1 , . . . , Xt ) the smallest σ-algebra making random vari-
ables X1 , . . . , Xt measurable. If {Xt , Ft } is a martingale, it is automatically a martingale relative to
F (X1 , . . . , Xt ). To see this, condition both sides of (2.1.7) with respect to F (X1 , . . . , Xt ). If we do not
mention the Ft explicitly, we always mean Ft = F (X1 , . . . , Xt ). In statistical asymptotic theory, it is
known that one of the most fundamental quantities becomes a martingale under suitable conditions
(see Problem 2.1).
In time series analysis, stationary processes and linear processes have been used to describe
the data concerned (see Hannan (1970), Anderson (1971), Brillinger (2001b), Brockwell and Davis
(2006), and Taniguchi and Kakizawa (2000)). Let us explain these processes in vector form.
Definition 2.1.8. An m-vector process

{Xt = (X1 (t), . . . , Xm (t))′ : t ∈ Z}
is called strictly stationary if, for all n ∈ N, t1 , . . . , tn , h ∈ Z, the distributions of Xt1 , . . . , Xtn and
Xt1 +h , . . . , Xtn +h are the same.
The simplest example of a strictly stationary process is a sequence of i.i.d. random vectors
et , t ∈ Z. Let
Xt = f (et , et−1 , et−2 , · · · ), t∈Z (2.1.10)
where f (·) is an m-dimensional measurable function. Then it is seen that {Xt } is a strictly stationary
process (see Problem 2.2).
If we deal with strictly stationary processes, we have to specify the joint distribution at arbitrary
time points. In view of this we will introduce more convenient stationarity based on the first- and
second-order moments. Let {Xt : t ∈ Z} be an m-vector process whose components are possibly
complex valued. The autocovariance function Γ(·, ·) of {Xt } is defined by
Γ(t, s) = Cov(Xt , X s ) ≡ E{(Xt − EXt )(X s − EX s )∗ }, t, s ∈ Z, (2.1.11)
where ( )∗ denotes the complex conjugate transpose of ( ). If {Xt } is a real-valued process, then ( )∗
implies the usual transpose ( )′ .
Definition 2.1.9. An m-vector process {Xt : t ∈ Z} is said to be second order stationary if
(i) E(Xt∗ Xt ) < ∞ for all t ∈ Z,
(ii) E(Xt ) = c for all t ∈ Z, where c is a constant vector,
(iii) Γ(t, s) = Γ(0, s − t) for all s, t ∈ Z.
If {Xt } is second order stationary, (iii) is satisfied, hence, we redefine Γ(0, s − t) as Γ(s − t) for
all s, t ∈ Z. The function Γ(h) is called the autocovariance function (matrix) of {Xt } at lag h. For
fundamental properties of Γ(h), see Problem 2.3.
Definition 2.1.10. An m-vector process {Xt : t ∈ Z} is said to be a Gaussian process if for each
t1 , . . . , tn ∈ Z, n ∈ N, the distribution of Xt1 , . . . , Xtn is multivariate normal.
Gaussian processes are very important and fundamental, and for them it holds that second order
stationarity is equivalent to strict stationarity (see Problem 2.5).
In what follows we shall state three theorems related to the spectral representation of second-
order stationary processes. For proofs, see, e.g., Hannan (1970) and Taniguchi and Kakizawa (2000).
Theorem 2.1.11. If Γ(·) is the autocovariance function of an m-vector second-order stationary

process {Xt }, then
Z π
Γ(t) = eitλ dF(λ), t ∈ Z, (2.1.12)
−π
where F(λ) is a matrix whose increments F(λ2 ) − F(λ1 ), λ2 ≥ λ1 , are nonnegative definite. The
function F(λ) is uniquely defined if we require in addition that (i) F(−π) = 0 and (ii) F(λ) is right
continuous.
The matrix F(λ) is called the spectral distribution matrix. If F(λ) is absolutely continuous with
respect to a Lebesgue measure on [−π, π] so that
Z λ
F(λ) = f (µ)dµ, (2.1.13)
−π
where f (λ) is a matrix with entries f jk (λ), j, k = 1, . . . , m, then f (λ) is called the spectral density
matrix. If
X
∞
||Γ(h)|| < ∞, (2.1.14)
h=−∞
where ||Γ(h)|| is the square root of the greatest eigenvalue of Γ(h)Γ(h)∗, the spectral density matrix
is expressed as
1 X
∞
f (λ) = Γ(h)e−ihλ . (2.1.15)
2π h=−∞
Next we state the spectral representation of {Xt }. For this the concept of an orthogonal increment
process and the stochastic integral is needed. We say that {Z(λ) : −π ≤ λ ≤ π} is an m-vector-valued
orthogonal increment process if
(i) E{Z(λ)} = 0, −π ≤ λ ≤ π,
(ii) the components of the matrix E{Z(λ)Z(λ)∗ } are finite for all λ ∈ [−π, π],
(iii) E[{Z(λ4 ) − Z(λ3 )}{Z(λ2 ) − Z(λ1 )}∗ ] = 0 if (λ1 , λ2 ] ∩ (λ3 , λ4 ] = φ,
(iv) E[{Z(λ + δ) − Z(λ)}{Z(λ + δ) − Z(λ)}∗ ] → 0 as δ → 0.
10 PRELIMINARIES
If {Z(λ)} satisfies (i)-(iv), then there exists a unique matrix distribution function G on [−π, π]
such that
G(µ) − G(λ) = E[{Z(µ) − Z(λ)}{Z(µ) − Z(λ)}∗ ], λ ≤ µ.

nR π o
Let L2 (G) be the family of matrix-valued functions M(λ) satisfying tr −π M(λ)dG(λ)M(λ)∗ < ∞.
For M(λ) ∈ L2 (G), there is a sequence of matrix-valued simple functions Mn (λ) such that
"Z π #
∗
tr {Mn (λ) − M(λ)}dG(λ){Mn (λ) − M(λ)} → 0 (2.1.16)
−π
as n → ∞ (e.g., Hewitt and Stromberg (1965)). Suppose that Mn (λ) is written as

X
n
Mn (λ) = M (n)
j χ(λ j ,λ j+1 ] (λ), (2.1.17)
j=0
where M (n)
j are constant matrices and −π = λ0 < λ1 < · · · < λn+1 = π, and χA (λ) is the indicator
function of A. For (2.1.17), we define the stochastic integral by
Z π X
n
Mn (λ)dZ(λ) ≡ M (n)
j {Z(λ j+1 ) − Z(λ j )} = In (say).
−π j=0
Then it is shown that In converges to a random vector I in the sense that ||In − I|| → 0 as n → ∞,
where ||In ||2 =E(In∗ In ). We write this as I = l.i.m In (limit in the mean). For arbitrary M(λ) ∈ L2 (G)
we define the stochastic integral
Z π
M(λ)dZ(λ) (2.1.18)
−π
to be the random vector I defined as above.
Theorem 2.1.12. If {Xt : t ∈ Z} is an m-vector second-order stationary process with mean zero and
spectral distribution matrix F(λ), then there exists a right-continuous orthogonal increment process
{Z(λ) : −π ≤ λπ} such that
(i) E[{Z(λ) − Z(−π)}{Z(λ) − Z(−π)}∗ ] = F(λ), −π ≤ λ ≤ π,
Z π
(ii) Xt = e−itλ dZ(λ). (2.1.19)
−π
Although the substance of {Z(λ)} is difficult to understand, a good substantial understanding of
it is given by the relation
1 X ′ eitλ2 − eitλ1
n
Z(λ2 ) − Z(λ1 ) = l.i.m Xt , (2.1.20)
n→∞ 2π t=−n it
P′
where indicates that for t = 0 we take the summand to be (λ2 − λ1 )X0 (see Hannan (1970)).
Theorem 2.1.13. Suppose that {Xt : t ∈ Z} is an m-vector second-order stationary process with
mean zero, spectral distribution matrix F(·) and spectral representation (2.1.19), and that {A( j) :
j = 0, 1, 2, . . .} is a sequence of m × m matrices. Further we set down
X
∞
Yt = A( j)Xt− j , (2.1.21)
j=0
and
X
∞
h(λ) = A( j)ei jλ .
j=0
Then the following statements hold true:

(i) the necessary and sufficient condition that (2.1.21) exists as the l.i.m of partial sums is
(Z π )
∗
tr h(λ)dF(λ)h(λ) < ∞, (2.1.22)
−π
(ii) if (2.1.22) holds, then the process {Yt : t ∈ Z} is second order stationary with autocovariance
function and spectral representation
Z π
ΓY (t) = e−itλ h(λ)dF(λ)h(λ)∗ (2.1.23)
−π
and
Z π
Yt = e−itλ h(λ)dZ(λ), (2.1.24)
−π
respectively.
P P
If dF(λ) = (2π)−1 dλ, where all the eigenvalues of are bounded and bounded away from
zero, the condition (2.1.22) is equivalent to
X
∞
||A( j)||2 < ∞. (2.1.25)
j=0
P
Let {Ut } ∼ i.i.d. (0, ). If {A( j)} satisfies (2.1.25), then it follows from Theorem 2.1.13 that the
process
X
∞
Xt = A( j)Ut− j (2.1.26)
j=0
is defined as the l.i.m of partial sums, and that {Xt } has the spectral density matrix
   ∗
  X 
1 X  X 
∞ ∞
 i jλ

 
 
i jλ 
f (λ) = 
 A( j)e 
 
 A( j)e 
 . (2.1.27)

2π  j=0 
 
 j=0 

The process (2.1.26) is called the m-vector linear process, which includes m-vector AR(p), m-vector
ARMA(p, q) written as VAR(p) and VARMA(p, q), respectively; hence, it is very general.
In the above the dimension of the process is assumed to be finite. However, if the number of
cross-sectional time series data (e.g., returns on different assets, data disaggregated by sector or
region) is typical large, VARMA models are not appropriate, because we need to estimate too many
unknown parameters. For this case, Forni et al. (2000) introduced a class of dynamic factor models
defined as follows:
X
∞
Xt = B( j)ut− j + ξt = Yt + ξt , (say) (2.1.28)
j=0
where Xt = (X1 (t), . . . , Xm (t))′ , B( j) = {Bkl ( j) : k, l = 1, . . . , m} and ξt = (ξ1 (t), . . . , ξm (t))′ . They
impose the following assumption.
12 PRELIMINARIES
Assumption 2.1.14. (1) The process {ut } is uncorrelated with E{ut } = 0 and Var{ut } = Im for all
t ∈ Z.
(2) {ξt } is a zero-mean vector process, and is mutually uncorrelated with {ut }.
P
(3) ∞j=0 ||B( j)||2 < ∞.
(4) Letting the spectral density matrix of {Xt } be f X (λ) = { f jlX (λ) : j, l = 1, . . . , m} (which exists
from the conditions (1)–(3)), there exists cX > 0 such that fiXj (λ) ≤ cX for all λ ∈ [π, π] and
j = 1, . . . , m.
(5) Let the spectral density matrices of {Yt } and ξt be f Y (λ) and f ξ (λ), respectively, and let the jth
ξ
eigenvalue of f Y (λ) and f ξ (λ) be pYm, j (λ) and pn, j (λ), respectively. Then it holds that
(i) there exists cξ > 0 such that max pξn, j (λ) ≤ cξ for any λ ∈ [−π, π] and any n ∈ N.
1≤ j≤m
(ii) The first q largest eigenvalues pYn, j (λ) of f Y (λ) diverge almost everywhere in [−π, π].
Based on observations {X1 , . . . , Xn }, Forni et al. (2000) proposed a consistent estimator Ŷt of
Yt under which m and n tend to infinity.
Example 2.1.15. (spatiotemporal model) Recently the wide availability of data observed over time
and space has developed many studies in a variety of fields such as environmental science, epidemi-
ology, political science, economics and geography. Lu et al. (2009) proposed the following adaptive
varying-coefficient spatiotemporal model for such data:
Xt (s) = a{s, α(s)′ Yt (s)} + b{s, α(s)′ Yt (s)}′ Yt (s) + ǫt (s), (2.1.29)
where Xt (s) is the spatiotemporal variable of interest, and t is time, and s = (u, v) ∈ S ⊂ R2 is a
spatial location. Also, a(s, z) and b(s, z) are unknown scalar and d × 1 functions, α(s) is an un-
known d × 1 index vector, {ǫt (s)} is a noise process which, for each fixed s, forms a sequence of
i.i.d. random variables over time and Yt (s) = {Yt1 (s), . . . , Ytd (s)}′ consists of time-lagged values of
Xt (·) in a neighbourhood of s and some exogenous variables. They introduced a two-step estimation
procedure for α(s), α(s, ∗) and b(s, ∗).
In Definition 2.1.7, we defined the martingale for sequences {Xn : n ∈ N}. In what follows we
generalize the notion to the case of an uncountable index set R+ , i.e., {Xt : t ≥ 0}. Let (Ω, F , P) be
a probability space. Let {Ft : t ≥ 0} be a family (filtration) of sub-σ-algebras satisfying
F s ⊆ Ft ⊆ F , s ≤ t,
\
Ft = Fs,
s<t
Ft = FtP ,
where FtP stands for completion of the σ-algebra Ft by the P-null sets from F . Then, the quadruple
(Ω, F , {Ft : t ≥ 0}, P) is called a stochastic basis. Suppose that a stochastic process {Xt : t ≥ 0} is
defined on (Ω, F , {Ft : t ≥ 0}, P), and Xt ’s are Ft -measurable. Then we say that they are adapted
with respect to {Ft : t ≥ 0}, and often write {Xt , Ft }.
Definition 2.1.16. A stochastic process {Xt , Ft } is said to be a martingale if

E|Xt | < ∞, t ≥ 0,
E{Xt |F s } = X s (P-a.s), s ≤ t.
A random variable τ = τ(ω) taking values in [0, ∞] is called a Markov time if
{ω : τ(ω) ≤ t} ∈ Ft , t ≥ 0.
The Markov times satisfying P(τ(ω) < ∞) = 1 are called stopping times.
Definition 2.1.17. A process {Xt , Ft } is called a local martingale if there exists a sequence {τn }
of stopping times such that τn (ω) ≤ τn+1 (ω), τn (ω) ր ∞ (P-a.s.) as n → ∞ and the processes
{Xτn ∧t , Ft } are martingales.
Henceforth we use the following notations:
" Z t #
D ≡ {At , Ft } : |dA s (ω)| < ∞, t ≥ 0, ω ∈ Ω, i.e., processes of bounded variation
0
and
Mloc ≡ [{Mt , Ft }; local martingales].
Definition 2.1.18. A stochastic process {Xt , Ft } is called a semimartingale if it admits the decom-
position
Xt = X0 + At + Mt , t≥0 (2.1.30)
where {At , Ft } ∈ D and {Mt , Ft } ∈ Mloc . The process {At } is called a compensator of {Xt }.
Under ordinary circumstances the process of interest can be modeled in terms of a signal plus
noise relationship:
process = signal + noise (2.1.31)
where the signal incorporates the predictable trend part of the model and the noise is the stochastic
disturbance. Semimartingales are such. Sørensen (1991) discussed the process
dXt = θXt dt + dWt + dNt , t ≥ 0, X 0 = x0 , (2.1.32)
where {Wt } is a standard Brownian motion and {Nt } is a Poisson process with intensity λ. This
process may be written in semimartingale form as
dXt = θXt dt + λdt + dMt , (2.1.33)
where
Mt = Wt + Nt − λt.
In Example 2.1.3, we introduced a few nonlinear processes. In what follows we explain more
general ones.
Let {Xt } and {ut } be defined on a stochastic basis (Ω, F , {Ft }, P), and Xt′ s are Ft -measurable, u′t s
are Ft -measurable and independent of Ft−1 . Bollerslev (1986) introduced a generalized autoregres-
sive conditional heteroscedastic model (GARCH(p, q)), which is defined by
( √
Xt = u t h t
Pq 2 Pp (2.1.34)
ht = a0 + j=1 a j Xt− j+ j=1 b j ht− j ,
where a0 > 0, a j ≥ 0, j = 1, . . . q, and b j ≥ 0, j = 1, . . . , p.

Giraitis et al. (2000) introduced an ARCH(∞) model, which is defined by
( √
Xt = u t h t
P (2.1.35)
ht = a0 + ∞j=1 a j Xt− 2
j,
where a0 > 0, a j ≥ 0, j = 1, 2, . . .. The class of ARCH(∞) models is larger than that of stationary
GARCH(p, q) models.
The following example gives a very general and persuasive model which includes ARCH, SE-
TAR, etc. as special cases.
14 PRELIMINARIES
Example 2.1.19. A stochastic process {Xt = (X1,t , . . . , Xm,t )′ : t ∈ Z} is said to follow a conditional
heteroscedastic autoregressive nonlinear (CHARN) model if it satisfies
Xt = Fθ (Xt−1 , · · · , Xt−p ) + Hθ (Xt−1 , · · · , Xt−q )Ut , (2.1.36)
where Ut = (U1,t , . . . , Um,t )′ ∼ i.i.d.(0, V), Fθ : Rmp → Rm , and Hθ : Rmq → Rm × Rm are

measurable functions, and θ = (θ1 , . . . , θr )′ ∈ Θ (an open set) ⊂ Rr is an unknown parameter
(Härdle et al. (1998)). This model is very general, and can be applied to analysis of brain and
muscular waves as well as financial time series analysis (Kato et al. (2006)).
When we analyze the nonlinear time series models, their stationarity is fundamental and im-
portant. We provide a sufficient condition for the CHARN model (2.1.36) to be stationary. De-
note by |A| the sum of the absolute values of all the elements of a matrix or vector A. Let
x = (x11 , . . . , x1m , x21 , . . . , x2m , . . . , x p1 , . . . , x pm )′ ∈ Rmp . Without loss of generality we assume
p = q in (2.1.36).
Theorem 2.1.20. (Lu and Jiang (2001)) Suppose that {Xt } is generated by the CHARN model
(2.1.36). Assume
(i) Ut has the probability density function p(u) > 0 a.e., u ∈ Rm .
(ii) There exist ai j ≥ 0, bi j ≥ 0, 1 ≤ i ≤ m, 1 ≤ j ≤ p, such that
X
m X
p
|Fθ (x)| ≤ ai j |xi j | + o(|x|),
i=1 j=1
X
m X
p
|Hθ (x)| ≤ bi j |xi j | + o(|x|), |x| → ∞.
i=1 j=1
(iii) Hθ (x) is a continuous and symmetric function with respect to x, and there exists λ > 0 such
that
{the minimum eigenvalue of Hθ (x)} ≥ λ
for all x ∈ Rmp .

(iv)
 

 

X X
p p
 

max  a + E|U | b  < 1.
1≤i≤m  
i j 1 i j

 j=1 

j=1
Then, {Xt } is strictly stationary.

Suppose that we need to compute some statistical average of a strictly stationary process when
we observe just a single realization of it. In such a situation, is it possible to determine the statistical
average from an appropriate time average of a single realization. If the statistical (or ensemble)
average of the process equals the time average, the process will be called ergodic. We mention some
limit theorems related to the ergodic stationary process.
Let {Xt = (X1 (t), . . . , Xm (t))′ : t ∈ Z} be a strictly stationary process defined on a probability
space (Ω, F , P). The σ-algebra F is generated by the family of all cylinder sets
{ω |(Xt1 , . . . , Xtk ) ∈ B},
where B ∈ Bmk . For these cylinder sets we define the shift operator A → T A where, for a set A of
the form
A = {ω |(Xt1 , . . . , Xtk ) ∈ C}, C ∈ Bmk ,

we have
T A = {ω |(Xt1 +1 , . . . , Xtk +1 ) ∈ C}.
This definition extends to all sets in F . Since {Xt } is strictly stationary, A and T −1 A have the same
probability content. Then we say that T is measure preserving.
Definition 2.1.21. Given a measure preserving transformation T , a measurable event A is said to
be invariant if T −1 A = A.
Denote the collection of invariant sets by AI .
Definition 2.1.22. The process {Xt : t ∈ Z} is said to be ergodic if for all A ∈ AI , either P(A) = 0
or P(A) = 1.
Now we state the following two theorems (see Stout (1974, p. 182) and Hannan (1970, p.204) ).
Theorem 2.1.23. Suppose that a vector process {Xt : t ∈ Z} is strictly stationary and ergodic, and
that there is a measurable function φ : R∞ → Rk . Let Yt = φ(Xt , Xt−1 , . . .). Then {Yt : t ∈ Z} is
strictly stationary and ergordic.
Since the i.i.d. sequences are strictly stationary and ergodic, we have
Theorem 2.1.24. Suppose that {Ut } is a sequence of random vectors that are independent and
identically distributed with zero mean and finite covariance matrix. Let
X
∞
Xt = A( j)Ut− j ,
j=0
P
where ∞j=0 kA( j)k < ∞. Then {Xt : t ∈ Z} is strictly stationary and ergodic.
The following is essentially the pointwise ergodic theorem.
Theorem 2.1.25. If {Xt : t ∈ Z} is strictly stationary and ergodic and EkXt k < ∞, then
1X
n
a.s.
Xt → E(X1 ). (2.1.37)
n t=1
Also, if EkXk2 < ∞, then
1X
n
′ a.s. ′
Xt Xt+m → E(X1 X1+m ). (2.1.38)
n t=1
Proof. The assertion (2.1.37) is exactly the pointwise ergodic theorem (e.g., Theorem 3.5.7 of Stout
′
(1974)). For (2.1.38), consider Xt Xt+m . Fixing m and allowing t to vary, this constitutes a sequence
of random matrices which again constitutes a vector strictly stationary process. Since {Xt } is er-
′
godic, then so is {Xt Xt+m }. Hence (2.1.38) follows from the pointwise ergodic theorem.
For martingales, we mention the following two theorems on the strong law of large numbers
(e.g., Hall and Heyde (1980), Heyde (1997)).
Theorem 2.1.26. (Doob’s martingale convergence theorem) If {Xn , Fn , n ∈ N} is a martingale such
a.s.
that supn≥1 E|S n | < ∞, then there exists a random variable X such that E|X| < ∞ and Xn → X.
For locally square integrable martingale M = {Mt } there exists a unique predictable process
hM, Mit for which
Mt2 − hM, Mit (2.1.39)
is a local martingale. The process hM, Mit is called the quadratic characteristic of M. For an ex-
ample of hM, Mit , see Problem 2.7. When {Mt } is a vector-valued local martingale, the quadratic
characteristic hM , M ′ it is the one for which M Mt′ − hM , M ′ it is a local martingale.
16 PRELIMINARIES
Theorem 2.1.27. Suppose that {Xt , t ≥ 0} is a local square integrable martingale. Then,
Xt a.s.
→ 0, as t → ∞, (2.1.40)
At
R∞ dt
for At ≡ f (hX, Xit ), lim At = ∞ and 0 f (t)2
< ∞.
t→∞
We next state some central limit theorems which are useful to derive the asymptotic distribution
of statistics for stochastic processes.
Theorem 2.1.28. (Ibragimov (1963)) Let {Xt : t ∈ Z} be a strictly stationary ergodic process such
that E(Xt2 ) = σ2 , 0 < σ2 < ∞, E{Xt |Ft−1 } = 0, a.s., where Ft = σ(X1 , . . . , Xt ). Then
1 X
n
d
√ Xt → N(0, 1).
σ n t=1
Let {Xn,t : t = 1, . . . , kn } be an array of random variables on a probability space (Ω, F , P). Let
{Fn,t : 0 ≤ t ≤ kn } be any triangular array of sub σ-algebras of F such that for each n and 1 ≤ t ≤ kn ,
Pn
Xn,t is Fn,t -measurable and Fn,t−1 ⊂ Fn,t . We write S n = kt=1 Xt .
Theorem 2.1.29. (Brown (1971)) Suppose that {Xn,t , Fn,t , 1 ≤ t ≤ kn } is a martingale difference
array satisfying
(i)
X
kn
2
E{Xn,t χ(|Xn,t | > ǫ)} → 0 as n → ∞ for all ǫ > 0 (2.1.41)
t=1
(Lindeberg condition).
(ii)
X
kn
p
2
E(Xn,t |Fn,t−1 ) → 1. (2.1.42)
t=1
Then,
d
S n → N(0, 1), (n → ∞).
Problems
′
2.1. Let Xn = (X1 , . . . , Xn ) be a sequence of random variables forming a stochastic process, and
having the probability density pnθ (xn ), xn = (x1 , . . . , xn )′ , where θ ∈ Θ ⊂ R1 (Θ is an open set).
Assume that pnθ (·) is differentiable with respect to θ. Write the log-likelihood function based on
Xn as Ln (θ), and let S n = ∂/∂θ{Ln (θ)} (score function). Assuming that ∂/∂θ and Eθ are exchange-
able, and the existence of moment of all the related quantities, show that {S n , Fn } is a martingale.
2.2. Let {Xt } be a process generated by (2.1.10). Then, show that it is strictly stationary.
2.3. Show that the autocovariance matrix Γ(·) = {rab (·) : a, b = 1, . . . , m} satisfies
(i) Γ(h) = Γ(−h)′ for all h ∈ Z,
(ii) |rab (h)| ≤ {raa (h)}1/2 {rbb (h)}1/2 for a, b = 1, . . . , m and h ∈ Z,
P
(iii) na,b=1 va∗ Γ(b − a)vb ≥ 0 for all n ∈ N and v1 , . . . , vn ∈ Cm .
2.4. Show that a strictly stationary process with finite second-order moments is second order sta-
tionary.
2.5. For Gaussian processes, show that second order stationarity is equivalent to strict stationarity.
2.6. Let {Xt : t ∈ Z} be an m-vector process satisfying
Xt + ΦXt−1 + · · · + Φ p Xt−p = Ut + Θ1 Ut−1 + · · · + Θq Ut−q , (P.1)

P
where Φ1 , . . . , Φ p , Θ1 , . . . , Θq are real m × m matrices and {Ut } ∼ i.i.d.(0, ). Defining Φ(z) =
p
I + Φ1 z + · · · + Φ p z , where I is the m × m identity matrix, we assume
det{Φ(z)} , 0 for all z ∈ C such that |z| ≤ 1. (P.2)
The process {Xt } defined by (P.1) is called an m-vector autoregressive moving average process
of order (p, q), and is denoted by VARMA(p,q). Then, show that {Xt } is second order stationary,
and has the spectral density matrix
1 X
f (λ) = Φ(eiλ )−1 Θ(eiλ ) Θ(eiλ )∗ Φ(eiλ )∗ −1 , (P.3)
2π
where Φ(z) = I + Θ1 z + · · · + Θq zq .
2.7. Let a(t, ω) be a square integrable predictable process, and {Wt } is the standard Wiener process.
Let X = {Xt , Ft , t ≥ 0} be generated by
Z t
Xt = a(s, ω)dW s .
0
Then, show that Z t

hX, Xit = a2 (s, ω)ds.
0
2.8. Let {Xt } be generated by the ARCH(∞) model defined as (2.1.35), and let Ft =
F (. . . , X1 , X2 , . . . , Xt ). Then, show that {Xt , Ft } satisfies
E{Xt |Ft−1 } = 0, a.e., (P.4)
i.e., {Xt , Ft } is a martingale difference sequence. Hence, if {Xt } is an ARCH(q) or GARCH(p, q)

( see 2.1.34), then it becomes a martingale difference sequence.
2.9. Let {Xt } be generated by the CHARN(p, q) model defined by (2.1.36), and let Ft =
F (X1 , ..., Xt ). Then, show that {Xt , Ft } satisfies
E{Xt |Ft−1 } = Fθ (Xt−1 , . . . , Xt−p ), a.e.
2.10. Simulate a series of n = 100 observations from the model (2.1.1) with p = 1 and {ut } ∼
i.i.d. N(0, 1). Then, plot their graphs for the case of a1 = 0.1, 0.3, 0.5, 0.7 and 0.9.
2.11. Simulate a series of n = 100 observations from the model (2.1.5) with q = 1, a0 = 0.5 and
{ut } ∼ i.i.d. N(0, 1). Then, plot their graphs for the case of a1 = 0.1, 0.3, 0.5, 0.7 and 0.9.
Chapter 3
Portfolio Theory for Dependent Return

Processes
3.1 Introduction to Portfolio Theory

Modern portfolio theory introduced by Markowitz (1952) has become a broad theory for portfo-
lio selection. He demonstrated how to reduce a standard deviation of return (portfolio risk) on a
portfolio of assets. The Markowitz theory of portfolio management deals with individual assets in
financial markets. It combines probability theory and optimization theory to model the behaviour
of investors. The investors are assumed to decide a balance between maximizing the expected re-
turn and minimizing the risk of their investment decision. Portfolio return is characterized by the
mean (i.e., expected return) and risk (defined by the standard deviation) of their portfolio of assets.
These mathematical representations of mean and risk have allowed optimization tools to be applied
to studies of portfolio management. The exact solution will depend on the level of the risk (in com-
parison with the rate of the mean) that the investors would bear. Even though many other models
may treat different risks in place of the standard deviation, the trade-off between mean and risk has
been the major issue which those theories try to solve.
If everybody uses the mean-variance approach to investing, and if everybody has the same esti-
mates of the asset’s expected returns, variances, and covariances, then everybody must invest in the
same fund F of risky and risk-free assets. Because F is the same for everybody, it follows that in
equilibrium, F must correspond to the market portfolio M, that is, a portfolio in which each asset
is weighted by its proportion of total market capitalization. This observation is the basis for the
capital asset pricing model (CAPM) proposed by Sharpe (1964). This model greatly simplifies
the input for portfolio selection and makes the mean-variance methodology into a practical applica-
tion. The CAPM result states that the expected return of each asset can be described as the function
of “beta” which expresses the covariance with the market portfolio. Since a beta of a portfolio is
equal to the weighted average of the betas of the individual assets that make up the portfolio, the
covariance structure is greatly simplified by using the betas. Although the CAPM brings us a benefit
for simplification, some researchers argued the validity of the CAPM assumptions such as the time
independence, the uncorrelation and the Gaussianity. To overcome these problems, some extensions
have been proposed so far.
In CAPM, the return of each asset is assumed as a regression model with a single explanatory
variable (that is a market return), which implies that the market risk is the only source of risk besides
the unique risk of each asset. However, there is some evidence that other common risk factors affect
the market risk. Companies within the same country or the same industry appear to have common
risks beyond the overall market risk. Also, research has suggested that companies with common
characteristics such as a high book-to-market value have common risks, though this is controversial.
A factor model that represents the connection between common risk factors and individual returns
also leads to a simplified covariance structure, and provides important insight into the relationships
among assets. The factor model framework leads to the arbitrage pricing theory (APT), introduced
by Ross (1976), which is a pricing theory based on the principle of the absence of arbitrage.
19
20 PORTFOLIO THEORY FOR DEPENDENT RETURN PROCESSES
The expected utility theory introduced by von Neumann and Morgenstern (1944) accounts for risk
aversion in financial decision making, and provides a more general and more useful approach than
the mean-variance approach. In view of the expected utility theory, the mean-variance approach has
two pitfalls: First, the probability distribution of each asset return is characterized only by its first
two moments. In the case of Gaussian distributions, the mean-variance model and utility theories
are mainly compatible. The mean-variance-skewness model is introduced to relax the Gaussian as-
sumption and to consider the skewness of the portfolio return. Second, the dependence structure is
only described by the linear correlation coefficients of each pair of asset returns. In that case, serious
losses are observed if extreme events are too underestimated.
Recent years have seen increasing development of new tools for risk management analysis. Many
sources of risk have been identified, such as market risk, credit risk, counterparty default, liquidity
risk, operational risk and so on. One of the main problems concerning the evaluation and opti-
mization of risk exposure is the choice of risk measures. In portfolio theory, many risk measures
instead of the variance have been introduced. According to Giacometti and Lozza (2004), we can
distinguish two kinds of risk measures, namely, dispersion measures and safety measures. Dis-
persion measures include standard deviation, semivariance and mean absolute deviation. The “mean
semivariance model” and “mean absolute deviation model” are proposed by using an alternative
risk measure. The mean semivariance model deals with semivariance as the portfolio risk, which is
closer to reality than the mean-variance model. In order to solve large-scale portfolio optimization
problems, the mean absolute deviation model is considered, in which the portfolio risk is defined
as the absolute deviation. Since investment managers frequently associate risk with the failure to
attain a target return, the “mean target model” is introduced. This model optimizes distributions
having below target returns. The safety measures describe the probability that the portfolio return
falls under a given level. The typical model using this measure is “safety first” introduced by Roy
(1952), Telser (1955) and Kataoka (1963). Value at risk (VaR) is the first attempt to take account of
fat tailed and non-Gaussian returns; for example, when volatilities are random and possible jumps
may occur, financial options are involved in the position, default risks are not negligible, or cross
dependence between asset is complex, etc. Since VaR is a risk measure that only takes account of the
probability of losses, and not of their size, other risk measures have been proposed. Among them,
the expected shortfall (ES) as defined in Acerbi and Tasche (2001), also called conditional value
at risk (CVaR) in Rockafellar and Uryasev (2000) or tail conditional expectation (TCE) in Artzner
et al. (1999), is a class of risk measures which take account both of the probability of losses and
of their size. Moreover, in Acerbi and Tasche (2001), coherence of these risk measures is proved.
“Pessimistic portfolio” introduced by Bassett et al. (2004) minimizes α-risks, which include ES,
CVaR and TCE. In order to explain the alternative calculation method for VaR or CVaR, we in-
troduce the copula dependence model. Rockafellar and Uryasev (2000) discussed the procedure
for calculating the mean-CVaR (or mean-VaR) portfolio based on the copula dependency structure.
Thanks to the copula structure, the modeling of the univariate marginal and the dependence struc-
ture can be separated. We show that once a copula function and marginal distribution functions are
determined, the mean-CVaR (or mean-VaR) portfolio is obtained.
3.1.1 Mean-Variance Portfolio

Suppose the existence of a finite number of assets indexed by i, (i = 1, . . . , m). Let Xt =
(X1 (t), . . . , Xm (t))′ denote the random returns on m assets at time t. Assuming the stationarity of {Xt },
write E(Xt ) = µ = (µ1 , . . . , µm )′ and Cov(Xt , Xt ) = Σ = (σi j )i, j=1,...,m (Σ is a nonsingular m by m
matrix). Let w = (w1 , . . . , wm )′ be the vector of portfolio weights. Then the return of the portfolio is
w′ Xt , and the expectation and variance are, respectively, given by µ(w)p= w′ µ and η2 (w) = w′ Σw.
P
Under the restriction m 2
i=1 wi = 1, we can plot all portfolios as points ( η (w), µ(w)) on the mean-
standard deviation diagram. The set of points that corresponds to portfolios is called the feasible set
INTRODUCTION TO PORTFOLIO THEORY 21
or feasible region, as shown in Figure 3.1. The feasible set1 satisfies the following two important
properties.
1. If m ≥ 3, the feasible set will be a two-dimensional region.
2. The feasible region is convex to the left.
µ(w)
p
η2 (w)
Figure 3.1: Feasible set (feasible region) with short selling allowed
The left boundary of a feasible set is called the minimum-variance set, because for a fixed mean
rate of return, the feasible point with the smallest variance corresponds to the left boundary point.
The minimum-variance set has a characteristic bullet shape. The upper portion of the minimum-
variance set is called the efficient frontier of the feasible region, as shown in Figure 3.2. According
µ(w) µ(w)
Minimum-variance point Minimum-variance point
p p
η2 (w) η2 (w)
(a) Minimum-variance set (b) Efficient frontier
Figure 3.2: Minimum-variance set and efficient frontier
to Markowitz (1959), if a portfolio is represented by a point on the efficient frontier, the portfolio
is called an efficient portfolio or, more precisely, the mean variance efficient portfolio. In other
words, if a portfolio is not efficient, it is possible to find an efficient portfolio with either
1. greater expected return (µ(w)) but no greater variance (η2 (w)), or
2. less variance (or standard deviation) but no less expected return.
Mean variance efficient portfolio weights have been proposed by various criteria. The following are
typical ones.
Consider
X
m
minm η2 (w) subject to µ(w) = µP , wi = 1 (3.1.1)
w∈R
i=1
1 The sale of an asset that is not owned is called “short selling.” If short selling is allowed, the portfolio weight w could
i
be negative. Otherwise, additional constraints wi ≥ 0 for all i are added.
where µP is a given expected portfolio return by investors. This criterion indicates the variance
minimizer of the set of portfolios for a given expected return as µP . The point of the portfolio
(called the “minimum-variance portfolio”) on the mean-standard deviation diagram corresponds to
the point on the minimum-variance set. This portfolio is shown in Figure 3.3. Let e = (1, . . . , 1)′
(m-vector), A = e′ Σ−1 e, B = µ′ Σ−1 e, C = µ′ Σ−1 µ, and D = AC − B2 . The solution for w is given
by
µP A − B −1 C − µP B −1
w0 = Σ µ+ Σ e. (3.1.2)
D D
In the case that the constraint µ(w) = µP is eliminated, the solution is given by
1 −1
w0 = Σ e. (3.1.3)
A
This portfolio is called the “global minimum variance portfolio.” The point of this portfolio on the
mean-standard deviation diagram corresponds to the minimum-variance point in Figure 3.3. These
solutions are obtained by solving the Lagrangian.
Next we consider
X
m
maxm µ(w) subject to η2 (w) = σ2P , wi = 1 (3.1.4)
w∈R
i=1
where σP is a given standard deviation of portfolio return by investors. This criterion indicates the
mean maximizer of the set of the portfolio for a given standard deviation of portfolio return as σP .
This portfolio is also shown in Figure 3.3. If J = B2 (1 − σ2P A)2 − A(1 − σ2P A)(C − σ2P B2 ) ≥ 0 and
σ2P A ≥ 1 are satisfied, then the solution for w is given by
1 −1 λ2 −1
w0 = Σ µ− Σ e (3.1.5)
2λ1 2λ1
where
√
B − λ2 A B(1 − σ2P A) + J
λ1 = , λ2 = .
2 A(1 − σ2P A)
Note that if σ2P A = 1, the solution describes the global minimum-variance portfolio.
µ(w)
µP
Minimum-variance
point
p
σP η2 (w)
Figure 3.3: Mean variance efficient portfolio
Suppose there exists a risk-free asset with the return µ f . Consider
µ(w) − µ f X
m
maxm subject to wi = 1. (3.1.6)
w∈R η(w) i=1
This efficient portfolio is called the “tangency portfolio.” Given a point in the feasible region, we
draw a line (l f ) between the risk-free asset and that point. When the line is a tangency line of the
efficient frontier, the tangency point describes the tangency portfolio shown in Figure 3.4. To cal-
culate the solution, we denote the angle between the line (l f ) for each point in the feasible region
µ(w)−µ
and the horizontal axis by θ. For any feasible portfolio, we have “tan θ = η(w) f .” Since the tan-
gency portfolio is the feasible point that maximizes θ or, equivalently, maximizes tan θ, the solution
is given by
Σ−1 (µ − µ f e)
w0 = . (3.1.7)
e′ Σ−1 (µ − µ f e)
µ(w) lf
µf θ
p
η2 (w)
Figure 3.4: Tangency portfolio
The above portfolio selection problems such as (3.1.1), (3.1.4) and (3.1.6) consider cases where
short sales for risky assets are allowed and lending or borrowing for a riskless asset is possible.
If these assumptions are not satisfied, we need to add some additional constraints. In these cases,
we cannot obtain explicit solutions, but we can get solutions by using “the quadratic programming
method.” The effect of these assumptions is explained by using the feasible region, as shown in
Figure 3.5.
(a) Short sales for risky assets are not allowed and there is no risk-free asset: in this case, the fea-
sible region is formed by all portfolios with nonnegative weights, which implies that additional
constraints wi ≥ 0 for all i are added to the optimization problems.
(b) Short sales for risky assets are allowed and there is no risk-free asset: obviously the feasible
region in this case contains that without short selling. However, the efficient frontiers of these
two regions may partially coincide.
(c) Short sales for risky assets are allowed and only lending for a risk-free asset is allowed: in this
case, the feasible region is surrounded by the finite line segments between the risk-free asset and
the points in the feasible region, which implies that an additional constraint w f = 1 − w′ e ≥ 0,
that is, the nonnegative weight of the risk-free asset, is added to the optimization problems.
(d) Short sales for risky assets are allowed and both borrowing and lending for a risk-free asset
are allowed: in this case, the feasible region is an infinite triangle whenever a risk-free asset is
included in the universe of available assets.
3.1.2 Capital Asset Pricing Model

One of the important problems in the discipline of investment science is to determine the equibrium
price of an asset. The capital asset pricing model (CAPM) developed primarily by Sharpe (1964),
µ(w) µ(w)
p p
η2 (w) η2 (w)
(a) (b)
µ(w) µ(w)
p p
η2 (w) η2 (w)
(c) (d)
Figure 3.5: Feasible region
Lintner (1969) and Mossin (1966) answers this problem. The CAPM follows logically from the
Markowitz mean-variance portfolio theory. The CAPM starts with the following assumptions.2
Assumption 3.1.1. The assumptions of CAPM
1. The market prices are “in equilibrium.” In particular, for each asset, supply equals demand.
2. Everyone has the same forecasts of expected returns and risks.
3. All investors choose portfolios optimally (efficiently) according to the principles of mean vari-
ance efficiency.
4. The market rewards for assuming unavoidable risk, but there is no reward for needless risks due
to inefficient portfolio selection.
Suppose that there exists a market portfolio with the expectation µ M and the standard deviation
σ M , such as S&P500 and Nikkei 225. Then, we can plot the market portfolio on a µ − σ diagram.
Given a risk-free asset with the return µ f , we can draw a single straight line passing through the
risk-free point and the market portfolio. This line is called the capital market line (CML). (See
Figure 3.6.)
The CML relates the expected rate of return of an efficient portfolio to its standard deviation,
but it does not show how the expected rate of return of an individual asset relates to its individual
risk. This relation is expressed by the CAPM. Under Assumption 3.1.1, if the market portfolio is
mean variance efficient, the expected return µi = E[Xi (t)] of any asset i belongs with the market and
can be written as
µi − µ f = βi (µ M − µ f ) (3.1.8)
where
σiM
βi = (3.1.9)
σ2M
and σiM is the covariance between the returns on the asset i and the market portfolio. The value βi is
2 See, e.g., Luenberger (1997).
µM market portfolio
µf
σM σ
Figure 3.6: Capital market line
referred to as the beta of an asset. An asset’s beta gives important information about the asset’s risk
characteristics by using the CAPM formula. The value µi − µ f is termed the expected excess rate of
return (or the risk premium) of asset i; it is the amount by which the rate of return is expected to
exceed the risk-free rate. It is easy to calculate the overall beta of a portfolio in terms of the betas of
the individual assets in the portfolio. Suppose, for example, that a portfolio contains m assets with
the weight vector w = (w1 , . . . , wm )′ . It follows immediately that
µ(w) − µ f = β(w)(µ M − µ f ) (3.1.10)
where β(w) = w′ β and β = (β1 , . . . , βm )′ . In other words, the portfolio beta is just the weighted
average of the betas of the individual assets in the portfolio, with the weights being identical to those
that define the portfolio. The CAPM formula can be expressed in graphical form by regarding the
formula as a linear relationship. This relationship is termed the security market line (SML). (See
Figure 3.7.) Under the equilibrium conditions assumed by the CAPM, any asset should fall on the
SML.
µ µ
µM µM
market portfolio market portfolio
σ2M σiM 1 β
(a) SML on the µ − σiM diagram (b) SML on the µ − β diagram
Figure 3.7: Security market line
Let Xi (t) be the return at time t on the ith asset. Similarly, let X M (t) and µ f be the return on the
market portfolio at time t and the risk-free asset. Consider the following regression model:
Xi (t) = µ f + βi (X M (t) − µ f ) + ǫi (t) (3.1.11)
where ǫi (t) is a random variable with E(ǫi (t)) = 0, V(ǫi (t)) = σ2ǫi , cov(ǫi (t), X M (t)) = 0 and
cov(ǫi (t), ǫ j (t)) = 0 for i , j.3 Taking expectations in (3.1.11), we get the form (3.1.8), which is
3 This assumption is essentially the reason why we should hold a large portfolio of all stocks. But this assumption is some-
times unrealistic. In that case, correlation among the stocks in a market can be modeled using a “factor model” introduced in
the next section.
the SML. In addition, we have
σ2i = β2i σ2M + σ2ǫi (3.1.12)
where σ2i = V(Xi (t)) and σ2M = V(X M (t)). In the above equation, β2i σ2M is called the systematic
risk, which is the risk associated with the market as a whole. This risk cannot be reduced by diver-
sification because every asset with nonzero beta contains this risk. On the other hand, σ2ǫi is called
nonsystematic, idiosyncratic, or specific risk, which is uncorrelated with the market and can be
reduced by diversification. When you consider appropriate total asset number m of a portfolio in
this model, it is worth it to reduce the nonsystematic risk for portfolios. Ruppert (2004, Example
7.2.) shows that the nonsystematic risk for a portfolio is decreasing as the asset number m tends to
increase. Moreover, you can eliminate the nonsystematic risk by holding a diversified portfolio in
which securities are held in the same relative proportions as in a broad market index. Suppose we
have known βi and σ2ǫi for each asset in a portfolio and also known µ M and σ2M for the market and
µ f for a risk-free asset. Then, under the CAPM, we can compute the expectation and variance of the
portfolio return w′ Xt as

µ(w) = µ f + w′ β(µ M − µ f ), η2 (w) = w′ ββ ′ σ2M + Σǫ w (3.1.13)
where β = (β1 , . . . , βm )′ and Σǫ = diag(σ2ǫ1 , . . . , σ2ǫm ). By substituting (3.1.13) into (3.1.1), (3.1.4)
or (3.1.6), we can obtain mean variance efficient portfolios in terms of β, Σǫ , σ2M , µ M , and µ f . Note
that the CAPM leads to the reduction of the number of the parameter. If you assume the CAPM,
the number of the parameter is 2 + 2m (i.e., β1 , . . . , βm , σ2ǫ1 , . . . , σ2ǫm , µ M and σ2M ). Otherwise, 2m +
m(m−1)/2 parameters (i.e., µ1 , . . . , µm , σ11 , . . . , σmm ) are needed. That is the reason why the CAPM
greatly simplifies the input for portfolio selection. However, there are the following problems on the
validity of the CAPM assumptions.
1. Any or all of the quantities β, Σǫ , σ2M , µ M , and µ f could depend on time t. However, it is generally
assumed that β and Σǫ as well as σ2M and µ M of the market are independent of time t so that these
parameters can be estimated assuming stationarity of the time series of returns.
2. The assumption that ǫi (t) is uncorrelated with ǫ j (t) for i , j is essential to the decomposition
of the ith risk into the systematic risk and the nonsystematic risk. This assumption is sometimes
unrealistic (see Ruppert (2004, Section 7.4.3)).
3. The CAPM is the basis of the mean variance portfolio and this portfolio fundamentally relies on
the description of the probability distribution function (pdf) of asset returns in terms of Gaussian
functions (see, e.g., Malevergne and Didier Sornette (2005)).
As for problem 1, Aue et al. (2012), Chochola et al. (2013, 2014) and others have constructed
sequential monitoring procedures for the testing of the stability of portfolio betas. It is well known
that if the model parameters βi in the CAPM vary over time, the pricing of assets and predictions
of risks may be incorrect and misleading. Therefore, some procedures for the detection of struc-
tural breaks in the CAPM have been proposed. On the other hand, Merton (1973) developed the
intertemporal CAPM (ICAPM) in which investors act so as to maximize the expected utility of life-
time consumption and can trade continuously in time. In this model, unlike the single-period model,
current demands are affected by the possibility of uncertain changes in future investment opportu-
nities. Moreover, Breeden (1979) pointed out that Merton’s intertemporal CAPM is quite important
from a theoretical standpoint and introduced the consumption-based CAPM (CCAPM).
As for problem 2, the correlation among the stocks in a market sector can be modeled using the “fac-
tor model.” For instance, Chamberlain (1983) and Chamberlain and Rothschild (1983) introduced
an approximate factor structure. They claimed that the uncorrelated assumption for the traditional
factor model included in CAPM is unnecessarily strong and can be relaxed.
As for problem 3, multi-moments CAPMs have been introduced. Since it is well known that the dis-
tributions of the actual asset returns have fat tails, Rubinstein (1973), Kraus and Litzenberger (1976),
Lim (1989) and Harvey and Siddique (2000) have underlined and tested the role of the skewness of
the distribution of asset returns. In addition, Fang and Lai (1997) and Hwang and Satchell (1999)
have introduced a four-moment CAPM to take into account the leptokurtic behaviour of the asset re-
turn distribution. Many other extensions have been presented, such as the VaR-CAPM by Alexander
and Baptista (2002) or the distribution-CAPM by Polimenis (2002).
The CAPM is available not only for portfolio selection problem but also for the other investment
problem such as investment implications, performance evaluation and project choice (see, e.g., Lu-
enberger (1997)).
3.1.3 Arbitrage Pricing Theory

The information required by the mean-variance approach grows substantially as the number m of
assets increases. There are m mean values, m variances, and m(m − 1)/2 covariances: a total of
2m+m(m−1)/2 parameters. When m is large, this is a very large set of required values. For example,
if we consider a portfolio based on 1,000 stocks, 501,500 values are required to fully specify a mean-
variance model. Clearly, it is a formidable task to obtain this information directly so that we need
a simplified approach. Fortunately, the CAPM greatly simplifies the number of required values by
using “beta” (in the case of the above example, 2,002 values are required). However, in the CAPM,
the market risk factor is the only risk based on the correlation between asset returns. Indeed, there is
some evidence of other common risk factors besides the market risk. Factor models generalize the
CAPM by allowing more factors than simply the market risk and the unique risk of each asset. This
model leads to a simplified structure for the covariance matrix, and provides important insight into
the relationships among assets. The factors used to explain randomness must be chosen carefully
and the proper choice depends on the universe, which might be population, employment rate, and
school budgets. For common stocks listed on an exchange, the factors might be the stock market
average, gross national product, employment rate, and so forth. Selection of factors in somewhat of
an art, or a trial-and-error process, although formal analysis methods can also be helpful. The factor
model framework leads to an alternative theory of asset pricing, termed arbitrage pricing theory
(APT). This theory does not require the assumption that investors evaluate portfolios on the basis
of means and variances; only that, when returns are certain, investors prefer greater return to lesser
return. In this sense the theory is much more satisfying than the CAPM theory, which relies on both
the mean-variance framework and a strong version of equilibrium, which assumes that everyone
uses the mean-variance framework. However, the APT requires a special assumption, that is, the
universe of assets being considered is large. In this theory, we must assume that there are an infinite
number of securities, and that these securities differ from each other in nontrivial ways. For example,
if you consider the universe of all publicly traded U.S. stocks, this assumption is satisfied well.
Time Series Factor Model Suppose there exist p kinds of risk factor F1 (t), . . . , F p (t) which depend
on time t. In a “time series multifactor model,”4 the random return on m assets at time t is written as
Xt = a + bFt + ǫt (3.1.14)
where a = (a1 , . . . , am )′ is an m-dimensional constant vector, b = (bi j )i=1,...,m, j=1,...,p is an m × p

constant matrix, Ft = (F1 (t), . . . , F p (t))′ are p-dimensional random vectors with an m dimensional
mean vector µF and a p × p covariance matrix ΣF = (σ f i j )i, j=1,...,m , and ǫt are m-dimensional random
vectors with a mean 0 and an m × m covariance matrix Σǫ = diag(σ2ǫ1 , . . . , σ2ǫm ). In addition, it is
usually assumed that the errors ǫt are uncorrelated with Ft , that is, Cov(ǫt , F s ) = 0. These are
idealizing assumptions which may not actually be true, but are usually assumed to be true for the
purpose of analysis. In this model, the ai ’s are called intercepts because ai is the intercept of the
4 An alternative type of model is a cross-sectional factor model, which is a regression model using data from many assets
but from only a single holding period (e.g., Ruppert (2004, Section 7.8.3)).
line for asset i with the vertical axis, and the bi ’s are called factor loadings because they measure
the sensitivity of the return to the factor.
If you consider the case of one factor (i.e., p = 1), it is called a single-factor model. In particular,
if the factor Ft is chosen to be the excess rate of return on the market X M (t) − µ f , then we can
set a = µ f and b = β, and the single factor model is identical to the CAPM. A factor can be
anything that can be measured and is thought to affect asset returns. According to Ruppert (2004),
the following are examples of factors:
• Return on market portfolio (market model, e.g., CAPM)
• Growth rate of GDP
• Interest rate on short-term Treasury bills
• Inflation rate
• Interest rate spreads
• Return on some portfolio of stocks
• The difference between the returns on two portfolios.
Suppose we have known a, b, µF , ΣF and Σǫ . Then, under the time series factor model, we can
compute the expectation and variance of the portfolio return w′ Xt as

µ(w) = w′ (a + bµF ), η2 (w) = w′ bΣF b′ + Σǫ w. (3.1.15)
By substituting (3.1.15) into (3.1.1), (3.1.4) or (3.1.6), we can obtain mean variance efficient
portfolios in terms of a, b, µF , ΣF and Σǫ . In the usual representation of asset returns, a total of
2m+m(m−1)/2 parameters are required to specify means, variances and covariances. In this model,
a total of just 2m + mp + p + p(p − 1)/2 parameters are required . Similarly to the CAPM, under
the time series multifactor model, the portfolio risk η2 (w) can be divided into the systematic risk
w′ bΣF b′ w and the nonsystematic risk w′ Σǫ w. The point of APT is that the values of a and b must
be related if arbitrage opportunities5 are to be excluded. This means that in the efficient portfolio
based on µ(w) and the systematic risk, there is no arbitrage opportunity. On the other hand, the
nonsystematic risk can be reduced by diversification because this term’s contribution to overall risk
is essentially zero in a well-diversified portfolio.
3.1.4 Expected Utility Theory

In order to model any decision problem under risk, it is necessary to introduce a functional repre-
sentation of preferences which measures the degree of satisfaction of the decision maker. This is the
purpose of the utility theory.
Suppose that there exists a set of possible outcomes which may have an impact on the con-
sequences of the decisions: Ω = {ω1 , . . . , ωk }. Let p = {p1 , . . . , pk } be the probability of occur-
P
rence of Ω satisfied with ∀i, 0 ≤ pi ≤ 1 and ki=1 pi = 1. Then, a lottery L is defined by a vector
{(ω1 , p1 ), . . . , (ωk , pk )} and the set of all lotteries is defined by L. Moreover, a compound lottery Lc
can be defined as a lottery whose outcomes are also lotteries.6
The decision maker is assumed to be “rational” if one’s preference relation (denoted by ) over
the set of lotteries L is a binary relation which satisfies the following axioms:
Axiom 3.1.2.
5 “Arbitrage” means to earn money without investing anything. In other words, to get any return without any risk and cost
(e.g., Luenberger (1997, Section 1.2.)).
6 Assume that a compound lottery L has two outcomes of lottery La = {(ω , pa ), . . . , (ω , pa )} and lottery Lb =
c 1 1 k k
{(ω1 , pb1 ), . . . , (ωk , pbk )} where the respective probability that the La becomes the outcome is ǫ and the lottery Lb becomes the
outcome is 1 − ǫ. Then, the probability that the outcome of Lc will be ωi is given by pci = ǫ pai + (1 − ǫ)pbi . Therefore, Lc has
the same vector of probabilities as the convex combination Lc = ǫLa + (1 − ǫ)Lb .
• The relation is complete, that is, all lotteries are always comparable by : ∀La , ∀Lb ∈ L,
La Lb or Lb La .
• The indifference relation ∼ associated to the relation is defined by ∀La , ∀Lb ∈ L, La ∼ Lb ⇔
La Lb and Lb La .
• The relation is reflexive: ∀L ∈ L, L L.
• The relation is transitive: ∀La , ∀Lb , ∀Lc ∈ L, La Lb and Lb Lc ⇔ La Lc .
Another standard assumption is continuity: small changes on probabilities do not modify the
ordering between two lotteries. This property is specified in the following axiom (continuity):
Axiom 3.1.3. The preference relation on the set L of lotteries is such that ∀La , ∀Lb , ∀Lc ∈ L, if
La Lb Lc then there exists a scalar ǫ ∈ [0, 1] such that Lb ∼ ǫLa + (1 − ǫ)Lc .
This continuity axiom implies the existence of a functional U : L → R such that
∀La , ∀Lb , La Lb ⇔ U(La ) ≥ U(Lb ).
To develop the analysis of economics under uncertainty, additional properties must be imposed
on the preferences. One of the most important additional conditions is the independence axiom:
Axiom 3.1.4. The preference relation on the set L of lotteries is such that ∀La , ∀Lb , ∀Lc ∈ L and
for all ǫ ∈ [0, 1], La Lb ⇔ ǫLa + (1 − ǫ)Lc ǫLb + (1 − ǫ)Lc .
The independence axiom, implicitly introduced in von Neumann and Morgenstern (1944), char-
acterizes the expected utility criterion: indeed, it implies that the preference functional U on the
lotteries must be linear in the probabilities of the possible outcomes:
Lemma 3.1.5. (existence of utility function) Assume that the preference relation on the set L of
lotteries satisfies the continuity and independence axioms. Then, the relation can be represented
by a preference function that is linear in probabilities: there exists a function u defined on the space
of possible outcomes Ω and with values in R such that for any two lotteries La = {(pa1 , . . . , pak )} and
Lb = {(pb1 , . . . , pbk )}, we have
X
k X
k
La Lb ⇔ u(ωi )pai ≥ u(ωi )pbi .
i=1 i=1
Proof: See Theorem 1.1 of Prigent (2007).
Using the properties of concavity/convexity of utility functions, we deduce a characterization of

risk aversion:
Lemma 3.1.6. (characterization of risk aversion) Let u be a utility function representing preferences
over the set of outcomes. Assume that u is increasing. Then
1) The function u is concave if and only if the investor is risk averse.
2) The function u is linear if and only if the investor is risk neutral.
3) The function u is convex if and only if the investor is risk loving.
Let X be a random variable which represents outcomes of a lottery L and suppose that a utility
function u exists. Then another lottery C[X] satisfied with u(C[X]) = E[u(X)] is called the certainty
equivalent of the lottery L. In addition, the difference π[X] = E[X] − C[X] is called the risk
premium, as introduced in Pratt (1964). By using the risk premium, we can measure the “degree”
of risk aversion of an investor, which is equivalent to the Arrow–Pratt measure as below.
Lemma 3.1.7. (degree of risk aversion) Let u and v be two utility functions representing preferences
over wealth. Assume that they are continuous, monotonically increasing, and twice differentiable.
Then the following properties are equivalent and characterize the “more risk aversion”:
2 2
1) The derivatives of both utility functions are such that − ∂(∂x)
u(x) ∂u(x)
2 / ∂x > − ∂(∂x)
v(x) ∂v(x)
2 / ∂x , for every x in
R.
2) There exists a concave function Ψ such that u(x) = Ψ[v(x)], for every x in R.
3) The risk premium satisfies πu (X) ≥ πv (X), for all random real valued variables X.
2
Definition 3.1.8. The term A(x) = − ∂(∂x) u(x) ∂u(x)
2 / ∂x is called the Arrow Pratt measure of absolute
risk aversion (ARA). Another measure allows us to take account of the level of wealth: the ratio
R(x) = xA(x), which is called the Arrow Pratt measure of relative risk aversion (RRA).
From the above risk aversion measures, we can characterize some standard utility functions, and
in particular the general class of HARA utilities, which are useful to get analytical results.
Definition 3.1.9. A utility function u is said to have harmonic absolute risk aversion (HARA) if
the inverse of its absolute risk aversion is linear in wealth.7
Usually, three subclasses are distinguished:
Constant absolute risk aversion (CARA) In this class, the ARA A(x) is constant (= A), the RRA
R(x) is increasing in wealth and the utility function u takes the form
exp(−Ax)
u(x) = − .
A
Constant relative risk aversion (CRRA) In this class, the RRA R(x) is constant (= R), and the
utility function u takes the form
( 1−R
x /(1 − R) i f R , 1
u(x) = . (3.1.16)
ln[x] if R = 1
This kind of utility function exhibits a decreasing absolute risk aversion (DARA).
Quadratic utility function Quadratic utility function u takes the form
u(x) = a(b − x)2 .
Note that we have to restrict its domain as (−∞, b] on which u is increasing (if a > 0) and the ARA
is increasing with wealth.
Other utility theories, such as weighted utility theory, rank-dependent expected utility theory
(RDEU), anticipated utility theory, cumulative prospect theory, nonadditive expected utility and
Regret theory, have been introduced (see Prigent (2007)).
As we mentioned above, under some specific assumptions, the investor’s preference on the pos-
sible positions X can be represented by a utility function u. Once a utility function is defined, all
alternative random wealth levels are ranked by evaluating their expected utility values E[u(x)]. In-
vestors wish to form a portfolio to maximize the expected utility of final wealth. The investor’s
problem is
X
m
maxm E[U(w′ Xt )] subject to wi = 1. (3.1.17)
w∈R
i=1
Note that when all returns are Gaussian random variables, the mean-variance criterion is also equiv-
alent to the expected utility approach for any risk-averse utility function as follows:
Remark 3.1.10. (Luenberger (1997), Ruppert (2004), Prigent (2007)) If the utility function U is
quadratic, U(x) = x − 2k x2 with k > 0, or exponential, U(x) = − exp[−Ax]
A with A > 0 and the returns
on all portfolios are normally distributed, then the portfolio that maximizes the expected utility is
on the efficient frontier.
7 HARA utility function u takes the form u(x) = a(b + x/c)1−c on the domain b + x/c > 0. The constant parameters a, b
and c satisfy the condition a(1 − c)/c > 0 if c , 0, 1.

Suppose that all returns are non-Gaussian and a utility function U has the third derivative D3 U.
Then, by Taylor expansion, the expected utility can be represented as
1
E[U(w′ Xt )] ≈ U[E(w′ Xt )] + DU[E(w′ Xt )]E[w′ Xt − E(w′ Xt )]
1!
1 2
+ D U[E(w′ Xt )]E[(w′ Xt − E(w′ Xt ))2 ]
2!
1
+ D3 U[E(w′ Xt )]E[(w′ Xt − E(w′ Xt ))3 ]
3!
1 1
= U[µ(w)] + D2 U[µ(w)]η2 (w) + D3 U[µ(w)]c3 (w)
2 6
i
where c3 (w) = E[(w′ Xt − E(w′ Xt ))3 ] and Di U(x) = d(dx)
U(x)
i for i = 1, 2, 3. If the returns on all port-
folios are not Gaussian, then we cannot ignore the higher-order moment of the above approximation.
That is the motivation with which the mean-variance-skewness model is introduced.
Mean-variance-skewness model (MVS) The third moment of a return is called skewness, which
measures the asymmetry of the probability distribution. A natural extension of the mean-variance
model is to add skewness into consideration for portfolio management. There will be triple goals:
maximizing the mean and the skewness, minimizing the variance. People interested in the use of
skewness prefer a portfolio with a larger probability for large payoffs when the mean and variance
remain the same. The mathematical programming formulation is as follows:
X
m
maxm E[(w′ Xt − µ(w))3 ] subject to µ(w) = µP , η2 (w) = σ2P , wi = 1. (3.1.18)
w∈R
i=1
The importance of higher-order moments in portfolio selection has been suggested by Samuelson
(1970). But partially because of the difficulty of estimating the third-order moment for a large num-
ber of securities such as over a few hundred, and of solving nonconvex programs by using traditional
computational methods, quantitative treatments of the third-order moment have been neglected for
a long time.
When the utility functions are not observable, the theory of stochastic dominance is available
to rank two random prospects X and Y. This theory uses the cumulative distribution functions F X
and FY of X and Y instead of E[u(X)] and E[u(Y)].
Definition 3.1.11. Assume that the supports of all random variables are in an interval [a, b].
(1) X is said to dominate Y according to first-order stochastic dominance (X 1 Y) if F X (ω) ≤
FY (ω) for all ω ∈ [a, b].
(2) RX is said to dominate
Rω Y according to second-order stochastic dominance (X 2 Y) if
ω
a
F X (s)ds ≤ a
F Y (s)ds for all ω ∈ [a, b].
This definition is consistent with the expected utility, that is, X 1 Y if and only if E[u(X)] ≥
E[u(Y)] for any utility function u which is monotonically increasing, and X 2 Y if and only if
E[u(X)] ≥ E[u(Y)] for any utility function u which is monotonically increasing and concave (see
Propositions 1.3 and 1.4 of Prigent (2007)). Moreover, the second-order dominance property justi-
fies the diversification investment strategy (see Remark 1.11 of Prigent (2007)).
The key property linked to the expected utility theory is the independence axiom, which may fail
empirically and can yield some paradoxes, as shown by Allais (1953). Therefore, during the 1980s,
the revision of the expected utility paradigm was intensely developed by slightly modifying or re-
laxing the original axioms. The following are typical ones.
Weighted utility theory One of the theories that was consistent with Allais, paradox was the
“weighted utility theory,” introduced by Chew and MacCrimmmon (1979). The idea is to apply
a transformation on the initial probability. The basic result of Chew and MacCrimmon yields the
following representation of preferences over lotteries L = {(x1 , p1 ), . . . , (xk , pk )}:
X
k X
k
U(L) = u(xi )φ(pi ) with φ(xi ) = pi / v(xi )pi
i=1 i=1
where u and v are two different elementary utility functions.

Prospect theory Another approach is the “prospect theory” introduced by Tversky and Kahneman
(1979). The idea of the prospect theory is to represent the preferences by means of a function φ such
that the utility of a lottery L = {(x1 , p1 ), . . . , (xk , pk )} is given by
X
k
U(L) = u(xi )φ(pi )
i=1
where φ is an increasing function defined on [0, 1] with values in [0, 1] and φ(0) = 0, φ(1) = 1.
Using these transformations, the Allais paradox can be solved. However, these theories imply the
violation of first-order stochastic dominance. To solve this problem, alternative approaches have
been proposed.
Rank-dependent expected utility theory The rank dependent expected utility (RDEU) theory as-
sumes that people consider cumulative distribution functions rather than probabilities themselves.
For a lottery L = {(x1 , p1 ), . . . , (xk , pk )} with x1 ≤ x2 ≤ · · · ≤ xk , the utility is given by
 
X k  X
i X
i−1 
U(L) = 
u(xi ) Φ( p j ) − Φ( p j )
i=1 j=1 j=1
Pi
where the weights Φ( j=1 p j ) depend on the ranking of the outcomes xi . In this framework, it
is possible to introduce preference representations that are compatible with first-order stochastic
dominance. If Φ is not the identity but u(x) = x, the RDEU is the “dual theory” of Yaari (1987).
Quiggin (1982) has proposed “anticipated utility theory” in which the “weak independence axiom”
instead of the independence axiom is added. Then the utility is a class of the RDEU where Φ is
nondecreasing on [0, 1] to [0, 1], and is concave on [0, 1/2], (Φ(pi ) > pi ) and convex on [1/2, 1],
(Φ(pi ) < pi ) with Φ(1/2) = 1/2 and Φ(1) = 1. Tversky and Kahneman (1992) have introduced
“cumulative prospect theory” in which utility functions are given for losses and gains. The utility is
also a class of the RDEU where Φ is divided into a loss part Φ− and a gain part Φ+ . The “Choquet
expected utility” of Schmeidler (1989) is also a class of the RDEU. The model is based on the so-
called “Choquet integral” (see Choquet (1953)). Consider two random variables X and Y associated
respectively to lotteries LX and LY , with the distribution functions F X and FY . In the framework of
the Choquet expected utility, X is preferred to Y if
Z 1 Z 1
E[u(X)] = u(F X−1 (t))dν(t) ≥ u(FY−1 (t))dν(t) = E[u(Y)]
0 0
where u is a utility function, ν is a Choquet distortion measure and F −1 (t) is tth quantile. The
simplest Choquet distortion measure is given by the one-parameter family
νǫ (t) = min(t/ǫ, 1) f or ǫ ∈ (0, 1].
This family plays an important role in the portfolio application. Focusing for a moment on a single
να , we have
Z
1 α
Eνα [u(X)] = u(F X−1 (t))dt.
α 0
Based on this utility function, Bassett et al. (2004) introduced the “pessimistic portfolio” which we
will discuss later.
3.1.5 Alternative Risk Measures
In portfolio theory, many risk measures alternative to the variance or the standard deviation have
been introduced. As noted by Giacometti and Lozza (2004), we can distinguish two kinds of risk
measures, that is, dispersion measures and safety measures.
In the class of the dispersion measures, the risks ρ are increasing, positive, and positively homoge-
neous, such as standard deviation, semivariance and mean absolute deviation. The following models
are optimal portfolios to minimize dispersion measures alternative to the standard deviation.
Mean semivariance model Based on the observation that investors may only be concerned with
the risk of return being lower than the mean (downside risk), the method of mean semivariance was
proposed to model this fact (see, e.g., Markowitz (1959)). The portfolio optimization problem is
defined as
X
m
minm E[min{w′ Xt − µ(w), 0}2 ] subject to µ(w) = µP , wi = 1. (3.1.19)
w∈R
i=1
Although the rationale behind introducing the model is intuitively closer to reality than the mean-
variance model, it is not widely used (see Wang and Xia (2002)).
Mean absolute deviation model In order to solve large-scale portfolio optimization problems,
Konno and Yamazaki (1991) considered the mean absolute deviation as the risk of portfolio in-
vestment. The portfolio optimization problem is defined as
X
m
minm E[|w′ Xt − µ(w)|] subject to µ(w) = µP , wi = 1. (3.1.20)
w∈R
i=1
Using the historical data of the Tokyo Stock Exchange, Konno and Yamazaki compared the perfor-
mance of the mean-variance model and that of the mean absolute deviation model, and found that
the performance of those two models was very similar.
Mean target model Fishburn (1977) examined a class of mean risk dominance models in which
risk equals the expected value of a function that is zero at or above a target return T , and is non-
decreasing in deviations below T . Considering that investment managers frequently associate risk
with the failure to attain a target return, the model is formulated as follows:
Z T X
m
minm (T − x)π dFw′ Xt (x) subject to wi = 1 (3.1.21)
w∈R −∞ i=1
where T is a specified target return, π(> 0) is a parameter which can approximate a wide variety of
attitudes towards the risk of falling below the target return, and Fw′ Xt (x) is the probability distri-
bution function of w′ Xt . This model avoids distributions having below-target returns. However, a
major drawback of the computational complexity restricts this model’s ample uses in practice.
It is known that dispersion measures are not consistent with first-order stochastic dominance. On the
other hand, the safety measures introduced by Roy (1952), Telser (1955) and Kataoka (1963) are
consistent with first-order stochastic dominance. The safety first rules are based on risk measures
involving the probability that the portfolio return falls under a given level.
Safety first While Markowitz’s mean-variance model measures a portfolio’s variance as the risk,
Roy proposed another criterion to minimize the probability that a portfolio reaches below a disaster
level. Denoting by D the minimum acceptable level of return for the investment, Roy’s criterion is
X
m
minm P(w′ Xt < D) subject to wi = 1. (3.1.22)
w∈R
i=1
Another example is the Telser criterion:
X
m
maxm µ(w) subject to P(w′ Xt < D) ≤ α, wi = 1. (3.1.23)
w∈R
i=1
With this different view of risk, the optimization problem can have several different formulations.
One example is the Kataoka criterion:
X
m
maxm D(w) subject to P(w′ Xt < D(w)) ≤ α, wi = 1 (3.1.24)
w∈R
i=1
where α is an acceptable probability that the portfolio return may drop below the limit D.
The three previous criteria consist of a given probability threshold. Under Gaussian assumptions,
Roy, Telser and Kataoka showed that all optimal solutions are obtained as a point on the mean vari-
ance efficient frontier. Obviously, if asset returns are not Gaussian due to having nonzero skewness
or a fat tail, then optimal solutions may no longer be mean variance efficient. For general distribu-
tion, it is necessary to introduce the value at risk (VaR).
Definition 3.1.12. The α-VaR is given by
VaRα (X) := − sup{x|P(X ≤ x) < α}. (3.1.25)
Mean VaR Portfolio Consider
X
m
minm VaRα (w′ Xt ) subject to µ(w) = µP , wi = 1. (3.1.26)
w∈R
i=1
The solution indicates minimizing the portfolio’s VaR under a given expected value µP . It is easy
to see that if the constraint µ(w) = µP is eliminated, the above problem corresponds to Kataoka’s
criterion.
As can be easily seen, VaR is a risk measure that only takes account of the probability of losses, and
not of their size. Moreover, VaR is traditionally based on the assumption of normal asset returns and
has to be carefully evaluated when there are extreme price fluctuations. Furthermore, VaR may not
be convex for some probability distributions. To overcome these problems, other risk measures have
been proposed such as coherent risk measures and convex risk measures. Regarding regulatory
concerns in the financial sector, it has been interested in “how to measure portfolio risks.” For this
problem, Artzner et al. (1999) provided an axiomatic foundation for coherent risk measures.
Definition 3.1.13. For real-valued random variables X ∈ X on (Ω, A), a mapping ρ : X → R is
called a coherent risk measure if it is
• monotone: X, Y ∈ X, with X ≤ Y ⇒ ρ(X) ≥ ρ(Y).
• subadditive: X, Y, X + Y ∈ X ⇒ ρ(X + Y) ≤ ρ(X) + ρ(Y).
• linearly homogeneous: for all λ ≥ 0 and X ∈ X, ρ(λX) = λρ(X).8
• translation invariant: for all λ ∈ R and X ∈ X, ρ(λ + X) = ρ(X) − λ.
These requirements rule out most risk measures traditionally used in finance. In particular, risk
measures based on a variance or standard deviation are ruled out by the monotonicity requirement,
and quantile-based risk measures such as VaR are ruled out by the subadditivity requirement.
α-risk is one of the important risk measures to have coherence properties. Let
Z 1 Z
−1 1 α −1
ρνα (X) = − F X (t)dνα (t) = − F (t)dt (3.1.27)
0 α 0 X
8 When the risk measure is not assumed to have variations in proportion to the risk variations themselves, linear homo-
geneity is no longer satisfied. A ρ is called “convex risk measure” if it is monotone, subadditive, translation invariant and
convex, which means that for all 0 ≤ λ ≤ 1 and X, Y ∈ X, ρ(λX + (1 − λ)Y) ≤ λρ(X) + (1 − λ)ρ(Y).
where F X−1 is the generalized inverse of the cdf F X , that is, F X−1 (p) = sup{x|F X (x) ≤ p}, να (t) =
min{t/α, 1} and α ∈ (0, 1]. Variants of ρνα (X) have been suggested under a variety of names, in-
cluding expected shortfall (Acerbi and Tasche (2001)), conditional value at risk (Rockafellar and
Uryasev (2000)) and tail conditional expectation (Artzner et al. (1999)). For the sake of brevity Bas-
sett et al. (2004) called ρνα (X) as α-risk of the random prospect X. Clearly, the α-risk is simply the
negative Choquet να expected return.
Remark 3.1.14. The following risk measures correspond to the α-risk ρνα (X).
Z
1 α −1
• expected shortfall-1: ESα (X) = − F (t)dt
α 0 X
1
• expected shortfall-2: ESα (X) = − E[X1X≤qα(X) ] − qα (X)(P[X ≤ qα (X)] − α)
α
• tail conditional expectation: TCEα (X) = −E[X|X ≤ qα (X)]9
where qα (X)(= −VaRα (X)) is the quantile of X at the level α ∈ (0, 1].
Note that ρνα (X) is usually equal and larger than VaRα (X) because qα (X) ≥ F X−1 (t) for t ∈ [0, α].
As for the portfolio selection theory, by using the α-risk, it is natural to consider the following
criteria: ρνα (X) − λµ(X) or, alternatively, µ(X) − λρνα (X). Minimizing the former criterion implies
minimizing the α-risk subject to a constraint on mean return; maximizing the latter criterion implies
maximizing the mean return subject to a constraint on the α-risk. Several authors have suggested
R in which α-risk is replaced by the
these criteria as alternatives to the classical Markowitz criteria
standard deviation of the random variable X. Since µ(X) = F X−1 (t)dt = −ρ1 (X), these criteria are
special cases of the following more general class.
Definition 3.1.15. A risk measure ρ is called pessimistic if, for some probability measure ψ on
[0, 1],
Z 1
ρ(X) = − ρνα dψ(α).
0
Pessimistic Portfolio Bassett et al. (2004) showed that decision making based on minimizing a
“coherent risk measure” is equivalent to Choquet expected utility maximization using a linear form
of the utility function and a pessimistic form of the Choquet distortion function.
Consider
X
m
minm ρνα (w′ Xt ) subject to µ(w) = µP , wi = 1. (3.1.28)
w∈R
i=1
A portfolio with the weight as the solution is called a pessimistic portfolio, which includes the
mean-conditional value at risk (M-CVaR) portfolio, the mean-expected shortfall (M-ES) portfolio
and the mean-tail conditional expectation (M-TCE) portfolio.
Although the α-risk provides a convenient one-parameter family of coherent risk measures, they
are obviously rather simplistic. The α-risk ρνα can be extended to weighted averages of α-risk
X
l
ρν (X) = νk ρναk (X), (3.1.29)
k=1
which is a way to approximate general pessimistic risk measures. Note that the weights νk : k =
1, . . . , l should be positive and sum to one.
9
Note that TCEα (X) is not always equal to ESα (X). In that case, it may not be coherent (not subadditive), as shown in
Acerbi (2004).
3.1.6 Copulas and Dependence
As we mentioned in 3.1.5, various risk measure alternatives to the variance or the standard deviation
have been proposed so far in the portfolio optimization problem. VaR is widely used for financial
risk measure (not only for portfolio optimization) because VaR is recommended as a standard by the
Basel Committee. However, in recent years, VaR has been criticized as the risk measure since VaR
is not coherent or convex. Due to nonconvexity, VaR may have many local extrema when optimizing
a portfolio. As an alternative risk measure, CVaR is known to have better properties than VaR.
Definition 3.1.16. The α-CVaR is given by
CVaRα (X) := E {−X| − X ≥ VaRα (X)} (3.1.30)
where VaRα (X) is the α-VaR defined by (3.1.25).

Here β is a specified probability level in (0, 1) (in practice, β = 0.90, 0.95 or 0.99.). Pflug (2000)
proved that CVaR is a coherent risk measure that is convex, monotonic, positively homogeneous
and so on. Rockafellar and Uryasev (2000) proposed the mean-conditional value at risk (M-CVaR)
portfolio, which minimizes the CVaR of a portfolio under the constraint for the portfolio mean.
P
Definition 3.1.17. Let w′ X = m i=1 wi Xi denote a portfolio return. Then, an M-CVaR portfolio is a
solution of the following problem:
min CVaRβ (w′ X) over w ∈ W (µP ) (3.1.31)

w∈Rm
where
 


 X
m 


′ ′
W (µP ) = 
w = (w1 , . . . , wm ) |w j ≥ 0 for j = 1, . . . , m, wi = 1, µ(w) = E(w X) ≥ µ P 
 .
 
i=1
According to Rockafellar and Uryasev (2000), this set W (µP ) is convex. Modeling for CVaR
(including VaR) by use of copula dependency structure description is the subject of many studies,
such as Palaro and Hotta (2006), Stulajter (2009), Kakouris and Rustem (2014) and so on. The
advanced point of modeling with a copula is that for continuous multivariate distributions, the mod-
eling of the univariate marginals and the multivariate or dependence structure can be separated, and
the multivariate structure can be represented by a copula. The general representation of a multivari-
ate distribution as a composition of a copula and its univariate margins is due to Sklar’s theorem
(e.g., Joe (2015)).
Theorem 3.1.18. [Sklar’s theorem] For an m-variate distribution F ∈ F (F1 , . . . , Fm ), with jth
univariate marginal distribution F j , the copula associated with F is a distribution function C :
[0, 1]m 7→ [0, 1] with U(0, 1) margins that satisfies
F(x) = C(F1 (x1 ), . . . , Fm (xm )), x ∈ Rd .
(a) If F is a continuous m-variate distribution function with univariate margins F1 , . . . , Fm , and

quantile functions F1−1 , . . . , Fm−1 , then
C(u) = F(F1−1 (u1 ), . . . , Fm−1 (um )), u ∈ [0, 1]m
is the unique choice.

(b) If F is an m-variate distribution of discrete random variables (more generally, partly continuous
and partly discrete), then the copula is unique only on the set
Range(F1) × · · · × Range(Fm).
Proof: See Theorem 1.1 of Joe (2015).
Suppose that the m-variate distribution function F with a copula C is continuous with invariant
marginal distribution functions F1 , . . . , Fm . If F1 , . . . , Fm are absolutely continuous with respective
densities f j (x) = ∂F j (x)/∂x, and C has mixed derivative of order m, then, the joint density function
f can be written as
∂m F(x1 , . . . , xm ) ∂mC(F1 (x1 ), . . . , Fm (xm ))
f (x) = f (x1 , . . . , xm ) = =
∂x1 · · · ∂xm ∂x1 · · · ∂xm
m
∂ C(u1 , . . . , um ) Ym
∂Fi (xi ) ∂ C(u1 , . . . , um ) Y
m m
= = fi (xi ). (3.1.32)
∂u1 · · · ∂um i=1
∂xi ∂u1 · · · ∂um i=1
Definition 3.1.19. If C(u) is an absolutely continuous copula, then its density function
∂mC(u1 , . . . , um )
c(u) = c(u1 , . . . , um ) = , u ∈ (0, 1)m
∂u1 · · · ∂um
is called the copula density (function).
It is easy to see that if there exists the copula density c(u), it follows that
Y
m
f (x) = c(F1 (x1 ), . . . , Fm (xm )) fi (xi ). (3.1.33)
i=1
So far, varity copulas have been introduced. The following is the typical one.
Example 3.1.20. [Gaussian copula] From the multivariate Gaussian distribution with zero means,
unit variances and m × m correlation matrix Σ, we get

C(u; Σ) = Φm Φ−1 (u1 ), . . . , Φ−1 (um ) , u ∈ [0, 1]m,
where Φm (·; Σ) is the m-variate Gaussian cdf, Φ is the univariate Gaussian cdf, and Φ−1 is the
univariate Gaussian inverse cdf or quantile function. The copula density is given by
( )
−1/2 1 ′ −1
c(u; Σ) = det(Σ) exp − ξ (Σ − Im )ξ , (3.1.34)
2
where ξ = (ξ1 , . . . , ξm )′ and ξi (i = 1, . . . , m) is the ui quantile of the standard Gaussian random
variable Xi (i.e., ui = P(Xi < ξi ), Xi ∼ N(0, 1)).
Suppose that a vector of random asset returns X = (X1 , . . . , Xm )′ follows a continuous m-
variate distribution function FX and the density function fX . Let w = (w1 , . . . , wm )′ be the vector
of portfolio weights. Then the return of the portfolio is w′ X, and the distribution function can be
written as
Z
Fw′ X (y) = P(w′ X ≤ y) = fX (x)dx.
w ′ x≤y
If the density function fX can be described by using the copula density and the marginal density
f1 (x), . . . , fm (x) as (3.1.33), then we can write
Z Y
m
Fw′ X (y) = c(FX (x)) fi (xi )dx
w ′ x≤y i=1
where FX (x) = (F1 (x1 ), . . . , Fm (xm )). The variable transformation as ui = Fi (xi ) leads to
Z
Fw′ X (y) = c(u)du
−1 (u)≤y
w ′ FX
−1
where FX (u) = (F1−1 (u1 ), . . . , Fm−1 (um )) maps from [0, 1]m to Rm .
By using this result, VaR and CVaR are given as follows.
Proposition 3.1.21. The α-VaR for w′ X is defined as
 Z 


 


′
VaRα (w X) = − sup  y c(u)du < α
 .
 ′ −1 w FX (u)≤y 
The α-CVaR for w′ X is defined as

Z
′ 1
CVaRα (w X) = − w′ FX
−1
(u)c(u)du.
α −1
w ′ FX (u)≤−VaRα (w ′ X)
Proof: It follows from the definition of α-VaR (see (3.1.25)) and α-CVaR (see (3.1.30)).
This result shows that once a copula density c and marginal distribution functions FX are de-
termined by simulation or estimation, we can calculate the portfolio w′ X’s CVaR (or VaR), which
implies that the optimal M-CVaR portfolio (or M-VaR portfolio) can be also obtained as the solution
of (3.1.31) (or 3.1.26).
3.1.7 Bibliographic Notes

Markowitz (1952) was the original paper on portfolio theory and was expanded into a book
Markowitz (1959). Elton et al. (2007), Luenberger (1997) Wang and Xia (2002), Ruppert (2004)
and Malevergne and Didier Sornette (2005) provide an elementally introduction to portfolio selec-
tion theory such as mean variance portfolio, CAPM and APT. Prigent (2007) discusses the utility
theory and risk measures in more detail. Koenker (2005) introduces the pessimistic portfolio. Wang
and Xia (2002) discuss some other portfolios, taking care of the investment (transaction) cost. Joe
(2015) and McNeil et al. (2015) provide some introduction of the general copula theory. Rockafellar
and Uryasev (2000) and Kakouris and Rustem (2014) discuss a portfolio optimization problem by
using copula. McNeil et al. (2015) provide some concepts and techniques of quantitative risk man-
agement (QRM) in which portfolio theory, finnancial time series modelling, copulas and extreme
value theory are required.
3.1.8 Appendix
Proof of (3.1.2) Let L(w) be
L(w) = η2 (w) − λ1 (µ(w) − µP ) − λ2 (w′ e − 1) = w′ Σw − λ1 (w′ µ − µP ) − λ2 (w′ e − 1).
When w0 is the minimizer of L(w), it follows that

∂ ∂ ∂
L(w) = 0, L(w) = 0, L(w) = 0.
∂w w=w0 ∂λ1 w=w0 ∂λ2 w=w0
Since
∂ ∂ ∂
(w′ Σw) = 2Σw, (w′ µ) = µ, (w′ e) = e,
∂w ∂w ∂w
we have
∂
L(w) = 2Σw − λ1 µ − λ2 e = 0,
∂w
which implies that
λ1 Σ−1 µ + λ2 Σ−1 e
w= .
2
In addition, by substituting this result into
∂ ∂
L(w) = w′ µ − µP = 0, L(w) = w′ e − 1 = 0,
∂λ1 ∂λ2
we have
λ1 µ′ Σ−1 µ + λ2 e′ Σ−1 µ λ1C + λ2 B λ1 e′ Σ−1 µ + λ2 e′ Σ−1 e λ1 B + λ2 A
= = µP , = =1
2 2 2 2
Hence, it follows that
2µP A − 2B 2µP A − 2B 2C − 2µP B 2C − 2µP B
λ1 = = , λ2 = =
AC − B2 D AC − B2 D
and
λ1 −1 λ2 µP A − B −1 C − µP B −1
w0 = Σ µ + Σ−1 e = Σ µ+ Σ e.
2 2 D D
L(w) = η2 (w) − λ(w′ e − 1) = w′ Σw − λ(w′ e − 1).

∂ ∂
L(w) = 0, L(w) = 0.
∂w w=w0 ∂λ w=w0
Hence we have
∂
L(w) = 2Σw − λe = 0,
∂w
which implies that
λΣ−1 e
w= .
2
In addition, by substituting this result into
∂
L(w) = w′ e − 1 = 0,
∂λ
we have
λe′ Σ−1 e λA
= =1
2 2
so that
2
λ=
A
and
1 −1
w0 = Σ e.
A
L(w) = µ(w) − λ1 (η2 (w) − σ2P ) − λ2 (w′ e − 1) = w′ µ − λ1 (w′ Σw − σ2P ) − λ2 (w′ e − 1).

∂ ∂ ∂
L(w) = 0, L(w) = 0, L(w) = 0.
∂w w=w0 ∂λ1 w=w0 ∂λ2 w=w0
Hence we have
∂
L(w) = µ − 2λ1 Σw − λ2 e = 0,
∂w
which implies that
Σ−1 µ − λ2 Σ−1 e
w=
2λ1
for λ1 , 0. In addition, by substituting this result into
∂ ∂
L(w) = w′ Σw − σ2P = 0, L(w) = w′ e − 1 = 0,
∂λ1 ∂λ2
we have
µ′ Σ−1 µ − 2λ2 µ′ Σ−1 e + λ22 e′ Σ−1 e C − 2λ2 B + λ22 A
= = σ2P ,
4λ21 4λ21
e′ Σ−1 µ − λ2 e′ Σ−1 e B − λ2 A
= = 1.
2λ1 2λ1
Hence, it follows that
A(1 − σ2P A)λ22 − 2B(1 − σ2P A)λ2 + (C − σ2P B2 ) = 0
and if J = B2 (1 − σ2P A)2 − A(1 − σ2P A)(C − σ2P B2 ) ≥ 0 is satisfied, we have

√
B(1 − σ2P A) ± J B − λ2 A
λ2 = 2
, λ1 = .
A(1 − σP A) 2
Let
√
B(1 − σ2P A) + J B − λ+2 A Σ−1 µ − λ+2 Σ−1 e
λ+2 = , λ+1 = , w+ =
A(1 − σ2P A) 2 2λ+1
√
B(1 − σ2P A) − J B − λ−2 A Σ−1 µ − λ−2 Σ−1 e
λ−2 = , λ−1 = , w− = .
A(1 − σ2P A) 2 2λ−1
Then we have
λ+2 − λ−2 8(σ2P A − 1)
(w+ )′ µ − (w− )′ µ = + − D= √ .
λ1 λ1 JA
Therefore, if σ2P A ≥ 1, then (w+ )′ µ ≥ (w− )′ µ, which implies that
1 −1 λ+2 −1
w0 = Σ µ − Σ e.
2λ+1 2λ+1
STATISTICAL ESTIMATION FOR PORTFOLIOS 41
Proof of (3.1.7) Under w′ e = 1, define L(w) by
µ(w) − µ f w′ (µ − µ f e)
L(w) = tan θ = = √ .
η(w) w′ Σw
When w0 is the maximizer of L(w), it follows that
∂
L(w) = 0.
∂w w=w0
Hence we have
√ Σw
∂ (µ − µ f e) w′ Σw − w′ (µ − µ f e) √w ′ Σw
L(w) = = 0,
∂w w′ Σw
which implies that
w = cP Σ−1 (µ − µ f e)
w ′ Σw
where cP = w ′ (µ−µ f e) . Because of w′ e = 1, it follows that
1
cP = .
e′ Σ−1 (µ − µ f e)
3.2 Statistical Estimation for Portfolios

In this section, we discuss statistical estimation for the optimal portfolio weights introduced in Sec-
tion 3.1. As we showed in Section 3.1, the class of mean-variance optimal portfolio weights can be
written as a function g = g(µ, Σ) of the mean vector µ and covariance matrix Σ with respect to asset
returns X(t). Several authors proposed estimators of g as functions of the sample mean vector µ̂ and
sample covariance matrix Σ̂ (i.e., ĝ = g(µ̂, Σ̂)) for independent returns of assets (e.g., Jobson et al.
(1979), Jobson and Korkie (1981) and Lauprete et al. (2002)). However, empirical studies show that
financial return processes are often dependent and non-Gaussian. From this point of view, Basak
et al. (2002) showed the consistency of optimal portfolio estimators when the portfolio returns are
stationary processes. In the literature there has been no study on the asymptotic efficiency of es-
timators for optimal portfolio weights. Shiraishi and Taniguchi (2008) considered the asymptotic
efficiency of the estimators when the returns are non-Gaussian linear stationary processes. They
give the asymptotic distribution of portfolio weight estimators ĝ for the processes.
Next, we discuss the other statistical methods in portfolio optimization problems. First, we dis-
cuss an estimation method and its asymptotic property of “pessimistic portfolio,” as we introduced
in Section 3.1. According to Bassett et al. (2004), the asymptotic property is explained via quan-
tile regression (QR) theory. Second, we briefly discuss shrinkage estimation of optimal portfolio
weight discussed by Jorion (1986) and Ledoit and Wolf (2004), among others. Third, we briefly dis-
cuss Bayesian estimation theory in portfolio optimization problems. According to Aı̈t-Sahalia and
Hansen (2009), the econometrician’s optimal portfolio weights are quite different from the above
plug-in estimates, because the econometrician takes on the role of the investor by choosing portfolio
weights that are optimal with respect to his or her subjective belief about the true return distribu-
tion. Next, we discuss factor analysis. As for the static factor model, three types of factor models
are available for studying asset returns (see Connor (1995) and Campbell et al. (1996)), namely,
macroeconomic factor model, fundamental factor model and statistical factor model. Here, we in-
troduce static factor models and portfolio estimation methods. As for the dynamic factor model,
we explain the weak points for the traditional (static) factor models and introduce the generalized
dynamic factor model which provides the desirable properties, and construct a portfolio estimator.
Finally, we discuss high-dimensional problems. Let n and m be the sample size and the sample
dimension, respectively. We first show the asymptotic property for the estimators of some portfo-
lios (the global minimum portfolio (MP), the tangency portfolio (TP) and the naive portfolio (NP))
under n, m → ∞ and m/n → y ∈ (0, 1). Then, the asymptotic property for the estimator of MP
is shown under n, m → ∞ and m/n → y ∈ (1, ∞). Here the estimator is modified since the usual
sample covariance matrix is singular under m > n.
3.2.1 Traditional Mean-Variance Portfolio Estimators

Suppose that a sequence of random returns {X(t) = (X1 (t), . . . , Xm (t))′ ; t ∈ Z} is a stationary process.
Writing µ = E{X(t)} and Σ = Var{X(t)} = E[{X(t) − µ}{X(t) − µ}′ ], we define (m + r)-vector
parameter θ = (θ1 , . . . , θm+r )′ by
θ = (µ′ , vech(Σ)′)′
where r = m(m + 1)/2. We estimate a general function g(θ) of θ, which expresses an optimal
portfolio weight for the mean-variance model such as (3.1.2), (3.1.3), (3.1.5) and (3.1.7). Here it
should be noted that the portfolio weight w = (w1 , . . . , wm )′ satisfies the restriction w′ e = 1. Then
we have only to estimate the subvector (w1 , . . . , wm−1 )′ . Hence we assume that the function g(·) is
(m − 1)-dimensional, i.e.,
g : θ → Rm−1 . (3.2.1)
In this setting we address the problem of statistical estimation for g(θ), which describes various
mean variance optimal portfolio weights.
Let the return process {X(t) = (X1 (t), . . . , Xm (t))′ ; t ∈ Z} be an m-vector linear process
X
∞
X(t) = A( j)U (t − j) + µ (3.2.2)
j=0
where {U (t) = (u1 (t), . . . , um (t))′ } is a sequence of independent and identically distributed (i.i.d.)
m-vector random variables with E{U (t)} = 0, Var{U (t)} = K (for short {U (t) ∼ i.i.d.(0, K)}) and
fourth-order cumulants. Here
A( j) = (Aab ( j))a,b=1,...,m , A(0) = Im , µ = (µa )a=1,...,m , K = (Kab )a,b=1,...,m .
We make the following assumption.

P
Assumption 3.2.1. (i) ∞j=0 | j|1+δ kA( j)k < ∞ for some δ > 0
P∞
(ii) det{ j=0 A( j)z j } , 0 on {z ∈ C; |z| ≤ 1}.
Assumption 3.2.1 (i) is the condition to guarantee finiteness for the fourth-moment of X(t).
Assumption 3.2.1 (ii) is the “invertibility condition” to guarantee that U (t) is representable as the
function of {X(s)} s≤t .
The class of {X(t)} includes that of non-Gaussian vector-valued causal ARMA models. Hence
it is sufficiently rich. The process {X(t)} is a second-order stationary process with spectral density
matrix
1
f (λ) = ( fab (λ))a,b=1,...,m = A(λ)KA(λ)∗
2π
P
where A(λ) = ∞j=0 A( j)ei jλ .
From the partial realization {X(1), . . . , X(n)}, we introduce
θ̂ = (µ̂′ , vech(Σ̂)′ )′ , θ̃ = (µ̂′ , vech(Σ̃)′ )′

where
1X 1X 1X
n n n
µ̂ = X(t), Σ̂ = {X(t) − µ̂}{X(t) − µ̂}′ , Σ̃ = {X(t) − µ}{X(t) − µ}′ .
n t=1 n t=1 n t=1
Denote the σ-field generated by {X(s); s ≤ t} by Ft . Also we introduce Ω1 ((m × m)-matrix), ΩG2
((r × r)-matrix), Ω2NG ((r × r)-matrix) and Ω3 ((m × r)-matrix) as follows
Ω1 = 2π f (0)
Z π !
ΩG2 = 2π { fa1 a3 (λ) fa2 a4 (λ) + fa1 a4 (λ) fa2 a3 (λ)}dλ
−π a1 ,a2 ,a3 ,a4 =1,...,m,a1 ≥a2 ,a3 ≥a4
X
m Z Z π
1
Ω2NG = ΩG2 + cU
b1 ,b2 ,b3 ,b4 Aa1 b1 (λ1 )Aa2 b2 (−λ1 )Aa3 b3 (λ2 )
(2π)2 b1 ,b2 ,b3 ,b4 =1 −π
!
× Aa4 b4 (−λ2 )dλ1 dλ2
a1 ,a2 ,a3 ,a4 =1,...,m,a1 ≥a2 ,a3 ≥a4
X
m Z Z π
1
Ω3 = cU
b1 ,b2 ,b3 Aa1 b1 (λ1 + λ2 )Aa2 b2 (−λ1 )Aa3 b3 (−λ2 )
(2π)2 b1 ,b2 ,b4 =1 −π
!
dλ1 dλ2
a1 ,a2 ,a3 =1,...,m,a2 ≥a3
where cU
b1 ,...,b j ’s are jth order cumulants of ub1 (t), . . . , ub j (t) ( j = 3, 4).
Remark 3.2.2. Denoting
 !


 Ω1 0

 i f {U (t)} is Gaussian


 0 ΩG2

Ω=
 ! ,






Ω1 Ω3
 i f {U (t)} is non-Gaussian
Ω′3 Ω2NG
√
this Ω corresponds to the spectral representation of the asymptotic covariance matrix of n(θ̂ − θ).
i.i.d.
For a simple example, we assume that m = 1 and U(t) ∼ N(0, σ2 ). Then, it follows that
1 X 1 X X
n n ∞
nVar(µ̂) = Cov {X(t1 ), X(t2 )} = A( j1 )A( j2 )Cov {U(t1 − j1 ), U(t2 − j2 )}
n t ,t =1 n t ,t =1 j , j =0
1 2 1 2 1 2
 2
X X X∞ X 
 

X
n−1 ∞ ∞ ∞
2 2
 
 2
= (n − |l|) A( j)A( j + l)σ → A( j)A( j + l)σ =   A( j) 
 σ

 

l=−n+1 j=0 l=−∞ j=0 j=0
under A( j) = 0 for j < 0. In the same way, we can write
1 X i 2 X
n h n
nVar(Σ̂) = Cov {X(t1 ) − µ}2 , {X(t2 ) − µ}2 = Cov {X(t1 ), X(t2 )}2
n t ,t =1 n t ,t =1
1 2 1 2
 2  2
X 
 
 X  

X X
n−1 ∞ ∞ ∞
 
2
 
 4
=2 (n − |l|) 
 A( j)A( j + l)σ 
 → 2 
 A( j)A( j + l) 
 σ

 
 
 

l=−n+1 j=0 l=−∞ j=0
X
=2 A( j1 )A( j2 )A( j3 )A( j2 + j3 − j1 )σ4 .
j1 , j2 , j3 ≥0, j1 ≤ j2 + j3
On the other hand, Ω1 and ΩG2 can be written as
 2
X
∞ Z π 
 X∞ 


 
 2
Ω1 = 2π f (0) = A( j1 )A( j2 )σ2 e −i( j1 − j2 )λ
dλ = 
 A( j)
 σ
−π 
 j=0 

j1 , j2 =0
Z π X
∞ Z π
ΩG2 = 4π f (λ) f (λ)dλ = A( j1 )A( j2 )A( j3 )A( j4 ) e−i( j1 − j2 − j3 + j4 )λ dλ
−π j1 , j2 , j3 , j4 =0 −π
X
=2 A( j1 )A( j2 )A( j3 )A( j2 + j3 − j1 )σ4 .
j1 , j2 , j3 ≥0, j1 ≤ j2 + j3
Theorem 3.2.3. Under Assumption 3.2.1,

√ L
n(θ̂ − θ) → N (0, Ω) .
Proof of Theorem 3.2.3. See Appendix.
Here we assume that the function g(·) is (m − 1)-dimensional, i.e.,
g : θ → Rm−1 . (3.2.3)
For g(·) given by (3.2.3) we impose the following.

Assumption 3.2.4. The function g(θ) is continuously differentiable.
Under this assumption, the delta method is available and (3.1.2), (3.1.3), (3.1.5) and (3.1.7)
satisfy this assumption. As a unified estimator for (mean-variance) optimal portfolio weights we
introduce g(θ̂). For this we have the following result.
Theorem 3.2.5. Under Assumptions 3.2.1 and 3.2.4,
! !′ !
√ L ∂g ∂g
n(g(θ̂) − g(θ)) → N 0, Ω
∂θ′ ∂θ′
where ∂g/∂θ′ is the vector differential (i.e., ∂g/∂θ′ = (∂g/∂θi)i=1,...,m+r ).

Proof of Theorem 3.2.5. The proof follows from Theorem 3.2.3 and the δ-method (e.g., Brockwell
and Davis (2006)).
3.2.2 Pessimistic Portfolio

In Section 3.1.5 we introduced the pessimistic Portfolio, which is a portfolio with the weight as the
solution of
X
m
min ρνα (w′ X) subject to E(w′ X) = µP , wi = 1 (3.2.4)
w=(w1 ,...,wm )′ ∈R m
i=1
where ρνα (w′ X) is called the α-risk of the portfolio return w′ X and a pre-specified expected return
µP . Here the α-risk ρνα (w′ X) is defined as
Z 1 Z α
1
ρνα (w′ X) = − −1
Fw ′ X (t)dνα (t) = −
−1
Fw ′ X (t)dt, α ∈ (0, 1)
0 α 0
−1
where να = min{t/α, 1} and Fw ′ X (t) = sup{x|F w ′ X (x) ≤ t} denotes the quantile function of the
portfolio return w′ X with distribution function Fw′ X .

Bassett et al. (2004) showed that a portfolio with minimized α-risk can be constructed via the quan-
tile regression (QR) methods of Koenker and Bassett (1978). QR is based on the fact that a quantile
can be characterized as the minimizer of some expected asymmetric absolute loss function, namely,
−1
Fw ′ X (α) = arg minθ E αI{w′ X − θ ≥ 0} + (1 − α)I{w′ X − θ < 0} |w′ X − θ|

≡ arg minθ E ρα (w′ X − θ)
where ρα (u) = u(α − I{u < 0}), u ∈ R is called the check function (see Koenker (2005)) and I{A} is
the indicator function defined by I{A} = IA (ω) ≡ 1 if ω ∈ A, ≡= 0 if ω < A. To construct the optimal
(i.e., α-risk minimized) portfolio, the following lemma is needed.
Lemma 3.2.6. (Theorem 2 of Bassett et al. (2004)) Let Y = w′ X be a real-valued random variable
with E(Y) = E(w′ X) = µP < ∞. Then,

min E ρα (Y − θ) = α(µP + ρνα (Y)).
θ∈R
Denoting Y = w′ X as a portfolio consisting of m different assets X = (X1 , . . . , Xm )′ with

P
allocation weights w = (w1 , . . . , wm )′ (subject to mj=1 w j = 1), the optimization problem can be
written as (3.2.4).
Suppose that we obtain a sequence of random returns {X(t) = (X1 (t), . . . , Xm (t))′ : t = 1, . . . , n, n ∈
N}. Taniai and Shiohama (2012) introduced the sample or empirical version of (3.2.4) as
X
n+1
min ′ ρα (Zi − Wi′ b) (3.2.5)
b=(b1 ,...,bm ) ∈Rm
i=1
X
n
where X̄i = n−1 Xi (t) and
t=1
Z = (Z1 , . . . , Zn , Zn+1 )′ = (X1 (1), . . . , X1 (n), κ(X̄1 − µP ))′

W = [W1 , . . . , Wn |Wn+1 ]
 
 1 ··· 1 0 
 X (1) − X (1) · · · X1 (n) − X2 (n) κ(X̄1 − X̄2 ) 
 1 2 
=  .. .. .. .. 
 . . . . 
 
X1 (1) − Xm (1) · · · X1 (n) − Xm (n) κ(X̄1 − X̄m )
with some κ = O(n1/2 ). Denoting the minimizer of (3.2.5) as β̂ (n) (α) = (β̂(n) (n) ′
1 (α), . . . , β̂n (α)) , the
optimal portfolio weights
 ′
 X
m 
ŵ (α) = 1 −
(n)
β̂ (α), β̂ (α), . . . , β̂m (α)
(n) (n)
i
(n)
2
i=2
√
minimize α-risk. The large sample properties of β̂ (n) (α), especially its n-consistency, can be im-
plied from the standard arguments and assumptions in the QR context.
Let β(α) be a QR coefficient of Zi conditioning on Wi satisfied with
FZ−1i (α|Wi ) = Wi′ β(α) (3.2.6)
where FZ−1 (α|W ) = inf{x : P(Z ≤ x|W ) ≥ α}. Note that here the QR model (3.2.6) has a ran-
dom coefficient regression (RCR) interpretation of the form Zi = Wi′ β(Ui ) with component-wise
monotone increasing function β and random variables Ui ∼Uniform[0, 1]. Here a choice such that
β(u) = (β1 (u), β2(u), . . . , βm (u))′ := (b1 + Fξ−1 (u), b2, . . . , bm )′ with Fξ the distribution function of
some independent and identically distributed (i.i.d.) n-tuple (ξ1 , . . . , ξn ) yields
 
 b1 + ξi 
 b 
 2 
Zi = Wi′ β(Ui ) = Wi′  ..  .
 . 

bm
Hence, recalling that the first component of Wi is 1 for i = 1, . . . , n, it follows that, for any fixed
α ∈ [0, 1], the QR coefficient β(α) can be characterized as the parameter b ∈ Rm of a model such as
iid
Zi = Wi′ b + ξi , ξi ∼ G
where the density g of G is subject to

( Z 0 Z ∞ )
α
g∈F = f : f (x)dx = α = 1 − f (x)dx ,
−∞ 0
that is, G−1 (α) = 0.

We impose the following assumptions.
Assumption 3.2.7. The distribution functions {P(Zi ≤ x|Wi )} are absolutely continuous, with con-
tinuous densities fi (ξ) uniformly bounded away from 0 and ∞ at the points ξi (α), i = 1, 2, . . . .
Assumption 3.2.8. There exist positive definite matrices D0 such that
1 X
n+1
(i) lim Wi Wi′ = D0
n→∞ n + 1
i=1
1 X 1 X
n+1 n+1
(ii) lim fi (ξi (α))Wi Wi′ = lim g(0)WiWi′ = g(0)D0 with g(0) , 0
n→∞ n + 1 n→∞ n + 1
i=1 i=1
√
(iii) max kWi k/ n → 0.
i=1,...,n+1
According to Koenker (2005), Assumption 3.2.8 (i) and (iii) are familiar throughout the liter-
ature on M-estimators for regression models; some variant of them is necessary to ensure that a
Lindeberg condition is satisfied. Assumption 3.2.8 (ii) is really a matter of notational convenience
and could be deduced from Assumption 3.2.8 (i) and a slightly strengthened version of Assumption
3.2.7.
Theorem 3.2.9. (Theorem 4.1. of Koenker (2005)) Under Assumptions 3.2.7 and 3.2.8,
!
√ n (n) o d α(1 − α) −1
n β̂ (α) − β(α) → N 0, D0 .
g(0)2
Essentially, the result that the pessimistic portfolio obtained from β̂ (n) (α) is minimizing the
α-risk was of Bassett et al. (2004). Furthermore, Taniai and Shiohama (2012) showed that semi-
parametrically efficient inference of the optimal weights β̂ (n) (α) is feasible.
3.2.3 Shrinkage Estimators

Let X(1), . . . , X(n) be a sequence of independent and identically distributed (m-dimensional) ran-
dom vectors distributed as N(µ, Σ), where µ is an m-dimensional vector and Σ is the m × m positive
P
definite matrix. The sample mean µ̂ = 1/n nt=1 X(t) seems the most fundamental and natural es-
timator of µ. However, if m ≥ 3, Stein (1956) showed that µ̂ is not admissible with respect to
the mean squared error loss function. Furthermore, James and Stein (1961) proposed a shrinkage
estimator µ̂S defined by
!
m−2
µ̂S := 1 − µ̂
nkµ̂k2
which improves µ̂S with respect to the mean squared error when m ≥ 3.
For the portfolio choice problem, shrinkage estimation has been considered by Jobson et al. (1979),
Jobson and Korkie (1981), Jorion (1986), Frost and Savarino (1986), Ledoit and Wolf (2003, 2004),
among others. Traditional portfolio weight estimators are written as functions of the sample mean
µ̂ and the sample covariance matrix Σ̂. Jorion (1986) proposed a plug-in portfolio weight estimator
constructed with shrinkage sample mean, that is,
µ̂S := δµ0 e + (1 − δ)µ̂
where δ is a shrinkage factor satisfying 0 < δ < 1 and µ0 is a common constant value. In addition,
an optimal shrinkage factor δopt is given by
( )
opt (m − 2)/n
δ = min 1, .
(µ̂ − µ0 e)′ Σ−1 (µ̂ − µ0 e)
Furthermore, Frost and Savarino (1986) and Ledoit and Wolf (2003) and Ledoit and Wolf (2004)
showed shrinkage estimation can be applied not only to mean, but also covariance matrices. They
proposed a shrinkage covariance matrix estimator by using the usual sample covariance matrix Σ̂
and a shrinkage target S (or its estimator Ŝ ), that is,
Σ̂S := δŜ + (1 − δ)Σ̂.
Ledoit and Wolf (2003) derive the following approximate expression for the optimal shrinkage
factor:
1A−B
δopt ≃
n C
with
X
m
√ X
m
√ √ X
m
A= asy var( nσ̂i j ), B = asy cov( nσ̂i j , n ŝi j ), C = (σi j − si j )2 .
i, j=1 i, j=1 i, j=1
In what follows, we briefly examine the effect of shrinkage estimators from a data generation
process. We generate K(= 3000) sequences of independent random vectors
n (k) (k) (k) ⊤
o
X(k)
n := Xt = (X1t , . . . , Xqt ) , k = 1, . . . , K
t=1,...,n
i.i.d.
from a q-dimensional Gaussian distribution (i.e., Xt(k) ∼ Nq (µ, Σ)) where the components of the
mean vector µ = (µ1 , . . . , µq )⊤ and that of the covariance matrix Σ = (σi j )i, j=1,...,q are defined as
follows:
!
i 0.1i σi j
µi = 0.1 1 + , σii = 1 + and ρ = √ = 0.6 (i , j).
q q σi j σi j
Consider the following shrinkage estimators for the mean vector and the covariance matrix:
µ̂Sk := δ1 µ0 e + (1 − δ1 )µ̂(k) , Σ̂Sk := δ2 Σ + (1 − δ2 )Σ̂(k) , Σ̃Sk := δ2 I + (1 − δ2 )Σ̂(k)
where δ1 , δ2 (= 0, 0.4, 0.8) are the shrinkage factors; µ0 = 0.15 is the shrinkage target mean; µ̂(k)
and Σ̂(k) are sample mean vector and sample covariance matrix based on X(k) n , respectively. Note that
Σ̂Sk is the case that the shrinkage target covariance matrix corresponds to the true covariance matrix
(i.e., S = Σ), and Σ̃Sk is the case that the shrinkage target does not equal the true covariance matrix
(i.e., S , Σ).
Based on the above µ̂Sk and Σ̂Sk (or Σ̃Sk ), we define an empirical version of the mean squared error
(MSE) for the shrinkage portfolio estimator as follows;
XK
√ n o XK
√ n o
dS := 1
trV k n g(µ̂Sk , Σ̂Sk ) − g(µ, Σ) k2 , gS := 1
trV k n g(µ̂Sk , Σ̃Sk ) − g(µ, Σ) k2
K k=1 K k=1
where g(·, ·) is a function of µ and Σ which corresponds to the portfolio weight defined by (3.1.2).
Table 3.1 shows the empirical MSE trV dS for q = 3, 10, 30 and n = 10, 30, 100 with q < n.
dS ) for the shrinkage portfolio estimator

Table 3.1: Empirical version of the mean squared error (trV
when the asset size q = 3, 10, 30, the sample size n = 10, 30, 100 with q < n and the shrinkage
factors δ1 , δ2 = 0, 0.4, 0.8.
n 10 30 100
q δ1 \ δ2 0 0.4 0.8 0 0.4 0.8 0 0.4 0.8
0 9.954 2.994 1.945 3.489 2.333 1.861 2.685 2.097 1.807
3 0.4 4.275 1.229 0.720 1.767 1.009 0.699 1.433 0.937 0.697
0.8 1.439 0.351 0.114 0.918 0.362 0.139 0.882 0.432 0.221
0 - - - 33.448 12.221 8.590 12.485 9.546 8.396
10 0.4 - - - 12.937 4.617 3.147 5.225 3.753 3.169
0.8 - - - 2.807 0.913 0.516 1.920 1.160 0.848
0 - - - - - - 83.558 36.967 27.585
30 0.4 - - - - - - 30.871 13.770 10.307
0.8 - - - - - - 5.832 3.177 2.582
The shrinkage effect for each n and q can be seen. Note that the empirical MSE for the usual
sample portfolio estimator is shown in the case of δ1 = δ2 = 0. In view of the shrinkage effect for
the mean, significant improvement (see the effect for the difference of δ1 ) can be seen even if the
shrinkage target mean vector µ0 e does not correspond to the true mean vector µ. Table 3.2 shows
gS for q = 3, 10, 30 and n = 10, 30, 100 with q < n.
the empirical MSE trV
This result shows the risk that the analyst trusts the shrinkage target too much. Especially for
δ2 = 0.8, the empirical MSEs of the shrinkage estimator are sometimes larger than that of the usual
sample portfolio estimator (i.e., δ1 = δ2 = 0). However, if we set δ1 = δ2 = 0.4, then we can
obtain improvement rather than the usual sample portfolio estimator in view of the empirical MSE.
This phenomenon is one of the reasons the shrinkage estimator is useful for the portfolio selection
problem.
3.2.4 Bayesian Estimation

From a Bayesian perspective, the effect of parameter uncertainty on the optimal portfolio problem is
considered by expressing the investor’s problem in terms of the predictive distribution of the future
returns. Suppose that a sequence of random asset returns {X(t) = (X1 (t), . . . , Xm (t))′ ; t ∈ Z} is a
stationary process with E{X(t)} = µ and Var{X(t)} = Σ. Writing the unknown parameter, portfolio
weight and investor’s utility function as θ = (µ, Σ) ∈ Rm × Rm+1 , w = (w1 , . . . , wm )′ ∈ Rm and
U : R → R, then the optimal portfolio weight wopt is obtained by the solution of the following
gS ) for the shrinkage portfolio estimator
Table 3.2: Empirical version of the mean squared error (trV
when the asset size q = 3, 10, 30, the sample size n = 10, 30, 100 with q < n and the shrinkage
factors δ1 , δ2 = 0, 0.4, 0.8.
n 10 30 100
q δ1 \ δ2 0 0.4 0.8 0 0.4 0.8 0 0.4 0.8
0 9.954 7.126 8.144 3.489 4.435 7.187 2.685 3.732 6.815
3 0.4 4.275 2.863 2.993 1.767 1.895 2.662 1.433 1.632 2.515
0.8 1.439 0.740 0.436 0.918 0.649 0.455 0.882 0.699 0.587
0 - - - 33.448 28.874 35.504 12.485 17.974 32.207
10 0.4 - - - 12.937 10.773 12.847 5.225 6.912 11.696
0.8 - - - 2.807 1.902 1.790 1.920 1.864 2.281
0 - - - - - - 83.558 83.934 112.359
30 0.4 - - - - - - 30.871 30.582 40.481
0.8 - - - - - - 5.832 5.683 7.248
optimization problem.
Z
′
max E{w X(t)} = max U{w′ X(t)}p(X(t)|θ)dX(t) (3.2.7)
w∈Rm w
subject to w′ e = 1, where p(X(t)|θ) is the conditional density of asset returns under a given pa-
rameter θ. Suppose that X = {X(1), . . . , X(n)} is observed from {X(t)}. The empirical version of
(3.2.7) can be written as
Z
maxm E{w′ X(n + 1)|X} = max U{w′ X(n + 1)}p(X(n + 1)|X)dX(n + 1) (3.2.8)
w∈R w
subject to w′ e = 1, where X(n + 1) is the (yet unobserved) next period return, which implies that
p(X(n + 1)|X) means the predictive return density after observing X. Denoting the conditional
predictive return density under a given parameter θ and the joint posterior density after observing
X by p(X(n + 1)|θ) and p(θ|X), respectively, the predictive return density p(X(n + 1)|X) is given
by
Z
p(X(n + 1)|X) = p(X(n + 1)|θ)p(θ|X)dθ (3.2.9)
(see, e.g., p. 35 of Rachev et al. (2008)). In addition, by using the continuous version of Bayes’
theorem, the joint posterior density p(θ|X) can be written as
p(X|θ)p(θ)
p(θ|X) = ∝ L(θ|X)p(θ) (3.2.10)
p(X)
where L(θ|X) is the likelihood function for θ and p(θ) is a prior distribution of θ (called hyper-
parameters). Substituting (3.2.9) and (3.2.10) into (3.2.8), the empirical version of the optimization
problem can be written as
Z (Z )
′
maxm U[w X(n + 1)]p(X(n + 1)|θ)dX(n + 1) L(θ|X)p(θ)dθ. (3.2.11)
w∈R
Comparing (3.2.7) and (3.2.11), we can describe the inference about the portfolio weight estimator.
i.i.d. Gaussian Return with Uninformative Prior When X(t)’s are i.i.d. Gaussian random vectors
with mean vector µ and covariance matrix Σ, the likelihood function can be written as
 
 1 Xn 
L(θ|X) ∝ |Σ| −n/2
exp − {X(t) − µ} Σ {X(t) − µ} .
′ −1
(3.2.12)
2 t=1
We consider the case when the investor is uncertain about the distribution of θ, and has no particular
prior knowledge of θ. In that case, Jeffreys (1961) introduced the Jeffreys prior, that is,
m+1
p(θ) ∝ |Σ|− 2 . (3.2.13)
In general, the Jeffreys prior of a parameter θ is givenn by p(θ) = |I(θ)|1/2o, where I(θ) is the so-called
∂2
Fisher’s information matrix for θ, given by I(θ) = E ∂θ∂θ ′ log p(X(t)|θ) . Then it can be shown that
the distribution of X(n + 1)|X follows a multivariate Student’s t-distribution with n − m degrees of
freedom. The predictive mean and covariance matrix of X(n + 1)|X are given by
n+1
µ̃ = µ̂ and Σ̃ = Σ̂ (3.2.14)
n−m−2
P P
where µ̂ = 1n nt=1 X(t) and Σ̂ = n1 nt=1 {X(t) − µ} {X(t) − µ}′ . Hence, the Bayesian (mean-
variance) optimal portfolio estimator is obtained by
g(θ̂bayes ) with θ̂bayes = (µ̃, Σ̃) (3.2.15)
where g is the optimal portfolio function defined in (3.2.3).

i.i.d. Gaussian Return with Informative Prior We consider that the investor has informative beliefs
about the mean vector µ and covariance matrix Σ. Suppose that the distributions of µ|Σ and Σ are
as follows:
!
1
µ|Σ ∼ N η, Σ , Σ ∼ IW (Ω, ν) , (3.2.16)
τ
where IW (Ω, ν) is the inverse Wishart distribution with a parameter matrix Ω and a degree of
freedom ν. Under the Gaussianity of X(t), (3.2.16) become the conjugate priors for µ|Σ and Σ.
Here the prior parameters τ and ν depend on the investor’s confidence about η and Ω. In that case,
it can be shown that the distribution of X(n + 1)|X follows a multivariate Student’s t-distribution
and the predictive mean and covariance matrix of X(n + 1)|X are given by
τ n
µ̃ = η+ µ̂ (3.2.17)
n+τ n+τ
n+1 nτ
Σ̃ = Ω + nΣ̂ + (η − µ̂) (η − µ̂)′ . (3.2.18)
n(ν + m − 1) n+τ
In this case, the Bayesian (mean-variance) optimal portfolio estimator is obtained by substituting
(3.2.17) and (3.2.18) for (3.2.14).
The Black–Litterman Model Black and Litterman (1992) applied the theoretical implications of
an economic model to specifying a prior in the portfolio choice context. The model is called the
Black–Litterman model which comes from the mixed estimation. According to Aı̈t-Sahalia and
Hansen (2009), the mixed estimation was first developed by Theil and Goldberger (1961) as a way
to update the Bayesian inferences drawn from old data with the information contained in a set of
new data. It applies more generally to the problem of combining information from two data sources
into a single posterior distribution.
Let X(t) be returns of m-risky assets with the mean µ and covariance matrix Σ. Assuming that
Σ is known, the market capitalization weights wmkt are observed and there exists a risk-free asset
with the return µ f = 0, then the tangency portfolio weight is given by wmkt = (1/γ)Σ−1 µ where γ
is the aggregate risk aversion. Under the capital asset pricing model (CAPM), the mean vector µ
corresponds to the equilibrium risk premium µequil therefore, we can write
µequil = γΣwmkt .
In the Black–Litterman model, the prior distribution of µ is assumed by using the benchmark beliefs
at this equilibrium risk premium, that is,
µ ∼ N(µequil , λΣ) (3.2.19)
where λ measures the strength of the investor’s belief in equilibrium.

On the other hand, the investor has a set of new views or forecasts η ∈ Rk about a subset of k ≤ m
linear combinations of returns PX(t), where P is a k × m matrix selecting and combining returns
into portfolios for which the investor is able to express views. For instance, suppose that there exist
four assets, A, B, C and D (i.e., m = 4), and the investor considers that
• asset A will give 1% monthly return,
• asset B will give 2% monthly return,
• asset C will outperform asset D by 0.5% (i.e., k = 3).
Then, P and η are written as
   
 1 0 0 0   0.01 
   
P =  0 1 0 0  , η =  0.02  .
  
0 0 1 −1 0.005
The new views are assumed to be unbiased but imprecise, with distribution
η|µ ∼ N(Pµ, Ω) (3.2.20)
where Ω expresses the new view’s accuracy and correlations, and the investor needs to specify that.
Combining (3.2.19) and (3.2.20) by use of Bayes’ theorem, the joint posterior density p(µ|η) can
be written as
p(µ|η) ∝ p(η|µ)p(µ),
hence,
µ|η ∼ N(E(µ|η), Var(µ|η))
where
E(µ|η) = {(λΣ)−1 + P′ Ω−1 P}−1 {(λΣ)−1 µequil + P′ Ω−1 η}

γ
= {(λΣ)−1 + P′ Ω−1 P}−1 wmkt + P′ Ω−1 η
λ
Var(µ|η) = {(λΣ)−1 + P′ Ω−1 P}−1
In addition, the predictive mean and covariance matrix of X(n + 1)|η are given by
µ̃ = E(µ|η) and Σ̃ = {Σ−1 + Var(µ|η)}−1 . (3.2.21)
In this case, the Bayesian (mean-variance) optimal portfolio estimator is obtained by substituting
(3.2.21) for (3.2.14).
3.2.5 Factor Models
3.2.5.1 Static factor models
In practice, most financial portfolios consist of many assets, which implies that we need to consider
high-dimensional statistical models. In that case, applying the traditional approach is often com-
plicated and hard. To overcome this problem, we is commonly use static factor models to simplify
portfolio analysis. In the literature, three types of static factor models are available for studying
asset returns, namely, the macroeconomic factor model, fundamental factor model and statistical
factor model. The macroeconomic factor model uses macroeconomic variables such as growth rate
of GDP, interest rates, inflation rate, and unemployment rate to describe the common behavior of
asset returns. The fundamental factor model uses firm or asset specific attributes such as firm size,
book and market values, and industrial classification to construct common factors. The statistical
factor model treats the common factors as unobservable to be estimated from the return series.
Macroeconomic Factor Model Suppose that there exist m assets and their returns at time t are
expressed as Xi (t) where i = 1, . . . , m and t ∈ Z. Suppose also that there exist q (q < m) kinds of
risk factor F1 (t), . . . , Fq (t) which depend on time t. A general form for the factor model is
X(t) = a + bF (t) + ǫ(t) (3.2.22)
where a = (ai )i=1,...,m is an m vector, b = (bi j )i=1,...,m, j=1,...,q is an m × q matrix (bi j ’s are called factor
loadings), F (t) = (F1 (t), . . . , Fq (t))′ is a vector of q common factors and ǫ(t) = (ǫ1 (t), . . . , ǫm (t))′ is
an m-dimensional error term. For asset returns, the factor F (t) is assumed to be an m-dimensional
stationary process such that
E{F (t)} = µF , Var{F (t)} = ΣF
and the error term ǫ(t) is a white noise series and uncorrelated with the factor F (t), that is,
E{ǫ(t)} = 0, Var{ǫ(t)} = Σǫ = diag(σ21 , . . . , σ2m ), Cov{F (t), ǫ(t)} = 0.
Then the mean vector and covariance matrix of the return X(t) is
E{X(t)} = a + bµF , Var{X(t)} = bΣF b′ + Σǫ .
For macroeconomic factor models, the factors are observed and we can apply the least squares
method to the multivariate linear regression (MLR) model ((3.2.22) is a special form of the MLR
model). Observing (X(1), F (1)), . . . , (X(n), F (n)), we have
     
 X(1)′   1 F (1)′  " ′ #  ǫ(1)′ 
 ..  
  .. .. 
 a 
 
X :=  . 
 =  . . 
 ′ +  ...  := GA + E.
    b  
X(n)′ 1 F (n)′ ǫ(n)′
The ordinary least squares (OLS) estimator of A is obtained by
" ′ #
â
Â := = (G′ G)−1 (G′ X). (3.2.23)
b̂′
In addition, the estimator of Σǫ is obtained by
" #
1
Σ̂ǫ = diag (X − GÂ)′ (X − GÂ) . (3.2.24)
n−q−1
Based on (3.2.23), (3.2.24) and
1X 1 X
n n
µ̂F = F (t), Σ̂F = {F (t) − µ̂F }{F (t) − µ̂F }′ ,
n t=1 n − 1 t=1
the mean-variance optimal portfolio estimator under the macroeconomic factor model is obtained
by
g(θ̂ MF ) = g(µ̂ MF , Σ̂ MF )
where
µ̂ MF = â + b̂µ̂F , Σ̂ MF = b̂Σ̂F b̂′ + Σ̂ǫ
and g is the optimal portfolio function defined in (3.2.3).

Instead of OLS, there are some other methods such as ridge regression, LASSO and elastic-net. The
criterion functions are as follows:
X
n
(â, b̂) = arg min {X(t) − a − bF (t)} {X(t) − a − bF (t)}′
(a,b)
t=1
X
q
n o
+λ αkbi k2 + (1 − α)kbi k (3.2.25)
i=1
where b = (b1 , . . . , bq ). In (3.2.25), the value λ = 0 corresponds to the OLS method, α = 0 corre-
sponds to ridge regression, α = 1 corresponds to (grouped) LASSO and otherwise the models show
elastic-net. The differences between the three models and OLS are to impose penalties on the size
of regression coefficients. These methods are called regularization and one of the most important
purposes of regularization is to avoid overfitting.
Fundamental Factor Model In fundamental factor models, the factor loadings are assumed as
observable asset-specific fundamentals such as industrial classification, market capitalization, book
value, and style classification (growth or value) to construct common factors that explain the excess
returns. In the literature, there are two approaches to this model: the BARRA approach proposed by
Rosenberg (1974), founder of BARRA Inc, and the Fama–French approach proposed by Fama and
French (1992). In what follows, we briefly discuss the BARRA approach. Suppose that the factor
model
X(t) = bF (t) + ǫ(t)
where b = (bi j )i=1,...,m, j=1,...,q is given m × q matrix, F (t) = (F1 (t), . . . , Fq (t))′ is an unobserved factor
and ǫ(t) = (ǫ1 (t), . . . , ǫm (t))′ is an error term. In this model, we have to estimate µF , ΣF and Σǫ . To
do so, we use a two-step procedure as follows:
• In step one, the ordinary least squares (OLS) method is used at each time t to obtain preliminary
estimates of F (t), ǫ(t) and Σǫ , namely,
F̂o (t) = (b′ b)−1 b′ X(t), ǫ̂o (t) = X(t) − bF̂o (t),
 

 1 X
 n 

′
Σ̂ǫ,o = diag  ǫ̂o (t)ǫ̂o (t) 
 .
n − 1 
t=1
• In step two, by using the above Σ̂ǫ,o , we use the generalized least squares (GLS) method to obtain
a refined estimator of F (t), ǫ(t) and Σǫ , namely,
F̂g (t) = (b′ Σ̂−1 −1 ′ −1

ǫ,o b) b Σ̂ǫ,o X(t), ǫ̂g (t) = X(t) − bF̂g (t),
 


 1 Xn 

′
Σ̂ǫ,g = diag  n − 1 ǫ̂g (t) ǫ̂g (t) 
 .
 
t=1
The mean-variance optimal portfolio estimator under the BARRA factor model (fundamental factor
model) is obtained by
g(θ̂FF ) = g(µ̂FF , Σ̂FF )
where
µ̂FF = bµ̂F , Σ̂FF = bΣ̂F b′ + Σ̂ǫ,g ,
and
1X 1 X
n n
µ̂F = F̂g (t), Σ̂F = {F̂g (t) − µ̂F }{F̂g (t) − µ̂F }′ .
n t=1 n − 1 t=1
Statistical Factor Model The statistical factor model aims to identify a few factors that can account
for most of the variations in the covariance or correlation matrix obtained from the observed data.
Consider the general factor model
X(t) = a + bF (t) + ǫ(t)
where both b and F (t) are assumed to be unobservable. In addition, we assume that the factor F (t)
and the error term ǫ(t) satisfy
E{F (t)} = 0, Var{F (t)} = Iq ,
E{ǫ(t)} = 0, Var{ǫ(t)} = Σǫ , Cov{F (t), ǫ(t)} = 0
where Iq is the q×q identity matrix. In this model, it is well known that principal component analysis
(PCA) is available to estimate the optimal portfolio. Denoting the covariance matrix of X(t) as Σ
(nonnegative definite), it is easy to see that
Var{X(t)} = Σ = bb′ + Σǫ .
Let (λ1 , e1 ), . . . , (λm , em ) be pairs of the eigenvalues and eigenvectors of Σ, where λ1 > λ2 > · · · >
λm > 0 and ei = (ei1 , . . . , eim )′ satisfied with kei k = 1 for i = 1, . . . , m. Based on the first q pairs
(λi , ei ) (i = 1, . . . , q), we have
  ′ 
 λ1   e1 
 .   . 
Σ = [e1 , . . . , eq ]  ..   ..  = PDP′ = (PD1/2 )(PD1/2 )′ .
   
λq e′q
If X(t) is perfectly fitted by F (t), the error term ǫ(t) must be vanish. In that case, we can write
Σ = bb′ = (PD1/2 )(PD1/2 )′
which implies that b = PD1/2 . In the same manner, we consider the empirical case. Again let
(λ̂1 , ê1 ), . . . , (λ̂m , êm ) be pairs of the eigenvalues and eigenvectors of sample covariance matrix
1 X 1X
n n
Σ̂ = {X(t) − µ̂}{X(t) − µ̂} with µ̂ = X(t)
n − 1 t=1 n t=1
where λ̂1 > λ̂2 > · · · > λ̂m > 0 and êi = (êi1 , . . . , êim )′ satisfied with kêi k = 1 for i = 1, . . . , m. Based
on the first p pairs (λ̂i , êi ) (i = 1, . . . , q), the estimator of factor loadings is given by
 p 
 λ̂1 
 . 
b̂ = P̂D̂1/2 = [ê1 , . . . , êq ]  ..  .
 q 
 
λ̂ 
q
In addition, the estimator of Σǫ is given by
Σ̂ǫ = diag[Σ̂ − b̂b̂′ ].
The mean-variance optimal portfolio estimator under the statistical factor model is obtained by
g(θ̂S F ) = g(µ̂S F , Σ̂S F )
where
µ̂S F = µ̂, Σ̂S F = b̂b̂′ + Σ̂ǫ .
3.2.5.2 Dynamic factor models

Features of the traditional factor model defined by (3.2.22) are
• The number of risk factors (p) is finite,
• The risk factors (F1 (t), . . . , Fq (t)) are uncorrelated with each other (i.e., Cov(Fi (t), F j (t)) = 0 for
i , j),
• The risk factor (F (t)) and the error term (ǫ(t)) are also uncorrelated (i.e., Cov(F (t), ǫ(t)) = 0).
Consider the representation of the MA(q) model given by
 
X
q  U(t − 1) 
 .. 
X(t) = a + A( j)U(t − j) = a + (A(1), . . . , A(q))  .  + A(0)U(t),
 
j=0
U(t − q)
as the classical factor model under the above restrictions, it is possible by the setting b =
(A(1), . . . , A(q)), F (t) = (U(t − 1), . . . , U(t − q))′ and ǫ(t) = A(0)U(t) we can assume that U(t)’s are
uncorrelated with respect t. On the other hand, in the case of the AR(p) model given by
 
X
p  X(t − 1) 
 .. 
X(t) = A(i)X(t − i) + U(t) = (A(1), . . . , A(p))  .  + U(t),
 
i=1
X(t − p)
we need to consider another representation because of the uncorrelation within the risk factors. If
{X(t)} is the stationary process, we can write
 
X
∞  U(t − 1) 
 
X(t) = Ã( j)U(t − j) = (Ã(1), Ã(2) , . . .)  U(t − 2)
2  + Ã(0)U(t)
 .. 
j=0
.
(e.g., Remark 4 of Brockwell and Davis (2006)), which implies that by setting b =
(Ã(1), Ã(2), . . .), F (t) = (U(t − 1), U(t − 2), . . .)′ and ǫ(t) = Ã(0)U(t), the uncorrelated conditions
are satisfied but the number of risk factors is infinite.
To overcome this problem, Sargent and Sims (1977) and John (1977) proposed the “dynamic
factor model” which is given by
X
∞ X
∞
X(t) = A( j)Z(t − j) + U (t) := A( j)L j Z(t) + U (t) := A(L)Z(t) + U (t) (3.2.26)
j=0 j=0
where X(t) = (X1 (t), . . . , Xm (t))′ is the m-vector of observed dependent variables, U (t) =
(U1 (t), . . . , Um (t))′ is the m-vector of residuals, Z(t) = (Z1 (t), . . . , Zq (t))′ is the q-vector of factors
with q < m, A( j) = (Aab ( j)) is the m × q matrix and L is the lag operator.
Although the dynamic factor model includes time series models such as VAR or VARMA mod-
els, the cross-correlation among idiosyncratic components is not allowed (i.e., Cov(Ui (t), U j (t)) = 0
if i , j is assumed).
On the other hand, Chamberlain (1983) and Chamberlain and Rothschild (1983) showed that
the assumption that the idiosyncratic components are uncorrelated is unnecessarily strong and can
be relaxed; they introduced approximate factor model which imposes a weaker restriction on the
distribution of the asset returns.
Forni et al. (2000) argue “Factor models are an interesting alternative in that they can provide a
much more parsimonious parameterization. To address properly all the economic issues cited above,
however, a factor model must have two characteristics. First, it must be dynamic, since business
cycle questions are typically dynamic questions. Second, it must allow for cross-correlation among
idiosyncratic components, since orthogonality is an unrealistic assumption for most applications.”
It is easy to see that dynamic factor model satisfies the first requirement but the second one is
not satisfied (i.e., dynamic factor model has orthogonal idiosyncratic components), and approximate
factor satisfies the second one but the first one is not satisfied (i.e., approximate factor model’ is
static).
The generalized dynamic factor model introduced by Forni et al. (2000) provides both features.
Definition 3.2.10. [generalized dynamic factor model (Forni et al. (2000))] Suppose that all
the stochastic variables taken into consideration belong to the Hilbert space L2 (Ω, F , P), where
(Ω, F , P) is a given probability space; thus all first and second moments are finite. We will study a
double sequence
{xit , i ∈ N, t ∈ Z},
where
xit = bi1 (L)u1t + bi2 (L)u2t + · · · + biq (L)uqt + ξit .
Remark 3.2.11. For any m ∈ N, a generalized dynamic model {Xi (t), i ∈ N, t ∈ Z} with
Xi (t) = bi1 F1 (t) + · · · + biq Fq (t) + ǫi (t)
can be described as
      
 X1 (t)   b11 · · · b1q   F1 (t)   ǫ1 (t) 
   .   .   . 
X(t) :=  ...  =  .. ..
.
..
.   ..  +  ..  := bF (t) + ǫ(t),
       
Xm (t) bm1 · · · bmq Fq (t) ǫm (t)
which implies that the traditional (static) factor model defined by (3.2.22) with a = 0 is a partial
realization of the generalized factor model.
Forni et al. (2000) imposed the following assumptions (Assumptions 3.2.12–3.2.15) for the
generalized dynamic factor model.
Assumption 3.2.12. (i) The p-dimensional vector process {(u1t , . . . , u pt )′ , t ∈ Z} is orthonormal
white noise, i.e., E(u jt ) = 0, V(u jt ) = 1 for any j, t, and Cov(u j1 t1 , u j2 t2 ) = 0 for any j1 , j2 and
t1 , t2 with j1 , j2 or t1 , t2 .
(ii) ξ = {(ξit , i ∈ N, t ∈ Z)} is a double sequence such that, firstly,
ξn = {(ξ1t , . . . , ξmt )′ , t ∈ Z)}
is a zero-mean stationary vector process for any n, and, secondly, Cov(ξ j1 t1 , u j2 t2 ) = 0 for any
j1 , j2 , t1 , t2 .
P∞ ℓ
(iii) the filters bi j (L) are one-sided in L (i.e., we can write bi j (L)uit = ℓ=0 bi j (ℓ)L uit =
P∞ P∞ 2
ℓ=0 bi j (ℓ)uit−ℓ ) and their coefficients are square summable (i.e., ℓ=0 bi j (ℓ) < ∞).
This assumption implies that the sub-process xm = {xmt = (x1t , . . . , xmt )′ , t ∈ Z} is zero-mean and
stationary for any m.
Let Σm (θ) denote the spectral density matrix of the vector process xmt = (x1t , . . . , xmt )′ , that is,
1 X
∞
Σm (θ) = Γmk e−ikθ , Γmk = Cov(xmt , xmt−k ).
2π k=−∞
Let σi j (θ) be the (i, j)-th component (note that for m1 , m2 , the (i, j)-th component of Σm1 (θ)
corresponds to that of Σm2 (θ)).
Assumption 3.2.13. For any i ∈ N, there exists a real ci > 0 such that σii (θ) for any θ ∈ [−π, π].
This assumption expresses the boundedness for the spectral density matrix, but it is not uniformly
bounded for i.
Let Σξm (θ) and Σχm (θ) denote the spectral density matrices of the vector process ξmt =
(x1t , . . . , xmt )′ and χmt = xmt − ξmt (the variable χit = xit − ξit is the ith component of χmt and
ξ ξ χ χ
is called the common component), respectively. Let λm j ≡ λm j (θ) and λm j ≡ λm j (θ) denote the
ξ χ ξ χ
dynamic eigenvalues of Σm (θ) and Σm (θ), respectively.10 The λm j and λm j are called idiosyncratic
(dynamic) eigenvalue and common (dynamic) eigenvalue, respectively.
ξ
Assumption 3.2.14. The first idiosyncratic dynamic eigenvalue λm1 is uniformly bounded. That is,
there exists a real Λ such that λξm1 (θ) ≤ Λ for any θ ∈ [−π, π] and m ∈ N.
This assumption expresses the uniform boundedness for all idiosyncratic dynamic eigenvalues.
Assumption 3.2.15. The first q common dynamic eigenvalues diverge almost everywhere in [−π, π].
χ
That is, limm→∞ λm j (θ) = ∞ for j ≤ q, a.e. in [−π, π].
This assumption guarantees a minimum amount of cross-correlation between the common compo-
nents (see Forni et al. (2000)).
Suppose that we observed an m × T data matrix
XmT = (xm1 , . . . , xmT ). (3.2.27)
Then, the sample mean vector can be easily obtained by
1X
T
µ̂m = xmt . (3.2.28)
T t=1
Let Σ̂m (θ) be a consistent periodogram-smoothing or lag-window estimator of the m × m spectral

density matrix Σm (θ) of xmt , given by
X
M
1 X
T −k
Σ̂m (θ) = Γ̂mk ωk e−ikθ , Γ̂mk = (xmt − µ̂m )(xm,t+k − µ̂m )′ (3.2.29)
k=−M
T − k t=1
where ωk = 1 − [|k|/(M + 1)] are the weights corresponding to the Bartlett lag window for a fixed
integer M < T . Let λ̂ j (θ) and p̂ j (θ) denote the jth largest eigenvalue and the corresponding row
eigenvector of Σ̂m (θ), respectively. Forni et al. (2005) propose the estimators of the auto-covariance
χ ξ
matrices of χmt and ξmt (i.e., Γmk = Cov(χmt , χm,t+k ), Γmk = Cov(ξmt , ξm,t+k )) as
Z π Z π
Γ̂χmk = eikθ Σ̂χm (θ)dθ, Γ̂ξmk = eikθ Σ̂ξm (θ)dθ (3.2.30)
−π −π
10 The dynamic eigenvalue (λ ) of Σ is defined as the function associating with any θ ∈ [−π, π] the real nonnegative jth
mj m
eigenvalue of Σm (θ) in descending order of magnitude.
where Σ̂χm (θ) and Σ̂ξm (θ) are estimators of the spectral density matrices Σχm (θ) and Σξm (θ), given by
χ
Σ̂m (θ) = λ̂1 (θ)p̂∗1 (θ)p̂1 (θ) + · · · + λ̂q (θ)p̂∗q (θ)p̂q (θ), (3.2.31)
ξ
Σ̂m (θ) = λ̂q+1 (θ)p̂∗q+1 (θ)p̂q+1 (θ) + · · · + λ̂m (θ)p̂∗m (θ)p̂m (θ). (3.2.32)
Note that
• for the actual calculation, the integration is approximated as the sum for the discretized θh =
πh/M with h = −M, . . . , M,
• q is the number of dynamic factors Forni et al. (2000) suggest how to choose q,
• p̂∗ is the adjoint (transposed, complex conjugate) of p̂.
Since the (true) covariance matrix Γm0 can be decomposed as the covariance matrices of Γχm0 =
ξ
V(χmt ) and Γm0 = V(ξmt ) from Assumption 3.2.12, we can obtain the covariance matrix estimator
by
Γ̃m0 = Γ̂χm0 + Γ̂ξm0 . (3.2.33)
From (3.2.28) and (3.2.33), we obtain the mean-variance optimal portfolio estimator under the dy-
namic factor model as
g(µ̂m , Γ̃m0 ).
3.2.6 High-Dimensional Problems

It is known that many studies demonstrate that inference for the parameters arising in portfolio
optimization often fails. The recent literature shows that this phenomenon is mainly due to a high-
dimensional asset universe. Typically, such a universe refers to the asymptotics that the sample size
(n) and the sample dimension (m) both go to infinity (see Glombek (2012)). In this section, we
discuss asymptotic properties for estimators of the global minimum portfolio (MP), the tangency
portfolio (TP) and the naive portfolio (NP) under m, n → ∞. When m goes to infinity, we cannot
analyze the portfolio weight in the same way as Theorem 3.2.5. However, it is possible to discuss the
asymptotics for the portfolio mean, the portfolio variance, the Sharpe ratio and so on. Especially, in
this section, we concentrate on the asymptotics for the portfolio mean and variance under m, n → ∞.
In the literature, there are some papers that treat this problem, but these are separated by (I) the case
of m/n → y ∈ (0, 1) and (II) the case of m/n → y ∈ (1, ∞).
(I) is the case that m < n is satisfied, which implies that it is natural to assume that there exists
Σ̂−1 due to rank(Σ̂) = m where Σ̂ is a sample covariance matrix based on X1 , . . . , Xn . For case (I),
we mainly follow Glombek (2012)). As we mentioned earlier, it is well known that the traditional
estimator of mean-variance optimal portfolios (g(µ, Σ)) is given based on the sample mean vector
µ̂ and the sample covariance matrix Σ̂ (i.e., g(µ̂, Σ̂)). Under large n and large m setting, the spectral
properties of Σ̂ have been studied by many authors (e.g., Marčenko and Pastur (1967)). In particular,
the eigenvalues of the sample covariance matrix differ substantially from the eigenvalues of the
true covariance matrix for both n and m being large. That is why the optimal portfolio estimation
involving the sample covariance matrix should take this difficulty into account. Bai et al. (2009) and
El Karoui (2010) discussed this problem under (n, m)-asymptotics which correnspond to n, m → ∞
and m/n → y ∈ (0, 1).
(II) is the case that there never exist Σ̂−1 because rank(Σ̂) = n < m in that case, we cannot
construct estimators for portfolio quantities by use of the usual plug-in method. To overcome this
problem, Bodnar et al. (2014) introduced use of a generalized inverse of the sample covariance
matrix as the estimator for the covariance matrix Σ and proposed an estimator of the global minimum
variance portfolio and showed its asymptotic property. For the case of (II), we will show Bodnar
et al. (2014) result.
3.2.6.1 The case of m/n → y ∈ (0, 1)
Suppose the existence of m assets and the random returns are given by Xt = (X1 (t), . . . , Xm (t))′
with E(Xt ) = µ and Cov(Xt , Xt ) = Σ (assumed nonsingular matrix). Glombek (2012) analyze the
inference for three types of portfolios as follows:
• MP: The global minimum portfolio of which the weight is defined in (3.1.3).
• TP: The tangency portfolio of which the weight is defined in (3.1.7) with µ f = 0.
• NP: The naive portfolio of which the weight is equal to each other.
Let w MP , wT P , wNP denote the portfolio weights for MP, TP, NP, respectively. Then, we can
write
1 1 1
w MP = Σ−1 e, wT P = Σ−1 µ, wNP = e. (3.2.34)
e′ Σ−1 e e′ Σ−1 µ m
Let µ MP , µT P , µNP denote the portfolio mean for MP, TP, NP, respectively. Since a portfolio mean is
given by µ(w) = w′ µ, we can write
µ′ Σ−1 e µ′ Σ−1 µ µ′ e
µ MP = , µT P = , µNP = . (3.2.35)
e′ Σ−1 e e′ Σ−1 µ m
Furthermore, let σ2MP , σ2T P , σ2NP denote the portfolio variance for MP, TP, NP, respectively. Since a
portfolio variance is given by η2 (w) = w′ Σw, we can write
1 µ′ Σ−1 µ e′ Σe
σ2MP = , σ2T P = , σ2NP = . (3.2.36)
e′ Σ−1 e (e′ Σ−1 µ)2 m2
We suppose that X1 , . . . , Xn are observed; then the sample mean µ̂ and the sample covariance
matrix Σ̂ (assumed nonsingular matrix) are obtained by
1X 1X
n n
µ̂ = Xt , Σ̂ = (Xt − µ̂)(Xt − µ̂)′ , (3.2.37)
n t=1 n t=1
which implies that the estimators for the above quantities are obtained by
µ̂′ Σ̂−1 e µ̂′ Σ̂−1 µ̂ µ̂′ e

µ̂ MP = , µ̂T P = , µ̂NP = , (3.2.38)
e′ Σ̂−1 e e′ Σ̂−1 µ̂ m
1 µ̂′ Σ̂−1 µ̂ e′ Σ̂e
σ̂2MP = , σ̂2T P = , σ̂2NP = . (3.2.39)
e′ Σ̂−1 e (e′ Σ̂−1 µ̂)2 m2
In what follows, we discuss the asymptotic property for these estimators under the following as-
sumptions:
i.i.d.
Assumption 3.2.16. (1) Xt ∼ Nm (µ, Σ).
(2) The (n, m)-asymptotics are given by n, m → ∞ and m/n → y ∈ (0, 1).
(3) The (n, m)-limit
µP ≡ µ(m) (∞)
P → µP ∈ R \ {0}
exists for P ∈ {MP, TP, NP}.

(4) We have either
a) σ2P → 0 or
b) σ2P → (σ(∞) 2
P ) ∈ (0, ∞), under the (n, m)-asymptotics, where P ∈ {MP, TP, NP}.
Since the Gaussian assumption (1) is often criticized, Bai et al. (2009) discussed the (n, m)-
asymptotics without the Gaussianity. As we mentioned, the (n, m)-asymptotics (2) guarantees that
the sample covariance matrix remains nonsingular in the high-dimensional framework. According
to Glombek (2012), Assumption (3) shows that the components of the vector µ are summable with
respect to the signed measure which is induced by the weight vector wP where P ∈ {MP, TP, NP}
P
(i.e., µ(∞)
P = ∞ ∞ ∞
i=1 wi,P µi < ∞ for µ = (µi )i=1 , wP = (wi,P )i=1 ). Assumption (4a) is satisfied if
λmax (Σ) = o(d) where λmax (Σ) denotes the largest eigenvalue of Σ. Economically, this condition
may be interpreted as perfect diversification as the variance of the portfolio returns can be made
arbitrarily small by increasing the number of considered assets (see Glombek (2012)). Assumption
(4b) is a more “natural” case than (4a) because the positivity of this limit requires λmax (Σ) = O(d)
under the (n, m)-asymptotics.
For µ̂ MP , µ̂T P , µ̂NP , the consistencies under the (n, m)-asymptotics are given as follows:
Theorem 3.2.17 (Theorems 5.8, 5.17 and 5.25 of Glombek (2012)). Under (1)–(3) and (4a) (only
for µ̂2T P ) of Assumption 3.2.16, we have
a.s. a.s. a.s.
µ̂ MP → µ(∞)
MP , µ̂T P → µ(∞)
TP , µ̂NP → µ(∞)
MP .
Under (1)–(3) and (4b) of Assumption 3.2.16, we have

 
p  y  (∞)
µ̂T P → 1 +  µ ,
(∞) 2  T P
(S RT P )
where (S R(∞) 2 2 ′ −1
T P ) (< ∞) is given by limm→∞ S RT P with S RT P = 1/(µ Σ µ).
Proof: See Theorems 5.8, 5.17 and 5.25 of Glombek (2012), Corollary 2 and Lemma 3.1 of Bai
et al. (2009) and Theorem 4.6 of El Karoui (2010).
Note that there exists a bias for the estimator µ̂T P under the (n, m)-asymptotics in the natural
case of Assumption 3.2.16 (4b). The next theorem gives the asymptotic normality for µ̂ MP , µ̂T P , µ̂NP
under the (n, m)-asymptotics.
Theorem 3.2.18. [Theorems 5.9, 5.18 and 5.26 of Glombek (2012)] Under (1)–(3) and (4) of As-
sumption 3.2.16, we have
!
√ µ̂ MP − µ MP d 1
nq → N 0, ,
1−y
σ̂2MP (1 + µ̂′ Σ̂−1 µ̂) − µ̂2MP
    !
√ 
  m/n  
 d VTµ P (y)
 
µ̂T P − 1 + S R2  µT P 
n  → N 0, 1 − y ,
TP
√ µ̂NP − µNP d
n → N (0, 1) ,
σ̂NP
where
µ(∞) n o
VTµ P (y) = (σ(∞)
TP )
2 TP
1 + (S R(∞) 2 (∞) 2
T P ) − (µT P )
µ(∞)
MP
n on o n o
2y 1 + (S RT(∞)
P ) 2
(S R (∞) 2
TP ) − (S R (∞) 2
MP ) + y 2
(S R (∞) 2
TP ) + (S R (∞) 2
MP ) + 1
+ (σ(∞)
TP )
2
,
(S R(∞) 2 (∞) 2
MP ) (S RT P )
and (S R(∞) 2 2 ′ −1 2 ′ −1
MP ) (< ∞) is given by limm→∞ S R MP with S R MP = (e Σ µ) /(e Σ e).
Proof: See Theorems 5.9, 5.18 and 5.26 of Glombek (2012) (see also Frahm (2010)).
Based on this result, it is possible to consider a (one- or two-sided) test for the hypothesis
H0 : µP = µ0 v.s. H1 : µP , µ0
or to construct the confidence interval for µP (as in the case of µT P , we need an appropriate estimate
of the asymptotic variance by using the bootstrap method or the shrinkage estimation method).
Similarly to the case of the mean, the consistency for σ̂2MP , σ̂2T P , σ̂2NP under the (n, m)-
asymptotics is given as follows:
Theorem 3.2.19. [Theorems 5.5, 5.15 and 5.23 of Glombek (2012)] Under (1)–(3) and (4a) (only
for σ̂2T P ) of Assumption 3.2.16, we have
σ̂2MP a.s. σ̂2T P a.s. σ̂2NP a.s.

→ 1 − y, → 1 − y, → 1.
σ2MP σ2T P σ2NP
Under (1)–(3) and (4b) of Assumption 3.2.16, we have

 
σ̂2T P p  y 

→ (1 − y) 1 +  .
2
σT P (∞) 2
(S RT P )
Proof: See Theorems 5.5, 5.15 and 5.23 of Glombek (2012), Corollary 2 and Lemma 3.1 of Bai
et al. (2009) and Theorem 4.6 of El Karoui (2010).
This result shows that σ̂2MP , σ̂2T P are inconsistent estimators for σ2MP , σ2T P , and σ̂2NP is a consis-
tent estimator for σ2NP under (n, m)-asymptotics. The next theorem gives the asymptotic normality
for σ̂2MP , σ̂2T P , σ̂2NP under (n, m)-asymptotics.
Theorem 3.2.20. [Theorems 5.6, 5.16 and 5.24 of Glombek (2012)] Under (1)–(3) and (4) of As-
sumption 3.2.16, we have
 
√  σ̂2MP m  d

n 2 − 1 −  → N (0, 2(1 − y)) ,
σ MP n
 2  
√  σ̂T P  m/n 
 d
n
 σ2 − 
1 −  → N 0, 2(1 − y)VTσP (y) ,
2 
TP S R TP
 
√  σ̂2NP  d
n  2 − 1 → N (0, 2) ,
σNP
where
(S R(∞) 4 (∞) 2
T P ) + 4y(S RT P ) + y
VTσP (y) =
(S R(∞) 2 (∞) 2
MP ) (S RT P )
n on o n o
(S R(∞) 2 (∞) 2
T P ) − (S R MP ) (S R(∞) 4 (∞) 2 2 (∞) 2 (∞) 2
T P ) + 2(S RT P ) + 3y + 2y 1 + (S RT P ) + (S R MP )
+ .
(S R(∞) 2 (∞) 2
MP ) (S RT P )
Proof: See Theorems 5.6, 5.16 and 5.24 of Glombek (2012).

Based on this result, it is possible to consider a (one- or two-sided) test for the hypothesis
H0 : σ2P = σ20 v.s. H1 : σ2P , σ20
or to construct the confidence interval for σ2P (as in the case of σ2T P , we need an appropriate estimate
of the asymptotic variance by using the bootstrap method or the shrinkage estimation method).
3.2.6.2 The case of m/n → y ∈ (1, ∞)
Suppose the existence of m assets and the random returns are given by Xt = (X1 (t), . . . , Xm (t))′
with E(Xt ) = µ and Cov(Xt , Xt ) = Σ (assumed nonsingular matrix). Bodnar et al. (2014) analyze
the inference for the global minimum portfolio (MP). Again, the portfolio weights (w MP ) and the
(oracle) portfolio variance (σ̃2MP ) for MP are given as follows:
1 1
w MP = Σ−1 e, σ̃2MP = w′MP Σw MP = . (3.2.40)
e′ Σ−1 e e′ Σ−1 e
Suppose that X ≡ (X1 , . . . , Xn ) are observed. In the case of (II), the sample covariance matrix
Σ̂ is singular and its inverse does not exist because of n < m. Thus, Bodnar et al. (2014) use the
following generalized inverse of Σ̂.
Σ̂∗ = Σ−1/2 (XX ′ )+ Σ−1/2 , (3.2.41)
where ‘+’ denotes the Moore–Penrose inverse. It can be shown that Σ̂∗ is the generalized inverse
satisfying Σ̂∗ Σ̂Σ̂∗ = Σ̂∗ and Σ̂Σ̂∗ Σ̂ = Σ̂. Obviously, in the case of y < 1, the generalized inverse Σ̂∗
coincides with the usual inverse Σ̂−1 . Note that Σ̂∗ cannot be determined in practice since it depends
on the unknown matrix Σ. Pluging in this Σ̂∗ instead of Σ̂−1 to the true portfolio quantities, the
following portfolio estimators are obtained.
1
w∗MP = Σ̂∗ e, (σ̃∗MP )2 = (w∗MP )′ Σ(w∗MP ). (3.2.42)
e′ Σ̂∗ e
We impose the following assumptions:
Assumption 3.2.21. (1) The elements of the matrix X = (X1 , . . . , Xn ) have uniformly bounded
4 + ǫ moments for some ǫ > 0.
(2) The (n, m)-asymptotics are given by n, m → ∞ and m/n → y ∈ (1, ∞).
This assumption includes the Gaussian assumption such as (1) of Assumption 3.2.16, so that
this is rather relaxed one. The consistency for (σ̃∗MP )2 under (n, m)-asymptotics is given as follows:
Theorem 3.2.22. [Theorem 2.2 of Bodnar et al. (2014)] Under Assumption 3.2.21, we have
a.s. c2 2
(σ̃∗MP )2 → σ .
c − 1 MP
Proof: See Theorem 2.2 of Bodnar et al. (2014).
As in the case of y < 1, (σ̃∗MP )2 is an inconsistent estimator for σ2MP under (n, m)-asymptotics.
Furthermore, they propose the oracle optimal shrinkage estimator of the portfolio weight for
MP:
w∗∗ + ∗ +
MP = α w MP + (1 − α )b, with b′ e = 1 (3.2.43)
where
(b − w∗MP )′ Σb
α+ = ,
(b − w∗MP )′ Σ(b − w∗MP )
and the bona fide optimal shrinkage estimator of the portfolio weight for MP:
Σ̂+ e
w∗∗∗
MP = α
++
+ (1 − α++ )b, with b′ e = 1 (3.2.44)
e′ Σ̂+ e
where
(m/n − 1)R̂b
α++ = ,
(m/n − 1)2
+ m/n + (m/n − 1)R̂b

m m
R̂b = − 1 (b′ Σ̂b)(e′ Σ+ e) − 1,
n n
and Σ̂+ is the Moore–Penrose pseudo-inverse of the sample covariance matrix Σ̂.
They investigate the difference between the oracle and the bona fide optimal shrinkage estima-
tors for the MP as well as between the oracle and the bona fide traditional estimators. As a result,
the proposed estimator shows significant improvement and it turns out to be robust to deviations
from normality.

Koenker (2005) provides the asymptotic property of the pessimistic portfolio via quantile regres-
sion models. Aı̈t-Sahalia and Hansen (2009) discuss briefly the shrinkage estimation in portfolio
problems. Gouriéroux and Jasiak (2001), Scherer and Martin (2005), Rachev et al. (2008) and
Aı̈t-Sahalia and Hansen (2009) introduce Bayesian methods in portfolio problems. Tsay (2010)
gives a general method to analyze factor models. Hastie et al. (2009) introduce estimation meth-
ods for linear regression models. Forni et al. (2000, 2005) discuss the generalized dynamic factor
model. Glombek (2012) discusses asymptotics for estimators of the various portfolio quantities un-
der m, n → ∞ and m/n → y ∈ (0, 1). Bodnar et al. (2014) discusses asymptotics for that under
m, n → ∞ and m/n → y ∈ (1, ∞).
3.2.8 Appendix
Proof of Theorem 3.2.3 Denote the (a, b) component of Σ̂ and Σ̃ by Σ̂ab and Σ̃ab , respectively. Then
 n 
1  X X 
n
√  

n Σ̂ab − Σ̃ab = √   (X a (t) − µ̂a )(X b (t) − µ̂b ) − (X b (t) − µb )(X b (t) − µb ) 


n t=1 
t=1
√
= − n(µ̂a − µa )(µ̂b − µb )
= o p (1).
Therefore
!
√ √ µ̂ − µ
n(θ̂ − θ) = n + o p (1)
vech{Σ̃ − Σ}
 
 {X1 (t) − µ1 } 
 .. 
 . 

 
1 X  {Xm (t) − µm }
n

= √   + o p (1)
n t=1  {X1 (t) − µ1 }{X1 (t) − µ1 } − Σ11 
 .. 
 . 
 
{Xm (t) − µm }{Xm (t) − µm } − Σmm
1 X
n
≡ √ Y (t) + o p (1) (say),
n t=1
where {Y (t)} = {(Y1 (t), . . . , Ym+r (t))′ ; t ∈ N}. In order to prove the theorem we use the following
lemma which is due to Hosoya and Taniguchi (1982).
Lemma 3.2.23. (Theorem 2.2 of Hosoya and Taniguchi (1982)) Let a zero-mean vector-valued
second-order stationary process {Y (t)} be such that
(A-1) for a positive constant ǫ,
Var{E(Ya (t + τ)|Ft )} = O(τ−2−ǫ ) uniformly in a = 1, . . . , m + r,

(A-2) for a positive constant δ,
E|E{Ya1 (t + τ1 )Ya2 (t + τ2 )|Ft } − E{Ya1 (t + τ1 )Ya2 (t + τ2 )}| = O[{min (τ1 , τ2 )}−1−δ ]
uniformly in t, for a1 , a2 = 1, . . . , m + r and τ1 , τ2 both greater than 0,
(A-3) {Y (t)} has the spectral density matrix f Y (λ) = { fab Y
(λ); a, b = 1, . . . , m + r}, whose elements
Y
are continuous at the origin and f (0) is nondegenerate,
P L
then √1n nt=1 Y (t) → N(0, 2π f Y (0)), where Ft is the σ-field generated by {Y (l); l ≤ t}.
Now returning to the proof of Theorem 3.2.3 we check the above (A-1)–(A-3).
(A-1). If a ≤ m, then
Ya (t + τ) = Xa (t + τ) − µa
Xm X ∞
= Aak ( j)uk (t + τ − j)
k=1 j=0
X
m X
τ−1 X
m X
∞
= Aak ( j)uk (t + τ − j) + Aak ( j)uk (t + τ − j)
k=1 j=0 k=1 j=τ
≡ Aa (t, τ) + O(4)
Ft
(τ−1−δ ), (say).
By use of a fundamental formula between E(·) and cum(·) (e.g., Brillinger (2001b)), we obtain
 4

 

X X
m ∞ n o
(4) −1−δ 4  

E|{OFt (τ )} | = E 
 A ak ( j)u k (t + τ − j) 
 = O (τ−1−δ )4

 k=1 j=τ 

by Assumption 3.2.1. Similarly we can show
n o
E|{O(4)
Ft
(τ−1−δ )}2 | = O (τ−1−δ )2 .
Hence Schwarz’s inequality yields
E|{O(4)
Ft
(τ−1−δ )}3 | = O{(τ−1−δ )3 }, E|{O(4)
Ft
(τ−1−δ )}| = O{(τ−1−δ )}.
From (ii) of Assumption 3.2.1 it follows that Ft is equal to the σ-field generated by {U (l); l ≤ t}.
Therefore, noting that Aa (t, τ) is independent of Ft , we obtain
E(Ya (t + τ)|Ft ) = E{Aa (t, τ) + O(4)
Ft
(τ−1−δ )|Ft } = O(4)
Ft
(τ−1−δ ).
Hence, we have
Var{E(Ya (t + τ)|Ft )} = Var{O(4)
Ft
(τ−1−δ )} = O(τ−2−2δ ), ( f or a ≤ m).
If a > m, then
Ya (t + τ)
= Aa1 (t, τ)Aa2 (t, τ) + O(4)
Ft
(τ−1−δ ){Aa1 (t, τ) + Aa2 (t, τ)} + {O(4)
Ft
(τ−1−δ )}2 − Σa1 a2 .
Similarly we obtain
E(Ya (t + τ)|Ft ) = E(Aa1 (t, τ)Aa2 (t, τ)) + {O(4)
Ft
(τ−1−δ )}2 − Σa1 a2
which implies
Var{E(Ya (t + τ)|Ft )} = E|{O(4)
Ft
(τ−1−δ )}2 − O(τ−2−2δ )|2 = O(τ−4−4δ ).
Thus we have checked (A-1).
SIMULATION RESULTS 65
(A-2). For a1 , a2 ≤ m, noting that
Ya1 (t + τ1 )Ya2 (t + τ2 ) = Aa1 (t, τ1 )Aa2 (t, τ2 ) + O(4)

Ft
(τ−1−δ
1 )Aa2 (t, τ2 )
+ O(4)
Ft
(τ−1−δ
2 )Aa1 (t, τ1 ) + O(4)
Ft
(τ−1−δ
1 )O(4)
Ft
(τ−1−δ
2 ),
we can see
E{Ya1 (t + τ1 )Ya2 (t + τ2 )|Ft } = E{Aa1 (t, τ1 )Aa2 (t, τ2 )} + O(4)

Ft
(τ−1−δ
1 )O(4)
Ft
(τ−1−δ
2 ).
On the other hand, since
E{Ya1 (t + τ1 )Ya2 (t + τ2 )} = E{Aa1 (t, τ1 )Aa2 (t, τ2 )} + O{(min(τ1 , τ2 ))−1−δ },
we have
E|E{Ya1 (t + τ1 )Ya2 (t + τ2 )|Ft } − E{Ya1 (t + τ1 )Ya2 (t + τ2 )}| = O{(min (τ1 , τ2 ))−1−δ }
for a1 , a2 ≤ m. For cases of (i) a1 ≤ m, a2 > m, and (ii) a1 , a2 > m, repeating the same arguments as
in the above we can check (A-2).
(A-3). It follows from Assumption 3.2.1.
By Lemma 3.2.23 we have
1 X
n
L
√ Y (t) → N(0, 2π f Y (0)).
n t=1
Pn
Next we evaluate the asymptotic variance of √1 Y (t) concretely. Write Y1 (t) =
n t=1
P
(Y1 (t), . . . , Ym (t)) and Y2 (t) = (Ym+1 (t), . . . , Ym+r (t)) . Then the asymptotic variance of √1n nt=1 Y1 (t)
′ ′
P
and √1n nt=1 Y2 (t) are evaluated by Hannan (1970) and Hosoya and Taniguchi (1982), respectively.
P P
What we have to evaluate is the asymptotic covariance Cov{ √1n nt=1 Y1 (t), √1n nt=1 Y2 (t)}, whose
typical component for a1 ≤ m and a2 > m is
 

 1 X
 n
1 X
m 


Cov  √ Ya1 (t), √ Ya2 (t)
 n n 
t=1 t=1
1 XX X
n ∞ m
= Aa1 ,b1 (l)Aa′2 ,b2 (l)Aa′3 ,b3 (l)cU
b1 ,b2 ,b3
n t=1 l=1 b1 ,b2 ,b3 =1
Xm Z Z π
1
→ cU Aa1 b1 (λ1 + λ2 )Aa′2 b2 (−λ1 )Aa′3 b3 (−λ2 )dλ1 dλ2
(2π)2 b ,b ,b =1 b1 ,b2,b3 −π
1 2 3
when {U (t)} is a non-Gaussian process. In the case of the Gaussian process of {U (t)}, it is easy to
see that cU U
b1 ,b2 ,b3 and cb1 ,b2 ,b3 ,b4 vanish. Hence we have the desired result.
3.3 Simulation Results

This section provides various numerical studies to investigate the accuracy of the portfolio estima-
tors discussed in Chapter 3. We provide simulation studies when the return processes are a vector
autoregressive of order 1 (VAR(1)), a vector moving average of order 1 (VMA(1)) and a constant
conditional correlation of order (1, 1) (CCC(1,1)) models, respectively. Then, the estimated efficient
frontiers are introduced. In addition, we discuss the inference of portfolio estimators with respect to
sample size, distribution, target return and coefficient.
Suppose that {X(t) = (X1 (t), . . . , Xm (t))′ ; t ∈ Z} follows a VAR(1) and a VMA(1) model, respec-
tively, defined by
VAR(1) : X(t) = µ + U (t) + Φ{X(t − 1) − µ} (3.3.1)
VMA(1) : X(t) = µ + U (t) + ΘU (t − 1) (3.3.2)
where µ is a constant m-vector; {U (t)} follows an i.i.d. multivariate normal distribution Nm (0, Im) or
a standardized multivariate Student-t distribution with 5 degrees of freedom satisfied with E{U (t)} =
0 and V{U (t)} = Im ; Φ is an m × m autoregressive coefficient matrix and Θ is an m × m moving
average coefficient matrix. Then, under regularity conditions, the mean vector and the covariance
matrix can be written as
VAR(1) : E{X(t)} = µ, V{X(t)} = (Im − ΦΦ′ )−1 (3.3.3)
VMA(1) : E{X(t)} = µ, V{X(t)} = Im + ΘΘ′ . (3.3.4)
On the other hand, suppose that {X(t) = (X1 (t), . . . , Xm (t))′ ; t ∈ N} follows a CCC(1,1) model,
defined by
p p
CCC(1,1) : X(t) = µ + D(t)ǫ(t), D(t) = diag h1 (t), . . . , hm (t) (3.3.5)
where µ is a constant m-vector; {ǫ(t) = (ǫ1 (t), . . . , ǫm (t))′ } follows an i.i.d. multivariate normal
distribution Nm (0, R) or a standardized multivariate Student-t distribution with 5 degrees of freedom
satisfied with E{ǫ(t)} = 0 and V{ǫ(t)} = R; the diagonal elements of R equal 1 and the off-diagonal
elements equal Ri j ; the vector GARCH equation for h(t) = (h1 (t), . . . , hm (t))′ satisfies
       
 a1   A1 0   h1 (t − 1)ǫ12 (t − 1)   B1 0   h1 (t − 1) 
   
  
  
  
h(t) =  ...  +  ..
.  
 
..
.  + 
 
..
.  
 
..
. 
   
am 0 Am hm (t − 1)ǫm2 (t − 1) 0 Bm hm (t − 1)
:= a + Ar 2 (t − 1) + Bh(t − 1) (say).
Then, under regularity conditions, the mean vector and the covariance matrix can be written as
CCC(1,1) : E{X(t)} = µ, V{X(t)} = Ω(a, A, B, R) = (ωi j )i, j=1,...,m (3.3.6)
where
 ai


 i= j


 1 − Ai − Bi √
ωi j = 
 Ri j a i a j .


 p(1 − A − B )(1 − A − B )
 i, j
i i j j
3.3.1 Quasi-Maximum Likelihood Estimator

Given {X(1), . . . , X(n); n ≥ 2}, the mean vector and the covariance matrix estimators based on the
quasi-maximum likelihood estimators are defined by
VAR(1) : µ̂QML = µ̂, Σ̂QML = (Im − Φ̂Φ̂′ )−1 (3.3.7)
VMA(1) : µ̂QML = µ̂, Σ̂QML = Im + Θ̂Θ̂′ (3.3.8)
CCC(1,1) : µ̂QML = µ̂, Σ̂QML = Ω(â, Â, B̂, R̂) (3.3.9)
1 Pn
where µ̂ = i=1 X(t) and Φ̂, Θ̂, â, Â, B̂, R̂ are “(Gaussian) quasi-maximum (log)likelihood es-
n
timator (QMLE or GQMLE)” defined by η̂ = arg max Ln (η). Here the quasi-likelihood functions
η∈E
(QMLE) for the VAR model and the VMA model are given as follows:
Y
n n o
VAR(1) : L(var)
n (Φ) = p Ŷ (t) − ΦŶ (t − 1) (3.3.10)
t=2
 
Yn 

 X
t−1 


 
VMA(1) : L(vma)
n (Θ) = p 
 Ŷ (t) + (−1) j Θ j Ŷ (t − j)
 (3.3.11)

 

t=1 j=1
where p{·} is the probability density of U (t) and Ŷ (t) = X(t) − µ̂. The Gaussian quasi-maximum
likelihood estimators (GQMLE) for CCC model are given as follows:
(â, Â, B̂) = arg max ℓV (a, A, B), R̂ = arg max ℓC (â, Â, B̂, R) (3.3.12)
(a,A,B) R
where ℓV(ccc) and ℓC(ccc) are a volatility part and a correlation part of the Gaussian log-likelihood
defined by
1 Xn
n o
CCC(1,1) : ℓV(ccc)(a, A, B) = − m log(2π) + log |D̂t |2 + Ŷ (t)′ D̂(t)−2 Ŷt (3.3.13)
2 t=1
1 Xn
n o
ℓC(ccc)(a, A, B, R) = − log |R| + ǫ̂(t)′ R−1 ǫ̂(t) (3.3.14)
2 t=1
q q !
where Ŷ (t) = X(t) − µ̂(t), D̂(t) = diag ĥ1 (t), . . . , ĥm (t) , ǫ̂(t) = D̂(t)−1 {X(t) − µ̂} and
(
ai t=0
ĥi (t) = .
ai + Ai {Xi (t − 1) − µ̂i }2 + Bi ĥi (t − 1) t = 1, . . . , n
Tables 3.3–3.5 provide µ̂, Σ̂QML (or Σ̂GQML ) and the (G)QMLEs for each model in the case of m = 2
and n = 10, 100, 1000. It is easy to see that in all cases the estimation accuracy tends to be better as
the sample size increases.
Table 3.3: Estimated mean vector (µ̂), covariance matrix (Σ̂QML ) and VAR coefficient (Φ̂) (with the
standard errors (SE)) for the VAR(1) model
µ Σ Φ
µ1 µ2 σ11 σ12 σ22 Φ11 Φ12 Φ21 Φ22
True 1.000 0.500 1.044 -0.015 1.012 0.200 -0.050 -0.050 0.100
normal distribution
n = 10 0.833 0.227 1.086 -0.049 1.768 0.235 0.274 0.151 -0.598
(0.029) (0.030) (13.75) (6.352) (6.352) (0.354) (0.451) (0.461) (0.356)
n = 100 0.858 0.546 1.014 0.014 1.027 0.108 0.153 -0.052 0.049
(0.033) (0.031) (0.051) (0.030) (0.030) (0.100) (0.103) (0.102) (0.103)
n = 1000 0.969 0.488 1.051 -0.014 1.006 0.215 -0.048 -0.048 0.065
(0.031) (0.031) (0.014) (0.007) (0.007) (0.003) (0.029) (0.032) (0.031)
t distribution with 5 degrees of freedom
n = 10 1.223 0.552 -8.857 -5.238 -1.737 -0.460 -0.415 0.802 0.303
(0.062) (0.059) (14.13) (19.76) (19.76) (0.344) (0.470) (0.478) (0.343)
n = 100 1.117 0.470 1.031 0.004 1.000 0.173 0.023 0.016 0.005
(0.043) (0.032) (0.045) (0.027) (0.027) (0.090) (0.095) (0.094) (0.089)
n = 1000 0.913 0.507 1.047 -0.016 1.017 0.210 -0.065 -0.013 0.112
(0.032) (0.032) (0.012) (0.006) (0.006) (0.027) (0.027) (0.027) (0.027)
Table 3.4: Estimated mean vector (µ̂), covariance matrix (Σ̂QML ) and VMA coefficient (Θ̂) (with the
standard errors (SE)) for the VMA(1) model
µ Σ Θ
µ1 µ2 σ11 σ12 σ22 Θ11 Θ12 Θ21 Θ22
True 1.000 0.500 1.012 -0.015 1.042 0.200 -0.050 -0.050 0.100
normal distribution
n = 10 1.285 0.818 4.643 -6.280 11.961 0.457 -0.432 -1.853 3.282
(0.029) (0.022) (6.870) (5.121) (5.121) (1.620) (1.581) (1.410) (1.699)
n = 100 1.043 0.663 1.018 0.023 1.032 0.116 0.170 -0.071 -0.056
(0.032) (0.030) (0.037) (0.029) (0.029) (0.110) (0.107) (0.108) (0.108)
n = 1000 1.037 0.444 1.005 -0.013 1.066 0.065 -0.100 -0.028 0.238
(0.030) (0.032) (0.007) (0.007) (0.007) (0.032) (0.031) (0.032) (0.030)
n = 10 0.469 1.209 1.404 -0.106 1.033 0.496 -0.176 0.397 -0.048
(0.099) (0.038) (7.853) (5.057) (5.057) (1.641) (1.453) (1.486) (1.640)
n = 100 1.019 0.362 1.010 -0.007 1.023 0.101 -0.047 -0.019 0.145
(0.037) (0.032) (0.027) (0.026) (0.026) (0.100) (0.098) (0.096) (0.102)
n = 1000 0.982 0.429 1.008 -0.010 1.023 0.081 -0.060 -0.038 0.142
(0.033) (0.031) (0.006) (0.006) (0.006) (0.027) (0.027) (0.028) (0.028)
Table 3.5: Estimated mean vector (µ̂), covariance matrix (Σ̂GQML ) and CCC coefficient (â, Â, B̂, R̂)
(with the standard errors (SE)) for the CCC(1,0) model
µ Σ â Â B̂ R̂
µ1 µ2 σ11 σ12 σ22 a1 a2 A1 A2 B1 B2 R21
True 1.000 0.500 0.375 0.075 0.166 0.300 0.150 0.200 0.100 0.000 0.000 0.300
normal distribution
n = 10 1.258 0.445 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.050 0.904 2.050
(0.31) (0.17) (1.85) (0.57) (1.00) (0.16) (0.06) (0.42) (0.35) (0.42) (0.42) (0.70)
n = 100 0.969 0.492 0.442 0.089 0.214 0.305 0.013 0.309 0.116 0.000 0.830 0.289
(0.25) (0.13) (0.11) (0.03) (0.05) (0.10) (0.05) (0.13) (0.10) (0.21) (0.23) (0.11)
n = 1000 0.967 0.491 0.372 0.069 0.170 0.298 0.156 0.197 0.085 0.000 0.000 0.275
(0.24) (0.12) (0.09) (0.03) (0.04) (0.07) (0.04) (0.06) (0.04) (0.09) (0.19) (0.15)
n = 10 0.901 0.753 0.117 0.102 0.302 0.116 0.222 0.000 0.265 0.004 0.000 0.545
(0.35) (0.20) (18.5) (0.86) (2.08) (0.46) (0.14) (0.50) (0.36) (0.41) (0.42) (0.61)
n = 100 1.062 0.435 0.819 0.183 0.417 0.458 0.382 0.439 0.083 0.000 0.000 0.314
(0.26) (0.13) (6.80) (0.48) (1.60) (0.20) (0.09) (0.27) (0.23) (0.19) (0.22) (0.11)
n = 1000 1.006 0.515 0.743 0.151 0.295 0.462 0.143 0.378 0.282 0.000 0.231 0.323
(0.25) (0.12) (1.36) (0.06) (0.70) (0.13) (0.07) (0.13) (0.10) (0.08) (0.16) (0.08)
3.3.2 Efficient Frontier

Let X(t) = (X1 (t), . . . , Xm (t))′ be the random returns on m assets at time t with E{X(t)} = µ and
V{X(t)} = Σ. Let w = (w1 , . . . , wm )′ be the vector of portfolio weights. Then, the expected portfolio
return is given by µP := w′ µ, while its variance is σ2P := w′ Σw. As already discussed in (3.1.2),
the optimal portfolio weight for given expected return µP is expressed as
µP α − β −1 γ − µP β −1
w0 = Σ µ+ Σ e (3.3.15)
δ δ
where α = e′ Σ−1 e, β = µ′ Σ−1 e, γ = µ′ Σ−1 µ and δ = αγ − β2 . By using this weight, the variance
is expressed as
µ2P α − 2µP β + γ
σ2P = . (3.3.16)
δ
Rewriting (3.3.16), we obtain
(µP − µGMV )2 = s(σ2P − σGMV
2
) (3.3.17)
where
β 2 1 δ
µGMV = , σGMV = and s = .
α α α
2
Here µGMV and σGMV are the expected return and the variance of the global minimum variance
(GMV) portfolio (i.e., the portfolio with the smallest risk). The quantity s is called the slope coeffi-
cient of the efficient frontier.
By using the estimators of µ and Σ defined in (3.3.3)–(3.3.6) (or the sample covariance matrix), the
estimated efficient frontier is given by
(µP − µ̂GMV )2 = ŝ(σ2P − σ̂GMV
2
) (3.3.18)
where
β̂ 2 1 δ̂
µ̂GMV = , σ̂GMV = and ŝ =
α̂ α̂ α̂
with α̂ = e′ Σ̂−1 e, β̂ = µ̂′ Σ̂−1 e, γ̂ = µ̂′ Σ̂−1 µ̂ and δ̂ = α̂γ̂ − β̂2 .
Figure 3.8 shows the true and the estimated efficient frontiers for each model. Here, the solid lines
show the true efficient frontiers, (n = 10), △ (n = 100) and + (n = 1000) plots show the estimated
efficient frontiers. You can see that the estimated efficient frontier tends to come closer to the true
efficient frontier as the sample size increases, because the slope coefficient ( ŝ), the estimated mean
2
(µ̂GMV ) and the estimated variance (σ̂GMV ) of the GMV portfolio converge to the true values.
3.3.3 Difference between the True Point and the Estimated Point
Once µP is fixed in (3.3.16) or (3.3.17), a point (σP , µP ) is determined on the σ-µ plane. Given µ̂,
Σ̂ and µP , the optimal portfolio weight estimator is obtained by
µP α̂ − β̂ −1 γ̂ − µP β̂ −1
ŵµP = Σ̂ µ̂ + Σ̂ e (3.3.19)
δ̂ δ̂
with α̂ = e′ Σ̂−1 e, β̂ = µ̂′ Σ̂−1 e, γ̂ = µ̂′ Σ̂−1 µ̂ and δ̂ = α̂γ̂ − β̂2 . Then, in the same way as the true
version, the estimated point (σ̂(µP ), µ̂(µP )) is determined on the σ-µ plane. Here µ̂(µP ) and σ̂2 (µP )
are the optimal portfolio’s conditional mean and variance, i.e.,
µ̂(µP ) := E{ŵµ′ P X(t)|ŵµP } = ŵµ′ P µ (3.3.20)
2
σ̂ (µP ) := V{ŵµ′ P X(t)|ŵµP } = ŵµ′ P ΣŵµP . (3.3.21)
Figure 3.9 shows the degree of dispersion of the point (σ̂(µP ), µ̂(µP )) in the case of µP = 0.85,
n = 10, 30, 100 and 100 times generation of the VAR(1) model. Note that all estimated points are
the right hand side of the (true) efficient frontiers. It can be seen that each point tends to come closer
on the true efficient frontier as the sample size increases. In addition, we show the histogram of the
following quantities in Figure 3.10:
µ̂(µP ) σ̂(µP )
, , d̂(µP ) := {µ̂(µP ) − µP }2 + σ̂2 (µP ) − σ2P .
µP σP
For each histogram, the shape becomes sharp as the sample size increases. Regarding the ratio of
σP , there exist points with σ̂(µσP
P)
< 1 (i.e., σ̂(µP ) < σP ). These points must satisfy µ̂(µP ) < µP
because all estimated points are the right hand side of the (true) efficient frontier.
VAR(1) model [normal] VAR(1) model [t(5)]
1.00
1.00
portfolio return
portfolio return
0.85
0.85
0.70
0.70
0.70 0.75 0.80 0.85 0.90 0.70 0.75 0.80 0.85 0.90
portfolio risk portfolio risk
VMA(1) model [normal] VMA(1) model [t(5)]

1.1
1.1
portfolio return
portfolio return
0.9
0.9
0.7
0.7
0.70 0.75 0.80 0.85 0.90 0.70 0.75 0.80 0.85 0.90
CCC(1,0) model [normal] CCC(1,0) model [t(5)]

portfolio return
portfolio return
1.1
1.1
0.9
0.9
0.7
0.7
0.85 0.90 0.95 1.00 1.05 0.85 0.90 0.95 1.00 1.05
Figure 3.8: True and estimated efficient frontiers for each model. The solid lines show the true
efficient frontiers, (n = 10), △ (n = 100) and + (n = 1000) plots show the estimated efficient
frontiers.
3.3.4 Inference of µP
Here we discuss the inference of estimation error when µP set down some values. In order to inves-
tigate the inference of estimation error, we introduce the following:
Mµ := E {µ̂(µP ) − µP } (3.3.22)
Mσ := E {σ̂(µP ) − σ(µP )} (3.3.23)
h n oi n o
M(µ,σ) := E {µ̂(µP ) − µP }2 + σ̂2 (µP ) − σ2 (µP ) = E d̂(µP ) (3.3.24)
Table 3.6 shows Mµ , Mσ and M(µ,σ) in the case of µP = 0.748 (= µGMV ), 0.850, 1.000, 1.500, 3.000,
n = 10, 30, 100 and 100 times generation of the VAR(1) model. Note that we calculate the average
of each quantity in place of the expectation. When µP is large, the estimation error is extremely large
even if sample size increases.
normal distribution (n=10) normal distribution (n=30) normal distribution (n=100)
1.00
1.00
1.00
0.95
0.95
0.95
portfolio return
portfolio return
portfolio return
0.90
0.90
0.90
0.85
0.85
0.85
0.80
0.80
0.80
0.75
0.75
0.75
0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0
portfolio risk portfolio risk portfolio risk
t distribution (n=10) t distribution (n=30) t distribution (n=100)

1.00
1.00
1.00
0.95
0.95
0.95
portfolio return
portfolio return
portfolio return
0.90
0.90
0.90
0.85
0.85
0.85
0.80
0.80
0.80
0.75
0.75
0.75
0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0
portfolio risk portfolio risk portfolio risk
Figure 3.9: Results from 100 times generation of the VAR(1) model in the case of n = 10, 30, 100.
For each generation, the optimal portfolio with a mean return of 0.85 is estimated. The estimated
points are plotted as a small dot. The large dot is the true point and the solid line is the (true) efficient
frontier.
Table 3.6: Inference of estimation error with respect to µP
n = 10 n = 30 n = 100
µP Mµ Mσ M(µ,σ)Mµ Mσ M(µ,σ) Mµ Mσ M(µ,σ)
normal distribution
0.748 0.017 0.892 4.812 -0.003 0.064 0.023 0.000 0.021 0.004
0.850 -0.020 0.965 6.313 -0.033 0.079 0.100 -0.003 0.035 0.036
1.000 -0.075 1.133 11.032 -0.078 0.088 0.401 -0.008 0.058 0.227
1.500 -0.261 1.971 48.987 -0.226 0.119 3.019 -0.024 0.138 2.106
3.000 -0.817 5.363 363.151 -0.670 0.265 25.516 -0.073 0.397 18.928
0.748 0.000 0.708 5.105 -0.001 0.091 0.043 0.010 0.036 0.012
0.850 -0.092 0.701 4.491 -0.018 0.097 0.098 0.015 0.078 0.078
1.000 -0.229 0.758 7.307 -0.043 0.116 0.435 0.024 0.139 0.401
1.500 -0.684 1.456 48.368 -0.127 0.248 3.690 0.053 0.328 3.327
3.000 -2.052 4.048 465.153 -0.378 0.718 33.107 0.140 0.909 28.825
muhat(mup)/mup (n=10) muhat(mup)/mup (n=30) muhat(mup)/mup (n=100)
4
3.0
frequency
frequency
frequency
4
3
1.5
2
1
0.0
0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
ratio of portfolio return ratio of portfolio return ratio of portfolio return
sighat(mup)/sigp (n=10) sighat(mup)/sigp (n=30) sighat(mup)/sigp (n=100)
6
frequency
frequency
frequency
4
4
1.0
2
0.0
0
1.0 1.4 1.8 1.0 1.4 1.8 1.0 1.4 1.8
ratio of portfolio risk ratio of portfolio risk ratio of portfolio risk
dhat(mup) (n=10) dhat(mup) (n=30) dhat(mup) (n=100)
3.0
2.0
frequency
frequency
frequency
1.0
1.5
1.0
0.0
0.0
0.0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
estimation error estimation error estimation error
Figure 3.10: Histograms of µ̂(µP )/µP (upper subplot), σ̂(µP )/σP (middle subplot) and d̂(µP ) (lower
subplot) with respect to the data of Figure 3.10 (normal distribution)
3.3.5 Inference of Coefficient

Here we discuss the inference of estimation error when VAR(1) coefficients set down some values.
In place of the VAR(1) coefficient Φ in (3.3.1), we define
 
 0.8 −0.05 −0.04 
 
Φ1 (κ) = I3 − κ  −0.05 0.85 −0.03  . (3.3.25)
 
−0.04 −0.03 0.9
Table 3.7 shows Mµ , Mσ and M(µ,σ) defined by (3.3.22)–(3.3.24) in the case of κ = 1.0, 0.1, 0.01,
n = 10, 100, µP = 0.850 and 100 times generation of the VAR(1) model. Note that when κ is close
to 0, the model is near a unit root process. In the case of the near unit root process, it looks like
the estimation errors are very large. In addition, in place of the VAR(1) coefficient Φ in (3.3.1), we
define
 
 0 −0.05 −0.04 
 
Φ2 (κ) = κ  −0.05 0 −0.03  (3.3.26)
 
−0.04 −0.03 0
Table 3.8 shows Mµ , Mσ and M(µ,σ) defined by (3.3.22)–(3.3.24) in the case of κ = −10.0, −1.0,
0, 1.0, 1.0, 10.0, n = 10, 100, µP = 0.85 and 100 times generation of the VAR(1) model. In the case
of κ < 0, there exist negative correlations between the assets, and in the case of κ > 0, there exist
positive correlations between the assets. If κ = 0, there is no correlation for all assets. In Tables 3.7
and 3.8, it looks like the estimation errors are unstable in the case of n = 10, which implies that the
estimators do not converge. In the case of n = 100, the estimation error becomes large as the degree
of correlation increases, especially for the positive correlation.
Table 3.7: Inference of estimation error with respect to Φ1 (κ)
n = 10 n = 100
κ det Σ σ(µP ) Mµ Mσ M(µ,σ) Mµ Mσ M(µ,σ)
normal distribution
1.0 1.0 0.663 -0.269 2.962 47.739 0.022 0.720 0.924
0.1 234.3 1.687 0.096 1.877 381.533 0.000 -1.012 1.038
0.01 208390.1 5.244 -0.114 -3.167 40.837 0.000 -4.571 20.919
1.0 1.0 0.663 0.271 4.098 166.713 0.073 0.754 0.946
0.1 234.3 1.687 -0.003 -0.053 4.144 0.003 -1.006 1.029
0.01 208390.1 5.244 0.059 -0.521 887.331 0.006 -4.569 20.901
Table 3.8: Inference of estimation error with respect to Φ2 (κ)
n = 10 n = 100
κ det Σ σ(µP ) Mµ Mσ M(µ,σ) Mµ Mσ M(µ,σ)
normal distribution
-10.0 4.244 1.040 -0.269 2.585 45.649 0.022 0.344 0.523
-1.0 1.010 0.644 0.096 2.920 386.537 0.000 0.029 0.013
0 1.000 0.642 -0.114 1.433 32.856 0.000 0.030 0.020
1.0 1.010 0.644 -0.056 1.146 9.249 -0.001 0.024 0.016
10.0 4.244 1.040 -0.044 0.500 2.005 -0.012 0.045 0.044
-10.0 4.244 1.040 0.271 3.721 163.767 0.073 0.377 0.520
-1.0 1.010 0.644 -0.003 0.989 5.119 0.003 0.036 0.017
0 1.000 0.642 0.059 4.080 903.705 0.006 0.032 0.020
1.0 1.010 0.644 -0.178 0.929 7.580 0.015 0.050 0.023
10.0 4.244 1.040 0.246 1.746 58.397 -0.012 0.081 0.072

Tsay (2010) provides the simulation technique for time series data. Ruppert (2004) discusses the
application of resampling to the portfolio analysis.
Problems
3.1 Let Xt be random returns on m(≥ 3) assets with E(X n t ) = µ and Cov(X
p t ) = Σ and without a risk-o
free asset. Then, show the feasible region (i.e., W = (x, y) ∈ R2 |x = η2 (w), y = µ(w), kwk = 1 )
is convex to the left.
3.2 Prove Theorem 3.1.18.
3.3 Verify Ω2NG = ΩG2 if {U (t)} is Gaussian in Remark 3.2.2 (Show fourth order cumulants cU
b1 ,b2 ,b3 ,b4
vanish).
i.i.d.
3.4 Suppose that X1 , . . . , Xn ∼ Nm (µ, Σ) are obtained. Assuming e′ Σ−1 e → σ2 ∈ (0, ∞) under
(n, m)-asymptotics (i.e., n, m → ∞ and m/n → y ∈ (0, 1)), show
σ2
e′ Σ̂−1 e →
1−y
where e = (1, . . . , 1)′ (m-vector) and Σ̂ is the sample covariance matrix defined in (3.2.37).
3.5 Let {X(t)} be a process generated by (3.3.1). Then, show (3.3.3) provided
det(Im − zΦ) , 0 for all z ∈ C such that |z| ≤ 1.
3.6 Let {X(t)} be a process

p generated by (3.3.5). Then, show (3.3.6) provided |ci | < 1 for i = 1, . . . , m
and E{hi j (t)} = ρ E{hii (t)}E{h j j (t)} for i , j.
3.7 Verify (3.3.16) and (3.3.17).
3.8 Suppose that {X(t) = (X1 (t), X2 (t), X3 (t))′ , t ∈ Z} follows a VAR(1) defined by
X(t) = µ + U (t) + Φ{X(t − 1) − µ} (3.3.27)
where µ = (0.01, 0.005, 0.003)′; Φ = (φi j )i, j=1,2,3 with φ11 = 0.022, φ22 = 0.0152, φ33 = 0.012
p i.i.d.
and φi j = −0.5 φii φ j j for i , j; {U (t)} ∼ N3 (0, I3 ).
(i) Generate a sequence of 100(:= n) observations X(1), . . . , X(100) from (3.3.27).
(ii) For a fixed 0.004(:= µP ), compute
q
µ̃(µP ) = ŵµ′ P µ̂(= µP ), σ̃(µP ) = ŵµ′ P Σ̂ŵµP .
(iii) Take 1000(:= B) resamples {Xb (1), . . . , Xb (n)}b=1,...,1000 from {X(1), . . . , X(n)}, and for each
b = 1, . . . , B, compute
q
µ̃b (µP ) = ŵµ′ P ,b µ̂, σ̃b (µP ) = ŵµ′ P ,b Σ̂ŵµP ,b
where ŵµP ,b corresponds to ŵµP based on {Xb (1), . . . , Xb (n)}.

(iv) Get a confidence region
n o
Cα (µP ) = (σ, µ)|L(µP ) = (σ − σ̃(µP ))2 + (µ − µ̃(µP ))2 ≤ Lα,B (µP )
for a fixed 0.05(:= α) where Lα,B (µP ) is the [(1 − α)B]-th largest of {L1 (µP ), . . . , LB (µP )} with
Lb (µP ) = (σ̃b (µP ) − σ̃(µP ))2 + (µ̃b (µP ) − µ̃(µP ))2 .

Chapter 4
Multiperiod Problem for Portfolio Theory
Portfolio optimization is said to be “myopic” when the investor does not know what will happen
beyond the immediate next period. In this framework, basic results about single-period portfolio
optimization (such as mean-variance analysis) are justified for short-term investments without port-
folio rebalancing. Multiperiod problems are much more realistic than single-period ones. In this
framework, we assume that an investor makes a sequence (in the discrete time case) or a continu-
ous function (in the continuous time case) of decisions to maximize a utility function at each time.
The fundamental method to solve this problem is dynamic programming. In this method, a value
function which expresses the expected terminal wealth is introduced. The recursive equation with
respect to the value function is the so-called Bellman equation. The first-order conditions (FOCs)
to satisfy the Bellman equation are a key tool in solving the dynamic problem. The original litera-
ture on dynamic portfolio choice, pioneered by Samuelson (1969) and Fama (1970) in discrete time
and by Merton (1969) in continuous time, produced many important insights into the properties of
optimal portfolio policies.
Section 4.1 discusses multiperiod optimal portfolios in the discrete time. In this case, a sequence of
optimal portfolio weights is obtained as the solutions of the FOCs. When the utility function is a
constant relative risk aversion (CRRA) utility or a logarithmic utility, multiperiod optimal portfolios
are derived as a sequence of maximizers of objective functions which are described as conditional
expectations of the return processes. Since it is known that closed-form solutions are not available
except for a few cases of utility functions and return processes, we consider a simulation approach
for the derivation by using the bootstrap. By utilizing the technique, it is possible to estimate the
multiperiod optimal portfolio even if the distribution of the return process is unknown.
In Section 4.2, we consider continuous time problems, which means that an investor is able to re-
balance his/her portfolio continuously during the investment period. Especially, we assume that the
price process of risky assets is a class of diffusion processes with jumps. Similarly to the discrete
time case, multiperiod optimal portfolios are derived from Bellman equations for continuous time
models (the so-called Hamilton–Jacobi–Bellman (HJB) equations). In the case of a CRRA utility
function, it will be shown that multiperiod optimal portfolios correspond to a myopic portfolio. Then
we discuss the estimation of multiperiod optimal portfolios for discretely observed asset prices. It
would be worth considering the discrete observations from continuous time models in practice. In
this setting, since an investor has to estimate multiperiod optimal portfolios based on observations,
it is different from “stochastic control problems.” For this problem, we introduce two types of meth-
ods: the generalized method of moments (GMM) and the threshold estimation method. In general, it
is well known that maximum likelihood estimation is one of the best methods under the parametric
model, but we cannot calculate the maximum likelihood estimation (MLE) for discrete observa-
tions. On the contrary, the above two methods are useful in this case.
Section 4.3 discussess universal portfolio (UP) introduced by Cover (1991). The idea of this portfo-
lio comes from a different approach to traditional ones such as the mean variance approach. That is
based on information theory introduced by Shannon (1948) (and many others), and the Kelly rule
of betting the Kelly (1956).
75
76 MULTIPERIOD PROBLEM FOR PORTFOLIO THEORY
4.1 Discrete Time Problem
In this section, we present some simulation-based methods for solving discrete time portfolio choice
problems for multiperiod investments. In the same way as the single-period problem, multiperiod
optimal portfolios are defined by a maximizer of an expectation of utility function. The fundamental
method to solve this problem is dynamic programming. In this method, a value function is intro-
duced as the conditional expectation of terminal wealth. Then, the optimal portfolio at each period
is expressed as a maximizer of the value function.
4.1.1 Optimal Portfolio Weights

We now introduce the following notation.
• Trading time t : t = 0, 1, . . . , T where t = 0 and t = T indicate the start point and the end point of
the investments.
−1
• Return process on risky assets {Xt }Tt=0 : Xt = (X1,t , . . . , Xm,t )′ is a vector of random returns on m
assets from time t to t + 1.
−1
• Return process on a risk-free asset {X0,t }Tt=0 : X0,t is a deterministic return from time t to t + 1.
−1
• Information set process {Zt }Tt=0 : Zt is a vector of state variables at the beginning of time t. The
information set indicates the past asset returns and other economic indicators.
• Portfolio weight process1 {wt }Tt=0
−1
: wt = (w1,t , . . . , wm,t )′ is a portfolio weight at the beginning
of time t and measurable (predictable) with respect to the past information Zt .
• Wealth process {Wt }Tt=0 :
Under given Wt , the wealth at time t + 1 is defined by

Wt+1 = Wt 1 + wt′ Xt + (1 − wt′ e)X0,t (4.1.1)
for t = 0, 1, . . . , T where e = (1, . . . , 1)′ (m-vector). This constraint is called a “budget constraint”
(Brandt et al. (2005) ) or “self-financing” (Pliska (1997)) and describes the dynamics of wealth.
The amount 1 + wt′ Xt + (1 − wt′ e)X0,t specifies the total return of the portfolio from time t
to t + 1. Especially, the amount wt′ Xt arises from risky assets allocated by wt and the amount
(1 − wt′ e)X0,t arises from risk-free assets allocated by 1 − wt′ e. Given a sequence of decisions
−1
{wτ }Tτ=t , it is useful to observe that the terminal wealth WT can be written as a function of current
wealth Wt
Y
T −1

WT = Wt 1 + wτ′ Xτ + (1 − wτ′ e)X0,τ .
τ=t
• Utility function U: We assume that W 7→ U(W) is differentiable, concave, and strictly increasing
for each W ∈ R.
Under the above setting, we consider an investor’s problem
max E[U(WT )]
−1
{wt }Tt=0
subject to the budget constraints (4.1.1). Following a formulation by dynamic programming (e.g.,
Bellman (2010)), it is convenient to express the expected terminal wealth in terms of a value func-
tion V which varies according to the current time, current wealth Wt and other state variables
Zt ∈ RK , K < ∞:
V(t, Wt , Zt ) = max Et [U(WT )]

−1
{wu }Tu=t
1 {w is called “trading strategy” (Pliska (1997), Luenberger (1997)).

t}
DISCRETE TIME PROBLEM 77
 
 
= max Et  max Et+1 [U(WT )]
wt −1
{wu }Tu=t+1

= max Et V t + 1, Wt 1 + wt′ Xt + (1 − wt′ e)X0,t , Zt+1 (4.1.2)
wt
subject to the terminal condition
V(T, WT , ZT ) = U(WT ).
The expectations at time t (i.e., Et ) are taken with respect to the joint distribution of asset returns
and next state, conditional on the information available at time t. The recursive equation (4.1.2) is
the so-called Bellman equation and is the basis for any recursive solution of the dynamic portfolio
choice problem. The first-order conditions (FOCs)2 to obtain an optimal solution at each time t
are

Et ∂W V(t + 1, Wt {1 + wt′ Xt + (1 − wt′ e)X0,t }, Zt+1 )(Xt − eX0,t ) = 0 (4.1.3)
where ∂W denotes the partial derivative with respect to the wealth W, that is,

∂
∂W V(t + 1, Wt+1 , Zt+1 ) = V(t + 1, W, Zt+1 ) .
∂W W=Wt+1
These optimality conditions assume that the state variable Zt+1 is not impacted by the decision wt .
These FOCs make up a system of nonlinear equations involving integrals that can in general be
solved for wt only numerically. Moreover, the second-order conditions are satisfied if the utility
function is concave.
Example 4.1.1. (CRRA Utility) In general, (4.1.3) can only be solved numerically. However, some
analytic progress can be achieved in the case of the CRRA utility introduced in (3.1.16) with R , 1,
1−γ
that is, U(WT ) = WT /(1 − γ). Then, the Bellman equation simplifies to
(Wt )1−γ
V(t, Wt , Zt ) = ψ(t, Zt )
1−γ
where
h i
ψ(t, Zt ) = max Et 1 + wt′ Xt + (1 − wt′ e) 1−γ ψ(t + 1, Zt+1 )
wt
with ψ(T, ZT ) = 1 (For details see Section 4.1.3). The corresponding FOCs are
h i
Et 1 + wt′ Xt + (1 − wt′ e) −γ ψ(t + 1, Zt+1 )(Xt − eX0,t ) = 0. (4.1.4)
Although (4.1.4) is simpler than the general case, solutions can still only be obtained numerically.
This example also helps to illustrate the difference between the dynamic and myopic (single-period)
portfolio choice. If the excess returns Xt −eX0,t are independent of the state variables Zt+1 , the T −t
period portfolio choice is identical to the single-period portfolio choice at time t. In that case, there
is no difference between the dynamic and myopic portfolios. Let wtdopt and wtmopt be the optimal
portfolio weights based on the dynamic and myopic (single-period) problems, respectively. If the
above condition is satisfied, then it follows that
h i
wtdopt = arg max Et 1 + wt′ Xt + (1 − wt′ e) 1−γ ψ(t + 1, Zt+1 )
wt
h i
= arg max Et 1 + wt′ Xt + (1 − wt′ e) 1−γ
wt
2 Here ∂ ∂
∂wt Et [V] = Et [ ∂w t
V] is assumed.
= wtmopt . (4.1.5)
In contrast, if the excess returns are not independent of the state variables, the dynamic portfo-
lio choice may be substantially different from the single-period portfolio choice. The differences
between the dynamic and myopic policies are called hedging demands because the investor tries
to hedge against changes in the investment opportunities due to deviations from the single-period
hopt
portfolio choice. Let wt be the portfolio weight concerned with hedging demands. Then, we can
write
wtdopt = wtmopt + wthopt
where
h i
= arg max Et 1 + wt′ Xt + (1 − wt′ e) 1−γ ψ(t + 1, Zt+1 )
dopt
wt
wt
h i
= arg max Et 1 + wt′ Xt + (1 − wt′ e) 1−γ .
mopt
wt
wt
Example 4.1.2. (Logarithmic Utility) In case of the logarithmic utility, that is, U(WT ) = ln(WT ),
the dynamic portfolio totally corresponds to the myopic portfolio. The Bellman equation is written
as

V(t, Wt , Zt ) = max Et ln 1 + wt′ Xt + (1 − wt′ e)
wt
X
T −1

+ ln Wt + Et ln 1 + wτ′ Xτ + (1 − wτ′ e) . (4.1.6)
τ=t+1
Since the second and third terms of the right hand part of the above equation do not depend on the
portfolio weights wt , the optimal portfolio weight wtdopt is obtained by

wtdopt = arg max Et ln 1 + wt′ Xt + (1 − wt′ e)
wt
which corresponds to the single-period optimal portfolio weight wtmopt . This result leads to the
“Kelly rule of betting” (Kelly (1956)).
The conditions3 for which the myopic portfolio choice corresponds to the dynamic one are
summarized as follows:
• The investment opportunities are constant (e.g., in the case of i.i.d. returns).
• The investment opportunities vary with time, but are not able to hedge (e.g., in the case of (4.1.5)).
• The investor has a logarithmic utility on terminal wealth.
As for the other utility function, Li and Ng (2000) analyzed various formulations of the maximiza-
tion of terminal quadratic utility under several hypotheses, provided explicit solutions in simplify-
ing cases, and derived analytical expressions for the multiperiod mean variance efficient frontier.
Leippold et al. (2002) provided an interpretation of the solution to the multiperiod mean variance
problem in terms of an orthogonal set of basis strategies, each with a clear economic interpretation.
They use the analysis to provide analytical solutions to portfolios consisting of both assets and lia-
bilities. More recently, Cvitanic et al. (2008) connect this problem to a specific case of multiperiod
Sharpe ratio maximization.
4.1.2 Consumption Investment

Intermediate consumption has traditionally been a part of the multiperiod optimal investment
problem since Samuelson (1969) and Fama (1970). In this setting, the problem is formulated in
3 For more detail, see Mossin (1968) or Aı̈t-Sahalia and Hansen (2009)
terms of a sequence of consumptions and a terminal wealth. We also discuss the fundamental method
to solve this problem. The future wealth may be influenced not only by the asset return but also by
eventual transaction costs and other constraints such as the desire for intermediate consumption,
minimization of tax impact or the influence of additional capital due to labour income. Suppose that
there exists a consumption process {ct }Tt=0 −1
which is a measurable (predictable) stochastic process.4
Here ct (0 < ct < 1) represents the weight of funds consumed by the investor at time t. A con-
−1 −1
sumption investment plan consists of a pair ({wt }Tt=0 , {ct }Tt=0 ) where {ct } is a consumption process
and {wt } is a portfolio weight process. The investor seeks to choose the consumption investment
plan that maximizes the expected utility over T periods. In particular, the investor faces a trade-off
between consumption and investment. Consider
 T −1 
X 
max E  Uc (ct ) + UW (WT )
{wt }T −1 ,{ct }T −1
t=0 t=0 t=0
5
subject to the budget constraint

Wt+1 = (1 − ct )Wt 1 + wt′ Xt + (1 − wt′ e)X0,t .
Here Uc denotes the utility of consumption c and UW denotes the utility of wealth W. This model
is called the “generalized consumption investment problem” (see Pliska (1997)). In this case, the
value function V(t, Wt , Zt ) (i.e., the Bellman equation) is written as
 T −1 
X 
V(t, Wt , Zt ) = max Et  Uc (cu ) + UW (WT )
−1 ,{c }T −1
{wu }Tu=t u u=t u=t
  T −1 
  X 

= max Et Uc (ct ) + max Et+1   Uc (cu ) + UW (WT )
wt ,ct T −1 T −1
{wu }u=t+1 ,{cu }u=t+1
u=t+1

= max Et Uc (ct ) + V(t + 1, (1 − ct )Wt {1 + wt′ Xt + (1 − wt′ e)X0,t }, Zt+1 )
wt ,ct

= max Uc (ct ) + Et V(t + 1, (1 − ct )Wt {1 + wt′ Xt + (1 − wt′ e)X0,t }, Zt+1 )
wt ,ct
subject to the terminal condition

V(T, WT , ZT ) = UW (WT ).
The FOCs (which are also called Euler conditions (e.g., Gouriéroux and Jasiak (2001)) are given
by

Et ∂W V(t + 1, Wt+1 (wt , ct ), Zt+1 )(Xt − eX0,t ) = 0 (4.1.7)
h i
∂c Uc (ct ) + Et ∂W V(t + 1, Wt+1 (wt , ct ), Zt+1 )(−Wt ) 1 + wt′ Xt + (1 − wt′ e)X0,t = 0 (4.1.8)
∂
where Wt+1 (w, c) = (1 − c)Wt 1 + w′ Xt + (1 − w′ e)X0,t and ∂c Uc (ct ) = ∂c Uc (c)|c=ct . These are
obtained by computing the first-order derivatives with respect to wt and ct . From (4.1.7) and (4.1.8),
we have

∂c Uc (ct ) = Et ∂W V(t + 1, Wt+1 (wt , ct ), Zt+1 )Wt (1 + X0,t ) . (4.1.9)
This equation is called the “envelope relation” derived by Samuelson (1969) and based on (4.1.7)
−1 −1
and (4.1.9), we can obtain the optimal solutions {wt }Tt=0 and {ct }Tt=0 numerically.
The above portfolio problems are considered at the individual level. If the return process {Xt }
consists of all assets on a market (i.e., wt′ Xt represents a market portfolio) and ct denotes the
aggregate consumption of all investors, then it is known in the literature as the consumption based
capital asset pricing model (CCAPM).
4 We assume that Et (ct ) = ct is satisfied for any t.
5 Note that in this case the portfolio is not self-financed since the value of consumption spending does not necessarily
balance the income pattern.
4.1.3 Simulation Approach for VAR(1) model
Since it is known that closed-form solutions are obtained only for a few special cases, the recent liter-
ature uses a variety of numerical and approximate solution methods to incorporate realistic features
into the dynamic portfolio problem, such as those of Aı̈t-Sahalia and Brandt (2001) and Brandt et al.
(2005). In what follows, we introduce a procedure to construct dynamic portfolio weights based on
the AR bootstrap following Shiraishi (2012). The simulation algorithm is as follows. First, we gen-
erate simulation sample paths of the vector random returns by using the AR bootstrap (Step1). Based
on the bootstrapping samples, an optimal portfolio estimator, which is applied from time T −1 to the
end of trading time T , is obtained under a constant relative risk aversion (CRRA) utility function
(Step 2). Next we approximate the value function at time T − 1 by linear functions of the boot-
strapping samples (Step 3). This idea is similar to that of Aı̈t-Sahalia and Brandt (2001) and Brandt
et al. (2005). Then, an optimal portfolio weight estimator applied from time T − 2 to time T − 1
is obtained based on the approximated value function (Step 4). In the same manner as above, the
optimal portfolio weight estimator at each trading time is obtained recursively (Steps 5-7).
Suppose the existence of a finite number of risky assets indexed by i, (i = 1, . . . , m). Let Xt =
(X1,t , . . . , Xm,t )′ denote the random returns on m assets from time t to t + 1.6 Suppose too that there
exists a risk-free asset with the return X0 .7 Based on the process {Xt }Tt=0 −1
and X0 , we consider an
investment strategy from time 0 to time T where T (∈ N) denotes the end of the investment time. Let
wt = (w1,t , . . . , wm,t )′ be vectors of portfolio weight for the risky assets at the beginning of time t.
Here we assume that the portfolio weights wt can be rebalanced at the beginning of time t and mea-
surable (predictable) with respect to the past information Ft = σ(Xt−1 , Xt−2 , . . .). Here we make
the following assumption.
Assumption 4.1.3. There exists an optimal portfolio weight w̃t ∈ Rm satisfied by |w̃t′ Xt + (1 −
w̃t e)X0 | ≪ 18 almost surely for each time t = 0, 1, . . . , T − 1 where e = (1, . . . , 1)′ .
Then the return of the portfolio from time t to t + 1 is written as 1 + X0 + wt′ (Xt − X0 e)9 and the
total return from time 0 to time T (called terminal wealth) is written as
Y
T −1

WT := 1 + X0 + wt′ (Xt − X0 e) .
t=0
Suppose that a utility function U : x 7→ U(x) is differentiable, concave and strictly increasing for
each x ∈ R. Consider an investor’s problem
max E[U(WT )].

−1
{wt }Tt=0
In the same way as (4.1.2), the value function Vt can be written as
Vt ≡ max E[U(WT )|Ft ] = max E [Vt+1 |Ft ]

{w s }Ts=t−1 wt
subject to the terminal condition VT = U(WT ). The first-order conditions (FOCs) are

E ∂W U(WT )(Xt − X0 e)Ft = 0
∂
where ∂W U(WT ) = ∂W U(W)|W=WT . These FOCs make up a system of nonlinear equations involving
6 Supposethat S i,t is a value of asset i at time t. Then, the return is described as 1 + Xi,t = S i,t+1 /S i,t .
7 Supposethat Bt is a value of risk-free asset at time t. Then, the return is described as 1 + X0 = Bt /Bt−1 .
8 We assume that the risky assets exclude ultra high risk and high return ones, for instance, the asset value S
i,t+1 may be
larger than 2S i,t .
9 Assuming that S := (S , . . . , S ′ ′ ′ ′
t 1,t m,t ) = Bt e, the portfolio return is written as {wt St+1 + (1 − wt e)Bt+1 )}/{wt St + (1 −
wt e)B(t)} = 1 + X0 + wt′ (Xt − X0 e).
integrals that can in general be solved for wt only numerically. According to the literature (e.g.,
Brandt et al. (2005)), we can simplify this problem in the case of a constant relative risk aversion
(CRRA) utility function,10 that is,
W 1−γ
U(W) = , γ,1
1−γ
where γ denotes the coefficient of relative risk aversion. In this case, the Bellman equation simplifies
to
 " # 
 1 
Vt = max E  max E (WT )1−γ |Ft+1 Ft 
wt T −1
{w s } s=t+1 1−γ
   
  1 Y T −1
1−γ  
= max E  max E  ′
1 + X0 + w s (X s − X0 e) |Ft+1 Ft 
wt −1
{w s }Ts=t+1 1 − γ s=0
"
1 Y
t

= max E 1 + X0 + w′s (X s − X0 e) 1−γ
wt 1 − γ s=0
 T −1  #
 Y 1−γ 
× max E  ′
1 + X0 + w s (X s − X0 e) |Ft+1 Ft
−1
{w s }Ts=t+1
s=t+1
 
 1 h i 
= max E  (Wt+1 )1−γ max E (Wt+1 ) |Ft+1 Ft 
T 1−γ
wt 1−γ {w s }T −1
s=t+1
= max E U(Wt+1 )Ψt+1 Ft
wt
Y
T −1
h i
T 1−γ
T
where Wt+1 = 1 + X0 + w′s (X s − X0 e) and Ψt+1 = max E (Wt+1 ) |Ft+1 . From this, the
−1
{w s }Ts=t+1
s=t+1
value function Vt can be expressed as
Vt = U(Wt )Ψt
and Ψt also satisfies a Bellman equation

h i
Ψt = max E 1 + X0 + wt′ (Xt − X0 e) 1−γ Ψt+1 |Ft
wt
subject to the terminal condition ΨT = 1. The corresponding FOCs (in terms of Ψt ) are

−γ
E 1 + X0 + wt (Xt − X0 e) Ψt+1 (Xt − X0 e)Ft = 0.
′
(4.1.10)
It is easy to see that we cannot obtain the solution of (4.1.10) as an explicit form, so that we
consider the (approximated) solution by AR bootstrap.
Suppose that {Xt = (X1 (t), . . . , Xm (t))′ ; t ∈ Z} is an m-vector AR(1) process defined by
Xt = µ + A(Xt−1 − µ) + ǫt
where µ = (µ1 , . . . , µm )′ is a constant m-dimensional vector, ǫt = (ǫ1 (t), . . . , ǫm (t))′ are independent
and identically distributed (i.i.d.) random m-dimensional vectors with E[ǫt ] = 0 and E[ǫt ǫ′t ] = Γ
(Γ is a nonsingular m by m matrix), and A is a nonsingular m by m matrix. We make the following
assumption.
10 As we mentioned in Section 3.1.4, various types of utility functions have been proposed such as CARA, CRRA, DARA.
Here, we deal with the CRRA utility function because it is easy to handle mathematically. If you consider other utility
functions, the Bellman equation would be more complicated.
Assumption 4.1.4. det{Im − Az} , 0 on {z ∈ C; |z| ≤ 1}.
Given {X−n+1 , . . . , X0 , X1 , . . . , Xt }, the least-squares estimator Â(t) of A is obtained by solving
X
t
(t)
Γ̂(t) Â(t) = Ŷ s−1 (Ŷ s(t) )′
s=−n+2
Pt Pt
where Ŷ s(t) = X s − µ̂(t) , Γ̂(t) = s=−n+1 Ŷ s(t) (Ŷ s(t) )′ and µ̂(t) = 1
n+t s=−n+1 X s . Then, the error
ǫ̂(t) (t) (t) ′
s = (ǫ̂1 (s), . . . , ǫ̂m (s)) is “recovered” by
(t) (t)
ǫ̂(t) (t)
s := Ŷ s − Â Ŷ s−1 s = −n + 2, . . . , t.
Let Fn(t) (·) denote the distribution which puts mass n+t
1
at ǫ̂(t) (b,t)∗ T
s . Let {ǫ s } s=t+1 (for b = 1, . . . , B) be
(t)
i.i.d. bootstrapped observations from Fn , where B ∈ N is the number of bootstraps. Given {ǫ(b,t)∗ s },
define Y s(b,t)∗ and X s(b1 ,b2 ,t)∗ by
s−t X
s s−k (b,t)∗
Y s(b,t)∗ = Â(t) (Xt − µ̂(t) ) + Â(t) ǫk
k=t+1
(b1 ,t)∗
X s(b1 ,b2 ,t)∗ = µ̂(t) + Â(t) Y s−1 + ǫ(b
s
2 ,t)∗
(4.1.11)
for s = t + 1, . . . , T . Based on the above {X s(b1 ,b2 ,t)∗ }b1 ,b2 =1,...,B;s=t+1,...,T for each t = 0, . . . , T − 1, we
construct an estimator of the optimal portfolio weight w̃t as follows.
Step 1: First, we fix the current time t, which implies that the observed stretch n + t is fixed. Then,
we can generate {X s(b1 ,b2 ,t)∗ } by (4.1.11).
0 ,t)
Step 2: Next, for each b0 = 1, . . . , B, we obtain ŵT(b−1 as the maximizer of
n o1−γ 1 X B n o1−γ
ET∗ −1 1 + X0 + w′ (XT(b0 ,b,t)∗ − X0 e) = 1 + X0 + w′ (XT(b0 ,b,t)∗ − X0 e)
B b=1
or the solution of
n o−γ
ET∗ −1 1 + X0 + w′ (XT(b0 ,b,t)∗ − X0 e) (XT(b0 ,b,t)∗ − X0 e)
1 Xn
B o−γ
= 1 + X0 + w′ (XT(b0 ,b,t)∗ − X0 e) (XT(b0 ,b,t)∗ − X0 e)
B b=1
=0
with respect to w. Here we introduce a notation “E ∗s [·]” as an estimator of conditional ex-

1X
B
(b0 ,b,t)∗ (b0 ,b,t)∗
pectation E[·|F s], which is defined by E ∗s [h(X s+1 )] = h(X s+1 ) for any function
B b=1
(b0 ,b,t)∗ 0 ,t)
h of X s+1 . This ŵT(b−1 corresponds to the estimator of myopic (single-period) optimal
portfolio weight.
Step 3: Next, we construct estimators of ΨT −1 . Since it is difficult to express the explicit form of
ΨT −1 , we parameterize it as linear functions of XT −1 as follows:
Ψ(1) (XT −1 , θT −1 ) := [1, XT′ −1]θT −1

Ψ(2) (XT −1 , θT −1 ) := [1, XT′ −1, vech(XT −1 XT′ −1 )′ ]θT −1 .
Note that the dimensions of θT −1 in Ψ(1) and Ψ(2) are m + 1 and m(m + 1)/2 + m + 1,
respectively. The idea of Ψ(1) and Ψ(2) is inspired by the parameterization of the conditional
expectations in Brandt et al. (2005).
In order to construct the estimators of Ψ(i) (i = 1, 2), we introduce the conditional least
squares estimators of the parameter θT(i)−1 , that is,
θ̂T(i)−1 = arg min Q(i)

T −1 (θ)
θ
where
1 X ∗ h
B i
Q(i)
T −1 (θ) = ET −1 (ΨT −1 − Ψ(i) )2
B b =1
0
 
1 X B  X
 1
B n o2 
= 
(b0 ,b,t)∗
ΨT −1 (XT (i) (b0 ,b0 ,t)∗
) − ΨT −1 (XT −1 , θ) 
B b =1  B b=1
0
and
n o1−γ
ΨT −1 (XT(b0 ,b,t)∗ ) = 1 + X0 + (ŵT(b−1
0 ,t) ′
) (XT(b0 ,b,t)∗ − X0 e) .
0 ,b,t)∗
Then, by using θ̂T(i)−1 , we can compute Ψ(i) (XT(b−1 , θ̂T(i)−1 ).
0 ,t)
Step 4: Based on the above Ψ(i) , we obtain ŵT(b−2 as the maximizer of
n o1−γ
0 ,b,t)∗ 0 ,b,t)∗
ET∗ −2 1 + X0 + w′ (XT(b−1 − X0 e) Ψ(i) (XT(b−1 , θ̂T(i)−1 )
1 Xn
B o1−γ
0 ,b,t)∗ 0 ,b,t)∗
= 1 + X0 + w′ (XT(b−1 − X0 e) Ψ(i) (XT(b−1 , θ̂T(i)−1 )
B b=1
or the solution of
n o−γ
0 ,b,t)∗ 0 ,b,t)∗ 0 ,b,t)∗
ET∗ −2 1 + X0 + w′ (XT(b−1 − X0 e) (XT(b−1 − X0 e)Ψ(i) (XT(b−1 , θ̂T(i)−1 )
1 Xn
B o−γ
0 ,b,t)∗ 0 ,b,t)∗ 0 ,b,t)∗
= 1 + X0 + w′ (XT(b−1 − X0 e) (XT(b−1 − X0 e)Ψ(i) (XT(b−1 , θ̂T(i)−1 )
B b=1
=0
0 ,t)
with respect to w. This ŵT(b−2 does not correspond to the estimator of myopic (single pe-
riod) optimal portfolio weight due to the effect of Ψ(i) .
(b0 ,t)
Step 5: In the same manner of Steps 3–4, we can obtain θ̂ (i) s and ŵ s , recursively, for s = T −
2, T − 1, . . . , t + 1.
Step 6: Then, we define an optimal portfolio weight estimator at time t as ŵt(t) := ŵt(b0 ,t) by Step 4.
(b0 ,b,t)∗
Note that ŵt(t) is obtained as only one solution because Xt+1 (= µ̂(t) + Â(t) (Xt − µ̂(t) ) +
(b,t)∗
ǫt+1 ) is independent of b0 .
Step 7: For each time t = 0, 1, . . . , T − 1, we obtain ŵt(t) by Steps 1–6. Finally, we can construct an
optimal investment strategy as {ŵt(t) }Tt=0
−1
.
Remark 4.1.5. As we mentioned in Section 3.1.4, various types of utility functions have been pro-
posed such as CARA, CRRA, DARA. Here, we deal with the CRRA utility function because it is
easy to handle mathematically. If you consider other utility functions, the FOCs would be more
complicated and can only solved numerically (e.g., generating sample paths).
Regarding the validity of the approximation of the discrete time multiperiod optimal portfolio
under the CRRA utility function, we need to evaluate the method in three view points.
The first view point is an existence of model misspecification. The quality of ARB approximation
becomes poor when the model assumptions are violated. For instance, if the order of the autoregres-
sive process is misspecified, the ARB method may be invalid; on the other hand, the moving block
bootstrap (MBB) would still give a valid approximation.
The second view point is about the validity of the parametrization of Ψ in Step 3. Brandt et al.
(2005) parameterize conditional expectations of the asset return processes as a vector of a polyno-
mial based on the state variables, and estimate the parameters. The key feature of the parameteriza-
tion is its linearity in the (nonlinear) functions of the state variables. Although the parameter space
may not include the true conditional expectations, they conclude that it works well through some
simulation results.
The third view point is about the length of the trading term T . The algorithm proceeds recursively
backward. Due to the recursive calculations, the estimation errors propagate (and cumulate) from
the end of investment time T to the current time.

Aı̈t-Sahalia and Hansen (2009), Nicolas (2011) and Pliska (1997) give a general method to solve
multiperiod portfolio problems. Gouriéroux and Jasiak (2001) introduce some econometric meth-
ods and models applicable to portfolio management, including CCAPM. Aı̈t-Sahalia and Hansen
(2009), Brandt et al. (2005) and Shiraishi (2012) investigate a simulation approach to solve dy-
namic portfolio problems. Shiraishi et al. (2012) discuss an optimal portfolio for the GPIF.
4.2 Continuous Time Problem

Merton (1969) considered a problem of maximizing the infinite-horizon expected utility of con-
sumption by investing in a set of risky assets and a riskless asset. That is, the investor selects the
amounts to be held in the m risky assets and the riskless asset at times t ∈ [0, ∞), as well as the
investor’s consumption path. The available investment opportunities consist of a riskless asset with
price S 0,t and m risky assets with price St = (S 1,t , . . . , S m,t )′ . In this section, following Aı̈t-Sahalia
et al. (2009), we will consider optimal consumption and portfolio rule when the asset price process
{St } is a diffusion model with jump.
Before the discussion of the portfolio problem, we introduce three stochastic processes: the Lévy
process, standard Brownian motion and the compound Poisson process.
Definition 4.2.1. (Lévy process) A stochastic process {Xt }t≥0 on Rm defined on a probability space
(Ω, F , P) is a Lévy process if the following conditions are satisfied.
(1) For any choice of n ≥ 1 and 0 ≤ t0 < t1 < · · · < tn , random variables Xt0 , Xt1 − Xt0 , . . . , Xtn −
Xtn−1 are independent (independent increments property).
(2) X0 = 0 a.s.
(3) The distribution of X s+t − Xt does not depend on s (temporal homogeneity or stationary incre-
ments property).
(4) It is stochastically continuous.
(5) There is Ω0 ∈ F with P[Ω0 ] = 1 such that, for every ω ∈ Ω0 , Xt (ω) is right-continuous in t ≥ 0
and has left limits in t > 0.
Definition 4.2.2. (standard Brownian motion) A stochastic process {Wt }t≥0 on Rm defined on a
probability space (Ω, F , P) is a standard Brownian motion, it is a Lévy process and if,
(1) for t > 0, Wt has a Gaussian distribution with mean vector 0 and covariance matrix tIm (Im is
the identity matrix)
(2) there is Ω0 ∈ F with P[Ω0 ] = 1 such that, for every ω ∈ Ω0 , Wt (ω) is continuous in t.
Definition 4.2.3. (compound Poisson process) Let {Nt }t≥0 be a Poisson process with constant in-
tensity parameter λ > 0 and {Zn }n=1,2,... be a sequence of independent and identically distributed
CONTINUOUS TIME PROBLEM 85
(i.i.d.) random vectors (jump amplitudes) with the probability density f (z) on Rm \ {0}. Assume that
{Nt } and {Zt } are independent. Define
X
Nt
Y0 = 0, Yt = Zn .
n=1
Then the stochastic process {Yt }t≥0 on Rm is a compound Poisson process associated with λ and f .
According to Sato (1999), the most basic stochastic process modeled for continuous random
motions is Brownian motion and that for jumping random motions is the Poisson process. These two
belong to a class of Lévy processes. Lévy processes are, speaking only of essential points, stochastic
processes with stationary independent increments.
The following results guarantee that any Lévy process can be characterized by (A, ν, γ), which
is called the generating triplet of the process.
Lemma 4.2.4. (Theorems 7.10 and 8.1 of Sato (1999)) Let {Xt }t≥0 be a class of d-dimensional Lévy
′
processes and denote its characteristic function by µ(u) := E[eiu X1 ]. Then, we can write uniquely
" Z #
1 ′
µ(u) = exp − u′ Au + iγ ′ u + eiu x − 1 − iu′ xI{x:kxk≤1} (x) ν(dx)
2 Rd
where A (called the Gaussian covariance matrix) is a symmetric nonnegative-definite d × d matrix,

ν (called the Lévy measure of {Xt }) is a measure on Rd satisfying
Z
ν({0}) = 0 and (kxk ∧ 1) ν(dx) < ∞,
Rd
R
and γ ∈ Rd . In addition, if ν satisfies kxk≤1
kxkν(dx) < ∞, then we can rewrite
" Z #
1 ′
µ(u) = exp − u′ Au + iγ0′ u + eiu x − 1 ν(dx)
2 Rd
with γ0 ∈ Rd .11
Remark 4.2.5. Let {Yt } be a (d-dimensional) compound Poisson process with intensity parameter
λ > 0 and probability density f (z) of Zn . Then, the characteristic function can be written as
h n oi " Z #
′ ′ ′
µ(u) = E[eiu Y1 ] = exp λ E(eiu Zn ) − 1 = exp λ eiu z − 1 f (z)dz , (4.2.1)
Rd
which implies that this process is a class of Lévy processes by setting
A ≡ 0, γ ≡ 0 and ν(dz) ≡ λ f (z)dz.
4.2.1 Optimal Consumption and Portfolio Weights

Suppose that asset prices are the following exponential Lévy dynamics:
dS 0,t
= rdt, (4.2.2)
S 0,t
dS i,t X m
= (r + Ri )dt + σi j dW j,t + dYi,t , i = 1, . . . , m (4.2.3)
S i,t− j=1
with a constant rate of interest r ≥ 0. The quantities Ri and σi j are constant parameters. We write
R = (R1 , . . . , Rm )′ and Σ = σσ ′ where σ = (σi j )i, j=1,...,m . We assume that Σ is a nonsingular matrix,
11 See p. 39 of Sato (1999).
Wt = (W1,t , . . . , Wm,t )′ is an m-dimensional standard Brownian motion and Yt = (Y1,t , . . . , Ym,t )′ is
an m-dimensional compound Poisson process with intensity parameter λ > 0 and probability density
f (z) of Zn . Let Wt denote an investor’s wealth at time t starting with the initial endowment W0 .
Writing H0,t and Hi,t (i = 1, . . . , m) as the number of shares for the riskless and risky assets at time
t, the wealth Wt can be written as
X
m
Wt = H0,t S 0,t + Hi,t S i,t− = H0,t S 0,t + Ht′ St− (4.2.4)
i=1
where Ht = (H1,t , . . . , Hm,t )′ , St = (S 1,t , . . . , S m,t )′ and S 0,t , S 1,t , . . . , S m,t are defined in (4.2.2) and
(4.2.3). We assume that the investor consumes continuously at the rate Ct at time t. In the absence of
any income derived outside his investments in these assets, we can write for a small period [t − h, t)
−Ct h ≈ (H0,t − H0,t−h )S 0,t + (Ht − Ht−h )′ St−

= (H0,t − H0,t−h )(S 0,t − S 0,t−h ) + (Ht − Ht−h )′ (St− − St−h )
+ (H0,t − H0,t−h )S 0,t−h + (Ht − Ht−h )′ St− . (4.2.5)
Taking the limits as h → 0, we have
−Ct dt = dH0,t dS 0,t + (dHt )′ dSt + dH0,t S 0,t + (dHt )′ St− . (4.2.6)
Using Itô’s lemma and plugging in (4.2.2), (4.2.3) and (4.2.6), the investor’s wealth Wt ≡
X(S 0,t , St , H0,t , Ht ) follows the dynamics
dWt = dH0,t S 0,t + (dHt )′ St− + H0,t dS 0,t + Ht′ dSt + dH0,t dS 0,t + (dHt )′ dSt (4.2.7)
= −Ct dt + H0,t dS 0,t + Ht′ dSt− . (4.2.8)
Let w0,t and wt = (w1,t , . . . , wm,t )′ denote the portfolio weights invested at time t in the riskless and
the m risky assets. The portfolio weights can be written as
H0,t S 0,t Hi,t S i,t−
w0,t = , wi,t = , i = 1, . . . , m, (4.2.9)
Wt Wt
which implies that the portfolio weights satisfy
X
m
1
w0,t + wi,t = H0,t S 0,t + Ht′ St− = 1, (4.2.10)
i=1
Wt
by (4.2.4). By using w0,t and wt in place of H0,t and Ht , we can rewrite (4.2.8) as
dS 0,t X
m
dS i,t
dWt = −Ct dt + w0,t Wt + Wt
S 0,t i=1
S i,t−
 
Xm 

 Xm 


 
= −Ct dt + Wt w0,t rdt + wi,t Wt 
 (r + Ri )dt + σi, j dW j,t + dYi,t 


 

i=1 j=1
= (rWt + wt′ RWt − Ct )dt + Wt wt′ σdWt + Wt wt′ dYt . (4.2.11)
Here the second equation comes from (4.2.2) and (4.2.3).

The investor’s problem at time t is then to pick the consumption and portfolio weight processes
{C s , w s }t≤s≤∞ which maximize the infinite horizon, discounted at rate β, expected utility of con-
sumption (so-called value function)
"Z ∞ #
−βs
V(Wt , t) = max Et e U(C s )ds (4.2.12)
{C s ,w s }t≤s≤∞ t
subject to limt→∞ E[V(Wt , t)] = 0 and (4.2.11) with given Wt . Here U(·) denotes a utility function
of consumption C s and Et (·) denotes the conditional expectation, that is, Et (·) = E(·|Ft ). For a small
period [t, t + h), this value function can be written as
" (Z ∞ ) #
V(Wt , t) = max Et max Et+h e−βs U(C s ) ds
{C s ,w s } s∈[t,t+h) {C s ,w s } s∈[t+h,∞] t
Z t+h
= max Et e−βs U(C s )ds
{C s ,w s } s∈[t,t+h) t
(Z ∞ )
+ max Et+h e−βs U(C s ) ds
{C s ,w s } s∈[t+h,∞] t+h
"Z t+h #
−βs
= max Et e U(C s )ds + V(Wt+h , t + h) .
{C s ,w s } s∈[t,t+h) t
Then, it follows that for a small period [t, t + h)

h i
0≈ max Et e−βt U(Ct )h + V(Wt+h , t + h) − V(Wt , t) .
{C s ,w s } s∈[t,t+h)
Taking the limits as h → 0, we have

" ( )#
dV(Wt , t)
0 = max e−βt U(Ct ) + Et . (4.2.13)
{Ct ,wt } dt
Based on Itô’s lemma for the diffusion process with jump (see e.g., Shreve (2004)), dV(Wt , t) can
be written as
∂V(Wt , t) ∂V(Wt , t) 1 ∂2 V(Wt , t)
dV(Wt , t) = dt + dWt(C) + (dWt(C) )2
∂t ∂Wt 2 ∂Wt2

+ 1{∆Nt =1} V(Wt + Wt wt′ ZNt , t) − V(Wt , t) (4.2.14)
where dWt(C) denotes the continuous part of dWt (i.e., dWt(C) = (rWt + wt′ RWt − Ct )dt +
Wt wt′ σdWt ) and the fourth term of the right hand side describes the jump effect. Since the jump
occurs discretely, the number of the jump from t− to t is 0 or 1. If there is no jump in this period
(i.e., ∆Nt = Nt − Nt− = 0), the jump effect in dV is ignored, otherwise (i.e., ∆Nt = Nt − Nt− = 1),
the jump effect in dV is shown as V(Wt + ∆Yt , t) − V(Wt , t) where
N 
Xt X
Nt− 
′ 
′
∆Yt := Wt wt (Yt − Yt− ) = Wt wt  Zn − Zn  = Wt wt′ ZNt
n=1 n=1
given Wt , wt , Yt− and ∆Nt = Nt − Nt− = 1. From the definition of Wt , the conditional expectations
are given by

Et dWt(C) = (rWt + w′ RWt − Ct )dt + Wt wt′ σE(dWt )
= (rWt + w′ RWt − Ct )dt (4.2.15)
and
n o
Et (dWt(C) )2 = (rWt + w′ RWt − Ct )2 dt2 + Wt2 wt′ σE(dWt dWt′ )σ ′ wt
+ (rWt + w′ RWt − Ct )Wt wt′ σE(dWt )dt
= Wt2 wt′ Σwt dt + o(dt). (4.2.16)
From the definition of Nt and ZNt , the another conditional expectation is given by

Et 1{∆Nt =1} V(Wt + Wt wt′ ZNt , t) − V(Wt , t)

= P {∆Nt = 1|Nt− } Et V(Wt + Wt wt′ ZNt , t) − V(Wt , t)
Z
= λdte−λdt V(Wt + Wt wt′ z, t) − V(Wt , t) dν(z)
Z

= λdt V(Wt + Wt wt′ z, t) − V(Wt , t) dν(z) + o(dt) (4.2.17)
where e−λdt = 1 − λdt + o(dt) from Taylor expansion. Hence, from (4.2.14) to (4.2.17), we can write
( )    
∂V(Wt , t) ∂V(Wt , t)  dWt  1 ∂2 V(Wt , t)   (dWt )2 
(C) (C)
V(Wt , t)  

Et = + Et   + Et 
 

dt ∂t ∂Wt dt 2 ∂Wt 2  dt 
1
+ Et 1{∆Nt =1} V(Wt + Wt wt′ ZNt , t) − V(Wt , t)
dt
∂V(Wt , t) ∂V(Wt , t) 1 ∂2 V(Wt , t) 2 ′
= + (rWt + w′ RWt − Ct ) + Wt wt Σwt
∂t ∂Wt 2 ∂Wt2
Z

+λ V(Wt + Wt wt′ z, t) − V(Wt , t) ν(dz) + o(1).
Summarizing the above argument, the Bellman equation (4.2.13) can be written as
0 = max {φ(Ct , wt ; Wt , t)} (4.2.18)

{Ct ,wt }
where
∂V(Wt , t) ∂V(Wt , t)
φ(Ct , wt ; Wt , t) = e−βt U(Ct ) + + (rWt + w′ RWt − Ct )
∂t ∂Wt
Z
1 ∂2 V(Wt , t) 2 ′
+ W t w t Σw t + λ V(Wt + Wt wt′ z, t) − V(Wt , t) ν(dz)
2 ∂Wt2
subject to limt→∞ E[V(Wt , t)] = 0.

Although the value function V is time dependent, it can be written from (4.2.12) for an infinite time
horizon problem
"Z ∞ #
eβt V(Wt , t) = max Et e−β(s−t) U(C s )ds
{C s ,w s }t≤s≤∞ t
"Z ∞ #
= max Et e−βu U(Ct+u )du
{Ct+u ,wt+u }0≤u≤∞ 0
"Z ∞ #
= max E0 e−βu U(Cu )du := L(Wt )
{Cu ,wu }0≤u≤∞ 0
from the Markov property where L(Wt ) is independent of time. Thus, (4.2.18) reduces to the fol-
lowing equation by using the time-homogeneous value function L:
0 = max {φ(Ct , wt ; Wt )} (4.2.19)

{Ct ,wt }
where
∂L(Wt )
φ(Ct , wt ; Wt ) = U(Ct ) − βL(Wt ) + (rWt + wt′ RWt − Ct )
∂Wt
Z
1 ∂2 L(Wt ) 2 ′
+ W t w t Σw t + λ L(Wt + Wt wt′ z) − L(Wt ) ν(dz) (4.2.20)
2 ∂Wt2
subject to limt→∞ E[e−βt L(Wt )] = 0.
In order to solve the maximization problem in (4.2.19), we calculate FOCs (first order conditions)
for Ct and wt separately. As for Ct , it follows that
∂φ(Ct , wt ; Wt ) ∂U(Ct ) ∂L(Wt )
= − = 0.
∂Ct ∂Ct ∂Wt
Given a wealth Wt , the optimal consumption choice Ct∗ is therefore
" #−1
∂U ∂L(Wt )
Ct∗ = .
∂C ∂Wt
As for wt , the optimal portfolio weight wt∗ is obtained as the maximizer of the following objective
function:
Z
∂L(Wt ) ′ 1 ∂2 V(Wt ) 2 ′
wt RWt + Wt wt Σwt + λ L(Wt + Wt wt′ z) − L(Wt ) ν(dz)
∂Wt 2 ∂Wt2
∂L(Wt ) A
:= − Wt −wt′ R + wt′ Σwt + λψ(wt )
∂Wt 2
given a wealth Wt , where
!−1
∂2 L(Wt )
∂L(Wt )
A= − Wt (4.2.21)
∂Wt ∂Wt2
!−1 Z
∂L(Wt )
ψ(wt ) = − Wt L(Wt + Wt wt′ z) − L(Wt ) ν(dz). (4.2.22)
∂Wt
∂L(Wt )
If L is a strictly convex function for x > 0, it satisfies that − Wt < 0 (then, max becomes
∂Wt
min) and A is a constant value, which implies that
wt∗ := arg min g(wt )

{wt }
where
A ′
g(wt ) = −wt′ R + w Σwt + λψ(wt ).
2 t
Example 4.2.6. (CRRA power utility) Suppose that the utility function U is a power utility,12 that
is,
( c1−γ
if c > 0
U(c) = 1−γ
−∞ i f c ≤ 0
with CRRA coefficient γ ∈ (0, 1) ∪ (1, ∞). Following Aı̈t-Sahalia et al. (2009), we look for a solution
to (4.2.19) in the form
x1−γ
L(x) = K −γ
1−γ
where K is a constant value, then
∂L(x) L(x) ∂2 L(x) L(x)
= (1 − γ) , = −γ(1 − γ) 2 .
∂x x ∂x2 x
12 Aı̈t-Sahalia et al. (2009) also considered the exponential and log utility cases, briefly.
Then (4.2.21) and (4.2.22) reduce to
( )−1 ( )
L(Wt ) L(Wt )
A = −(1 − γ) −γ(1 − γ) Wt = γ (4.2.23)
Wt Wt2
( )−1 Z n o
L(Wt )
ψ(wt ) = −(1 − γ) Wt L(Wt ) 1 + wt′ z 1−γ − 1 ν(dz)
Wt
Z n o
1
=− 1 + wt′ z 1−γ − 1 ν(dz). (4.2.24)
1−γ
In this case, g(wt ) is time independent under given constant R, A(= γ), Σ, λ and a function ν, so
it is clear that any optimal solution will be time independent. Furthermore, the objective function
is state independent, so any optimal solution will also be state independent. In addition, according
to Aı̈t-Sahalia et al. (2009), the constant K will be fully determined once we have solved for the
optimal portfolio weights wt∗ .
4.2.2 Estimation
As discussed previously, under the convexity of L, we can write optimal portfolio weights as
w∗ := arg min g(w) (4.2.25)

w
where
A ′
g(w) = −w′ R + w Σw + λψ(w).
2
Note that g(w) is time independent under some appropriate utility functions such as the CRRA
power utility (see Example 4.2.6), which implies that the optimal portfolio w∗ is also time indepen-
dent. We now consider that the objective function g is parameterized by θ ∈ Θ ⊂ Rq , that is,
A ′
g(w) ≡ gθ (w) = −w′ Rθ + w Σθ w + λθ ψθ (w).
2
Then, the optimal portfolio weights (4.2.25) can be written as
w∗ (θ) := arg min gθ (w). (4.2.26)

w
Therefore, once the parameter estimator θ̂ of θ is constructed based on discretely observed data
{St }, we can obtain the optimal portfolio estimator w∗ (θ̂) by the plug-in method.
Suppose that the asset price processes {St = (S 1,t , . . . , S m,t )′ } are defined by (4.2.3) with r = 0
and S i,t > 0 for i = 1, . . . , m and t ∈ [0, T ]. Write
Xi,t := log S i,t and Xi,0 = 0.
Based on Itô’s lemma for the diffusion process with jump, it follows that
Z t Z
∂ log S i (C) 1 t ∂2 log S i (C) 2
Xi,t = log S i,0 + dS i,s + 2
(dS i,t )
0 ∂S i 2 0 ∂S i
X
+ (log S i,s − log S i,s− )
0<s≤t
Z t dS (C) Z t  dS (C) 2 X
i,s 1  i,s 
= log S i,0 + −   + (log S i,s − log S i,s− )
0 S i,s− 2 0 S i,s− 
0<s≤t
where
 (C) 2  m 
(C)
dS i,t X
m  dS i,t  X 2
log S i,0 = Xi,0 = 0, = Ri dt + σi j dW j,t ,   =  σi j  dt = Σii dt
S i,t− j=1
S i,t−  
j=1

P
m
with σσ ′ = ji , j2 =1 σi1 j1 σ j2 i2 = Σi1 i2 i1 ,i2 =1,...,m = Σ.
i1 ,i2 =1,...,m
Denote
1
µ := R − (Σ11 , . . . , Σmm )′ , Mt := (M1,t , . . . , Mm,t )′ , Ur := (U1,r , . . . , Um,r )′
2
where
X X
(log S i,s − log S i,s− ) = {log(S i,s− (1 + Yi,s − Yi,s− )) − log S i,s− }
0<s≤t 0<s≤t
X
= {log(S i,s− (1 + 1{Ns −Ns− =1} Zi,Ns )) − log S i,s− }
0<s≤t
X
= log(1 + 1{Ns −Ns− =1} Zi,Ns )
0<s≤t
X
:= 1{Ns −Ns− =1} Ui,Ns (say)
0<s≤t
X
Nt
= Ui,r
r=1
= Mi,t (say).
Then, we can write
X
Nt
Xt := (X1,t , . . . , Xm,t )′ = µt + σWt + Mt , Mt := Ur . (4.2.27)
r=1
In what follows, we will consider estimation theory of the parameter θ := (µ′ , vec(σ)′ , ξ ′ )′ (where ξ
is a parameter which characterizes the intensity parameter λξ of Nt and the density fξ (u) of Ur ) from
discrete observations {Xt = (X1,t , . . . , Xm,t )′ = (log S 1,t , . . . , log S m,t )′ ; t = t1 , . . . , tn }. Especially, we
will focus on two types of estimation method: the generalized method of moments (GMM) and the
threshold estimation method.
4.2.2.1 Generalized method of moments (GMM)

Suppose that the log asset price process {Xt = (X1,t , . . . , Xm,t )′ } is defined by (4.2.27) and
i.i.d.
Ur ∼ N (β, Σ J ) .
In this case, we can treat the parameter vector as θ := (µ′ , vech(ΣC )′ , λ, β ′ , vech(Σ J )′ )′ ∈ Θ ⊂ Rq
where ΣC = σσ ′ and q = 1 + 3m + m2 .
We assume that observations are X0 , X1 , . . . , Xn for n ∈ N. Then, the conditional moments can be
written as
(1)
m(1)
t (θ) := mi,t (θ) i=1,...,m := E θ (Xt |Xt−1 ) = Xt−1 + µ + λβ
(2) hn on o′ i
m(2)
t (θ) := mi j,t (θ) := Eθ Xt − m(1) (1)
t (θ) Xt − mt (θ) |Xt−1
i, j=1,...,m
= ΣC + λΣ̃ J (4.2.28)

mt (θ) := m(e)
(e) Xi,t
i,t (θ) i=1,...,m := E θ (e |Xt−1 ) i=1,...,m
( )!
1 βi + 12 (Σ J )ii
= exp Xi,t−1 + µi + (ΣC )ii + λ e −1
2 i=1,...,m
where Σ̃ J = Σ J + ββ ′ . By using these conditional moments, we introduce
 
 h1 (Xt , Xt−1 ; θ) 
 
h(Xt , Xt−1 ; θ) :=  h2 (Xt , Xt−1 ; θ) 
 
he (Xt , Xt−1 ; θ)
 
 Xt − m(1) 
 hn t (θ) (1) o n o′ i 
:=   vech Xt − mt (θ) Xt − m(1) (θ) − m(2) (θ)  . (4.2.29)
 Xi,t t t

e − m(e) i,t (θ)
i=1,...,m
Then, the p(= 2m + m(m + 1)/2)-dimensional stochastic process {h(Xt , Xt−1 ; θ) : t = 1, 2, . . .} sat-
isfies
Eθ {h(Xt , Xt−1 ; θ)|Xt−1 } = 0 almost everywhere (a.e.)
for all t = 1, 2, . . . , n. Moreover, we introduce a q(= 1 + 3m + m2 )-dimensional stochastic process
{Gn (θ) : n = 1, 2, . . .} by
X
n
Gn (θ) = a(Xt−1 ; θ)h(Xt , Xt−1 ; θ)
t=1
where a(Xt−1 ; θ) is a q × p-dimensional function which depends only on Xt−1 and θ, and is dif-
ferentiable with respect to θ. Then, Gn (θ) is an unbiased martingale estimating function satisfying
Eθ {Gn (θ)} = 0 and
Eθ {Gn (θ)|Fn−1} = Gn−1 (θ) a.e.
for n = 1, 2, . . ., where Fn−1 is the σ-field generated by the observations X0 , X1 , . . . , Xn−1 .
If kVarθ {a(Xt−1 ; θ)h(Xt , Xt−1 ; θ)}k < ∞ is satisfied, the martingale central limit theorem
1 d
√ Gn (θ) → N (0, V(θ))
n
holds where V(θ) is a q × q-matrix given by

V(θ) = Eθ a(Xt−1 ; θ)Kt (θ)a(Xt−1; θ)′
where Kt (θ) = Eθ {h(Xt , Xt−1 ; θ)h(Xt , Xt−1 ; θ)′ |Xt−1 }. Let θ̂GMM be a solution of the estimating
equation Gn (θ) = 0. By using Taylor expansion, we obtain
0 = Gn (θ̂GMM ) = Gn (θ0 ) + Ġn (θ̃)(θ̂GMM − θ0 )
where Ġn (θ) = ∂Gn (θ)/∂θ ′, |θ̃ − θ0 | ≤ |θ̂GMM − θ0 | and θ0 is the true parameter. Then, assuming
that
1 p n o
Ġn (θ̃) → Eθ0 a(Xt−1 ; θ0 )ḣ(Xt , Xt−1 ; θ0 ) := S (θ0 )
n
where ḣ(·, ·; θ) = ∂h(·, ·; θ)/∂θ ′, we have
√ d ′
n(θ̂GMM − θ0 ) → N 0, S (θ0 )−1 V(θ0 ) S (θ0 )−1 (4.2.30)
as n → ∞.
We now consider how to choose the best Gn in the class of unbiased martingale estimating functions.
Let
 


 Xn 


Gn := Gn (θ) : Gn (θ) = a(Xt−1 ; θ)h(Xt , Xt−1 ; θ)

 
t=1
be a class of estimating functions satisfying (4.2.30). Here a is a function to be chosen and h is a
fixed function defined in (4.2.29). For any Gn (θ), we define
IGn (θ) := Ḡn (θ)′ hG(θ)i−1

n Ḡ n (θ)
where
X
n n o
Ḡn (θ) = a(Xt−1 ; θ)Eθ ḣ(Xt , Xt−1 ; θ)|Xt−1 (4.2.31)
t=1
X
n
hG(θ)in = hG(θ), G(θ)in = a(Xt−1 ; θ)Kt (θ)a(Xt−1; θ)′ . (4.2.32)
t=1
Then, it follows that

1 p
IG (θ) → S (θ)V(θ)−1S (θ)′ ,
n n
which implies that the inverse of IGn (θ) estimates the covariance matrix of the asymptotic distribu-
tion of the estimator θ̂GMM . If an estimation function G∗n ∈ Gn satisfies
IG∗n (θ) ≥ IGn (θ)
(i.e., IG∗n (θ) − IGn (θ) is nonnegative definite) for all θ ∈ Θ, for all Gn ∈ Gn and for all n ∈ N, G∗n is
called optimal in Gn .
Lemma 4.2.7. (Heyde (1988), Theorem 2.1 of Heyde (1997)) G∗n ∈ Gn is an optimal estimating
function within Gn if
Ḡn (θ)−1 hG(θ), G∗ (θ)in
is a constant matrix for all Gn ∈ Gn .

Let Gn , G∗n ∈ Gn be
Gn (θ) = a(Xt−1 ; θ)h(Xt , Xt−1 ; θ), G∗n (θ) = a∗ (Xt−1 ; θ)h(Xt , Xt−1 ; θ),
respectively. Then, it follows that

X
n
hG(θ), G∗ (θ)in = a(Xt−1 ; θ)Kt (θ)a∗ (Xt−1 ; θ)′ . (4.2.33)
t=1
From (4.2.31) and (4.2.33), Ḡn (θ)−1 hG(θ), G∗(θ)in is constant for all Gn ∈ Gn if
a∗ (Xt−1 ; θ) = −Ḣt (θ)′ Kt (θ)−1

n o
where Ḣt (θ) = Eθ ḣ(Xt , Xt−1 ; θ)|Xt−1 . An optimal estimating function is thus
G∗n (θ) = −Ḣt (θ)′ Kt (θ)−1 h(Xt , Xt−1 ; θ).
From (4.2.29), Ḣt (θ) is described as

 
 Im 0m×m(m+1)/2 β λIm 0m×m(m+1)/2 
 
 
 
Ḣt (θ) =  0m(m+1)/2×m Im(m+1)/2 vech(Σ̃ J ) Ḣt(2,β) (θ) λIm(m+1)/2 
 (4.2.34)
 
 
 (e,β)
diag(m(e)
t (θ)) Ḣt(e,ΣC ) (θ) Ḣt(e,λ) (θ) Ḣt (θ) Ḣt (e,Σ J )
(θ)
where

Ḣt(2,β) (θ) = Ḣi(2,β)
j,k,t (θ) = λβi δ(i, k) + λβ j δ( j, k)
i, j,k=1,...,m, j≤k i, j,k=1,...,m, j≤k
!
1
Ḣt(e,ΣC ) (θ) = Ḣi,(e,Σ C) (e)
jk,t (θ) i, j,k=1,...,m, j≤k = 2 mi,t (θ)δ(i, j, k)
i, j,k=1,...,m, j≤k
! !
1
Ḣt(e,λ) (θ) = Ḣi,t
(e,λ)
(θ) = exp{βi + (Σ J )ii } − 1 m(e) i,t (θ)
i=1,...,m 2 i=1,...,m
!
(e,β) (e,β) 1 (e)
Ḣt (θ) = Ḣi, j,t (θ) = λ exp{βi + (Σ J )ii }mi,t (θ)δ(i, j)
i, j=1,...,m 2 i, j=1,...,m
!
λ 1
Ḣt(e,ΣJ ) (θ) = Ḣi,(e,Σ J)
jk,t (θ) = exp{β i + (Σ )
J ii }m (e)
i,t (θ)δ(i, j, k)
i, j,k=1,...,m, j≤k 2 2 i, j,k=1,...,m, j≤k
( (
1 if i = j 1 if i = j = k
for δ(i, j) = and δ(i, j, k) = .
0 if i , j 0 i f otherwise
Similarly, from (4.2.29), Kt (θ) is described as
 (2) (1,2) (1,e) 
 mt (θ) Kt (θ)′ Kt (θ)′ 
 
 
 
Kt (θ) =  Kt(1,2) (θ) Kt(2,2) (θ) Kt(2,e) (θ)′  (4.2.35)
 
 
 (1,e) 
Kt (θ) Kt(2,e) (θ) Kt(e,e) (θ)
where

Kt(1,2) (θ) = Ki,(1,2)
jk,t (θ) i, j,k=1,...,m, j≤k
n o
= λ βi (Σ J ) jk + β j (Σ J )ik + βk (Σ J )i j + βi β j βk
i, j,k=1,...,m, j≤k
(2,2)
(2,2)
Kt (θ) = Ki j,kl,t (θ)
i, j,k,l=1,...,m,i≤ j,k≤l
n
= mik,t (θ)m jl,t (θ) + m(2)
(2) (2) (2)
il,t (θ)m jk,t (θ) + λ (Σ̃ J )i j (Σ̃ J )kl + (Σ̃ J )ik (Σ̃ J ) jl
o
+ (Σ̃ J )il (Σ̃ J ) jk − 2βi β j βk βl
i, j,k,l=1,...,m,i≤ j,k≤l
(1,e)
(1,e)
Kt (θ) = Ki, j,t (θ)
i, j=1,...,m
n 1 o
= (ΣC )i j − λβi + λ(βi + (Σ J )i j )eβ j + 2 (ΣJ ) j j m(e) j,t (θ) i, j=1,...,m

Kt(2,e) (θ) = Ki(2,e)
j,k,t (θ) i, j,k=1,...,m, j≤k
h n o 1 i
= (ΣC )i j + λ (Σ J )i j + (βi + (Σ J )ik )(β j + (Σ J ) jk ) eβk + 2 (ΣJ )kk m(e) k,t (θ)
(1,e)
(θ)K (1,e)
Ki,k,t j,k,t (θ)
+ − m(2) (e)
i j,t (θ)mk,t (θ)
m(e)
k,t (θ)
i, j,k=1,...,m, j≤k

Kt(e,e) (θ) = Ki,(1,2)
j,t (θ) i, j=1,...,m
h 1 1 i
= exp (ΣC )i j + λ(eβi + 2 (ΣJ )ii − 1)(eβ j+ 2 (ΣJ ) j j − 1) − 1 m(e)
i,t (θ)m (e)
j,t (θ) .
i, j=1,...,m
∗
Let θ̂GMM be a solution of the estimating equation G∗n (θ) = 0. Then, we have
√ d
∗
n(θ̂GMM − θ0 ) → N 0, V ∗ (θ0 )−1 (4.2.36)
as n → ∞, where
1X
n
1 1
IG∗n (θ) = Ḡ∗n (θ)′ hG∗ (θ)i−1 ∗
n Ḡ n (θ) = Ḣt (θ)′ Kt (θ)−1 Ḣt (θ)
n n n t=1
p
→ Eθ {Ḣt (θ)′ Kt (θ)−1 Ḣt (θ)} = V ∗ (θ).
Example 4.2.8. (CRRA power utility) Consider the case of Example 4.2.6. Let θ =
(µ′ , vech(Σ(C) )′ , λ, β ′ , vech(Σ(J) )′ )′ . Then, the optimal portfolio can be written by
w∗ (θ) = arg min gθ (w)

w∈Rm
where
( )
1 (C) γ
gθ (w) = −w µ + (Σ11 , . . . , Σmm ) + w′ Σ(C) w + λψθ (w)
′ (C) ′
2 2
and
Z n o
1 1−γ
ψθ (w) = − 1 + w′ z − 1 fθ (z)dz
1−γ z>−1
Y
m
1
fθ (z) = φ(β,Σ(J) ) (log(1 + z1 ), . . . , log(1 + zm )) .
i=1
1 + zi
Here φ(β,Σ(J) ) denotes the p.d.f. of N(β, Σ(J) ). Then, from (4.2.36) and the δ-method, we have
√ d
n(w∗ (θ̂GMM
∗
) − w∗ (θ0 )) → N 0, ẇ∗ (θ0 )V ∗ (θ0 )−1 (ẇ∗ (θ0 ))′
where ẇ∗ (θ) = ∂w∗ (θ)/∂θ ′.
4.2.2.2 Threshold estimation method

Suppose that the log asset price process {Xt = (X1,t , . . . , Xm,t )′ } is defined by (4.2.27). In addition,
we assume that the intensity parameter λ of Nt and the density f (u) of Ur are parameterized by ξ
X
Nt
(i.e., λ ≡ λξ and f (u) ≡ fξ (u)). Then, Mt = Ur is a vector-valued compound Poisson process
r=1
defined in Definition 4.2.3 and has a homogeneous Poisson random measure
X
µξ (dt, du) = 1{∆Ms ,0} ǫ(s,∆Ms ) (dt, du)
s>0
where ∆M s = M s −M s− and ǫa is the Dirac measure at a. There exists the predictable compensator
E[µξ (dt, du)] = νξ (dz)dt = λξ fξ (u)dudt.
In this case, we can treat the parameter vector as θ := (µ′ , vec(σ)′ , ξ ′ ) ∈ Θ (Θ is a compact space)
and we assume that there exists the true parameter θ0 := (µ′0 , vec(σ0 )′ , ξ0′ ) in Θ.
Suppose that the processes Xt = (X1,t , . . . , Xm,t )′ are observed at the discrete times 0 = t0 < t1 <
t2 < · · · < tn = T in the interval [0, T ], and we assume that the sampling intervals are ti − ti−1 = hn
for i = 1, . . . , n. We make the following assumption for the sampling scheme.
Assumption 4.2.9. hn → 0, T = tn = nhn → ∞ and nh2n → 0
According to Shimizu and Yoshida (2006), this sampling scheme is a so-called rapidly increas-
ing experimental design. The estimation of discretely observed diffusion processes under similar
sampling schemes has been studied very well by several authors, such as Prakasa Rao (1983, 1988),
Florens-Zmirou (1989), Yoshida (1992) and Kessler (1997). On the other hand, the estimation of a
diffusion process with jumps has been studied by, for example, Sørensen (1991), Aı̈t-Sahalia (2004)
and Shimizu and Yoshida (2006).
Shimizu and Yoshida (2006) tackled that problem and proposed an approximation of the log-
likelihood of {Xti }ni=1 . They proposed an asymptotic filter to judge whether or not a jump had oc-
curred in an observational interval (ti−1 , ti ] of the form
Hin := {ω ∈ Ω; |∆ni X| > hρn }

1
where ∆ni X = Xti − Xti−1 and ρ is satisfied with 0 < ρ < . Let Jin be the number of jumps in the
2
interval (ti−1 , ti ] and for k = 0, 1, we denote
Dni,k := Hin ∩ {Jin = k}, Dni,2 := Hin ∩ {Jin ≥ 2}

n
Ci,k := (Hin )c ∩ {Jin = k}, n
Ci,2 := (Hin )c ∩ {Jin ≥ 2}.
Then, we can write

[
2 [
2
Hin = Dni,k , (Hin )c = n
Ci,k .
k=0 k=0
Under some regularity conditions, such as Lipschitz continuity and boundedness, Shimizu and
Yoshida (2006) and Shimizu (2009) showed the following result.
Lemma 4.2.10. For any p ≥ 1 and some γ > 0, as n → ∞
P(Dni,0 |Fti−1 ) = R1 (hnp(1/2−ρ) , Xti−1 ), n

P(Ci,0 |Fti−1 ) = e−λ0 hn − P(Dni,0 |Fti−1 ),
P(Dni,1 |Fti−1 ) = R2 (hn , Xti−1 ), P(Ci,1 |Fti−1 ) = R3 (h1+γ
n
n , Xti−1 ),
P(Dni,2 |Fti−1 ) ≤ λ20 h2n , n
P(Ci,0 |Fti−1 ) ≤ λ20 h2n
where R j : R × Rm → R ( j = 1, 2, 3) are functions for which there exists a constant C such that
R j (un , s) ≤ unC(1 + |s|)C
for a real-valued sequence {un }n∈N and all s ∈ Rm .

This result implies that for n → ∞
1Hin ≈ 1{Jin=1} , 1(Hin )c ≈ 1{Jin =0} .
Based on this asymptotic filter Hin , they introduced the estimator θ̂T H = (µ̂′T H , vec(σ̂T H )′ , ξ̂T′ H )′ of
θ by
ℓn (θ̂T H ) = sup ℓn (θ)

θ∈Θ
where ℓn (θ) is a contrast function defined by
ℓn (θ) = ℓn(C) (θ) + ℓn(J) (θ)

1 X n 1X
n n
ℓn(C) (θ) = − (∆i X − hn µ)′ Σ−1 (∆ni X − hn µ)1(Hin )c − log det(Σ)1(Hin )c
2hn i=1 2 i=1
X n Z
ℓn(J) (θ) = log fξ (∆ni X)1Hin − nhn fξ (u)du
i=1 Rm
where Σ = σσ ′ .
This estimation method is called the threshold estimation method, which originated in the paper
by Mancini (2001) and the idea was also independently introduced by Shimizu (2002), and pub-
lished later by Shimizu and Yoshida (2006). It is one of the useful techniques in inference for jump
processes from discrete observations. In this method, statistics are constructed via the asymptotic
filter (Hin ). We judge a jump has occurred in the interval (ti−1 , ti ], if ∆ni X ∈ Hin ; otherwise we deal
ρ
with no jump in this interval. In other words, there exists a threshold hn and the judgement of the
existence of the jump in this interval is determined by whether ∆ni X has exceeded the threshold
or not. That is why this method is called the threshold estimation method, and, based on the two
groups (i.e., jump observations and no jump observations), an approximation of the log-likelihood
is proposed. Then, Shimizu and Yoshida (2006) showed the following.
Lemma 4.2.11. (Theorem 2.1. of Shimizu and Yoshida (2006) and Theorem 3.6 of Shimizu (2009))
Under Assumptions A.1–A.11 of Shimizu and Yoshida (2006) and Assumption 4.2.9, it follows that
 √ 
 √ nhn (µ̂T H − µ0 ) 
  L
 n(vec( σ̂T H ) − vec(σ0 ))  → N 0, K(θ0 )−1 (4.2.37)
√ 
nhn (ξ̂T H − ξ0 )
as n → ∞, where
 
 K1 (θ) 0 0 
 
K(θ) =  0 K2 (θ) 0 

0 0 K3 (θ)
and
K1 (θ) = Σ
1 n −1 o!
−1
K2 (θ) = tr ∂(vec(σ))i Σ Σ ∂(vec(σ)) j Σ Σ
2 i, j=1,...,m2
Z ′
K3 (θ) = ∂ξ fξ (u) ∂ξ fξ (u) fξ (u)du.
Rm
Moreover, Shimizu and Yoshida (2006) argue that the asymptotic efficiency for θ̂T H is obtained
since K1 and K3 correspond to the asymptotic variance of the estimator for the continuously ob-
served ergodic diffusion processes with jumps.
Example 4.2.12. (CRRA power utility) Consider the case of Example 4.2.6. Let θ =
(µ′ , vec(σ)′ , ξ ′ )′ . Then, the optimal portfolio can be written by
w∗ (θ) = arg min gθ (w)

w∈Rm
where
( )
1 γ
′
gθ (w) = −w µ + (Σ11 , . . . , Σmm ) + w′ Σw + λψθ (w)
′
2 2
and
Σ = σσ ′
Z n o Y
m
1 1−γ 1
ψθ (w) = − 1 + w′ z − 1 fξ (log(1 + z1 ), . . . , log(1 + zm )) dz.
1−γ z>−1 i=1
1 + zi
Then, from (4.2.37) and the δ-method, we have
p d
nhn (w∗ (θ̂T H ) − w∗ (θ0 )) → N 0, ẇ∗ (θ0 )K̃(θ0 )−1 (ẇ∗ (θ0 ))′
where
 
 K1 (θ)−1 0 0 
 
K̃(θ) −1
=  0 K2 (θ)−1 0  .

0 0 K3 (θ)−1

Merton (1969) introduced portfolio theory for continuous time models and his results are summa-
rized in Merton (1992). In addition, Aı̈t-Sahalia et al. (2009) expanded his results to a diffusion
model with jump. Chapter 4 in Aı̈t-Sahalia and Hansen (2009) and Heyde (1997) provide an in-
troduction and some applications for GMM. Shimizu and Yoshida (2006) and Shimizu (2009) (in
Japanese) introduced the threshold method and showed its asymptotic property.
4.3 Universal Portfolio

The mean-variance approach has two problems pointed out by finance practitioners, private in-
vestors and researchers. The first problem is related to distributional assumptions concerning the
behaviour of asset prices, and the second problem is related to the selection of the optimal portfo-
lio depending on some objective function and/or utility function defined according to the investor’s
goal. To overcome these problems, a different approach based on information theory to portfolio
selection was proposed by Cover (1991). The main characteristic is that no distributional assump-
tions on the sequence of price relatives are required. To emphasize this independence from statistical
assumptions, such portfolios are called “universal portfolios.” Cover (1991) showed that universal
portfolios achieve the same asymptotic growth rate as the best constant rebalanced portfolio in hind-
sight which achieves the optimal growth rate.
Moreover, Cover and Ordentlich (1996) considered adding side information such as explanatory
variables and discrete ones, and obtained precise bounds on the ratio of the wealth given by the
universal portfolio to the best wealth achievable by a constant rebalanced portfolio given hindsight.
Gaivoronski and Stella (2000) introduced ideas of stochastic optimization into the definition of uni-
versal portfolios. Blum and Kalai (1999) addressed the original lack of consideration of transaction
costs in Cover’s formulation.
4.3.1 µ-Weighted Universal Portfolios

Suppose that a stock market can be represented by x1 , x2 , . . . ∈ Rm + where m ∈ N is the number of
stocks in the market, xi = (xi1 , xi2 , . . . , xim )′ for each i ∈ N is the vector of price relatives for all
stocks on day i and xi j ≥ 0 is the price relative for stock j(= 1, 2, . . . , m) on day i. Note that the price
relative xi j is the ratio of the price at the end of the day i to the price at the beginning of the day i.
For instance, xi j = 1.03 means that the jth stock went up 3 percent on day i.
P
In this section, a portfolio b = (b1 , b2 , . . . , bm )′ , bi ≥ 0, m i=1 bi = 1 is the proportion of the current
wealth invested in each of the m stocks. We denote the set of (m − 1)-dimensional portfolios by
 


 Xm 


 m 
B=  b ∈ R : b j = 1, b j ≥ 0 
 .

 

j=1
According to Cover and Thomas (2006), this set is called the probability simplex in Rm . For any
b ∈ B, the wealth relative on day i (i.e., the ratio of the wealth at the end of the day i to the wealth
UNIVERSAL PORTFOLIO 99
at the beginning of the day i) is given by
X
m
S i = b′ xi = b j xi j .
j=1
Consider an arbitrary (nonrandom) sequence of stock vectors x1 , x2 , . . . xn ∈ Rm + for n ∈ N

(i.e., the investment period is n days). According to Cover and Thomas (2006), a realistic target is
the growth achieved by the best constant rebalanced portfolio strategy in hindsight (i.e., the best
constant rebalanced portfolio on the known sequence of stock market vectors). For a constant rebal-
anced portfolio b ∈ B, the wealth is described as follows:
Definition 4.3.1. At the end of n days, a “constant rebalanced portfolio” b = (b1 , . . . , bm )′ achieves
wealth:
Y
n Y
n X
m
S n (b, xn ) = b′ xi = b j xi j
i=1 i=1 j=1
where xn is the sequence of n vectors (i.e., xn = {x1 , . . . , xn }).

Within the class of constant rebalanced portfolios B, there exists the best constant rebalanced
portfolio (in hindsight), which is defined as follows:
Definition 4.3.2. At the end of n days, the “best constant rebalanced portfolio” b∗ = (b∗1 , . . . , b∗m )′
achieves wealth:
Y
n
S n∗ (xn ) = max S n (b, xn ) = max b′ xi .
b∈B b∈B
i=1
In this section, we will find an optimal universal strategy to achieve asymptotically the same
wealth without advance knowledge of the distribution of the stock market vector. This optimal uni-
versal strategy should belong to a class of causal (nonanticipating) portfolio strategies that depend
only on the past values of the stock market sequence. For a causal (nonanticipating) portfolio strat-
egy, the wealth is described as follows:
Definition 4.3.3. At the end of n days, a “causal (nonanticipating) portfolio” strategy {b̂i =
(b̂i1 , . . . , b̂im )′ ; i = 1, . . . , n} achieves wealth:
Y
n
Ŝ n (xn ) = b̂′i xi .
i=1
∗
Note that the best constant rebalanced portfolio b is defined by the stock market sequence xn =
{x1 , . . . , xn }, i.e. b∗ ≡ b∗ (xn ). Similarly, each causal (nonanticipating) portfolio b̂i is defined by the
past stock market sequence xi−1 = {x1 , . . . , xi−1 }, i.e. b̂i ≡ b̂∗ (xi−1 ).
It would be desirable if a causal (nonanticipating) portfolio strategy {b̂i } yielded wealth in some
sense close to the wealth obtained by means of the best constant rebalanced portfolio b∗ . One such
strategy was proposed by Cover (1991) under the name of the universal portfolio and the definition
is as follows:
Definition 4.3.4. The universal portfolio strategy is the performance-weighted strategy specified by
!′ R
1 1 1 bS k (b, xk )db
b̂1 = , ,..., , b̂k+1 = Rb∈B , k = 1, 2, . . . , n (4.3.1)
m m m S (b, xk )db
b∈B k
Q
where S k (b, xk ) = ki=1 b′ xi and S 0 (b, x0 ) = 1.
The following result provides the optimality of the best constant rebalanced portfolio which
implies that S n∗ (xn ) (the wealth of the best constant rebalanced portfolio) exceeds Ŝ n (xn ) (the wealth
of the universal portfolio strategy) for every sequence of outcomes from the stock market.
∗
Theorem 4.3.5. Let x1 , x2 , . . . , xn be a sequence of nonrandom stock vectors in Rm n
+ . Let S n (x ) be
the wealth achieved by the best constant rebalanced portfolio defined by Definition 4.3.2, and let
Ŝ n (xn ) be the wealth achieved by the universal portfolio strategy defined by Definitions 4.3.3 and
4.3.4. Then
S n∗ (xn )
≥ 1.
Ŝ n (xn )
Proof: From (4.3.1), for k = 1, 2, . . . , n, it follows that
R R Qk ′
b∈B
b′ xk S k−1 (b, xk−1 )db i=1 b xi db
′
b̂k xk = R = R b∈B Qk−1 (4.3.2)
S (b, xk−1 )db ′
b∈B k−1 b∈B i=1 b xi db
hence, we have
R Qk R Qn ′
Y
n Y
n ′
i=1 b xi db i=1 b xi db
′ b∈B b∈B
Ŝ n (xn ) = b̂k xk = R Qk−1 ′ = R (4.3.3)
k=1 k=1 b∈B i=1 b xi db b∈B
db
Qn ′ R
maxb∈B i=1 b xi b∈B db
≤ R = S n∗ (xn ).
b∈B
db
Cover (1991) and Cover and Ordentlich (1996) showed an upper bound for the ratio of the wealth
by the universal portfolio strategy and the wealth by the best constant rebalanced portfolio.
Theorem 4.3.6. (Theorem 1 of Cover and Ordentlich (1996)) Let xn be a stock market sequence
∗
{x1 , x2 , . . . , xn }, xi ∈ Rm n
+ of length n with m assets. Let S n (x ) be the wealth achieved by the best
constant rebalanced portfolio defined by Definition 4.3.2, and let Ŝ n (xn ) be the wealth achieved by
the universal portfolio defined by Definitions 4.3.3 and 4.3.4. Then
!
S n∗ (xn ) n+m−1
≤ ≤ (n + 1)m−1 .
Ŝ n (xn ) m−1
Remark 4.3.7. Theorems 4.3.5 and 4.3.6 immediately imply that

1 1 S ∗ (xn ) m − 1
log 1 ≤ log n ≤ log(n + 1).
n n Ŝ n (xn ) n
Therefore, it follows that
1 1
log Ŝ n (xn ) − log S n∗ (xn ) → 0
n n
as n → ∞. Thus, the universal portfolio achieves the same asymptotic growth rate of wealth as the
best constant rebalanced portfolio.
Cover and Ordentlich (1996) generalized the definition of the universal portfolio defined in
Definition 4.3.4 as follows:
Definition 4.3.8. The µ-weighted universal portfolio strategy is specified by
R
bS k (b, xk )dµ(b)
b̂k+1 = Rb∈B , k = 0, 1, 2, . . . , n (4.3.4)
S (b, xk )dµ(b)
b∈B k
Z
Q
where dµ(b) = 1, S k (b, xk ) = ki=1 b′ xi and S 0 (b, x0 ) = 1.
b∈B
′
Note that if µ is symmetric, then b̂1 = m1 , m1 , . . . , m1 . The overall portfolio algorithm is specified
by the choice of µ, while the portfolio action at each time depends on µ and the sequence xk observed
so far. It is easy to see that if you choose µ as µ(b) = (m−1)!b, it corresponds to the uniform weighted
universal portfolio defined in (4.3.1).
Remark 4.3.9. Cover and Ordentlich (1996) considered the case when the measure µ is the Dirich-
let (1/2, 1/2, . . . , 1/2) distribution, which has a density with respect to the Lebesgue (uniform) mea-
sure on the simplex B given by

Γ m2 Y m
dµ(b) = n om b−1/2
j db.
1
Γ 2 j=1
Then, similarly to Theorem 4.3.5, it follows that

S n∗ (xn )
≥1
Ŝ n (xn )
and similarly to Theorem 4.3.6, it is seen that
S n∗ (xn )
≤ 2(n + 1)(m−1)/2
Ŝ n (xn )
because
Qm nr ( jn ) nr ( jn )
S n∗ (xn ) r=1 n
≤ max R Qn
Ŝ n (xn ) jn ∈{1,...,m}n
b∈B i=1 b ji dµ(b)
r=1 n
≤ max Qm Γ(nr ( jn )+ 12 )
jn ∈{1,...,m}n Γ( m2 )
P
Γ( mr=1 (nr ( j )+ 2 ))
n 1 r=1 Γ( 12 )
n om (4.3.5)
Γ 12 Γ n + m2 Y m n
nr ( jn )nr ( j )
= max
jn ∈{1,...,m}n Γ m nn n
r=1 Γ nr ( j ) + 2
1
2

1 m
Γ 2 Γ n+ 2
≤
Γ m2 Γ 2n
≤ 2(n + 1)(m−1)/2
where nr ( jn ) is the number of occurrences of r in jn ∈ {1, . . . , m}n . For details of the above proof,
see Theorem 2 and Lemmas 3–5 of Cover and Ordentlich (1996). Thus the µ-weighted universal
portfolio with Dirichlet (1/2, 1/2, . . . , 1/2) distribution also achieves the same asymptotic growth
rate of wealth as the best constant rebalanced portfolio, i.e.,
1 1
log Ŝ n (xn ) − log S n∗ (xn ) → 0
n n
as n → ∞.
4.3.2 Universal Portfolios with Side Information

So far, we have considered the portfolio selection problem in the case where no additional infor-
mation is available concerning the stock market. However, it is common that investors adjust (i.e.,
rebalance) their portfolios by use of various sources of information concerning the stock market,
which called side information. Following Cover and Ordentlich (1996), we model this side infor-
mation as a finite-valued variable y made available at the start of each investment period. Thus the
formal domain of our market model is a sequence of pairs {(xi , yi )} where xi is the vector of price
relatives for all stocks on day i and yi ∈ Y = {1, 2, . . . , k} denotes the state of the side information on
day i. One example for the side information is to set yi = j if stock j has outperformed other stocks
in the previous r trading days. Another one is to use yi to reflect whether the moving average of the
last r trading days is greater or less than the average of the price relatives on the previous trading
day.
The constant rebalanced portfolio defined in Definition 4.3.1 is extended to the state constant rebal-
anced portfolio b(yi ), which depends only on the side information yi ∈ Y = {1, 2, . . . , k}, and the
wealth is described as follows:
Definition 4.3.10. At the end of n days, a “state constant rebalanced portfolio” b(1), b(2), . . . , b(k) ∈
B achieves wealth:
Y
n Y
n X
m
S n (b(·), xn|yn ) = b(yi )′ xi = b j (yi )xi j
i=1 i=1 j=1
where xn is the sequence of n vectors (i.e., xn = {x1 , . . . , xn }), yn is the sequence of n state variables
(i.e., yn = {y1 , . . . , yn }) and b(·) is the collection of state constant rebalanced portfolios (i.e., b(·) =
{b(1), . . . , b(k)} ∈ Bk ).
Similarly to Definition 4.3.2, we define the best state constant rebalanced portfolio (in hindsight)
within the class of the state constant rebalanced portfolio as follows:
Definition 4.3.11. At the end of n days, the “best state constant rebalanced portfolio” b∗ (·) =
{b∗ (1), b∗ (2), . . . , b∗ (k)} achieves wealth:
Y
n
S n∗ (xn |yn ) = max S n (b(·), xn|yn ) = max b′ (yi )xi .
b(·)∈Bk b(·)∈Bk
i=1
Note that a state constant rebalanced portfolio for k states and m stocks has k(m − 1) degrees
of freedom, and m − 1 degrees of freedom for each of the k portfolios which must be specified.
P
The requirement that the entries sum to one (i.e., mj=1 b j (y) = 1) gives each portfolio vector m − 1
degrees of freedom. Here m is the number of stocks and k is the number of states of side information.
Cover and Ordentlich (1996) introduced µ-weighted universal portfolios with side information for
|Y| = k > 1, similarly to Definition 4.3.8, as follows:
Definition 4.3.12. The µ-weighted universal portfolio with side information is specified by
R
bS i−1 (b, xi−1 |y)dµ(b)
b̂i (y) = Rb∈B , i = 1, 2, . . . , n, y = 1, . . . , k ∈ Y (4.3.6)
S (b, xi−1 |y)dµ(b)
b∈B i−1
Z
where dµ(b) = 1 and S i (b, xi |y) is the wealth obtained by the constant rebalanced portfolio b
b∈B
along the subsequence { j ≤ i : y j = y}, and is given by
Y
S i (b, xi |y) = b′ x j
j∈{ j≤i,y j =y}
with S 0 (b, x0 |y) = 1.

Similarly to (4.3.2), from (4.3.6), it follows that
R
b′ xi S i−1 (b, xi−1 |y)db
b̂i (y)′ xi = b∈B
R
S i−1 (b, xi−1 |y)db
R b∈B Q
b′ xi j∈{ j≤i−1,y j =y} b′ x j db
= b∈B
R Q ′
b∈B j∈{ j≤i−1,y j =y} b x j db
R Q
b∈B j∈{ j≤i,y j =y} b′ x j db
=R Q (4.3.7)
b∈B j∈{ j≤i−1,y j =y} b′ x j db
for i = 1, 2, . . . , n, hence, at the end of n days, the µ-weighted universal portfolio with side informa-
tion {b̂i (y)}i=1,...,n,y=1,...,k achieves wealth
Y
n
Ŝ n (xn |yn ) = b̂i (yi )′ xi
i=1
Y
k Y
= b̂i (y)′ xi
y=1 i∈{i≤n,yi =y}
R Q
Y
k Y j∈{ j≤i,y j =y} b′ x j dµ(b)
b∈B
= R Q
y=1 i∈{i≤n,yi =y} b∈B j∈{ j≤i−1,y j =y} b′ x j dµ(b)
R Q
Y
k Y
n ′
j∈{ j≤i,y j =y} b x j dµ(b)
b∈B
= R Q ′
y=1 i=1 b∈B j∈{ j≤i−1,y j =y} b x j dµ(b)
R Q
Y
k
j∈{ j≤n,y j =y} b′ x j dµ(b)
b∈B
= R
y=1 b∈B
dµ(b)
k Z
Y
= S n (b, xn |y)dµ(b). (4.3.8)
y=1 b∈B
By using this result, we have the following.

Theorem 4.3.13. (Theorem 3 of Cover and Ordentlich (1996)) Let xn be a stock market sequence
{x1 , x2 , . . . , xn }, xi ∈ Rm n
+ of length n with m assets, and y be a state sequence {y1 , y2 , . . . , yn },
∗ n n
yi ∈ Y = {1, 2, . . . , k}. Let S n (x |y ) be the wealth achieved by the best state constant rebalanced
portfolio defined by Definition 4.3.11, and let Ŝ n (xn |yn ) be the wealth achieved by the µ-weighted
universal portfolio with side information defined by (4.3.8).
(i) The uniform weighted universal portfolio with side information (i.e., dµ(b) ≡ (m − 1)!db)
satisfies
S n∗ (xn |yn ) Y
k
1≤ ≤ (nr (yn ) + 1)m−1 ≤ (n + 1)k(m−1)
Ŝ n (xn |yn ) r=1
n
for all n ∈ N, x ∈ (Rm n n
+) , y ∈ {1, 2, . . . , k}n .
(ii) The Dirichlet (1/2, 1/2, . . . , 1/2) weighted universal portfolio with side information (i.e.,
Γ m2 Y m
dµ(b) = n om b−1/2
j db) satisfies
1
Γ 2 j=1
S n∗ (xn |yn ) Y
k
1≤ ≤ 2k (nr (yn ) + 1)(m−1)/2 ≤ 2k (n + 1)k(m−1)/2
Ŝ n (xn |yn ) r=1
for all n ∈ N, xn ∈ (Rm n n n

+ ) , y ∈ {1, 2, . . . , k} .
Here the quantity nr (y ) denotes the number of times y j = r in the sequence yn = {y1 , . . . , yn }.
n
Remark 4.3.14. Theorem 4.3.13 immediately implies that

(i) For the uniform weighted universal portfolio with side information
1 1 S ∗ (xn |yn ) k(m − 1)
log 1 ≤ log n ≤ log(n + 1).
n n Ŝ n (xn |yn ) n
(ii) For the Dirichlet (1/2, 1/2, . . . , 1/2) weighted universal portfolio with side information
( )
1 1 S n∗ (xn |yn ) 1 k(m − 1)
log 1 ≤ log ≤ k log 2 + log(n + 1) .
n n Ŝ n (xn |yn ) n 2
Therefore, for both portfolio strategies, it follows that

1 1
log Ŝ n (xn |yn ) − log S n∗ (xn |yn ) → 0
n n
under k, m < ∞, as n → ∞. Thus, the universal portfolios achieve the same asymptotic growth rate
of wealth as the best state constant rebalanced portfolio.
4.3.3 Successive Constant Rebalanced Portfolios

Gaivoronski and Stella (2000) applied ideas of the stochastic programming problem into the def-
inition of universal portfolios. Successive constant rebalanced portfolios (SCRP), introduced by
them, are derived using a different set of ideas than Cover’s universal portfolio, and originated in
non-stationary and stochastic optimization. Similar to Cover’s universal portfolio, SCRP do not de-
pend on statistical assumptions about distribution of price relatives and exhibit similar asymptotic
behaviour, that is, to achieve the same asymptotic growth rate of wealth as the best constant rebal-
anced portfolio (BCRP). Moreover, there is an advantage of SCRP compared to Cover’s universal
portfolio with respect to the computational cost, because Cover’s universal portfolio relies on mul-
tidimensional integration, which is notoriously difficult to perform, except for few stocks; on the
other hand, SCRP are computable for fairly large numbers of stocks.
Denote
1X
n
Fn (b) := fi (b), fi (b) = log(b′ xi ) (4.3.9)
n i=1
as functions with respect to b ∈ B under a given a sequence of price relatives xn = {x1 , . . . , xn }.

Note that we do not need to assume the stationarity of the market; we only need the boundedness
of the sequence of the price relatives xn . Then, the best constant rebalanced portfolio b∗ defined in
Definition 4.3.2 obtains as the solution of the following optimization problem:
 
1X Y
n  n 
1
max Fn (b) = max 
fk (b) = log max b xk  .
′
b∈B b∈B n n b∈B
i=1 i=1
By using this function Fn (b), the successive constant rebalanced portfolio is defined as follows:
Definition 4.3.15. The successive constant rebalanced portfolio b̃n = {b̃1 , . . . , b̃n } is defined through
the following procedure:
1. At the beginning of the first trading period take
!′
1 1 1
b̃1 = , ,..., .
m m m
2. At the beginning of trading period i = 2, . . . , n, the sequence of the price relatives xi−1 =
{x1 , . . . , xi−1 } is available. Compute b̃i as the solution of the optimization problem
max Fi−1 (b)

b∈B
where Fi (b) is defined in (4.3.9).

Let S̃ n (xn ) be the wealth by the successive constant rebalanced portfolio at the end of n days,
namely,
Y
n
S̃ n (xn ) = b̃′i xi .
i=1
Then, Gaivoronski and Stella (2000) showed the following result.

Theorem 4.3.16. (Theorem 2 of Gaivoronski and Stella (2000)) Suppose that the sequence of the
price relatives xn = {x1 , . . . , xn } satisfy the following conditions:
1. Asymptotic independence:
 n 
 1 X xi x′i 
lim inf λmin    ≥ δ > 0 (4.3.10)
n∈N n i=1 kxi k2 
where λmin (A) is the smallest eigenvalue of matrix A.

2. Uniform boundedness: There exist 0 < x− , x+ < ∞ such that
0 < x− ≤ xi j ≤ x+ < ∞ (4.3.11)
for ∀i = 1, . . . , n and ∀ j = 1, . . . , m.
Then
1. The asymptotic rate of growth of wealth S̃ n (xn ) obtained by successive constant rebalanced
portfolio b̃n coincides with asymptotic growth of wealth S n∗ (xn ) obtained by the best constant
rebalanced portfolio x∗ up to the first order of the exponent, i.e.,
1 1
log S n∗ (xn ) − log S̃ n (xn ) → 0
n n
as n → ∞.
2. The following inequality is satisfied:
S̃ n (xn ) 2
≥ C(n − 1)−2K /δ
S n∗ (xn )
where K = supn∈N,b∈B k∂ fn (b)/∂bk, while C and δ are constants.
Consider the computation of the SCRP. While the data size n is not large enough, each new data
point may bring about substantial change in the SCRP. In this case, some smoothing is necessary
to avoid such a substantial change. To overcome this problem, Gaivoronski and Stella (2000) intro-
duced another portfolio, which is referred as the weighted successive constant rebalanced portfolio
(WSCRP). The WSCRP is defined by making a linear combination between the previous portfolio
and the new SCRP.
Definition 4.3.17. The weighted successive constant rebalanced portfolio b̃˜ n = {b̃˜ 1 , . . . , b̃˜ n } is de-
fined through the following procedure:
1. At the beginning of the first trading period take
!′
1 1 1
b̃˜ 1 = , ,..., .
m m m
2. At the beginning of trading period i = 2, . . . , n, the sequence of the price relatives xi−1 =
{x1 , . . . , xi−1 } is available. Compute b̃i as the solution of the optimization problem
max Fi−1 (b)

b∈B
where Fi (b) is defined in (4.3.9).

3. Take the current portfolio at stage i as a linear combination between previous portfolio b̃˜ i−1 and
b̃i :
b̃˜ i = γb̃˜ i−1 + (1 − γ)b̃i
where γ ∈ (0, 1) is the weighting parameter.
Let S̃˜ (xn ) be the wealth by the WSCRP at the end of n days, namely,
n
Y
n
S̃˜ n (xn ) = b̃˜ ′i xi .
i=1
The following theorem describes the asymptotic properties of the WSCRP.

Theorem 4.3.18. (Theorem 3 of Gaivoronski and Stella (2000)) Suppose that the sequence of the
price relatives xn = {x1 , . . . , xn } satisfies the asymptotic independence (4.3.10) and the uniform
boundedness (4.3.11). Then
1. The asymptotic rate of growth of wealth S̃˜ n (xn ) obtained by weighted successive constant rebal-
anced portfolio b̃˜ n coincides with the asymptotic growth of wealth S n∗ (xn ) obtained by the best
constant rebalanced portfolio x∗ up to the first order of the exponent, i.e.,
1 1
log S n∗ (xn ) − log S̃˜ n (xn ) → 0
n n
as n → ∞.
2. The following inequality is satisfied:
S̃˜ n (xn ) 2
∗ n
≥ C(n − 1)−5K /(δ(1−γ))
S n (x )
where K = supn∈N,b∈B k∂ fn (b)/∂bk, while C and δ are constants.
Furthermore, Dempster et al. (2009) extend the SCRP to the case with side information. The
mixture successive constant rebalanced portfolio (MSCRP) is defined by a combination of the idea
of the universal portfolio with side information defined in Definition 4.3.12 and the idea of SCRP
defined in Definition 4.3.15.
4.3.4 Universal Portfolios with Transaction Costs

Blum and Kalai (1999) extend Cover’s model by adding a fixed percentage commission cost 0 ≤ c ≤
1 to each transaction. In what follows, we consider the wealth of a constant rebalanced portfolio at
the end of n days including commission costs (say S nc (b, x)). For simplicity, we will assume that the
commission is charged only for purchases and not for sales (according to Blum and Kalai (1999),
this can be assumed without loss of generality).
Let pi = (pi1 , . . . , pim )′ be the vector of price for all stocks at the end of day i and at the beginning
of day i + 1.13 Let hi = (hi1 , . . . , him )′ and h−i = (h−i1 , . . . , h−im )′ be the vector of the number of shares
for all stocks after and before rebalancing on day i. In the case of “constant rebalanced portfolio”
strategy, the portfolio weight is rebalanced as follows:
p1 j h1 j pn j hn j
bj = = ··· =
S1 Sn
P
where S i = m − − − ′
l=1 pil hil for i = 1, . . . , n and j = 1, . . . , m. Writing bi = (bi1 , . . . , bim ) as the portfolio
weight before rebalancing on day i, then it follows that
pi j hi−1, j
b−i j =
S i−
13 Note that xi j = pi j /pi−1, j for i = 1, 2, . . . , n, j = 1, . . . , m with p0 j = 1.
P
where S i− = m
l=1 pil hi−1,l . By using the above notations, the transaction cost can be written as
X X
c(hi j − hi−1, j )pi j = c(S i b j − S i− b−i j ).
j:hi j >hi−1, j j:S i b j >S −i b−i j
Here the reason we need to consider the sum with respect to hi j > hi−1, j is by the assumption to be
charged only for purchases and not for sales (i.e., if hi j > hi−1, j , asset j is purchased on day i). In
addition, it follows
hi j > hi−1, j ⇔ pi j hi j > pi j hi−1, j ⇔ S i b j > S i− b−i j .
Denoting αi = S i /S i− , the transaction cost is described as

X
cS i− (αi b j − b−i j )
j:αi b j >b−i j
which implies that

Si S i− − (the transaction cost) X
αi = − = − =1−c (αi b j − b−i j ). (4.3.12)
Si Si j:α b >b−
i j ij
Based on this αi , for a constant rebalanced portfolio b (∈ B) with transaction costs, the wealth is
described as follows:
Definition 4.3.19. At the end of n days, a constant rebalanced portfolio b = (b1 , . . . , bm )′ with
transaction costs achieves wealth:
Y
n Y
n X
m
S nc (b, xn ) = αi b′ xi = αi b j xi j
i=1 i=1 j=1
where αi is satisfied with Equation (4.3.12) for each i = 1, 2, . . . , n.

In the same manner of Definition 4.3.2, we define the best constant rebalanced portfolio (in
hindsight) with transaction costs as follows:
Definition 4.3.20. At the end of n days, the best constant rebalanced portfolio with transaction
c∗ ′
costs bc∗ = (bc∗
1 , . . . , bm ) achieves wealth:
Y
n
S nc∗ (xn ) = max S nc (b, xn ) = max αi b′ xi .
b∈B b∈B
i=1
Blum and Kalai (1999) proposed the universal portfolio with transaction costs by a slight modi-
fication of Definition 4.3.4:
Definition 4.3.21. The universal portfolio with transaction costs on day i is specified by
!′ R
c
1 1 1 bS i−1 (b, xi−1 )db
b̂c1 = , ,..., , b̂ci = Rb∈B c , i = 2, 3, . . . , n
m m m S (b, xi−1 )db
b∈B i−1
Q
where S i−1 c
(b, xi−1 ) = i−1 ′
l=1 αl b xl and αl is satisfied with Equation (4.3.12) for each l.
Moreover, for a causal (nonanticipating) portfolio strategy with transaction costs , the wealth is
described as follows:
Definition 4.3.22. At the end of n days, a causal (nonanticipating) portfolio strategy {b̂i =
(b̂i1 , . . . , b̂im )′ ; i = 1, . . . , n} with transaction costs achieves wealth:
Y
n
Ŝ nc (xn ) = α̂i b̂′i xi
i=1
where α̂i is satisfied with
X
α̂i = 1 − c (α̂i b̂i j − b̂−i j )
j:α̂i b̂i j >b̂−i j
b̂i−1, j Ŝ ci−1 (xi−1 )

pi j ĥi−1, j pi j pi−1, j xi j b̂i−1, j
b̂−i j = Pm =P = Pm .
pir ĥi−1,r b̂i−1,r Ŝ ci−1 (xi−1 )
r=1 xir b̂i−1,r
m
r=1 r=1 pir pi−1,r
Similarly to the case without transaction costs, Blum and Kalai (1999) showed an upper bound
for the ratio (S nc∗ (xn )/Ŝ nc (xn )) only by using the following natural properties of optimal rebalancing:
1. The costs paid changing from distribution b1 to b3 are no more than the costs paid changing from
b1 to b2 and then from b2 to b3 .
2. The cost, per dollar, of changing from a distribution b to a distribution (1 − α)b + αb0 (where
b0 ∈ B and 0 < α < 1) is no more than αc, simply because we are moving at most an α fraction
of our money.
3. An investment strategy I which invests an initial fraction α of its money according to investment
strategy I1 and an initial 1 − α of its money according to I2 , will achieve at least α times the
wealth of I1 plus 1 − α times the wealth of I2 . (In fact, I may do even better by occasionally
saving in commission cost if, for instance, strategy I1 says to sell stock A and strategy I2 says to
buy it.)
Theorem 4.3.23. (Theorem 2 of Blum and Kalai (1999)) Let xn be a stock market sequence
{x1 , x2 , . . . , xn }, xi ∈ Rm c∗ n
+ of length n with m assets. Let S n (x ) be the wealth achieved by the best
constant rebalanced portfolio with transaction costs defined by Definition 4.3.20, and let Ŝ nc (xn ) be
the wealth achieved by the universal portfolio with transaction costs defined by Definitions 4.3.21
and 4.3.22. Then
!
S nc∗ (xn ) (1 + c)n + m − 1
≤ ≤ ((1 + c)n + 1)m−1 .
Ŝ nc (xn ) m−1
Remark 4.3.24. In the same way as Theorem 4.3.5, we have
S nc∗ (xn )
≥ 1.
Ŝ nc (xn )
From this and Theorem 4.3.23, it immediately implies that
1 1 S c∗ (xn ) m − 1
log 1 ≤ log nc ≤ log((1 + c)n + 1).
n n Ŝ n (xn ) n
Therefore, it follows that
1 1
log Ŝ nc (xn ) − log S nc∗ (xn ) → 0
n n
as n → ∞. Thus, the universal portfolio can even achieve the same asymptotic growth rate as the
best constant rebalanced portfolio in the presence of commission.

Cover (1991) introduced universal portfolios, which Cover and Ordentlich (1996) extended to the
case with side information. Including the introduction of information theory, these results are sum-
marized in Cover and Thomas (2006). Regarding some properties of the Dirichlet distribution, we
follow Ng et al. (2011). Gaivoronski and Stella (2000) introduced successive constant rebalanced
portfolios and Dempster et al. (2009) extended their result to the case with side information. The
effect of the transaction cost is addressed by Blum and Kalai (1999).
4.3.6 Appendix
Proof of Theorem 4.3.6 This proof originally comes from Cover and Thomas (2006) or Cover and
Ordentlich (1996). We fill in the details or partially modify them.
Rewrite S n∗ (xn ) as
Y
n
S n∗ (xn ) = b∗′ xi
i=1
Yn Xm
= b∗j xi j
i=1 j=1
= (b∗1 x11 + · · · + b∗m x1m ) · · · (b∗1 xn1 + · · · + b∗m xnm )

X m Yn
= b∗ji xi ji
j1 ,..., jn =1 i=1
X Y
n
= b∗ji xi ji . (4.3.13)
jn ={ j1 ,..., jn }∈{1,...,m}n i=1
From (4.3.3), similarly to the above, we can rewrite

R Qn ′ P R Qn
n b∈B i=1 b xi db jn ∈{1,...,m}n b∈B i=1 b ji xi ji db
Ŝ n (x ) = R = R . (4.3.14)
b∈B
db b∈B
db
From (4.3.13), (4.3.14) and Lemma 1 of Cover and Ordentlich (1996) (or Lemma 16.7.1 of Cover
and Thomas (2006)), we have
Qn ∗ R
S n∗ (xn ) i=1 b ji b∈B db
≤ n max n R Q (4.3.15)
Ŝ n (xn ) j ∈{1,...,m} b∈B ni=1 b ji db
since the product of the xi j j ’s factors out of the numerator and denominator.
Q
Next, we consider the upper bound of ni=1 b∗ji for each jn . For a fixed jn = { j1 , . . . , jn }, let nr ( jn )(r =
Q
1, . . . , m) denote the number of occurrences of r in jn . Then, for any b ∈ B, it is seen that ni=1 b ji =
Qm nr ( jn ) Q nr ( jn )
r=1 br . In order to find the maximizer of m r=1 br , we consider the Lagrange multiplier
 
Y m X
m 
nr ( jn )
L(b) = br + λ  b j − 1 .
r=1 j=1
Differentiating this with respect to bi yields
∂L n
Y n
= ni ( jn )bni i ( j )−1 bnr r ( j ) + λ, i = 1, 2, . . . , m.
∂bi r,i
Setting the partial derivative equal to 0 for a maximum, we have

!
n1 ( jn ) nm ( jn )
arg max L(b) = ,..., ,
b∈B n n
which implies
Y
n Ym !n ( jn )
nr ( jn ) r
b∗ji ≤ . (4.3.16)
i=1 r=1
n
Next, we consider the calculation of
Z Z Y
n Z Y
m
n
db and b ji db = bnr r ( j ) db.
b∈B b∈B i=1 b∈B r=1
Define
 


 X
m−1 


 m−1 
C=
 c = (c 1 , . . . , c m−1 ) ∈ R : c j ≤ 1, c j ≥ 0 
 .

 

j=1
Since a uniform distribution over B induces a uniform distribution over C, we have

Z Z Z Y   m−1
m−1 1−1 Y Qm
m−1  X  Γ(1) 1
db = dc = 1−1  
cr 1 − cr   dcr = r=1
Pm = (4.3.17)
b∈B c∈C c∈C r=1 r=1 r=1
Γ r=1 1 (m − 1)!
and
Z Z   n
m−1 nm ( j )
Y
m
n
Y
m
n
 X 
bnr r ( j ) db = cnr r ( j ) 1 − cr  dc

b∈B r=1 c∈C r=1 r=1
Z   n
m−1 nm ( j )+1−1 Y
Y
m−1  X m−1
nr ( jn )+1−1  
= cr 1 − cr  dcr
c∈C r=1 r=1 r=1
Qm n
r=1 Γ (nr ( j ) + 1)
= P
Γ m n
r=1 (nr ( j ) + 1)
Qm
Γ (nr ( jn ) + 1)
= r=1
Γ(n + m)
Qm
nr ( jn )!
= r=1
(n + m − 1)!
from the basic property of the Dirichlet distribution (see, e.g., Ng et al. (2011)). Moreover, by
Theorem 11.1.3 of Cover and Thomas (2006), it follows that
!
n! n
Qm = ≤ 2nH(P)
n
r=1 nr ( j )!
n1 ( jn ), . . . , nm ( jn )
 n1 ( jn )


 a1 with probability



n
where H(P) is the entropy of a random variable X =  .. .. , which implies


 . .

 nm ( jn )
am with probability n
that
X
m Y nr ( jn ) m !nr ( jn )
nr ( jn ) nr ( jn ) 1
H(P) = − log2 = − log2
r=1
n n n r=1
n
and
Qm nr ( jn ) −nr ( jn ) Ym !−n ( jn )
nr ( jn ) r
2nH(P) = 2log2 r=1 n = .
r=1
n
Thus, we have
Z Yn Qm Ym !nr ( jn )
n! r=1 nr ( jn )! n! nr ( jn )
b ji db = ≥ . (4.3.18)
b∈B i=1 (n + m − 1)! n! (n + m − 1)! r=1 n
From (4.3.15), (4.3.16), (4.3.17) and (4.3.18), it follows that
Qm nr ( jn ) nr ( jn ) R
S n∗ (xn ) r=1 n b∈B
db
≤ max R Qn
Ŝ n (xn ) jn ∈{1,...,m}n
b∈B i=1 b ji db
Qm n ( jn ) nr ( jn )
r
r=1 n
= max QR
jn ∈{1,...,m}n (m − 1)! b∈B ni=1 b ji db
r=1 n
≤ max Qm nr ( jn ) nr ( jn )
jn ∈{1,...,m}n (m−1)!n!
(n+m−1)! r=1 n
(n + m − 1)!
=
(m − 1)!n!
!
n+m−1
=
m−1
! ! !
n+m−1 n+m−2 n+1
= ···
m−1 m−2 1
n n n
= +1 +1 ··· +1
m−1 m−2 1
≤ (n + 1)(n + 1) · · · (n + 1)
= (n + 1)m−1 .
Problems
4.1 Verify (4.1.6).
4.2 Verify (4.2.1).
4.3 Consider why −Ct h is approximated by (H0,t − H0,t−h )S 0,t + (Ht − Ht−h )′ St− for a small period
[t − h, t) in the absence of any income.
4.4 Show (4.2.7) by using Ito’s lemma where dWt is defined by (4.2.4).
4.5 Verify (4.2.15) and (4.2.16). (Hint: Show E(dWt ) = 0 and E(dWt dWt′ ) = Im dt.)
4.6 Verify (4.2.28), (4.2.34) and (4.2.35).
4.7 Verify (4.3.5) (see Theorem 2 and Lemmas 3–5 of Cover and Ordentlich (1996)).
4.8 Show Theorem 4.3.13 (see Theorem 3 of Cover and Ordentlich (1996)).
4.9 Show Theorem 4.3.23 (see Theorem 2 of Blum and Kalai (1999)).
4.10 Obtain the monthly stock prices of Toyota, Honda and Komatsu from December
1983 to January 2013 (350 observations). The simple prices are given in the file
stock-price-data.txt.
(i) Let xi = (xi1 , xi2 , xi3 ) be the vector of the price relative for all stocks where i = 1, . . . , 349,
j = 1 (T oyota), 2 (Honda), 3 (Komatsu) and
xi j = (stock price of month j + 1)/(stock price of month j).
Then, calculate the wealths of the constant rebalance portfolios S n (b, xn ) for b =
(1, 0, 0)′, b = (0, 1, 0)′, b = (0, 0, 1)′, b = (1/3, 1/3, 1/3)′ and n = 1, 2, . . . , 349.
(ii) Calcuate the wealths of the best constant rebalanced portfolio S n∗ (xn ) for n = 1, 2, . . . , 349.
(iii) Calcuate the wealths of the (approximated) uniform weighted universal portfolio S n (b̂n , xn )
for n = 1, 2, . . . , 349 where an approximated form of the UP is defined by
!′ !′ !
1 X X1 b1 b2
N N−b
b1 b2 b1 b2 b1 b2 k
, ,1− − Sk , ,1− − ,x
N b =0 b =0 N N N N N N N N
1 2
b̂k+1 = !′ !
1 X X1
N N−b
b1 b2 b1 b2 k
Sk , ,1− − ,x
N b =0 b =0 N N N N
1 2
for k ≥ 1 and N = 100.

Chapter 5
Portfolio Estimation Based on Rank Statistics
5.1 Introduction to Rank-Based Statistics

5.1.1 History of Ranks
5.1.1.1 Wilcoxon’s signed rank and rank sum tests
Even at the introductory level, students who learn statistics should hear about Wilcoxon’s signed
rank and rank sum tests. Wilcoxon (1945) described these tests for the first time to deal with the
following two kinds of problems.
(a) We may have a number of paired replications for each of the two treatments, which are unpaired.
(b) We may have a number of paired comparisons leading to a series of differences, some of which
may be positive and some negative.
The unpaired experiments (a) are formulated by the following two-sample location problem.
Let Xi (i = 1, . . . , N) be i.i.d. with unspecified (nonvanishing) density f . Under the null hypothesis
H0 , we observe Zi = Xi (i = 1, . . . , N), whereas, under the alternative H1 , Zi = Xi (i = 1, . . . , m)
and Zi = Xi − θ (i = m + 1, . . . , N) for some θ , 0. As the example for the unpaired experiments,
Wilcoxon (1945) introduced the following Table 5.1, which gives the results of fly spray tests on
two preparations in terms of percentage mortality, where m = 8 and N = 16.
Table 5.1: Fly spray tests on two preparations
Sample A Sample B
Percent kill Rank Percent kill Rank
68 12.5 60 4
68 12.5 67 10
59 3 61 5
72 15 62 6
64 8 67 10
67 10 63 7
70 14 56 1
74 16 58 2
Total 542 91 494 45
On the other hand, the paired example is described by the following symmetry test. Suppose that
the independent observations (X1 , Y1 ) , . . . , (XN , YN ), are given. If the observations are exchangeable,
that is, the pairs (Xi , Yi ) and (Yi , Xi ) are equal in distribution, then Xi −Yi is symmetrically distributed
about zero. As the example for the paired experiment, Wilcoxon (1945) also gave the data obtained
in a seed treatment experiment on wheat, which is shown in Table 5.2. The data are taken from a
randomized block experiment with eight replications of treatments A and B.
113
114 PORTFOLIO ESTIMATION BASED ON RANK STATISTICS
Table 5.2: Seed treatment experiment on wheat
Block A B A−B Rank

1 209 151 58 8
2 200 168 32 7
3 177 147 30 6
4 169 164 5 1
5 159 166 −7 −3
6 169 163 6 2
7 187 176 11 5
8 198 188 10 4
Now, we introduce the rank and signed rank statistics to test the above two types of experiments.
Let X1 , . . . , XN be a set of real-valued observations which are coordinates of a random vector X ∈
RN . Consider the space Y ⊆ RN defined by
n o
Y = y | y ∈ R N , y1 ≤ y2 ≤ · · · ≤ y N .

The order statistic of X is the random vector X( ) = X(1) , . . . , X(N) ′ ∈ Y consisting of the ordered
values of the coordinate of X. More precisely, ith order statistic X(i) is ith value of the observations
positioned in increasing order. The rank Ri of Xi among X1 , . . . , XN is its position number in the
order statistic. That is, if X1 , . . . , XN are all different (so that with probability one if X has density),
then Ri is defined by the equation
Xi = X(Ri ) .
Otherwise, the rank Ri is defined as the average of all indices j such that Xi = X( j) , which is
sometimes called midrank.
We write the vector of ranks as RN = (R1 , . . . , RN )′ . A rank statistic S (RN ) is any function
PN
of the ranks RN . A linear rank statistic is a rank statistic of the special form i=1 a (i, Ri ), where
(a(i, j))i, j=1,...,N is a given N × N matrix. Moreover, a simple linear rank statistic belongs to the
subclass of linear rank statistics, which take the form
X
N
ci a R i ,
i=1
where c = (c1 , . . . , cN )′ and a = (a1 , . . . , aN )′ are given vectors in RN and called the coefficients and
scores, respectively.
The most popular simple linear rank statistic is the Wilcoxon statistic
X
N
W= Ri
i=m+1
which has coefficients c = (0, . . . , 0, 1, . . . , 1)′ and scores a = (1, . . . , N)′ . Unlike the Student test,
the Wilcoxon test does not require any assumption on the density f . Wilcoxon (1945) mainly con-
sidered his test a robust, “quick and easy” solution for the location shift problem to be used when
everything else fails. For determining the significance of differences in unpaired experiments (a), he
evaluated the probability of
  N/2  


 X XN  


  Ri ,  ≤ x N(N + 2) N(N + 1)

 min  R i  
 , x= ,..., , (5.1.1)

 
 8 4
i=1 i=N/2+1
INTRODUCTION TO RANK-BASED STATISTICS 115
1.0
0.8
0.6
probability
0.4
0.2
0.0
36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68
Smaller rank total
Figure 5.1: The probability (5.1.1) of the smaller rank total
which is shown in Figure 5.1 with N = 16. The smaller rank total of the fly spray test in Table 5.1
is 45 and the probability that this total is equal to or less than 45 is only 1.476%. To compute the
probability of the smaller rank total in R, type
nn<-8 # value of m
mm<-nn*2 # value of N
ss<-nn*(nn+1)/2 # minimum of smaller rank total
tt<-mm*(mm+1)/2 # total of all rank
ll<-tt-ss*2
uu<-dwilcox(0:ll,nn,nn)
lll<-length(uu)
mmm<-(lll+1)/2
probability<-cumsum(c(uu[1:(mmm-1)]+uu[lll:(mmm+1)],uu[mmm]))
jjj<-(1:mmm)+ss-1
plot(probability,xaxt=’n’,xlab=’Smaller rank total’)
cc<-1:mmm
axis(1,jjj,at=cc)
PSRT<-function(x)
{
probability[x-ss+1]
}
PSRT(45)
The sign of a number x is defined by the function



 1, if x > 0,



sign(x) =  0, if x = 0,



−1, if x < 0.
The absolute rank R+i of an observation Xi in a sample X1 , . . . , XN is defined as the rank of

|Xi |, among the absolute values |X1 | , . . . , |XN | of the sample. Denote
the vector ′ of absolute val-
ues, absolute ranks and signs by |X| = (|X1 | , · · · , |XN |)′ , R+N = R+1 , . . . , R+N and sign (|X|) =

sign (X1 ) , · · · , sign (XN ) ′ , respectively.
The ordinary ranks RN can always be derived from the
combined set R+N , sign (X) of absolute ranks and signs; therefore the vector R+N , sign (X) has
more information of observations X than the ordinary ranks RN . Obviously, if there is no tied pair
of observation (so that with probability one if X has density), then
Xi = |X|(R+i ) sign (Xi ) , i = 1, . . . , N.

A signed rank statistic is any function S R+N , sign (X) of absolute ranks and signs. A linear
PN +
signed rank statistic is a signed rank statistic of the form i=1 a i, Ri , sign (Xi ) . Furthermore, a
simple linear signed rank statistic has the form
X
N
ci aR+i sign (Xi ) .
i=1
The main attractive feature of signed rank statistics is their simplicity and distribution freeness over
the set of all symmetric distributions.
Again, the most popular simple linear signed rank statistic is the Wilcoxon signed rank statistic
X
N
W= R+i sign (Xi )
i=1
which has coefficients c = (1, . . . , 1) and scores a = (1, . . . , N)′ . For determining the significance
′
of differences in paired experiments (b), Wilcoxon (1945) evaluated the probability of

  N  


 X + X +
N
 

 N(N + 1)
min 

 R max sign (X ) , 0 , R min sign (X ) , 0 
 ≤ x  , x = 0, . . . , , (5.1.2)

 i i i i 
 4
i=1 i=1
which is shown in Figure 5.2 with N = 8. The smaller rank total of + or − in the seed treatment
experiment (Table 5.2) is 3 and the probability that this total is equal to or less than 3 is 3.906%. To
compute the probability of the smaller rank total of + or − in R, type
nn<-8 # value of N
tt<-nn*(nn+1)/2 # total of all rank
uu<-dsignrank(0:tt,nn)
lll<-length(uu)
mmm<-(lll+1)/2
probability<-cumsum(c(uu[1:(mmm-1)]+uu[lll:(mmm+1)],uu[mmm]))
plot(probability,xaxt=’n’,xlab=’Smaller rank total of + or -.’)
jjj<-(0:(mmm-1))
cc<-1:mmm
axis(1,jjj,at=cc)
PSRN<-function(x)
{
probability[x+1]
}
PSRN(3)
1.0
0.8
0.6
probability
0.4
0.2
0.0
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18
Smaller rank total of + or −
Figure 5.2: The probability (5.1.2) of the smaller rank total of + or −
5.1.1.2 Hodges–Lehmann and Chernoff–Savage

Eleven years after Wilcoxon’s seminal paper (Wilcoxon (1945)), a totally unexpected result was
published by Hodges and Lehmann (1956). They derived the minimum Pitman efficiency of the
Wilcoxon tests. This result is based on lecture notes of Pitman (1948), and Andrews (1954), who
considered the asymptotic relative efficiencies of the H test, which will defined in (5.1.3) later, pro-
posed by Kruskal and Wallis (1952). The H test is a direct generalization of the two-sided Wilcoxon
test (θ , θ0 = 0) and is similar to a classical F test, with ranks replacing
n o the original observations.
The K-sample problem is expressed formally as follows. Let Xi j for i = 1, . . . , K and j =
1, . . . , ni be a set of independent random variables. The probability distribution functions
of
Xi j ,
j = 1, . . . , ni are common and denoted by Gi , so that Gi (x) is the probabilities Pi Xi j ≤ x , j =
1, . . . , ni . The set of admissible hypotheses designates that each Gi belongs to some class G of
distribution functions. To avoid the problems of ties, we assume throughout that the class is the
class of continuous distribution functions. The hypothesis to be tested, say H0 , specifies that for an
element G of G, we have G1 (x) = G2 (x) = · · · = G K (x) = G(x) for all real x. Alternative to H0 is
the hypotheses that each Gi belongs to G but H0 does not hold. If we consider the translation-type
alternatives, the class of admissible hypotheses will be Gi (x) = G(x + θi ), i = 1, . . . , K for some
arbitrary choice of G ∈ G and real numbers θ1 , . . . , θK . All θi = 0 yield hypothesis H0 , while if
θi1 , θi2 for some pair (i1 , i2 ) then an alternative to H0 is given.
The H test is based on the statistic
XK !2
12 N+1
H= n i Ri − , (5.1.3)
N(N + 1) i=1 2
where Ri is the average rank of the members of the ith sample obtained after ranking all of the
PK
N = i=1 ni observations. The
sequence of the local alternative hypothesis An specifies, for each
√
i = 1, . . . , K, that Gi (x) = G x + θi / n , with G ∈ G, and for some pair (i1 , i2 ) that θi1 , θi2 . The
limiting probability distributions will then be found as n → ∞. Therefore, for each index n, we
assume that ni = si n with si a positive integer. Since the limiting distribution of H under the local
alternative An depends on the quantity
Z
√ n √ o
lim n G x + t/ n − G(x) dG(x),
n→∞ R
we assume the following.

Assumption 5.1.1. The distribution function G has the density g(x) = G′ (x) except on a set of
G-measurable
zero, and for any real number t, there exists a function g which bounds the difference
quotient G(x+t)−G(x) ≤ l(x) for which R l(x)dG(x) < ∞.
t R
A direct application of the Lebesgue bounded convergence theorem and the definition of the
derivative prove that under Assumption 5.1.1, we have
Z Z Z
√ n √ o
lim n G x + t/ n − G(x) dG(x) = t G′ (x)dG(x) = t g2 (x)dx. (5.1.4)
n→∞ R R R
Now, we have the following lemma.

Lemma 5.1.2. Suppose that Assumption 5.1.1 is satisfied.
Then, under the local hypothesis An , the
limiting distribution of the test statistic H is χ2K−1 λH , where
Z !2 X
K 2 PK
si θi
λH = 12 g2 (x)dx si θi − θ , θ = Pi=1
K
R i=1 i=1 si
and χ2r (λ) denotes the possibly noncentral chi square distribution with degrees of freedom r and
noncentral parameter λ.
The concept of asymptotic relative efficiency of one consistent test with respect to another is due
to Pitman (1948). We consider the comparison of the H test with respect to the ordinary analysis of
variance F test, where the classical F statistic in this instance is defined by
PK 2
1
K−1 ni Xi· − X
i=1
F= PK Pni 2 .
1
PK
(n −1) i=1 j=1 X i j − X i·
i=1 i
R R 2
If Assumption 5.1.1 is satisfied and σG2 = x2 dG(x) − xdG(x) exists, then it is well known
R R
that the limiting distribution of F under the local alternative An is χ2K−1 λF , where
X
K  2
 θi − θ 
λF = si   .
i=1
σG
Suppose that two consistent tests T and T ′ require N and N ′ observations, respectively, to attain the
power β at level of significance α for testing the hypothesis H0 against hypothesis An . The difference
in the sample sizes N and N ′ results from the fact that we demand that the tests yield a common
power for a given alternative An . The asymptotic relative efficiency of T ′ with respect to T is defined
to be
PK PK
N ni si n n
εT ′ ,T (α, β, H0 , {An }) = lim ′ = lim PKi=1 ′ = lim PKi=1 = lim ′ . (5.1.5)
N→∞ N n→∞ n n→∞ s n ′ n→∞ n
i=1 i i=1 i
Note that under the null hypothesis H0 , we have θ1 = θ2 = · · · = θK , and consequently λH = λF = 0.
That is, under H0 , both of the limiting distributions of H and F are the ordinary
chi square distribu-
θ
tion χ2K−1 . The alternative hypothesis An states that Gi (x) = G x + √i
n
and so is characterized by
θ ′
the numbers √i .
n
Let n and n index the sample sizes for the H test and F test, respectively. To have
θ′
the same alternatives for each test we must have √θin = √ni ′ . If the level of significance is fixed at α
H F
and the limiting power fixed at β, then we haveqλ = λ to achieve the same limiting power for
q must
n′ ′ n′
the two tests. The substitution θi′ = θi n θ =θ n in λH along with the requirement λH = λF
gives
Z !2
n′ X 1 X
K 2 K 2
12 g2 (x)dx si θi − θ = 2 si θi − θ ,
R n i=1 σG i=1
and consequently the following lemma.

Lemma 5.1.3. If Assumption 5.1.1 is satisfied and σ2F exists, then the asymptotic relative efficiency
of the H test with respect to the F test for testing the hypothesis H0 against An is
Z !2
εH,F (α, β, H0 , {An }) = 12σG2 g2 (x)dx .
R
Note that εH,F does not depend on the number K of groups, and the case with K = 2 corresponds
to the comparison of the Wilcoxon test based on the rank sum of X2 j ’s among the set of N-ordered
observations with respect to Student’s t-test. Therefore, the relative asymptotic efficiency of the
Wilcoxon test relative to the t-test is also given by
Z !2
εW,t = 12σG2 g2 (x)dx ,
R
which coincides with Pitman’s computation. Some particular values for εW,t are shown in Table 5.3.
Furthermore, the lower limit of the relative asymptotic efficiency εW,t is given by Hodges
and
Lehmann (1956). Suppose that g(x) has mean 0 and variance 1, and let h(x) = σ−1 g x−µ σ ; then h(x)
R 2 R 2
has mean µ and variance σ2 . Since σ2 R h2 (x)dx = R g2 (x)dx , the value of εW,t is invariant
under a change of location or scale, and we may take µ = 0 and σ2 = 1. Then, the problem of
minimizing εW,t reduces to that of minimizing
Z !2
g2 (x)dx
R
subject to the conditions

Z Z Z
xg(x)dx = 0, g(x)dx = x2 g(x)dx = 1 and g(x) ≥ 0 for all x.
R R R
According to the method of undetermined multipliers, it is sufficient to minimize

Z n o
g2 (x) − a + bx2 g(x) dx.
R
For nonnegative g(x), this is achieved by setting




− 2
a+bx2
a + bx2 ≤ 0
g(x) = 
 .
0 otherwise
From the conditions, we have the constants a = − 2 √3 5 , b = 3√
10 5
, and so
3 2 108
g(x) = − √ x −5 , εW,t = .
20 5 125
Table 5.3: Some particular values for εW,t
Distribution g(x) σG2 εW,t

2

Normal √1
2πσ2
exp − (x−µ)
σ2 σ2 3
π ≈ 0.955
1 (b−a)2
Uniform b−a if x ∈ [a, b] 12 1
k−1 −x/θ
x e 12k{(2k−2)!}2
Gamma k
if x ≥ 0 kθ2 24k−2 {(k−1)!}4
Γ(k)θ

Hodges–Lehmann − 203√5 x2 − 5 if x2 ≤ 5 1 108
125 = 0.864
On the other hand, the asymptotic relative efficiency of the following c1 -test relative to the t-
test has been considered in Chernoff and Savage (1958). Let X1 , . . . , Xm and Y1 , . . . , Yn be ordered
observations from the absolutely continuous cumulative distribution functions F(x) and G(x). And
let N = m + n and λN = Nm , and assume that for all N the inequalities 0 < λ0 ≤ λN ≤ 1 − λ0 < 1
hold for some fixed λ0 ≤ 12 . The combined population cumulative distribution function is H(x) =
λN F(x) + (1 − λN )G(x).
Let Fm (x) and Gn (x) be the sample cumulative distribution functions of the X’s and Y’s, respec-
tively, namely, Fm (x) = (the number of Xi ≤ x) /m and Gn (x) = (the number of Yi ≤ x) /n. Define
the combined sample cumulative distribution function HN (x) = λN Fm (x) + (1 − λN )Gn (x). If the
ith smallest in the combined sample is an X, let zNi = 1, and otherwise, let zNi = 0. Then, many
nonparametric test statistics are of the form
X
N
mT N = a Ni z Ni , (5.1.6)
i=1

i
where the aNi are given numbers. As a special case, we can consider aNi = a N and the Wilcoxon
test is given with aNi = Ni . However, the representation
Z
TN = JN {HN (x)} dFm (x) (5.1.7)
R
is employed in Chernoff and Savage (1958). Equations (5.1.6) and (5.1.7) are equivalent when
aNi = JN Ni . While JN need be defined only at N1 , N2 , . . . , NN , it is convenient to extend its domain
i
of definition to (0, 1] by some convention, such as letting JN be constant on Ni , i+1 N . And let IN be
the interval in which 0 < HN (x) < 1.
In the following, we will examine translation local alternatives only. That is, we assume that
there is a distribution function Ψ (x) which does not depend on N and has a density ψ(x), and F(x) =
Ψ (x − θN ) and G(x) = Ψ (x − ϕN ). And we test the hypothesis that ∆N = θN −ϕN = 0 versus the local
alternative of the form ∆N = cN −1/2 . We also assume that 0 < limN→∞ λN = λ < 1. Furthermore,
we assume the following conditions.
Assumption 5.1.4. (1) J(H) = limN→∞ JN (H) exists for 0 < H < 1 and is not constant.
R
(2) I {JN (HN ) − J (HN )} dFm (x) = oP N −1/2 .
N
√
(3) JN (1) = o N .
i
dJ
(4) J (i) (H) = dH i ≤ K {H(1 − H)}
−i−1/2+δ
for i = 0, 1, 2, and for some δ > 0.
Now, we have the following asymptotic normality of the c1 -test which is due to Chernoff and
Savage (1958).
Lemma 5.1.5. If the conditions of Assumption 5.1.4 are satisfied, then the equation
!
T N − µN (∆N )
lim P ≤ x = Φ(x)
N→∞ σN (∆N )
holds uniformly with respect to λN , θN , and ϕN for ∆N = θN − ϕN in some neighbourhood of 0, where
Φ(x) is the distribution function of the standard normal and
Z
µN (∆N ) = J {H(x)} dF(x), (5.1.8)
R
Z Z
λN Nσ2N (∆N )
lim =2 x(1 − x)J (1) (x)J (1) (y)dxdy
N→∞ 1 − λN 0<x<y<1
Z 1 (Z 1 )2
= J 2 (x)dx − J(x)dx . (5.1.9)
0 0
Note that if J is the inverse of some cumulative distribution function L, the right-hand side of
(5.1.9) is
Z 1n o2 (Z 1 )2 Z (Z )2
2
−1
L (x) dx − −1
L (x)dx = x dL(x) − xdL(x) = σ2L (5.1.10)
0 0 R R
(Problem 5.2).
Although the definition (5.1.5) of asymptotic relative efficiency is easily interpreted, it is often
useful to turn to equivalent expressions in order to evaluate Pitman efficiency for a given pair of test
procedures in a particular setting. Assume that T N(i) , i = 1, 2 are test statistics for testing ∆N = 0
against a local alternative ∆N = cN −1/2 which satisfy the following assumption.
Assumption 5.1.6. There are functions µ(i) (i)
N (∆N ) and σN (∆N ) such that for (∆N in some neighbour-
hood of 0,
(1)
 (i) 
 T N − µ(i)
N (∆N )

lim P∆N   ≤ x  = Φ(x),
N→∞ σ(i)
N (∆ N )
σ(i) (∆N )
(2) limN→∞ N
σ(i) (0)
= 1,
N
(i) 2
µN (∆N )−µ(i) (0)
(3) ET (i) = limN→∞ N
∆N N 1/2 σ(i) (0)
exists and is independent of c.
N
The quantity ET (i) is called the efficacy of the procedure based on the sequence of statistics T (i) .
Let 0 < α < 1 be fixed and let dN(i) , i = 1, 2 be sequences of critical values for the respective test
such that
 (i) 
(i) (i)
 T N − µ(i) (0) dN(i) − µ(i) (0) 
α = lim P0 T N ≥ dN = P0  N
≥ N
 .
N→∞ σ(i)
N (0) σ(i)
N (0)
This requires, along with Assumption 5.1.6 (1), that
dN(i) − µ(i)
N (0)
lim = zα ,
N→∞ σ(i)
N (0)
where Φ (zα ) = 1 − α. On the other hand, the respective power functions for T N(i) , i = 1, 2 are given
by
 (i) 
(i) (i)
 T N − µ(i) (∆N ) dN(i) − µ(i) (∆N ) 
βT (i) (∆N ) = lim P∆N T N ≤ dN = P∆N  N
≤ N  .
N N→∞ σ(i)
N (∆N ) σ(i)
N (∆N )
In order for the tests to have the same limiting power function, i.e.,
lim β (1) (∆N ) = lim βT (2) (∆N ) ,

N→∞ T N N→∞ N
we must have
dN(1) − µ(1)
N (∆N ) dN(2) − µ(2)
N (∆N )
lim = lim .
N→∞ σ(1)
N (∆N )
N→∞ σ(2)
N (∆N )
Now, note that
dN(i) − µ(i)
N (∆N ) dN(i) − µ(i) (i) (i)
N (0) + µN (0) − µN (∆N ) σ(i)
N (0)
lim = lim ·
N→∞ σ(i)
N (∆N )
N→∞ σ(i)
N (0) σ(i)
N (∆N )
µ(i) (i)
N (0) − µN (∆N )
= zα + lim ,
N→∞ σ(i)
N (0)
hence
√ r
µ(1) (1)
N (∆N ) − µN (0) N (2) ∆N σ(2)
N (0) N (1)
1 = lim √ · ·
N→∞ N (1) ∆N σ(1)
N (0)
µ(2) (2)
N (∆N ) − µN (0)
N (2)
s r
ET (1) N (1)
= lim · .
N→∞ ET (2) N (2)
Now, we have the following lemma, which is due to Noether (1955).

Lemma 5.1.7. The asymptotic relative efficiency (5.1.5) of T (1) with respect to T (2) satisfies the
following relationship.
N (2) E (1)
εT (1) ,T (2) (α, β, 0, {∆N }) = lim = lim T .
N→∞ N (1) N→∞ ET (2)
From this Lemma (5.1.8) and Equation (5.68) we see that, for c1 -test T N ,
Z
µT N N − µT N
(∆ ) (0) = [J {H (x − θN )} − J {Ψ (x)}] dΨ (x)
R
Z
= (1 − λN ) ∆N J (1) {Ψ (x)} ψ2 (x)dx + o (∆N ) ,
R
since
H (x + θN ) − Ψ (x) = (1 − λN ) Ψ ′ (x)∆N + o (∆N ) .
Therefore, if J is the inverse of some cumulative distribution function L which has a density l(x),
then the c1 -test satisfies the conditions of Assumption 5.1.6, with
"Z #2
λ (1 − λ) (1) 2
ET N = J {Ψ (x)} ψ (x)dx
σ2L R
(Z h i )2
λ (1 − λ) −1
= ψ Ψ {L(x)} dx .
σ2L R
On the other hand, if Ψ has finite second moment σ2Ψ , for the t-test
X−Y
tN = q ,
S m+n
mn
we define
√ √
N∆N λ (1 − λ)
µtN (∆N ) = , σtN (∆N ) ≡ 1
σψ
X − Y − ∆N
VN = q ,
S m+nmn
then we have
√ √ !
tN − µtN (∆N ) λN (1 − λN ) λ (1 − λ)
− VN = c − ,
σtN (∆N ) S σΨ
tN −µtN (∆N )
Therefore, like VN , σtN (∆N ) has a limiting standard normal distribution, hence Asssumption 5.1.6
is satisfied with
λ (1 − λ)
E tN = .
σ2ψ
Consequently, the asymptotic relative efficiency of the c1 -test relative to the t-test is given by
(Z h i )2
σ2
εc1 ,t = Ψ2 ψ Ψ −1 {L(x)} dx .
σL R

Suppose that both Ψ e and eL have mean 0 and variance 1 and let ψ(x) = σ−1 e x−µΨ and l(x) =
x−µ Ψ ψ σΨ
σ−1e
L l σL
L
. Then we see that
(Z h n oi )2 σ2Ψ (Z h i )2
e e −1 e
ψ Ψ L(x) dx = 2 −1
ψ Ψ {L(x)} dx ,
R σL R
so the value of εc1 ,t is invariant under a change of location or scale with respect to both Ψ and L.
If we take L (x) = Φ (x), the distribution function of the standard normal, the problem becomes
the minimization of
Z Z h i
J (1) {Ψ (x)} ψ2 (x)dx = ψ Ψ −1 {Φ(x)} dx
R R
subject to the restrictions

Z
0= xψ (x) dx, (5.1.11)
ZR
1= x2 ψ (x) dx, (5.1.12)
R
where ϕ (x) is the density of the standard normal distribution. Chernoff and Savage (1958) proved
the following lemma using variational methods.
Lemma 5.1.8. If Ψ (x) is a cumulative distribution function which has a density and finite second-
order moment, then εc1 ,t ≥ 1, and εc1 ,t = 1 only if Ψ (x) is normal.
A simpler proof of this result has been provided by Gastwirth and Wolff (1968). Since
n o−1 ϕ(Φ−1 (Ψ (X)))
(1)
J {Ψ (x)} = ϕ Φ−1 (Ψ (x)) , if we put Y = Ψ (X) , and applying the Jensen’s inequality
yields
Z n o−1
J (1) {Ψ (x)} ψ2 (x)dx = Eψ Y −1 ≥ Eψ (Y)
R
with equality iff Y ≡ c1 , a.e. Furthermore, after integrating by parts, applying the Cauchy–Schwarz
inequality leads to
Z Z
−1
Eψ (Y) = ϕ Φ (Ψ (x)) dx = xΦ−1 (Ψ (x)) ψ(x)dx
R R
"Z Z n o2 #1/2
2 −1
≤ x ψ(x)dx Φ (Ψ (x)) ψ(x)dx =1 (5.1.13)
R R
with equality iff Φ−1 (Ψ (x)) = c2 x, a.e. Taking account of the restrictions (5.1.11) and (5.1.12)
completes the proof.
5.1.2 Maximal Invariants

5.1.2.1 Invariance of sample space, parameter space and tests
It is natural that we impose symmetric restrictions on statistical procedures, since many statistical
problems exhibit symmetries. Mathematically, symmetry is described by invariance under a suitable
group of transformations (see, e.g., Lehmann and Romano (2005).) Let X be a random variable
which is distributed according to a probability distribution Pθ , θ ∈ Ω, and let g be a transformation
of the sample space X. We assume that all transformations which are considered in connection with
invariance are one-to-one transformations of X onto itself. The random variable denoted by gX takes
on the value gx when X = x. Suppose that when the distribution of X is Pθ , θ ∈ Ω, the distribution
of Y = gX is Pθ′ , where θ′ is also in Ω. Then, the element θ′ of Ω will be denoted by gθ, so that
Pθ {gX ∈ A} = PθX {gX ∈ A} = PYgθ {Y ∈ A} = Pgθ {X ∈ A} .

This relation can be written as Pθ g−1 A = Pgθ (A) and hence Pθ (B) = Pgθ (gB).
We say that the parameter set Ω remains invariant under g (or is preserved by g) if
(i) gθ ∈ Ω for all θ ∈ Ω,
(ii) for any θ′ ∈ Ω, there exists θ ∈ Ω such that gθ = θ′ ,
and these two conditions can be expressed by the equation
gΩ = Ω.
Note that if the distributions Pθ corresponding to different values of θ are distinct, then the transfor-
mation g of Ω onto itself defined in this way is a one-to-one transformation.
We say that the problem of testing H : θ ∈ ΩH against K : θ ∈ ΩK remains invariant under a
transformation g if g preserves both ΩH and ΩK , so that the equations
gΩH = ΩH and gΩK = ΩK
hold.
Any class of transformations leaving the problem invariant can be naturally extended to a group
G. Furthermore, the class of induced transformations g also form a group G (see, e.g., Lemma
6.1.1 of Lehmann and Romano (2005).) In the presence of symmetries in both sample space and
parameter space, which are represented by the groups G and G, it is natural to restrict attention to
tests φ, which are also symmetric, that is, which satisfy
φ (gx) = φ (x) , for all x∈X and g ∈ G,
and we say that such tests are invariant under G. In such a case, the principle of invariance re-
stricts our attentions to invariance tests and to obtain the best of these tests. For this purpose, it is
convenience to characterize the totality of invariant tests.
Two points x1 , x2 are considered equivalent under G, and denoted by
x1 ∼ x2 (mod G) ,
if there exists a transformation g ∈ G for which x2 = gx1 . A function M is said to be maximal
invariant if it is invariant and if
M (x1 ) = M (x2 ) implies x2 = gx1 for some g ∈ G.
That is, the maximal invariant function M is constant on the orbits except that each orbit takes on a
different value.
Example 5.1.9 (Example 6.2.2 of Lehmann and Romano (2005)). (i) Let x = (x1 , . . . , xn )′ . And
let G be the set of n! permutations of the elements of xi . Since a permutation of the xi only
changes the order of xi , the order statistics x(1) ≤ · · · ≤ x(n) is invariant. Furthermore, since two
points which have the same order statistics can be obtained from each other by a permutation of
the elements, the order statistics are maximal invariant.
(ii) Suppose that we restrict our attention to the points which have n different elements. Let G be
the totality of transformations e xi = f (xi ), where f is continuous and strictly increasing. If we
consider xi . . . , xn is n values on the real line, since such transformations preserve their order,
the ranks (r1 , . . . , rn )′ of (x1 , . . . , xn )′ are invariant. On the other hand, we consider the case that
x1 , . . . , xn and e
x1 , . . . , e
xn have the same order xi1 < · · · , < xin and e xi1 < · · · , < e
xin . Let e
xi = f (xi ),

for all i = 1, . . . , n, and let f (x) be the linear interpolation between f xik and f xik+1 for
xik ≤ x ≤ xik+1 . Furthermore, let



x + exi1 − xi1 x ≤ xi1 ,
f (x) =  
x + exin − xin x ≥ xin .
Then, since such f is continuous and strictly increasing, the ranks are maximal invariant.
5.1.2.2 Most powerful invariant test

If there exist symmetries, we may restrict our attention to invariance tests. Then, we are interested
in the most powerful invariance test. We have the following result.
Theorem 5.1.10 (Theorem 6.3.1 of Lehmann and Romano (2005)). We consider the problem of
testing Ω0 against Ω1 remains invariant under finite group G = {g1 , . . . , gN }. Assume that G is
transitive over Ω0 and Ω1 . Then, there exists a uniformly most powerful (UMP) invariance test of
Ω0 against Ω1 , which rejects Ω0 , when
1 PN
N i=1 pgi θ1 (x)
1 PN
N i=1 pgi θ0 (x)
is sufficiently large, where θ0 and θ1 are any elements of Ω0 and Ω1 , respectively.

Sufficient statistics make a problem simple by reducing the sample space. On the other hand,
invariance makes a problem simple by reducing the data to a maximal invariant statistic M, whose
distribution may depend only on a function of the parameter. Therefore, invariance may also shrink
the parameter space. This fact is described in the following.
Theorem 5.1.11 (Theorem 6.3.2 of Lehmann and Romano (2005)). Let M (x) be invariant under
G, and let ν (θ) be maximal invariant under the induced group G. Then, the distribution of M (X)
depends only on ν (θ).
The two-sample problem of testing the equality of two distributions appears in the comparison of
a treatment with a control. The null hypothesis of no treatment effect is tested against the alternative
hypothesis, which is a beneficial effect. The observations consist of two samples X1 , . . . , Xm and
Y1 , . . . , Yn from two distributions with continuous cumulative distribution functions F and J. Then,
we test the null hypothesis
H1 : J = F.
Here, we consider testing H1 against one-sided alternatives where the Y’s are stochastically larger
than the X’s, i.e.,
K1 : J (z) ≤ F (z) for all z, and J , F.
The two-sample problem of testing H1 against K1 remains invariant under the group G:

xi = ρ (xi ) , e
e y j = ρ y j , i = 1, . . . , m, j = 1, . . . , n,
where ρ is a continuous and strictly increasing function on the real line. These transformations pre-
serve the continuity of a distribution. Furthermore, they preserve either the property of two variables
being identically distributed or the property of one being stochastically larger than the other. There-
fore, this test is invariant under the group G. As we saw in Example 5.1.9, a maximal invariant under
G is the set of ranks

R′ ; S ′ = R′1 , . . . , R′m ; S 1′ , . . . , S n′

of X1 , . . . , Xm ; Y1 , . . . , Yn in the combined sample. The distribution of R′1 , . . . , R′m ; S 1′ , . . . , S n′ is
symmetric in the first m variables for X’s, and in the last n variables for Y’s, for all distributions F
and J. Therefore, sufficient statistics for (R′ ; S ′ ) are the ordered X-ranks and the ordered Y-ranks,
i.e.,
R1 < · · · < Rm and S 1 < · · · < S n .
Since each of them determines the other, one of these sets alone is enough. Thus, any invariance test
is a rank test, namely, depends only on the ranks of the observations, e.g., on (S 1 , . . . , S n ).
5.1.3 Efficiency of Rank-Based Statistics

5.1.3.1 Least favourable density and most powerful test
Along with Sidak et al. (1999), we introduce the concept of the least favourable density and the most
powerful test. First, we consider a simple null hypothesis p and simple alternative q. We denote the
set of critical functions whose sizes are not exceeding α by M (α, p), namely,
Z
Ψ (x)dP(x) ≤ α if and only if Ψ ∈ M (α, p) ,
where dP(x) = p(x)dx. Furthermore, we call

Z
β (α, p, q) = sup Ψ (x)dQ(x)
Ψ ∈M(α,p)
the envelope power function for p against q at level α, where, again, dQ(x) = q(x)dx.
In the case that the hypothesis is composite, that is, H = {p}, we denote by M (α, H) the set of
critical functions such that
Z
sup Ψ (x)dP(x) ≤ α,
p∈H
and call
Z
β (α, H, q) = sup Ψ (x)dQ(x)
Ψ ∈M(α,H)
the envelope power function for H against q at level α.

Finally, in the case that the alternative is also composite, that is, K = {q}, we call
Z
β (α, H, K) = sup inf Ψ (x)dQ(x)
Ψ ∈M(α,H) q∈K
the envelope power function for H against K at level α.

If we consider testing p against q, a test Ψ 0 is said to be most powerful at level α if Ψ 0 ∈ M (α, p)
and if
Z
β (α, p, q) = Ψ 0 (x)dQ(x). (5.1.14)
Furthermore, if we consider testing H against q, a test Ψ 0 is said to be most powerful at level α if

Ψ 0 ∈ M (α, H) and if
Z
β (α, H, q) = Ψ 0 (x)dQ(x). (5.1.15)
Finally, if we consider testing H against K, a test Ψ 0 ∈ M (α, H) is called uniformly most powerful
(UMP) at level α if (5.1.15) holds for all q ∈ K, while a test Ψ 0 ∈ M (α, H) is called maxmin most
powerful at level α if
Z
β (α, H, K) = inf Ψ 0 (x)dQ(x).
q∈K
If there is a p0 ∈ H such that the most powerful test Ψ 0 for p0 against q belongs to M (α, H),
then
Z Z Z
0 0
β α, p , q = Ψ (x)dQ(x) = sup Ψ (x)dQ(x) ≤ sup Ψ (x)dQ(x) = β (α, p, q)
Ψ ∈M(α,H) Ψ ∈M(α,p)
for every p ∈ H. Therefore, such a p0 is called a least favourable density in H for testing H against
q at level α. Similarly, if there are p0 ∈ H and q0 ∈ K such that the most powerful test Ψ 0 for p0
against q0 belongs to M (α, H) and
Z Z
0
Ψ (x)dQ(x) ≥ Ψ 0 (x)dQ0 (x), for all q ∈ K,
then
Z Z Z
β α, p0 , q0 = Ψ 0 (x)dQ0 (x) ≤ Ψ 0 (x)dQ(x) = sup Ψ (x)dQ(x)
Ψ ∈M(α,H)
Z
≤ sup Ψ (x)dQ(x) = β (α, p, q)
Ψ ∈M(α,p)

for every p ∈ H and q ∈ K. Hence, we call p0 , q0 a least favourable pair of densities for H against
K at level α.
The following basic lemma shows that the most powerful test for testing a simple hypothesis
against a simple alternative may be found quite easily (see, e.g, Lemma (2.3.4.1) of Sidak et al.
(1999)).
Lemma 5.1.12 (Neyman–Pearson lemma). In testing p against q at level α,



1, if q(x) > kp(x),
0
Ψ (x) = 
0, if q(x) < kp(x),
is the most powerful test, where k, and the value of Ψ 0 (x) for x such that q(x) = kp(x) is defined so
that
Z
Ψ 0 (x) dP (x) = α.
Example 5.1.13 (Example 3.8.1 of Lehmann and Romano (2005)). In the case that an item should
be satisfactory, some characteristic X of the item may be expected to exceed a given constant u. Let
p = P (X ≤ u)
be the probability of an item being defective. We would like to test the hypothesis H : p ≥ p0 .
Furthermore, let X1 , . . . , Xn be characteristics of n sample items, that is, i.i.d. copies of X. The
distributions of X1 , . . . , Xn can be determined by p and the conditional distributions
P− (x) = P (X ≤ x | X ≤ u) and P+ (x) = P (X ≤ x | X > u) .
We assume that P− and P+ have densities p− and p+ . Then, the joint density of X1 , . . . , Xn is given
by

pm (1 − p)n−m p− xi1 · · · p− xim p+ x j1 · · · p+ x jn−m ,
where the sample points x1 , . . . , xn satisfy
xi1 , . . . , xim ≤ u ≤ x j1 , . . . , x jn−m .
Now, we consider a fixed alternative to H, say {p1 , P− , P+ } with p1 < p0 . On the other hand,
we consider {p0 , P− , P+ } as a candidate of the least favourable distribution. Then, the test Ψ 0 in
Theorem 5.1.12 becomes
 p m 1−p n−m

 1 1
1 p0 1−p0 > C,
0
Ψ (x) = 
 m 1−p1 n−m
0 p1
< C,
p0 1−p0
which is equivalent to



1 m > C′,
0
Ψ (x) = 

0 m < C′.
Therefore, the test rejects when M < C, and when M = C with probability γ, where
P {M < C} + γP {M = C} = α for p = p0 .
The distribution of M is binomial distribution b (n, p), so that it does not depend on P+ and P− .
Then, the power function of the test depends only p. Furthermore, the power function is a decreasing
function of p, which takes on its maximum under H for p = p0 . As a consequence, p = p0 is the
least favourable and Ψ 0 is the most powerful. Furthermore, Ψ 0 is UMP since the test is independent
of the particular choice of the alternative.
If we express the test statistic M in terms of the variables Zi = Xi − u, then the test statistic M is
the number of variables Zi which are less than or equal to 0. This test is the so-called sign test.
Let H0 be the system of densities of the form
Y
N
p (x) = f (xi )
i=1
where f (x) is an arbitrary one-dimensional density. Now, we consider testing H0 against some
simple alternative q. Assume that for some p0 ∈ H0 , the likelihood ratio pq(x)
0 (x) is a function of the
′
ranks r = (r1 , . . . , rN ) only, that is,
q (x)
= h (r) .
p0 (x)
Then, the test



1, if h(r) > k,
0
Ψ (x) = 

0, if h(r) < k,
R
is most powerful at level α = Ψ 0 (x) dP0 (x) for p0 against q. Moreover, the distribution of Ψ 0 (x)
is invariant over H0 , since Ψ 0 (X) = Ψ 0 (RN ) and the distribution of RN = (R1 , . . . , RN )′ does not
depend on the choice of p ∈ H0 . Therefore, we see that
Z
Ψ 0 (x) dP (x) = α for all p ∈ H0 ,
and consequently p0 is least favourable for H0 against q and Ψ 0 is most powerful for H0 against q.
If we put K = {q}, where q are all the densities such that
q (x) = p (x) h (r) , p ∈ H0 ,

R
then Ψ 0 (x) dQ (x) does not depend on the choice of q ∈ K, hence Ψ 0 is most powerful for H0
against K, too.
More generally, if K = {qd , d ∈ W} and
qd (x)
= hd (r) , d ∈ W, (5.1.16)
p0 (x)
for some density p0 , then the maximin most powerful test for H0 against K may be found within
the family of rank tests. However, the relation (5.1.16) does not hold for any important practical
problem.
5.1.3.2 Asymptotically most powerful rank test

Observing Xi = α + βc n i + σYi , Hájek (1962) considered
o testing the hypothesis β = 0 against the
alternative β > 0. Let Xm,1 , . . . , Xm,Nm ′ , m ≥ 1 be a sequence of random vectors, where X’s are
independent and
!
x − α − βcm,i
P Xm,i ≤ x | α, β, σ = F , 1 ≤ i ≤ Nm , m ≥ 1,
σ
where F is a known distribution function of the residuals Ym,i , which has the density f (x) = F ′ (x),
cm,i are known constants, α ∈ R and σ > 0 are nuisance parameters, and β is the parameter of
interest. Suppose that the square root of the probability density ( f (x))1/2 possesses a quadratically
integrable derivative, which means that
Z ( ′ )2
f (x)
0 < I (f) = f (x) dx < ∞ (5.1.17)
R f (x)
(as usual, I ( f ) is the Fisher information related to the location parameter family { fθ (x) = f (x − θ) |
θ ∈ R}), since
( ′ )
1/2 ′ 1 f (x)
{ f (x)} = .
2 { f (x)}1/2
Hájek (1962) defines a class of rank tests which are asymptotically most powerful for given f . We
assume the following condition for the constants cm,i .
Assumption 5.1.14. (i) Noether condition
 2 

 max1≤i≤Nm cm,i − cm 
 

lim  PN 2  = 0, (5.1.18)
m→∞  

i=1 cm,i − cm
m
P Nm
where cm = Nm−1 i=1 cm,i ,
(ii) Boundedness condition
X
Nm
2
sup cm,i − cm < ∞. (5.1.19)
m i=1
Define
X
Nm

dm2 = cm,i − cm 2 I ( f ) and d2 = lim dm2 . (5.1.20)
m→∞
i=1
And let F −1 (u) = inf {x | F (x) ≥ u}, 0 < u < 1, and put

f ′ F −1 (u)
ψ (u) = − . (5.1.21)
f F −1 (u)
From (5.1.17), it follows that

Z 1 Z
ψ (u) du = f ′ (x) dx = 0 (5.1.22)
0 R
and
Z 1
ψ2 (u) du = I ( f ) < ∞. (5.1.23)
0

1
If an f corresponds to ψ, f → ψ, then f1 (x) = σ f x−α
σ corresponds to ψ1 (u) = σ1 ψ (u),
1 x − α 1
f → ψ (u) ,
σ σ σ
i.e., the map is invariant under translation and a change of scale only introduces a multiplicative
constant, which satisfies σ2 I ( f1 ) = I ( f ).
Nm
Now, let Rm,i be the rank of Xm,i , that is, Xm,i = XRNm , and let ψm (u) be a function defined on
m,i
(0, 1) which is constant over intervals Nim , i+1Nm , 0 ≤ i < Nm , namely,

!
i i−1 i
ψm (u) = ψm := ψm,i , <u< . (5.1.24)
Nm + 1 Nm Nm
Consider the rank statistics S m of the form
 Nm 
X
Nm
 Rm,i 
Sm = cm,i − cm ψm   , m ≥ 1. (5.1.25)
i=1
Nm + 1 

Since the distribution of S m does not depend on the ordering of ψm,i , without loss of generality, we
may suppose that ψm (u) is a nondecreasing function in u. Furthermore, assume that
Z 1
lim {ψm (u) − ψ (u)}2 du = 0 (5.1.26)
m→∞ 0
and ψm (u) is nonconstant and quadratically integrable.

Here, we consider a particular alternative {α = α0 , β = β0 , σ = σ0 } and associate with it the
particular null hypothesis {α = α0 + β0 cm , β = 0, σ = σ0 }, where the distributions under the null
and alternative hypotheses are denoted by Pm and Qm , respectively, assume that
Y
Nm

Pm Xm,i ≤ xi , 1 ≤ i ≤ Nm = Pm,i Xm,i ≤ xi ,
i=1
Ym N

Qm Xm,i ≤ xi , 1 ≤ i ≤ Nm = Qm,i Xm,i ≤ xi
i=1
and put
 dQ

 m,i
 dPm,i on sets Bm,i ,
rm,i =

0 otherwise,

where Bm,i satisfies Pm,i Bm,i = 1 and Qm,i is absolutely continuous with respect to Pm,i on Bm,i. If
pm,i and qm,i are densities corresponding to distributions Pm,i and Qm,i , respectively, then
 q (x)


 pm,i (x) if pm,i (x) > 0,
m,i
rm,i (x) = 

0 otherwise.
Put
Nm
X
21
Wm = 2 rm,i −1 , (5.1.27)
i=1
X
Nm
Lm = log rm,i
i=1
and suppose that for every ε > 0

lim max Pm,i rm,i − 1 > ε = 0. (5.1.28)
m→∞ 1≤i≤Nm

We say that L (Ym | Pm ) is AN am , b2m if the distribution of b−1
m (Ym − am ) under Pm tends to the
standard normal distribution and similar notation will be used for L (Ym | Qm ). The following lemma
is due to LeCam.
Lemma
5.1.15 (LeCam is third lemma). Assume that (5.1.28) holds true, and that L (Wm | Pm ) is
AN − 14 σ2 , σ2 . Then
(i) the probability distributions Qm are contiguous to the probability measures Pm ,
Pm
(ii) Wm − Lm → 41 σ2 and L (Lm | Pm ) is AN − 21 σ2 , σ2 ,

(iii) if L (Ym | Pm ) is AN a, b2 , and L (Ym , Lm | Pm ) tends to a bivariate normal distribution with
correlation coefficient ρ, then

L (Ym | Qm ) is AN a + ρbσ, b2 .
Setting
1
s (x) = { f (x)} 2 ,
we have
 
′ 1  f ′ (x) 
s (x) =  
2 { f (x)} 12 
so that if (5.1.17) is satisfied, s (x) has a quadratically integrable derivative. Concerning functions
possessing a quadratically integrable derivative, Hájek (1962) formulated the following lemma.
Lemma 5.1.16. (Hájek (1962), Lemma 4.3) Let s (x), x ∈ R be an absolutely continuous function
possessing a quadratically integrable derivative s′ (x). Then
Z ( )2
s (x − h) − s (x) ′
lim + s (x) dx = 0.
h→∞ h
We can express (5.1.27) as
Nm (
X )
s Ym,i − γcm,i + γcm
Wm = 2 −1 , (5.1.29)
i=1
s Ym,i
where
Xm,i − α0 − β0 cm
Ym,i =
σ0
and
β0
γ= . (5.1.30)
σ0
Because of Lemma 5.1.16, we can see that
( ) Z ( )2
s Ym,i − h s (x − h) − s (x)
2E Pm − 1 = −h2 dx
s Ym,i R h
Z Z ( ′ )2
2 ′ 2 2 f (x) h2
∼ −h s (x) dx = −h f (x) dx = − I ( f )
R R f (x) 4

where the sign a ∼ b means that limh→0 ba = 1. Letting h = γ cm,i − cm , from (5.1.29), we have
γ2 I ( f ) X
N
m
γ2 dm2
E Pm (Wm ) ∼ − cm,i − cm 2 = − , (5.1.31)
4 i=1
4
where dm2 and γ are given by (5.1.20) and (5.1.30), respectively.

We now approximate Wm by
X
Nm
f ′ Ym,i
Tm = − cm,i − cm .
i=1
f Ym,i

Since Ym,i ’s are independent, we have
X
Nm ( )
f ′ Ym,i
VarPm (T m ) = cm,i − cm 2 VarPm
i=1
f Ym,i
X
Nm ( ′ )
f Ym,i
= cm,i − cm 2 E Pm = dm2 .
i=1
f Ym,i

Setting hm,i = γ cm,i − cm , we see that
X N ( )
m
s Ym,i − hm,i hm,i f ′ Ym,i
E Pm Wm − E Pm (Wm ) − γT m 2 = 4 VarPm −1+
i=1
s Ym,i 2 f Ym,i
XNm ( ) 2
s Ym,i − hm,i hm,i f ′ Ym,i
≤4 E Pm −1+
i=1
s Ym,i 2 f Ym,i
XN m Z ( )2
s x − hm,i − s (x)
≤ 4γ2 cm,i − cm 2 + s′ (x) dx.
i=1
hm,i
From (5.1.18), (5.1.19) and Lemma 5.1.16, it follows that

lim E Pm Wm − E Pm (Wm ) − γT m 2 = 0. (5.1.32)
m→∞
Introduce the statistic

X
Nm

T m∗ = cm,i − cm ψm F Ym,i ,
i=1
where ψm satisfies the condition (5.1.26). Then, from Theorem 3.2 of Hájek (1961), it can be seen
that

lim E Pm S m − T m∗ 2 = 0,
m→∞
where S m is given by (5.1.25). Therefore, it follow that

Pm
S m − T m∗ → 0. (5.1.33)
Furthermore, we see that

lim E Pm T m − T m∗ 2 = lim VarPm T m − T m∗ 2
m→∞ m→∞
X
Nm " ′ #
f Ym,i
= lim cm,i − cm 2 VarPm + ψm F Ym,i
m→∞
i=1
f Ym,i
X
Nm " #2
2 f ′ Ym,i
≤ lim cm,i − cm E Pm + ψm F Ym,i
m→∞
i=1
f Ym,i
X
Nm Z 1

= lim cm,i − cm 2 {−ψ (u) + ψm (u)}2 du = 0. (5.1.34)
m→∞ 0
i=1
Now, (5.1.31), (5.1.32), (5.1.33) and (5.1.34) imply
γ2 dm2 Pm
Wm + − γT m → 0 (5.1.35)
4
and
Pm Pm
T m − T m∗ → 0, T m − S m → 0. (5.1.36)
It can be shown that S m given by (5.1.25) satisfies Lindeberg’s condition (see Lemma 4.1 of Hájek
(1961)), so that

L (S m | Pm ) is AN 0, dm2 . (5.1.37)
Let εm and Qm (α, β, σ) be the size and power of the test based on critical region
S m > Kε d, m ≥ 1, (5.1.38)
where d is given by (5.1.20) and Kε satisfies
Φ (Kε ) = 1 − ε.
Definition 5.1.17. If limm→∞ εm = ε and

lim inf Qm (α, β, σ) − Q∗m (α, β, σ) ≥ 0, α ∈ R, β > 0, σ > 0,
m→∞
where Q∗m (α, β, σ) is the power of any other test of limiting size ε, then we say that the test based on
(5.1.38) is asymptotically uniformly most powerful (uniformness is meant with respect to α, β and
σ).
From (5.1.20), (5.1.35), (5.1.36) and (5.1.37), it follows that
!
γ2 d 2 2 2
L (Wm | Pm ) is AN − ,γ d . (5.1.39)
4
Making use of Lemma 5.1.15, we conclude that the distributions Qm are contiguous to the distribu-
tions Pm , and from (5.1.20), (5.1.35), (5.1.36) and (5.1.39), we see that
γ2 d 2 Pm
Lm + − γS m → 0. (5.1.40)
2
Now the contiguity implies that
γ2 d 2 Qm
Lm + − γS m → 0.
2
Consequently, the best critical region for distinguishing the particular alternative Qm from the par-
ticular null hypothesis Pm , which is known to be based on the statistics Lm (see, Neyman–Pearson
lemma (Lemma 5.1.12)), is asymptotically approximated by the test (5.1.38). Since the statistic S m
is distribution free, the test has the correct limiting size for all possible particular hypotheses, and
therefore, it is asymptotically uniformly most powerful.
Using (5.1.40), we see that L (S m , Lm | Pm ) tends to a degenerated
! bivariate normal distribution
′
γ2 d 2 d2 γd2
N (µ, Σ), where µ = 0, − 2 and Σ = with the correlation coefficient ρ = 1.
γd2 γ2 d2
Consequently, from (5.1.37), (5.1.39) and Lemma 5.1.15, it follows that

L (S m | Qm ) is AN γd2 , d2 .
Now, we can state the following result.
Theorem 5.1.18 (Hájek (1962), Theorem 1.1). Suppose that (5.1.17) and Assumption 5.1.14 hold,
and let ψ (u) given by (5.1.21) satisfy (5.1.26). Then the critical region (5.1.38) provides an asymp-
totically uniformly most powerful test of limiting size ε, and the asymptotic power equals
β
1 − Φ Kε − d . (5.1.41)
σ
As the candidates of step function ψm (u) we can consider, e.g.
(i) Step functions ψm form an Nm -dimensional subspace in the space of quadratically integrable
functions on (0, 1). The projection of ψ on this space is
! Z i/Nm
0 i
ψm = Nm ψ (u) du, 1 ≤ i ≤ Nm .
Nm + 1 (i−1)/Nm
(ii) Let Xm,(1) < · · · < Xm,(Nm ) be the ordered sample referring to F, and α = β = 0 and σ = 1, so that

Zm,(1) < · · · < Zm,(Nm ) , where Zm,(i) = F Xm,(i) is an ordered sample from the uniform distribution
over (0, 1). Put
! ( ′ )
i f Xm,(i)
ψ+m =E − = E ψ Zm,(i) , 1 ≤ i ≤ Nm .
Nm + 1 f Xm,(i)
(iii) Another step function of particular interest is defined by

! !
∗ i i
ψm =ψ , 1 ≤ i ≤ Nm .
Nm + 1 Nm + 1
We extend the definitions of ψ0m , ψ+m and ψ∗m to the whole interval (0, 1) according to (5.1.24). Then,
it can be shown that under (5.1.17), (5.1.18) and (5.1.19), ψ0m and ψ+m satisfy the condition (5.1.26)
(see Hájek (1961)). Moreover, suppose that (5.1.17), (5.1.18) and (5.1.19) hold, and that ψ (u) is
either monotone or continuous and bounded. Then, ψ∗m also satisfies the condition (5.1.26) (see
Hájek (1961)).
Under the conditions above, all the critical regions S m0 > Kε dm , S m+ > Kε dm and S m∗ > Kε dm
provide an asymptotically most powerful test of limiting size ε and of asymptotic power (5.1.41),
where
 Nm 
XNm
0  Rm,i 
0
Sm = cm,i − cm ψm   , m ≥ 1,
i=1
Nm + 1 
 Nm 
XNm
 Rm,i 
S m+ = cm,i − cm ψ+m   , m ≥ 1,
i=1
Nm + 1 
 Nm 
XNm
∗  Rm,i 
∗
Sm = cm,i − cm ψm   , m ≥ 1.
i=1
Nm + 1 
If the true density, say g (x), differs from the supposed one f (x), then the tests based on S m will
be no longer asymptotically most powerful. Denote

g′ G−1 (u)
ϕ (u) = − ,
g G−1 (u)
Rx
where G (x) = −∞ g (y) dy. Assume that
Z 1 Z ( )2
2 g′ (x)
ϕ (u) du = g (x) dx = I (g) < ∞, (5.1.42)
0 R g (x)
and introduce the correlation coefficient

R1 R1
0
ψ (u) ϕ (u) du 0
ψ (u) ϕ (u) du
ρ = R R 1/2 = p . (5.1.43)
1 2
ψ (u) du
1 2
ϕ (u) du I ( f ) I (g)
0 0
Let Pm correspond to G and {α = α0 + β0 cm , β = 0, σ = σ0 }, and let Qm correspond to G and
{α = α0 , β = β0 , σ = σ0 }. Introduce statistics
X
Nm ( )
g Ym,i − γcm,i + γcm
Lm = log ,
i=1
g Ym,i
X
Nm

Tem = cm,i − cm ϕ G Ym,i
i=1
and
X
Nm

Tm = cm,i − cm ψ G Ym,i .
i=1
Then, the argument similar to the above leads to
γ2 d̃2 Pm
Lm + − γTem → 0,
2

L Tem | Pm is AN 0, d̃m2 and L (T m | Pm ) is AN 0, dm2 ,
where dm2 is given by (5.1.20),
X
Nm

d̃m2 = cm,i − cm 2 I (g) and d̃2 = lim d̃m2 .
m→∞
i=1
Furthermore, since under Pm ,
XNm Z 1
e 2
Cov T m , T m = cm,i − cm ψ (u) ϕ (u) du,
i=1 0
we can see that L (T m , Lm | Pm ) tends to a degenerated

! bivariate normal distribution N (µ, Σ), where

2 2 ′ 2
d ργ d̃d
µ = 0, − γ 2d̃ and Σ = with the correlation coefficient ρ given by (5.1.43), and
ργd̃d γ2 d̃2
consequently, it follows that

L (T m | Qm ) is AN ργdd,˜ d2 .
Now, we have the following result.

Theorem 5.1.19 (Hájek (1962), Theorem 6.1). Let the same conditions as Theorem 5.1.18 and
(5.1.42) be satisfied. Then the asymptotic power of the test S m > Kε d equals
β
1 − Φ Kε − ρ d̃ ,
σ
where ρ is given by (5.1.43).
In the two-sample problem where Nm /N em → λ, 0 < λ < 1, the efficiency may be interpreted
as the ratio of sample sizes needed to attain the same power. However, in the general regression
P Nm
problem, the sample size Nm should be replaced by the sum i=1 cm,i − cm 2 .
5.1.4 U-Statistics for Stationary Processes
U-statistics play an important role in nonparametric and semiparametric rank-based inference.
Yoshihara (1976) investigated the asymptotic properties of U-statistics for strictly stationary, ab-
solutely regular processes.
Let {Xt , t ∈ Z} be a d-dimensional strictly stationary process defined on a probability space
(Ω, A, P). For a ≤ b, let Fab denote the σ-algebra generated by {Xa , . . . , Xb }. The sequence is said
to be absolutely regular if
 

 n o 

 
β (N) = E 
 sup P A | F−∞ − P {A}
0
 ↓0 (5.1.44)
A∈F ∞ 
N
as N → ∞. It can be shown that

1 1
β (N) = V P0,N , P1,N = Var P0,N − P1,N ,
2 2
0
where P0,N is the measure induced by the process {Xt } on the σ-algebra F−∞ ∪ FN∞ , and P1,N the
∞ 0
measure defined for A ∈ FN , B ∈ F−∞ by the equality
P1,N (A ∩ B) = P0,N (A) P0,N (B) .
On the other hand, {Xt } is said to satisfy the φ-mixing condition if

1
φ (N) = sup P0,N (A ∩ B) − P1,N (A ∩ B) ↓ 0
0
B∈F−∞ ,A∈FN∞ P0,N (B)
and the strong mixing condition if

α (N) = sup P0,N (A ∩ B) − P1,N (A ∩ B) ↓ 0.
0
B∈F−∞ ,A∈FN∞
Since α (N) ≤ β (N) ≤ φ (N), if {Xt } is φ-mixing, then it is absolutely regular and if {Xt } is abso-
lutely regular, then it is strong mixing. Ibragimov and Solev (1969) (see also Ibragimov and Rozanov
(1978), Section IV.4) introduced the complete description of stationary Gaussian processes satisfy-
ing (5.1.44).
Theorem 5.1.20 (Ibragimov and Solev (1969)). A stationary Gaussian process {Xt , t ∈ Z} is abso-
lutely regular if and only if this process has a spectral density f (λ), representable as
f (λ) = |P (λ)|2 a (λ)
where P (λ) is the polynomial with roots on the circle |z| = 1, and the coefficients a j of the Fourier
P
series j∈Z a j eiλ j of the function log a (λ) satisfy
X
| j| a j < ∞.
j∈Z
We denote the distribution function of Xt by F (x), x ∈ Rd and consider a functional

Z Z
θ (F) = ··· g x1 , . . . , x p dF (x1 ) · · · dF x p ,
Rd p

where |θ (F)| < ∞ and g x1 , . . . , x p is symmetric in its m(≥ 1) arguments. As an estimator of θ (F),
we define U-statistics
1 X
N

UN = g Xt1 , . . . , XtP , N ≥ p.
C
N p 1≤t1 <···<t p ≤N
Alternatively, we also consider a von Mises’s differentiable statistical functional θ (F N ) defined by
Z Z
θ (F N ) = ··· g x1 , . . . , x p dF N (x1 ) · · · dF N x p
Rd p
1 X X
N N

= p ··· g Xt1 , . . . , XtP .
N t =1 t =1
1 p
In what follows we assume that {Xt } is a d-dimensional strictly stationary process which satisfies
absolutely regularity.
For every m (0 ≤ m ≤ p), let
Z Z
gm (x1 , . . . , xm ) = ··· g x1 , . . . , x p dF (xm+1 ) · · · dF x p
Rd(p−m)
Then we have g0 = θ (F) and g p = g. Furthermore, let
h n o i X
∞ h i
σ2 = σ2 (F) = E g21 (X1 ) − θ2 (F) + 2 E {g1 (X1 ) g1 (Xk+1 )} − θ2 (F) .
k=1
Now, we introduce the following assumption.

Assumption 5.1.21. We assume that for some r > 2,
(i)
Z Z r

µr = ··· g x1 , . . . , x p dF (x1 ) · · · dF x p ≤ M0 < ∞,
Rd p
(ii) and for all integers t1 < · · · < t p ,

r
νr = E g Xt1 , . . . , Xt p ≤ M0 < ∞.
Under this assumption, Yoshihara (1976) proved two weak convergence results as below,
namely, the central limit theorem (CLT) and the functional central limit theorem (FCLT).
Theorem 5.1.22 (Yoshihara (1976), Theorem 1). Suppose that there is a positive number δ such
that Assumption 5.1.21 holds for r = 2 + δ and
(2+δ′ )
β (N) = O N − δ′
for some δ′ (0 < δ′ < δ). If σ2 > 0, then we have

√ 

 N {U N − θ (F)} 

lim P 
N→∞  pσ  = Φ (z)
≤ z (5.1.45)
for all z ∈ R. Furthermore, we have

√ P
N |θ (F N ) − U N | → 0,
hence, (5.1.45) holds for U N being replaced by θ (F N ).

Secondly, let C be the space of all continuous real-valued functions on [0, 1], with the uniform
topology, i.e., for g, h ∈ C
ρ (g, h) = sup |g (u) − h (u)| ,

0≤u≤1
SEMIPARAMETRICALLY EFFICIENT ESTIMATION IN TIME SERIES 139
and let W = {W (u) , 0 ≤ u ≤ 1} be a standard Brownian motion. Suppose
n σ2 > 0. For
o every N ≥ p,
∗ ∗
define the random elements XN = {XN (u) , 0 ≤ u ≤ 1} and XN = XN (u) , 0 ≤ u ≤ 1 in C by




 0 for 0 ≤ u ≤ p−1
N ,

 k{U√k −θ(F)} k
XN (u) = 
 for u = N , p ≤ k ≤ N,



N pσ h i
linearly interpolated for u ∈ k , k+1 , p − 1 ≤ k ≤ N − 1
N N
and



 0 for u = 0,


 k{θ(F√k )−θ(F)}
∗
XN (u) = 
 for u = Nk , 1 ≤ k ≤ N,



N pσ h i
linearly interpolated for u ∈ k , k+1 , 0 ≤ k ≤ N − 1.
N N
Theorem 5.1.23 (Yoshihara (1976), Theorem 2). Suppose that there is a positive number δ such
that Assumption 5.1.21 holds for r = 4 + δ and
3 4+δ′
!
− ( )
β (N) = O N (2+δ′ )
for some δ′ (0 < δ′ < δ). Then, both XN and XN∗ converge weakly to W. Furthermore, we have
P
ρ XN , XN∗ → 0.
5.2 Semiparametrically Efficient Estimation in Time Series

5.2.1 Introduction to Rank-Based Theory in Time Series
5.2.1.1 Testing for randomness against ARMA alternatives
Nonparametric methods have been developed for the analysis of univariate and multivariate observa-
tions without the distributional assumption (e.g., Gaussian assumption). A nonparametric procedure
is even desired in the area of time series analysis. Hallin et al. (1985a) considered the systematic
time series oriented study of testing for randomness against ARMA alternatives. They proposed
statistics of the form
X
N
S N = (N − p)−1 aN R(N) (N) (N)
t , Rt−1 , . . . , Rt−p , (5.2.1)
t=p+1
where aN (·) is some given score function and R(N) t is the rank of the observation at time t in observed
series X (N) = (X1 , . . . , XN ). These statistics are so-called linear serial rank statistics. As a special
case, for example, we can consider:
run statistic (with respect to median)



1 if (2i1 − N − 1) (2i2 − N − 1) < 0,
aN (i1 , i2 ) = 

0 if (2i1 − N − 1) (2i2 − N − 1) ≥ 0,
turning point statistic




 1 if i1 > i2 < i3 ,



aN (i1 , i2 , i3 ) = 
 1 if i1 < i2 > i3 ,



0 elsewhere,
Spearman’s rank correlation coefficient (up to additive and multiplicative constants)
i1 i p+1
aN i1 , . . . , i p+1 = .
(N + 1)2

Put Ut = F (Xt ) and assume that there exists a function J = J v p+1 , v p , . . . , v1 , defined over
[0, 1] p+1 , satisfying
Z
0< J v p+1 , . . . , v1 dv p+1 · · · dv1 < ∞
[0,1] p+1
and
n o
(N) 2
lim E J U p+1 , . . . , U1 − aN R(N)
p+1 , . . . , R 1 | H (N)
0 = 0, (5.2.2)
N→∞

where E · | H0(N) indicates the expectation under the null hypothesis, which is defined below. This
assumption is satisfied most of the time when aN is of the form
!
i1 i2 i p+1
aN i1 , i2 , . . . , i p+1 = J , ,..., .
N+1 N+1 N+1
Such a function J is called a score generation functon associated with the serial rank statistic S N .
Let {εt , t ∈ Z} be i.i.d. random variables with E (εt ) = 0, E ε2t = σ2 , and assume that it has a
density f (x) which has a derivative f ′ (x) a.e. Similar to (5.1.21), we put

f ′ F −1 (u)
φ F −1 (u) = − ,
f F −1 (u)
′ (x)
that is, φ (x) = ψ (F (x)) a.e. This function can also be written as φ (x) = − ff (x) a.e.
Now, we assume the following condtion on the innovation process {εt , t ∈ Z}.
Assumption 5.2.1. (i) εt has finite moments up to the third order.
R
(ii) The derivative f ′ (x) satisfies R | f ′ (x)| dx < ∞.
(iii) Fisher information I ( f ) from (5.1.17) satisfies 0 < I ( f ) < ∞.
(iv) φ (x) has a derivative φ′ (x) a.e. which satisfies a Lipschitz condition |φ′ (x) − φ′ (y)| <
A |x − y| a.e.
Under these condition, we have
Z
φ (x) f (x) dx = 0
R
and
Z
φ2 (x) f (x) dx = I ( f )
R
(see also (5.1.22) and (5.1.23)). It is also easy to check that

Z
xφ (x) f (x) dx = 1
R
and
Z
φ′ (x) f (x) dx = I ( f ) .
R
If we put fσ (x) = σ−1 f (x/σ), then σ2 I ( fσ ) = I ( f ); therefore, σ2 I ( f ) is independent of the scale
transformation.
Let a1 , . . . , a p1 , b1 , . . . , b p2 be an arbitrary (p1 + p2 )-tuple of real numbers, and consider the
sequence
1 X 1 X
p1 p2
Xt(N) − √ (N)
ai Xt−i = εt + √ bi εt−i , t ∈ Z, N ≥ 1 (5.2.3)
N i=1 N i=1
of stochastic difference equations. For n sufficiently large, all the roots of the characteristic equation
1 X
p1
z p1 − √ ai z p1 −i = 0, z∈C
N i=1
n o
lie inside the unit circle, and (5.2.3) generates a sequence of stationary processes Xt := Xt(N) , t ∈ Z .

(N) ′
′
Denote by x(N) = x(N) 1 , . . . , xN an observed realization of X (N) = X1(N) , . . . , XN(N) .
Denote by H0(N) the sequence of simple (null) hypotheses which satisfies a1 = · · · = a p1 = b1 =
n o
· · · = b p2 = 0. Then, the processes Xt(N) all coincide with the white noise process {εt } which has
the likelihood function
YN
l0N x(N) = f x(N)
t .
t=1
On the other hand, denote by H1(N) the corresponding sequence of alternative n o hypotheses which
satisfies that both a p1 and b p2 are different from zero. Then, the processes Xt(N) are ARMA(p1 , p2 )

processes whose likelihood function is denoted by l1N x(N) (the exact expression of this likelihood
function can be found in Hallin et al. (1985a), (3.3)).
In what follows, for simplicity, we write x, xt , X and Xt for x(N) , x(N) t , X
(N)
and Xt(N) . Now,
consider the likelihood ratio



 l1 (x) /l0N (x) if l0N (x) > 0


N
LN (x) =   1 if l1N (x) = l0N (x) = 0



∞ if l1N (x) > l0N (x) = 0,
then, we have the following lemma.

Lemma 5.2.2 (Hallin et al. (1985a), Proposition 3.1). Under H0(N) , log LN (X) = L0N (X) − d2 /2 +
oP (1), where
1 X X
N p
L0N (X) = √ φ (Xt ) di Xt−i ,
N t=p+1 i=1



 ai + bi 1 ≤ i ≤ min {p1 , p2 } ,



di = 
 ai p2 < i ≤ p1 if p2 < p1 ,



b i p1 < i ≤ p2 if p1 < p2 ,
X p
2
p = max {p1 , p2 } and d = di2 σ2 I ( f ) .
i=1
Furthermore, L0N (X) is asymptotically normal, with mean zero and variance d2 .
From LeCam’s first lemma (see, e.g., Sidak et al. (1999), Corollary 7.1.1), we can immediately
see that this lemma implies that H1(N) is contiguous to H0(N) .
Next, we consider the joint asymptotic normality of
√ !
N {S N (X) − mN }
log LN (X)
under H0(N) , where

1 X X
mN = E S N (X) | H0(N) = ··· aN i1 , . . . , i p+1 . (5.2.4)
N (N − 1) · · · (N − p) 1≤i1 ,···,i p+1 ≤N
Define
1 XN
SN (X) = √ J F (Xt ) , F (Xt−1 ) , . . . , F Xt−p
N − p t=p+1
and
√ X X
N−p
MN (X) = ··· J F Xt1 , . . . , F Xt p+1 ,
N (N − 1) · · · (N − p) 1≤t1 ,···,t p+1 ≤N
and let
p
∆N (X) = N − p {S N (X) − mN } − {SN (X) − MN (X)} .
n o
Then, it can be shown that limN→∞ E ∆2N (X) = 0 (see Hallin et al. (1985a), Section 4.1). There-
√
fore, N − p {S N (X) − mN } and {SN (X) − MN (X)} are asymptotically equivalent (under H0(N) ).
Now, we consider U-statistics which are asymptotically equivalent to N −1/2 {SN (X) − MN (X)}
and N −1/2 L0N (X). Define the (p + 1)-dimensional random variables
′ ′
Yt = Yt,1 , . . . , Yt,p+1 = Ut , . . . , Ut−p , p + 1 ≤ t ≤ N.
The Yt ’s are identically distributed (uniformly over [0, 1]p+1 , under H0(N) ), but, of course, they are
not independent. However, they constitute an absolutely regular process since they are p-dependent
(see Yoshihara (1976) and 5.1.4).
Define
X
p
G (y) = G y1 , . . . , y p+1 = φ F −1 (y1 ) di F −1 (yi+1 )
i=1
and
Xp+1
ΦL Yt1 , . . . , Yt p+1 = G Yt j / (p + 1)
j=1
X
p+1
= G Yt j ,1 , . . . , Yt j ,p+1 / (p + 1) ,
j=1
where ΦL is a kernel of an U-statistics which is asymptotically equivalent to N −1/2 L0N . Similarly,

we define the kernels
X
p+1
S
Φ Yt1 , . . . , Yt p+1 = J Yt j / (p + 1)
j=1
X
p+1
= J Yt j ,1 , . . . , Yt j ,p+1 / (p + 1)
j=1
for N −1/2 SN and

1 X
ΦM Yt1 , . . . , Yt p+1 = J Y j1 ,1 , . . . , Y j p+1 ,1
(p + 1)! j
P
for N −1/2 MN , where the summation j extends over all possible (p + 1)! permutations j1 . . . , j p+1

of t1 . . . , t p+1 . The corresponding U-statistics satisfy
1 X X
··· ΦL Yt1 , . . . , Yt p+1 = N −1/2 L0N + oP N −1/2 ,
N−p C p+1 p+1≤t1 <···<t p+1 ≤N
1 X X
··· ΦS Yt1 , . . . , Yt p+1 = N −1/2 SN + oP N −1/2 ,
N−p C p+1 p+1≤t1 <···<t p+1 ≤N
1 X X
··· ΦM Yt1 , . . . , Yt p+1
N−p C p+1 p+1≤t1 <···<t p+1 ≤N
1 X X
= ··· J Ut1 , . . . , Ut p+1
(N − p) · · · (N − 2p) p+1≤t1 ,···,t p+1 ≤N

−1/2 −1/2
=N MN + oP N .
Therefore, we can see that, for arbitrary coefficents α and β,

h n oi
UNαβ = N −1/2 α {SN (X) − MN (X)} + β L0N (X)

is (up to oP N −1/2 terms) a sequence of U-statistics with kernel

Φαβ = α ΦS − ΦM + βΦL .
From Theorem√ 1nof Yoshihara

(1976)
o (see Section 4.2 of Hallin et al. (1985a) and Section 5.1.4), it
follows that N UNαβ − E UNαβ is asymptotically normal with mean zero and variance
X
p
σ2αβ = α2 V 2 + 2αβ diCi + β2 d2 ,
i=1
where
Z
V2 = J ∗2 v p+1 , . . . , v1 dv1 · · · dv p+1
[0,1] p+1
X p Z
+2 J ∗ v p+1 , . . . , v1 J ∗ v p+1+ j , . . . , v1+ j dv1 · · · dv p+1+ j
j=1 [0,1] p+1+ j
and
Z X
p−i
Ci = J ∗ v p+1 , . . . , v1 φ F −1 v p+1− j F −1 v p+1− j−i dv1 · · · dv p+1 , (5.2.5)
[0,1] p+1 j=0
with

J ∗ u p+1 , . . . , u1 = J u p+1 , . . . , u1
p+1 Z
X Z
− J v p , . . . , vk , u1 , vk−1 , . . . , v1 dv1 · · · dv p + p J v p+1 , . . . , v1 dv1 · · · dv p+1 .
k=1 [0,1] p [0,1] p+1
(5.2.6)
n o
Obviously, E J ∗ U p+1 , . . . , U1 = 0. Now, from the Cramér–Wold device, we see that, under H0(N) ,
√ !
N {S N (X) − mN }
log LN (X)
′
is asymptotically normal with mean 0, −d2 /2 and covariance matrix
Pp !
V2 di C i
Pp i=1 .
i=1 di Ci d2
Finally, appling LeCam’s third lemma (see, e.g., Sidak et al. (1999), Corollary 7.1.3), we can im-
mediately derive the following results.
√ Pp
Theorem 5.2.3. Under H1(N) , N {S N (X) − mN } is asymptotically normal with mean i=1 diCi and
variance V 2 .
Note that the score-generating function
 

 

X −1
p−i
 −1



 φ F v p+1− j F v p+1− j−i ; i = 1, . . . , p 
 (5.2.7)

 j=0 

constitutes a p-tuple of L2 -functions with norms

  2 1/2
Z 

 X
p−i 

 
  −1 −1  
 
 φ F v p+1− j F v p+1− j−i 
 dv p+1 · · · dv 1 
[0,1] p+1 
 j=0 

q
= (p + 1 − i) σ2 I ( f ), i = 1, . . . , p.
From (5.2.5), it follows that
n o−1 X
p
Ci X p−i

2
J0∗ u p+1 , . . . , u1 = σ I ( f ) φ F −1 v p+1− j F −1 v p+1− j−i (5.2.8)
i=1
(p + 1 − i) j=0

is the projection of J ∗ u p+1 , . . . , u1 from (5.2.6) onto the linear L2 -space spanned by the functions
from (5.2.7). Therefore, any square integrable score-generating function J ∗ of the form (5.2.6) can
be decomposed into J ∗ = J0∗ + J⊥∗ , where J⊥∗ satisfies
Z X
p−i
J⊥∗ u p+1 , . . . , u1 φ F −1 v p+1− j F −1 v p+1− j−i dv p+1 · · · dv1 = 0
[0,1] p+1 j=0
for i = 1, . . . , p. This leads to the same values of the Ci ’s if we use J ∗ or J0∗ in (5.2.5). Denote by
S N , S 0N and S ⊥N linear serial rank statistics associated with J ∗ , J0∗ and J⊥∗ ; denote by mN , m 0 ⊥
N and mN
√ √
their means; denote by V 2 , V02 and V⊥2 the asymptotic variances of N (S N − mN ), N S 0N − m0N
√
and N S ⊥N − m⊥N , respectively. It is easy to check that
√ √ 0 √
N (S N − mN ) = N S N − m0N + N S ⊥N − m⊥N + oP (1)
n o
with limN→∞ NE S 0N − m0N S ⊥N − m⊥N (under H0(N) ) Hence we have V 2 = V02 + V⊥2 .
Now, we can evaluate the asymptotic relative efficiency of two linear serial rank statistics.
(N) Pp
Let S (1) (2)
N and S N have asymptotic normal distributions under H1 with means i=1 diCi(1) and
Pp (2) (1) 2 (2) 2 (1) (2)
i=1 di Ci , and with variances V and V . Then, the ARE of S N with respect to S N is given
by
 (2) P p 
(1) 2
(1) (2)  
V i=1 di Ci 

e SN ,SN = 
 V (1) P p diC (2) 
 

i=1 i
(see also (5.1.5)). We shall denote by Hd(N) a sequence of alternatives characterized by the coeffi-
′
cients d = d1 , . . . , d p . Now, we have the following result.
0
Lemma 5.2.4 (Hallin et al. (1985a),
Lemma 5.1). Let S N and SN be linear rank statistics with
∗ ∗
score-generating functions J u p+1 , . . . , u1 (from (5.2.6)) and J0 u p+1 , . . . , u1 (from (5.2.8)), re-
spectively. Then, we have
P p 2
V 2 i=1 di C i V02 + V⊥2
0
e S N, S N = P p 2 = ≥1
V02 i=1 di C i V02
for any alternative Hd(N) .

From this lemma we can restrict ourselves to the score functions J0∗ u p+1 , . . . , u1 of the form
(5.2.8). Straightforward calculation leads to
n o−1 X
p
V02 = σ2 I ( f ) Ci2 .
i=1
′
Thus, the values of C1 , . . . , C p that maximize
P p 2 P p 2
i=1 di C i σ2 I ( f ) i=1 diCi
= Pp 2
V02 i=1 Ci
′
are parallel to d1 , . . . , d p .

A test statistics S N such that e S N , S N ≥ 1 for any serial rank statistic S N is asymptotically the
most efficient statistic (in Pitman’s sense) within the class of linear serial rank statistics for testing
randomness H0(N) against ARMA alternative Hd(N) . Therefore, we can establish the following result.
Theorem 5.2.5 (Hallin et al. (1985a), Proposition 5.1). An asymptotically most efficient linear serial
d
rank test for H0(N) against Hd(N) is any statistic S N with score-generating function (up to additive and
multiplicative constants) given by
Xp
di X p−i
Jd∗ u p+1 , . . . , u1 = φ F −1 v p+1− j F −1 v p+1− j−i .
i=1
(p + 1 − i) j=0
′ √ d d

It should also be noted that under H (N)e
de = de1 , . . . , de1 ∈ R p , N S N − mN is asymptoti-
Pp e d Pp 2
cally normal with mean σ2 I ( f ) i=1 di di and variance σ2 I ( f ) i=1 di .
As alternative approach, the dependence coefficient that measures all types of dependence be-
tween random vectors X and Y in arbitrary dimension has recently proposed. Distance correla-
tion and distance covariance (Székely et al. (2007)), and Brownian covariance (Székely and Rizzo
(2009)) provide a new approach to the problem of measuring dependence and testing the joint inde-
pendence of random vectors in arbitrary dimension.
5.2.1.2 Testing an ARMA model against other ARMA alternatives
Testing for randomness is also important in the econometric field to check hypotheses, such as
market efficiency, life cycle permanent income, etc. However, testing a given ARMA model against
other ARMA models is more important in the various identification and validation steps in a time
series model building procedure. Regarding to the testing one ARMA model against another ARMA
model, the most popular one is based on the Akaike information criteria (AIC). If we would like
to compare two models, the one with lower AIC is generally better. Alternatively, Hallin and Puri
(1988), Hallin and Puri (1992) considered the following rank-based methods.
For simplicity, we assume that all densities f, g are absolutely continuous with distribution func-
tions F, G, so that the derivatives f ′ (x) = d f (x) /dx exist for almost all x. Furthermore, assume that
all of them are of the form
1 x
f (x) = fσ (x) = f1 > 0, x ∈ R
σ σ
with
Z Z
x f1 (x) dx = 0, x2 f1 (x) dx = 1.
R R
′
(x)
Defining φ f (x) = − ff (x) , we also assume that the Fisher information
Z Z
σ2 I ( f ) = φ2f (x) f (x) dx = φ2f1 (x) f1 (x) dx = I ( f1 )
R R
is finite. Since f is always strictly positive, F is strictly increasing, and the inverse F −1 is
well
(N) defined.(N)Denote
′ by H (N) (A, B; f ) the hypothesis under which the observed series X (N) =
X1 , . . . , XN is generated by the ARMA(p1 , q1 ) model
A (L) Xt = B (L) εt , t ∈ Z, (5.2.9)
where {εt } is a white noise process with density function f and
A (L) = 1 − A1 L − · · · − A p1 L p1 , B (L) = 1 + B1 L + · · · + Bq1 Lq1 .

(N)
Assume that under the null hypothesis f remains unspecified,
(N) which is denoted by H (A, B; ·) =
(N) ′
(N) (N)
∪ f H (A, B; f ). Consider the filtered series Z = Z1 , . . . , ZN associated with the observed
n o
−1
series X , where Zt = B (L) A (L) Xt is the filtered process. Clearly, X (N) is an ARMA(p1 , q1 )
(N)
model from (5.2.9) if and only if Z (N) is a white noise series.

Here, the alternative hypotheses of interest are those under which X (N) is an observed series
from some ARMA model
α (L) Xt = β (L) εt , t∈Z
distinct from (5.2.9). In order to investigate locally optimal procedures, Hallin and Puri (1988, 1992)
considered the sequences of alternatives that are contiguous to the null hypotheses. Let
α(N) (L) Xt = β(N) (L) εt , t∈Z (5.2.10)

α(N)
(L) = 1 − α(N)
1 L −···− α(N) p2
p2 L , β (N)
(L) = 1 + β(N)
1 L + ···+ β(N)
q2 L
q2
be a sequence of ARMA(p2 , q2 ) model, with p2 ≥ p1 , q2 ≥ p2 and




Ai + N −1/2 γi 1 ≤ i ≤ p1 ,
(N)
αi =  
N −1/2 γi p1 + 1 ≤ i ≤ p2 ,



 Bi + N −1/2 δi 1 ≤ i ≤ q1 ,
(N)
βi = 
N −1/2 δ q +1≤i≤q . i 1 2
′ ′
Denote A = A1 , . . . , A p1 , 0, . . . , 0 ∈ R p2 , B = B1 , . . . , Bq1 , 0, . . . , 0 ∈ Rq2 , θ = (A′ , B ′ )′ ∈
R p2 +q2 . And let Θ be the open subset for which (5.2.9) constitutes a valid, causal and invertible
ARMA model of order p1 and q1 , i.e., satisfies A p1 , 0, Bq1 , 0, no common roots, causality and
′ ′
invertibility. Furthermore, let γ = γ1 , . . . , γ p2 ∈ R p2 , δ = δ1 , . . . , δq2 ∈ Rq2 and τ = (γ ′ , δ ′ )′ ∈

R p2 +q2 . Denote by H (N) θ + N −1/2 τ ; f a sequence of hypotheses under which X (N) is generated
by some ARMA(p, q) model with p1 ≤ p ≤ p2 , q1 ≤ p ≤ q2 , coefficients of the form A + N −1/2 γ,
B + N −1/2 δ and innovation density f , which is approaching in some sense to H (N) (θ; f ). If X (N) is
an observed series from (5.2.10), then the filtered series Z (N) is generated by the model
α(N) (L) {A (L)}−1 Zt = β(N) (L) {B (L)}−1 εt , t ∈ Z.
Using the relation
α(N) (L) = A (L) − N −1/2 γ (L) , β(N) (L) = B (L) + N −1/2 δ (L)
P p2 Pq2
with γ (L) = i=1 γi Li and δ (L) = i=1 δi Li , we can rewrite
Zt − N −1/2 a (L) Zt = εt + N −1/2 b (L) N −1/2 εt , (5.2.11)

where
X
∞
a (L) = ai Li = γ (L) {A (L)}−1 ,
i=1
X
∞
b (L) = bi Li = δ (L) {B (L)}−1 .
i=1
ai and e
Let e bi be the Green’s functions associated with A (L) and B (L), namely,
X
∞ X
∞
e i
ai L = {A (L)} −1
and e
bi Li = {B (L)}−1 .
i=0 i=0
Then we have the relations for the coefficients

min
X (p2 ,i) min
X (q2 ,i)
ai = γ je
ai− j and bi = δ je
bi− j . (5.2.12)
j=1 j=1
Consider the white noise hypothesis

with specified density f , H (N) (0; f ), and the generalized alter-
native ARMA hypothesis H (N)
N −1/2
c; f with a = (a1 , a2 , . . .)′ , b = (b1 , b2 , . . .)′ and c = (a′ , b′ )′ ,
under which Z (N) is given by the realization of the generalized ARMA process (5.2.11). Now,
we can reduce testing H (N) (θ; f ) against H (N) θ + N −1/2 τ ; f to the problem of testing H (N) (0; f )

against H (N) N −1/2 c; f .
Denote by L(N) (N)
θ; f and Lθ+N −1/2 τ ; f the likelihood functions of X
(N)
under H (N) (θ; f ) and

H (N) θ + N −1/2 τ ; f , respectively. Furthermore, let L(N) (N)
0; f and LN −1/2 c; f be the likelihood functions

of Z (N) under H (N) (0; f ) and H (N) N −1/2 c; f , respectively. Then, the log likelihood ratios Λ(N) θ,τ ; f
and Λ(N)
0,c; f satisfy the relation

L(N)
θ+N −1/2 τ ; f
X (N)
Λ(N)
θ,τ ; f
:= log
L(N)
θ; f X
(N)

L(N)
N −1/2 c; f
Z (N)
=Λ(N)
0,c; f := log .
L(N)
0; f Z
(N)
Then, the result of Lemma 5.2.2 can be extended as follows.

Lemma 5.2.6. (Proposition 2.2 of Hallin and Puri (1988)) Under H (N) (0; f ), the log-
1 P∞
likelihood ratio Λ(N) 0,c; f is asymptotically normal, with mean − 2
2 2
i=1 (ai + bi ) σ I ( f ) and variance
P∞
2 2 (N)
i=1 (ai + bi ) σ I ( f ). Therefore, the sequence of hypotheses H N −1/2 c; f and H (N) (0; f ) is
contiguous.
Now, we consider, again, the linear serial rank statistics S N of order p defined in (5.2.1), whose
expectation
under H (N) (0; f ) is given by mN in (5.2.4). We assume that there exists a function
J = J v p+1 , v p , . . . , v1 , defined over [0, 1]p+1 , satisfying
Z 2+δ
0< J v p+1 , . . . , v1 dv p+1 · · · dv1 < ∞
p+1
[0,1]
for some δ > 0, and the relationship of the approximation (5.2.2) under H (N) (0; f ). Furthermore,
we can assume without loss of generality that
Z Y
J v p+1 , . . . , v1 dv j = 0, i = 1, . . . , p + 1.
[0,1] p j,i
Then, we have the following generalization of the above result.

Proposition 5.2.7. (Proposition
√ 2.3 of Hallin and Puri (1988)) The linear serial rank statistics S N of
order p satisfies that N (S N − mN ) is asymptotically normal, with mean 0 and variance V 2 under
Pp
H (N) (0; f ), while it is asymptotically normal with mean i=1 (ai + bi ) Ci and variance V 2 under

(N) −1/2
H N c; f , where
 
Z 
 Xp 


 

V2 = 
 J 2
v p+1 , . . . , v 1 + 2 J v p+1 , . . . , v 1 J v p+1+ j , . . . , v 1+ j 
 dv1 · · · dv2p+1
2p+1 
 

[0,1] j=1
and
Z X
p−i
Ci = J v p+1 , . . . , v1 φ F −1 v1+ j F −1 v1+ j+i dv1 · · · dv p+1 .
[0,1] p+1 j=0

Note that this result implies that the asymptotic distribution of S N under H (N) N −1/2 c; f is

exactly the same as that under the pth- order truncation of H (N) N −1/2 c; f which is obtained by
letting a p+1 = a p+2 = · · · = b p+1=b p+2 =···=0 in (5.2.11). That is, a finite-order serial rank statistic cannot

contain all the information for testing H (N) (0; f ) against H (N) N −1/2 c; f in general. Therefore, the

asymptotically most powerful test for H (N) (0; f ) against H (N) N −1/2 c; f does not belong to the
class of linear serial rank statistics. However, we can still construct the asymptotically most powerful
test based on ranks.
P
For a, b ∈ l2 , we denote their inner product by ha, bi = ∞ i=1 ai bi and the corresponding norm
by kak = ha, ai1/2 . We define the rank autocorrelation coefficient associated with density f ( f -rank
autocorrelation) of order i as
   (N)   (N)  
1   1 X  −1  Rt  −1  Rt−i  
N
(N)  
(N) 
ri; f = (N)  φ 
 F 
 
 
 F 
 
 − m  ,
s  N − i
t=i+1
N+1 N+1 

where
1 XX i i
1 2
m(N) = φ F −1 F −1
N (N − 1) 1≤i1 ,i2 ≤N N+1 N+1
and
n o2 1 XX i i 2
1 2
s(N) = φ F −1 F −1
N (N − 1) 1≤i1 ,i2 ≤N N+1 N+1
2 (N − 2i) XXX i
1
+ φ F −1
(N − i) N (N − 1) (N − 2) 1≤i1 ,i2 ,i3 ≤N N+1
i i i
2 2 3
× φ F −1 F −1 F −1
N+1 N+1 N+1
N 2 − N (2i + 3) + i2 + 5i XXXX i
1
+ φ F −1
N (N − 1) (N − 2) (N − 3) (N − i) 1≤i1 ,i2 ,i3 ,i4 ≤N N+1
i i i
2 3 4
× φ F −1 F −1 F −1
N+1 N+1 N+1
n o2
− (N − i) m(N) .
Then, we have the following result.

Proposition 5.2.8. (Proposition 2.4 of Hallin and Puri (1988)) We consider the test statistics
1 X
N−1 √
(N) ∗
S = q n o N − i (ai + bi ) ri;(N)
f .
PN−1
N i=1 (ai + bi )2 i=1
Then, we have
√ ∗
(i) NS (N) is asymptotically
√
normal, with mean 0 and variance 1 under H (N) (0; f ), and with
e +e 2
h(a+b), a b i σ I( f ) (N) −1/2 e ′ e′ ′
mean ka+b)k and variance 1 under H N e
c; f (with e
a, b ∈ l2 , e
c = ea , b ).
∗
(ii) S (N) provides an asymptotically
most powerful test for H (N) (0; f ) against the general linear
(N) −1/2
alternative H N c; f (among all tests of given level α), that is,
∗
ϕ(N)
α (α, β, f ) = 1
v
u
tN−1
X
N−1 √ X
if N − i (ai + bi ) ri;(N)
f > (ai + bi )2 k1−α ,
i=1 i=1
where k1−α is the (1 − α)-quantile of the standard normal distribution.
5.2.2 Tangent Space

Now, we briefly review the calculation of tangent space (see Bickel et al. (1998), Chapter 3 for
details).
Let µ be a fixed σ-finite measure on (X, B), M be all probability measures on (X, B) and Mµ
be that dominated by µ, that is,
Mµ = {P ∈ M | P << µ} .
Furthermore, let X1 , . . . , Xn be i.i.d. with common distribution P ∈ P where P is parametric, so that

n o
P = Pθ | θ ∈ Θ ⊂ Rk .
Then, we can write
dPθ
p (θ) = p (·, θ) = (·) , I (θ) = log p (θ),
dµ
which are the density and log-likelihood of Pθ , respectively. It is convenient to consider P as a
subset of L2 (µ) in terms of embedding P → s, where
dP √
p≡ , s≡ p, p (θ) → s (θ) .
dµ
Then, we have the following definitions.
Definition 5.2.9. We say that θ0 is a regular point of the parametrization θ → Pθ if θ0 is an interior
point of Θ, and
(i) the map θ → s (θ) from Θ to L2 (µ) is Fréchet differentiable at θ0 , that is, there exists a vector
ṡ (θ0 ) = ( ṡ1 (θ0 ) , . . . , ṡk (θ0 ))′ of elements of L2 (µ) such that

s (θ0 + h) − s (θ0 ) − ṡ (θ0 )′ h = o (|h|) as h → 0,
R
(ii) the k × k matrix ṡ (θ0 ) ṡ (θ0 )′ dµ is nonsingular,
where |·| is the Euclidean norm and k·k is the Hilbert norm in L2 (µ).
Definition 5.2.10. A parametrization θ → Pθ is regular if
(i) every point of Θ is regular,
(ii) the map θ → ṡi (θ0 ) is continuous from Θ to L2 (µ) for i = 1, . . . , k.
Define the Fisher information matrix of θ by
Z
I (θ) = 4 ṡ (θ) ṡ (θ)′ dµ
and the score function I˙ of an observation by

ṡ (θ) ṗ (θ)
I˙ (θ) = 2 χ {s (θ) > 0} = χ {p (θ) > 0} , (5.2.13)
s (θ) p (θ)

where ṗ (θ) = 2s (θ) ṡ (θ) and χ denotes the indicator function. If θ is a regular point, then I˙ (θ) ∈
L2 (Pθ ), and therefore we have another representation of the Fisher information matrix for θ as
Z
I (θ) = 4 I˙ (θ) I˙ (θ)′ dPθ .

Furthermore, we can consider another local embedding of P into L2 Pθ0 by
!
s (θ)
Pθ ↔ r (θ) ≡ 2 − 1 χ {s (θ0 ) > 0} .
s (θ0 )
Suppose ν is a Euclidean parameter defined on a regular parametric model P = {Pθ | θ ∈ Θ}.

We can identify ν with the parametric function q : Θ → Rm defined by
q (θ) = ν (Pθ ) .
Fix P = Pθ and suppose q has a total differential matrix q̇m×k at θ. Then, we can define the infor-
mation bound and the efficient influence function for ν in terms of
I −1 (P | ν, P ) = q̇ (θ) I (θ) q̇ (θ)′ and Ie(·, P | ν, P ) = q̇ (θ) I (θ) I˙ (θ) ,

respectively.
Now, we can define the tangent space in Hilbert space, which plays an important role in semi-
parametric inference. Let H be a Hilbert space with norm k·k and inner product < ·, · >, and V be
a subset of H. We use the notation
an = bn + o (εn )
for an , bn ∈ H, which implies

kan − bn k = o (εn ) .
We call V a k-dimensional surface in H if it can be represented as the image of the open unit
sphere in Rk under a continuously Fréchet differentiable map which is of rank k. Namely, we can
write
V = {v (η) | |η| < 1}
with
(i) v (η + ∆) = v (η) + ∆′ v̇ (η) + o (|∆|),
(ii) v̇ = (v̇1 , . . . , v̇k )′ ∈ H k ,
(iii) dim [v̇] = k,
where [W ] denotes the closed linear span of W .
Any surface may have many representations (parametrizations). However, if η → v (η) and γ →
g (γ) are two representations of V with g (γ0 ) = v (η0 ) and v−1 {v (η0 )} = η0 , then the application
of the chain rule and inverse function theorem leads to v̇ (η0 ) = M ġ (γ0 ), where M is a k × k
nonsingular matrix. Therefore, [v̇ (η0 )] = [ġ (γ0 )] is independent of the parametrization. We call it
the tangent space of V at v0 = v (η0 ), and write it V̇ (or V̇ (v0 )).
Now, we first study P as a subset S of Hilbert space L2 (µ) via the correspondence P ↔ s.
Next, we will consider P as a subset of L2 (P0 ) in terms of the correspondence P ↔ r. If we take
H = L2 (µ) and V = S, we have the following. We call P a regular parametric model if it has a
regular parametrization in the sense of Definition 5.2.10.
Proposition 5.2.11. (Proposition 3.2.1 of Bickel et al. (1998))
n o
(i) P = Pθ | θ ∈ Θ ⊂ Rk is a regular parametric model and Θ is a surface in Rk if and only if S
is a k-dimensional surface in L2 (µ).
(ii) Then, we can represent the projection operator onto Ṡ by

Π h | Ṡ = 4 hh, ṡi′ I −1 ṡ,
where ṡ = ( ṡ1 , . . . , ṡk )′ and hh, ṡi = (hh, ṡ1 i , . . . , hh, ṡk i)′ .
Next, we consider the image of S under the mapping
!
s
s→r=2 − 1 χ {s0 > 0}
s0
which maps s0 into zero. This image is a subset of L2 (P0 ). However, we call this set P following
a familiar abuse of notation, since we do not lose anything by identifying this image with P . By
(5.2.13), we have
" # h i
2 ṡ
Ṗ = = I˙1 , . . . , I˙k
s0
and
D E′
Π0 h | Ṗ = h, I˙ I −1 I,
˙
0
′
where I˙ = I˙1 , . . . , I˙k and
D E D E D E ′ Z Z !′
h, İ = h, İ1 , . . . , h, İk = hI˙1 dP0 , . . . , ˙
hIk dP0 .
0 0 0
Note that for any Ṡ and corresponding Ṗ , these projection operators are related by the identities

Π0 h | Ṗ = s−1
0 Π hs0 | Ṡ for h ∈ L2 (P0 )
!
t
Π t | Ṡ = s0 Π0 | Ṗ for t ∈ L2 (µ) .
s0
For the general nonparametric case, we have the following definition.
Definition 5.2.12. If v0 ∈ V ⊂ H, then let the tangent set at v0 be the union of all the (1-
0
h i C ⊂ V passing through v0 , and denote it by V̇ .
dimensional) tangent spaces of surfaces (curves)
0 0
Furthermore, we call the closed linear span V̇ of the tangent set V̇ the tangent space of V , and
denote it by V̇ .
Suppose that P is a regular parametric, semiparametric or nonparametric model, and ν : P →
Rm is a Euclidean parameter. Let ν also denote the corresponding map from S to Rm given by
ν (s (P)) = ν (P). Now, we give the following definition.
Definition 5.2.13. For m = 1, we say that ν is pathwise differentiable on S at s0 if there exists a
bounded linear functional ν̇ (s0 ) : Ṡ → R such that
ν (s (η)) = ν (s0 ) + ην̇ (s0 ) (t) + o (|η|) (5.2.14)
for any curve s (·) in S which passes through s0 = s (0) and ṡ (0) = t.
Defining the bounded linear function ν̇ (P0 ) : Ṗ → R by
!
1
ν̇ (P0 ) (h) = ν̇ (s0 ) hs0 ,
2
(5.2.14) holds if and only if

ν Pη = ν (P0 ) + ην̇ (P0 ) (h) + o (|η|) , (5.2.15)
n o
where Pη is the curve in P corresponding to {s (η)} in S and h = 2t/s0 . We call (5.2.15) pathwise
differentiability of ν on P at P0 . To simplify, we write ν̇ (h) for ν̇ (P0 ) (h) and ν̇ (t) for ν̇ (s0 ) (t),
ignoring the dependence of ν̇ (P0 ) on P0 . The functional ν̇ is uniquely determined by (5.2.15) on P 0
and hence on P . Naturally, we often describe the parameter ν as the restriction to P of a parameter
νe defined on a larger model M ⊃ P . That is, if νe is pathwise differentiable on M with derivative
ν̇e : M → R, then we have
ν̇e = ν̇ on Ṗ .
Now, we have the following results (see Section 3.3 of Bickel et al. (1998) for the proofs).
Theorem 5.2.14. (Theorem 3.3.1 of Bickel et al. (1998)) For m = 1, let ν be pathwise differentiable
on P at P0 . For any regular parametric submodel Q, where I−1 (P0 | ν, Q) is defined, the efficient
influence function e
I (·, P0 | ν, Q) satisfies

e
I (·, P0 | ν, Q) = Π0 ν̇ (P0 ) | Q̇ (5.2.16)
Therefore,
2
I−1 (P0 | ν, Q) = Π0 ν̇ (P0 ) | Q̇ ≤ kν̇ (P0 )k20 (5.2.17)
0
with equality if and only if

ν̇ (P0 ) ∈ Q̇. (5.2.18)
Next, we consider the case that ν = (ν1 , . . . , νm )′ is m-dimensional. If each ν j is pathwise dif-
ferentiable with derivative ν̇ j = ν̇ j (P0 ), then we call ν pathwise differentiable on P at P0 with
derivative ν̇ = (ν̇1 , . . . , ν̇m )′ . Now, we extend the above results to m ≥ 1. Let
′
I j = ν̇ j (P0 ) , j = 1, . . . , m and Ie = e
e I1 , . . . , e
Im . (5.2.19)
Corollary 5.2.15. (Corollary 3.3.1 of Bickel et al. (1998)) Suppose that

(i) ν is pathwise differentiable on P at P0 ,
(ii) Q is any regular parametric submodel where I −1 (P0 | ν, Q) is defined.
Then, the efficient influence function Ie(·, P0 | ν, Q) for ν in Q satisfies (5.2.16), and therefore,

I −1 (P0 | ν, Q) = hΠ0 ν̇ (P0 ) | Q̇ , Π0 ν̇ ′ (P0 ) | Q̇ i0
e Ie′ i0 ,
≤ hI,
where the order A ≤ B implies that B − A is nonnegative definite. The equality holds if and only if
e
I j ∈ Q̇, j = 1, . . . , m,
and in this case, the efficient influence function Ie(·, P0 | ν, Q) equals I. e

In addition, if
h i
(iii) e I j : j = 1, . . . , m ⊂ Ṗ 0 ,
then for each a ∈ Rm there exists a one-dimensional regular parametric submodel Q such that

a′ I −1 (P0 | ν, Q) a = a′ E IeIe′ a.
If Q is a regular parametric submodel of P satisfying (5.2.18), that is, it gives the equality
in (5.2.17), we call Q least favourable. For m = 1, from Theorem 5.2.14, under the condition
ν̇ (P0 ) ∈ Ṗ , there exists a least favourable Q which may be chosen to be one dimensional. We
extend this idea for the general m-dimensional case in which no least favourable submodel Q exists;
instead, a least favourable sequence Q j does.
Suppose that ν is pathwise differentiable as in Theorem 5.2.14 and Corollary 5.2.15, and
(iii)′
h i
ν̇ j (P0 ) : j = 1, . . . , m ⊂ Ṗ 0 , (5.2.20)
where W denotes the closure of W.

Under these conditions, we have the following definition.
Definition 5.2.16. (i) We call Ie = Ie(·, P0 | ν, P ) defined by (5.2.19) the efficient influence function
for ν, that is,
Ie = Ie(·, P0 | ν, P ) = ν̇ (P0 ) .
(ii) The information matrix I (P0 | ν, P ) for ν in P is defined as the inverse of

e Ie′ i0 = E0 IeIe′ .
I −1 (P0 | ν, P ) = hI,
The parameters are often defined implicity through the parametrization in semiparametric mod-
els. We consider the information bounds and efficient influence functions for such parameters. This
approach is equivalent to the approach for parameters which are explicitly defined as functions of P .
Each of them has its own advantages depending on the example considered. Here, it is convenient
to present the semiparametric model in terms of P rather than S. Let
n o
P = P(ν,g) | ν ∈ N , g ∈ G ,
where N is an open subset of Rm and G is general. Fix P0 = P(ν0 ,g0 ) and let I˙1 denote the vectors
of partial derivatives of log p (·, ν, g) with respect to ν at P0 , which are just the score functions for
the parametric model
n o
P1 = P(ν,g0 ) | ν ∈ N .
Also, let
n o
P2 = P(ν0 ,g) | g ∈ G
Then we have the following results (see Section 3.4 of Bickel et al. (1998) for the proofs).
Theorem 5.2.17. (Theorem 3.4.1 of Bickel et al. (1998)) Suppose that
(i) P1 is regular,
(ii) ν is a 1-dimensional parameter,
and let

I˙1∗ = I˙1 − Π0 I˙1 | Ṗ2 .
Then, we have the following.

n o
(A) If Q = P(ν,gγ ) | ν ∈ N, γ ∈ Γ is a regular parametric submodel of P with P0 ∈ Q, then
2
I (P0 | ν, Q) ≥ I˙1∗ 0 = E0 I˙1∗ 2
with equality if and only if I˙1∗ ∈ Q̇.

(B) If Ṗ = Ṗ1 + Ṗ2 , I˙1∗ , 0, and ν is pathwise differentiable on P at P0 , then the efficient influence
function for ν is given by
−2
I1 = I˙1∗ 0 I˙1∗ ≡ I−1 (P0 | ν, P ) I˙1∗ .
e
From these results, we have the following definition.

Definition 5.2.18. The efficient score function for ν in P is given by I˙1∗ and written as I˙∗ (·, P0 | ν, P ).
Now, we consider the case that ν is m-dimensional with m > 1. The efficient score function
vector is given by

∗ ′
I˙1∗ = I˙11
∗
, . . . , I˙1m = I˙1 − Π0 I˙1 | Ṗ2 .
Then, we have the following results.

Corollary 5.2.19 (Corollary 3.4.1 of Bickel et al. (1998)). Suppose P1 is regular.
(A) If Q is a regular parametric submodel as in Theorem 5.2.17 where I −1 (P0 | ν, Q) is defined,
then we have

I (P0 | ν, Q) ≥ E0 I˙1∗ I˙1∗ ′
h i
with equality if and only if I˙1∗ is contained in Q̇.
(B) If Ṗ = Ṗ1 + Ṗ2 , ν : P → Rm is pathwise differentiable on P at P0 and (5.2.20) holds, then the
efficient influence function for ν is given by
Ie1 = I −1 (P0 | ν, P ) I˙1∗
and

I (P0 | ν, P ) = E0 I˙1∗ I˙1∗ ′ .
5.2.3 Introduction to Semiparametric Asymptotic Optimal Theory
The assumption that the innovation densities underlying these models are known seems quite un-
realistic. If these densities remain unspecified, the model becomes a semiparametric one, and rank-
based inference methods naturally come into the picture. Rank-based inference methods under very
general conditions are known to achieve semiparametric efficiency bounds.
Here, we briefly review semiparametrically efficient estimation theory. Consider the sequence
of “semiparametric models”
n n oo
E(N) := X(N) , A(N) , P(N) = P(N)
η,g : η ∈ Θ, g ∈ G N ∈ N,
where η denotes a finite-dimensional parameter with Θ an open subset of Rk , g an infinite-

dimensional nuisance and G an arbitrary set. This sequence of models is studied in the neighbour-
hood of arbitrary parameter values (η0 , g0 ) ∈ Θ × G. Denote expectations under P(N) (N)
η,g by E η,g .
(N)
Associated with E , consider the fixed-g sequences of “parametric submodels”
n n oo
E(N) (N) (N) (N) (N)
g := X , A , Pg = Pη,g : η ∈ Θ N ∈ N.
n o
Throughout this section we assume that the sequence of parametric models E(N) g0 is locally asymp-
totic normal (LAN) at η0 , with central sequence ∆(N) (η0 , g0 ) and Fisher information I (η0 , g0 ).
More precisely, we have the following assumption.
Assumption 5.2.20 (Hallin and Werker (2003), Assumption A). For any bounded sequence {τN } in
Rk , we have
dP(N) τ
η + √N ,g
( )
0 0 1 ′
τN′ ∆(N)
N
= exp (η0 , g0 ) + τN I (η0 , g0 ) τN + rN ,
dP(N)
η0 ,g0
2
L P
where, under P(N)
η0 ,g0 , ∆
(N) (η
0 , g0 ) → N (0, I (η0 , g0 )) and rN → 0, as N → ∞.
More generally, we consider the sequences of “parametric experiments”
n n (N) oo
E(N) (N) (N) (N)
q (g0 ) := X , A , Pq = Pη,q(λ) : η ∈ Θ, λ ∈ (−1, 1)
k
N ∈ N, (5.2.21)
with maps of the form q : (−1, 1)k → G. Let Q denote the set of all maps q such that
(i) q (0) = g0 , and

′ ′
(ii) the sequence E(N) ′ ′ ′ ′
q (g0 ) is locally asymptotic normal (LAN) with respect to (η , λ ) at η0 , 0 .
′
The central sequence and Fisher information matrix associated with the LAN (at η0′ , 0′ ) property
(ii) naturally decompose into η- and λ-parts, namely, into
′
∆(N) (η0 , g0 )′ , Hq(N) (η0 , g0 )′ (5.2.22)
and
!
I (η0 , g0 ) Cq (η0 , g0 )′
Iq (η0 , g0 ) := .
Cq (η0 , g0 ) IHq (η0 , g0 )
Then, the “influence function” for parameter η is defined as the η-component of

′
Iq (η0 , g0 )−1 ∆(N) (η0 , g0 )′ , Hq(N) (η0 , g0 )′ .
After elementary algebra, that influence function for η takes the form
n o−1
I (η0 , g0 ) − Cq (η0 , g0 )′ IH
−1
q
(η0 , g0 ) Cq (η0 , g0 ) ∆(N)
q (η0 , g0 ) , (5.2.23)
with
∆(N)
q (η0 , g0 ) := ∆
(N)
(η0 , g0 ) − Cq (η0 , g0 )′ IH
−1
q
(η0 , g0 ) Hq(N) (η0 , g0 ) .
“Asymptotically efficient inference” on η (in the parametric submodel E(N) ′ ′ ′

q (g0 ), at (η0 , 0 ) ) should
(N)
be based on this influence function. Note that ∆q (η0 , g0 ) corresponds to the residual of the regres-
sion, with respect to the λ-part Hq(N) , of the η-part ∆(N) of the central sequence (5.2.22), in the
covariance Iq (η0 , g0 ).
The “residual Fisher information” (the asymptotic covariance matrix of ∆(N) q (η0 , g0 )) is given
by
I (η0 , g0 ) − Cq (η0 , g0 )′ IH
−1
q
(η0 , g0 ) Cq (η0 , g0 ) ;
hence the “loss of η-information” (due to the nonspecification of λ in E(N)

q (g0 )) is
Cq (η0 , g0 )′ IH
−1
q
(η0 , g0 ) Cq (η0 , g0 ) .
Does there exist a ql f ∈ Q such that the corresponding ∆(N) ql f is asymptotically orthogonal to any
Hq (η0 , g0 ), q ∈ Q? If such a ql f exists, it can be considered “least favourable,” since the corre-
sponding loss of information is maximal. We then say that Hql f is a “least favourable direction,”
and ∆(N)
ql f (η0 , g0 ) a “semiparametrically efficient (at (η0 , g0 )) central sequence” for the full experi-
ment E(N) .
Consider fixed η subexperiments, denoted by
n n oo
E(N) (N) (N) (N) (N)
η0 := X , A , Pη0 = Pη0 ,g : g ∈ G N ∈ N,
corresponding to the sequence of nonparametric nexperiments oof a fixed value η0 ∈ Θ. Here we
focus on the situation where these experiments E(N) η0 , N ∈ N for some distribution freeness or
invariance
n structure.
o More precisely, assume that there exists, for all η0 , a sequence of σ-fields
B (N) (η0 ), N ∈ N with B (N) (η0 ) ⊂ A(N) , such that the restriction of P(N)
η0 ,g to B
(N)
(η0 ) does not
depend on g ∈ G. We denote this restriction by P(N) η0 ,g|B (N) (η0 )
. This distribution freeness of B (N) (η0 )
(N)
generally follows from somen invariance o property, under which B (η0 ) is generated by the orbits
(N) (N)
of some group acting on X , A . Hallin and Werker (2003) have shown how this invariance
property leads to the fundamental and quite appealing consequence that semiparametric efficiency
bounds in this context can be achieved by distribution-free, hence rank-based, procedures.
Proposition 5.2.21 (Hallin and Werker (2003), Proposition 2.1). Fix q ∈ Q, and consider the model
E(N)
q (g0 ) in (5.2.21). Let B
(N)
(η0 ) ⊂ A(N) be a sequence of σ-fields such that P(N) η ,q(λ)|B (N) (η )
does
0 0
not depend on λ. If the λ-part Hq(N) (η0 , g0 ) of the central sequence is uniformly integrable under
P(N)
(η0 ,g0 ) , we have
n o
Eη(N)
0 ,g0
Hq(N) (η0 , g0 ) | B (N) (η0 ) = oL1 (1) .
Note that this proposition states that, if P(N)

η ,g|B (N) (η )
does not depend on g ∈ G, then under P(N)
η0 ,g0 ,
0 0
the conditional (upon B (N) (η0 )) expectation of the λ-part Hq(N) (η0 , g0 ) of the central sequence
associated with experiment (5.2.21) converges to zero in L1 norm, as N → ∞. Moreover, a uniformly
integrable version of Hq(n) (η0 , g0 ) always exists.
Lemma 5.2.22 (Hallin and Werker (2003), Lemma 2.2). Denote by ∆(N) η n an arbitrary
o central se-
(N)
quence in some locally asymptotically normal sequence of experiments E , with parameter η
(N)
(N)
(N) ′
and probability distributions P(N)
η . Then, there exists a sequence ∆ η = ∆ η,1 , . . . , ∆η,k , N ∈ N
such that
(N) p
(i) for any p ∈ (−1, ∞), the sequences ∆η,i , n ∈ N, i = 1, . . . , k, are uniformly integrable;
(N) (N)
(ii) ∆(N) (N)
η − ∆η = oP (1) under Pη as N → ∞; therefore, for any p ∈ (−1, ∞), ∆η constitutes a
uniformly pth-order integrable version of the central sequence.
From Proposition 5.2.21, we see that for all q ∈ Q, the λ-part Hq(N) (η0 , g0 ) of the central
sequence is asymptotically orthogonal to the entire space of B (N) (η0 )-measurable variables. There-
fore,
n o
Hl(N)
f (η0 , g0 ) := ∆
(N)
(η0 , g0 ) − Eη(N)
0 ,g0
∆(N) (η0 , g0 ) | B (N) (η0 )
is an obvious candidate as the least favourable direction, and, automatically,
n o
∆(N)
lf
(η0 , g0 ) := Eη(N)
0 ,g0
∆(N) (η0 , g0 ) | B (N) (η0 ) (5.2.24)
is a candidate as a semiparametrically efficient (at η0 , g0 ) central sequence. Note that the efficient
score is obtained by taking the residual of the projection of the parametric score on the tangent space
generated by g0 . The following condition is thus sufficient for Hl(N)f (η0 , g0 ) to be least favourable.
Condition 5.2.23 (LF1). Hl(N) (η0 , g0 ) can be obtained as the λ-part of the central sequence
f′
∆(N) (η0 , g0 )′ , Hl f (η0 , g0 )′ of some parametric submodel E(N)
(N)
l f (g0 ).
More precisely, we have the following result.
Proposition 5.2.24 (Hallin and Werker (2003), Proposition 2.4). Assume that condition (LF1)
holds. Then Hl(N) (N)
f (η0 , g0 ) is a least favourable direction, and ∆l f (η0 , g0 ) a semiparametrically
efficient (at g0 ) central sequence.
Furthermore, assuming that condition (LF1) holds, the information matrix in E(N) l f (g0 ) has the
form
!
I (η0 , g0 ) I (η0 , g0 ) − I (η0 , g0 )∗
I˜ (η0 , g0 ) = , (5.2.25)
I (η0 , g0 ) − I (η0 , g0 )∗ I (η0 , g0 ) − I (η0 , g0 )∗
(N)
n o
where I ∗ (η, g) is the variance of the limiting distribution of Eη,g ∆(N) (η, g) | B (N) (η) . Therefore,
up to oP (1) terms, the efficient influence function is the λ-part of
!−1 !
I (η, g) I (η, g) − I (η, g)∗ ∆(N) (η, g)
∗ ∗ (N) .
I (η, g) − I (η, g) I (η, g) − I (η, g) Hl f (η, g)
Still under (LF1), this admits the asymptotic representation

n o
(N)
I (η, g)∗−1 Eη,g ∆(N) (η, g) | B (N) (η) + oP (1) ,
as N → ∞, under P(N) η,g , for any (η, g) ∈ Θ × G.

Proposition 5.2.24 deals with a fixed g value, hence with semiparametric efficiency at given g.
If efficiency is to be attained over some class F of values of g, the following additional requirement
is needed.
Condition 5.2.25 (LF2). For all g ∈ F ⊂ G, there exists a version of the influence function (5.2.23)
for η in E(N)
l f (g) that does not depend on g.
The conditions under which condition (LF2) can be satisfied in the general framework cannot be
easily found. We will discuss them for the specific case of rank-based inference in the next section.
Now we give a precise statement of a set of assumptions that are sufficient for condition (LF1)
to hold. As before, fix η0 ∈ Θ and g0 ∈ G, and consider the following set of assumptions.
Assumption 5.2.26 (Hallin and Werker (2003), Assumption B). (i) Let Hl(N) f
(η0 , g0 ) be such

(N) (η ′ (N) ′ ′
that ∆ 0 , g0 ) , Hl f (η0 , g0 ) is asymptotically normal (automatically the limiting dis-

tribution is N 0, Ĩ (η0 , g0 ) ), under P(N) η0 ,g0 , as N → ∞.

(ii) There exists a function ql f : (−1, 1)r → G such that, for any sequence ηN = η0 + O N −1/2 ,
the sequence of experiments

X(N) , A(N) , P(N)
ηN ,ql f (λ)
: λ ∈ (−1, 1)r
is LAN at λ = 0 with central sequence Hl(N) f (ηN , g0 ) and Fisher information matrix I (η0 , g0 ) −
I (η0 , g0 )∗ .
(iii) The sequence Hl(N) f
(η0 , g0 ) satisfies a local asymptotic linearity property in the sense that,
for any bounded sequence τN in Rr , we have
!
τN ∗
Hl(N)f
(η ,
0 0g ) − H (N)
lf η 0 + √ , g 0 = I (η0 , g0 ) − I (η0 , g0 ) τN + oPη0 ,g0 (1)
N
under P(N)
η0 ,g0 as N → ∞.
Let τN and ρN be bounded sequences in Rk . The contiguity of P(N) τ

η + √N ,g
and P(N)
η0 ,g0 follows from
0 N 0
Assumption 5.2.20 via LeCam’s first lemma. Using Assumption 5.2.26 (ii) and (iii), observe that
dP(N) τ

ρ
dP(N) τ

ρ

dP(N) τ
η0 + √NN ,q √NN η0 + √NN ,q √NN η + √N ,g
0 0
N
log = log + log
dP(N)
η0 ,g0 dP(N) τ
η0 + √NN ,g0
dP(N)
η0 ,g0
!
τN 1
=ρ′N Hl(N)
f η0 + √ , g0 − ρ′N I (η0 , g0 ) − I (η0 , g0 )∗ ρN
N 2
1
+ τN′ ∆(N) (η0 , g0 ) − τN′ I (η0 , g0 ) τN + oP (1)
2
(N) 1
=ρN Hl f (η0 , g0 ) + τN ∆(N) (η0 , g0 ) − ρ′N I (η0 , g0 ) − I (η0 , g0 )∗ τN
′ ′
2
1 ′ ∗ 1
− ρN I (η0 , g0 ) − I (η0 , g0 ) ρN − τN′ I (η0 , g0 ) τN + oP (1) .
2 2
Assumption 5.2.26 (i) asserts the asymptotic normality of the central sequence. Therefore, we have
the following result.
Proposition 5.2.27 (Hallin and Werker (2003), Proposition 2.5). Suppose that Assumptions 5.2.20
and 5.2.26 are satisfied. Then the model

X(N) , A(N) , P(N)
η,ql f (λ)
: η ∈ Θ, λ ∈ (−1, 1)r

′ ′
is locally asymptotically normal at (η0 , 0) with central sequence ∆(N) (η0 , g0 )′ , Hl(N) f
(η 0 , g 0 )
and the Fisher information matrix (5.2.25), and condition (LF1) is satisfied.
An important property of some semiparametric models is adaptiveness. If inference problems
are, locally and asymptotically, no more difficult in the semiparametric model than in the underlying
parametric model, then adaptiveness holds. In the above notation, adaptiveness occurs at (η0 , g0 ), if
I (η0 , g0 )∗ = I (η0 , g0 ). Hence, if, under P(N)
η0 ,g0 ,
n o
∆(N) (η0 , g0 ) = Eη(N)
0 ,g0
∆(N) (η0 , g0 ) | B (N) (η0 ) + oP (1) , (5.2.26)
or if there exists any B (N) (η0 )-measurable version of the central sequence ∆(N) (η0 , g0 ), then adap-
tiveness holds. These statements are made rigorous in the following proposition.
Proposition
n 5.2.28 (Hallin
n (N) and Werker (2003), ooProposition 2.6). Fix q ∈ Q, that is,
let X(N) , A(N) , P(N)
q = Pη,q(λ) : η ∈ Θ, λ ∈ (−1, 1)k denote an arbitrary submodel of the
′
semiparametric model that satisfies the LAN propety at η0′ , 0′ with central sequence
′
∆(N) (η0 , g0 )′ , Hq(N) (η0 , g0 )′ . If a B (N) (η0 )-measurable version ∆(N) B (η0 , g0 ) of the central se-
(N) (η (N)
quence ∆ 0 , g0 ) exists, then, under Pη0 ,g0 , we have, for any q ∈ Q,
! !!
∆(N) (η0 , g0 ) L I (η0 , g0 ) 0
(N) → N 0,
Hq (η0 , g0 ) 0 IHq (η0 , g0 )
as N → ∞. Therefore, the model E(N) is adaptive at (η0 , g0 ).
Note that under condition (LF1), adaptiveness holds at (η0 , g0 ) if and only if (5.2.26) holds, that
is, if and only if a B (N) (η0 )-measurable version of the central sequence ∆(N) (η0 , g0 ) exists.
5.2.4 Semiparametrically Efficient Estimation in Time Series, and Multivariate Cases

In this section, we deal with the special cases of the general results of the previous section, namely,
we consider semiparametrically efficient estimation in univariate time series, and the multivariate
case.
5.2.4.1 Rank-based optimal influence functions (univariate case)

First, we consider semiparametric models for univariate time series. In this case, the randomness is
described by the unspecified probability density f of a sequence of i.i.d. innovation random variables
ε(N) (N)
1 , . . . , εN . Therefore, the sequence of semiparametric models is given by
n n oo
E(N) := X(N) , A(N) , P(N) = P(N)
θ, f
: θ ∈ Θ, f ∈ F N∈N (5.2.27)
with Θ an open subset of Rk and F is a set of densities. We assume that the parametric model is
locally asymptotic normal (LAN) if we fix the density f ∈ F . Furthermore, we assume that F is
a subset of the class F0 of all nonvanishing densities over a real line, so that a rank-based optimal
inference function will be yielded. (If we restrict ourselves to the class of non-vanishing densities
which are symmetric with respect to the origin instead, a signed rank-based inference function will
be yielded, and the class of nonvanishing densities with median zero will yield a sign and rank-based
inference function). As a consequence we suppose the following assumption.
Assumption 5.2.29 (Hallin
n and Werker (2003),
n (N) Assumption
oo I). (i) For all f ∈ F the parametric
(N) (N) (N) (N)
submodel E f := X , A , P f = Pθ, f : θ ∈ Θ is locally asymptotically normal (LAN) in
θ at all θ ∈ Θ, with central sequence ∆(N) (θ, f ) of the form
1 XN
∆(N) (θ, f ) := √ Jθ, f ε(N) (N)
t (θ) , . . . , εt−p (θ) , (5.2.28)
N − p t=p+1
where the function Jθ, f is allowed to depend on θ and f , and the residuals ε(N) t (θ), t = 1, . . . , N
are an invertible function of the observations such that ε(N)
1 (θ) , . . . , ε(N)
N (θ) under P(N)
θ, f are i.i.d.
with density f . The Fisher information matrix corresponding to this LAN condition is denoted as
I (θ, f ).
(ii) The class F is a subset of the class F0 of all densities f over the real line such that f (x) > 0,
x ∈ R.
Under Assumption 5.2.29(ii), for fixed θ and N, the nonparametric model
n n (N) oo
E(N) (N) (N) (N)
θ := X , A , Pθ = Pθ, f : f ∈ F
is generated by the group of continuous order-preserving transformations acting on ε(N)

1 (θ) , . . . ,
ε(N)
N (θ), namely, transformations G (N)
h defined by
n (N) o h n (N) o n (N) oi
Gh ε1 (θ) , . . . , ε(N)
(N)
N (θ) := h ε1 (θ) , . . . , h εN (θ) ,
where h belongs to the set H of all functions h : R → R that are continuous, strictly in-
creasing, andn satisfy lim x→±∞ h (x)o = ±∞. The corresponding maximal invariant σ-algebra is the
σ-algebra σ R(N) (N) (N) (N)
1 (θ) , . . . , RN (θ) generated by the ranks R1 (θ) , . . . , RN (θ) of the residuals
ε(N)
1
(θ) , . . . , ε(N) (N)
N (θ). This σ-algebra is distribution free under Eθ and plays the role of B
(N) (θ).
Here, we need the conditions on the function Jθ, f ensuring that either condition (LF1) alone or
conditions (LF1) and (LF2) together hold.
Let ε0 , ε1 , . . . , ε p be an arbitrary (p + 1)-tuple of i.i.d. random variables with density f and
independent of the residuals ε(N) t (θ). Expectations with respect to ε0 , ε1 , . . . , ε p are denoted by
E f (·), which does not depend on N and θ. Now, we introduce the following set of assumptions on
the function Jθ, f .
Assumption 5.2.30 (Hallin and Werker (2003), Assumption J). (i) The function Jθ, f is such that
2
0 < E f Jθ, f ε0 , ε1 , . . . , ε p < ∞ and
n o
E f Jθ, f ε0 , ε1 , . . . , ε p | ε1 , . . . , ε p = 0, ε1 , . . . , ε p -a.e. (5.2.29)
(ii) The function Jθ, f is componentwise monotone increasing with respect to all its arguments or a
linear combination of such functions.
√
(iii) For all sequences θN = θ0 + O 1/ N , we have, under P(N)
θ0 , f
, as N → ∞,
1 X N n (N) o
√ Jθ, f ε(N) (N) (N)
t (θN ) , . . . , εt−p (θN ) − Jθ, f εt (θ0 ) , . . . , εt−p (θ0 )
N − p t=p+1
√
= −I f (θ0 ) N (θN − θ0 ) + oP (1) , (5.2.30)
n ′ o
where the matrix-valued function θ 7→ I f (θ) := E f Jθ, f ε0 , ε1 , . . . , ε p Jθ, f ε0 , ε1 , . . . , ε p is
continuous in θ for all f . Furthermore, let
(N) (N)

∗
Jθ, f εt (θ) , . . . , εt−p (θ)
n (N) (N) o
:= Jθ, f ε(N) (N)
t (θ) , . . . , εt−p (θ) − E f Jθ, f εt (θ) , ε1 , . . . , ε p | εt (θ) . (5.2.31)
Then, we have, under P(N)

θ0 , f ,
1 X N n (N) (N) o
∗ (N) ∗ (N)
√ Jθ, f εt (θN ) , . . . , εt−p (θN ) − Jθ, f εt (θ0 ) , . . . , εt−p (θ0 )
N − p t=p+1
√
= −I ∗f (θ0 ) N (θN − θ0 ) + oP (1) ,
n ′ o
where θ 7→ I ∗f (θ) := E f Jθ,
∗ ∗
f ε0 , ε1 , . . . , ε p Jθ, f ε0 , ε1 , . . . , ε p is also continuous in θ for
all f .
Equation (5.2.29) in Assumption 5.2.30(i) with the martingale central limit theorem leads to the
asymptotic normality of the parametric central sequence ∆(N) (θ, f ), i.e.,
1 XN L
√ Jθ, f ε(N) (N)
t (θ) , . . . , εt−p (θ) → N 0, I f (θ) .
N − p t=p+1
The existence of rank-based versions of the score functions is also obtained from Assumption
5.2.30(ii), namely, the existence of functions a(N)
f : {1, . . . , N}
p+1
→ Rk such that
(N) 2
lim E f Jθ, f ε(N)
1
(θ) , . . . , ε(N)
p+1
(θ) − a(N)
f R1 , . . . , R(N)
p+1
= 0, (5.2.32)
N→∞
as N → ∞, where R(N) (N) (N) (N)
1 , . . . , RN are the ranks of the residuals ε1 (θ) , . . . , εN (θ). The classical
(N)
choices for a f are exact scores
(N) n (N) (N) o
a(N)
f Rt , . . . , R(N) (N) (N)
t−p := E f Jθ, f εt (θ) , . . . , εt−p (θ) | R1 , . . . , RN
and approximate scores

  (N)   (N) 
    R 
a(N) R(N) (N) F −1  Rt  , . . . , F −1  t−p  ,
f t , . . . , Rt−p := Jθ, f     N + 1 
N+1
where F is the distribution function corresponding to f . The smoothness conditions for Jθ, f and
∗
Jθ, f follow from Assumption 5.2.30(iii) which are needed for Assumption 5.2.26(iii). Recall that
the LAN condition√with (5.2.30) is equivalent to the ULAN condition which is the LAN condition
uniformly over 1/ N-neighbourhoods of θ.
In what follows, we write ε(N) t for ε(N)
t (θ0 ) when it will not cause confusion. Recall that
ε1 , . . . , εN are i.i.d. with density f under P(N)
(N) (N)
θ0 , f
. Now, we have the following results (see Section
4.1 of Hallin et al. (1985a)).
Proposition 5.2.31 (Hallin and Werker (2003), Proposition 3.1). Suppose that Assumptions
5.2.30(i), (ii) are satisfied, and let a(N)
f be any function satisfying (5.2.32). Writing
1 X N (N)
S (N)
f
(θ0 ) := a(N) R t , . . . , R (N)
t−p ,
N − p t=p+1 f
X X
m(N)
f := {N (N − 1) · · · (N − p)}−1 ··· a(N)
f i0 , . . . , i p ,
1≤i0 ,···,i p ≤N
define
p n o
e
∆(N) (θ0 , f ) := N − p S (N) (θ0 ) − m(N) . (5.2.33)
f f
Then,
1 XN
e
∆(N) (θ0 , f ) = √ Jθ, f ε(N) (N)
t , . . . , εt−p
N − p t=p+1
√ X X
N−p
− ··· Jθ, f ε(N) (N)
i0 , . . . , εi p + oL2 (1) ,
N (N − 1) · · · (N − p) 1≤i ,···,i 0 p ≤N
under P(N)
θ0 , f
, as N → ∞.
The U-statistics with kernel Jθ, f can be rewritten in a simpler form.
Corollary 5.2.32. Uner the same conditions as in Proposition 5.2.31, we also have
1 XN (N)
e
∆(N) (θ0 , f ) = √ ∗ (N)
Jθ, f εt , . . . , εt−p + oP (1)
N − p t=p+1
∗
with Jθ, f defined in (5.2.31).
Then, we can construct a candidate of the least favourable direction in the present case which is
generated by the sequence
n o
Hl(N)
f (θ, f ) = ∆(N)
(θ, f ) − E (N)
θ, f
∆(N)
(θ, f ) | B (N)
(θ)
1 XN
= √ Jθ, f ε(N) (N)
t (θ) , . . . , εt−p (θ)
N − p t=p+1
 


 XN 


(N)  1 (N) (N) (N) (N) 
− Eθ, 
f  √ J θ, f ε t (θ) , . . . , εt−p (θ) | R (θ) , . . . , R (θ)


 N − p t=p+1 1 N 

1 XN h
= √ Jθ, f ε(N) (N)
t (θ) , . . . , εt−p (θ)
N − p t=p+1
(N)
n (N) oi
− Eθ, f
Jθ, f ε(N) (N) (N)
t (θ) , . . . , εt−p (θ) | R1 (θ) , . . . , RN (θ) .
Now, we have the following result.

Proposition 5.2.33 (Hallin and Werker (2003), Proposition 3.3). Consider the semiparametric
model (5.2.27). Assume that Assumptions 5.2.29 and 5.2.30 are satisfied. Then, there exists a map-
ping q : (−1, 1)k → F such that the parametric model
n n oo
E(N) (N)
q := X , A , P
(N) (N)
= P(N)
θ,q(λ) : θ ∈ Θ, λ ∈
(−1, 1)k N ∈ N
is locally asymptotic normal at (θ0 , 0) with central sequence
! !!
∆(N) (θ0 , f ) L I f (θ0 ) I f (θ0 ) − I ∗f (θ0 )
(N) → N 0, ,
Hl f (θ0 , f ) I f (θ0 ) − I ∗f (θ0 ) I f (θ0 ) − I ∗f (θ0 )
(N)
n o
(N)
where I ∗f (θ0 ) is the variance of the limiting distribution of Eθ, f ∆
(θ, f ) | B(N) (θ) . Hence,
Hl(N)
f
(θ, f ) satisfies condition (LF1) and the corresponding inference function for θ is
∗ (θ )−1 e(N) (θ, ).
If 0 ∆ f It follows that semiparametrically efficient inference at θ and f can be based
on the efficient central sequence
∗ 1 XN (N)
(N)
∆(N) (θ, f ) := √ ∗
Jθ, f εt , . . . , εt−p ,
N − p t=p+1
of which e ∆(N) (θ, f ) defined in (5.2.33) provides a rank-based version.

This result is obtained without explicitly using the tangent space arguments in Section 5.2.2
and without computing the efficient score function. The efficient score is obtained by taking the
residual of the projection of the parametric score on the tangent space generated by f . However,
in this case f is completely unrestricted, so that the tangent space consists of all square inte-
grable functions
(N) of ε(N)
t which have mean 0 and vanish when fn vanishes. (N) Therefore, the projection o
of Jθ, f εt (θ) , . . . , ε(N)
t−p (θ) onto this space is given by E f J (N)
θ, f εt (θ) , ε1 , . . . , ε p | εt (θ) ,
∗
which is the least favourable submodel. As a consequence, Jθ, f indeed becomes the efficient score
function.
Since it depends on unknown density f , we have to substitute some adequate estimator b fN for
f in the influence function I ∗f (θ0 )−1 e ∆(N) (θ, f ) in order to satisfy condition (LF2). Define C as the
class of all densities f in F which satisfy the following assumption.
Assumption 5.2.34 (Assumption in Section 3.3 of Hallin and Werker (2003)). (i) Jθ, f satisfies As-
sumption 5.2.30.
(ii) There exists an estimator b fN which is measurable with respect to the order statistics of the
residuals ε(N)
1 (θ) , . . . , ε (N)
N (θ), such that
( (N) (N) 2 )

E f a(N) b
R 1 , . . . , R (N)
p+1 − a (N)
f R 1 , . . . , R (N)
p+1
| b
f N = oP (1)
fN
as N → ∞, where the rank scores a(N)

f are defined in (5.2.32).
Therefore, the class C contains all densities f ∈ F such that Assumption 5.2.30 holds. Fur-
thermore, the rank scores a(N)
f can be estimated consistently. A possible estimator is given in (1) of
Sidak et al. (1999), Section 8.5.2, as follows. Let X = (X1 , . . . , XN )′ be a random sample from the
f ′ (F −1 (u))
density f and the score ϕ (u, f ) = − f (F−1 (u)) , 0 < u < 1, and let X(1) , . . . , X(N) be its order statistics.
h i h i
Put mN = N 3/4 ǫN−2 , nN = N 1/4 ǫN3 , where {ǫN } is some sequence of positive numbers such that
ǫN → 0, N 1/4 ǫN3 → ∞.
Furthermore, put
" #
jN
hN j = , 1 ≤ j ≤ nN ,
nN + 1
where [x] denotes the largest integer not exceeding x. Let
ϕN (u, X) =
e
 ( −1 −1 )


 mN nN


 2 N+1 Z hN +mN − Z hN −mN − Z hN +mN − Z hN −mN ,


 h N
j j
j+1 j+1
 h N
 j j+1


 N ≤u< N , 1 ≤ j ≤ nN ,



0, (otherwise) .
ϕN (u, X) is a consistent estimate of ϕ (u, f ) in the sense that

Then, e
(Z 1 )
2
lim P ϕN (u, X) − ϕ (u, f )| du > ǫ = 0
|e
N→∞ 0
for every ǫ > 0 and the density f satisfying I ( f ) < ∞.

We denote the covariance matrix of e ∆(N) θ, bfN conditional on b fN by Ib(N)∗
b
(θ) :=
fN
Var e∆(N) θ, b fN . Since e
fN | b ∆(N) θ, bfN is conditionally distribution free, Ibb (θ) does not de-
(N)∗
fN
pend on f and can be computed from the observations. Moreover, Ib(N)∗b
(θ) becomes the consistent
fN
∗ (θ).
estimator of I f The following result shows that conditions (LF1) and (LF2) are satisfied for all
f ∈ C.
Proposition 5.2.35 (Hallin and Werker (2003), Proposition 3.4). For all f ∈ C and θ ∈ Θ, under
P(N)
θ, f ,

Ib(N)∗
b
(θ)−1 e
∆(N) θ, b
fN = I ∗f (θ) e
∆(N) (θ, f ) + o p (1) .
fN
∗

Therefore, conditions (LF1) and (LF2) are satisfied and ∆(N) θ, b fN (as well as e
∆(N) θ, b
fN ) are
versions of the efficient central sequence for E(N) at f and θ.
As a simple example, we consider the following MA(1) process.
Example 5.2.36 (Hallin and Werker (2003), Example 3.1; MA(1) process). Let Y1(N) , · · · , YN(N) be a
finite realization of the MA(1) process
Yt = εt + θεt−1 , t ∈ N,
where θ ∈ (−1, 1) and {εt , t ≤ 1} are i.i.d. random variables with common density f , and as-
sume that the starting value ε0 is observed. Furthermore, we have to assume that f is ab-
solutely continuous with finite variance σ2f and finite Fisher information for location I ( f ) :=
R
R
{ f ′ (x) / f (x)}2 f (x) dx < ∞. Under these assumptions, this MA(1) model is locally asymptoti-
cally normal (LAN) with univariate central sequence
1 X
N
−f′ n o (N)
∆(N) (θ, f ) = √ ε(N)
t (θ) εt−1 (θ) ,
N−1 t=2
f
n o
where the residuals ε(N) (N) (N)
t (θ) , 1 ≤ t ≤ N are recursively computed from εt (θ) := Yt − θε(N)
t−1 (θ)
with initial value ε(N)
0 (θ) = ε0 . We see that this central sequence is of the form (5.2.28).
For this model, the function Jθ, f is given by
−f′
(ε0 ) ε1 ,
Jθ, f (ε0 , ε1 ) :=
f
R
which obviously satisfies Assumption 5.2.30(i). Let µ f := R x f (x) dx. Then, we immediately obtain
the efficient score function
−f′
Jθ,∗ f (ε0 , ε1 ) = (ε0 ) ε1 − µ f .
f
Observing that
ε(N) (N)
t (θN ) = Yt − θN ε(N) (N) (N) (N)
t−1 (θN ) = εt (θ0 ) + θ0 εt−1 (θ0 ) − θN εt−1 (θN ) ,
Assumption 5.2.30(iii) follows from standard arguments.

In the case where µ f = 0, the efficient score function Jθ,∗ f and the parametric score function Jθ, f
coincide, and the model is adaptive, which is a well-known result.
5.2.4.2 Rank-based optimal estimation for elliptical residuals

We call the distribution of a k-dimensional random variable X spherical if for some θ ∈ Rk , the
distribution of X −θ is invariant under orthogonal transformations. For multivariate normal variable
cases, the sphericity is equivalent to the covariance matrix Σ of X being proportional to the identity
matrix Ik . For elliptical random variables, in which the finite second-order moments need not exist,
the
(n) sphericity is equivalent to the shape matrix V being equal to the unit matrix Ik . Let X (n) =
′ ′ ′ ′ ′
X1 , . . . , Xn(n) , n ∈ N be a triangle array of k-dimensional observations, where X1(n) , . . . , Xn(n)
are i.i.d. random variables with ellipical density
" n o1/2 #
1 1 ′ −1
f(θ,σ2 ,V ; f1 ) (x) := ck, f1 f1 (x − θ) V (x − θ) , x ∈ Rk , (5.2.34)
σk |V |1/2 σ

where θ ∈ Rk is a location parameter, σ2 ∈ R+ := (0, ∞) is a scale parameter, V := Vi j is a shape
parameter which is a symmetric positive definite real k × k matrix with V11 = 1, f1 : R+0 := (0, ∞) →
R+ is the infinite-dimensional parameter which is an a.e. strictly positive function, and the constant
ck, f1 is a normalization factor depending on the dimension k and f1 . The function f1 is called a radial
density, since f1 does not integrate to one, and therefore is not a probability density. We denote
the modulus of the centred and sphericized observations Zi(n) = Zi(n) (θ, V ) := V −1/2 X (n) − θ ,
n o
i = 1, . . . , n by di(n) = di(n) (θ, V ) := Zi(n) (θ, V ) . If X (n) have the density (5.2.34), then moduli
n (n) o
di are i.i.d. with density function
1 1 r k−1
r 7→ f 1,k (r/σ) := f1 (r/σ) χ{r>0}
σ σµk−1: f1 σ
and distribution function
Z r/σ
r 7→ F 1,k (r/σ) := f 1,k (s) ds, (5.2.35)
0
provided
Z ∞
µk−1: f1 := rk−1 f1 (r) dr < ∞.
0
◦ ′ !′
k(k+1)
Let vechM = M11 , vech M be the 2 -dimensional vector of the upper triangle elements
◦ ′ !′
′ 2
of a k × k symmetric matrix M = Mi j , and η = θ , σ , vech V . Therefore, the parameter
◦ k(k−1)
space becomes Θ := Rk × R+ × Vk , where Vk is the set of values! of vech V in R 2 . We denote
◦ ′ ′
the distribution X (n) under given values of η = θ ′ , σ2 , vech V and f1 by P(n) (n)
η: f1 or Pθ,σ2 ,V : f . 1
We denote the rank of di(n)

= di(n)
(θ, V ) among d1(n) , . . . , dn(n)
by = R(n)
(θ, V ). Underi R(n)
i P(n)
η: f1 ,
(n) (n)

the vector R1 , . . . , Rn is uniformly distributed over the n! permutations of (1, . . . , n). Further-
more, let Ui(n) = Ui(n) (θ, V ) := Zi(n) /di(n) . Then under P(n)
η: f1 , the vectors Ui
(n)
are i.i.d., uniformly
distributed over the unit sphere and independent of the ranks R(n) (n)
i . The vectors Ui are considered as
multivariate signs of the centred observations Xi − θ since they are invariant under transformations
of Xi − θ which preserve half lines through the origin (Problem 5.6).
Hallin and Paindaveine (2006) considered testing the hypothesis that the shape matrix V is equal
to some given value V0 . As a special case, we can consider the case that V0 = Ik , which yields the
testing of sphericity. Therefore, the shape matrix V is the parameter of interest, while θ, σ2 and f1
are nuisance parameters. We would like to employ the test statistics whose null distributions remain
invariant under variations of θ, σ2 and f1 . When θ is known, this is achieved by test statistics based
on the signs Ui(n) and ranks R(n) (n)
i computed from Zi (θ, V0 ), i = 1, . . . , n. These tests are invariant
under monotone radial transformations, including scale transformations, rotations and reflections of
the observations with respect to θ.
The statistical experiments involve the nonparametric family
P(n) := ∪ f1 ∈FA P(n) (n)

f1 := ∪ f1 ∈FA ∪σ>0 Pσ2 : f1

:= ∪ f1 ∈FA ∪σ>0 P(n)
θ,σ2 ,V : f
| θ ∈ R k
, V ∈ Vk ,
1
where f1 ranges over the set FA of standardized densities satisfying the assumptions below (Assump-
tions (A1) and (A2) of Hallin and Paindaveine (2006)). The semiparametric structure is induced by
the partition of P(n) into a collection of parametric subexperiments P(n)
σ2 : f1
.
◦ ′ !′
′ 2
We need uniform local asymptotic normality (ULAN) with respect to η = θ , σ , vech V
of the family P(n)
f1 (see Hallin and Paindaveine (2006)). For this purpose, denote the space of mea-
R n o
surable functions h : Ω → R satisfying Ω h (x)2 dλ (x) < ∞ by L2 (Ω, λ). Especially, we consider
R∞n o
the space L2 (R+ , µl ) of measurable functions h : R+ → R satisfying 0 h (x)2 xl dx < ∞ and
R∞ n o
the space L2 (R, νl ) of measurable functions h : R → R satisfying −∞ h (x)2 elx dx < ∞. Note
R R
that g ∈ L2 (Ω, λ) admits weak derivative T iff Ω g (x) φ′ (x) dx = − Ω T (x) φ (x) dx for all in-
finitely differentiable compactly supported functions φ on Ω. Define the mappings x 7→ f11/2 (x)
1/2 ′ 1/2 ′
1/2 f (r) f (log r) 1/2 ′
and x 7→ f1:exp (x), and let ψ f1 (r) := −2 11/2 (r) and ϕ f1 (r) := −2 1:exp1/2 , where f1 and
f1 f1:exp (log r)
1/2 ′
f1:exp stand for the weak derivatives of f11/2 in L2 (R+ , µk−1 ) and f1:exp
1/2
in L2 (R, νk ), respectively.
The following assumptions (Assumptions (A1) and (A2) of Hallin and Paindaveine (2006)) imply
the finiteness of radial Fisher information for location
h i Z 1
−1

Ik ( f1 ) := E P(n) ϕ2f1 di(n) (θ, V ) /σ = ϕ2f1 F 1,k (u) du
η: f1
0
and radial Fisher information for shape (and scale)

(n)
n (n) o2 Z 1
2
Jk ( f1 ) := E P(n) ψ f1 di (θ, V ) /σ di (θ, V ) /σ = K 2f1 (u) du,
η: f1
0

−1 −1
where K f1 (u) := ψ f1 F 1,k (u) F 1,k (u), respectively.
Assumption 5.2.37 ((A1)). The mapping x 7→ f11/2 (x) is in the Sobolev space W1,2 (R+ , µk−1 ) of
order 1 on L2 (R+ , µk−1 ).
1/2
Assumption 5.2.38 ((A2)). The mapping x 7→ f1:exp (x) is in the Sobolev space W1,2 (R, νk ) of order
1 on L2 (R, νk ).
Furthermore, we use the following notation. The lth vector in the canonical basis of Rk is denoted
by el and the k × k identical matrix is denoted by Ik . And let
X
k
Kk := ei e′j ⊗ e j e′i ,
i, j=1
which is the so-called the commutation matrix,

X
k
Jk := ei e′j ⊗ ei e′j = (vecIk ) (vecIk )′ ,
i, j=1
◦
Mk be the k(k−1)
2 × k2 matrix such that Mk′ vech V = vecV for any symmetric k × k matrix

V = vi j with v11 = 0 and V ⊗2 be the Kroneker product V ⊗ V . Then, we have the following
ULAN result.
Theorem 5.2.39 (Proposition
n (n) 2.1 of Hallin
o and Paindaveine (2006)). Under Assumptions (A1) and
(A2), the family P(n)
f1 = P η: f1 | η ∈ Θ is ULAN with central sequence
 (n) 
 ∆ f1 :1 (η) 

 
∆(n) (η) :=  ∆(n) f :2 (η) 
f1
 (n)1 
∆ f1 :3 (η)
 (n) 
 −1/2 1 Pn di −1/2 (n) 
 n σ i=1 ϕ f1 σ V Ui 
   
:=    −2 ′
 Pn
 1 n−1/2  σ (vecI−1/2 k)

 vec ψ
di(n) di(n)
U (n) (n) ′
U − I


2  M V ⊗2  i=1 f1 σ σ i i k
k
and full-rank information matrix

 
 Γ f1 :11 (η) 0 0 
 Γ′f1 :32 (η)  ,
Γ f1 (η) :=  0 Γ f1 :22 (η)
 
0 Γ f1 :32 (η) Γ f1 :33 (η)
where
1
Γ f1 :11 (η) := Ik ( f1 ) V −1 ,
kσ2
1 2

Γ f1 :22 (η) := J k ( f1 ) − k ,
4σ2
1
2
Γ f1 :32 (η) := J k ( f 1 ) − k Mk vecV −1
4kσ2
and
−1/2 ( Jk ( f1 ) ) −1/2
Γ f1 :33 (η) := Mk V ⊗2 (Ik2 + Kk + Jk ) − Jk V ⊗2 Mk′ .
k (k + 2)
◦ ′ !′
(n) (n) ′ 2 (n) (n)
More precisely, for any η = θ , σ , vech V = η+O n−1/2 and any bounded sequence
◦ ′ !′
′ ′ (n) ′ ′ k(k−1)
τ (n) := t(n) , s(n) , vech v (n) = τ1(n) , τ(n)2 , τ3 = in Rk+1+ 2 , we have
dP(n)
η (n) +n−1/2 τ (n) : f
Λ(n)
η (n) +n−1/2 τ (n) /η (n) : f1
:= log 1
dP(n)
η (n) : f1
′ 1 ′
= τ (n) ∆(n)
f1 η
(n)
− τ (n) Γ f1 (η) τ (n) + o p (1)
2
and
L
∆(n)
f1 η
(n)
→ N 0, Γ f1 (η)
under P(n)
η (n) : f
as n → ∞.
1
Test statistics based on efficient central sequences are only valid under correctly specified radial
densities. However, a correct specification f1 of the actual radial density g1 is unrealistic. Therefore,
we have to consider the problem from a semiparametric point of view, where g1 is a nuisance
parameter. Within the family of distributions ∪σ2 ∪V ∪g1 P(n)
θ,σ2 ,V :g
, where θ is fixed, we consider
1
the null hypothesis H0 (θ, V0 ) under which we have V = V0 . Namely, θ and V = V0 are fixed, and
σ2 and the radial density g1 remain unspecified. For the scalar nuisance σ2 , we have the efficient
central sequence by means of a simple projection. For the infinite-dimensional nuisance g1 , we
can take the projection of the central sequences along adequate tangent space in principle. Here,
however, we consider the appropriate group invariance structures which allow for the same result
by conditioning central sequences with respect to maximal invariants such as ranks or signs in line
with Hallin and Werker (2003).
The null hypothesis H0 (θ, V0 ) is invariant under the following two groups of transformations
acting on the observations X1 , . . . , Xn .
(i) The group Gorth(n) , ◦ := Gorth(n)
θ,V0 , ◦ of V0 -orthogonal transformations centered at θ consisting of
all transformations of the form
X (n) 7→ GO (X1 , . . . , Xn )

= GO θ + d1(n) (θ, V0 ) V01/2 U1(n) (θ, V0 ) , . . . , θ + dn(n) (θ, V0 ) V01/2 Un(n) (θ, V0 )

:= θ + d1(n) (θ, V0 ) V01/2 OU1(n) (θ, V0 ) , . . . , θ + dn(n) (θ, V0 ) V01/2 OUn(n) (θ, V0 ) ,
where O is an arbitrary k × k orthogonal matrix. This group contains rotations around θ in the
metric associated with V0 . Furthermore, it also contains the reflection with respect to θ, namely,
the mapping (X1 , . . . , Xn ) 7→ (θ − (X1 − θ) , . . . , θ − (Xn − θ)).
(ii) The group G(n) , ◦ := G(n)θ,V0
, ◦ of continuous monotone radial transformations consisting of all
transformations of the form
X (n) 7→ Gh (X1 , . . . , Xn )

= Gh θ + d1(n) (θ, V0 ) V01/2 U1(n) (θ, V0 ) , . . . , θ + dn(n) (θ, V0 ) V01/2 Un(n) (θ, V0 )

:= θ + h d1(n) (θ, V0 ) V01/2 U1(n) (θ, V0 ) , . . . , θ + h dn(n) (θ, V0 ) V01/2 Un(n) (θ, V0 ) ,
where h : R+0 → R+0 is continuous, monotone increasing with h (0) = 0 and

limr→∞ h (r) = ∞. This group contains the subgroup of scale transformations (X1 , . . . , Xn ) 7→
(θ + α (X1 − θ) , . . . , θ + α (Xn − θ)) with α > 0.
The group G(n) , ◦ of continuous monotone radial transformations is a generating group for the
family of distributions ∪σ2 ∪ f1 P(n)
θ,σ2 ,V : f
, namely, a generating group for the null hypothesis
0 1
H0 (θ, V0 ) which is under consideration. Therefore, from the invariant principle, we can consider
the test statistics which
are measurable with respect to the corresponding maximal invariant. In this
case, it is the vector R(n)
1
(θ, V0 ) , . . . , R (n)
n (θ, V0 ) , U (n)
1
(θ, V0 ) , . . . , U (n)
n (θ, V0 ) , where R (n)
i (θ, V0 )
denotes the rank of di(n) (θ, V0 ) among d1(n) (θ, V0 ) , . . . , dn(n) (θ, V0 ). Then, we have the signed rank
test statistics which are invariant under G(n) , ◦ so that they are distribution free under H0 (θ, V0 ).
Combining invariance and optimality arguments by considering a signed rank-based version of
the f1 -efficient central sequences for shape, we construct the test statistics for the null hypothesis
H0 (θ, V0 ). Note that the central sequences are always defined up to oP (1) under P(n) η: f1 as n → ∞.
We consider the signed rank version ∆(n)
f1
(η) of the shape efficient central sequence
∆∗(n)
f1
(η) = ∆(n)
f1 :3
(η) − Γ f1 :32 (η) Γ−1 (n)
f1 :22 (η) ∆ f1 :2 (η)
 (n)  (n)
1 −1/2 X n  d  d ′
⊗2 −1/2 ⊥
= n Mk V Jk ψ f1  i  i vec Ui(n) Ui(n) ,
2 i=1
σ σ
where Jk⊥ := Ik2 − 1k Jk , in our nonparametric test. For this purpose, we use the f1 -score version,
based on the scores K = K f1 , of statistic
 (n) 
1 −1/2 X n  R  ′
⊗2 −1/2 ⊥
∆(n)
K (η) := n Mk V Jk K  i  vec Ui(n) Ui(n)
2 i=1
n+1
  !
1 −1/2
⊗2 −1/2
Xn  R(n)  (n) (n) ′ 1
= n Mk V K  i  vec U U − Ik
2 i=1
n + 1 i i
k
 n  (n)  
1 −1/2 
X  Ri  (n) (n) ′ m(n) 

⊗2 −1/2  
= n Mk V 
 K   vec U U − K
vecIk 
 ,
2  n+1 i i k 
i=1
where R(n)i = R(n)

i (θ, V ) denotes
the
rank of di(n) = di(n) (θ, V ) among d1(n) , . . . , dn(n) , Ui(n) =
P
Ui(n) (θ, V ) and m(n)
K =
n i
i=1 K n+1 . The following asymptotic representation result shows that
∆(n)
f1 (η) is indeed another version of the efficient central sequence ∆ f1 (η).
∗(n)
Lemma 5.2.40 (Lemma 4.1 of Hallin and Paindaveine (2006)). Assume that the score function
K : (0, 1) → R is continuous and square integrable. And assume that it can be expressed as the
difference of two monotone increasing functions. Then, defining
  (n) 
1 −1/2 Xn   d  ′
⊗2 −1/2 ⊥
∗(n)
∆K:g1 (η) := n Mk V Jk K G1k  i  vec Ui(n) Ui(n) ,
2 i=1
σ
we have ∆(n) ∗(n) (n)

K (η) = ∆K:g1 (η) + oL2 (1) as n → ∞, under Pη:g1 , where G 1k is defined in the same
manner as F 1k in (5.2.4.2).
Now, we can propose the class of test statistics.
n oLet KR :1 (0, 1) → R beRsome
1
score function as in
Lemma 5.2.40, and write E {K (U)} and E K 2 (U) for 0 K (u) du and 0 K 2 (u) du, respectively.
Then, the K-score version of the statistics for testing H : V = V0 is given by
 (n)   (n)  ( )
k (k + 2) X  Ri   R j  (n) ′ (n) 2 1
n
(n)
QK = QK := K  K   Ui U j − ,
2nE K 2 (U) i, j=1  n + 1   n + 1  k
where R(n) (n)

i = Ri (θ, V0 ) and Ui
(n)
= Ui(n) (θ, V0 ). These tests can be rewritten as
!
nk (k + 2) 2 1 2
QK = trSK − tr SK
2E K 2 (U) k

nk (k + 2) E {K (U)} SK
2
1 2
= 2 − Ik + oP (1)
2E K (U) trSK k
 (n)  ! 2
k (k + 2) −1/2 X n  Ri  (n) (n) ′ 1
= n K   
 U Ui − Ik + oP (1) ,
2E K 2 (U) i=1
i + 1 i k
as n → ∞ under any distribution, where

 (n) 
1 X  Ri  (n) (n) ′
n
SK = SK(n) := K  U Ui .
n i=1  i + 1  i
On the other hand, the local optimality under radial density f1 is achieved by the test based on the
test statistics taking the form
 (n)   (n)  ( )
k (k + 2) X  R j 
n  Ri 
(n)     (n) ′ (n) 2 1
Q f = QK := Kf   K f   Ui U j − ,
1 f1 2nIk ( f1 ) i, j=1 1  n + 1  1  n + 1  k
which can be rewritten as

!
nk (k + 2) 1
Qf = trS 2f1 − tr2 S f1
1 2Ik ( f1 ) k
2
nk (k + 2) S f1 1
= − Ik + oP (1) ,
2Ik ( f1 ) trS f1 k
as n → ∞ under any distribution, where
 (n) 
1X
n  R  ′
S f1 = S (n)
f1 := K f1  i  Ui(n) Ui(n) .
n i=1 i+1
To describe the asymptotic behaviour of Q f and QK , we define the quantities

1
Z 1
Ik ( f1 , g1 ) := K f1 (u) Kg1 (u) du,
0
which is a measure of cross information, and

Z 1
Ik (K; g1 ) := K (u) Kg1 (u) du.
0
We define the rank-based test φ(n) and φ(n) , which consists in rejecting H : V = V0 when Q f
f1 K 1
k(k−1)
and QK exceed the α-upper quantile of a chi-square distribution with 2 degrees of freedom,
respectively. Now, we have the following result.
Theorem 5.2.41 (Proposition 4.1 of Hallin and Paindaveine (2006)). Let K be a continuous, square
integrable score function defined on (0, 1) that can be expressed as the difference of two monotone
increasing functions. Furthermore, let f1 , satisfying Assumptions (A1) and (A2), be such that K f1 is
continuous and can be expressed as the difference of two monotone increasing functions. Then, we
have the following.
(i) Test statistics Q f and QK are asymptotically chisquare with k(k−1) 2 degrees of freedom under
1
∪σ2 ∪g1 P(n) θ,σ2 ,V :g
. On the other hand, under ∪σ2 P(n)
θ,σ2 ,V +n−1/2 v:g
, Q f and QK are asymp-
0 1 0 1 1
k(k−1)
totically noncentral chisquare with 2 degrees of freedom with noncentrality parameters
" 2 1 n o2 #
I2k ( f1 , g1 )
tr V0−1 v − tr V0−1 v
2k (k + 2) Ik ( f1 ) k
and
" 1n o2 #
I2k (K, g1 ) −1 2 −1
tr V0 v − tr V0 v ,
2k (k + 2) E K 2 (U) k
respectively.

(ii) The sequences of tests φ(n) and φ(n) have asymptotic level α under ∪σ2 ∪g1 P(n)
θ,σ2 ,V :g
.
f1 K 0 1
(ii) Furthermore, the sequence of tests φ(n)is locally asymptotically maximin efficient for ∪σ2 ∪g1
f1
P(n) 2
θ,σ ,V :g
against alternatives of the form ∪ σ2 ∪V ,V
0
P (n)
2
θ,σ ,V : f
.
0 1 1
5.3 Asymptotic Theory of Rank Order Statistics for ARCH Residual Empirical Processes
This section develops the asymptotic theory of a class of rank order statistics {T N } for a two-sample
problem pertaining to empirical processes based on the squared residuals from two classes of ARCH
models. The results can be applied to testing the innovation distributions of two financial returns
generated by different mechanisms (e.g., different countries and/or different industries, etc.).
An important result is that, unlike the residuals of ARMA models, the asymptotics of {T N }
depend on those of ARCH volatility estimators. Such asymptotics provide a useful guide to the re-
liability of confidence intervals, asymptotics relative efficiency and ARCH affection.
As we explained in the previous chapters, the ARCH model is one of the most fundamental
econometric models to describe financial returns. For an ARCH(p) model, Horváth et al. (2001)
derived the asymptotic distribution of the empirical process based on the squared residuals. Then
they showed that, unlike the residuals of ARMA models, those residuals do not behave like asymp-
totically independent random variables, and the asymptotic distribution involves a term depending
on estimators of the volatility parameters of the model.
In i.i.d. settings, a two-sample problem is one of the important statistical problems. For this prob-
lem, a class of rank order statistics plays a principal role since it provides locally most powerful rank
tests. The study of the asymptotic properties is a fundamental and essential part of nonparametric
statistics. Many authors have contributed to the development, and numerous theorems have been
formulated to show the asymptotic normality of a properly normalized rank order statistic in many
testing problems. The classical limit theorem is the celebrated Chernoff and Savage (1958) theorem,
which is widely used to study the asymptotic power and power efficiency of a class of two-sample
tests. Further refinements of the theorem are given by Hájek and Šidák (1967) and Puri and Sen
(1971). Specifically, Puri and Sen (1971) formulated the Chernoff–Savage theorem under less strin-
gent conditions on the score-generating functions.
This section discusses the asymptotic theory of the two-sample rank order statistics {T N } for
ARCH residual empirical processes based on the techniques of Puri and Sen (1971) and Horváth
ASYMPTOTIC THEORY OF RANK ORDER STATISTICS 171
et al. (2001). Since the asymptotics of the residual empirical processes are different from those for
the usual ARMA processes, the limiting distribution of {T N } is greatly different from that of the
ARMA case (of course the i.i.d. case) .
Suppose that a class of ARCH(p) models is characterized by the equations
 P pX i 2


σt (θX )ǫt , σ2t (θX ) = θ0X + i=1 θX Xt−i , f or t = 1, · · · , m,
Xt = 
 (5.3.1)
0, f or t = −pX + 1, · · · , 0,
where {ǫt } is a sequence of i.i.d. (0, 1) random variables with fourth-order cumulant k4X , θX =
(θ0X , θ1X , · · · , θXpX )′ ∈ ΘX ⊂ R ~ px +1 is an unknown parameter vector satisfying θ0 > 0, θi ≥ 0, i =
X X
1, · · · , pX − 1, θ pX > 0, and ǫt is independent of X s , s < t. Denote by F(x) the distribution function
of ǫt2 and we assume that f (x) = F ′ (x) exists and is conditioned on (0, ∞).
Suppose that another class of ARCH(p) models, independent of {Xt }, is characterized similarly
by the equations
 P pY i 2


σt (θY )ξt , σ2t (θY ) = θY0 + i=1 θY Yt−i , f or t = 1, · · · , m,
Yt = 
 (5.3.2)
0, f or t = −pY + 1, · · · , 0,
where {ξt } is a sequence of i.i.d. (0, 1) random variables with fourth-order cummulant k4Y , θY =
p ~ pY +1 , θ0 > 0, θi ≥ 0, i = 1, · · · , pY − 1, θ pY > 0, are unknown parameters,
(θY0 , θY1 , · · · , θYY )′ ∈ ΘY ⊂ R Y Y Y
and ξT is independent of Y s , s < t. The distribution function of ξt2 is denoted by G(x) and we
assume that g(x) = G′ (x) exists and is continuous on (0, ∞). For (5.3.1) and (5.3.2), we assume that
θ1X + · · · + θXpX < 1 and θY1 + · · · + θYpY < 1 for stationarity (see Milhøj (1985)).
Now we are interested in the two-sample problem of testing
H0 : F(x) = G(x) f or all x against HA : F(x) , G(x) f or some x. (5.3.3)
First, consider the estimation of θX and θY . Letting
ZX,t = Xt2 , ZY,t = Yt2 , ζX,t = (ǫt2 − 1)σ2t (θX ), ζY,t = (ξt2 − 1)σ2t (θY ),
WX,t = (1, ZX,t , · · · , ZX,t−PX +1 )′ and WY,t = (1, ZY,t , · · · , ZY,t−PY +1 )′ ,
we have the following autoregressive representations :
ZX,t = θ′X WX,t−1 + ζX,t , ZY,t = θY′ WY,t−1 + ζY,t . (5.3.4)

X Y
Here, note that E[ζX,t |Ft−1 ] = E[ζY,t |Ft−1 ] = 0, where FtX = σ{Xt , Xt−1 , · · · } and FtY =
σ{Yt , Yt−1 , · · · }. Suppose that ZX,1 , · · · , ZX,m and ZY,1 , · · · , ZY,n are observed stretches from {ZX,t } and
{ZY,t }, respectively. Let
X
m X
n
Lm (θX ) = (ZX,t − θ′X WX,t−1 )2 and Ln (θY ) = (ZY,t − θY′ WY,t−1 )2 (5.3.5)
t=1 t=1
be the least squares penalty functions. Then the conditional least-squares estimators of θX and θY
are given by
θ̂X,m = arg min Lm (θX ), θ̂Y,n = arg min Ln (θY ),

θX ∈ΘX θY ∈ΘY
respectively. In what follows we assume that θ̂X,m and θ̂Y,n are asymptotically consistent and normal
with rate m−1/2 and n−1/2 , respectively, i.e.,
√ √
m||θ̂X,m − θX || = 0 p (1), n||θ̂Y,n − θY || = 0 p (1), (5.3.6)
where ||.|| denotes the Euclidean norm. For the validity of (5.3.6), Tjøstheim (1986, pp. 254–256)
gave a set of sufficient conditions.
The corresponding empirical squared residuals are
ǫ̂t2 = Xt2 /σ̂2t (θ̂X , t = 1, · · · , m and ξ̂t2 = Yt2 /σ2t (θ̂Y ) , t = 1, · · · , n (5.3.7)
P px i 2 P py i 2
where σ̂2τ (θ̂X ) = θ̂0X + i=1 θ̂X Xt−i and σ̂2t (θ̂Y ) = θ̂Y0 + i=1 θ̂Y Yt−i .
In line with Puri and Sen (1971) we describe the setup. Let N = m + n and λN = m/N. For (5.3.7)
the m and n are assumed to be such that 0 < λ0 ≤ λN ≤ 1 − λ0 < 1 hold for some fixed λ0 ≤ 21 . Then
the combined distribution is defined by
HN (x) = λN F(x) + (1 − λN )G(x).
If F̂m (x) and Ĝn (x) denote the empirical distribution functions of {ǫ̂t2 } and {ξ̂t2 }, the corresponding
empirical distribution to HN (x) is
ĤN (x) = λN F̂m (x) + (1 − λN )Ĝn (x). (5.3.8)

√ √
Write B̂m (x) = m(F̂m (x) − F(x)) and B̂n (x) = n(Ĝn (x) − G(x)). Then they are expressed as
1 X
m
B̂m (x) = √ [χ(ǫ̂t2 ≤ x) − F(x)], (5.3.9)
m t=1
and
1 X
m
B̂n (x) = √ [χ(ξ̂t2 ≤ x) − G(x)], (5.3.10)
n t=1
where χ(·) is the indicator function of (·). Horváth et al. (2001) showed that (5.3.9) has the repre-
sentation
B̂m (x) = ǫm (x) + AX x f (x) + ηm (x), (5.3.11)
where
ρ
1 X X
m X
√
ǫm (x) = √ [χ(ǫt2 ≤ x) − F(x)], AX = m(θ̂iX,m θiX )τi,X
m t=1 i=o
and
sup x |ηm (x)| = o p (1) with

τ0,X = E[1/σ2t (θX )] and τi,X = E[σ2t−i (θX )ǫt−i
2
/σ2 (θX )], 1 ≤ i ≤ pX .
Similarly
B̂n (x) = ǫn (x) + AY xg(x) + ηn (x), (5.3.12)
where
1 X X
n Y p
√ i
ǫn (x) = √ [χ(ξt2 ≤ x) − G(x)] , AY = n(θ̂Y,n − θYi )τi,Y (5.3.13)
n t=1 i=0
and
sup x |ηn (x)| = o p (1) with

τ0,Y = E[1/σ2t (θY )] and τi,Y = E[σ2t−i (θY )ξt−i
2
/σ2t (θY )], 1 ≤ i ≤ pY .
P −1 Pn e
Let Fm (x) = m−1 m 2
t=1 χ(ǫt ≤ x), G n (x) = n
2
t=1 χ(ξt ≤ x) and HN (x) = λN F m (x) + (1 − λN )G n (x).
From (5.3.10) and (5.3.11) it follows that
eN (x) = H
H eN (x) + m−1/2 λN AX x f (x) + n−1/2 (1 − λN )AY xg(x) + η∗ (x), (5.3.14)
N
where η∗N (x) = m−1/2 λN ηm (x) + n−1/2 (1 − λN )ηn (x). The expression (5.3.14) is fundamental and will
be used later.
Define S N,i = 1, if the ith smallest one in the combined residuals ǫ̂12 , · · · , ǫ̂m2 ,ξ̂12 , · · · , ξ̂n2 is from
ǫ̂12 , · · · , ǫ̂m2 , and otherwise define S N,i = 0. For the testing problem (5.3.3), consider a class of rank
order statistics of the form
1X
N
TN = ϕN,i S N,i ,
m i=1
where the ϕN,i are given constants called weights or scores. This definition is conventionally used.
However, we use its equivalent form given by
Z
N e
TN = J HN (x) d F̂m (x), (5.3.15)
N+1
where ϕN,i = J( j/(N + 1)), and J(u), 0 < u < 1, is a continuous function. The class of T N is suffi-
ciently rich since it includes the following typical examples:
(i) Wilcoxon’s two-sample test with J(u) = u, 0 < u < 1,

−1 −1/2
R x Van2 der Waerden’s two-sample test with J(u) = Φ (u), 0 < u < 1 where Φ(x) = (2π)
(ii)
−t /2
−∞
e dt,
(iii) Mood’s two-sample test with J(u) = (u − 21 )2 , 0 < u < 1,
(iv) Klotz’s normal scores test with J(u) = (Φ−1 (u))2, 0 < u < 1.
Examples (i) and (ii) are the tests for location, and (iii) and (iv) are tests for scale when {ǫ̂t2 } and
2
{ξ̂t } are assumed to have the some median. The purpose of this section is to elucidate the asymptotics
of (5.3.15). For this we need the following regularity conditions.
Assumption 5.3.1. (i) J(u) is not constant and has a continuous derivative J ′ (u) on (0, 1).
(ii) There exists a constant K > 0 such that
|J| ≤ K[u(1 − u)]−(1/2)+δ and |J ′ | ≤ K[u(1 − u)]−(3/2)+δ
for some δ > 0.
(iii) x f (x), xg(x), x f ′ (x) and xg′ (x) are uniformly bounded continuous, and integrable functions on
(0, ∞).
(iv) There exists a constant c > 0 such that F(x) ≥ c{x f (x)} and G(x) ≥ c{xg(x)} for all x > 0.
Returning to the model of {Xt } and {Yt }, we impose a further condition. Write
 1 2 
 θX ǫt · · · θXpX −1 ǫt2 θXpX ǫt2 
 1 ··· 0 0 


AX,t =  . .. ..

. .  , and
 . . . . . 
 
0 ··· 1 0
 1 2 
 θY ξt · · · θYpY −1 ξt2 θYpY ξt2 
 
 1 ··· 0 0 
AY,t =  . .. .. ..  .
 . . . . . 
 
1 ··· 1 0
We introduce the tensor product A⊗3 X,t = AX,t ⊗ AX,t ⊗ AX,t (e.g., Hannan (1970, pp. 516–518)), and
P P
define X,3 = E[A⊗3X,t ] and Y,3 = E[A ⊗3
Y,t ].
P P
Assumption 5.3.2. E|ǫt | < ∞ and || X,3 || < 1, E|ξt |8 < ∞ and || Y,3 || < 1, where || · || is the
8
spectral matrix norm.

4 4
From Chen and An (1998), we see that Assumption 5.3.2 implies E(ZX,t ) < ∞ and E(ZY,t ) < ∞.
P 1 −1/3
In the case when pX = 1 and {ǫt } is Gaussian, if || X,3 || < 1, then θX < 15 ≈ 0.4.
Before we state the main result, note that the matrices
′ ′
UX = E[WX,t−1 WX,t−1 ], UY = E[WY,t−1 WY,t−1 ],
(5.3.16)
RX = 2E[σ4t (θX )WX,t−1 WX,t−1
′
] and RY = 2E[σ4t (θY )WY,t−1 WY,t−1
′
]
are positive definite (see Exercise 5.1).

Recalling (5.3.4), we observe that
∂ X m X m
2 2
Lm (θ X ) = −2 (ǫt − 1)σt (θ X ) = −2 φX (ǫt2 )ϑ0t,X (say),
∂θ0X t=1 t=1
∂ Xm Xm
2 2
Lm (θ X ) = −2 (ǫt − 1)σt (θ X )Z X,t−i = −2 φX (ǫt2 )ϑit,X , 1 ≤ i ≤ pX (say),
∂θiX t=1 t=1
∂ Xm Xn
2 2
Ln (θY ) = −2 (ξt − 1)σ (θ
t Y ) = −2 φY (ǫt2 )ϑ0t,Y (say),
∂θY0 t=1 t=1
∂ Xn Xm
2 2
L (θ
n Y ) = −2 (ξt − 1)σ (θ )Z
t Y Y,t−i = −2 φY (ξt2 )ϑit,Y , 1 ≤ i ≤ pY (say)
∂θYi t=1 t=1
p p
where φ.(u) = u−1. Write ϑt,X = (ϑ0t,X , · · · , ϑt,XX )′ and ϑt,Y = (ϑ0t,Y , · · · , ϑt,pY Y ,Y )′ . Then, using standard
arguments, it is seen that the ith element of θ̂X,m and θ̂Y,n admits the stochastic expansions
1 X i
m
θ̂iX,m − θiX = U φX (ǫt2 ) + o p (m−1/2 ), 0 ≤ i ≤ pX and
m t=1 t,X
1X i
n
i
θ̂Y,n − θYi = U φY (ǫt2 ) + o p (n−1/2 ), 0 ≤ i ≤ pY (5.3.17)
n t=1 t,Y
i
where Ut,X i
and Ut,Y are the ith elements of UX−1 ϑt,X and UY−1 ϑt,Y , respectively. Write αiX = E(Ut,X i
),
i i ′
0 ≤ i ≤ pX , αY = E(Ut,Y ), 0 ≤ i ≤ pY , and τX = (τ0,X , τ1,X , · · · , τ pX ,X ) (recall (5.3.11) and (5.3.12)).
The following theorem is due to Chandra and Taniguchi (2003). Here we just give the main line of
the proof because the detailed one is very technical.
Theorem 5.3.3. Suppose that Assumptions 5.3.1 and 5.3.2 hold and that θ̂X,m and θ̂Y,n are the con-
ditional least-squares estimators of θX and θY satisfying (5.3.6). Then
d
N 1/2 (T N − µN )/σN −→ N(0, 1) as N → ∞,
where
Z
µN = J[HN (x)]dF(x) and σ2N = σ21N + σ22N + σ23N + γN , 0 with
Z Z
σ21N = 2(1 − λN ){ AN (x, y)dF(x)dF(y)
x<y
Z Z
1 − λN
+ BN (x, y)dG(x)dG(y)},
λN x<y
σ22N = ω′X,N V−1 −1 2 ′ −1 −1

X BX UX ωX,N , σ3N = ωY,N VY RY UY ωY,N
and


 1 − λN X
 pX "
γN = 2(1 − λN ) 
 τi,X hiX (x)δ f,N (x, y)dG(x)dG(y)
 λN
i=0

XpY " 


i
+ τi,Y hY (x)δg,N (x, z)dF(x)dF(z)  , (5.3.18)

i=0
where
AN (x, y) = G(x)(1 − G(y))J ′ [HN (x)]J ′ [HN (y)],

BN (x, y) = F(x)(1 − F(y))J ′ [HN (x)]J ′ [HN (y)],
Z
ωX,N = −(λN )−1/2 (1 − λN ) x f (x)J ′ [HN (x)]dG(x) × τX ,
Z
1/2
ωY,N = (1 − λ) zg(z)J ′ [HN (z)]dF(z) × τY ,
δ f,N (x, y) = y f (y)J ′ [HN (x)]J ′ [HN (y)],

δg,N (x, z) = zg(z)J ′ [HN (x)]J ′ [HN (z)],
Z x
hiX (x) = αiX φX (u) f (u)du, 0 ≤ i ≤ pX ,
Z0 x
hiY (x) = αiY φY (u)g(u)du, 0 ≤ i ≤ pY .
0
Sketch of Proof. First, we rewrite the integrand of T N as

N ĤN ′
J ĤN = J [HN ] + (ĤN − HN )J ′ [HN ] − J [HN ]
N+1 N+1
N N
+J ĤN − J[HN ] − ĤN − HN J ′ [HN ],
N+1 N+1
and d F̂m = d(F̂m − F + F). Then it is seen that T N can be expressed as
X
3
T N = µN + B1N + B2N + CiN , (5.3.19)
i=1
where
Z
B1N = J[HN ]d(F̂m − F)(x),
Z
B2N = (ĤN − HN )J ′ [HN ]dF(x),
Z
1
C1N =− ĤN J ′ [HN ]d F̂m (x),
N+1
Z
C2N = (ĤN − HN )J ′ [HN ]d(F̂m − F)(x),
Z N
N
C3N = J ĤN − J[HN ] − ĤN − HN d F̂m (x).
N+1 N+1
The theorem is established if we show
√ d
(i) N(B1N + B2N )/σN −→ N(0, 1) as N → ∞, (5.3.20)
(ii) CiN = o p (N −1/2 ), i = 1, 2, 3.
Although the proof of (ii) is technically very complicated, it is understood intuitively. Hence we
omit it (see Chandra and Taniguchi (2003) for details). In what follows we prove (i). From (5.3.11)
we observe that
Z Z
B1N = J[HN ]d(Fm − F)(x) + m−1/2 A x J[HN ]d[x f (x)]
+ lower order terms. (5.3.21)
Then, integrating B2N by parts, and adding it to (5.3.21), we obtain

√
N(B1N + B2N )
(Z Z
√
= N(1 − λN ) s(x)d(Fm − F) − s∗ (x)d(Gn − G)
Z Z )
−1/2 ′ −1/2 ′
−m AX x f (x)J [HN ]dG(x) + n AY zg(z)J [HN ]dF(z)
+ lower order terms

= aN + bN + cN + dN + lower order terms, (say), (5.3.22)
where
Z x Z x
′ ∗
s(x) = J [HN (y)]dG(y), s (x) = J ′ [HN (y)]dF(y)
x0 x0
and λN s∗ (x)+(1−λN )s(x) = J[HN (x)]− J[HN (x0 )] with x0 determined arbitrarily, say, by HN (x0 ) =
1/2. Since aN and bN are mutually independent, using the result by Puri and Sen (1971, pp. 97–99),
we obtain
σ21N = Var(aN + bN ). (5.3.23)
From Tjøstheim (1986) it follows that

√
Var[ m(θ̂X,m − θX )] = UX−1 RX UX−1
and
√
Var[ n(θ̂Y,n − θY )] = UY−1 RY UY−1 . (5.3.24)
Recalling (5.3.11), (5.3.12) and (5.3.22) we get
σ22N = Var(cN ) and σ23N = Var(dN ). (5.3.25)
Next, we compute the covariance terms. Since {Xt } and {Yt } are independent, we have only to eval-
uate
L1N = 2E[aN cN ] and L2N = 2E[bN dN ].
From (5.3.22) we obtain

" n √ o
L1N = 2Nm−1 (1 − λN )2 E ( m(Fm − F)(x))AX
× δ f,N (x, y)dG(x)dG(y), (5.3.26)
for which; it is necessary to find E{·}. Using the result by Horváth et al. (2001), it follows from
(5.3.11) and (5.3.17) that
√ XX p
E{( m(Fm − F)(x))AX } = τi,X hiX (x).
i=0
Thus,
(1 − λN )2 X
pX "
L1N = 2 τi,X hiX (x)δ f,N (x, y)dG(x)dG(y) (5.3.27)
λN i=0
and analogously
X
pY "
L2N = 2(1 − λN ) τi,Y hiY (x)δg,N (x, z)dF(x)dF(z). (5.3.28)
i=0
Adding (5.3.27) and (5.3.28) yields γN .

For (5.3.20), using the central limit theorems given by Horváth et al. (2001) and Tjøstheim
(1986), we may conclude that
√ d
N(B1N + B2N )/σN −→ N(0, 1) as N → ∞, (5.3.29)
leading to (i).
The asymptotic distribution of {T N } given in the theorem provides useful information for the
reliability of confidence intervals, asymptotic relative efficiency (ARE) and ARCH affection.
The test statistic T N was constructed from the empirical residuals {ǫ̂t2 } and {ξ̂t2 }. Likewise, if we
construct it replacing {ǫ̂t2 } and {ξ̂t2 } by {ǫt2 } and {ξt2 }, respectively, then it becomes the usual rank order
statistic given in Puri and Sen (1971), which is denoted by T NPS . Here we consider the problem of
constructing approximate confidence intervals based on T NPS and T N in the i.i.d. and in our ARCH
residual settings, respectively. Some important feature of T N will be elucidated.
For this, let us consider the ARCH(1) model:



σt (θX )ǫt , σ2t (θX ) = θ0X + θ1X Xt−12
, f or t = 1, · · · , m,
Xt =  
0, f or t ≤ 0,
where θ0X > 0, 0 ≤ θ1X < 1, {ǫt } is a sequence of i.i.d. (0,1) random variables with fourth-order
cumulant κ4X , and ǫt is independent of X s , s < t.
Another ARCH(1) model, independent of {Xt }, is given by



σt (θY )ξt , σ2t (θY ) = θY0 + θY1 Yt−1
2
, f or t = 1, · · · , n,
Yt = 
0, f or t ≤ 0,
where θY0 > 0, 0 ≤ θY1 < 1, {ξt } is a sequence of i.i.d. (0,1) random variables with fourth-order
cumulant κ4Y , and ξt is independent of Y s , s < t.
Recall that F(x) and G(x) are the distribution functions of ǫt2 and ξt2 , respectively. Consider the
scale problem in the case of G(x) = F(θx), θ ∈ [1, ∞). Henceforth it is assumed that F is arbitrary
and has finite variance σ2 (F). The two-sample testing problem for scale is described as
H0 : θ = 1 vs HA : θ > 1. (5.3.30)
For this, we use Wilcoxon’s test with J(u) = u, 0 < u < 1. Assume that m = n = N/2. Then,
from Theorem 5.3.3 it follows that the asymptotic mean and variance under H0 are, respectively,
given by
Z
µ(θ) = F(x)dF(x), and σ2 (1) = σ21 (1) + σ22 (1) + σ23 (1) + γ(1),
where
Z 1 Z 1 !2 Z !2
1 CX
σ21 (1) = 2
u du − udu = , σ22 (1) = 3
x f (x)dx ,
0 0 12 2
Z !2
CY
σ23 (1) = z f 3 (z)dz and
2
" "
γ(1) = k1 p(x)y f 2 (x) f 3 (y)dxdy + k2 p(x)z f 2 (x) f 3 (z)dxdz,
with
C x = τ′X UX−1 RX UX−1 τ x , CY = τ′Y UY−1 RY UY−1 τY ,

k1 = (τ0,X + τ1,X )(α0X + α1X ), k2 = (τ0,Y + τ1,Y )(α0Y + α1Y ) and
Z x
p(x) = (u − 1) f (u)du,
0
where τX = (τ0,X , τ1,X ) and τY = (τ0,Y , τ1,Y )′ . Writing µ = 1/2, σ2 = σ2 (1) and σ21 = 1/12, we
′
have, under H0 ,
√ d √ d
N(T NPS − µ)/σ1 −→ N(0, 1) and N(T N − µ)/σ −→ N(0, 1). (5.3.31)
For α ∈ (0, 1), write λα = Φ−1 (1 − α), where Φ(·) is the distribution function of N(0, 1). The (1 − α)
asymptotic coverage probabilities are expressed as
n o
P µ − N −1/2 σ1 λα/2 ≤ T NPS ≤ µ + N −1/2 σ1 λα/2 ≈ 1 − α,
n o
P µ − N −1/2 σλα/2 ≤ T N ≤ µ + N −1/2 σλα/2 ≈ 1 − α.
Hence the approximate theoretical (1 − α) confidence intervals are

1 1
1/2 − N −1/2 √ λα/2 , 1/2 + N −1/2 √ λα/2 and (5.3.32)
2 3 2 3

1/2 − N −1/2 σλα/2 , 1/2 + N −1/2 σλα/2 , (5.3.33)
respectively.
Chandra and Taniguchi (2003) evaluated the interval (5.3.32) numerically, for three types of the
distribution function F ∗ of ǫt :
Z y !
∗ ∗ ∗ −1/2 t2
(i) F =F N (normal); F N (y) = (2π) exp − dt.
−∞ 2
Z y
1
(ii) F ∗ =F DE
∗
(double exponential); F DE ∗
(y) = e−|t| dt.
4 −∞
3
(iii) F ∗ =F L∗ (logistic); F L∗ (y) = 2 (1 + e−y ).
π
If ǫt has the distribution F ∗ , then ǫt2 has the distribution
 √


2F ∗ ( x) − 1, x ≥ 0,
2
F(x) ≡ P(ǫt ≤ x) =   (5.3.34)
0, x < 0.
For (i)–(iii), the corresponding distribution F is, respectively,

√
F N (x) = 2F N∗ ( x) − 1,
1 √
F DE (x) = (1 − e− x ), (5.3.35)
2 √
F L (x) = 3{1 − 2/(1 + e x )}/π2 ,
(see Exercise 5.2).

For α = 0.05 and for F = F N , F DE and F L , the confidence intervals (5.3.31) are
(0.4747, 0.5253), N = 1000

(0.4821, 0.5179), N = 2000. (5.3.36)
In the case of ARCH, we have to evaluate σ in (5.3.32). For this, set θ0 ≡ θ0X = θY0 = 30 and
θ ≡ θ1X = θY1 = 0.1, 0.5. For m = n = 500, 1000 (i.e., N = 1000, 2000), we generated realizations of
1
Xt and Yt with 100 replications. Chandra and Taniguchi (2003) evaluated the sample version of σ,
and gave
Table 5.4: 95% Confidence interval based on T N in (5.3.32)
Dist. m = n = 500, θ0 = 30 m = n = 1000, θ0 = 30

θ1 = 0.1 θ1 = 0.5 θ1 = 0.1 θ1 = 0.5
FN (0.3642, 0.6344) (0.3216, 0.6672) (0.3746, 0.6240) (0.3632, 0.6354)
F DE (0.4411, 0.5590) (0.4163, 0.5838) (0.4580, 0.5420) (0.4401, 0.5599)
FL (0.4745, 0.5254) (0.4735, 0.5264) (0.4818, 0.5181) (0.4740, 0.5259)
The difference between (5.3.36) and (5.3.37) is due to the effect of the volatility estimators
θ̂X,m and θ̂Y,m . Table 5.4 implies that the Wilcoxon test is preferable if the underlying distribution is
logistic.
Next we discuss the asymptotic relative efficiency (ARE), due to Pitman, of T N for distributions
F N , F DE and F L . Suppose that T N is a test statistic based on N observations for testing H0 : θ = θ0 ,
vs HA : θ > θ0 with critical region T N ≥ λN,α . Further, we assume that
(i) limN→∞ Pθ0 (T N ≥ λµ,α ) = α, where 0 < α < 1 is a given level,
(ii) there exist functions µN (θ) and σN (θ) such that
√ d
N T N − µN (θ) /σN (θ) −→ N(0, 1) (5.3.37)
uniformly in θ ∈ [θ0 , θ0 + ǫ], ǫ > 0,

(iii) µ′N (θ0 ) > 0,
(iv) for a sequence {θN = θ0 + N −1/2 δ, δ > 0} such that θN → θ0 as N → ∞,
lim [µ′N (θN )/µ′N (θ0 )] = 1, lim [σN (θN )/σN (θ0 )] = 1,
N→∞ N→∞
lim [µ′N (θ0 )/σN (θ0 )] = c > 0. (5.3.38)
N→∞
Then, the asymptotic power is given by 1 − Φ(λα − δc ), where λα = Φ−1 (1 − α). The quantity
c defined by (5.3.38) is called the efficacy of T N . Let T N(1) and T N(2) be sequences of statistics with
efficacies c1 and c2 , respectively, and define e(T N(2) , T N(1) ) ≡ c22 /c21 . We call e(T N(2) , T N(1) ) the asymptotic
relative efficiency (ARE) of T N(2) relative to T N(1) . Especially, if we use the Wilcoxon test T N = T N (F)
based on an innovation distribution F, e(F2 , F1 ) means the ARE of T N (F2 ) relative to T N (F1 ).
In our setting, Chandra and Taniguchi (2003) evaluated e(·, ·) and δF = σ1 /σ (recall (5.3.31))
numerically, in the case of m = n = 500 and θ0 = 300.
Table 5.5: ARE
ARE θ1 = 0.1 θ1 = 0.5

e(F N , F L ) 0.4408 0.1972
e(F DE , F L ) 0.9698 0.4820
Table 5.6: ARCH affection
θ1 = 0.1 θ1 = 0.5
δFN 0.5685 0.3795
δFDE 0.8292 0.7921
δFL 0.9684 0.9432
Table 5.5 shows that the Wilcoxon test is good in the case of F L . Also δF shows an ARCH
affection. For the Wilcoxon test, the ARCH affection becomes large if F = F N and if θ1 becomes
large. But for F = F L , the ARCH affection becomes small. (See Table 5.6.)
5.4 Independent Component Analysis

5.4.1 Introduction to Independent Component Analysis
5.4.1.1 The foregoing model for financial time series
In many financial applications, e.g., asset pricing, portfolio selection, hedging and risk management,
a modeling of the time varying conditional covariance matrix for asset returns is widely invoked.
However, multivariate volatility models have two main difficulties. Namely, they have the high-
dimensional problem and the restriction of positive definite at every point.
The K-factor GARCH model can reduce the number of parameters and parameter constraints
(see, e.g., Lin (1992)). Let εt be a d × 1 vector of random variables, Ft be the σ-field generated
by {ε s }ts=−∞ , and Ht be a covariance matrix of εt that is measurable with respect to Ft−1 . Then, the
conditional distribution of a multivariate GARCH model can be expressed as
εt | Ft−1 ∼ (0, Ht ) , (5.4.1)
namely, it is some arbitrary multivariate distribution which has the mean vector 0 and the covari-
ance matrix Ht . There are many possible ways to parametrize Ht as a function of past values.
Bollerslev et al. (1988) gave one simple formulation where each element of the covariance matrix
depends solely on its own past values. That is, let hi j,t be the (i, j)th element of Ht ; then hi j,t can be
represented by
hi j,t = ci j + ai j εi,t−1 ε j,t−1 + bi j hi j,t−1 for all i, j.

INDEPENDENT COMPONENT ANALYSIS 181
This model is the so-called the “diagonal multivariate GARCH” since it is obtained as a diagonal
representation in the vec model (see, e.g., Engle and Kroner (1995)). We may require that Ht be
positive definite. However, it can be hard to check even in the diagonal representation. Alternatively,
Engle and Kroner (1995) proposed a multivariate GARCH (p, q, K) model defined by
X
K X
q X
K X
p
Ht = Ω + A′k, j εt− j ε′t− j Ak, j + Bk,′ j Ht− j Bk, j , (5.4.2)
k=1 j=1 k=1 j=1
where Ak, j and Bk, j are d × d parameter matrices and Ω is a d × d symmetric parameter matrix.
They showed that this will be positive definite under very weak conditions, and this representation
is sufficiently general since it includes all positive definite diagonal representations. However, the
number of parameters easily becomes huge, as the dimension of variables d increases. Engle (1987)
suggested a K-factor structure to the covariance matrix and Lin (1992) adapted it to the following
K-factor GARCH model.
Definition 5.4.1. A multivariate GARCH (p, q, K) model in (5.4.1) and (5.4.2) is said to be a K-
factor model if it can be represented as
Ak j = αk j fk gk′ , and Bk j = βk j fk gk′
with d × 1 vectors fk , gk , (k = 1, . . . , K) which satisfy




0 (k , l)
′
fk gl = 

1 (k = l) .
Therefore, a K-factor GARCH (p, q) model can be rewritten as

 q 
XK X X
K X
p 
′
Ht = Ω + gk gk  αk j fk εt− j εt− j fk +
2 ′ ′
βk j fk Ht− j fk  .
2 ′
k=1 j=1 k=1 j=1
Hence, the time-varying part of the covariance matrix in the K-factor GARCH (p, q) model is re-
duced to the rank K. The following properties of the K-factor GARCH (p, q) model were listed in
Lin (1992).
Property 5.4.2. Let the kth factor be ηk,t = fk′ εt . Then, we have the following.
(i) The kth factor ηk,t follows the GARCH (p, q) process

ηk,t | Ft−1 ∼ 0, hk,t ,
Xq X
p
hk,t = ωk + α2k, j η2k,t− j + β2k, j hk,t− j ,
j=1 j=1
where hk,t = fk′ Ht fk and ωk = fk′ Ωfk . Furthermore, the conditional covariance of εt can be
rewritten as
X
K
H t = Ω∗ + gk gk′ hk,t ,
k=1
PK
where Ω∗ = Ω − k=1 gk gk′ ωk , and therefore it depends only on the K conditional variance of
common factors.
(ii) The conditional covariance of any two factors satisfies

E ηk,t η j,t | Ft−1 = fk′ Ωf j for k , j, k, j = 1, . . . , K,
so that it is time invariant.

(iii) The linear combinations of εt still follow the K-factor GARCH (p, q) process with the same
common factors of εt . Namely, let P be a d × m matrix with full column rank m, then,

ft ,
P ′ εt | Ft−1 ∼ 0, H
X
K
ft = Ω
H e+ e gk′ hk,t ,
gk e
k=1
e = P ′ Ω∗ P and e
where Ω gk = P ′ gk .
(iv) The covariance stationary condition of the K-factor GARCH (p, q) process is given by
X
q X
p
α2k j + β2k j < 1 for k = 1, . . . , K.
j=1 j=1
Moreover, the stationary covariance matrix of εt is obtained as
X K
E εt ε′t = Ω∗ + gk gk′ σ2k ,
k=1
P P
where σ2k = ωk / 1 − qj=1 α2k j − pj=1 β2k j .
Lin (1992) proposed four alternative estimators for the K-factor GARCH (p, q) model, and gave
the Monte Carlo comparison of their finite sample properties.
Alternatively, the observations yt (t = 1, . . . , T ) of the multivariate full-factor GARCH model
are given by the following equations (see Vrontos et al. (2003)):
yt = µ + ε t ,
εt = W x t ,
xt | Ft−1 ∼ NN (0, Σt ) ,
where µ is the N × 1 constant mean vector, εt is the N × 1 innovation vector, W is the N × N
element xk,t (k = 1, . . . , N)
parameter matrix and xt is the N × 1 factor vector. Furthermore, the kth
of the factors xt are GARCH (1, 1) processes; therefore Σ = diag σ21,t , . . . , σ2N,t is the N × N
diagonal covariance matrix which satisfies
σ2k,t = ωk + αk x2k,t−1 + βk σ2k,t−1 , k = 1, . . . , N, t = 1, . . . , T,
where σ2k,t is the conditional variance of the kth factor xk,t at time t and coefficients satisfy ωk > 0,
αk ≥ 0, βk > 0, k = 1, . . . , N. Since the factor vector xt follows a conditional normal distribution,
the innovation vector εt satisfies εt | Ft−1 ∼ NN (0, Ht ) with
′
Ht = W Σt W ′ = W Σ1/2 1/2 ′
t Σt W = W Σt
1/2
W Σ1/2
t := Lt L′t .
In this case we can naturally assume that W is triangular and has unit diagonal elements, namely,
wi j = 0 for j > i and wii = 1, (i = 1, . . . , N). Under these assumptions, the conditional covariance
matrix Ht has the (i, j)th element
min{i,
Xj}
hi j,t = wik w jk σ2k,t ,
k=1
and is always positive definite if the factor conditional variance σ2k,t (k = 1, . . . , N) is well defined.
In this model the conditional correlations are given by
Pmin{i, j}
k=1 wik w jk σ2k,t
ρi j,t = P 1/2 P j 1/2 .
i 2 σ2 2 σ2
w
k=1 ik k,t w
k=1 jk j,t
Therefore, it can be considered as a generalization of the constant correlation coefficient model.
For the sake of convenience, we additionally assume that xk,0 and σ2k,0 are known, and αk = α,
βk = β (k = 1, . . . , N). Then, the unconditional covariances are given by U = E (Ht ) with the
(i, j)th element
Pmin{i, j}
k=1 wik w jk ωk
ui j = .
1−α−β
The likelihood of the multivariate full-factor GARCH model for observations y = (y1 , . . . , yT ) can
be written as
 
YT  1 XT 
l (y | θ) := (2π)− T2N
|Ht | −1/2
exp − (yt − µ) Ht (yt − µ),
′ −1
t=1
2 t=1
N(N−1)
where the number of parameters is 2N + 2 + 2 and the parameter vector is
θ = (µ1 , . . . , µN , ω1 , . . . , ωN , α, β, w21 , w31 , w32 , . . . , wN1 , . . . , wNN−1 )′ .
Note that the factors xk,t are not parameters to be estimated but are given by xt = W −1 εt . Vrontos
et al. (2003) used the classical and Bayesian techniques for the estimation of the model parameters.
They showed that maximum likelihood estimation and the construction of well-mixing MCMC
algorithms are easily performed.
Another type of the multivariate GARCH model in financial time series is described by the
orthogonal GARCH type. If we assume that xt is normally distributed and has the conditional
covariance matrix Vt which is measurable with respect to Ft−1 , then the multivariate GARCH model
is given by
xt | Ft−1 ∼ N (0, Vt ) ,
where we also assume that xt is second order stationary and so that V = E (Vt ) exists. The prob-
lem in multivariate GARCH modeling is how to parametrize Vt as a function of Ft−1 where the
model is sufficiently general while feasible estimation methods can be applied. For this purpose
van der Weide (2002) suggested the generalized orthogonal GARCH (GO-GARH) model. The key
assumption of GO-GARH modeling is given by the following.
Assumption 5.4.3. The observed process xt is given by a linear combination of uncorrelated com-
ponents yt , namely,
xt = Zyt ,
where the linear link map Z is constant over time and invertible.

We can assume the unobserved components satisfy E yt yt′ = IN without loss of generality.
Therefore, the observed process satisfies

V = E xt x′t = ZZ ′ . (5.4.3)
A simple example is the GO-GARCH (1, 1) model
xt = Zyt
yt | Ft−1 ∼ N (0, Ht ) ,
where each component is a GARCH (1, 1) process, so that

Ht = diag h1,t , . . . , hN,t
hk,t = (1 − αk − βk ) + αk y2k,t−1 + βk hk,t−1 , k = 1, . . . , N.
Additionally, if we assume H0 = IN , then the unobserved conponents have unit variances, i.e.,

E yt yt′ = IN . Therefore, the unconditional covariance V of the observed process xt is given by
(5.4.3). On the other hand, the conditional covariances of xt satisfy
Vt = ZHt Z ′ .
Let P and Λ = diag (λ1 , . . . , λN ) denote matrices of the orthonormal eigenvectors and the eigenval-
ues of the unconditional covariance matrix V = ZZ ′ , respectively. Then, the following properties
are useful for the estimation procedure.
Property 5.4.4 (Lemma 2 of van der Weide (2002)). Let Z be the map that links the uncorrelated
components yt with the observed process xt . Then, there exists an orthogonal matrix U which
satisfies
P Λ1/2 U = Z.
For the estimation of U , we can restrict the determinant of U to be 1 without loss of generality.
Then, we also have the following property.
Property 5.4.5 (Lemma 3 of van der Weide (2002)). Every N-dimensional orthogonal matrix U
with det {U } = 1 can be represented as a product of N(N−1)
2 rotation matrices, namely,
Y
U= Ri j θi j − π < θi j ≤ π,
i< j

where Ri j θi j is a rotation matrix in the plane spanned by the unit vectors ei and e j over an angle
θi j .
The matrices P and Λ will be estimated from the sample version of the covariance matrix V .
On the other hand, conditional information is required for estimation of U . The parameters which
will be estimated from the conditional information include the vector θ of rotation matrices and
the parameters (α, β) for the N univariate GARCH (1, 1) processes. The log-likelihood for the GO-
GARCH model is given by
1 Xn
T o
Lθ,α,β = − N log (2π) + log |Vt | + x′t Vt −1 xt
2 t=1
1 X
T −1
=− N log (2π) + log Zθ Ht Zθ′ + yt′ Zθ′ Zθ Ht Zθ′ Zθ yt
2 t=1
1 Xn
T o
=− N log (2π) + log Zθ Zθ′ + log |Ht | + yt′ Ht−1 yt ,
2 t=1
where Zθ Zθ′ = P ΛP ′ is independent of θ.

The orthogonal GARCH model uses principle components which are uncorrelated uncondition-
ally. On the other hand, Fan et al. (2008) proposed an alternative approach which is based on the

conditionally uncorrelated components (CUCs). Let Zt = Z1,t , . . . , ZN,t ′ be N CUCs which satisfy
conditions

E Zk,t | Ft−1 = 0,

Var Zk,t = 1,

E Zk,t Zl,t | Ft−1 = 0 for all k , l. (5.4.4)

Put σ2k,t = Var Zk,t | Ft−1 . Then we have

Var (Zt | Ft−1 ) = diag σ21,t , . . . , σ2N,t .
We assume that each component of εt is a linear combination of N CUCs Zt , namely,
εt = AZt , (5.4.5)
where A is a constant matrix. For simplicity we suppose that Var (εt ) = IN . In practice, this can be
achieved by εt by S −1/2 εt , where S is a sample covariance matrix of εt . Therefore, A is a N × N
orthogonal matrix with N(N−1)
2 free elements so that Zt = A′ εt , since
IN = Var (εt ) = AVar (Zt ) A′ = AA′ .
This assumption is not essential and introduced to reduce the free parameters in A from N 2 to
N(N−1) ′ ′ ′
2 . Let ζ1,t = b1 εt and ζ2,t = b2 εt be any two portfolios and b j A = b1, j , . . . , bN, j , ( j = 1, 2).
Then we have
X 2 2
N
Var ζ1,t | Ft−1 = bk,1 σk,t ,
k=1
X
N
Cov ζ1,t , ζ2,t | Ft−1 = bk,1 bk,2 σ2k,t ,
k=1
so that the volatilities for any portfolio can be deduced to N univariate volatility models. Let A =
(a1 , . . . , aN ). Since Zk,t = a′k εt and a1 , . . . , aN are N orthogonal vectors, our goal is the estimation
for the orthogonal matrix A. The condition (5.4.4) is equivalent to
X
E Z Z I (B) = 0 (5.4.6)
k,t l,t
B∈Bt
for any π-class (closed under finite intersections) Bt ⊂ Ft−1 such that the σ-algebra which is gener-
ated by Bt is equal to Ft−1 . In practice, we use some simple Bt , namely, B which is the collection of
all the balls that are centred at the origin in RN . Let u0 be a prescribed integer and
X X u0 n
X o

E ak εt εt al I (εt−u ∈ B)
′ ′
Ψ (A) = ω (B)
1≤k<l≤N B∈B u=1
X X u0
X n o

Corr ak εt , al εt | εt−u ∈ B P (εt−u ∈ B) ,
′ ′
= ω (B)
P
where ω (·) is a weight function which satisfies B∈B ω (B) < ∞. We can understand Ψ (A) as
a collective conditional correlation measure among the N directions a1 , . . . , aN . Since the order
of a1 , . . . , aN is arbitrary and ak may be replaced by −al , we measure the distance between two
orthogonal matrices A and B by
1 X
N
D (A, B) = 1 − max a′k bl .
N k=1 1≤l≤N
For any two orthogonal matrices A, B, D (A, B) ≥ 0 and if the columns of A are obtained from
a permutation of the columns of B or their reflections, then D (A, B) = 0. The sample version of
Ψ (A) is given by
 
X X X 1 ′  X 
u0 T
 ′ 

ΨT (A) = ω (B) ak 
 εt εt I (εt−u ∈ B)
 a . (5.4.7)
T − u t=u+1  l
Let A0 be a unique, under the D-distance, minimizer of Ψ (A). We can say the estimator A b of
b
A0 is consistent if the D-distance between A and A0 converges to 0 in probability, since ΨT (A)
does not give any difference between orthogonal matrices A and B as long as D (A, B) = 0. Fan
et al. (2008) proposed estimation methods using ΨT (A) and iterative algorithms and showed that
the resulting estimator is consistent.
Once the CUCs are identified, we can fit each σ2k,t with an appropriate univariate volatility such
as the following extended GARCH (1, 1) model.
Zk,t = σk,t ηk,t

X
N
σ2k,t = ωk + 2
αkl Zl,t−1 + βk σ2k,t−1 ,
l=1

where ηk,t is a sequence of i.i.d. random variables with mean 0 and variance 1, and ηk,t is indepen-
PN
dent of Ft−1 . The coefficients satisfy ωk > 0, αkl ≥ 0, βk ≥ 0, and ωk = 1 − l=1 αkl − βk to eusure
P 2
that Var Zk,t = 1. The extended GARCH (1, 1) model contains extra N − 1 terms l,k αkl Zl,t−1
compared with the standard GARCH (1, 1) model. They can capture the dependence between the
kth CUC and the other CUCs, which is called the causal components if αkl > 0. On the other hand,
the conditional uncorrelated condition (5.4.4) still holds. If 0 ≤ βk < 1, the extended GARCH (1, 1)
model can be rewritten as
ωk XN X ∞
σ2k,t = Var Zk,t | Ft−1 = + αkl 2
βks−1 Zl,t−1
1 − βk l=1 s=1
PN
αkl X N X∞
= 1 − l=1 + αkl 2
βks−1 Zl,t−1 . (5.4.8)
1 − βk l=1 s=1
The quasi-Gaussian log-likelihood for θk = (αk1 , . . . , αkN , βk )′ is given by

T 
X  n 2 o 2
Zk,t


lk (θk ) = − log σk,t (θk ) + 
t=ν+1
σ2k,t (θk )
for a prescribed integer ν, where σ2k,t (θk ) is given by the truncated version of Equation (5.4.8). For
the selection of causal components, a sequential method can be used. We start with the standard
GARCH (1, 1) model, namely, with the constraints αkk > 0 and αkl = 0 for k , l. Next, we
compute the QMLE θ b(1) with the constraints αkk > 0, αkl > 0, αkm = 0 for m < {k, l}. Then, we
k (1)
choose l1 , k to maximize lk θ b and denote it by lk (1). If the model already contains d − 1
k
terms Zl1 ,t−1 , . . . , Zld−1 ,t−1 , then we choose an additional term
(d)Zld ,t−1 among ld < {k, l1 , . . . , ld−1 } which
maximizes the quasi-Gaussian log-likelihood lk (d) = lk θ b in the same manner. Finally, putting
k
BICk (d) = −lk (d) + (d + 2) log (T − ν),
we choose dk which minimizes BICk (d) over 0 ≤ d ≤ N − 1.

A natural question is whether the CUCs Z1,t , . . . , Z1,t exist or not. To test the existence of the
CUCs, we consider the null hypothesis

H0 : εt = AZt and εt = diag σ1,t , . . . , σN,t ηt ,

where A′ A = IN , ηt = η1,t , . . . ηN,t ′ , and η1,t , . . . ηN,t are mutually independent and each of
them is a sequence of i.i.d. random variables. This assumption is too strong for the conditionally
uncorrelated condition (5.4.4). The assumption of independence needs a bootstrap method. Let A b=
′ ′ b
a1 , . . . , b
b aN be the estimator which minimizes ΨT (A) in (5.4.7), Zk,t = b ak εt and θk be an estimator
ηk,t , t = ν + 1, . . . , T are obtained by standardizing the raw
for θk . The standardized residuals b
residuals, that is,
Zk,t
ηk,t =
b , t = ν + 1, . . . , T.
bk
σk,t θ
Then, the following bootstrap sampling scheme can be executed.
(i) Draw η∗k,t , t = −∞, . . . , T by random sampling with replacement from the standardized residuals

bηk,ν+1 , . . . ,b
ηk,T for k = 1, . . . , N.
∗
(ii) Draw Zk,t = σ∗k,t η∗k,t , t = −∞, . . . , T for k = 1, . . . , N, where
X
N X
N
σ∗k,t2 = 1 − b
βk − αkl +
b ∗2
αkl Zl,t−1
b +b 2
βk σ∗k,t−1 .
l=1 l=1

b Z ∗ , . . . , Z ∗ ′ for t = 1, . . . , T .
(iii) Let ε∗t = A 1,t N,t

Note that the bootstrap sample ε∗t is drawn from the model in which a′k ε∗t is its genuine CUC.
Furthermore, if Zk,t and Zl,t are not conditionally uncorrelated, the left hand side of Equation (5.4.6)
is equal to a positive constant (instead of 0), so that larger values of ΨT A b imply that the CUCs do
∗ (A) ∗
not exist. We define
ΨT from
replacing εt by εt in Equation (5.4.7), and compute the bootstrap
estimator A∗ = a∗1 , . . . , a∗N in the same manner as A b with ΨT (A) replaced by Ψ∗ (A). Then,
T
we reject H0 if ΨT A b is greater than the [nα]th largest value of Ψ∗ (A∗ ) in a replication of the
T
above bootstrap resampling for n times, where α ∈ (0, 1) is the size of the test. Similarly, a bootstrap
approximation for a 1 − α confidence set of the transformation matrix A can be obtained from
n o
A | D A, A b ≤ cα , A′ A = IN

where cα is the [nα]th largest value of D A∗ , A b . A bootstrap confidence interval for each element
of θk can be constructed from the same policy.
5.4.1.2 ICA modeling for financial time series

The sequence of innovations from financial asset returns can be regarded as an additive mixture
of several independent latent economic variables which act on the financial market. Independent
component analysis (ICA) provides a solution to extract these latent sources from the observed
data under few assumptions on these sources. Let st be mutually independent latent sources. For
simplicity, we assume a full rank linear mixing transformation M0 of the latent sources st exists
and gives an empirically sufficient representation of observations et , namely, et = M0 st . Simi-
larly, the separating matrix W0 is defined by st = W0 et . The traditional independent component
estimation is achieved by maximization of non-Gaussianity. This concept is motivated by the cen-
tral limit theorem, which confirms that the sums of independent non-Gaussian random variables
are closer to Gaussian than the original random variables. For given observations et , we seek the
components of W et which are mutually independent (non-Gaussian) for some separating matrix
W . This estimation of W can be carried out by minimizing the mutual information of W et . For
instance, the Kullback–Leibler divergence measures the difference between the joint distribution of
its components and the product of the marginal distributions of its components.
The approach to multivariate volatility modeling based on the concept of dynamic orthogonal
components (DOCs) was proposed by Matteson and Tsay (2011). This approach is an extension of
PCA in which orthogonality in cross dependence over time is also sought. However, it is not as strin-

gent as ICA, since it focuses on orthogonality only up to the fourth order. Let yt = y1,t , . . . , yN,t ′
denote a vector of returns, or log returns, of N assets at time t. The return vector can be partitioned
as
yt = µ t + e t , (5.4.9)
in which µt = E (yt | Ft−1 ) is the conditional mean of yt given Ft−1 , and et is the mean zero inno-
vation vector. The conditional covariance matrix of the returns given Ft−1 , Σt = Cov (yt | Ft−1 ) =
Cov (et | Ft−1 ) is often referred to as a volatility matrix in the finance literature.
For a stationary vector time series xt , assume that there exist DOCs st such that xt = M st
where M is a nonsingular mixing matrix. Furthermore, let U denote an uncorrelating matrix so
that zt = U xt is an uncorrelated observation. It is straightward to find an invertible, linear trans-
form U which provides the uncorrelated random variables. Namely, let Σ = Cov (xt ) denote the
positive definite unconditional covariance matrix of xt and define zt = U xt . Then Cov (zt ) equals
the N × N identity matrix IN . Let Γ and Λ denote the orthogonal matrix of eigenvectors and the
diagonal matrix of corresponding eigenvalues of Σ, respectively. Then, the invertible matrix U
can be chosen as the inverse symmetric square root of the unconditional covariance matrix de-
fined as U = Σ−1/2 = ΓΛ−1/2 Γ′ , or simply in connection with principle component analysis as
U = Λ−1/2 Γ′ . The relationship between st and zt is given by
st = M −1 xt = M −1 U −1 zt ≡ W zt , (5.4.10)
where W = M −1 U −1 .
We begin with the specification of DOCs in mean for time series with multivariate serial corre-
lation. To simplify the problem, we assume that we observe xt = yt − E (yt ) instead of yt , and Σt
as constant. Let M1 denote the mixing matrix associated with DOCs in mean, namely, xt = M1 st .
Then, combining (5.4.9) and (5.4.10), we have
st = M1−1 xt = M1−1 µt + M1−1 et = µ et .

et + e (5.4.11)
Here, the components of st are dynamically orthogonal in mean, that is Cov (st , st−l ) is diagonal
for all lags l = 0, 1, 2, . . .. The remaining marginal serial correlation can be modeled by using the
univariate ARMA model. In the corresponding vector ARMA model, we can obtain the version of

e
et in Equation (5.4.11), in which Cov e et is diagonal. Therefore, we can obtain the version of µ
et in
which the coefficient matrices specified by µ et are also diagonal.
Next, we focus on volatility modeling. To simplify the discussion, we assume that µt is known,
namely, we assume that xt = et = yt − E (yt | Ft−1 ) is observed. Let M2 denote the mixing matrix
associated with DOCs in volatility, namely, xt = M2 st . Then, combining (5.4.9) and (5.4.10), we
have
Σt = Cov (xt | Ft−1 ) = M2 Cov (st | Ft−1 ) M2′ .

Here, the components of st are dynamically orthogonal in volatility, that is, Cov s2t , s2t−l is diag-

onal, and Cov si,t s j,t , si,t−l s j,t−l = 0 for i , j, for all lags l = 0, 1, 2, . . .. For linear generalized
autoregressive conditional heteroscedastic time series, we can obtain the version of the conditional
covariance process Cov (st | Ft−1 ) in which Cov (st | Ft−1 ) is diagonal at all time points. The re-
maining marginal conditional heteroscedasticity can be modeled by univariate models such as the
GARCH model. Then, we obtain the version of Cov (st | Ft−1 ) in which all coefficient matrices for
the corresponding matrix models of Cov (st | Ft−1 ) are diagonal.
For any DOC specification we require the xt satisfies the standard regularity condition.
Assumption 5.4.6. The process xt is stationary and ergodic with E kxt k2 < ∞.

Without loss of generality, we can assume that st is standardized, that is, E si,t = 0 and

Var si,t = 1 for i = 1, . . . , N. The fundamental motivation is the fact that, empirically, the dy-
namics of xt can often be well approximated by an invertible linear combination of DOCs, namely,
xt = M st . For theoretical and practical considerations, it is convenient to work with uncorrelated
random variables, so that it is useful to start with uncorrelated process zt instead of xt in Equation
(5.4.10).
We test for significant lagged cross correlations in st for DOCs in mean and significant lagged
cross correlations in s2t for DOCs in volatility. Let h denote either the identity transformation
h (x) = x for DOCs in mean or the square transformation
h (x) = x2 for DOCs in volatility,
h
and let hi,t−l = hi (st−l ) and ρi, j (l) = Corr hi,t , h j,t−l . Then, the joint lag m null and alternative
hypotheses to test for the existence of DOCs are given by
H0 : ρhi, j (l) = 0 for all i , j, l = 0, . . . , m,

HA : ρhi, j (l) = 0 for some i , j, l = 0, . . . , m.
Matteson and Tsay (2011) suggested using Ljung–Box-type test statistic

X m X ρh (k)2
X i, j
Q0N (m) = T ρhi, j 2
(0) + T (T + 1) ,
i< j k=1 i, j
T −k
where T is the sample size of the observations. Under H0 , Q0N is asymptotically distributed as a chi
square distribution with N(N−1)
2 + mN (N − 1) degrees of freedom. The null hypothesis is rejected for
a large value of Q0N . If H0 is rejected, we should consider an alternative modeling procedure.
Given the uncorrelated process zt , the separating matrix W is orthogonal since, from Equation
(5.4.10), we have
IN = Var (st ) = W Var (zt ) W ′ = W W ′ .
Therefore, W has p = N(N−1) 2 free elements, and the orthogonality of W implies that it represents a
rotation in the N-dimensional space and can be parametrized by the vector θ of the rotation angles
with length p.
Let O (N) denote the group of all N × N orthogonal matrices, SO (N) denote the subgroup
N
(rotation group) whose determinants equal to 1, and ξ1 , . . . , ξN denote the canonical
basis of R for
N ≥ 2. Furthermore, let Qi, j (ψ) denote a rotation of all vectors lying in the ξi , ξ j plane of RN by
angle ψ which is oriented such that the rotation from ξi to ξ j becomes positive, namely, Qi, j (ψ) is
given by replacing the elements of identity matrix IN in which the (i, i)th and ( j, j)th elements are
replaced by cos ψ, the (i, j)th element is replaced by − sin ψ and the ( j, i)th element is replaced by
sin ψ.
If θ is a vector of rotation angles with length p = N(N−1)
2 , indexed by i, j : 1 ≤ i < j ≤ N, then
any rotation W ∈ SO (N) can be written in the form
Wθ = Q(N) · · · Q(2)

in which Q(k) = Q1,k θ1k · · · Qk−1,k θk−1 k
. Moreover, we assume that there exists a unique inverse
mapping of W ∈ SO (N) into θ ∈ [−π, π]p where the mapping is continuous.
Given an uncorrelated mean 0 vector time series zt , t = 1, . . . , T , define st (θ) = Wθ zt and
b {·} denote the sample expectation operator. We begin by estimating the parameter vector of
let E
rotation angles θ. Let h be a suitably chosen vector function h : RN → RN with hi (s) = hi (si ),
i = 1, . . . , N. Our objective in estimating θ is to make the lagged sample cross covariance function
of the transformed process
b b h (st (θ)) h (st−l (θ))′ − E
Γh(s(θ)) = E b {h (st (θ))} E
b {h (st−l (θ))}′
as close to diagonal as possible for some finite set of lags l ∈ N0 ⊂ N0 , in which lag 0 is included.
To estimate DOCs in mean, let hi (s) = si , namely, the identity function of each element. On
the other hand, to estimate DOCs in volatility, let hi (s) = s2i , or if the observations zt exhibit heavy
tails, let



 s2i if |si | ≤ c,
hi (s) = Huberc (s) = 
2 |si | c − c2 if |si | > c
for some 0 < c < ∞, which is a continuously differentiable version of Huber’s function.
Now, we define the estimator in line with the generalized method of moments (GMM) literature.
Let f (zt , θ) denote a vectorized array of orthogonal constraints whose elements are given by
n o
b {hi (st (θ))} E
fi,l j (zt , θ) = hi (st (θ)) h j (st−l (θ)) − E b h j (st−l (θ)) .
Note that the constraint f is indexed by i < j for l = 0 and i , j for l , 0, since the cross covariance

function is symmetric at lag 0. Therefore, the length of f is q = p 2 N0 − 1 , where N0 denotes
the number of elements of N0 . We apply the weights
 1−l/N



0
for l = 0,
 p Pl 1−l/ N0

φl = 
 1−l/N0


 P for l , 0,
2p l 1−l/N0
which are arranged into a diagonal weight matrix as

Φ = diag φl1 , . . . , φl1 , . . . , φl N , . . . , φl N .
| 0| | 0|
Letting feT (θ) = E

b {f (zt , θ)}, define the objective function as
JT (θ) = feT (θ)′ ΦfeT (θ)
bT = arg minθ JT (θ). Then, the estimator of the separating matrix is given
and the estimator of θ as θ
by WθbT and the estimator of the vector time series of DOCs is given by bst = WθbT zt .
The magnitude, sign and order of DOCs are ambiguous when they are associated with estimating
W and s. The magnitude of the DOCs has already been fixed by assuming Var (s) = IN . On the
other hand, letting P± denote a signed permutation matrix, the mixing model x = M s is equivalent
to

x = M P±′ P± s = M P±′ (P± s)
in which P± s are new DOCs and M P±′ is the new mixing matrix. However, the identification of
the DOCs up to a signed permutation is sufficient for the purpose of extracting the independent
univariate time series.
In the following, without loss of generality, we assume that E (xt ) and Cov (xt ) = IN . Let
J (θ) = E {f (xt , θ)}′ ΦE {f (xt , θ)} (5.4.12)
and let Θ denote a sufficiently large compact subset of the space Θ (see the online supplemen-
tary materials of Matteson and Tsay (2011) for the detailed definition). We assume the following
regularity conditions.
Assumption 5.4.7.
2
sup E h (Wθ xt )2 < ∞
θ∈Θ
and

∂h (Wθ xt ) 2
sup E xt < ∞.
θ∈Θ ∂θ
Now, we have the following results.

Theorem 5.4.8 (Theorem 1 of Matteson and Tsay (2011)). Suppose that N0 ⊂ N0 is fixed and
finite, Φ is constant and positive definite, and h is measurable and continuously differentiable.
Furthermore, suppose that there exists a unique minimizer θ 0 ∈ Θ of Equation (5.4.12) and Wθ 0
satisfies the conditions for a unique continuous inverse to exist. If Assumption 5.4.6 holds for xt and
h is chosen for which Assumption 5.4.7 is satisfied, then we have
a.s.
b
θT → θ 0 as T → ∞.
n o
Note that if DOCs exist, then there exists θ 0 ∈ Θ such that E f xt , θ 0 = 0.
Also note that feT (θ) is continuously differentiable with respect to θ on Θ by assumption. De-
e n o
note q× p gradient matrix of feT (θ) by F T (θ) = ∂fT (θ) and let V T = T Var feT θ 0 be a nonrandom
∂θ
and positive definite q × q matrix. Since both V T and θ 0 are unknown, we need the weak consistent
estimator to use the following theorem for statistical inference.
Theorem 5.4.9 (Theorem n 2 of
Matteson
o and Tsay (2011)). Suppose that h (Wθ xt ) is strong mixing
∂
with geometric rate, E ∂θ f xt , θ 0 has linearly independent columns and there exists an estimator
P
V bT →
bT of V T such that V V T as T → ∞. Then, under the conditions of Theorem 5.4.8, we have
n ′ o−1/2 n ′ o √ L
bT ΦV
FT θ bT ΦF T θ
bT bT ΦF T θ
FT θ bT T θbT − θ 0 → N 0, I p
as T → ∞.
Note that xt is strong mixing with geometric rate if there exists constant K > 0 and α ∈ (0, 1)
such that
sup |P (A ∩ B) − P (A) P (B)| = φk ≤ Kαk

A∈σ{xt |t≤0},B∈σ{xt |t>k}
in which φk is referred to as the mixing rate function.
5.4.1.3 ICA modeling in frequency domain for time series

Regarding analysis of multiple time series, only a few ICA methods have been applied satisfac-
torily. Pham and Garat (1997) suggested ICA based on a quasi-maximum likelihood approach in
the setting of independent non-Gaussian sources and correlated sources. Denote the observation
record by X (1) , . . . , X (T ) where each X (t) is a random vector of K components X1 (t) , . . . , XK (t)
corresponding to K observation channel. Assume that each observation Xi (t) is a linear combi-
nation of K independent sources, namely, X (t) = AS (t), where A is some unknown matrix.
Furthermore, {S (t) , t = 1, . . . , T } is a stationary sequence of random vectors with K components
S 1 (t) , . . . , S K (t) and the sequences {S k , (1) , . . . , S k (T )}, k = 1, . . . , K are mutually independent.
The source separation problem is to reconstruct the sources {S k , (1) , . . . , S k (T )} from the observa-
tions {X (1) , . . . , X (T )} based only on the assumption of independence between the sources. To
write down the likelihood and derive a separation procedure, first, Pham and Garat (1997) assume
that the sources are white in the sense that the S i (t) at different times t are independent and iden-
tically distributed. Of cause this assumption is unrealistic, but it will be used only as a working
assumption. Furthermore, one needs the density function of the sources, which is unknown in a
blind context. However, we assume for the moment that these densities are known up to a scale
factor to avoid scaling ambiguity. Namely, the source S k (t) has density fk (·/σk ) /σk , where fk (·)
QK
is known and σk is unknown. Then, S (t) has density k=1 fk {sk (t) /σk } /σk and X (t) is related to
S (t) through X (t) = AS (t). Here, we assume that A is invertible. Now, we can write down the
log-likelihood of the observation data as
 K    ′ −1   
X   
 1  ek A X  
  
LT = T  b log  fk 
E  
  − log |det A| , (5.4.13)
 σk σk 
k=1
where E b is the time average operator E b {g (X)} = g {X (1)} + · · · + g {X (T )} /T and ek is the
kthn columno of the identity matrix of order K. Denoting dA−1 = −A−1 dAA−1 , d log |det A| =
tr A−1 dA and putting ψ ek (x) = − d log{ fk (x)} , we see that
dx
   
X
K 

  e′k A−1 X  e′k A−1 dAA−1 X 

 n o
−1
T dLT = Eb ψe    − tr A−1 dA
 k  σk 
 σk 

k=1
   
X K 
 ′ −1
e  ek A X  ek A X
′ −1
1

b ψ  
+ E 
 k  σk  2
σk
−
σ  dσk .

k
k=1
Letting ∂i j be the (i, j)th elements of matrix A−1 dA, the first two terms of the above right hand side
can be rewritten by
   ′ −1 
X K X K 
e  ei A X  e j A X 
 ′ −1 
 XK
b
E ψ 
 
  ∂ − ∂ii .

 i
σi σi 
 i j
i=1 j=1 i=1
Since dLT must vanish for all dA (or equivalently ∂i j ) and dσi at the maximum likelihood estimators
b and σ
A bi of A and σi , we have the estimating equation
   


 b−1 X 
 e′i A 

b ψ
E ei   ′ b X
 e j A −1
 = 0, i , j = 1, . . . , K,

 σi
b 

   
 ′ b−1
e  ei A
 X  ′ −1  

σ b
bi = E  ψ   b
 ei A X  , i = 1, . . . , K.
 i b
 σi  

Note that post-multiplying by a diagonal matrix and dividing the σi by its corresponding diagonal
elements does not change the likelihood; therefore, the parameter set {A, σ1 , . . . , σK } is redundant.
The first K (K − 1) equations above in fact determine A b up to scaling and once some scaling con-
vention is adopted to make it unique, the last K equations merely serve to estimate the σi . Since we
are not interested in values of σi , we just drop these equations. On the other hand, since the densities
of the sources are unknown in the blind separation context, we have to take some a priori function
ei , which also includes the factor σi . Thus, the separation procedure will be proposed
ψi , instead of ψ
to solve the system of estimation equations
n o
E b−1 X e′ A
b ψi e′i A b−1 X = 0, i , j = 1, . . . , K (5.4.14)
j
with respect to A. b Note that by choosing the ψi a priori, the above procedure is no longer maximum
likelihood. Nevertheless, the procedure will be justified. For instance, the sources satisfy being
centred and possessing stationarity and ergodicity, the left hand sides of (5.4.14) converge to the
expectations of ψi (S i (t) /e
σi ) S j (t), i , j, where e
σi are some scaling factors, and these expectations
vanish since S i (t) has zero mean and S i (t) are mutually independent.
Next, Pham and Garat (1997) considered the derivation of the separation procedure with the
maximum likelihood approach for temporally correlated sources. A well-known technique to decor-
relate a stationary signal is given by performing a discrete Fourier transform, which transforms
X (1) , . . . , X (T ) into
!
1 X
T
k
dX = √ X (t) e−i2πtk/T , k = 0, . . . , T − 1.
T T t=1

It is well known that for k , l, E dX Tk dX Tl → 0 as T → ∞, and E dX Tk dX Tk
k
converge to the spectral density matrices of the process fX (λ) at T. It is also known that these
random vectors are asymptotically
n Gaussian under o mild conditions (see, e.g., Brillinger (2001c)).
Therefore, we can regard dX Tk , 0 ≤ k ≤ T2 as independent Gaussian random vectors as is the

usual practice in time series literature. The joint probability density function of dX (0), dX T1 , . . .
for T even is given by
 ′ −1 

 ′ −1 1 1 1 
1  dX (0) fX (0) dX (0) + dX 2 fX 2 dX 2 
 

q n o exp 
 − 


 2 

det 4π2 f (0) f 1
X X 2
T
−1  !′ !−1 !
Y
2
1 

 k k k 

· n o exp   −dX fX dX 
 ,
k  T T T 
k=1 det πfX T

whereas the terms corresponding to dX 12 simply disappear when T is odd. Since fX (λ) =
Adiag {g1 (λ) , . . . , gK (λ)} A′ , where the gi are the spectral densities of the sources, taking the loga-
rithm reduces to
 2 
K T −1 
 e′ A−1 dX k !

1 XX 
 i T k 


LT = 
 + log g i 
 − T log |det A| + constant.

2 i=1 k=0 
 g i
k T 


T
For the moment, we make the working assumption that the spectral densities of the sources are up
to constant, so that gi (λ) = ηi hi (λ), where hi are known functions and ηi are unknown parameters.
The ML approach maximizes LT with respect to η1 , . . . , ηK and A, hence we have

XK XT −1 e′ A−1 dAA−1 d k ′ −1 k
i X T ei A d X T n o
dLT = − T tr A−1 dA
i=1 k=0 ηi hi Tk

2 
X K 

 T −1 e′ A−1 dX k
X 


1 
 i T 
 dηi
+ 
 − T 
 .

2 i=1 
 k=0 ηi hi k 
 ηi

T
Again, introducing the element ∂i j of the matrix A−1 dA, we can rewrite the first two terms as

X
K X T −1 e′ A−1 d
K X k
e′i A−1 dX k X
K
j X T T
∂i j − T ∂ii .
k
i=1 j=1 k=0 ηi hi T i=1
The differential of LT must vanish for all infinitesimal increments dA, dη1 , . . . , dηK , at the ML
bb
estimators A, η1 , . . . ,b
ηK . Therefore the estimating functions are given by

T −1 e′ A
X b−1 dX k e′ A b−1 dX k
j T i T
= 0, i , j = 1, . . . , K,
k
k=0 hi T
2
T −1 ′ b−1 k
1 X ei A dX T
ηi =
b , i = 1, . . . , K.
T k=0 hi Tk
The second set of equations determines the scale factors and are not of interest. The estimated
mixing matrix is led from the first set of equations (up to the scale factor). However, we approximate
sums by integrals, so that
Z 1 b−1 dX (λ) e′ A
e′j A b−1 dX (λ)
i
dλ = 0, i , j = 1, . . . , K.
0 hi (λ)
Let φi,t , t ∈ Z denote the impulse response of the filter φi whose frequency response is hi (u)−1 ,
namely,
X
hi (λ)−1 = φi,t e−i2πtλ .
t∈Z
√
Then T dX (λ) hi (λ)−1 becomes the Fourier series whose coefficients are the convolution of X (t)
with φi , where X (t) = 0 for t < 1 or t > T by convention.
Therefore, using the Parseval equality, the above equations are rewritten as
n o
b−1 X φi ∗ e′ A
b e′j A
E b−1 X = 0, i , j = 1, . . . , K, (5.4.15)
i
where Eb is the time average operator as in (5.4.13) and ∗ denotes the convolution operator. The
equivalent form of these equations is given by
1 X X −k}
T −1 min{T,T
φi,k b−1 X (t) e′ A
e′j A b−1 X (t − k) .
i
T k=1−T t=max{1,1−k}
Using the sample autocovariance matrices of the observed process
1 X −k}
min{T,T
RT (k) X (t) X (t − k) , (RT (k) = 0, for |k| ≥ T ) ,
T t=max{1,1−k}
the estimating equation for A is given by
X
T −1
b−1 RT (k) A
φi,k e′j A b−1′ ei = 0, i , j = 1, . . . , K.
k=1−T
If we permute and scale both A b and A by the same convention, we can expect that δ = I − A b−1 A
b−1 X (t) = S i (t) − P K
is small. Therefore, e′i A −1
j=1 δi j S j (t), where S i (t) = A X (t) and δi j denotes
the (i, j)th elements of δ. Hence, we can write (5.4.15) approximately as
n o XK h n o i
b S j (φi ∗ S i ) −
E b {S k (φi ∗ S i )} δ jk + E
E b S j (φi ∗ S k ) δik = 0, i , j = 1, . . . , K.
k=1
n o n o
If we assume ergodicity, E b S j (φi ∗ S k ) converges to E S j (t) φi,t ∗ S k (t) as T → ∞, which
R 1/2
vanishes for j , k and equals −1/2 f j∗ (λ) h−1 ∗
i (λ) dλ for j = k, where f j (λ) is the true spectral
density of the process S j (t). Therefore, the system of equations (5.4.15) can be further written
approximately as
n o Z 1/2 Z 1/2
b S j (φi ∗ S i ) ≈
E fi∗ (λ) h−1
i (λ) dλδ ji + f j∗ (λ) h−1
i (λ) dλδi j , i , j = 1, . . . , K.
−1/2 −1/2
Let ∆ and Ψ be the K (K − 1)-vectors (δ12 , δ21 , δ13 , δ31 , . . .)′ and (Ψ12 , Ψ21 , Ψ13 , Ψ31 , . . .)′ with
Ψi j = S j (φi ∗ S i ). Then the above system of equations is gathered up to ∆ ≈ H −1 E b (Ψ), where H
is the block diagonal matrix with blocks
 R 1/2 R 1/2 
 f j∗ (λ) h−1 fi∗ (λ) h−1 
 −1/2 i (λ) dλ −1/2 i (λ) dλ 
H(i j) =  R 1/2 ∗ −1 (λ)
R 1/2
∗ (λ) −1 (λ) 
f
−1/2 j
(λ) h j dλ f
−1/2 i
h j dλ
b (Ψ) is the average of the stationary process Ψ (t), with
and the pairs (i, j), i , j = 1, . . . , K. Since E
√ D
the mild condition, the central limit theorem leads to T E b (Ψ) → N (0, G), where
 T T 
1 
X X
 

′
G = lim E  Ψ (t) Ψ (s) 

T →∞ T  
t=1 s=1
is the block diagonal matrix with blocks

X
∞ !
γ j (k) e
γi (k) γ j (k) γi (−k)
G(i j) =
γi (k) γ j (−k) γi (k) eγ j (k)
k=−∞
and the pairs (i, j), i , j = 1, . . . , K,

γi (k) = E {S i (t) S i (t + k)} , γi (k) = E φi,t ∗ S i (t) φi,t+k ∗ S i (t + k)
e
are autocovariance functions and

γi = E S i (t) φi,t+k ∗ S i (t + k)

is the cross-covariance function of the process S i (t) and φi,t ∗ S i (t) . Note that the spectral den-

sity of the process φi,t ∗ S i (t) is fi∗ (λ) h−2i (λ), and the cross-spectral density between S i (t) and

φi,t ∗ S i (t) is fi∗ (λ) h−1
i (λ). Therefore, Parseval’s identity leads to
 R 1/2 R 1/2 
 ∗ (λ) ∗ (λ) −2 (λ)
f j∗ (λ) fi∗ (λ) h−1 −1 
f f h dλ j (λ) hi (λ) dλ
G(i j) =  R 1/2 −1/2 j i i −1/2
R 1/2 

f ∗ (λ) f ∗ (λ) h−1 (λ) h−1 (λ) dλ
−1/2 i j i j f ∗ (λ) f ∗ (λ) h−2 (λ) dλ
−1/2 i j j

with the pairs (i, j), i , j = 1, . . . , K. Since H and G are block diagonal, the pairs δi j , δ ji
are asymptotically independent for different (i1 , j1 ) , (i2 , j2 ) and have covariance matrices
′
H(i−1j) G(i j) H(i−1j) /T . Furthermore, defining
 R 1/2 
 f ∗ (λ) fi∗ −1 (λ) dλ 1 
J(i j) =  −1/2 j R 1/2  ,

1 f ∗ (λ) f j∗ −1 (λ) dλ
−1/2 i
we have
!
G(i j) H(i j)
H(i′ j) J(i j)
 −1 
Z  hi (λ) 
 h−1 (λ)
1/2 
 j  h−1 (λ) h−1 fi∗ −1 (λ) f j∗ −1 (λ) fi∗ (λ) f j∗ (λ) dλ,
=  f ∗ −1 (λ)  i j (λ)
−1/2   i −1 
f j∗ (λ)
′
where the right hand side of this equation is nonnegative definite, so that H(i−1j) G(i j) H(i−1j) ≥ J(i j) . The
equality holds in the case that each of the hi (λ) is proportional to the corresponding fi∗ (λ). In such
a case, the asymptotic covariance matrix of the estimator becomes minimum and is given by J/T ,
where J is the block diagonal matrix whose blocks are J(i j) with the pairs (i, j), i , j = 1, . . . , K.
To derive an iterative algorithm for the solution of (5.4.15), we expand (5.4.15) around an initial

estimator Ae of A and construct the next step estimator A e I −δ e −1 as the solution of
n o X
K h n o n o i
b Se j φi ∗ Sei =
E b Se j φi ∗ Sek e
E b Sek φi ∗ Sei e
δik + E δ jk ,
k=1
e−1 X (t) and e
where Sei (t) = e′i A e e
n δi j denotes
o then (i, j)th elements of
o δ. We can expect S i (t) is close to
S i (t) and approximate Eb Se j φi ∗ Sek by E S j (t) φi,t ∗ S k (t) , which vanishes for j , k. There-
e
fore, δi j can be computed from the system of equations
n o n o n o
b Se j φi ∗ Sei = E
E b Se j φi ∗ Se j e b Sei φi ∗ Sei e
δi j + E δ ji , i , j = 1, . . . , K,
which yields the new estimator of A, and we can iterate the procedure until convergence.
Shumway and Der (1985) consider the problem of estimating simultaneously two convolved
vector time series with additive noise. For instance, the observed time series recorded by seismome-
ters contains convolutions in the problem of monitoring a nuclear test. We discuss the problem of
estimating an n × 1 stochastic vector x (t) = (x1 (t) , . . . , xn (t))′ in terms of the linear model
y (t) = a (t) ⊗ x (t) + v (t)
for t = 0 ± 1, ±2, . . ., where a (t) is an N × n matrix of fixed functions which satisfy the condition
X
∞
|t| ai j (t) < ∞ (5.4.16)
t=−∞
for i = 1, . . . , N, j = 1, . . . , n, and the notation ⊗ denotes the convolution. Furthermore, the n ×

1 signal process x (t) and the N × 1 noise process v (t) are zero mean strong mixing stationary
processes and uncorrelated with each other. The spectral density matrices of y (t), x (t) and v (t)
are absolutely continous positive definite and denoted by fy (λ), f x (λ) and fv (λ), respectively. The
problem of estimating the vector process x (t) with minimum mean squared error can be achieved
by a linear estimator of the form
X
∞
b (t) =
x h (u) y (t − u) , (5.4.17)
u=−∞
x (t) , y (t)i = 0, so that hx (t) , y (t)i = hb

which satisfies hx (t) − b x (t) , y (t)i from orthogonality.
Therefore, the Fourier transform of h (t) is given by the n × N matrix
H (λ) = f x (λ) A∗ (λ) fy−1 (λ)

−1
= f x−1 (λ) + A∗ (λ) fv−1 (λ) A (λ) A∗ (λ) fv−1 (λ) , (5.4.18)
where A∗ denotes the complex conjugate transpose of the matrix A, and the matrix A (λ) is defined
as the Fourier transform
X
∞
A (λ) = a (t) exp (−iλt) , −π ≤ λ ≤ π, (5.4.19)
t=−∞
of the N × n matrix valued function a (t). The mean squared error of the estimator is
h i
b (t) , x (t) − x
hx (t) − x b (t)i = E x (t) − x b (t) ′
b (t) x (t) − x
Z πn o dλ
= H (λ) fy (λ) H ∗ (λ) − 2 f x (λ) A∗ (λ) H ∗ (λ) + f x (λ)
2π
Z−ππ
dλ
= f x (λ) {1 − A∗ (λ) H ∗ (λ)}
−π 2π
Z π −1 dλ
= f x−1 (λ) + A∗ (λ) fv−1 (λ) A (λ) . (5.4.20)
−π 2π
Note that the filter response (5.4.18) becomes the best linear unbiased estimator when f x−1 (λ) = 0.
Replacing h (u) by
1 X
M−1
h M (u) = H (λm ) exp (iλm u)
M m=0
for λm = 2πm
M , the approximate version of (5.4.17) can be given.
To construct the approximation of likelihood, we define the discrete Fourier transform (DFT) of
the vector series y (t), x (t) and v (t) as
1
X
T −1
1
X
T −1
Y (k) = T − 2 y (t) exp (−iλk t), X (k) = T − 2 x (t) exp (−iλk t)
t=0 t=0
1
X
T −1
and V (k) = T − 2 v (t) exp (−iλk t)
t=0
2πk
for λk = T , k = 0, 1, . . . , T − 1, respectively. Then, we can see that
Y (k) = A (λk ) X (k) + V (k) + oP (1) , (5.4.21)
where A (λ) is the usual both side infinite Fourier transform defined in (5.4.19). If signal process
x (t) and noise process v (t) are strictly stationary and mixing, the vectors X (k) and V (k) are
approximately independent and complex normal variables whose covariance matrices are given by
f x (λk ) and fv (λk ) for k = 1, . . . , T2 − 1, and real multivariate normal for k = 0, T2 , when T is even.
As before, the estimator of X (k) can be constructed as
b (k) = H (λk ) Y (k) ,
X (5.4.22)
where H (λ) is defined in (5.4.18). Since the approximation in (5.4.21) is obtained by DFT, we can
suggest inverting to obtain the approximation to (5.4.17), namely,
1
X
T −1
b (t) ≈ T − 2
x b (k) exp (iλk t).
X
k=0
Now, we consider the application of this general theory to the particular model of interest
yi j (t) = wi j (t) ⊗ ri (t) ⊗ x j (t) + vi j (t)
with wi j (t) a known function, which is rewritten in the frequency domain as
Yi j (k) = Wi j (λk ) Ri (λk ) X j (k) + Vi j (K) + oP (1) , (5.4.23)
where the argument λk implies the both side infinite Fourier transform and k implies the DFT. To
simplify, we assume that the series x j (t) are uncorrelated and stationary with spectral density f x j (λ)
and thatthe noise series are ′ uncorrelated
and stationary with identical spectral density fv (λ). Let
y j (t) = y1 j (t) , . . . , ym j (t) , h j (t) = h1 j (t) , . . . , hm j (t) and
X
∞
b
x j (t) = h j (u) y j (t − u) ,
u=−∞
where h j (t) is the Fourier transform of

−1
H j (λ) = f x−1
j
(λ) + A∗j (λ) fv−1 (λ) A j (λ) A∗j (λ) fv−1 (λ)
−1
= f x−1
j
(λ) fv (λ) + A∗j (λ) A j (λ) A∗j (λ) (5.4.24)
′
with A j (λ) = W1 j (λ) R1 (λ) , . . . , Wm j (λ) Rm (λ) . Therefore, in this case, the estimator in (5.4.22)
becomes
X
m
bj (k) = D−1
X j (λk ) Wi j (λk ) Ri (λk )Yi j (k) , (5.4.25)
i=1
where
X
m
D j (λk ) = W (λ )2 |R (λ )|2 + θ (λ )
ij k i k j k
i=1
and
θ j (λk ) = f x−1
j
(λk ) fv (λk ) (5.4.26)
is the inverse of the signal to noise ratio at frequency λk . Furthermore, the mean squared error
(5.4.20) in this case becomes
h Z π
i −1 dλ
E x (t) − x b (t) ′ =
b (t) x (t) − x f x−1 (λ) + A∗ (λ) fv−1 (λ) A (λ)
2π
Z−ππ Z π
dλ dλ
= D−1j (λ) fv (λ) := σ2j (λ) (5.4.27)
−π 2π −π 2π
with
σ2j (λ) = D−1

j (λ) fv (λ) . (5.4.28)
Up to now, we assumed the fixed functions ri (t), i = 1, . . . , m and the spectral density functions
fv (λ) and f x j (λ) are known in advance. However, we must indeed estimate these parameters, e.g.,
by maximizing the likelihood function. We can consider the frequency domain approximation of
likelihood function. Namely, we maximize the likelihood function corresponding to the approxi-
mated model (5.4.23) with respect to the parameters Ri (λk ), fv (λk ) and f x j (λk ) for i = 1, . . . , m,
j = 1, . . . , n, k = 0, 1, . . . , T2 , which all together is denoted by Φ. From (5.4.23), the frequency do-
main observations Yi j (k) are approximately independent complex Gaussian random variables which
have zero means and variances
2
η2i j (λk ) = Wi j (λk ) |Ri (λk )|2 f x j (λk ) + fv (λk )
under the condition (5.4.16) for ai j (t) = wi j (t) ⊗ ri (t). Therefore, the approximation of the log-
likelihood is given by
2
X n o X Yi j (k)
L (Y ; Φ) ≈ − δk log η2i j (λk ) − δk 2 , (5.4.29)
i jk i jk
ηi j (λk )
where



2
1
k = 0, T2 ,
δk = 

1 (otherwise)
and (5.4.29) is termed the incomplete-data likelihood in Dempster et al. (1977). On the other hand,
if the joint likelihood of the unobserved component X j (k) and Vi j (K) in (5.4.23) is available, it
should be of the form
2
X n o X X j (k)
L (X, V ; Φ) = − δk log f x j (λk ) − δk
jk jk
f x j (λk )

X X Yi j (k) − Wi j (λk ) Ri (λk ) X j (k)2
− δk log { fv (λk )} − δk , (5.4.30)
i jk i jk
fv (λk )
which is termed the complete-data likelihood in Dempster et al. (1977), and will be maximized as a
function of the unobserved parameters f x j (λk ), fv (λk ) and Ri (λk ).
Dempster et al. (1977) proposed and EM algorithm which involves two steps, namely, the expec-
tation step (E-step) and the maximization step (M-step). Assume that there exist two sample spaces
Y and X and the observed data y are a realization from Y and termed incomplete-data. The corre-
sponding x in X is not observed directly, but only indirectly through y. That is, there is a mapping
x → y (x) from X to Y, and x is known only to lie in X (y), which is the subset of X determined
by the equation y = y (x). We refer to x as the complete-data even though in certain examples x
includes the parameters.
We assume that a family of the sampling densities f (x | Φ) depends on r × 1 parameter vector
Φ ∈ Ω and derive its corresponding family of sampling densities g (y | Φ). The relationship between
the complete-data specification f (x | Φ) and the incomplete-data specification g (x | Φ) is given by
Z
g (y | Φ) = f (x | Φ) dx.
X(y)
Our purpose is to find the value Φ∗ of Φ that maximizes

L (Φ) = log g (y | Φ). (5.4.31)
The conditional density of x given y and Φ is denoted by
f (x | Φ)
k (x | y, Φ) = .
g (y | Φ)
Therefore, (5.4.1.3) can be written as
L (Φ) = log f (x | Φ) − log k (x | y, Φ).
We now define a new function

Q Φ′ | Φ = E log f x | Φ′ | y, Φ , (5.4.32)
which is assumed to exist for all pairs (Φ′ , Φ). Furthermore, we assume f (x | Φ) > 0 almost
everywhere in X for all Φ ∈ Ω.
Now we introduce the EM iteration Φ(p) → Φ(p+1) as follows:

E-step Compute Q Φ | Φ(p) .

M-step Choose Φ(p+1) which maximizes Q Φ | Φ(p) among Φ ∈ Ω.
We would like to choose Φ∗ to maximize log f (x | Φ). However, since we do not know
log f (x | Φ), we maximize instead its current expectation given the data y and the current fit Φ(p) .
In addition, it is convenient to write

H Φ′ | Φ = E log k x | y, Φ′ | y, Φ .
Therefore, (5.4.32) becomes

Q Φ′ | Φ = L Φ′ + H Φ′ | Φ .
Then, we have the following lemma.
Lemma 5.4.10 (Lemma 1 of Dempster et al. (1977)). For any pair (Φ′ , Φ) in Ω × Ω, we have

H Φ′ | Φ ≤ H (Φ | Φ) ,
where the equality holds if and only if k (x | y, Φ′ ) = k (x | y, Φ) almost everywhere.

In general, the iterative algorithm should be applicable to any starting point. Therefore, we define
a mapping Φ → M (Φ) from Ω to Ω such that each step Φ(p) → Φ(p+1) is given by

Φ(p+1) = M Φ(p) .
Definition 5.4.11. An iterative algorithm with mapping M (Φ) is called a generalized EM (GEM)
algorithm if
Q (M (Φ) | Φ) ≥ Q (Φ | Φ)
for every Φ ∈ Ω.
It should be noted that the definition of the EM algorithm requires

Q (M (Φ) | Φ) ≥ Q Φ′ | Φ
for every pair (Φ′ , Φ) in Ω × Ω, that is, Φ′ = M (Φ) maximizes Q (Φ′ | Φ). Now, we have the
following results.
Theorem 5.4.12 (Theorem 1 of Dempster et al. (1977)). For every GEM algorithm
L (M (Φ)) ≥ L (Φ) for all Φ ∈ Ω
where the equality holds if and only if
Q (M (Φ) | Φ) = Q (Φ | Φ)
and
k (x | y, M (Φ)) = k (x | y, Φ)
almost everywhere.
Corollary 5.4.13 (Corollary 1 of Dempster et al. (1977)). Assume that there exists some Φ∗ ∈ Ω
such that L (Φ∗ ) ≥ L (Φ) for all Φ ∈ Ω. Then, for every GEM algorithm, we have
(i) L (M (Φ∗ )) = L (Φ∗ )
(ii) Q (M (Φ∗ ) | Φ∗ ) = Q (Φ∗ | Φ∗ )
(iii) k (x | y, M (Φ∗ )) = k (x | y, Φ∗ ) almost everywhere.
Corollary 5.4.14 (Corollary 2 of Dempster et al. (1977)). Assume that there exists some Φ∗ ∈ Ω
such that L (Φ∗ ) > L (Φ) for all Φ ∈ Ω, Φ , Φ∗ . Then, for every GEM algorithm, we have
M (Φ∗ ) = Φ∗ .
Theorem 5.4.15 (Theorem 2 of Dempster et al. (1977)). Let Φ(p) , p = 0, 1, 2, . . . be a sequence of

a GEM algorithm such that

(i) the sequence L Φ(p) is bounded,
2
(ii) Q Φ(p+1) | Φ(p) − Q Φ(p) | Φ(p) ≥ λ Φ(p+1) − Φ(p) for some scalar λ > 0 and all p.
Then, the sequence Φ(p) converges to some Φ∗ in the closure of Ω.
Shumway and Der (1985) applied the EM algorithm in the form of

Q Φ | Φ(p) = E p {L (X, V ; Φ) | Y } (5.4.33)
using (5.4.30), where E p denotes the expectation with respect to the density determined by the
parameter Φ(p) . From the orthogonal arguments in (5.4.25) and (5.4.27), we have

bj (k)
E X j (k) | Y = X
and

Var X j (k) | Y = σ2j (λk ) .
Therefore, we can rewrite (5.4.33) as

2
X n o X bj (k) + σ2 (λk )
X
j
Q Φ | Φ(p) = − δk log f x j (λk ) − δk
jk jk
f x j (λk )
2
X X Yi j (k) − Wi j (λk ) Ri (λk ) X bj (k) + Wi j (λk ) Ri (λk )2 σ2 (λk )
j
− δk log { fv (λk )} − δk
i jk i jk
fv (λk )
and so the maximization step leads to

X
n
bi (λk ) = Ci (k)−1
R bj (k)Yi j (k) ,
Wi j (λk ) X (5.4.34)
j=1
where
n
X 2 2
b
X j (k) + σ j (λk ) Wi j (λk )
2
Ci (k) = (5.4.35)
j=1
for the i = 1, . . . , m receiver functions. On the other hand, the signal and noise spectral densities are
estimated by
2
bf x j (λk ) = X bj (k) + σ2j (λk ) (5.4.36)
for j = 1, . . . , n and
 


X 2 2 


b 1  b b b 2 
fv (λk ) = 
 Yij (k) − Wij (λk ) R i (λk ) X j (k) + Wij (λk ) R i (λ )
k σ j (λ )
k  . (5.4.37)
mn  ij 

Shumway and Der (1985) suggested the EM algorithm for estimating the unknown parameters
and getting the deconvolved series b x j (t), j = 1, . . . , n.
(i) Specify the initial guess of the receiver functions Ri (t), which can be taken as simple delta
function for the first iteration.
(ii) Initialize fv (λ) and the noise-to-signal ratios θ j (λ), j = 1, . . . , n in Equation (5.4.26).
(iii) Calculate Xbj (k) and σ2 (λ) using Equations (5.4.25)–(5.4.28).
j
bi (λ) using Equations (5.4.34) and (5.4.35).
(iv) Calculate R
(v) Compute b f x j (λ) and b
fv (λ) using (5.4.36) and (5.4.37).
bj (k) and R
(vi) Obtain the time versions of X bi (λ) by DFT. Then, return to Step 3 for the next iteration.
5.5 Rank-Based Optimal Portfolio Estimation
In this subsection, we propose two different multivariate rank-based portfolio estimations. Let yt =

y1,t , . . . , yN,t ′ be the vector of returns (or log returns) of an N-tuple of assets at time point t. Assume
that yt is of the form
yt = µ t + e t ,
where µt = E {yt | Ft−1 } is the conditional expectation of yt given the sigma-field Ft−1 generated by
the past returns yt−1 , yt−2 , . . ., and {et } is the mean zero innovation process. For simplicity, assume
that µt is known or, equivalently, that we directly observe xt = et = yt − E {yt | Ft−1 }. In practice,
of course, µt is unknown and has to be estimated in some way, for instance by the DOCs in mean
discussed in Section 5.4.1.2. However, we concentrate on volatility issues.
5.5.1 Portfolio Estimation Based on Ranks for Independent Components

For a stationary vector time series xt , assume that there exist DOCs in volatility st such that xt =
M st where M is a nonsingular mixing matrix. Furthermore, let L denote an uncorrelating matrix so
that zt = Lxt is an uncorrelated observation. Namely, let Σ = Cov (xt ) denote the positive definite
unconditional covariance matrix of xt , and let Γ and Λ denote the orthogonal matrix of eigenvectors
and the diagonal matrix of corresponding eigenvalues of Σ, respectively. Then, the invertible matrix
L can be chosen as the inverse symmetric square root of the unconditional covariance matrix defined
as L = Σ−1/2 = ΓΛ−1/2 Γ′ , or simply in connection with principle component analysis as L =
Λ−1/2 Γ′ . The relationship between st and zt is given by
st = M −1 xt = M −1 L−1 zt ≡ W zt ,
where W = M −1 L−1 .
The volatility modeling is given by
Σt = Cov (xt | Ft−1 ) = M Cov (st | Ft−1 ) M ′

′ ′ ′ ′
= L−1 W −1 Cov (st | Ft−1 ) W −1 L−1 = L−1 W −1 At W −1 L−1 ,
where st is assumed to be conditionally

uncorrelated
given Ft−1 for every time point t, and therefore,

At = Cov (st | Ft−1 ) = diag σ21,t , . . . , σ2N,t with Aii,t = σ2i,t = Cov si,t | Ft−1 . Otherwise, st is
assumed to be dynamically orthogonal in volatility as in Subsection 5.4.1.2.
We can obtain the invertible consistent estimator L b of the uncorrelating transformation L from
orthogonal eigenvectors and eigenvalues of the estimated sample covariance matrix Σ. b Furthermore,
we can apply the estimator WθbT of the separating matrix given by DOCs in volatility as in Section
5.4.1.2, and the estimator of the vector time series of DOCs b st = WθbT zt .

For the conditionally uncorrelated process At = Cov (st | Ft−1 ) = diag σ21,t , . . . , σ2N,t , univari-
ate models such as GARCH can be used for modeling. Hirukawa et al. (2010) considered that each

conditionally uncorrelated signal si,t | t ∈ Z , i = 1, . . . , N is a univariate LARCH (∞) process
si,t = σi,t εit (5.5.1)

X
∞
σi,t = bi,0 + bi, j si,t− j , t ∈ Z.
j=1
In order to characterize semiparametric efficiency, we need the uniform local asymptotic normality
(ULAN) result for LARCH models. We consider a class of semiparametric LARCH models of the
form
st = σt (η) εt
RANK-BASED OPTIMAL PORTFOLIO ESTIMATION 203
X
∞
σt (η) = b0 (η) + b j (η) st− j , t∈Z
j=1
where the εt ’s are i.i.d., with mean 0, variance 1, and unspecified density g, b0 (η) , 0, and η ∈
H ⊂ Rr . Denote by P(T )
η,g the probability distribution of the observation under parameter value η and
density g. Let η0 be some fixed value of η, i.e., √ the true parameter value, and consider sequences η̃T
such that, for some bounded sequence hT , T (η̃T − η0 ) − hT → 0. Then, the log-likelihood ratio
ΛT (η̃T , η0 ) for P(T ) (T )
η̃T ,g with respect to Pη0 ,g is given by
X
T h i
ΛT (η̃T , η0 ) = l1+WT′ t (η̃T −η0 ) (εt (η0 )) − l1 (εt (η0 )) ,
t=1
st P∞
x
where lσ (x) := log σ−1 g σ , εt (η) := σt (η) , σt (η) := b0 (η) + j=1 b j (η) st− j , and
WT t := (WT t1 , . . . , WT tr ) , with

WT tk := Wt (η0 , η̃T )k = σt (η0 )−1 η̃T,k − η0,k −1 σt η̃kT − σt η̃k−1 T , and
k ′
η̃T := η̃T,1 , . . . , η̃T,k , η0,k+1 , . . . , η0,r , k = 0, . . . , r.
Then, we can easily see that
WT′ t (η̃T − η0 ) = σt (η0 )−1 (σt (η̃T ) − σt (η0 )) .
Define

Wt := Wt (η0 ) := σt (η0 )−1 gradη {σt (η)}η=η
0
assume that x 7→ g(x) admits a derivative x 7→ Dg(x) and the score for scale ψ (x) =
and
− 1 + x Dg(x)
g(x) satisfies
n o Z !2
Dg (x)
0 < I s (g) := E ψ2 (εt ) = 1+x g (x) dx < ∞,
g (x)
where I s (g) is the Fisher information of scale.
Furthermore, denote the score and Fisher information matrix by
l̇t (η) = Wt (η) ψ (εt (η))

n o
and Ig (η) = Eη,g l̇t (η)l̇t′ (η) , respectively, and assume that several regularity conditions hold (see
Drost et al. (1997) and Hirukawa et al. (2010) for details). Then, we have the following ULAN result
for LARCH models.
Theorem 5.5.1 (ULAN). Let the several regularity conditions as in Drost et al. (1997) hold. Then,
under P(T ) r
η0 ,g , for any bounded sequence hT ∈ R , we have
X
T X T
ΛT (η̃T , η0 ) = h′T T −1/2
1
l̇t (η0 ) − T −1 h′ l̇ (η )2 + R , (5.5.2)
T t 0 T
t=1
2 t=1
with
P
X
T
D
RT → 0 and ∆(T ) (η0 , g) := T −1/2 l̇t (η0 ) → N 0, Ig (η0 ) .
t=1
Furthermore, the quadratic approximation (5.5.2) holds uniformly over T −1/2 -neighbourhoods of
η0 .
Since σt (η) admits the Volterra series expansion
 
 X
∞ X
∞ 
σt (η) = b0 (η) 1 + b j1 (η) · · · b jk (η) εt− j1 (η) · · · εt− j1 −···− jk (η) ,
k=1 j1 ,..., jk =1
the central sequence
X
T
∆(T ) (η0 , g) = T −1/2 l̇t (η0 )
t=1
X
T
−1/2
=T Wt (η0 )ψ (εt (η0 ))
t=1
X
T
= T −1/2 σt (η0 )−1 gradη {σt (η)}η=η ψ (εt (η0 ))
0
t=1
depends on infinitely many lagged values of εt . Therefore, we replace them with an approximate
central sequence of the form
X
T
˜ (T ) (η0 , g) = (T − p)−1/2
∆ σ̃t (η0 )−1 gradη {σ̃t (η)}η=η ψ (εt (η0 ))
0
t=p+1
X
T
=: (T − p)−1/2 Jη0 ,g εt (η0 ), . . . , εt−p (η0 ) ,
t=p+1
where, with p sufficiently large,

 
 X
∞ X 
σ̃t (η) := b0 (η) 1 + b j1 (η) · · · b jk (η) εt− j1 (η) · · · εt− j1 −···− jk (η) .
k=1 1≤ j1 +···+ jk ≤p
Finally, we can estimate the σi,t (η), i = 1, . . . , N corresponding to the N univariate LARCH (∞)
models (5.5.1) from the estimated DOCs b si,t , i = 1, . . . , N, which are elements of b
st = WθbT zt , via
the univariate rank-based estimation methods developed in Section 5.2.4.1. This implies, of course,
N distinct rankings, i.e., one for each univariate LARCH process. Then, the estimated volatility is
given by
bt = L
Σ b−1 W b −1 A
bt W b −1′ L
b−1′ ,
θT θT
n o
bt = diag b
where A σ21,t , . . . , b
σ2N,t .
5.5.2 Portfolio Estimation Based on Ranks for Elliptical Residuals

Assume that the joint distribution P(T )
η, f of the observations is characterized by a model involving a
parameter η and some unobserved multivariate white noise process εt with density f . More specif-
ically, in the present context, assume that the observation at time t is given by
et = Σt (η) εt ,
where Σt is some unobserved positive definite matrix-valued process characterizing the joint volatil-
ity of et , and the εt ’s are i.i.d. with density f .
Furthermore, we assume that the density f of the noise {εt } is elliptically symmetric. That is,
RANK-BASED OPTIMAL PORTFOLIO ESTIMATION 205
+ +
R ∞ V be a symmetric positive definite N × N matrix, f1 : R0 → R be such that f1 > 0 a.e., and
let
N−1
0
r f1 (r) dr < ∞, then we assume that the innovation density is of the form
f (x; V , f1 ) := cN, f1 (det V )1/2 f1 (kxkV ) , x ∈ RN (5.5.3)

1/2
where kxkV := x′ V −1 x denotes the norm of x in the metric associated with V , the constant
−1
cN, f1 := ωN µN−1: f1 is the normalization factor, ωN stands for the (N − 1)-dimensional Lebesgue
R∞
measure of the unit sphere SN−1 ⊂ RN and µl: f1 := 0 rl f1 (r) dr.
Moreover, assume that there exists a residual function ε(T ) (η) := (ε1 (η), . . . , εT (η))′ , where
εt (η) is measurable with respect to η, the past observations and a p-tuple of initial values, such that,
under P(T ) (T )
η, f , εt (η) coincides with εt .
Define the V -standardized residuals as
Zt(T ) (η, V ) := V −1/2 ε(T )

t (η).
(T )
Let Rt = R(T ) (T )
t (η, V ) denote the ranks of dt = dt (η, V ) := Zt among d1 , . . . , dT , and define the
multivariate signs Ut as
Ut = Ut(T ) (η, V ) := kZt k−1 Zt .
Under P(T ) (T )
η, f , the signs Ut (η, V ) are i.i.d. and uniformly distributed over the unit sphere, the ranks
R(T )
t (η, V ) are uniformly distributed over the T ! permutations of {1, . . . , T }, and signs and ranks are
mutually independent.
However, since the true value of the shape matrix V is unknown, the genuine ranks R(T )
t (η0 , V )
and signs Ut(T ) (η0 , V ) associated with some given parameter value η0 cannot be computed from the
residuals (ε1 (η0 ), . . . , εT (η0 )). Therefore, we need some estimator V b of the shape matrix V which
satisfies the following assumption.
Assumption 5.5.2 (Assumptions (D1)–(D2) of Hallin and Paindaveine (2005)). Let ε1 , . . . , εT be
i.i.d. with density f satisfying (5.5.3). The sequence V b =V b (T ) := V
b (T ) (ε1 , . . . , εT ) of estimators of
V is such that
√ (T )
(i) T V b − aV = OP (1) as T → ∞ for some positive real a,
(ii) Vb (T ) is invariant under permutations and reflections, with respect to the origin in RN , of the
εt ’s, and
(iii) the estimator V b is quasi-affine-equivariant, in the sense that, for all T , all N × N full-rank
b
matrices M , V (M ) = dM V b M ′ , where V b (M ) stands for the statistic V b computed from
the T -tuple (M ε1 , . . . , M εT ), and d denotes some positive scalar that may depend on M and
(ε1 , . . . , εT ).
Then, we can, instead, use the pseudo-Mahalanobis signs and ranks defined in Hallin and Pain-
daveine (2002) as

b −1/2 εt (η0 ) / V
Wt (η0 ) = Wt(T ) (η0 ) := V b −1/2 εt (η0 ) ,
b is an appropriate estimator of the shape V which satisfies Assumption 5.5.2. The pseudo-
where V
Mahalanobis ranks
b(T
bt (η0 ) := R
R )
t (η0 )
similarly are defined as the ranks of the pseudo-Mahalanobis distances

b := V
d t η0 , V b −1/2 εt (η0 ) .
Finally, nonparametric signed rank test statistics can be based on lag-i rank-based generalized cross-
covariance matrices of the form
   ! 
 1 X
′−1/2 
T bt (η0 ) 
 R R̂t−i (η0 )  ′1/2
e (T ) b
Γi;J {η} := V 
 J1  
 J2 Wt (η0 ) Wt−i (η0 ) V
′ b ,
T − i t=i+1 T +1 T +1
where the score functions J1 and J2 satisfy the following assumption.

Assumption 5.5.3 (Assumptions (C) of Hallin and Paindaveine (2005)). The score functions Jl :
(0, 1) → R, l = 1, 2 are continuous differences of two monotone increasing functions, and satisfy
R1
0
{Jl (u)}2 du < ∞, l = 1, 2.
( f 1/2 )′ ′
Let ϕ f := −2 f 1/2 , where f 1/2 is the weak derivative of f 1/2 in L2 R+0 , µN−1 which stands
R∞
for the space of all measurable functions h : R+0 → R satisfying 0 {h (r)}2 r N−1 dr < ∞. The score
functions which yield locally and asymptotically optimal procedures are of the form J1 := ϕ e ◦ Fe−1
f1 N
e−1 for some radial density e
and J2 := F eN stands for the cdf associated with the radial pdf
f1 , where F
N
−1
e
fN = µ e r N−1 e
f1 (r) χ{r>0} , r ∈ R. Then, the above assumption can be rewritten as follows.
N−1: f1
Assumption 5.5.4 (Assumptions (C’) of Hallin and Paindaveine (2005)). The radial density e f1 is
such that ϕ ef1 is the continuous difference of two monotone increasing functions, µN−1: ef1 < ∞, and
R∞n o2
0
ϕ ef1 (r) r N−1 ef1 (r) dr < ∞.
Problems
5.1 Show (5.1.4) by applying the Lebesgue bounded convergence theorem.
5.2 Show (5.1.10), where σ2L corresponds to the variance of the random variable X which has distri-
bution function L.
5.3 Integrating by part, and then applying the Cauchy–Schwarz inequality, show (5.1.13).
5.4 Verify (5.1.22) and (5.1.23).
5.5 Verify the relations for the coefficients (5.2.12).
5.6 Show that the vectors Ui(n) are invariant under transformations of Xi − θ which preserve half
lines through the origin.
5.7 Show that matrices UX and RX in (5.3.16) are positive definite.
5.8 Show (5.3.35).
5.9 Using the Parseval equality, verify (5.4.15).
5.10 Verify (5.4.24) and (5.4.27).
Chapter 6
Portfolio Estimation Influenced by

Non-Gaussian Innovations and Exogenous
Variables
The traditional theory of portfolio estimation assumes that the asset returns are independent normal
random variables. However, as we saw in the previous chapters, the returns are not necessarily either
normal or independent. Also Gouriéroux and Jasiak (2001) studied financial time series returns, and
showed how these series can be modeled as ARMA processes. Aı̈t-Sahalia et al. (2011) discussed
how stock returns sampled at a high frequency show AR dependence. Discussion on the asymmetry
of the distribution of returns can be found in Adcock and Shutes (2005) and the references therein.
De Luca et al. (2008) observed that the distribution of the European market index returns conditional
on the sign of the one-day lagged US return is skew-normal. Hence, time series models with skew-
normal innovations are important. In view of this we assume that the return processes are generated
by non-Gaussian linear processes.
In Section 6.1, we suppose that the return process is generated by a linear process with skew-
normal distribution characterized by the skewness parameter δ. Let ĝ be the generalized mean-
variance portfolio estimator (recall Section 3.2). Theorem 3.2.5 showed that ĝ is asymptotically
normal with zero mean and variance matrix V(δ). We evaluate the influence of δ on V(δ), and
investigate the robustness of the estimator.
If the return process is non-Gaussian, it depends on higher-order cumulant spectra. Hence, it is
natural to introduce a class G of portfolio estimators which depend on higher-order sample cumulant
spectra. Then, we derive the asymptotic distribution of g̃ ∈ G1 , and compare it with that of ĝ.
Numerical studies illuminate some interesting features of g̃ in comparison with ĝ.
When we construct a portfolio on some assets X, we often use the information of exogenous
variables Z. Section 6.3 introduces a class G2 of portfolio estimators which depend on E(X), [
cdov(X, X), cd d
ov(X, Z), cum(X, X, Z), where (·)b means the sample version of (·). Then we de-
rive the asymptotic distribution of ǧ ∈ G2 , and elucidate the influence of exogenous variables on
it.
6.1 Robust Portfolio Estimation under Skew-Normal Return Processes

De Luca et al. (2008) introduced a class of skew-GARCH models in view of empirical research on
European stock markets. Hence, it is natural to introduce a class of time series models generated by
skew-normal innovations. Suppose that {X(t) = (X1 (t), · · · , Xm (t))′ ; t ∈ Z} is the return process of
m assets generated by
X
∞
X(t) = A( j)U (t − j) + µ (6.1.1)
j=0
207
208 NON-GAUSSIAN INNOVATIONS AND EXOGENOUS VARIABLES
where {U (t)} ∼i.i.d. (0, K) and the A(j)’s are m × m matrices satisfying Assumption 3.2.1. Hence
µ = E{X(t)}. Denoting Σ =Var{X(t)}, we define θ = (θ1 , . . . , θm+r )′ by
′
θ = µ′ , vech(Σ)′
where r = m(m + 1)/2. As in Section 3.2, we assume that the optimal portfolio weight on
X(t) is written as g(θ) where g : θ → Rm−1 is a smooth function. From the partial realization
{X(1), . . . , X(n)}, a natural estimator of θ is
′
θ̂ = µ̂′ , vech(Σ̂)′
P P
where µ̂ = n−1 nt=1 X(t) and Σ̂ = n−1 nt=1 {X(t) − µ̂}{X(t) − µ̂}′ . Then we use g(θ̂) as an estimator
of g(θ), which is called a generalized mean-variance portfolio estimator. Under Assumptions 3.2.1
and 3.2.4, Theorem 3.2.5 yields
√ n o L " ∂g
!
∂g
!′ #
n g(θ̂) − g(θ) → N 0, Ω . (6.1.2)
∂θ ′ ∂θ ′
In what follows we introduce the multivariate skew-normal distribution. An m-dimensional ran-
dom vector U is said to have a multivariate skew-normal distribution, denoted by S Nm (µ, Γ, α), if
its probability density function (with respect to the Lebesgue measure) is given by

2φm (u; µ, Γ) Φ α′ (u − µ) , u ∈ Rm , (6.1.3)
where φm (u; µ, Γ) is the probability density of Nm (µ, Γ), Φ(·) is the distribution function of N(0, 1)
and α is an m-dimensional vector. Let
δ = (1 + α′ Γα)−1/2 Γα,
which regulates the skewness of the distribution. If δ = 0, then it becomes the ordinal Nm (µ, Γ).
The moment generating function of S Nm (µ, Γ, α) is
Z

M(s) = 2 exp(s′ u) φm (u; µ, Γ) Φ α′ (u − µ) du
Rm

= 2 exp(s′ µ) exp (s′ Γs)/2 Φ(δ ′ s). (6.1.4)
The cumulants of S Nm (µ, Γ, α) can be computed from the moment generating function. Let κj
denote the jth-order cumulant matrix ( j = 1, . . . , 4), so that κ1 is m × 1, κ2 is m × m, κ3 is m2 × m
and κ4 is m2 × m2 . Concretely,
r
∂ 2
κ1 = log M(s) =µ+ δ, (6.1.5)
∂s s=0 π
∂2 2
κ2 = ′
log M(s) = Γ − δ ⊗ δ′ , (6.1.6)
∂s∂s s=0 π
!
∂3 1 1
κ3 = log M(s) = √ − 2 (Im ⊗ δ) δ ⊗ δ ′ , (6.1.7)
∂s∂s′ ∂s s=0 2π π
!
∂4 4 3
κ4 = log M(s) = − δ ⊗ δ′ ⊗ δ ⊗ δ′ . (6.1.8)
∂s∂s′ ∂s∂s′ s=0 π π2
ROBUST ESTIMATION UNDER SKEW-NORMAL RETURN PROCESSES 209
Next we evaluate the derivative of the asymptotic variance of g(θ̂) with respect to the skew-
normal √perturbation δ. Suppose that the innovations U (t) in (6.1.1) are i.i.d. S Nm (µ, Γ, α) with
µ = − 2/πδ. Then E{U (t)} = 0 and Var{U (t)} = Γ − (2/π)δδ ′ . Write δ as δ = δd where d =
(d1 , . . . , dm )′ with kdk = 1, and δ is a positive scalar. Here d represents the direction of the skewness,
and δ represents the intensity. In this setting, regarding the matrix Ω as a function of δ, we write
!
Ω1 (δ) Ω3 (δ)
Ω = Ω(δ) = .
Ω3 (δ)′ Ω2 (δ)
Let
! !′
∂g ∂g
V(δ) = Ω(δ) ,
∂θ ′ ∂θ ′
D = dd′ ,
and
X
∞
A(λ) = A( j)ei jλ .
j=0
The following two propositions are due to Taniguchi et al. (2012).

Proposition 6.1.1. Under Assumptions 3.2.1 and 3.2.4,
!
∂ Ω̃1 (δ) Ω̃3 (δ)
Ω̃(δ) ≡ Ω(δ) = ,
∂δ Ω̃3 (δ)′ Ω̃2 (δ)
where
4δ
Ω̃1 (δ) = − A(0)DA(0)′,
π
(
δ2 X
m
Ω̃2 (δ) = −2 2 D pq Kuv + Duv K pq
π p,q,u,v=1
Z πn o
× Aa1 p (λ)Aa3 q (λ)Aa2 u (λ)Aa4 v (λ) + Aa1 p (λ)Aa4 q (λ)Aa2 u (λ)Aa3 v (λ) dλ
−π
!2 X
m Z Z π
1
+ c̃U
b1 b2 b3 b4 Aa1 b1 (λ1 )Aa2 b2 (−λ1 )Aa3 b3 (λ2 )
2π b1 ,b2 ,b3 ,b4 =1 −π
)
× Aa4 b4 (−λ2 ) dλ1 dλ2 ; a1 , a2 , a3 , a4 = 1, . . . , m, a1 ≥ a2 , a3 ≥ a4 ,
( X
m Z π
1
Ω̃3 (δ) = c̃U
b1 b2 b3 Aa1 b1 (λ1 + λ2 )Aa2 b2 (−λ1 )Aa3 b3 (−λ2 )dλ1 dλ2 ;
(2π)2 b1 ,b2 ,b3 =1 −π
)
a1 , a2 , a3 = 1, . . . , m, a2 ≥ a3 .
Here, D p1 p2 , K p1 p2 and A p1 p2 (λ) are the (p1 , p2 )-components of D, K and A(λ), respectively, and
the quantities c̃U U
b1 b2 b3 , c̃b1 b2 b3 b4 can be obtained from the matrices
!
3δ2 1
κ̃3 = √ − 2 Im ⊗ d′ D.
2π π
Differentiation of V(δ) yields
! !′
∂ ∂g ∂g
V(δ) = (δ) Ω̃(δ) (δ)
∂δ ∂θ ′ ∂θ ′
! !′ ! !′
∂2 g ∂g ∂g ∂2 g
+ (δ) Ω(δ) (δ) + (δ) Ω(δ) (δ) . (6.1.9)
∂δ∂θ ′ ∂θ ′ ∂θ ′ ∂δ∂θ ′
From Proposition 6.1.1 and (6.1.9) we have,

Proposition 6.1.2. Under Assumptions 3.2.1 and 3.2.4, if
!
∂2 g
(δ) = 0, (6.1.10)
∂δ∂θ ′ δ=0
∂
then V(δ) = 0, i.e., the asymptotic distribution of portfolio estimator g(θ̂) is robust for small
∂δ δ=0
deviations from normality.
The classical mean-variance portfolio coefficient is given by


 max
 θ ′ µ + θ0 R0 − βθ ′ Σθ
 θ, θ (6.1.11)
 subject to Pm θ j = 1,
0

j=0
where R0 is the risk-free interest rate and β is a positive constant representing risk aversion. Then
the solution of (6.1.11) is given by
( 1 −1
g MV (θ) = 2β Σ (µ − R0 e),
′ (6.1.12)
θ0 = 1 − e g(θ),
m
z }| {
where e = (1, . . . , 1)′ . It is not difficult to show (see Problem 6.3)
∂g MV (θ)
= O(δ), j = 1, . . . , r , (6.1.13)
∂δ∂θ j
which, together with Proposition 6.1.2, implies that the classical mean-variance portfolio estimator
g ML (θ̂) is robust for a small perturbation with respect to skew parameter δ.
Further, Taniguchi et al. (2012) evaluated
∂2
V(δ) , (6.1.14)
∂δ2 δ=0
which measures a sort of second-order sensitivity with respect to the asymmetry parameter δ. They
observed that if A(λ)−1 or A(λ) has near unit roots, (6.1.14) shows high sensitivity with respect to
δ.
In Theorem 8.3.6, we will establish the LAN for vector-valued non-Gaussian linear processes.
If we assume that the innovation distribution is SNm with asymmetry δ, then the central sequence
∆n (h; θ) in (8.3.14) depends on δ. Denoting ∆n (h; θ) by ∆n (δ), we expand it as
∆n (δ) = ∆n (0) + (A) + higher-order terms in δ. (6.1.15)
It is seen that E{(A)} = 0, and (A) is of the form δ ′ (·)δ. Because the central sequence is a key quantity
which describes the optimality of various statistical procedures, Taniguchi et al. (2012) introduced
an influence measure criterion (IMC) of δ on the central sequence by IMC=Var{(A)}. They observed
that, if A(λ)−1 or A(λ) has near unit roots, IMC becomes large.
PORTFOLIO ESTIMATORS DEPENDING ON HIGHER-ORDER CUMULANT SPECTRA 211
6.2 Portfolio Estimators Depending on Higher-Order Cumulant Spectra
In the previous section we introduced the generalized mean variance portfolio g(θ) for a non-
Gaussian linear process (6.1.1). Throughout this section we assume that the coefficient matrices
{A( j)} satisfy Assumption 3.2.1 and {U (t)} has the kth-order cumulant for k = 2, 3, . . .. Then the
process {X(t)} becomes a kth-order stationary one (e.g., Brillinger (2001b, p. 34) For 2 ≤ l ≤ k,
denote the lth-order cumulant of {Xa1 (0), Xa2 (t1 ), . . . , Xal (tl−1 )} by c(l) a1 ···al (t1 , . . . , tl−1 ). The lth-order
cumulant spectral density is defined as
 
X 
 

 X
∞ l−1
(l) −l+1
 
 (l)
fa1 ···al (λ1 , . . . , λl−1 ) = (2π) exp 
 i λ t
j j
 ca ···a (t1 , . . . , tl−1 ). (6.2.1)

 
 1 l
t1 ,t2 ,...,tl−1 =−∞ j=1
The lth-order cumulant spectral distribution is defined as

Z
Fa(l)1 ···al (A) = fa(l)1 ···al (λ1 , . . . , λl−1 )dλ1 · · · dλl−1 , (6.2.2)
A
where A is a Borel set in S l−1 = [−π, π]l−1.

Let us consider the problem of constructing a portfolio for the return process (6.1.1). Suppose
that U(·) is a utility function and α = (α1 , . . . , αm )′ is a portfolio coefficient vector satisfying
α1 + · · · + αm = 1. We seek the optimal portfolio α which maximizes E[U(α′ X(t))], i.e.,

αopt ≡ arg max E U α′ X(t) . (6.2.3)
α
Expanding U(·) in a Taylor expansion, we obtain the following approximation:

1 2 1
E[U(M(t))] ≈ U(c1M (t)) + D U(c1M (t))c2M (t) + D3 U(c1M (t))c3M (t)
2! 3!
1
+ · · · + Dk U(c1M (t))ckM (t), (6.2.4)
k!
where M(t) = α′ X(t) and clM (t) is the lth-order cumulant of M(t). If we use the right-hand-side
approximation of (6.2.4), then it is seen that αopt depends on the mean µ and the higher-order
cumulant spectral densities f (l) = { fa(l)1 ···al ; 1 ≤ a j ≤ m}, 2 ≤ l ≤ k. For 2 ≤ l ≤ k, write
F (l) = {Fa(l)1 ···al ; 1 ≤ a j ≤ m}, (6.2.5)
(Z Z π )
(l) (l)
θl F = ··· hl (λ1 , . . . , λl−1 )dFa1 ···al (λ1 , . . . , λl−1 ); 1 ≤ a j ≤ m
−π
n o
= θl Fa(l)1 ···al ; 1 ≤ a j ≤ m (say), (6.2.6)
where hl (·)’s are real-valued functions whose total variations are bounded. Therefore we can de-
scribe the optimal portfolio αopt as a function of µ and F (l) , i.e.,

g : µ, θ2 F (2) , . . . , θk F (k) → Rm . (6.2.7)
To estimate g(·), we construct estimators of θl (F (l) ). Suppose that an observed stretch

{X(1), . . . , X(n)} from (6.1.1) is available. For this we introduce finite Fourier transforms:
X
n−1
dan (λ) ≡ Xa (t) exp{−iλt}. (6.2.8)
t=0
Define Fan,(l)
1 ,a2 ···al
as
!l−1 X !
2π 2πr1 2πrl
Fan,(l)
1 ···al
(λ1 , . . . , λl−1 ) = Ian1 ,...,al ,..., , (6.2.9)
n 2πr j /n≤λ j ;
n n
j=1,...,l−1
where
!  l  l !
X  Y
2πr1 2πrl −k+1 −1   n 2πr j
Ian1 ···al ,..., = (2π) 
n η  λ j  da j , (6.2.10)
n n j=1 j=1
n
(lth-order periodogram). Here η(·) is the Kronecker comb:

(
1, λ ≡ 0 (mod 2π),
η(λ) = (6.2.11)
0, λ . 0 (mod 2π).
(l)
To estimate θl Fa1 ···al we introduce
Z Z π
θl Fan,(l)
1 ···al
= · · · hl (λ1 , . . . , λl−1 )dFan,(l)
1 ···al
(λ1 , . . . , λl−1 )
−π
!l−1 X ! !
2π ∗ 2πr1 2πrl n 2πr1 2πrl
= hl ,..., Ia1 ···al ,..., , (6.2.12)
n n n n n
P∗
where extends over (r1 , . . . , rl ) satisfying
n n n n
− + 1 ≤ r1 ≤ ,...,− + 1 ≤ rl ≤
2 2 2 2
and r1 + · · · + rl = 0 (mod n). The mean vector µ = (µ1 , . . . , µm )′ is estimated by µ̂ ≡ (µ̂1 , . . . , µ̂m )′
with µ̂ j = n−1 dnj (0), j = 1, . . . , m.

To evaluate the expectation of θl Fan,(l) 1 ···al
, we prepare the following lemmas.
Lemma 6.2.1. (Theorem 2.3.2 of Brillinger (2001b, p. 21)) For Yi j , j = 1, 2, . . . , Ji , i = 1, 2, . . . , I,
a two-way array of random variables and
Y
Ji
Zi = Yi j ,
j=1
the joint cumulant of {Z1 , Z2 , . . . , ZI } is given by

Xh i
cum(Yi j ; i j ∈ ν1 ) · · · cum(Yi j ; i j ∈ ν p ) , (6.2.13)
ν
where the summation is over all indecomposable partitions ν = ν1 ∪ · · · ∪ ν p of the two-way array
{(i, j); j = 1, . . . , Ji , i = 1, . . . , I}.
Lemma 6.2.2. (Theorem 4.3.1 of Brillinger (2001b, p. 92)) Under Assumption 3.2.1,
   


 Xl 

 

 X
l 


    (l)
cum{dan1 (λ1 ), dan2 (λ2 ), . . . , danl (λl )} = (2π)l−1∆n 
 λ  η  λ  fa ···a (λ1 , . . . , λl−1 ) + O(1),
 j=1   j=1 
 
j j
   
 1 l
where
2πr j
λj = , r j ∈ Z,
n
X
n−1 (
0, λ . 0 (mod 2π),
∆n (λ) = exp(−iλt) =
n, λ ≡ 0 (mod 2π),
t=0
and the error term O(1) is uniform for all λ1 , . . . , λl .

HIGHER-ORDER CUMULANT SPECTRA 213
From Lemmas 6.2.1 and 6.2.2, it follows that
 p 
h i XZ Y 
n,(l) (l)
E θl Fa1 ···al = θl Fa1 ···al + h(λ1 , . . . , λl )  f{α j ; α j ∈ν j } (λ ji ; ji ∈ ν j )
ν Sl j=1
  
Y 
 |ν j |
X 
  l
 Y
p
 
 
×  η
 λ ji 
  dλk + O(n−1 ), (6.2.14)
j=1

 i=1 

k=1
P
where ν is over all ν = (ν1 , ν2 , . . . , ν p ), p > 1 and |ν j | > 1, and |ν j | denotes the number of elements
in ν j . Then we can see that
h i
E θl Fan,(l)
1 ···al
= θl Fa(l)1 ···al + O(1) + O(n−1 ), (6.2.15)

which implies that θl Fan,(l)1 ···al
is not asymptotically unbiased. Keenan (1987) introduced a modifica-
n,(l)
tion of θl Fa1 ···al to be asymptotically unbiased as follows.
P
Let Ω(l) be the collection of (λ1 , . . . , λl ) in [−π, π] such that lj=1 λ j ≡ 0 (mod 2π), λ j = 2πr j /n,
Ps
but is not contained in any further integral region of the form j=1 λi j ≡ 0 (mod 2π) where {i j ; j =

1, . . . , s} is a proper subset of {1, 2, . . . , l}. Let θl F̂an,(l)
1 ···al
be defined the same as θl Fan,(l)
1 ···al
, except
(l)
that the sum is now over (2πr1 /n, . . . , 2πrl /n), which are contained in Ω .
The following three propositions are due to Hamada and Taniguchi (2014).
Proposition 6.2.3. Under Assumption 3.2.1, it holds that
h i
(l)
E θl F̂an,(l)
1 ···al
= θl F a1 ···al
+ O(n−1 ), (6.2.16)

i.e., θl F̂an,(l)
1 ···al
is an asymptotically unbiased estimator of θl Fa(l)1 ···al .
Write

θ= vec µ, θ2 (F (2) ), . . . , θl (F (l) ) ,
(6.2.17)
θ̂ = vec µ̂, θ2 (F̂n(2) ), . . . , θl (F̂n(l) ) ,
n o
where θl (F̂n(l) ) = θl F̂an,(l)
1 ···al
; 1 ≤ a j ≤ m . By use of Lemmas 6.2.1 and 6.2.2, it is not difficult to
show the following.
Proposition 6.2.4. Under Assumption 3.2.1, it holds that
√ d
n(θ̂ − θ) → N(0, Σ), (n → ∞), (6.2.18)
where the elements of Σ corresponding to those of the variance-covariance matrix between µ̂ and
(p) (p) (q)
µ̂, µ̂ and θ p (F̂n ), θ p (F̂n ) and θq (F̂n ), p, q ≤ l, are, respectively, given by
(2)
n Cov(µ̂a , µ̂b ) = 2π fab (0),
Z Z
n,(q)
n Cov(µ̂a , θq (F̂b1 ···bq )) = ··· hq (λ1 , . . . , λq−1 )
Ω(q)
(q+1)
× fab1 ···bq (0, λ1 , . . . , λq−1 )dλ1 · · · dλq−1 ,
XZ Z
n,(q)
n Cov(θ p (F̂an,(p)
1 ···a p
), θq ( F̂ b1 ···bq )) = · · · h p (λ1 , . . . , λ p )bq (λ p+1 , . . . , λ p+q )
∆ M p+q−r−2
 |∆ | 
X1 
× η  λ1 j  f{a1 j ; a1 j ∈∆1 } (λ1 j ; λ1 j ∈ ∆1 ) · · ·
j=1
 |∆ | 
Xr 
× η  λr j  f{ar j ; ar j ∈∆r } (λr j ; λr j ∈ ∆r )dλ1 · · · dλ p+q ,
j=1
where M p+q−r−2 = [−π, π] p+q−r−2 , and the summation is extended over all indecomposable portions
∆ = (∆1 , . . . , ∆r ) of the two-way array {(i, ji ); i = 1, 2, j1 = 1, 2, . . . , p, j2 = 1, 2, . . . , q}.
Proposition 6.2.4, together with the δ-method (e.g., Proposition 6.4.3 of Brockwell and Davis
(2006)), yields the following.
! !′ !
√ d ∂g ∂g
n[g(θ̂) − g(θ)] → N 0, Σ , (6.2.19)
∂θ ′ ∂θ ′
as n → ∞.
In what follows we provide a numerical example.
Example 6.2.6. Let the return process {X(t) = (X1 (t), X2 (t))} be generated by
(
X1 (t) = µ1 ,
(6.2.20)
X2 (t) − µ2 = ρ(X2 (t − 1) − µ2 ) + u(t),
where {u(t)}’s are i.i.d. t(ν) (Student t-distribution with ν degrees of freedom). The portfolio is defined
as
M(t) = (1 − α)X1 (t) + αX2 (t),
whose cumulants are c1M = (1 − α)µ1 + αµ2 , c2M = α2 c2X2 = α2 σ22 and c3M = α3 c3X2 where σ22 =
R R π (3)
Var{X2 (t)} and c3X2 = −π f222 (λ1 , λ2 )dλ1 dλ2 . We are interested in the optimal portfolio αopt by the
following cubic utility:
1
αopt = arg max E[M(t) − γα2 {(X2 (t) − µ2 )2 − √ h3 (X2 (t) − µ2 )}], (6.2.21)
α n
where γ > 0 is a risk aversion coefficient and h3 (·) is the Hermite polynomial of order 3. We
approximate the expectation in (6.2.21) by Edgeworth expansion as
Z ∞( )
1
(6.2.21) ≈ (1 − α)µ1 + αµ2 − γα2 σ22 + −γα2 σ22 h2 (y) + γα2 √ h3 (y)
−∞ n
β
× 1 + h3 (y) φ(y)dy
6
2 2 2 β
≈ (1 − α)µ1 + αµ2 − γα σ2 + γα √ , (6.2.22)
n
which imply that αopt for sufficiently large n is
 
µ2 − µ1 β  µ2 − µ1 
αopt ≈ + √    . (6.2.23)
2γσ22 n 2γσ42
while we know that the classical mean-variance optimal portfolio is given by
µ2 − µ1
αcopt = , (6.2.24)
2γσ22
Returning to (6.2.20), we set µ1 = 0.3, µ2 = 0.4, γ = 0.5 and v = 10. For each ρ = −0.9(0.12)0.9,
we generated {X(1), . . . , X(100)} from (6.2.20), and estimated (µ, σ22 , β) with µ = µ2 − µ1 by
1 X
100
µ̂ = {X2 (t) − X1 (t)},
100 t=1
PORTFOLIO ESTIMATION DEPENDING ON EXOGENOUS VARIABLES 215
Figure 6.1: Values of MAE and MAEc
1 X
100
σ̂22 = {X2 (t) − µ̂2 }2 , (6.2.25)
100 t=1
1 X
100
β̂ = {X2 (t)}3 − 2µ̂2 σ̂22 − µ̂32 ,
100 t=1
P100
where µ̂2 = 100−1 t=1 X2 (t). The estimated portfolios are
µ̂ β̂µ̂
α̂opt = + , (6.2.26)
σ̂2 10σ̂42
2
µ̂
α̂copt = 2 . (6.2.27)
σ̂2
Then we repeated the procedure above 10,000 times in simulation. We write α̂opt,m and α̂copt,m ,
α̂opt and α̂copt in the mth replication, respectively. Their mean absolute errors (MAE) are
1 X
10,000
α̂ α
MAE =
10, 000 m=1 opt,m
− opt , (6.2.28)
and
X
10,000
1 α̂c c
MAEc = − αopt . (6.2.29)
10, 000 m=1 opt,m
We plotted MAEc and MAE solid line and dotted line with a, respectively, in Figure 6.1.
The figure shows that α̂opt is better than α̂copt except for the values of ρ around the unit root (ρ = 1).
6.3 Portfolio Estimation under the Utility Function Depending on Exogenous Variables
In the estimation of portfolios, it is natural to assume that the utility function depends on exogenous
variables. From this point of view, in this section, we develop estimation under this setting, where
the optimal portfolio function depends on moments of the return process, and cumulants between
the return process and exogenous variables. Introducing a sample version of it as a portfolio estima-
tor, we derive the asymptotic distribution. Then, an influence of exogenous variables on the return
process is illuminated by a few examples.
Suppose the existence of a finite number of assets indexed by i, (i = 1, . . . , m). Let X(t) =
(X1 (t), . . . , Xm (t))′ denote the random returns on m assets at time t, and let Z(t) = (Z1 (t), . . . , Zq (t))′
denote the exogenous variables influencing the utility function at time t. We write
Y (t) ≡ (X(t)′ , Z(t)′ )′ = (Y1 (t), . . . , Ym+q (t))′ (say).
Since it is empirically observed that {X(t)} is non-Gaussian and dependent, we will assume
that it is a non-Gaussian vector-valued linear process with third-order cumulants. Also, suppose that
m q
z }| { z }| { ′
there exists a risk-free asset whose return is denoted by Y0 (t). Let α0 and α = (α1 , . . . , αm , 0, . . . , 0)
be the portfolio weights at time t, and the portfolio is M(t) = Y (t)′ α + Y0 (t)α0 , whose higher-order
cumulants are written as
c1M (t) = cum{M(t)} = ca1 αa1 + Y0 (t)α0 ,

c2M (t) = cum{M(t), M(t)} = ca2 a3 αa2 αa3 , (6.3.1)
c3M (t) = cum{M(t), M(t), M(t)} = c a4 a5 a6
αa4 αa5 αa6 ,
where ca1 = E{Ya1 (t)}, ca2 a3 = cum{Ya2 (t), Ya3 (t)} and ca4 a5 a6 = cum{Ya4 (t), Ya5 (t), Ya6 (t)}. Here we
use Einstein’s summation conversion. For a utility function U(·), the expected utility can be ap-
proximated as
1 2 1
E[U(M(t))] ≈ U(c1M (t)) + D U(c1M (t))c2M (t) + D3 U(c1M (t))c3M (t), (6.3.2)
2! 3!
by Taylor expansion of order 3. The approximate optimal portfolio may be described as


 max
 [the right hand side of (6.3.2)],
α0 ,α (6.3.3)
 subject to α0 + Pmj=1 α j = 1.


If the utility function U(·) depends on the exogenous variable Z, then the solution α = αopt of
(6.3.3) depends on Z and the return of assets X, i.e., αopt = αopt (X, Z). However, the form
of αopt (X, Z) is too general to handle. Hence, we restrict ourselves to the case when αopt is an
m-dimensional smooth function:
g(θ) : g[E(X), cov(X, X), cov(X, Z), cum(X, X, Z)] → Rm . (6.3.4)
We estimate g(θ) by its sample version:

[ cd
ĝ ≡ g(θ̂) = g[E(X), ov(X, X), d d
cov(X, Z), cum(X, X, Z)], (6.3.5)
b is a sample version of (·).

where (·)
Let {Y (t) = (Y1 (t), . . . , Ym+q (t))′ } be an (m + q)-dimensional linear process defined by
X
∞
Y (t) = G( j)e(t − j) + µ, (6.3.6)
j=0
where {e(t)} is an (m + q)-dimensional stationary process such that E{e(t)} = 0 and E{e(s)e(t)′ } =
δ(s, t)K, with K a nonsingular (m + q) × (m + q) matrix, G( j)’s are (m + q) × (m + q) matrices and
µ = (µ1 , . . . , µm+q )′ . Assuming that {e(t)} has all order cumulants, let Qea1 ···a j (t1 , . . . , t j−1 ) be the joint
jth-order cumulant of ea1 (t), ea2 (t + t1 ), . . . , ea j (t + t j−1 ). In what follows we impose
Assumption 6.3.1. (i) For each j = 1, 2, 3, . . . ,
X
∞ m+q
X
e
Qa1 ···a j (t1 , . . . , t j−1 ) < ∞, (6.3.7)
t1 ,··· ,t j−1 =−∞ a1 ,...,a j =1
(ii)
X
∞
||G( j)|| < ∞. (6.3.8)
j=0
Letting QYa1 ···a j (t1 , . . . , t j−1 ) be the joint jth-order cumulant of Ya1 (t), Ya2 (t + t1 ), . . . , Ya j (t + t j−1 ),
we define the jth-order cumulant spectral density of {Y (t)} by
! j−1 X
∞
1
fa1 ···a j (λ1 , . . . , λ j−1 ) = exp{−i(λ1 t1 + · · · + λ j−1 t j−1 )}
2π t1 ,...,t j−1 =−∞
×QYa1 ···a j (t1 , . . . , t j−1 ). (6.3.9)
For an observed stretch {Y (1), . . . , Y (n)}, we introduce
1X
n
ĉa1 = Ya (s),
n s=1 1
1X
n
ĉa2 a3 = (Ya (s) − ĉa2 )(Ya3 (s) − ĉa3 ),
n s=1 2
1X
n
ĉa4 a5 = (Ya (s) − ĉa4 )(Ya5 (s) − ĉa5 ),
n s=1 4
1X
n
ĉa6 a7 a8 = (Ya (s) − ĉa6 )(Ya7 (s) − ĉa7 )(Ya8 (s) − ĉa8 ),
n s=1 6
where 1 ≤ a1 , a2 , a3 , a4 , a6 , a7 ≤ m and m + 1 ≤ a5 , a8 ≤ m + q. Write the quantities that appeared in

(6.3.4) and (6.3.5) by
θ̂ = (ĉa1 , ĉa2 a3 , ĉa4 a5 , ĉa6 a7 a8 ),

(6.3.10)
θ = (ca1 , ca2 a3 , ca4 a5 , ca6 a7 a8 ),
where ca1 ···a j = QYa1 ···a j (0, . . . , 0). Then, dim θ = dim θ̂ = a + b + c + d, where a = m, b = m(m + 1)/2,
c = mq, d = m(m+1)q/2. As in Proposition 6.2.5, it is not difficult to show the following Proposition
6.3.2, whose proof, together with that of Propositions 6.3.3 and 6.3.5, is given in Hamada et al.
(2012).
√ d
n(g(θ̂) − g(θ)) → N(0, (Dg)Ω(Dg)′), (n → ∞), (6.3.11)
where
 
 Ω11 Ω12 Ω13 Ω14 
 Ω Ω22 Ω23 Ω24 
Ω =  21  ,
 (6.3.12)
 Ω31 Ω32 Ω33 Ω34 
Ω41 Ω42 Ω43 Ω44
and the typical element of Ωi j corresponding to the covariance between ĉ∆ and ĉ∇ is denoted by
V{(∆)(∇)}, and
V{(a1 )(a′1 )} = Ω11 = 2π fa1 a′1 (0),

Z π
V{(a2 a3 )(a′1 )} = Ω12 = 2π fa2 a3 a′1 (λ, 0)dλ(= Ω21 ),
−π
Z π
V{(a4 a5 )(a′1 )} = Ω13 = 2π fa4 a5 a′1 (λ, 0)dλ(= Ω31 ),
Z−πZ π
V{(a6 a7 a8 )(a′1 )} = Ω14 = 2π fa6 a7 a8 a′1 (λ1 , λ2 , 0)dλ1 dλ2 (= Ω41 ),
−π
Z Z π
V{(a2 a3 )(a′2 a′3 )} = Ω22 = 2π fa2 a3 a′2 a′3 (λ1 , λ2 , −λ2 )dλ1 dλ2
−π
Z Z πn o
+ 2π fa2 a′2 (λ) fa3 a′3 (−λ) + fa2 a′3 (λ) fa3 a′2 (−λ) dλ,
−π
Z Z π
′ ′
V{(a2 a3 )(a4 a5 )} = Ω23 = 2π fa2 a3 a′4 a′5 (λ1 , λ2 , −λ2 )dλ1 dλ2
−π
Z Z πn o
+ 2π fa2 a′4 (λ) fa3 a′5 (−λ) + fa2 a′5 (λ) fa3 a′4 (−λ) dλ(= Ω32 ),
−π
Z Z Z π
′ ′ ′
V{(a2 a3 )(a6 a7 a8 )} = Ω24 = 2π fa2 a3 a′6 a′7 a′8 (λ1 , λ2 , λ3 , −λ3 )dλ1 dλ2 dλ3 (= Ω42 ),
Z Z π −π
V{(a4 a5 )(a′4 a′5 )} = Ω33 = 2π fa4 a5 a′4 a′5 (λ1 , λ2 , −λ2 )dλ1 dλ2
Z Z π n −π o
+ 2π fa4 a′4 (λ) fa5 a′5 (−λ) + fa4 a′5 (λ) fa5 a′4 (−λ) dλ,
−π
Z Z Z π
′ ′ ′
V{(a4 a5 )(a6 a7 a8 )} = Ω34 = 2π fa4 a5 a′6 a′7 a′8 (λ1 , λ2 , λ3 , −λ3 )dλ1 dλ2 dλ3 (= Ω43 ),
−π
Z Z Z Z π
V{(a6 a7 a8 )(a′6 a′7 a′8 )} = Ω44 = 2π fa6 a7 a8 a′6 a′7 a′8 (λ1 , λ2 , λ3 , −λ3 − λ4 )dλ1 · · · dλ4
Z Z Z π−πX
+ 2π fai1 ai2 ai3 ai4 (λ1 , λ2 , λ3 )
−π v1
× fai5 ai6 (−λ2 − λ3 )dλ1 dλ2 dλ3

Z Z Z π X
+ 2π fai1 ai2 ai3 (λ1 , λ2 )
−π v2
× fai4 ai5 ai6 (λ3 , −λ2 − λ3 )dλ1 dλ2 dλ3

Z Z Z π X
+ 2π fai1 ai2 (λ1 )
−π v3
× fai3 ai4 (λ2 ) fai5 ai6 (−λ1 − λ2 )dλ1 dλ2 .
Next we investigate an influence of the exogenous variables Z(t) on the asymptotics of the
portfolio estimator g(θ̂). Assume that the exogenous variables have ”shot noise” in the frequency
domain, i.e.,
Za j (λ) = δ(λa j − λ), (6.3.13)
where δ(·) is the Dirac delta function with period 2π, and λa j , 0, hence Za j (λ) has one peak at
λ + λa j ≡ 0 (mod 2π).
Proposition 6.3.3. For (6.3.13), denote Ωi j and V{(∆)(∇)} in Proposition 6.3.2 by Ω′i j and
V ′ {(∆)(∇)}, respectively. That is, Ω′i j and V ′ {(∆)(∇)} represent the asymptotic variances when the
exogenous variables are shot noise. Then,
V ′ {(a4 a5 )(a′1 )} = Ω′13 = 0(= Ω′31 ),
V ′ {(a6 a7 a8 )(a′1 )} = Ω′14 = 2π fa6 a7 a′1 (λa8 , 0)(= Ω′41 ),
V ′ {(a2 a3 )(a′4 a′5 )} = Ω′23 = 2π fa2 a3 a′4 (λa′5 , 0) + 2π fa2 a′4 (λa5 ) fa3 a′5 (−λa′5 )
+ 2π fa2 a′5 (−λa5 ) fa3 a′4 (λa5 )(= Ω′32 ),
R Rπ (6.3.14)
V ′ {(a2 a3 )(a′6 a′7 a′8 )} = Ω′24 = 2π −π fa2 a3 a′6 a′7 a′8 (−λa′8 , λ1 , λ2 , −λ2 )dλ1 dλ2 (= Ω′42 ),
V ′ {(a4 a5 )(a′4 a′5 )} = Ω′33 = 2π fa4 a5 a′4 a′5 (λa′5 , λa5 , −λa5 ) + 2π fa4 a′4 (λa5 ) fa5 a′5 (−λa5 )
+ 2π fa4 a′5 (λa5 ) fa5 a′4 (−λa5 ),
Rπ
V ′ {(a4 a5 )(a′6 a′7 a′8 )} = Ω′34 = 2π −π fa4 a5 a′6 a′7 a′8 (−λa′8 , λa5 , λ, −λ)dλ(= Ω′43 ).
In (6.3.14), Ω′33 shows an influence of exogenous variables. It is easily seen that if {X(t)} has
near unit roots, and if λa j is near zero, then the magnitude of Ω′33 becomes large, hence, in such
cases, the asymptotics of g(θ̂) become sensitive.
In what follows we consider the case when the sequence of exogenous variables {Z(t) =
(Z1 (t), . . . , Zq (t))′ } is nonrandom, and satisfies Grenander’s conditions (G1)–(G4) with
X
n−h
a(n)
j,k (h) = Z j (t + h)Zk (t), (6.3.15)
t=1
(G1) limn→∞ a(n)

j, j (0) = ∞, ( j = 1, 2, . . . , q),
(G2) limn→∞ Z j (n + 1)2 /a(n) j, j (0) = 0, ( j = 1, 2, . . . , q),

q
(G3) a(n) (n) (n)
j,k (h)/ a j, j (0)ak,k (0) = ρ j,k (h) + o(n
−1/2
), for j, k = 1, . . . , q, h ∈ Z,
(G4) the matrix Φ(0) = {ρ j,k (0) : j, k = 1, . . . , q} is regular.
Under Grenander’s conditions, there exists a Hermitian matrix function M (λ) = {M j,k (λ) :
j, k = 1, . . . , q} with positive semidefinite increments such that
Z π
Φ(h) = eihλ dM (λ). (6.3.16)
−π
Here M (λ) is called the regression spectral measure of {Z(t)}.

We discuss the asymptotics for sample versions of cov(X, Z) and cum(X, X, Z) which are
defined by
Pn
X j (t)Zk (t)
Â j,k = qt=1 ,
Pn 2
n t=1 Zk (t)
Pn (6.3.17)
t=1 X j (t)Xm (t)Zk (t)
B̂ j,m,k = q P ,
n nt=1 Zk2 (t)
respectively. For this we need the following assumption.

Assumption 6.3.4. There exists a constant b > 0 such that
det{ f (λ)} ≥ b, for all λ ∈ [−π, π], (6.3.18)
where f (λ) is the spectral density matrix of {X(t)}.
Proposition 6.3.5. Suppose that Assumptions 6.3.1 and 6.3.4 and Grenander’s conditions are sat-
isfied. Then the following hold.
(i)
√ d
{ n(Â j,k − A j,k )} → N(0, ΩA ), (6.3.19)
Rπ
where the covariance between ( j, k)th and ( j′ , k′ )th elements in ΩA is given by 2π −π
f j j′ (λ)dMk,k′ (λ).
(ii)
√ d
{ n( B̂ j,m,k − B j,m,k )} → N(0, ΩB), (6.3.20)
where ΩB = {V( j, m, k : j′ , m′ , k′ )} with

Z π "Z π
′ ′ ′
V( j, m, k : j , m , k ) = 2π { f jm′ (λ − λ1 ) fm j′ (λ1 ) + f j j′ (λ − λ1 ) fmm′ (λ1 )}dλ1
−π −π
Z Z π #
+ f jm j m (λ1 , λ2 − λ, λ2 )dλ1 dλ2 dMk,k′ (λ).
′ ′
−π
The results above show an influence of {Z(t)} on the return process {X(t)}, which reflects that
for the asymptotics of g(θ̂). As a simple example, let {X(t)} and {Z(t)} be generated by
(
X(t) − aX(t − 1) = u(t),
(6.3.21)
Z(t) = cos(bt) + cos(0.25bt),
where u(t)’s are i.i.d. N(0, 1) variables. Figure 6.2 plots V = V( j, m, k : j′ , m′ , k′ ) in Proposition
6.3.5 with respect to a and b. The figure shows that if b ≈ 0 and |a| ≈ 1, then V becomes large,
Figure 6.2: Values of V for a = −0.9(0.1)0.9, b = −3.14(0.33)3.14
which implies that if the return process contains near unit roots, and if the exogenous variables are
constant, then the asymptotics of g(θ̂) will be sensitive for such a situation.
MULTI-STEP AHEAD PORTFOLIO ESTIMATION 221
6.4 Multi-Step Ahead Portfolio Estimation
This section addresses the problem of estimation for multi-step ahead portfolio predictors in the
case when the asset return process is a vector-valued non-Gaussian process. First, to grasp the idea
simply, we mention the multi-step prediction for scalar-valued stationary processes.
Let {Xt : t ∈ Z} be a stationary process with mean zero and spectral density g(λ). Write the
spectral representation as
Z π
Xt = e−itλ dZ(λ), E{|dZ(λ)|2 } = g(λ)dλ,
−π
and assume
Z π
log g(λ)dλ > −∞.
−π
P∞
Let Ag (z) = j=0 a(g) j
j z , z ∈ C, be the response function satisfying
1
g(λ) = |Ag (eiλ )|2 .
2π
We are now interested in linear prediction of Xt based on Xt−1 , Xt−2 , . . . , and seek the linear predictor
P
X̂t = j≥1 b j Xt− j which minimizes E|Xt − X̂t |2 . The best linear predictor is given by
Z π Ag (eiλ ) − Ag (0)
X̂tbest = e−itλ dZ(λ) (6.4.1)
−π Ag (eiλ )
(e.g., Chapter 8 of Grenander and Rosenblatt (1957)). Although X̂tbest is the most plausible predic-
tor, in practice, model selection and estimation for g(λ) are needed. Hence the true spectral density
g(λ) is likely to be misspecified. This leads us to a misspecified predicition problem when a conjec-
tured spectral density
1 2
f (λ) = A f (eiλ )
2π
is fitted to g(λ). From (6.4.1) the best linear predictor computed on the basis of the conjectured
spectral density f (λ) is
Z π
f A f (eiλ ) − A f (0)
X̂t = e−itλ dZ(λ). (6.4.2)
−π A f (eiλ )
Grenander and Rosenblatt (1957) evaluated the prediction error as
|A (0)|2 Z π g(λ)
f
E {Xt − X̂tf }2 = dλ. (6.4.3)
2π −π f (λ)
The setting is very wide, and is applicable to the problem of h-step ahead prediction, i.e., prediction
of Xt+h based on a linear combination of Xt , Xt−1 , . . .. This can be grasped by fitting the misspecified
spectral density
1
fθ (λ) = |1 − θ1 eihλ − θ2 ei(h+1)λ − · · · |−2 , (6.4.4)
2π
θ = (θ1 , θ2 , . . .)′ ,
to g(λ). The best h-step ahead linear predictor of Xt+h is given in the form
θ1 Xt + θ2 Xt−1 + θ3 Xt−2 + · · · , (6.4.5)

where θ = (θ1 , θ2 , . . .) is determined by the value which minimizes the misspecified prediction error
(6.4.3):
Z π
1 g(λ)
dλ, (6.4.6)
2π −π fθ (λ)
with respect to θ. Since θ is actually unknown, we estimate it by a Whittle estimator. In what
follows, we describe the details in vector form. Let {X(t) = (X1 (t), . . . , Xm (t))′ : t ∈ Z} be a zero
mean stationary process with m × m spectral density matrix
1
g(λ) = {g jk (λ) : j, k = 1, 2, . . . , m} = Ag (eiλ )KAg (eiλ )∗ ,
2π
P∞
where Ag (eiλ ) = j=0 A( j)e
i jλ
, and K is a nonsingular m×m matrix. Let the spectral representation
of {X(t)} be
Z π
X(t) = e−itλ dZ(λ),
−π
where {Z(λ) = (Z1 (λ), . . . , Zm (λ))′ : −π ≤ λ ≤ π} is an orthogonal increment process satisfying

(
g jk (λ)dλ (if λ = µ)
E[dZ j (λ)dZk (µ))] =
0 (if λ , µ).
Hannan (1970) (Theorem 1 in Chapter III) gave the best linear predictor of X(t + 1) based on
X(t), X(t − 1), . . . by
Z π
X̂(t)best = eitλ {Ag (eiλ ) − Ag (0)}Ag (e−iλ )−1 dZ(λ). (6.4.7)
−π
It is well known that the prediction error is

tr E {X(t + 1) − X̂(t)best }{X(t + 1) − X̂(t)best }′
" Z π #
1
= exp tr{log det(2πg(λ))}dλ . (6.4.8)
2πm −π
As in the scalar case, we consider the problem of misspecified prediction based on a conjectured
spectral density matrix
1
f (λ) = Af (eiλ )A f (eiλ )∗ ,
2π
P
where Af (z) = ∞j=0 a(f ) j
j z , (z ∈ C). Then the pseudo best linear predictor computed on the basis
of f (λ) is given by
Z π
X̂(t) p−best(f ) = e−itλ {Af (eiλ ) − Af (0)}Af (eiλ )−1 dZ(λ),
−π
where Af (0) = Im (m × m identity matrix). As (6.4.3) the prediction error is

′ Z π
tr E X(t + 1) − X̂(t) p−best(f ) X(t + 1) − X̂(t) p−best(f ) ∝ tr{f −1 (λ)g(λ)}dλ. (6.4.9)
−π
We can apply this to the h-step ahead prediction. Consider the problem of prediction for Xt+h by
linear combination
X
L
X̂(t + h) = Φ( j)X(t − j + 1), (6.4.10)
j=1
MULTI-STEP AHEAD PORTFOLIO ESTIMATION 223
where Φ( j)’s are m × m matrices. This problem can be understood if we fit
1 n −1 o
fθ (λ) = Γθ (λ)Γ−1θ (λ)
∗
(6.4.11)
2π
−1 PL i(h+ j−1)λ
to g(λ), where Γθ (λ) = Im − j=1 Φ( j)e . Here
θ = (vec(Φ(1))′ , . . . , vec(Φ(L))′ )′ ∈ Θ ⊂ Rr ,
where γ = dim θ = L × m2 . The best h-step ahead linear predictor is given by X̂(t) p−best(fθ ) , where
θ is defined by
Z π
θ = arg min tr {fθ (λ)−1 g(λ)}dλ.
θ −π
Because θ is unknown, we estimate it by a Whittle estimator. Suppose that an observed stretch
{X(t − n + 1), X(t − n + 2), . . . , X(t)} is available. Let
1 X X
n n
In (λ) = { X(t − n + j)ei jλ }{ X(t − n + j)ei jλ }∗ .
2πn j=1 j=1
The Whittle estimator for θ is defined by

Z π
θ̂W = arg min tr {fθ (λ)−1 In (λ)}dλ.
θ∈Θ −π
To state the asymptotics of θ̂W , we assume

Assumption 6.4.1. (i) All the components of g(λ) are square integrable.
(ii) g(λ) ∈ Lip(α), α > 1/2.
(iii) The fourth-order cumulants of {X(t)} exist.
(iv) θ uniquely, and θ ∈ Int Θ.
(v) The r × r matrix
Z π " # 
 ∂2 
Σθ =   −1
tr{fθ (λ) g(λ)} dλ; i, j = 1, . . . , r
−π ∂θi ∂θ j θ=θ
is nonsingular.
As an estimator of X̂(t + h) p−best(fθ ) , we can use
X̂(t + h) p−best(fθ̂W ) . (6.4.12)
Next we are interested in the construction of a portfolio on {X(t)}. For a utility function we
can define the optimal portfolio weight αopt = (α1 , . . . , αm )′ and its estimator α̂opt based on
X(t), . . . , X(t − n + 1) (recall Section 3.2). The asymptotics of θ̂W and α̂opt are established
in Theorems 8.2.2 and 3.2.5, respectively. Then we can estimate the h-step ahead portolio value
α′opt X(t + h) by
α̂′opt X̂(t + h) p−best(fθ̂W ) . (6.4.13)
Letting PE ≡ α′opt X̂(t + h) p−best(fθ ) − α′opt X̂(t + h), we can evaluate the accuracy α̂′opt X̂(t +
h) p−best(fθ̂W ) − α′opt X(t + h) as follows.
Proposition 6.4.2. (Hamada and Taniguchi (2014)) Under Assumption 6.4.1, it holds that
(i) α̂′opt X̂(t + h) p−best(fθ̂W ) − α′opt X(t + h) = PE + o p (1),
Rπ
(ii) E{PE2 } = α′opt −π Γθ (λ)g(λ)Γ∗θ (λ)dλ αopt .
If one wants to evaluate a long-run behaviour of a pension portfolio, this result will be useful.
6.5 Causality Analysis
Granger (1969) introduced a useful way to describe the ralationship between two variables when
one is causing the other. The concept is called Granger causality (for short G-causality), which has
been populatized and developed by a variety of works, e.g., Sims (1972), Geweke (1982), Hosoya
(1991), Lütkepohl (2007), Jeong et al. (2012), etc.The applications are swelling, e.g., Sato et al.
(2010) in fMRI data analysis, Shojaie and Michailidis (2010) in game analysis, etc.
Let X = {X(t) = (X1 (t), . . . , X p (t))′ : t ∈ Z} and Y = {Y (t) = (Y1 (t), . . . , Yq (t))′ : t ∈ Z} be
two time series which constitute a (p + q)-dimensional stationary process {S(t) = (X(t)′ , Y (t)′ )′ :
t ∈ Z} with mean zero and spectral density matrix
!
f (λ) fXY (λ)
f s (λ) = XX .
fYX (λ) fYY (λ)
Now we are interested in whether Y causes X or not. In what follows we denote by σ{· · · } the
b
σ-field generated by {· · · }. Let X(t|σ{· · · }) be the linear projection (predictor) of X(t) on σ{· · · }.
Definition 6.5.1. If
b
trE[{X(t) − X(t|σ{S(t b
− 1), S(t − 2), . . .})}{X(t) − X(t|σ{S(t − 1), S(t − 2), . . .})}′ ]
b
< trE[{X(t) − X(t|σ{X(t b
− 1), X(t − 2), . . .})}{X(t) − X(t|σ{X(t − 1), X(t − 2), . . .})}′ ],
(6.5.1)
we say that Y Granger causes X denoted by Y ⇒ X. If the equality holds in (6.5.1), we say that
Y does not Granger cause X denoted by Y ; X.
Next we express (6.5.1) in terms of f s (λ). For this we assume that {S(t)} is generated by
X
∞
S(t) = A( j)U (t − s), (6.5.2)
j=0
where A(0) = I p+q and U (t) ≡ (ǫ(t)′ , η(t)′ )′ s are i.i.d. (0, G) random vectors. Here ǫ(t) =
(ǫ1 (t), . . . , ǫ p (t))′ , η(t) = (η1 (t), . . . , ηq (t))′ and the covariance matrix G is of the form
!
G11 G12
G= , (6.5.3)
G21 G22
where G11 : p × p, G12 : p × q, G21 : q × p, G22 : q × q.

X∞
Assumption 6.5.2. (i) det{ A( j)z j } , 0 for all z ∈ C and |z| ≤ 1.
j=0
X
∞
(ii) kA( j)k < ∞.
j=0
Now we evaluate the right hand side (RHS) of (6.5.1). We use [B(z)]+ to mean that we retain
only the terms from the expansion for positive powers of z. [B(z)]− is formed similarly, taking only
P
nonpositive powers of z, so that B(z) = [B(z)]+ + [B(z)]− . Let h11 (λ) = ∞j=1 H11 ( j)e−i jλ , where
the H11 ( j)’s are p × p matrices. Then,
Z π n o
1/2 ∗
RHS o f (6.5.1) = (2π)−1 tr {I p − h11 (λ)}A11 (λ)G1/2
11 {I p − h11 (λ)}A11 (λ)G 11 dλ
−π
Z π
= (2π)−1 tr {[eiλ A11 (λ)]+ + [eiλ A11 (λ)]− − eiλ h11 (λ)A11 (λ)}G11 { ” }∗ .
−π
(6.5.4)
CAUSALITY ANALYSIS 225
Because [eiλ A11 (λ)]+ ∈ H(eiλ , e2iλ , . . .), [eiλ A11 (λ)]− ∈ H(1, e−iλ, e−2iλ , . . .), and eiλ h11 (λ)A11 (λ)
∈ H(1, e−iλ, e−2iλ , . . .), where H(. . .) is the Hilbert space generated by (. . .), it is seen that if
[eiλ A11 (λ)]− − eiλ h11 (λ)A11 (λ) = 0, (6.5.5)
then (6.3.4) attains the minimum

Z π
1
tr [eiλ A11 (λ)]+G11 [eiλ A11 (λ)]∗+ dλ = trG11 . (6.5.6)
2π −π
Next, the variance matrix of ǫ(t) − G12G−1

22 η(t) is given by
G11 − G12G−1
22 G 21 (6.5.7)
Then, (RHS − LHS ) of (6.5.1) is trG12G−1

22 G 21 . Hence, if
trG12G−1
22 G 21 = 0, (6.5.8)
Y ; X.
It is known that
" Z π #
1
detG = exp log{det2πf s (λ)}dλ (6.5.9)
2π −π
(e.g., Hannan (1970, p.162)). We observe that
det{I−G−1/2 −1 −1/2
11 G 12 G 22 G 21 G 11 }
" Z π #
1
= exp log{det2πf s (λ)}dλ
2π −π (6.5.10)
" Z π #
1
− exp log{det2πfXX (λ)det2πfYY (λ)}dλ .
2π −π
If (6.5.10) = 0, then Y ; X. To estimate the LHS of (6.5.10), we substitute a nonparametric
spectral estimator fbn (λ) to the RHS. We use
Z π
fbn (λ) = Wn (λ − µ)In (µ)dµ, (6.5.11)
−π
 n  n ∗
1 X
 
X
itλ  


itλ 
where In (µ) = 
 S(t)e   
 S(t)e   , and Wn (·) satisfies the following.
2πn  t=1

t=1

Assumption 6.5.3. (i) W(x) is bounded, even, nonnegative and such that
Z ∞
W(x)dx = 1.
−∞
α
(ii) For M = O(n ), (1/4 < α < 1/2), the function Wn (x) ≡ MW(Mλ) can be expanded as
!
1 X l
Wn (λ) = w exp(−ilλ),
2π l M
Z ∞
where w(x) is a continuous, even function with w(0) = 1, |w(x)| ≤ 1, and w(x)2 dx < ∞,
−∞
and satisfies
1 − w(x)
lim = κ2 < ∞.
x→0 |x|2
Under this assumption we can check the assumptions of Theorems 9 and 10 in Hannan (1970),
whence
p
fbn (λ) − f (λ) = O p { M/n}, (6.5.12)
uniformly in λ ∈ [−π, π].

Let m = p + q, and let K(·) be a function defined on f (λ).
2
Assumption 6.5.4. K : D ⇒ R is holomorphic, where D is an open subset of Cm .
Theorem 6.5.5. Suppose that (HT1)–(HT6) in Section 8.1 hold. Under Assumptions 6.5.2–6.5.4, it
holds that
"Z π Z π #
√
Tn ≡ n b
K{fn (λ)}dλ − K{f (λ)}dλ (6.5.13)
−π −π
is asymptotically normal with mean zero and variance ν1 (f ) + ν2 {Q s }, where

Z π h i2
ν1 (f ) = 4π tr f (λ)K (1) {f (λ)} dλ, (6.5.14)
−π
and
X
m " π
s
ν2 (Q ) = 2π Krt(1) (λ1 )Kus
(1) s
(λ2 )Qrtus (−λ1 , λ2 , −λ2 )dλ1 dλ2 .
r,t,u,s=1 −π
Here K (1) {f (λ)} is the first derivative of K{f (λ)} at f (λ) (see Magnus and Neudecker (1999)), and
Krt(1) (λ) is the (r,t)th conponent of K (1) (λ). Also Qrtus
s
(·) is the (r,t,u,s) conponent of the fourth-order
cumulant spectral density (see Theorem 8.1.1.).
Proof. Expanding K{fbn (λ)} in Taylor series at f (λ), the proof follows from 6.5.12 and Theorem
8.1.1 (see also Taniguchi et al. (1996)).
Recalling 6.5.10, we see that if
Z π
−1/2 −1/2
H: log det{I − fXX (λ)fXY (λ)fYY (λ)−1 fYX (λ)fXX (λ)}dλ = 0, (6.5.15)
−π
then, Y ; X. Let
−1/2 −1/2
KG {f s (λ)} = log det{I − fXX (λ)fXY (λ)fYY (λ)−1 fYX (λ)fXX (λ)}dλ = 0.
In view of Theorem 6.5.5, we introduce

"Z π Z #.
√ π p
GT ≡ n KG {fbn (λ)}dλ − KG {f s (λ)}dλ ν1 (f s ) + ν2 {Q s }. (6.5.16)
−π −π
Under H, it is seen that

Z π .p
√
GT 0 ≡ n KG {fbn (λ)}dλ ν1 (f s ) + ν2 {Q s } (6.5.17)
−π
is asymptotically N(0, 1). Because the denominator is unknown, we may propose

Z π .q
√
c ≡ n
GT KG {fbn (λ)}dλ ν1 (fbn ) + ν[ s
2 (Q ), (6.5.18)
−π
CLASSIFICATION BY QUANTILE REGRESSION 227
where ν[ s s
2 (Q ) is a consistent estimator of ν2 (Q ), which is given by Taniguchi et al. (1996). If the
process {S(t)} is Gaussian, we observe that
Z π .q
√
c′ ≡ n
GT KG {fbn (λ)}dλ ν1 (fbn ) (6.5.19)
−π
is asymptotically N(0, 1) under H. Hence, we propose the test of H given by the critical region
′
c | > tα ],
[|GT (6.5.20)
where tα is defined by
Z ∞ !
x2 α
(2π)−1/2 exp − dx = .
tα 2 2
6.6 Classification by Quantile Regression

Discriminant analysis has been developed for various fields, e.g., multivariate data, regression and
time series data, etc. Taniguchi and Kakizawa (2000) develop discriminant analysis for time series
in view of the statistical asymptotic theory based on local asymptotic normality. Shumway (2003)
introduces the use of time-varying spectra to develop a discriminant procedure for nonstationary
time series. Taniguchi et al. (2008) provide much detail in discriminant analysis for financial time
series. Hirukawa (2006) examines hierarchical clustering by a disparity between spectra for daily
log-returns of 13 companies. The results show that the proposal method can classify the type of
industry clearly.
In discriminant analysis, with observations in hand, we try to find which category a sample be-
longs to. Without loss of generality, we discuss the case of two categories Π1 and Π2 . First, assuming
that the two categories are mutually exclusive, we evaluate the misclassification probability P(1|2)
when a sample is misclassified into Π1 with the fact that it belongs to Π2 , and vice versa for P(2|1).
Then we discuss the goodness by minimizing the misclassification probability:
P(1|2) + P(2|1). (6.6.1)
Quantile regression is a methodology for estimating models of conditional quantile functions,

and examines how covariates influence the entire response distribution. In this field, there is much
theoretical literature. Koenker and Bassett (1978) propose quantile regression methods for non-
random regressors. Koenker and Xiao (2006) extend the result to the case of linear quantile autore-
gression models. Koenker (2005) provides a comprehensive content of his research for 25 years on
quantiles and great contributions to statistics and econometrics.
In this section we discuss discriminant analysis for quantile regression models. First, we in-
troduce a classification statistic for two categories, and propose the classification rule. Second, we
evaluate the misclassification probabilities P(i| j), i , j, which will be shown to converge to zero.
Finally, we examine a delicate goodness of the statistic by evaluating P(i| j) when the categories are
contiguous.
Let {Yt } be generated by
Yt = β0 + β1 Yt−1 + · · · + β p Yt−p + ut , (6.6.2)
where the ut ’s are i.i.d. random variables with mean 0, variance σ2 , and probability distribution
Xp
function F(·), and f (u) = dF(u) du. Here we assume that B(z) ≡ 1 − β j z j , 0 on D = {z ∈ C :
j=1
|z| ≤ 1}. Letting ρτ (u) = u(τ − χ(u < 0)), where χ(·) is the indicator function, we define
′
β(τ) = β0 (τ), β1 (τ), . . . , β p (τ)
by
β(τ) ≡ arg min E{ρτ (Yt − β ′ Yt−1 )}, (6.6.3)

β
where Yt−1 ≡ (1, Yt−1 , . . . , Yt−p )′ .

Suppose that Y1 , . . . , Yn belong to one of two categories Π1 or Π2 as follows:
Π1 : β(τ) = β1 (τ)
(6.6.4)
Π2 : β(τ) = β2 (τ) (. β1 (τ)).
We introduce the following as a classification statistic:

X
n

D≡ ρτ Yt − β2 (τ)′ Yt−1 − ρτ Yt − β1 (τ)′ Yt−1 . (6.6.5)
t=p+1
We classify (Y1 , . . . , Yn ) into Π1 if D > 0, and into Π2 if D ≤ 0.

Theorem 6.6.1. The classification rule D is consistent, i.e.,
lim P(1|2) = lim P(2|1) = 0. (6.6.6)

n→∞ n→∞
Proof. We use Knight’s indentity (Knight (1998)):

Z v
ρτ (u − v) − ρτ (u) = −vψτ (u) + {I(u ≤ s) − I(u < 0)} ds, (6.6.7)
0
where ψτ (u) = τ − χ(u < 0). Under Π1 , write utτ ≡ Yt − β1 (t)′ Yt−1 , and d(t) ≡ β2 (τ) − β1 (τ). Then
X
n

D = ρτ utt − d(t)′ Yt−1 − ρτ (utτ ) , (6.6.8)
t=p+1
Xn
= − d(t)′ Yt−1 φτ (utτ ) (6.6.9)
t=p+1
n Z
X d(t)′ Yt−1
+ {I(utτ ≤ s) − I(utτ < 0)} ds. (6.6.10)
t=p+1 0
√
Since {d(t)′ Yt−1 φτ (utτ )} is a martingale difference sequence, it is seen that (6.6.9) = O p ( n). Let
Ft be the σ-field generated by Yt , Yt−1 , . . .. Then we observe that
Z d(t)′ Yt−1
At ≡ E[ {I(utτ ≤ s) − I(utτ < 0)} ds|Ft−1 ]
0
Z d(t)′ Yt−1
= {F(s) − F(0)} ds > 0 a.e., (6.6.11)
0
and
E|At |2 ≤ E|d(t)′ Yt−1 |2 < ∞. (6.6.12)
Now (6.6.10) is written as

X
n Z d(t)′ Yt−1
[ {I(utτ ≤ s) − I(utτ < 0)} ds − At ] (6.6.13)
t=p+1 0
PORTFOLIO ESTIMATION UNDER CAUSAL VARIABLES 229
X
n
+ At . (6.6.14)
t=p+1
√ √
The term (6.6.13) becomes a martingale, whose order is of O p ( n), and (6.6.14) = O p ( n) > 0
a.e.
Under Π1 ,
P(2|1) = P(D ≤ 0) = P (6.6.9 ≤ −6.6.10)

√
= P o p ( n) ≤ −O p (n) −→ 0 as n → ∞, (6.6.15)
which proves the theorem.

Next we evaluate a delicate goodness of D. For this we suppose that Π1 and Π2 are contiguous,
i.e.,
1
β2 (τ) − ρ1 (τ) = √ δ. (6.6.16)
n
Lemma 6.6.2. (Koenker and Xiao (2006)) Under Π1 ,
d 1
− −δ ′ W (τ) + δ ′ Ψ(τ)δ,
D→ (6.6.17)
2
′
where W (τ) ∼ N p (0, τ(1 − τ)Ω0 ), with Ω0 = E[Yt−1 Yt−1 ], and
h i
Ψ(τ) = f F −1 (τ) Ω0 .
From this we obtain

Theorem 6.6.3. Under (6.1.16),
1
lim P(2|1) = lim P(1|2) = Fδ,τ [− δ ′ Ψ(τ)δ], (6.6.18)
n→∞ n→∞ 2
where Fδ,τ (·) is the distribution function of N(0, τ(1 − τ)δ ′ Ω0 δ).
In practice we do not know β1 (τ) and β2 (τ). But if we can use the training samples {Yt(1) ; t =
1, . . . , n1 } and {Yt(2) ; t = 1, . . . , n2 } from Π1 and Π2 , respectively, the estimators are given by
X
ni
bi (τ) = arg min
β ρτ (Yt(i) − βi (τ)′ Yt−1
(i)
), i = 1, 2.
βi (τ)
t=1
b of D by
Then we introduce an estimated version D
X
n n o
b≡
D c2 (τ)′ Yt−1 ) − ρτ (Yt − c
ρτ (Yt − β β1 (τ)′ Yt−1 ) . (6.6.19)
t=p+1
Chen et al. (2016) applied the discriminant method above to the monthly mean maximum tem-
perature at Melbourne for 1994–2015. The results illuminate interesting features of weather change.
6.7 Portfolio Estimation under Causal Variables

This section discusses the problem of portfolio estimation under causal variables, and its applica-
tion to the Japanese pension investment portfolio, whose main stream is mainly based on Kobayashi
et al. (2013).
First, we mention canonical correlation analysis (CCA). Let
X = {(X1 , . . . , X p )′ } and Y = {(Y1 , . . . , Y p )′ } be random vectors with

( !) ( ! !) !
X X X ΣXX ΣXY
E = 0, Cov , = , (p + q) × (p + q) − matrix.
Y Y Y ΣYX ΣYY
We want to find linear combinations α′ X and β ′ Y whose correlation is the largest, say ξ1 =
α X, η1 = β ′ Y and λ1 = Corr(ξ1 , η1 ). We call λ1 the first canonical correlation, and ξ1 , η1 the first
′
canonical variates. Next we consider finding second linear combinations ξ2 = α2 X and η2 = β2 Y

such that of all combinations uncorrelated with ξ1 and η1 , these have maximum correlation λ2 . This
procedure is continued. Then we get ξ j = α′j X and η j = β ′j Y whose correlation is λ j . We call λ j the
jth canonical correlation, and ξ j , η j the jth canonical variates. It is known that λ′j s(λ1 ≥ λ2 ≥ · · · )
are the jth eigenvalues of
Σ−1/2 −1 −1/2
XX ΣXY ΣYY ΣYX ΣXX (6.7.1)
Σ−1/2 −1 −1/2
YY ΣYX ΣXX ΣXY ΣYY (6.7.2)
The vectors α j and β j are the jth eigenvectors of (6.7.1) and (6.7.2), respectively, corresponding to
the jth eigenvalue λ j (e.g., Anderson (2003), Brillinger (2001a)).
In what follows we apply the CCA above to the problem of portfolios. Let X = {X(t) =
(X1 (t), . . . , X p (t))′ } be an asset return process which consists of p asset returns. We write µ =
E{X(t)}, and ΣXX = E[(X(t)−µX )(X(t)−µX )′ ]. For a portfolio coefficient vector α = (α1 , . . . , α p )′ ,
the expectation and variance of portfolio α′ X(t) are, respectively, given by µ(α) = µ′X α and
η2 = α′ ΣXX α. The classical mean-variance portfolio is defined by the optimization problem



max {µ(α) − βη(α)}


α (6.7.3)
subject to e′ α = 1,
where e = (1, . . . , 1)′ (p × 1 − vector), and β is a given positive number. The solution for α is given
by
1 −1 e′ Σ−1 µX −1 Σ−1XX e
α MV = {ΣXX µX − ′ XX Σ XX e} + , (6.7.4)
2β e Σ−1
XX e e ′ Σ−1 e
XX
which is specified by the asset return X.
However, in the real world, if we observe some other causal variables, we may use their infor-
mation to construct the optimal portfolio. Let Y = {Y (t) = (Y1 (t), . . . , Yq (t))′ } be a q-causal return
process with µY = E{Y (t)}, ΣYY = E[(Y (t) − µY )(Y (t) − µY )′ ]. If we want to construct a portfolio
α′ X(t) including the best linear information from Y (t), we are led to find the CCA variate α given
by
max Corr{α′ X(t), Y (t)′ β}. (6.7.5)

α,β
This is exactly a CCA portfolio estimation by use of causal variables.
To develop the asymptotic theory we assume

Assumption 6.7.1. The process {Z(t) = (X(t)′ , Y (t)′ )′ } is generated by (p+q)-dimensional linear
process
X
∞
Z(t) − µ = A( j)U (t − j), (6.7.6)
j=1
P
whereA( j)’s satisfy ∞j=0 kA( j)k2 < ∞, and E[U (t)] = 0, E[U (t)U (s)′ ] = δ(t, s)K. Here δ(t, s) is
the Kroneker delta function, µ = (µ′X , µ′Y )′ and K is a positive definite matrix.
Write
!
′ ΣXX ΣXY
Σ = E[{Z(t) − E(Z(t))}{Z(t) − E(Z(t))} ] = .
ΣYX ΣYY
Let λ′j s(λ1 ≥ λ2 ≥ · · · ) be the jth eigenvalues of (6.7.1) and (6.7.2), and let α j and β j the jth
eigenvectors of (6.7.1) and (6.7.2), respectively. Then we can see that the solutions of (6.7.5) are
λ = λ1 and β = β1 .
Suppose that the observed stretch {Z(1), . . . , Z(n)} from (6.7.6) is available. Then, the sample
versions of Σ and (µ′X , µ′Y , ) = E(Z(t)) are, respectively, given by
1X
n
Σ̂ = {Z(t) − Z̄}{Z(t) − Z̄}′
n t=1
!
Σ̂XX Σ̂XY
= , (6.7.7)
Σ̂YX Σ̂YY
!
1X
n
µ̂
Z̄ ≡ Z(t) = X . (6.7.8)
n t=1 µ̂Y
Naturally we can introduce the following estimator for α MV given by (6.7.4):
1 e′ Σ̂−1
XX µ̂X −1 Σ̂−1
XX e
α̂ MV = {Σ̂XX µ̂ − Σ̂ XX e} + . (6.7.9)
2β ′ −1
e Σ̂XX e e Σ̂−1
′
XX e
Also we estimate α1 by solving
Σ̂−1/2 −1 −1/2 ˆ
XX Σ̂XY Σ̂YY Σ̂YX Σ̂XX α̂1 = λ1 α̂1 . (6.7.10)
If we use information from Y (t), we may introduce the following estimator for the portfolio coeffi-
cient α:
α̂ = (1 − ε)α̂ MV + εα̂1 , (6.7.11)
where ε > 0 is chosen appropriately. Because the length of the portfolio is 1, the final form of the
causal portfolio estimator is given by
α̂
α̂ε = . (6.7.12)
kα̂k
Next we consider the asymptotic properties of α̂ MV and α̂1 . Recalling (6.7.6), assume that the
innovation process {U (t)} satisfies the conditions (HT1) – (HT6) in Section 8.1 Then, it holds that
Z π
Σ̂ − Σ = {IZ (λ) − fZ (λ)} dλ, (6.7.13)
−π
where IZ (λ) is the periodogram matrix of {Z(1), . . . , Z(n)}, and fZ (λ) is the spectral density matrix
of {Z(t)}. For simplicity, in what follows we assume µ = 0. Then,
1 1
α MV = Σ−1
XX e, α̂ MV = Σ̂−1
XX e. (6.7.14)
e′ Σ−1
XX e e′ Σ̂−1
XX e
Letting L(Σ̂) = Σ̂−1/2 −1 −1/2

XX Σ̂XY Σ̂YY Σ̂YX Σ̂XX , it follows from the proof of Theorem 9.2.4 of Brillinger
(2001a) that
X
α̂1 − α1 = [α′k {L(Σ̂) − L(Σ)}α1 ]αk /(λ1 − λk )
k,1 (6.7.15)
+ lower-order terms.
In view of (6.7.14) and (6.7.15), we may write
!
√ α̂ MV − α MV √
n = n{H(Σ̂) − H(Σ)} + o p (1)
α̂1 − α1
√
= n∂H(Σ){vech(Σ̂) − vech(Σ)} + o p (1), (6.7.16)
where H(·) is a smooth function, and ∂H(Σ) is the derivative with respect to Σ. Noting (6.7.13),
(6.7.16) and Theorem 8.1.1 in Chapter 8, we can see that
!
√ α̂ MV − α MV L
n → N(0, V ), (n → ∞), (6.7.17)
α̂1 − α1
where V is expressed as integral functionals of fZ (λ) and the fourth-order cumulant spectra of
{Z(t)}. Hence,
√ L
n(α̂ − α) → N(0, F ′ V F ), (n → ∞), (6.7.18)
where F = [(1 − ε)I p , εI p ]′ .
We next apply the results above to the problem of the portfolio for the Japanese government
pension investment fund (GPIF). In 2012, GPIF invested the pension reserve fund in the following
five assets: S 1 (government bond), S 2 (domestic stock), S 3 (foreign bond), S 4 (foreign stock) and
S 5 (short-term fund), with portfolio coefficient α̂GPIF MV = (0.67, 0.11, 0.08, 0.09, 0.05)′. We write
the price processes at the time t by S 1 = {S 1 (t)}, S 2 = {S 2 (t)}, S 3 = {S 3 (t)}, S 4 = {S 4 (t)} and
S 5 = {S 5 (t)}. Letting Xi (t) = logS i (t + 1)/S i (t), i = 1, . . . , 4, we introduce the return process on the
five assets:
X(t) = (X1 (t), X2 (t), X3 (t), X4 (t))′ . (6.7.19)
As exogenous variables, we take the consumer price index S 6 (t) and the average wege S 7 (t) . Hence
the return process is
Y (t) = (logS 6 (t + 1)/S 6 (t), logS 7 (t + 1)/S 7 (t))′ . (6.7.20)
Suppose that the monthly observations of {X(t)} and {Y (t)} are available for t = 1, . . . , 502 (De-
cember 31, 1970 – October 31, 2012).
Considering the variables Y1 (t) and Y2 (t), Kobayashi et al. (2013) calculated α̂(ε) in (6.7.12).
For ε = 0.01,
α̂(0.01) = (0.902, 0.013, 0.084, 0.001)′. (6.7.21)
Next we investigate the predictive property of 2012’s GPIF portfolio coefficient (0.67, 0.11,
0.08, 0.09)′. For this, let
αGPIF (ε) = (0.67 + ε, 0.11 + ε, 0.08 + ε, 0.09 + ε)′ . (6.7.22)
Introduce the following ε-perturbed portfolio:
′
ϕε (t) ≡ αGPIF (ε)X(t). (6.7.23)
We measure the predictive goodness of 2012’s GPIF portfolio by
E [{ϕ0 (t) − β1 ϕε (t − 1) − · · · − β p ϕε (t − p)}2 ]. (6.7.24)

β1 ,...,β p
The estimated version of the expectation of (6.7.24) is given by
 
X
502  ϕε (t − 1)
1 ′
 ..  2
PE(ε, m, p) ≡ {ϕ0 (t) − β̂LS  . } , (6.7.25)
n − p t=p+1  
ϕε (t − p)
where
 
502 n  ϕ (t − 1)
X  ε  o
β̂LS = [  ..  ϕε (t − 1), . . . , ϕε (t − p) ]−1
 . 
t=502−m 
ϕε (t − p)
 
502 
X  ϕε (t − 1)
 .. 
× [  .  ϕ0 (t)].
t=502−m  
ϕε (t − p)
For m = 50, n = 502, p = 1, 2, . . . , and |ε| ≤ 0.05, Kobayashi et al. (2013) calculated PE(ε, m, p)
and showed that PE(ε, 50, p) is minimized at ε = 0 and p = 3, which implies that 2012’s GPIF
portfolio has an optimal predictive property for p = 3.
Problems
6.1. Verify (6.1.4).
6.2. Verify (6.1.5)–(6.1.8).
6.3. Concrete expression is possible for (6.1.13). Let Σ = {Σi j ; i, j = 1, . . . , m}. Then, show that
∂2 g MV (θ) δ
= 2 Σ−1 D̃Σ−1 ,
∂δ∂µ′ πβ
∂2 g MV (θ) δ h i
= −2 Σ−1 D̃Σ−1 Eii + Eii Σ−1 D̃ Σ−1 (µ − R0 e),
∂δ∂Σii πβ
i = 1, . . . , m,
∂2 g MV (θ) δ h i
= −2 Σ−1 D̃Σ−1 Ei j + Ei j Σ−1 D̃ Σ−1 (µ − R0 e),
∂δ∂Σi j πβ
i, j = 1, . . . , m, i > j,
where Eii is an m × m matrix whose elements are all 0 except for the (i, i)th element, which is 1,
Ei j is an m × m matrix whose elsments are all 0 except for the (i, j)th and ( j, i)th elements, which
are 1, and
X∞
D̃ = A( j)DA( j)′ . (6.7.26)
j=0
6.4. In Proposition 6.3.2, verify the expressions for Ω12 , Ω14 and Ω22 .
6.5. In Proposition 6.3.3, verify the expressions (6.3.14).
6.6. Show that the best linear predictor X̂tbest is given by (6.4.1).
6.7. Show that the prediction error of the pseudo best linear predictor is given by (6.4.9).
6.8. Prove that the RHS of (6.5.1) is equal to (6.5.4).
6.9. Prove Theorem 6.6.3.
6.10. Derive the formula (6.7.4) for the classical mean-variance portfolio α MV .
6.11. Establish the result (6.7.17).

Chapter 7
Numerical Examples
This chapter summarizes numerical examples, applying theories introduced in previous chapters.
We also introduce additional practical new theories relevant to portfolio analysis in this chapter. The
numerical examples include the following five real data analyses: Section 7.1, based on the theory
of portfolio estimators discussed in Chapter 3, time series data analysis for six Japanese companies’
monthly log-returns have been investigated; Section 7.2 as an application for the multiperiod portfo-
lio optimization problem introduced in Chapter 4, we performed portfolio analysis for the Japanese
government pension investment fund (GPIF); Section 7.3, Microarray data analysis by using rank
order statistics, which was introduced in Chapter 5; Section 7.4, DNA sequencing data analysis by
using a mean-variance portfolio, Spectral envelope (Stoffer et al. (1993)) and mean-diversification
portfolio; corticomuscular coupling analysis by the CHARN model discussed in Chapter 2 is in-
troduced, and we apply a mean-variance portfolio, Spectrum envelope and mean-diversification
portfolio to the analysis.
7.1 Real Data Analysis for Portfolio Estimation

This section provides various numerical studies to investigate the accuracy of the portfolio estima-
tors discussed in Chapter 3. We provide real data analysis based on Toyota, Honda, Nissan, Fanuc,
Denso and Komatsu’s monthly log-return. We introduce a procedure to construct a confidence re-
gion of the efficient frontier by using the AR bootstrap method. In addition, assuming a locally
stationary process, nonparametric portfolio estimators are provided.
7.1.1 Introduction
As we mentioned in Chapter 3, the mean-variance portfolio is identified by the mean vector (µ) and
the covariance matrix (Σ) of the asset returns. Since we cannot know µ and Σ, we construct these
estimators µ̂ and Σ̂. This study aims to evaluate the effect of the major portfolio quantities (such as
the efficient frontier, portfolio reconstruction) caused by the estimators µ̂ and Σ̂. We first consider
the selection of the underlying model for the asset returns. Then, based on the selected model, the
confidence region for a point of an estimated efficient frontier is constructed. On the other hand,
assuming a locally stationary process, we consider a sequence of estimated portfolio weights to
optimize the estimated time-varying mean vector and covariance matrix.
7.1.2 Data
The data consists of Toyota, Honda, Nissan, Fanuc, Denso and Komatsu’s monthly log-return from
December 1983 to January 2013, inclusive, so there are 350 months of data.
Figure 7.1 shows the time series plots of the returns for each company. The plots look stationary
but they may not be independent and identically distributed (i.i.d.). Figure 7.2 shows the autocorrela-
tion plots of the returns. There is no evidence of long memory effect but there exist some short-term
autocorrelations. Figure 7.3 shows normal probability plots of the returns. There is evidence of
heavy tails.
235
236 NUMERICAL EXAMPLES
Toyota Honda
0.0 0.2
0.0 0.2
return
return
−0.3
−0.3
0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350
Nissan Fanuc
0.0 0.2
0.0 0.2
return
return
−0.3
−0.3
0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350
Denso Komatsu
0.0 0.2
0.0 0.2
return
return
−0.3
−0.3
0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350
Figure 7.1: Time series plots of monthly log-returns for six companies
Toyota Honda
0.2
0.2
ACF
ACF
0.0
0.0
−0.2
−0.2
0 5 10 15 20 25 0 5 10 15 20 25
Lag Lag
Nissan Fanuc
0.2
0.2
ACF
ACF
0.0
0.0
−0.2
−0.2
0 5 10 15 20 25 0 5 10 15 20 25
Lag Lag
Denso Komatsu
0.2
0.2
ACF
ACF
0.0
0.0
−0.2
−0.2
0 5 10 15 20 25 0 5 10 15 20 25
Lag Lag
Figure 7.2: Autocorrelation plots of monthly log-returns for six companies

REAL DATA ANALYSIS FOR PORTFOLIO ESTIMATION 237
Toyota Honda
Sample Quantiles
Sample Quantiles
0.3
0.1 0.3
0.0
−0.2
−0.3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Theoretical Quantiles Theoretical Quantiles
Nissan Fanuc
0.4
Sample Quantiles
Sample Quantiles
0.2
0.0
−0.2
−0.4
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Denso Komatsu
Sample Quantiles
Sample Quantiles
−0.4 −0.1 0.2

−0.3 −0.1 0.1
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Figure 7.3: Normal probability plots of monthly log-returns for six companies
7.1.3 Method
7.1.3.1 Model selection
Here we discuss the procedure for order specification, estimation and model checking to build a
vector AR model for a given time series. Suppose that {X(t) = (X1 (t), . . . , Xm (t))′ ; t ∈ N} follows a
VAR(i) model:
X
i
X(t) = µ + U (t) + Φ(i) ( j){X(t − j) − µ}.
j=1
Given {X(1), . . . , X(n)}, the residual can be written as
X
i
Û (i) (t) = Ŷ (t) − Φ̂(i) ( j)Ŷ (t − j), i+1≤t ≤n
j=1
P
where Ŷ (t) = X(t) − µ̂ with µ̂ = 1/n nt=1 X(t) and Φ̂(i) ( j) for j = 1, . . . , i is the OLS estimator
of Φ(i) ( j). By using the residual, the following information criterion and the final prediction error
(FPE) are available for the selection of order i.
2m2 i
AIC(i) = ln |Σ̃i | + , (7.1.1)
n
m i ln(n)2
BIC(i) = ln |Σ̃i | + , (7.1.2)
n
2m2 i ln(ln(n))
HQ(i) = ln |Σ̃i | + , (7.1.3)
! n
∗ m
n+i
FPE(i) = |Σ̃i | (7.1.4)
n − i∗
where
1 X (i)
n
Σ̃i = Û (t)Û (i) (t)′ , i≥0
n t=i+1
and i∗ is the total number of the parameters in each model and i assigns the lag order. The Akaike
information criterion (AIC), the Schwarz–Bayesian criterion (BIC) and the Hannan and Quinn cri-
terion (HQ) are proposed by Akaike (1973), Schwarz (1978) and Hannan and Quinn (1979), respec-
tively.
7.1.3.2 Confidence region

Next, we discuss the procedure for construction of the confidence region for an estimated efficient
frontier. In general, when statistics are computed from a randomly chosen sample, these statistics are
random variables and we cannot evaluate the inference of the randomness exactly if we do not know
the population. However, if the sample is a good representative of the population, we can simulate
sampling from the population by sampling from the sampling, which is usually called “resampling.”
The “bootstrap” introduced by Efron (1979) is one of the most popular resampling methods and
has found application to a number of statistical problems, including construction of the confidence
region. However, it is difficult to apply a bootstrap to dependent data, because the idea is to simulate
sampling from the population by sampling from the sample under the i.i.d. assumption. Here we
apply a “(stationary) AR bootstrap.” The idea is to resample the residual from the fitted model
assuming the AR model.1 In the autoregressive bootstrap (ARB) method, we assume that X1 , . . . , Xn
are observed from the AR(p) model. Then, based on X1 , . . . , Xn and the least squares estimators of
the AR coefficients, the “residuals” from the fitted model are constructed. Under the correct fitted
model, the residuals turn out to be “approximately independent” and then we resample the residuals
to define the bootstrap observations through an estimated version of the structural equation of the
fitted AR(p) model. Properties of the ARB have been investigated by many authors. Especially, Bose
(1988) showed that the ARB approximation is second-order correct. Although the above method
assumes stationarity for the underlying AR model, nonstationary cases are also discussed. As for
bootstrapping an explosive autoregressive AR process, Datta (1995) showed that the validity of the
ARB critically depends on the initial values. As for bootstrapping an unstable autoregressive AR
process (e.g., unit root process), Datta (1996) provides conditions on the resample size m for the
“m out of n” ARB to ensure validity of the bootstrap approximation. In a similar way, the validity
of the bootstrapping approximation for the stationary autoregressive and moving average (ARMA)
process is discussed by Kreiss and Franke (1992) and Allen and Datta (1999), among others. For
details about the validity of the ARB, see, e.g., Lahiri (2003).
Suppose that a vector stochastic process which describes Toyota, Denso, and Komatsu’s returns
follows VAR(2)2 as follows:
X(t) = µ + U (t) + Φ(2) (1) {X(t − 1) − µ} + Φ(2) (2) {X(t − 2) − µ} .
Suppose that {X(t)} follows the VAR(2) model. Then residual can be written as
Û (2) (t) = Ŷ (t) − Φ̂(2) (1)Ŷ (t − 1) − Φ̂(2) (2)Ŷ (t − 2), 3 ≤ t ≤ n.
Let Gn (·) denote the distribution which puts mass 1/n at Û (2) (t). Then, the centering adjusted dis-
P
tribution is given by Fn (x) = Gn (x + Ū ) where x ∈ R3 and Ū = 1/(n − 2) nt=3 Û (2) (t). Let
1 Other resampling methods for dependent data are known as the Jackknife, the (Moving) Block bootstrap, the Model-
Based bootstrap (resampling including AR bootstrap, Sieve bootstrap) and the Wild bootstrap.
2 See Table 7.1.
{U ∗ (t); t = 1, . . . , n} be a sequence of i.i.d. observations from Fn (·). Given {U ∗ (t)}, we can generate
a sequence of the resampled return {X ∗ (t) : t = 1, . . . , n} by
X ∗ (t) = µ̂ + U ∗ (t) + Φ̂(2) (1) {X ∗ (t − 1) − µ̂} + Φ̂(2) (1) {X ∗ (t − 2) − µ̂} (7.1.5)
where X ∗ (1) = µ̂ + U ∗ (1) and X ∗ (2) = µ̂ + U ∗ (2) + Φ̂(2) (1) {X ∗ (1) − µ̂}.
For each sequence of the resampled return {X ∗ (t) : t = 1, . . . , n}, the sample mean vector µ∗
and the sample covariance matrix Σ∗ are calculated, and the optimal portfolio weight with a mean
return of µP is estimated by
µP α∗ − β∗ ∗ −1 ∗ γ∗ − µP β∗ ∗ −1
wµ∗ P = (Σ ) µ + (Σ ) e (7.1.6)
δ∗ δ∗
with α∗ = e′ (Σ∗ )−1 e, β∗ = µ∗′ (Σ∗ )−1 e, γ∗ = µ∗′ (Σ∗ )−1 µ∗ and δ∗ = α∗ γ∗ − (β∗ )2 .
For N times resampling of {X ∗ (t)}, N kinds of the mean vector µ∗ and the covariance matrix Σ∗
are obtained. Based on these statistics, N kinds of the resampled efficient frontiers can be calculated
by
2
∗
(µP − µGMV )2 = s∗ (σ2P − σGMV
∗
) (7.1.7)
where
∗ β∗ ∗ 2 1 δ∗
µGMV = , σGMV = and s∗ = .
α∗ α∗ α∗
Moreover, plugging fixed σP in (7.1.7), N kinds of µP are obtained. Denoting the µP in descending
order as µ(b) (1) (2) (N)
P (σP ); b = 1, . . . , N (i.e., µP (σP ) ≥ µP (σP ) ≥ · · · ≥ µP (σP )), we can obtain the p-
percentile of µP by µ([pN/100])
P (σP ).
7.1.3.3 Locally stationary estimation

Dahlhaus (1996c) estimated, nonparametrically, the covariance of lag k at time u of locally sta-
tionary processes with mean vector µ = 0. Shiraishi and Taniguchi (2007) generalized his re-
sults to the case of nonzero mean vector µ = µ(u). In this section, we suppose that X(t, n) =
(X1 (t, n), . . . , Xm (t, n))′ (t = 1, . . . , n; n ∈ N) follow the locally stationary process
t Z π
X(t, n) = µ + exp(iλt)A◦ (t, n, λ)dξ(λ). (7.1.8)
n −π
Let µ(u), Σ(u) denote the local mean vector and local covariance matrix at time u = t/n ∈ [0, 1].
Then, the nonparametric estimators are given by
t X
[t+bn n/2]
!
1 t−s
µ̂ := K X(s, n) (7.1.9)
n bn n s=[t−bn n/2]+1
bn n
t !
X t−s t t ′
[t+bn n/2]
1
Σ̂ := K X(s, n) − µ̂ X(s, n) − µ̂ , (7.1.10)
n bn n s=[t−b n/2]+1 bn n n n
n
respectively, where K : R → [0, ∞) is a kernel function and bn is a bandwidth. Here, for n < t < n +
[bn n/2], we define X(t, n) ≡ X(2n−t, n), and, for 1 > t > −[bn n/2], we define X(t, n) ≡ X(2−t, n).
By using µ̂(t/n) and Σ̂(t/n) for t = 1, . . . , n, a (local) global minimum variance (GMV) portfolio
weight estimator is given by
t Σ̂−1 (t/n)e
ŵGMV := . (7.1.11)
n e′ Σ̂−1 (t/n)e
7.1.4 Results and Discussion
Results for Model Selection Table 7.1 shows the selected orders of the VAR model for 3 returns
within 6 returns based on AIC, HQ, BIC and FPE. In addition, we plot AIC(i), HQ(i), BIC(i) and
Table 7.1: Selected orders for 3 returns within T (Toyota), H (Honda), N (Nissan), F (Funac), D
(Denso) and K (Komatsu) based on AIC, HQ, BIC and FPE
Asset1 T T T T T T T T T T H H H H H H N N N F
Asset2 H H H H N N N F F D N N N F F D F F D D
Asset3 N F D K F D K D K K F D K D K K D K K K
AIC 2 1 2 3 1 2 1 1 1 2 2 2 2 1 2 2 2 1 7 2
HQ 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 2 2
BIC 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
FPE 2 1 2 3 1 2 1 1 1 2 2 2 2 1 2 2 2 1 6 2
FPE(i) for i = 1, . . . , 10, for the returns of Toyota, Denso and Komatsu in Figure 7.4. You can see
that in the case of i = 2, the criterion have the smallest value for AIC, HQ and FPE, while for BIC
BIC(1) is the smallest but BIC(2) is close to that.
AIC HQ
−15.86
−15.6
−15.90
−15.8
−15.94
2 4 6 8 10 2 4 6 8 10
Order Order
BIC FPE
−14.8
1.20e−07 1.26e−07
−15.2
−15.6
2 4 6 8 10 2 4 6 8 10
Order Order
Figure 7.4: AIC, HQ, BIC and FPE for the returns of Toyota, Denso and Komatsu
In particular, the fitted model for the returns of Toyota, Denso, Komatsu is a VAR(2) model with
the following coefficient matrices.
   
 −0.1122 −0.0108 0.1465   −0.1265 0.0236 0.0736 
   
Φ̂(2) (1) =  0.0997 −0.1898 0.1551  , Φ̂(2) (2) =  −0.0294 −0.1773 0.2037  ,
   
−0.0038 0.1203 0.0468 0.0837 −0.0038 0.0258
 
 0.0039 
 
µ̂ =  0.0020  .
 
0.0041
Results for Confidence Region In Figure 7.5, the points (σ∗ (µP ), µ∗ (µP )) are plotted from 10,000
bootstrap resamples where
q
µ∗ (µP ) = wµ∗ P ′ µ̂, σ∗ (µP ) = wµ∗ P ′ Σ̂wµ∗ P (7.1.12)
and µP = 0.008.
resampled portfolio
0.025
0.020
0.015
portfolio return
0.010
0.005
0.000
0.1 0.2 0.3 0.4 0.5
portfolio risk
Figure 7.5: Results from 10,000 bootstrap resamples of a VAR(2) model. For each resample, the
optimal portfolio with a mean return of 0.008 is estimated. The resampled points are plotted as a
small dot. The large dot is the estimated point and the solid line is the (estimated) efficient frontier.
In Figure 7.6, for N = 10,000 resampling results, we plot the 90, 95 and 99 percentile confidence
region (i.e., µ(9000)
P (σP ), µ(9500)
P (σP ) and µ(9900)
P (σP ) for each σP ), respectively.
confidence region
0.025
0.020
0.015
portfolio return
0.010
0.005
−0.005 0.000
0.1 0.2 0.3 0.4 0.5
portfolio risk
Figure 7.6: Confidence region of efficient frontier. The solid line is the estimated efficient fron-
tier. The upper, middle and lower dotted lines are the 90, 95 and 99 percentile confidence regions,
respectively.
Results for Locally Stationary Estimation Figure 7.7 shows (local) GMV portfolio weight estima-
tors (ŵGMV (t/n)) and traditional GMV portfolio weight
(traditional) Σ̂−1 e
ŵGMV := (7.1.13)
e′ Σ̂−1 e
where Σ̂ is the sample covariance matrix. The data are Toyota, Honda, Nissan, Fanuc, Denso and
Komatsu’s monthly log-returns. Note that we apply the kernel function as K(x) = 6 14 − x2 for
−1/2 ≤ x ≤ 1/2 (this function satisfies Assumption 3 of Shiraishi and Taniguchi (2007)), and the
bandwidth as bn = 0.1, 0.3, 0.5 (according to Shiraishi and Taniguchi (2007), the optimal bandwidth
satisfies O(n−1/5)). You can see that the portfolio weights are dramatically changed as the bandwidth
becomes small.
Based on these portfolio weights, the sequences of the portfolio return (i.e., X(t, n)′ ŵGMV nt
(traditional)
or X(t, n)′ ŵGMV ) are plotted in Figure 7.8. It looks like the variance of the portfolio returns
is smaller as the bandwidth becomes small. Table 7.2 shows the mean and standard deviation of
the portfolio returns. As the bandwidth becomes small, the mean tends to be small but the standard
deviation also tends to be small. This phenomenon implies that locally stationary fitting works well.
Figure 7.7 and Table 7.2 show the merits and drawbacks for investigation of portfolio selection
problems under the locally stationary model. The merit is that we can catch the actual phenomenon
rather than the analysis under the stationary model. The drawback is the difficulty in determining
the bandwidth. In addition, we have to take care that we do not catch any jump phenomenon under
the locally stationary model.
Transition map (local (bn=0.1)) Transition map (local (bn=0.3))
1.0 1.0
1 2 3 4 5 6
1 2 3 4 5 6
portfolio weight
portfolio weight
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
50 150 250 50 150 250
time time
Transition map (local (bn=0.5)) Transition map (traditional)
1.0 1.0
1 2 3 4 5 6
1 2 3 4 5 6
portfolio weight
portfolio weight
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
50 150 250 50 150 250
time time
Figure 7.7: Transition map for (local) GMV portfolio weight (1:Toyota, 2:Honda, 3:Nissan, 4:Fanuc,
5:Denso and 6:Komatsu)
7.1.5 Conclusions
In this section, two types of portfolio analysis have been investigated. The first one is a construc-
tion of a confidence region for an estimated efficient frontier. Model selection for Toyota, Denso
and Komatsu’s returns is considered from a VAR(p) model and p = 2 is selected based on some
APPLICATION FOR PENSION INVESTMENT 243
local (bn=0.1) local (bn=0.3)
0.0 0.1 0.2
0.0 0.1 0.2

portfolio return
portfolio return
−0.2
−0.2
0 50 150 250 350 0 50 150 250 350
time time
local (bn=0.5) traditional

0.0 0.1 0.2
0.0 0.1 0.2

portfolio return
portfolio return
−0.2
−0.2
0 50 150 250 350 0 50 150 250 350
time time
Figure 7.8: Time series plots of returns for (local) GMV portfolio
Table 7.2: Mean and standard deviation (SD) of returns for (local) GMV portfolio
Portfolio Type bn Mean SD

local 0.1 0.00114 0.04978
0.3 0.00238 0.05417
0.5 0.00345 0.05798
traditional - 0.00354 0.05914
information criteria. Assuming the VAR(2) model for these asset returns, a confidence region for an
estimated efficient frontier is constructed by using the autoregressive bootstrap (ARB). The result
shows that the wideness of the confidence regions tends to increase as the portfolio risk is increasing.
The second one is a construction of a sequence of estimated portfolio weights under an assumption
of a locally stationary process. By using the nonparametric estimation method, the local mean vec-
tor and local covariance matrix for each time are estimated. As a result, we found that the sequence
of the estimated portfolios is able to grasp the actual phenomenon but there exists a difficulty in how
to determine the bandwidth.

Lahiri (2003) introduces various resampling methods for dependent data. Hall (1992) gives the
accuracy of the bootstrap theoretically by using the Edgeworth expansion.
7.2 Application for Pension Investment

In this section, we evaluate a multiperiod effect in portfolio analysis, so that this study is an appli-
cation for the multiperiod portfolio optimization problem in Chapter 4.
7.2.1 Introduction
A portfolio analysis for the Japanese government pension investment fund (GPIF) is discussed in
this section. In actuarial science, “required reserve fund” for pension insurance has been investi-
gated for a long time. The traditional approach focuses on the expected value of future obligations
and interest rate. Then, the investment strategy is determined for exceeding the expected value of
interest rate. Recently, solvency for the insurer is defined in terms of random values of future obli-
gations (e.g., Olivieri and Pitacco (2000)). In this section, assuming that the reserve fund is defined
in terms of the random interest rate and the expected future obligations, we consider optimal port-
folios by optimizing the randomized reserve fund. Shiraishi et al. (2012) proposed three types of
the optimal portfolio weights based on the maximization of the Sharpe ratio under the functional
forms for the portfolio mean and variance, two of them depending on the Reserve Fund at the end-
of-period target (e.g., 100 years). Regarding the structure of the return processes, we take account
of the asset structure of the GPIF. On the asset structure of the GPIF, the asset classes are split into
cash, domestic and foreign bonds, domestic and foreign equity. The first three classes (i.e., cash,
domestic and foreign bonds) are assumed to be short memory while the other classes (i.e., domestic
and foreign equity) are long memory. To obtain the optimal portfolio weights, we rely on a boot-
strap. For the short memory returns, we use a wild bootstrap (WB). Early work focuses on providing
first- and second-order theoretical justification for WB in the classical linear regression model (see,
e.g., Wu (1986)). Gonçalves and Kilian (2004) show that WB is applicable for the linear regres-
sion model with a conditional heteroscedastic such as stationary ARCH, GARCH, and stochastic
volatility effects. For the long memory returns, we apply a sieve bootstrap (SB). Bühlmann (1997)
establishes consistency of the autoregressive sieve bootstrap. Assuming that the long memory pro-
cess can be written as AR(∞) and MA(∞) processes, we estimate the long memory parameter by
means of Whittle’s approximate likelihood (Beran (1994)). Given this estimator, the residuals are
computed and resampled for the construction of the bootstrap samples, from which the optimal
portfolio estimated weights are obtained.
7.2.2 Data
The Government Pension Investment Fund (GPIF) of Japan was established on April 1, 2006 as
an independent administrative institution with the mission of managing and investing the Reserve
Fund of the employees’ pension insurance and the national pension (www.gpif.go.jp for more in-
formation). It is the world’s largest pension fund ($1.4 trillion in assets under management as of
December 2009) and its mission is managing and investing the Reserve Funds in safe and efficient
investment with a long-term perspective. Business management targets to be achieved by GPIF are
set by the Minister of Health, Labour and Welfare based on the law on the general rules of indepen-
dent administrative agencies.
Available data is monthly log-returns from January 31, 1971 to October 31, 2009 (466 obser-
vations) of the five types of assets: Domestic Bond (DB), Domestic Equity (DE), Foreign Bond
(FB), Foreign Equity (FE) and cash. We suppose that cash and bonds are gathered in the short-
memory panel XtS = (Xt(DB) , Xt(FB) , Xt(cash) ) and follow (7.2.6), and equities are gathered into the
long-memory panel XtL = (Xt(DE) , Xt(FE) ) and follow (7.2.7).
7.2.3 Method
Let S i,t be the price of the ith asset at time t (i = 1, . . . , m), and Xi,t be its log-return. Time runs from
0 to T . In this section, we consider that today is T 0 and T is the end-of-period target. Hence the
past and present observations run for t = 0, . . . , T 0 , and the future until the end-of-period target for
t = T 0 + 1, . . . , T . The price S i,t can be written as
 t 
X 
S i,t = S i,t−1 exp(Xi,t ) = S i,0 exp  Xi,s  , (7.2.1)
s=1
where S i,0 is the initial price. Let Fi,t denote the Reserve Fund on asset i at time t and defined by
Fi,t = Fi,t−1 exp(Xi,t ) − ci,t ,
where ci,t denotes the maintenance cost at time t. By recursion Fi,t can be written as
Fi,t = Fi,t−1 exp(Xi,t ) − ci,t

 t   t 
 X  X
t  X 
= Fi,t−2 exp  Xi,s  − ci,s exp  Xi,s′ 
s=t−1 s=t−1 s′ =s+1
 t   t 
X  X
t  X 
= Fi,0 exp   
Xi,s  − ci,s exp   Xi,s  ,
′ (7.2.2)
s=1 s=1 s′ =s+1
where Fi,0 = S i,0 .

We gather the Reserve Funds in the vector Ft = (F1,t , . . . , Fm,t ). Let Ft (w) = w′ Ft be a portfolio
formed by the m Reserve Funds, which depend on the vector of weights w = (w1 , . . . , wm )′ . The
portfolio Reserve Fund can be expressed as a function of all past returns:
X
m
Ft (w) := wi Fi,t
i=1
  t   t 
Xm  X  X
t  X 
= 
wi Fi,0 exp   
Xi,s  − ci,s exp   Xi,s′  . (7.2.3)
i=1 s=1 s=1 s′ =s+1
We are interested in maximizing Ft (w) at the end-of-period target FT (w):

  T   T 
Xm   X  X
T  X 
FT (w) = 
αi Fi,T 0 exp   
Xi,s  − ci,s exp  Xi,s′  . (7.2.4)
i=1 s=T 0 +1 s=T 0 +1 s′ =s+1
It depends on the future returns, the maintenance cost and the portfolio weights. While the first
two are assumed to be constant from T 0 to T (the constant return can be seen as the average return
over the T − T 0 periods), we focus on the optimality of the weights that we denote by wopt .
In the first specification, the estimation of the optimal portfolio weights is based on the maxi-
mization of the Sharpe ratio:
µ(w)
wopt = arg max , (7.2.5)
w σ(w)
under different functional forms of the expectation µ(w) and the risk σ(w) of the portfolio. Shiraishi
et al. (2012) proposed three functional forms, two of the them depending on the Reserve Fund.
Definition 7.2.1. The first one is the traditional form based on the returns:
p
µ(w) = w′ E(XT ) and σ(w) = w′ V(XT )w,
where E(XT ) and V(XT ) are the expectation and the covariance matrix of the returns at the end-
of-period target.
Definition 7.2.2. The second form for the portfolio expectation and risk is based on the vector of
Reserve Funds:
p
µ(w) = w′ E(FT ) and σ(w) = w′ V(FT )w,
where E(FT ) and V(FT ) indicate the mean and covariance of the Reserve Funds at time T .
Definition 7.2.3. Last, we consider the case where the portfolio risk depends on the lower partial
moments of the Reserve Funds at the end-of-period target:
n o
µ(w) = w′ E(FT ) and σ(w) = E F̃ − FT (w) I FT (w) < F̃ ,
where F̃ is a given value.

The m returns split into p- and q-dimensional vectors {XtS ; t ∈ Z} and {XtL ; t ∈ Z}, where S
and L stand for short and long memory, respectively. The short memory returns correspond to cash
and domestic and foreign bonds, which we generically denote by bonds. The long memory returns
correspond to domestic and foreign equity, which we denote as equity. Cash and bonds follow the
nonlinear model
XtS = µS + H(Xt−1
S S
, . . . , Xt−m )ǫSt , (7.2.6)
where µS is a vector of constants, H : Rkp → R p × R p is a positive definite matrix-valued measur-

S ′
able function, and ǫSt = (ǫ1,t
S
, . . . , ǫ p,t ) are i.i.d. random vectors with mean 0 and covariance matrix
S
Σ . By contrast, equity returns follow a long memory nonlinear model
X
∞ X
∞
XtL = L
φν ǫt−ν , ǫtL = L
ψν Xt−ν , (7.2.7)
ν=0 ν=0
where
Γ(ν + d) Γ(ν − d)
φν = and ψν =
Γ(ν + 1)Γ(d) Γ(ν + 1)Γ(−d)
L ′
with −1/2 < d < 1/2, and ǫtL = (ǫ1,tL
, . . . , ǫ p,t ) are i.i.d. random vectors with mean 0 and covariance
L
matrix Σ . We estimate the optimal portfolio weights by means of a bootstrap. Let the super-indexes
(S , b) and (L, b) denote the bootstrapped samples for the bonds and equity, respectively, and B the
total number of bootstrapped samples. Below we show the bootstrap procedure for both types of
assets.
Bootstrap procedure for Xt(S ,b)
,b)
Step 1: Generate the i.i.d. sequences {ǫ(S t } for t = T 0 + 1, · · · , T and b = 1, . . . , B from N(0, I p ).
P 0
Step 2: Let Yt = Xt − µ̂ , where µ̂ = T10 Ts=1
S S S S
X sS . Generate the i.i.d. sequences {Yt (S ,b) } for
t = T 0 + 1, · · · , T and b = 1, . . . , B from the empirical distribution of {Yt S }.
Step 3: Compute {Xt(S ,b) } for t = T 0 + 1, · · · , T and b = 1, . . . , B as
Xt(S ,b) = µ̂S + Yt (S ,b) ⊙ ǫ(S

t
,b)
,
where ⊙ denotes the cellwise product.

Bootstrap procedure for Xt(L,b)
Step 1: Estimate d̂ from the observed returns by means of Whittle’s approximate likelihood:
d̂ = arg min L(d, Σ)

d∈(0,1/2)
where
2 0X
(T −1)/2
L(d, Σ) = log det f (λ j,T 0 , d, Σ) + tr f (λ j,T 0 , d, Σ)−1 I(λ j,T 0 ) ,
T 0 j=1
|1 − exp(iλ)|−2d
f (λ, d, Σ) = Σ,
2π
2
1 X T0
I(λ) = √ Xt e
L itλ
and
2πT 0 t=1
λ j,T 0 = 2π j/T 0.
Step 2: Compute {ǫ̂tL } for t = 1, . . . , T 0 :

X
t−1 


ˆ
Γ(k−d)
k ≤ 100
 ˆ
ǫ̂tL = L
πk Xt−k where πk = 

Γ(k+1)Γ(−d)
ˆ

 k−d−1
k > 100.
k=0 Γ(−d)ˆ
Step 3: Generate {ǫ(L,b)

t } for t = T 0 + 1, . . . , T and b = 1, . . . , B from the empirical distribution of
{ǫ̂tL }.
Step 4: Generate {Xt(L,b) } for t = T 0 + 1, . . . , T and b = 1, . . . , B as
X
t−T 0 −1 X
t−1
Xt(L,b) = (L,b)
τk ǫt−k + τk ǫ̂t−k .
k=0 k=t−T 0

We gather Xt(S ,b) and Xt∗(L,b)
into Xt(b) = (Xt(S ,b) , Xt(L,b) ) = ∗(b)
(X1,t , . . . , X ∗(b)
p+q,t ) for t = T 0 +
1, . . . , T and b = 1, . . . , B. The bootstrapped Reserve Funds FT(b) = (b) (b)
(F1,T , . . . , F p+q,T )
 T   T 
 X  X
T  X (b) 
(b)  (b)  
Fi,T = FT 0 exp  Xi,s  − ci,s exp  Xi,s′  .
s=T 0 +1 s=T 0 +1 s′ =s+1
And the bootstrapped Reserve Fund portfolio is

X
p+q
FT(b) (w) = w′ FT(b) = (b)
wi Fi,T .
i=1
Finally, the estimated portfolio weights that give the optimal portfolio are
µ(b) (w)
ŵopt = arg max ,
w σ(b) (w)
where µ(b) (w) and σ(b) (w) may take any of the three forms introduced earlier, but be evaluated in
the bootstrapped returns or Reserve Funds.
Based on the observed data {(X1S , X1L ), . . . , (X466

S L
, X466 )} (where XtS = (Xt(DB) , Xt(FB) , Xt(cash) )
(DE) (FE)
and Xt = (Xt , Xt )), we estimate the optimal portfolio weights, ŵopt1 , ŵopt2 and ŵopt3 ,
L
corresponding to the three forms (Definition 7.2.1–Definition 7.2.3) for the q expectation (i.e.,
µ(b) (w) ≡ w′ E (b) (XT(b) ) or µ(b) (w) ≡ w′ E (b) (FT(b) )) and risk (i.e., σ(b) (w) ≡ w′ V (b) (XT(b) )w
q n o
or σ(b) (w) ≡ w′ V (b) (FT(b) )w or σ(b) (w) ≡ E (b) F̃ − FT(b) (w) I FT(b) (w) < F̃ ) of the Sharpe
ratio, where
1 X (b) 1 X n (b)
B B on o′
E (b) (T (b) ) := T and V (b) (T (b) ) := T − E (b) (T (b) ) T (b) − E (b) (T (b) )
B b=1 B b=1
for any bootstrapped sequence {T (b) }. In addition, we compute the trajectory of the optimal Reserve
Fund for t = T 0 +1. . . . , T . Because of liquidity reasons, the portfolio weight for cash is kept constant
at 5%. The target period is fixed at 100 years, and the maintenance cost is based on the 2004 Pension
Reform.

Table 7.3 shows the estimated optimal portfolio weights for the three different choices of the port-
folio expectation and risk. The weight of domestic bonds is very high and clearly dominates over
the other assets. Domestic bonds are low risk and medium return, which is in contrast with equity,
which shows a higher return but also higher risk, and with foreign bonds, which show a low return
and risk. Therefore, in a sense, domestic bonds are a compromise between the characteristics of the
four equities and bonds. Figure 7.9 shows the trajectory of the future Reserve Fund for different
Table 7.3: Estimated optimal portfolio weights
DB DE FB FE cash
Returns (ŵopt1 ) 0.95 0.00 0.00 0.00 0.05
Reserve Fund (ŵopt2 ) 0.75 0.00 0.20 0.00 0.05
Low Partial (ŵopt3 ) 0.85 0.10 0.00 0.00 0.05
values of the yearly return (assumed to be constant from T 0 + 1 to T ) ranging from 2.7% to 3.7%.
Since the investment term is extremely long, 100 years, the Reserve Fund is quite sensitive to the
choice of the yearly return. In the 2004 Pension Reform, authorities assumed a yearly return of the
portfolio of 3.2%, which corresponds to the middle thick line of the figure.
This simulation result shows that the portfolio weight of domestic bonds is quite high. The
reason is that the investment term is extremely long (100 years). Because the investment risk for
the Reserve Fund is exponentially amplified year by year, the portfolio selection problem for the
Reserve Fund is quite sensitive to the year-based portfolio risk. Note that the above result is obtained
only under the assumption without reconstruction of the portfolio weights. If we can relax this
assumption, the result would be significantly changed. This assumption depends on the liquidity of
the portfolio assets.
7.2.5 Conclusions
In this section, optimal portfolio weights to invest for five asset categories such as Domestic Bond
(DB), Domestic Equity (DE), Foreign Bond (FB), Foreign Equity (FE) and cash are investigated. In
this study, selected optimal portfolio weights are fixed at the end-of-period target (e.g., 100 years),
and the objective function for the optimization is defined based on the randomized (future) Reserve
Fund. The future expected return and the future risk are estimated by using some bootstrap methods.
The result shows that the weight of DB is very high because DB is very low risk and risk is amplified
highly in this study.
7.3 Microarray Analysis Using Rank Order Statistics for ARCH Residual
Statistical two-group comparisons are widely used to identify the significant differentially expressed
(DE) signatures against a therapy response for microarray data analysis. To DE analysis, we applied
rank order statistics based on an autoregressive conditional heteroskedasticity (ARCH) residual em-
pirical process introduced in Chapter 5. This approach was considered for publicly available datasets
and compared with two-group comparison by original data and an autoregressive (AR) residual
(Solvang and Taniguchi (2017a)) . The significant DE genes by the ARCH and AR residuals were
MICROARRAY ANALYSIS USING RANK ORDER STATISTICS FOR ARCH RESIDUAL 249
YR=2.7%
YR=2.8%
1000
YR=2.9%
YR=3.0%
YR=3.1%
800 YR=3.2% (base)
YR=3.3%
YR=3.4%
600 YR=3.5%
YR=3.6%
YR=3.7%
400
200
2004 2020 2036 2052 2068 2084 2100
Figure 7.9: Trajectory of the Future Reserve Fund for different values of the yearly return ranging
from 2.7% to 3.7%.
reduced by about 20–30% of the genes by the original data. Almost 100% of the genes by ARCH
are covered by the genes by the original data, unlike the genes by AR residuals. GO enrichment
and pathway analyses indicate the consistent biological characteristics between genes by ARCH
residuals and original data. ARCH residuals array data might contribute to refining the number of
significant DE genes to detect the biological feature as well as ordinal microarray data.
7.3.1 Introduction
Microarray technology provides a high-throughput way to simultaneously investigate gene expres-
sion information on a whole genome level. In the field of cancer research, the genome-wide ex-
pression profiling of tumours has become an important tool to identify gene sets and signatures that
can be used as clinical endpoints, such as survival and therapy response (Zhao et al. (2011)). When
we are contrasting expressions between different groups or conditions (i.e., the response is poly-
tumous), such important genes are described as differentially expressed (DE) (Yang et al. (2005)).
To identify important genes, a statistical scheme is required that measures and captures evidence
for a DE per gene. If the response consists of binary data, the DE is measured using a two-group
comparison for which such statistical methods as t-statistics, the statistical analysis of microarrays
(SAM) (Tusher et al. (2001)), fold change, and B statistics have been proposed (Campain and Yang
(2010)). The p-value of the statistics is calculated to assess the significance of the DE genes. The
p-value per gene is ranked in ascending order; however, selecting significant genes must be con-
sidered by multiple testing corrections, e.g., false discovery rate (FDR) (Benjamini and Hochberg
(1995)), to avoid type I errors. Even if significant DE genes are identified by the FDR procedure,
the gene list may still include too many to apply a statistical test for a substantial number of probes
through whole genomic locations. Such a long list of significant DE genes complicates capturing
gene signatures that should provide the availability of robust clinical and pathological prognostic
and predictive factors to guide patient decision-making and the selection of treatment options.
As one approach for this challenge, based on the residuals from the autoregressive conditional
heteroskedasticity (ARCH) models, the proposed rank order statistic for two-sample problems per-
taining to empirical processes refines the significant DE gene list. The ARCH process was proposed
by Engle (1982), and the model was developed in much research to investigate a daily return series
from finance domains. The series indicate time-inhomogeneous fluctuations and sudden changes
of variance called volatility in finance. Financial analysts have attempted more suitable time series
modeling for estimating this volatility. Chandra and Taniguchi (2003) proposed rank order statistics
and the theory provided an idea for applying residuals from two classes of ARCH models to test the
innovation distributions of two financial returns generated by such varied mechanisms as different
countries and/or industries. Empirical residuals called “innovation” generally perturb systems be-
hind data. Theories of innovation approaches to time series analysis have historically been closely
related to the idea of predicting dynamic phenomena from time series observations. Wiener’s theory
is a well-known example that deems prediction error to be a source of information for improving
the predictions of future phenomena. In a sense, innovation is a more positive label than prediction
error (Ozaki and Iino (2001)). As we see in innovation distribution for ARCH processes, it resem-
bles the sequential expression level based on the whole genomic location. For applying time indices
of ARCH model to the genomic location, time series mining has been practically used in DNA se-
quence data analysis (Stoffer et al. (2000)) and microarray data analysis (Koren et al. (2007)). To
investigate the data’s properties, we believe that innovation analysis is more effective than analysis
just based on the original data. While the original idea in Chandra and Taniguchi (2003) was based
on squared residuals from an ARCH model, not-squared empirical residuals are also theoretically
applicable, as introduced in Lee and Taniguchi (2005). In this study, we apply this idea to test DEs
between two sample groups in microarray datasets that we assume to be generated by different
biological conditions.
To investigate whether ARCH residuals can consistently refine a list of significant DE genes,
we apply publicly available datasets called Affy947 (van Vliet et al. (2008)) for breast cancer re-
search to compare significant gene signatures. As a statistical test for two-group comparisons, the
estrogen receptor (ER) is applied in clinical outcomes to identify prognostic gene expression signa-
tures. Estrogen is an important regulator of the development, the growth, and the differentiation of
normal mammary glands. It is well documented that endogenous estrogen plays a major role in the
development and progression of breast cancer. ER expression in breast tumours is frequently used
to group breast cancer patients in clinical settings, both as a prognostic indicator and to predict the
likelihood of response to treatment with antiestrogen (Rezaul et al. (2010)). If the cancer is ER+,
hormone therapy using medication slows or stops the growth of breast cancer cells. If the cancer is
ER−, then hormonal therapy is unlikely to succeed. Based on these two categorical factors for ER
status, we applied our proposed statistical test to the expression levels for each genomic location.
After identifying significant DE genes, biological enrichment analyses use the gene list and seek bi-
ological processes and interconnected pathways. These analyses support the consistency for refined
gene lists obtained by ARCH residuals.
7.3.2 Data
Due to the extensive usage of microarray technology, in recent years publicly avail-
able datasets have exploded (Campain and Yang (2010)), including the Gene Expression
Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) (Edgar et al. (2002)) and ArrayExpress
(http://www.evi.ac.uk/microarray-as/ae/). In this study for breast cancer research, we used five
different expression datasets, collectively called the Affy947 expression dataset (van Vliet et al.
(2008)). These datasets, which all measure the Human Genome HG U133A Affymetrix arrays, are
normalized using the same protocol and are assessable from GEO with the following identifiers:
GSE6532 for the Loi et al. dataset (Loi et al. (2007), Loi), GSE3494 for the Miller et al. dataset
(Miller et al. (2005), Mil), GSE7390 for the Desmedt et al. dataset (Desmedt et al. (2007)), Des),
and GSE5327 for the Minn et al. dataset (Minn et al. (2005), Min). The Chin et al. dataset (Chin
et al. (2006), Chin) is available from ArrayExpress. This pooled dataset was preprocessed and nor-
malized, as described in Zhao et al. (2014). Microarray quality control assessment was carried out
using the R AffyPLM package from the Bioconductor web site (http://www.bioconductor.org, Bol-
stad et al. (2005)). The Relative Log Expression (RLE) and Normalized Unscaled Standard Errors
(NUSE) tests were applied. Chip pseudo-images were produced to assess artifacts on the arrays
that did not pass the preceding quality control tests. The selected arrays were normalized by a
three-step procedure using the RMA expression measure algorithm (http://www.bioconductor.org,
Irizarry et al. (2003)): RMA background correction convolution, the median centering of each gene
separately across arrays for each dataset, and the quantile normalization of all arrays. Gene mean
centering effectively removes many dataset specific biases, allowing for effective integration of mul-
tiple datasets (Sims et al. (2008)). The total number of probes is 22,268 for these microarray data.
Against all probes that covered the whole genome, we use probes that correspond to the intrinsic
signatures that were obtained by classifying breast tumours into five molecular subtypes (Perou et al.
(2000)). We extracted 777 probes from the whole 22K probes for the microarray data-sets using
the intrinsic annotation included in the R codes in Zhao et al. (2014). As the response contrasting
expression between two groups, we used a hormone receptor called ER, which indicates whether
a hormone drug works as well for treatment as a progesterone receptor and is critical to determine
prognosis and predictive factors. ERs are used for classifying breast tumours into ER-positive (ER+)
and ER-negative (ER−) diseases. The two upper figures in Figure 7.10 present the mean of the
microarray data by averaging all of the previously obtained samples (Desmedt et al. (2007)). The
left and right plots correspond to a sample indicating ER+ and ER−. The data for ER− show more
fluctuation than for ER+. The two lower figures illustrate histograms of the averaged data for ER+
and ER− and present sharper peakedness and heavier tails than the shape of an ordinary Gaussian
distribution.
Figure 7.10: Upper: Mean of expression levels for ER+ (left) and ER− (right) across all Des sam-
ples. Lower: Histograms of expression levels for ER+ (left) and ER− (right)
7.3.3 Method
Denote the sample and the genomic location by i and j in microarray data xi j . The samples for
the microarray data are divided by two biologically different groups; one group is for breast cancer
tumours driven by ER+ and another group is for breast cancer tumours driven by ER−. We ap-
ply two-group comparison testing to identify significantly different expression levels between two
groups of ER+ and ER− samples for each gene (genomic location). As the statistical test, we pro-
pose the rank order statistics for the ARCH residual empirical process introduced in Section 7.3.3.1.
For comparisons with the ARCH model’s performance, we consider applying the two-group com-
parison testing to original array data and applying the test to the residuals obtained by the ordinal
AR (autoregressive) model. The details about both methods are summarized in Section 7.3.3.2. For
the obtained significant DE gene lists, biologists or medical scientists require further analysis for
their biological interpretation to investigate the biological process or biological network. In this arti-
cle, we apply GO (gene ontology) analysis, shown in Section 7.3.3.3, and pathway analysis, shown
in Section 7.3.3.4, which methods are generally used to investigate specific genes or relationships
among gene groups. Our proposed algorithm is summarized in Figure 7.11.
7.3.3.1 The rank order statistic for the ARCH residual empirical process
Suppose that a classes of ARCH(p) models is generated by following Equations (5.3.1) and an-
other class of ARCH(p) models, independent of {Xt }, is generated similarly by Equations (5.3.2).
We are interested in the two-sample problem of testing given by (5.3.3). In this section, F(x) and
G(x) correspond to the distribution for the expression data of samples driven by ER+ and ER−,
individually.
For this testing problem, we consider a class of rank order statistics, such as Wilcoxon’s two-
sample test. The form is derived from the empirical residuals ε̂2t = Xt2 /σ2t (θ̂X ), t = 1, · · ·, n and
ξ̂t2 = Yt2 /σ2t (θ̂Y ), t = 1, · · ·, n. Because Lee and Taniguchi (2005) developed the asymptotic theory
for not squared empirical residuals, we may apply the results to ε̂t and ξ̂t .
7.3.3.2 Two-group comparison for microarray data

To obtain the empirical residuals, the ARCH model is applied to a vector {xi1 , xi2 , · · ·, xiL } for the
ith sample, where L is the total number of genomic locations in the microarray data. Assuming that
the ER+ and ER− samples correspond to distributions F(x) and G(x) as above, orders pX and pY of
the ARCH model are identified by model selection using the Akaike Information Criterion (AIC),
where the model with the minimum AIC is defined as the best fit model (Akaike (1974)) (see 1 in
Figure 7.11). According to those responses, the empirical residuals are grouped as ε+i j and ξi−j . The
Wilcoxon statistic is applied as the order statistic to those two groups for each genomic location
j, and the p-value is calculated (see 2 in Figure 7.11). The p-value vector is adjusted for multiple
testing corrections using FDR (Benjamini and Hochberg (1995)) (see 3 in Figure 7.11).
For comparisons with the ARCH model’s performance, we applied the same procedure to the
original data and the residuals by the Autoregressive (AR) model. The AR model represents the
PK
current value using the weighted average of the past values as xi j = k=1 βi xi j−k + wi j , where βi , k,
and wi j are the AR coefficient, the AR order, and the error terms. The AR model is widely applied in
time series analysis and signal processing of economics, engineering, and science. In this study, we
apply it to the expression data for two ER+ and ER− groups. The AR order for the best fit model is
identified by the AIC. Empirical residuals w+i j for ER+ and w−i j for ER− are subtracted from the data
by predictions. These procedures are summarized as follows: 1) take the original microarray and
ER data from one study; 2) apply the ARCH and AR models to the original data for each sample
and identify the best fit model among the model candidates with 1–10 time lags; 3) subtract the
residuals by the best fit model from the original data; 4) apply the Wilcoxon statistic to the original
data and to the ARCH and AR residuals; 5) list the p-values and identify the significant FDR (5%)
corrected genes. 6) apply gene ontology analysis and pathway analysis (see the details in Sections
Figure 7.11: Procedures for generating residual array data and analysis by rank order statistics
7.3.3.3 and 7.3.3.4) for biological interpretation to the obtained gene list (see 4 in Figure 7.11). The
computational programs were done by the garchFit function for ARCH fitting, by the ar.ols function
for AR fitting, the wilcox.test as a rank-sum test, and fdr.R for the FDR adjustment in the R package.
7.3.3.3 GO analysis
To investigate the gene product attributes from the gene list, gene ontology (GO) analysis was per-
formed to find specific gene sets that are statistically associated among several biological categories.
GO is designed as a functional annotation database to capture the known relationships between bio-
logical terms and all the genes that are instances of those terms. It is widely used by many functional
enrichment tools and is highly regarded both for its comprehensiveness and its unified approach for
annotating genes in different species to the same basic set of underlying functions (Glass and Gir-
van (2012)). Many tools have been developed to explore, filter, and search the GO database. In our
study, GOrilla (Eden et al. (2009)) was used as a GO analysis tool. GOrilla is an efficient web-based
interactive user interface that is based on a statistical framework called minimum hypergenometric
(mHG) for enrichment analysis in ranked gene lists, which are naturally represented as functional
genomic information. For each GO term, the method independently identifies the threshold at which
the most significant enrichment is obtained. The significant mHG scores are accurately and tightly
corrected for threshold multiple testing without time-consuming simulations (Eden et al. (2009)).
The tool identifies enriched GO terms in ranked gene lists for background gene sets which are
obtained by the whole genomic location of microarray data. GO consists of three hierarchically
structured vocabularies (ontologies) that describe gene products in terms of their associated bio-
logical processes, cellular components, and molecular functions. The building blocks of GO are
called terms, and the relationship among them can be described by a directed acyclic graph (DAG),
which is a hierarchy where each gene product may be annotated to one or more terms in each ontol-
ogy (Glass and Girvan (2012)). GOrilla requires a list of gene symbols as input data. The obtained
significant Etrez gene lists by FDR correction are converted into gene symbols using a web-based
database called SOURCE (Diehn et al. (2003)), which was developed by the Genetics Department
of Stanford University.
7.3.3.4 Pathway analysis

For GO analysis, the identified genes are mapped to the well-defined biological pathways. Path-
way analysis determines which pathways are overrepresented among genes that present significant
variations. The difference from GO analysis is that pathway analysis includes interactions among a
given set of genes. Several tools for pathway analysis have been published. In this study, we used
a web-based analysis tool called REACTOME, which is a manually curated open-source open-data
resource of human pathways and reactions Joshi–Tops et al. (2005). REACTOME is a recent fast
and sophisticated tool that has grown to include annotations for 7088 of the 20,774 protein-coding
genes in the current Ensembl human genome assembly, 15,107 literature references, and 1,421 small
molecules organized into 6,744 reactions collected in 1,481 pathways Joshi–Tops et al. (2005).
7.3.4 Simulation Study

To investigate the performance of our proposed algorithm, we performed a simulation study. We
first prepared the clinical indicator like ER+ and ER−. The artificial indicator includes “1” for 50
samples and “0” for 50 samples. Next, we considered two types of artificial 1,000-array and 100
samples: one array data (A) generating by normal distributions was set. The mean and variance val-
ues of the distribution were set as 1.0 to generate overall array data at once. In addition, the array
data for the 201-400 array and the 601-600 array were replaced with data generating a different nor-
mal distribution with 1.8 mean and 10 variance; another array data (B) was generated by an ARCH
model. The model was applied to real array (DES data, see details in Chapter 4) and the parameters
(mu: the intercept, omega: the constant coefficient of the variance equation, alpha: the coefficients
of the variance equation, skew: the skewness of the data, shape: the shape parameter of the condi-
tional distribution setting as 3) for the model were estimated for ER+ and ER−, respectively. We
used these parameters and random numbers to generate the simulation data. For the computational
programs, we conducted normrnd of MATLAB R
command to generate random variables by nor-
mal distributions for array data A, and conducted garchSim of the R package fGARCH for array
data B. We iterated 100 times to generate the two array datasets. To 100 datasets for A and B, we
applied two-group comparison for the original simulation data and the ARCH residuals of them and
identified 5% FDR significant parts.

7.3.5.1 Simulation data
For the simulation data and ARCH residuals, we summarized the average of the number of the
identified 5% FDR significant parts and the number of overlapping parts in Table 7.4. In the case
of the simulation data generated by normal distribution, the significant number for original series
and ARCH residuals in array sets A and B did not differ. The parts identified in A were mostly the
same as ones in B. On the other hand, in the simulation data generated by the ARCH model, the
ARCH residuals identified more significant parts from the data than the original series. The number
of significant parts for ARCH residuals was about 30% less than the number of significant parts for
the original series. The overlapping number was less than in case A; however, over 50% parts were
covered.
Table 7.4: Summary of the average for the identified significant parts
1. Original se- 2. ARCH Overlapped ♯ Ratio for the

ries residuals of 2 with 1 overlapped ♯
Array set A 30.2 29.8 29.3 98.2%
Array set B 512.2 338.9 212.1 62.6%
7.3.5.2 Affy947 expression dataset

Based on the method explained in Section 3.2, the best fit AR and GARCH models were selected
by AIC for each sample. The estimated orders of all the best fit models of all the studies are
summarized in Supplementary Table 1 https://www.dropbox.com/sh/sotz8jufje73eg6/AACKHD-
tXqB02h rxlvVlXqsa?dl=0. Figure 7.10 summarizes the ratio of the sample numbers for each se-
lected order against the total number of samples. These figures suggest that the most often selected
orders were one while ER+ samples tended to take more complicated models than for the ER−
samples.
Using residuals obtained by the best fit ARCH and AR models and the original data, we ap-
plied the Wilcoxon statistic to compare DEs between two groups divided by ER+ and ER−. The
significant genomic locations were assessed by an FDR. The locations were mapped on Entrez
gene IDs according to the Affy probes presented in the original microarray data and converted into
gene symbols by SOURCE. The identified genes in the original data and the ARCH residual anal-
yses are listed in Supplementary Table 2 https://www.dropbox.com/sh/sotz8jufje73eg6/AACKHD-
tXqB02h rxlvVlXqsa?dl=0. Based on these gene lists, we investigated the overlapped significant
genes for the original data with significant genes for the ARCH and AR residuals and summarized
the results in Table 7.5. About 200–280 significant DE genes in the studies of Des, Mil, Min, and
Chin were identified with FDR correction. For Loi, the significant genes were fewer than the other
studies in all cases. Except for Loi, the number of significant genes for the ARCH residuals was
reduced by about 20–35% less than the number of genes for the original data. The estimated genes
Table 7.5: Summary for FDR 5% adjusted Entrez genes of five datasets
Model Data Des Mil Min Loi Chin

- Original #EntrezID 245 238 274 53 277
(unique) (186) (176) (195) (47) (201)
Residual #EntrezID 193 175 207 46 177
ARCH
(unique) (152) (139) (154) (41) (133)
Overlapped with 100 100 100 98 100
original [ %]
Residual #EntrezID 194 161 183 37 178
AR
(unique) (152) (131) (141) (34) (139)
Overlapped with 95 99 87 92 98
original [%]
Note: Values in parentheses indicate number of unique genes to avoid duplicate and multiple genes
from the obtained gene list. Percentages for overlapped with original indicate ratios for overlapped
significant genes for ARCH or AR residuals with significant genes in original data.
Table 7.6: Identified differentially expressed genes (FDR 5%) and cytobands for ER status in origi-
nal data and empirical ARCH residuals
Original ARCH residuals

Studies
cytoband genes cytoband genes
1p13.3 VAV3, GSTM3, CHI3L2 1p13.3 VAV3, GSTM3
1p32.3 ECHDC2
1p34.1 CTPS1
1p35 IFI6 1p35 IFI6
1p35.3 ATPIF1 1p35.3 ATPIF1
1p35.3-p33 MEAF6
1q21 S100A11, S100A1 1q21 S100A1, S100A11
1q21.1 PEA15 1q21.1 PEA15
1q21.3 CRABP2 1q21.3 CRABP2
1q23.2 COPA 1q23.2 COPA
1q24-q25 CACYBP 1q24-q25 CACYBP
...
Note: Complete table is shown in Table 3 in Solvang and Taniguchi (2017a).
for all the datasets (except for Loi’s case) overlapped 100% with the estimated genes for the original
data. The number of significant genes for the AR residuals in the cases of Mil, Min, and Loi was
less than the number of genes by ARCH and resulted in about a 20–35% reduction of significant
genes for the original data except for Loi. However, these genes for the AR residuals did not com-
pletely 100% overlap with the genes for the original data, unlike the case of the ARCH residuals,
suggesting that empirical ARCH residuals might be more effective to correctly specify important
genes from a list of long genes than AR residuals.
To investigate the overlapping genes by the ARCH residuals with genes by the original data,
the corresponding cytoband and gene symbols are summarized in Table 7.6. The total numbers of
common genes by the original and ARCH residuals in four studies were 132 and 99. If we take into
account Loi’s case, the total numbers of common genes across all studies for the original and ARCH
residuals are 12 and 9. The genes obtained by the ARCH residuals were completely covered by the
genes obtained by the original data. The results by the ARCH residuals covered several important
genes for breast cancer, such as TP53 in the chromosome 1q region, ERBB2 in the chromosome
17q region, and ESR1 in the chromosome 6q region, even if the number of identified Entrez genes
was less than the number of identified genes from the original data.
Next, we performed GO enrichment analysis using significant DE gene lists for the original
data and ARCH residual analyses in all studies. To correctly find the enriched GO terms for the
associated genes, a background list was prepared of all the probes included in the original mi-
croarray data. The Entrez genes in the background list were converted into 13,177 gene symbols
without any duplication by SOURCE. As the input gene lists to GOrilla, the numbers of sum-
marized unique genes are shown in the parentheses of Table 7.5. All the associated GO terms
for the original and ARCH residuals in all the studies are summarized in Supplementary Table 3
https://www.dropbox.com/sh/sotz8jufje73eg6/AACKHD-tXqB02h rxlvVlXqsa?dl=0. Since the es-
timated gene symbols in Loi’s case were less than half of the amount taken in other studies, few
associated GO terms were identified in the biological process and cellular component and no GO
terms in the molecular function. Also, significant DE genes for the ARCH residuals contributed to
finding additional associated GO terms that did not appear in the GO terms for the original data,
e.g., mammary gland epithelial cell proliferation for Des, a single-organism metabolic process for
Des, an organonitrogen compound metabolic process for Mil and Min, and a single-organism devel-
opmental process for Min and Chin, all of which are related to meaningful biological associations
like cellular differentiation, proliferation, and metabolic pathways in cancer cells (Glass and Girvan
(2012)).
Table 7.7 summarizes the common GO terms of the biological processes for Des, Mil,
Min, and Chin and presents 13 terms for the original data. The terms for the ARCH residu-
als mostly overlapped with them except for Mil’s case. As shown in Supplementary Table 3
https://www.dropbox.com/sh/sotz8jufje73eg6/AACKHD-tXqB02h rxlvVlXqsa?dl=0, two terms in
the molecular function and eight in the cellular components were commonly identified by the orig-
inal data. The GO terms for the ARCH residuals covered them, and more terms were shown in
the molecular function. Furthermore, to investigate the consistency of the refined significant gene
lists, we applied pathway analysis to the significant DE genes for the original and ARCH residuals
listed in Table 7.6. All the identified pathways with Entities FDR (<1.0) and associated genes are
summarized in Supplementary Table 4 https://www.dropbox.com/sh/sotz8jufje73eg6/AACKHD-
tXqB02h rxlvVlXqsa?dl=0. In the pathway components shown in Supplementary Table 3
https://www.dropbox.com/sh/sotz8jufje73eg6/AACKHD-tXqB02h rxlvVlXqsa?dl=0, ERBB2 sig-
naling, EGFR, cell-cycle, immune system, metabolic pathway, AKT signaling and Wnt pathway are
well-known important breast cancer signaling pathways (Teschendorff et al. (2007)). We took them
to be representative of important pathways and counted the number of identified pathways related
to these components in the case of the original and ARCH residuals. The number and associated
gene symbols are summarized in Table 7.8. The representative pathways were mostly covered by
the significant DE genes for the ARCH residuals. This result supports that the refined gene lists
obtained by the ARCH residuals generally captured the differentiating breast tumours based on ER
status and did not overlook any important biological information by the limited DE gene lists for
the ARCH residuals.
7.3.6 Conclusions
We applied a rank order statistic for an ARCH residual empirical process to refine significant DE
genes by two-group comparison in microarray analysis. Our approach considered publicly available
gene expression datasets and the clinical output for ER in addition to the simulation study. We
compared the analysis performances by the ARCH residuals with the AR residuals and the ordinal
original microarray data. While the genes for the AR residuals did not cover 100% of the genes for
Table 7.7: Common associated biological processes among Des, Mil, Min, and Chin for original
and ARCH residuals
Des Mil Min Chin

Associated GO terms
Orig Arch Orig Arch Orig Arch Orig Arch
epithelial cell proliferation + + + + + + + +
response to estrogen + + + + + + + +
epidermis development + + + + + +
regulation of phosphatidylonositol + + + + + + +
3-kinase activity
erythropoietin-mediated signaling + + + + + +
pathway
regulation of lipid kinase activity + + + + + + + +
phosphatidylinostitol 3-kinase signaling + + + + + + + +
positive regulation of phosphatidylinositol + + + + + + + +
3-kinase activity
phenylpropanoid catabolic process + + + + + + + +
mast cell differentiation + + + + + + + +
extracellular vesicle + + + + + + +
extracellular vesicular exosome + + + + + + +
extracellular region part + + + + + + +
Table 7.8: Identified important breast cancer signaling pathways and associated gene symbols ob-
tained from original data and ARCH residuals
Original ARCH residuals

Pathways
number Gene symbol number Gene symbol
ERBB2 signaling 7 ERBB2, KIT, NRG1 6 ERBB2
EGFR pathways 11 FLT1, KIT, PIK3R1, 10 ERBB2, FLT1,
VAV3 PIK3R1
Cell cycle 5 CDH1, MCM3 3 MCM3, NUMA1
Immune system 5 CDH1, KIT, 5 ERBB2, IFI6, STAT6
PIK3R1, STAT6
Metabolic disorder 1 SAT1 1 FZD1
PI3K/AKT signaling 5 KIT, PIK3R1 5 ERBB2, PIK3R1
Wnt pathway 5 FZD1 5 FZD1
the original data analysis, the genes by the ARCH residuals were mostly 100% overlapping with the
original data, and the gene lists were reduced about 30% from the gene lists obtained by the original
data analysis. We confirmed a similar property for the 30% reduction in the simulation study. In
GO enrichment and pathway analyses, the result by the ARCH residuals was mostly covered with
associated biological terms obtained by the original data analysis and presented additional important
GO terms in biological processes. These results suggest that data processing using ARCH residuals
array data could contribute to refining significant DE genes that follow the required gene signatures
and provide prognostic accuracy and guide clinical decisions.
PORTFOLIO ESTIMATION FOR SPECTRAL DENSITY OF TIME SERIES DATA 259
7.4 Portfolio Estimation for Spectral Density of Categorical Time Series Data
Stoffer et al. (1993) proposed a way of scaling categorical time series data and presented the fre-
quency domain framework called Spectrum Envelope (SpecEnv). SpecEnv can be understood as the
largest proportion of the total power, with the maximum achieved by the scaling parameters. We
consider this a similar concept to the portfolio approach to measuring risk diversification proposed
by Meucci (2010). In this article, we explore the relationship between SpecEnv and the diversifica-
tion index and, moreover, provide a portfolio estimation of spectral density. Furthermore, we apply
this portfolio estimation to the study of DNA sequences and confirm its consistency with the results
shown in Stoffer et al. (2000), Solvang and Taniguchi (2017b).
7.4.1 Introduction
Stoffer et al. (1993) presented a method of scaling categorical time series data and a framework of
spectral analysis called Spectral Envelope (SpecEnv). The SpecEnv approach envelops the standard-
ized spectrum of any scaled process. For a categorical time series, the spectral envelope discovers
the efficient periodic components in it. The method was applied to DNA sequence data for a specific
gene and used to explore the emphasized periodic feature, the strong-weak bonding alphabet of the
sequence, and the coding/noncoding areas (Stoffer et al. (2000)). The SpecEnv and the optimum
scales correspond to the largest eigenvalue and the eigenvector of the matrix for the standardized
spectral density of the scaled categorical data. In other words, since the scaling characterizes the
frequency-based categorical contribution to maximize the diversification of the data, it comes to
serve, for example, as principal component analysis of multivariate time series. We find similar at-
tempts to this approach in the principal portfolio proposed by Rudin and Morgan (2006) and the
diversification analysis proposed by Meucci (2010). They introduced a diversification index that
represents the effective number of bets in a portfolio based on maximizing the entropy, calculated
by the contribution from each asset to the total portfolio. The principles of maximum entropy and
minimum cross-entropy have been extensively applied in finance (Zhou et al. (2013)). For a robust
portfolio problem, Xu et al. (2014) pointed out that entropy serves as a measure of portfolio di-
versification and suggested that the maximum entropy method provides better performance than a
classical mean-variance model. In this section, we explore the relationship between SpecEnv and
the diversification index and provide a procedure to estimate a portfolio for the spectral density of
categorical time series data. We also apply the proposed procedure to the DNA sequence data used
for the analysis in Stoffer et al. (2000).
7.4.2 Method
7.4.2.1 Spectral Envelope
First, we briefly summarize the fundamental theory of SpecEnv according to Stoffer et al. (1993).
A stationary time series {Xt , t = 0, ±1, ±2, · · · } is decomposed into orthogonal components at fre-
quency λ measured in cycles per unit of time, where −π ≤ λ ≤ π. When we observe a numer- 2
P
ical time series, Xt , t = 1, · · · , n, the periodogram is defined as In (λ) = (2πn)−1/2 nt=1 Xt eitλ .
Since the periodogram is not consistent with the spectral density, the periodogram is smoothed as
P
f (λ j ) = m q=−m hq In (λ j+q ) where j = 1, 2, . . . , [n/2] ([n/2] P
is the greatest integer less than or equal
to n/2), the weights are chosen for hq = h−q ≥ 0 and m q=−m hq = 1. We use the case where
hq = 1/(2m + 1) for q = 0, ±1, · · · , ±m, known as the Daniell kernel smoother (moving average).
Now, let Xt , t = 0, ±1, ±2, . . . , be a categorical-valuedn time series o with finite state-space c =
{c1 , c2 , · · · , ck }. Assume that Xt is stationary and p j = pr Xt = c j > 0 for λ ∈ (−π, π]. For β =
(β1 , β2 , · · · , βk )t ∈ Rk , Xt (β), corresponds to the scaling that assigns the category c j the numerical
value β j , which is called the scaling parameter and used to detect a sort of weight for each category.
The spectral envelope aims to find the optimal scaling β maximizing the power at each frequency λ
for λ ∈ (−π, π], relative to the total power σ2 (β) = var {Xt (β)}. That is, we choose β(λ), at each of
interest λ, so that ( )
f (λ; β)
s(λ) = sup . (7.4.1)
β σ2 (β)
over all β not proportional to 1k , the k × 1 vector of ones. The variable s(λ) is called the Spectral
Envelope (SpecEnv) for a frequency λ. When Xt = c j , we then define a k-dimensional time series
Yt by Yt = e j , where e j indicates the k × 1 vector with 1 in the jth row and zeros elsewhere. The
time series Xt (β) can be obtained from Yt by the relationship Xt (β) = βt Yt . Assume that the vector
process Yt has a continuous spectral density matrix fY (λ) for each λ. The relationship Xt (β) = βt Yt
implies that fX (λ; β) = βt fYRe (λ)β, where fYRe (λ) denotes the real part of fY (λ). Denoting the variance-
covariance matrix of Yt by V , formula (7.4.1) can be expressed as
 t Re 

 β fY (λ)β 
s(λ) = sup   βt V β  
β
In practice, Yt is formed by the given categorical time series data, the fast Fourier transform of
Yt is calculated, the periodogram should be smoothed, the variance-covariance matrix S of Yt
is calculated, and, finally, the largest eigenvalue and the corresponding eigenvector of the matrix
2n−1 S −1/2 f re S −1/2 are obtained. The sample spectral envelope corresponds to the eigenvalue. Fur-
thermore, the optimal sample scaling is the multiplication of S −1 and the eigenvector.
7.4.2.2 Diversification analysis

Meucci (2010) provided a diversification index as the measurement for risk diversification man-
agement. He first represented a generic portfolio as the weights of each asset, the N-dimensional
return of assets R, and the return on the portfolio Rw = wt R. For the covariance matrix Σ with the
N-dimensional return of assets, principal component analysis (PCA) was applied using Σ = EΛE t ,
where Λ = diag(ν1 , · · · , νN ) denotes the eigenvalue and E = (e1 , · · · , eN ) denotes the eigenvector.
The eigenvectors define a set of N uncorrelated portfolios, and the returns of the principal portfolios
are represented by R̃ = E −1 R. Then, the generic portfolio can be considered a combination of the
original assets with weights w̃ = E −1 w. Since the principal portfolios are uncorrelated, the total
PN 2
portfolio variance is obtained by Var(Rw ) = i=1 wi νi . Furthermore, the risk contribution from each
principle portfolio to total portfolio variance is pi = w̃2i νi /Var(Rw ). Using the contribution pi , he
introduced an entropy as the portfolio diversification by
 N 
 X 
NENT ≡ exp − pi ln pi  .
i=1
If all of the risk is completely due to a single principal portfolio, that is, pi = 1 and p j = 0
(i , j), we obtain NENT = 1. On the other hand, if the risk is homogeneously spread among N
(that is, pi = p j = 1/N), we obtain NENT = N. To optimize diversification, Meucci presented the
mean-diversification efficient frontier
n o
wϕ ≡ arg maxw∈C ϕwt µ + (1 − ϕ)NENT (w) (7.4.2)
where µ denotes the estimated expected returns, C is a set of investment constraints, and ϕ ∈ [0, 1].
For small values of ϕ, diversification is the main concern, whereas as ϕ → 1, the focus shifts on the
expected returns (Meucci (2010)).
7.4.2.3 An extension of SpecEnv to the mean-diversification efficient frontier

For the mean-diversification efficient frontier in Meucci (2010), if we take ϕ = 0, and if NENT (·) is
the spectral density of the portfolio, then it loads to the spectral envelope. If ϕ , 0 and the entropy
NENT (·) is the variance of the portfolio, Equation (7.4.2) leads to the mean-variance portfolio to the
periodogram at λ as
n o
wϕ ≡ arg maxw∈C ϕwt µ(λ) + (1 − ϕ)wt Var(f˜YRe (λ))w (7.4.3)
where µ(λ) denotes the mean vector of the periodgram smoothed by Daniell kernel around λ and
Var(f˜YRe (λ)) denotes the variance of the periodogram. Although Meucci (2010) estimated the opti-
mum portfolio for diversification by the principal portfolios, we simplify using the following mean-
variance analysis (Taniguchi et al. (2008)):
(
maxw µt w − βwt V w
(7.4.4)
subject to et w = 1
where e = (1, 1, · · · , 1)t ( m × 1 vector) and β is a given number. If we substitute (7.4.3) to (7.4.4),
the solution for w is given by
 
ϕ  
 ˜ (λ)) µ(λ) −
Re −1
et Var(f˜YRe (λ))−1 µ(λ)
˜ (λ)) e
Re −1

 Var(f˜YRe (λ))−1 e
w(λ) =  Var( f Var( f  +
1−ϕ   Y
et Var(f˜YRe (λ))−1 e
Y 
 et Var(f˜Re (λ))−1 e
Y
(7.4.5)
For a small value ϕ, we obtain the portfolio return and variance by µ(f˜YRe )t w(λ) and
w(λ)t Var(f˜YRe (λ))t w(λ).
In applying SpecEnv to categorical data, one category was fixed as a base and the number of
variables in the scaling parameter vector β should be the total number of categories −1. Yt should
be formed as the matrix for the number of data × the total number of the categories −1. On the other
hand, in applying a portfolio study, we should not take a “base” for the factorization and the number
of weights estimated by Equation (7.4.5) should be equivalent to the number of categories.
7.4.3 Data
7.4.3.1 Simulation data
To confirm the performance of our proposed extension of SpecEnv to the mean-diversification ef-
ficiency frontier, we conducted a simulation study. We considered sequence data including four
numbers (1, 2, 3 and 4) . We first generated 1,000 random four-character sequences. Next, the num-
ber “3” was placed instead of the number located at every ten and every five positions in the entire
sequence. For the procedure, we used the following R code:
tmp <- sample(c(1,2,3,4),1000,replace=TRUE)
l31 <- seq(from = 1, to = 1000, by = 5 ); tmp[l31] <- 3
l32 <- seq(from = 1, to = 1000, by = 10 ); tmp[l32] <- 3.
7.4.3.2 DNA sequence data

As a case of real data analysis in this section, we used the DNA sequence for Epstein–Barr virus
(EBV), which was demonstrated in Stoffer et al. (2000) as an example of categorical time series
data. A DNA sequence is represented by four letters, termed base pairs (bp), from the finite set of
alphabetic characters related to two purines, adenine (A) and guanine (G), and two pyrimidines, cy-
tosine (C) and thymine (T). The order of the alphabet (nucleotide) contains the genetic information
specific to the organism. Translating the information stored in the protein-coding sequences (CDS)
of the DNA is an important process. CDS disperse throughout the sequence and are separated by re-
gions of noncoding (Stoffer (2005)). Stoffer et al. (2000) clarified the optimal periodic length of the
DNA sequence for specific genes and identified the most favourable alphabets, such as the strong-
weak (S-W) alphabet related to regional fluctuations or codon context, for use in individual amino
acid shifting. The entire EMV DNA sequence they applied consists of approximately 172,000 bp.
We use a part of the sequence in the region of 1736 – 5689 bp related to a gene labeled BNRF1
shown in Stoffer et al. (2000). They analyzed the entire 3954-bp sequence and its four 1000-bp
windows of the gene.

7.4.4.1 Simulation study
The generated sequence is summarized in Box 1. The total number of generated data for each
category was 209 for 1, 184 for 2, 414 for 3 and 193 for 4.
✓ Box 1. Generated artificial sequence data ✏
333233333433232311323442132224343223442432332332223
113133342313323442333143312443413432344331333211233
433344143411332111334113431131111312123311334442342
143141234421313123434431242321223133133131341333342
133423312313421131334331333314331122343423224234324
323413333232221324213213433443344113314433113321233
332333432331143332333124311443413332123311143311232
124311313322233341314243432333333314333142332131323
223224432211341433311133321344433221431322313443222
333422324243242231132344123113232432311323331131212
341443121132133334313212333421334223214431143342433
133232411332343333132112341233234231324311413443334
444341333331234114333133432331413312423434233143341
413314431143311133144233242322343323434221324413214
133412342113431232232334233431333112313333142232443
322113114132243341343132134443332333331431143342233
442433332324113221134444324223134232233342133234131
332314313444431423313233433131213331413234234244323
343311131331313413334334341332433434131142313223423
3341213121334314313433234431141
✒ ✑
For the data, we first applied the SpecEnv approach. We summarize the calculated periodogram
and SpecEnv in Figures 7.12. The obtained SpecEnv contained two clear peaks at the 0.2 and 0.4
frequencies (202nd and 404th positions). In this calculation, the factor for the fourth category was
treated as a base. The obtained scaling parameters were 0.018 for 1, 0.087 for 2, and 0.996 for 3 at
the 0.2 frequency, and 0.013 for 1, −0.002 for 2, and 1.00 for 3 at the 0.4 frequency. Next, we applied
the portfolio study to the data. The portfolio weights were obtained for the two clear peaks at the 0.2
and 0.4 frequency. The portfolio returns µ(f˜YRe )t w(λ) and variances by the obtained weights were
plotted for the small value ϕ = 0, · · · 0.1 in Figures 7.13 (middle and lower). The configurations
of those plots were basically not different according to the degree ϕ for dividing mean and diver-
sification. The portfolio weights for categories 1 and 2 at the first peaks are close, and the weights
for category 3 are far from the weights for categories 1 and 2 at the second peaks. The tendency of
the obtained weights is consistent with the obtained scaling parameters of SpecEnv. The portfolio
variances indicate two large peaks at the 0.2 and 0.4 frequencies, with the second peak twice as
high as the first. The frequencies of the maximum two peaks are consistent with the frequencies of
the two largest SpecEnv. The configurations for the obtained portfolio return are also consistent for
ϕ = 0, · · · 0.1.
As we mentioned in Section 7.4.2.3, the µ(λ) was the mean vector of the periodogram smoothed
by Daniell kernel around λ. If we simply used the periodogram without smoothing by Daniel kernel,
the portfolio weights would not become consistent for the increases of ϕ, as shown in Figure 7.14
Series: x
Smoothed Periodogram
2.00
1.00
0.50
spectrum
0.20
0.10
0.05
0.0 0.1 0.2 0.3 0.4 0.5
frequency
bandwidth = 0.00923
3.0
2.5
2.0
Spectral Envelope (%)
1.5
1.0
0.5
0.0 0.1 0.2 0.3 0.4 0.5
frequency
Figure 7.12: Plots for the periodogram and calculated SpecEnv for the simulation data.(Upper)
Periodogram for the simulation data. Solid line for 1, dotted line for 2 and broken line for 3. (Lower)
SpecEnv for the simulation data. Two peaks appeared at 0.2 and 0.4 frequencies.
(Upper). The estimates for portfolio variance and returns summarized in Figure 7.14 (Middle) and
(Lower) also came to fluctuate for ϕ > 0.075. The results would suggest that the portfolio estimation
depends on the smoothing for the periodogram. To confirm this, we should compare the portfolio
estimates for the periodogram while applying another smoothing function as a further consideration.
7.4.4.2 DNA sequence for the BNRF1 genes

According to Stoffer’s article, we first performed SpecEnv on the entire 3954 bp for the BNRF1 gene
and on four 1000-bp sections, using (formula) with m = 7. The estimated SpecEnv are illustrated
in Figure 7.15 for the entire sequence and in the four panels of Figures 7.16, 7.17, 7.18 and 7.19
for four sections. Just as they found, we confirmed the strong peak at frequency 1/3 for the entire
coding sequence in Figure 7.15, and this same peak appears in the first three sections in Figures
7.16, 7.17, and 7.18, while there is no peak in Figure 7.19. This raises the conjecture that the fourth
section was actually protein noncoding. The optimal scalings for the first three sections were (b) A
= 0.05, C = −0.68, G = −0.72, T = 0; (c) A = −0.06, C = −0.69, G = −0.71, T = 0; (d) A = −0.17, C
Peak at 0.195
1
Portfolio weights
4
0.5
1
2
0
3
-0.5
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Peak at 0.403
1 2
Portfolio weights
1
0.5
0 3
4
-0.5
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
phi
phi = 0 0.005 0.01 0.015

0.02
Portfolio variance
0.02 0.025 0.03 0.035

0.02
0.04 0.045 0.05 0.055

0.02
0.06 0.065 0.07 0.075

0.02
0.08 0.085 0.09 0.095

0.02
0
0 0.5 0 0.5 0 0.5
0.1
0.02 frequency
0
0 0.5
phi=0 0.005 0.01 0.015

0.02
Portfolio return
0.02 0.025 0.03 0.035

0.02
0.04 0.045 0.05 0.055

0.02
0.06 0.065 0.07 0.075

0.02
0.08 0.085 0.09 0.095

0.02
0
0 0.5 0 0.5 0 0.5
0.1 frequency
0.02
0
0 0.5
Figure 7.13: Portfolio estimates for the simulation data. (Upper) Estimated portfolio weights of sim-
ulation data for ϕ = 0, · · · , 0.1. (Middle) Portfolio variances for ϕ = 0, · · · , 0.1. (Lower) Portfolio
returns for ϕ = 0, · · · , 0.1.
Peak at 0.195
2
Portfolio weights
3
1
4
0
1
2
-1
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Peak at 0.403
1
1
Portfolio weights
4
0.5
0
3
2
-0.5
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
phi
phi=0 0.005 0.01 0.015

0.02
0
Portfolio variance
0.02 0.025 0.03 0.035

0.02
0.04 0.045 0.05 0.055

0.02
0.06 0.065 0.07 0.075

0.02
0.08 0.085 0.09 0.095

0.02
0
0 0.5 0 0.5 0 0.5
0.1 frequency
0.02
0
0 0.5
phi=0 0.005 0.01 0.015

0.3
Portfolio return
0.02 0.025 0.03 0.035

0.3
0.1
0.04 0.045 0.05 0.055

0.3
0.1
0.06 0.065 0.07 0.075

0.3
0.1
0.08 0.085 0.09 0.095

0.3
0.1
0 0.5 0 0.5 0 0.5
0.1 frequency
0.3
0.1
0 0.5
Figure 7.14: Results of the portfolio estimates using µ(λ) by periodogram without a Daniell
smoother. (Upper) Estimated portfolio weights for ϕ = 0, · · · , 0.1. (Middle) Portfolio variances
for ϕ = 0, · · · , 0.1. (Lower) Portfolio returns for ϕ = 0, · · · , 0.1.
= −0.58, G = −0.80, T = 0. The first two sections show the consistent strong-weak bonding alphabet
for the combination of C and G toward the combination of A and T (strong for C&G and weak for
A&T), and the third shows consistent minor bonding, as mentioned in Stoffer et al. (2000).
Figure 7.15: Estimated SpecEnv for BNRF1 gene of EBV virus (a. overall set of 3945 sequences1)
Figure 7.16: Estimated SpecEnv for BNRF1 gene of EBV virus (b. first 1000 sections)
Next, we applied the portfolio estimation to the fourth section DNA sequence data. The results
are summarized in Figures 7.20 – 7.23. For ϕ = 0, . . . , 0.1, the configurations of the obtained vari-
ance indicated close similarity with the obtained SpecEnv for each ϕ in the corresponding section.
The maximum peaks in the first, second and third sections stood at the 1/3 frequency as well as
SpecEnv. Furthermore, we compared the portfolio weights of the four categories based on the max-
imum peaks at the 1/3 frequency. The results are summarized in Figures 7.24 – 7.27. We found that
the weights for C and G were close in the first and second sections in Figures 7.24 and 7.25. This
suggests a strong combination of C and G in the sequence sections. These results are consistent
with those of Stoffer et al. (2000). On the other hand, the weights in the third section, as shown in
Figure 7.26, did not indicate a strong relationship between C and G. For the third section, Stoffer
Figure 7.17: Estimated SpecEnv for BNRF1 gene of EBV virus (c. second 1000 sections)
Figure 7.18: Estimated SpecEnv for BNRF1 gene of EBV virus (d. third 1000 sections)
et al. (2000) showed some minor departure from the strong-weak bonding alphabet. In the fourth
section, shown in Figure 7.27, the weights for A-C and G-T were close at ϕ = 0. However, when
ϕ increases, the disparity between the weights for A and C also increases. For the fourth section,
Stoffer et al. (2000) also showed that no signal was present in the section and concluded that the
fourth section of BNRF1 of Epstein–Barr was actually noncoding.
In the case of SpecEnv for the second section, the obtained scaling parameters indicated strong
C-G bonding. The portfolio weights in Figure 7.25 also indicated the same characteristics. The port-
folio variance for the second section showed consistency for ϕ = 0, · · · , 0.1. To confirm the possible
consistency for larger ϕ, we calculated the variations for 0.1 ≤ ϕ ≤ 0.9, shown in Figure 7.28. The
peaks at the 1/3 frequency were not different until ϕ = 0.6; however, as shown in Figure 7.29 the
portfolio weights actually changed from ϕ = 0.3. For ϕ > 0.5, the divergence between the weights
became apparent. The portfolio variance in the same range for ϕ showed several peaks.
Figure 7.19: Estimated SpecEnv for BNRF1 gene of EBV virus (e. fourth 1000 sections)
phi=0 0.005 0.01 0.015

Portfolio variance
0.05
0
0.02 0.025 0.03 0.035
0.05
0
0.04 0.045 0.05 0.055
0.05
0
0.06 0.065 0.07 0.075
0.05
0
0.08 0.085 0.09 0.095
0.05
0
0 0.5 0 0.5 0 0.5
0.1
0.05 frequency
0
0 0.5
Figure 7.20: Estimated portfolio variance plots for first 1000 sections
7.4.5 Conclusions
We considered an extension to the current mean-diversification efficient frontier that is based on
using SpecEnv. A simulation study was performed to explore the relationship between SpecEnv
and the proposed portfolio approach. Furthermore, we applied the proposed approach to real DNA
APPLICATION TO REAL-VALUE TIME SERIES DATA 269
phi=0 0.005 0.01 0.015
0.05
Portfolio variance 0
0.02 0.025 0.03 0.035
0.05
0
0.04 0.045 0.05 0.055
0.05
0
0.06 0.065 0.07 0.075
0.05
0
0.08 0.085 0.09 0.095
0.05
0
0 0.5 0 0.5 0 0.5
0.1 frequency
0.05
0
0 0.5
Figure 7.21: Estimated portfolio variance plots for the second 1000 sections
phi=0 0.005 0.01 0.015

0.05
Portfolio variance
0
0.02 0.025 0.03 0.035
0.05
0
0.04 0.045 0.05 0.055
0.05
0
0.06 0.065 0.07 0.075
0.05
0
0.08 0.085 0.09 0.095
0.05
0
0 0.5 0 0.5 0 0.5
0.1
0.05 frequency
0
0 0.5
Figure 7.22: Estimated portfolio variance plots for the third 1000 sections
sequence data of the BNRF gene for Epstein–Barr virus (EBV). We found a consistent tendency in
the results between SpecEnv and portfolio estimation. Our findings for the proposed approach sup-
port the idea that the scaling parameters for SpecEnv correspond to the portfolio weights, suggesting
that it is a useful method of portfolio estimation for categorical time series data.
phi=0 0.005 0.01 0.015

0.04
Portfolio variance
0
0.02 0.025 0.03 0.035
0.04
0
0.04 0.045 0.05 0.055
0.04
0
0.06 0.065 0.07 0.075
0.04
0
0.08 0.085 0.09 0.095
0.04
0
0 0.5 0 0.5 0 0.5
0.1
0.04 frequency
0
0 0.5
Figure 7.23: Estimated portfolio variance plots for the fourth 1000 sections
Peak at 0.33
1
T
0.5
C
weights
G
0
A
-0.5
-1
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
phi
Figure 7.24: Estimated portfolio weights for the first 1000 sections and ϕ = 0, · · · , 0.1
7.5 Application to Real-Value Time Series Data for Corticomuscular Functional Coupling
for SpecEnv and the Portfolio Study
7.5.1 Introduction
Corticomuscular functional coupling (CMC) forms the functional connection between the brain and
the peripheral organs. The evaluation of CMC is important for revealing pathophysiological mech-
anisms of many movement disorders, such as paresis and involuntary movements like dystonia,
tremors, and myclonus (Mima and Hallett (1999)). The mutual relationship between electromyo-
grams (EMGs) and electroencephalograms (EEG), which can be recorded easily from the scalp,
has been studied in a number of cases (for review, see Mima and Hallett (1999)). In many of the
previous studies on CMC, a linear model, such as a coherence (Ohara et al. (2000)) or multivari-
ate autoregressive model (Honda et al. (2011)), has often been used to investigate the relationship
between EEG and EMG time series.
The physiological natures of EEG and EMG are significantly different. Namely, EEG is the
summation of postsynaptic potentials of neurons, which are characterized by relatively gradually
Peak at 0.33
1
A
T
0.5
weights
0
G
C
-0.5
-1
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
phi
Figure 7.25: Estimated portfolio weights for the second 1000 sections and ϕ = 0, · · · , 0.1
Peak at 0.33
1
T
C
0.5
weights
0
A
-0.5 G
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
phi
Figure 7.26: Estimated portfolio weights for the third 1000 sections and ϕ = 0, · · · , 0.1
changing signal. By contrast, EMG is the summation of action potentials of the muscle spindle,
which are impulse-like signals. Compared with EEG, the electrocorticogram (ECoG) more accu-
rately reflects the electrical potential generated just below the recording electrode, thus making it
possible to more directly observe the relationship between the cortex and other organs, such as mus-
cles. Kato et al. (2006) proposed a multiplicatively modulated exponential autoregressive model
(mmExpAR) to predict the dynamics of EMG modulating by ECoG. The mmExpAR was consid-
ered as a class of multiplicatively modulated nonlinear autoregressive model (mmNAR) and the
local asymptotic normality (LAN, Le Cam (1986)) was also established. In this section, we employ
the numerical example as a continuous time series data analysis.
Peak at 0.33
0.6
G
T
0.4
weights
0.2
C
A
-0.2
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
phi
Figure 7.27: Estimated portfolio weights for the fourth 1000 sections and ϕ = 0, · · · , 0.1
phi=0 0.1 0.2
Portfolio variance
0.05 0.05 0.05
0 0 0
0.3 0.4 0.5
0.05 0.05 0.05
0 0 0
0.6 0.7 0.8

0.2 0.4
0.05
0 0 0
0 0.5 0 0.5
0.9 frequency
1.0
0
0 0.5
Figure 7.28: Outputs of portfolio variance for 0.1 < ϕ
10
C
5 T
weights
-5
A
-10
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
phi
Figure 7.29: Outputs of portfolio weights for 0.1 < ϕ
For the ECoG data, the subject was a 23-year-old female suffering from intractable seizures.
Chronic subdural electrodes were implanted for surgical evaluation. ECoG data are from a monopo-
lar recording obtained using platinum grid subdural electrodes (diameter 3 mm, interelectrode dis-
tance 1cm) placed on the hand area of the left primary motor cortex with the right mastoid as a
reference. The location of the hand area was confirmed by electrical stimulation. Surface EMG data
were obtained using a recorder on the right extensor carpi radialis muscle. The amplifier setting was
0.03–120Hz (−3dB) and the sampling frequency was 500 Hz. The motor task for the subject was
isometric contraction of the right extensor carpi radialis muscle. The total number of data points was
20,000 sample points of 40-s duration. In light of the inhomogeneous characteristics of the entire
dataset, we divided the data into stationary epochs with 1000 sample points (i.e., 20 epochs). The
ECoG and EMG data for the second epoch is shown in Figure 7.30. To investigate the relationship
between the cortex and other organs such as muscles, we used a dataset consisting of that from
1.0
X10 Microvolt −1.0
−2.0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
6
1.0
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
sec
Figure 7.30: Original of ECoG and EMG data for one epoch
ECoG and an EMG. The nonlinear and non-Gaussian characteristics of the EMG data make it dif-
ficult to handle the data in a linear model. Kato et al. (2006) proposed a nonlinear model called the
Multiplicatively Modulated Nonlinear Autoregressive Model (mmNAR) to predict the dynamics for
generating ECoG. The model was presented by a radial basis function (RBF) of exogenous variables
such as EMG.
Let {zt : t = 1, · · · , N} be a time series of interest, and let {yt : t = 1, · · · , N} be a series of
exogenous variables. Suppose that {zt } is generated by
X
P
zt = Φ0 + Φp zt−p + εt , (7.5.1)
p=1
where the εt ’s are i.i.d. random variables, and
Φ0 ≡ Φ0 (Zt−1 , Cz,1 , · · · , Cz,t ) (7.5.2)

Xtz
= φ0,0 + φ0, j exp{−γz, j k Zt−1 − Cz,j k2 },
j=1
Φp ≡ Φ p (Zt−1 , Yt−l , Cz,1 , · · · , Cz,tz , Cy,1 , · · · , Cy,ty ) (7.5.3)

X
tz ty
X
= φ p,0 + φ p, j exp{−γz, j k Zt−1 − Cz,j k2 } + φ p,i exp{−γy,i k Yt−l − Cy,i k2 }
j=1 i=1
with
Zt−1 = (zt−1 , ∆zt−1 , · · · , ∆kz −1 zt−1 )T (kz = 2, 3, · · · , Kz ), (7.5.4)

ky −1 T
Yt−l = (yt−l , ∆yt−l , · · · , ∆ yt−l ) (ky = 2, 3, · · · , Ky )
where tz and ty are the number of the centres in RBF of zt and yt , and φi, j , {γz, j : γz, j > 0} and
{γy, j : γy, j > 0} are unknown parameters. Also, l indicates a lag where the exogenous variables
have significant time delays. Here, c∗z, j and c∗y, j mark the RBF’s centre, and the difference orders are
denoted by kz and ky .
(kz −1) T
Cz, j ≡ (cz, j, c(1)
z, j , · · · , cz, j ) (7.5.5)
(k −1)
Cy, j ≡ (cy, j , c(1)
y, j , · · · , cy, yj )T .
The mmNAR model is understood as a special form of vector conditional heteroscedastic autore-
gressive nonlinear (CHARN) models (Härdle et al. (1998)) introduced in Theorem 2.1.36.
In this section, we focus on the residual analysis. Stoffer et al. (2000) introduced the Spectral
Envelope (SpecEnv) approach for real-valued time series data. In Stoffer et al. (2000), as an appli-
cation, they performed a simple diagnostic procedure by using SpecEnv to aid in a residual analysis
for fitting time series modeling. The method was motivated by a study by Tiao and Tsay (1994)
indicating that the second-order moving average model (MA(2)) was applied to quarterly US real
GNP data and the residuals from the model fit appeared to be uncorrelated. The SpecEnv explored
that the residuals from the MA(2) fit were not i.i.d., and there was considerable power at the low
frequencies. Based on that, we consider residual analysis from two methodological view points,
SpecEnv and the mean-diversification efficient frontier defined in (7.4.4).
7.5.2 Method
We first obtain the residuals by fitting mmNAR and exponential AR (ExpAR) models (Haggan and
Ozaki (1981)) to EMG data (ECoG data is used as the exogenous variable in the mmNAR model).
The ExpAR model is given by
XP n o
ExpAR model : zt = φ p + π p exp(−γz2p−1 ) zt−p + ǫt
p=1
where P is the AR order, φ p , π p indicate AR parameters, γ is a scaling parameter, and ǫt s are i.i.d.
random variables. We consider some transformations for the residuals  and placethem as a vector.
 t Re
 β fX (λ)β 

To the vector, we secondarily calculate the SpecEnv s(λ) = sup   βt V β   shown in Section
β
7.4.2.1. We investigate the maximum SpecEnv estimates and the scaling parameters at the frequency
for the maximum SpecEnv peak. Furthermore, we apply the portfolio study to the vector by the
procedure introduced in Section 7.4.2.3. The portfolio weights, returns and variances are estimated
for each ϕ. Finally, we consider the correspondence of the portfolio weights to the scaling parameter
and the interpretation to the residuals.

In this study, we use a section for epoch = 2 in the whole time series data shown in Kato et al.
(2006). The ExpAR and mmNAR models were applied to the data, and the best fit models were
identified by the optimum AR orders of ExpAR and mmNAR and time lags of mmNAR. For the
best fit model, the obtained log-likelihood and AIC were −1.25 × 104 and 2.49 × 104 for ExpAR, and
−1.05 × 104 and 2.10 × 104 for mmNAR. The minimum AIC selected 5 as the optimum AR order for
ExpAR and 69 as the optimum time lag for the exogenous variable (ECoG) in mmNAR, following
Kato et al. (2006). The autocorrelation of the residuals was assessed by a modified Q statistic (Ljung
and Box (1978)). Q values indicated 26.2 for ExpAR and 26.9 for mmNAR, which were not larger
than a given significant level of the chi-squared distribution (46.9 = χ20.05 (25), Kato et al. (2006)).
Next, we applied the residual diagnostic to the obtained residual series shown in Figure 7.31.
For the residuals, a k-dimensional vector of continuous real-valued transformations should be set.
We fixed the transformed set (vector) G = {x, |x|, x2 } to the residual series x. The SpecEnv for
the transformed set was calculated (smoothed using Daniell kernel m = 7), shown in Figure 7.32.
In both cases for the ExpAR and mmNAR models, there were considerable peaks in the lower
frequency domain. This suggests that the presence of spectral power at low frequency is associated
with long-range dependencies. The peak for mmNAR was more than the peak for ExpAR; however,
the SpecEnv around the middle and higher frequency domains for mmNAR were comparative lower
than the SpecEnv on the same domain for ExpAR. This suggests a better fitting by mmNAR to the
6
×10
1.5
microvolt
0.5
-0.5
-1
0 100 200 300 400 500 600 700 800 900
1000
500
microvolt
-500
-1000
0 100 200 300 400 500 600 700 800 900
time points (500 points correspond to 1 sec)
(a) Residuals series for applying the ExpAR model (upper) and the mmNAR model (lower).
200
150
frequency
100
50
0
-1 -0.5 0 0.5 1 1.5
6
×10
200
150
frequency
100
50
0
-1000 -800 -600 -400 -200 0 200 400 600
microvolt
(b) Histogram of the residuals for ExpAR (upper) and mmNAR (lower).
Figure 7.31: Plots and the histogram for residuals
data. The scaling parameters at the maximum peaks for ExpAR are β(0.011) = (1, 17.5, −3.0) (the
second peak β(0.04) = (1, 5.4, 0.086)), which leads to an optimum transformation y = x + 17.5|x| −
3.0x2 using the same notation as in Stoffer et al. (2000). Also, for mmNAR, the estimated maximum
at the maximum peak was β(0.011) = (1, 38.5, −8.7), which leads to an optimum transformation
y = x + 38.5|x| − 8.7x2 .
1.0
Spectral, ylim Envelope (%)
0.8
0.6
0.4
0.2
0.0
0.0 0.1 0.2 0.3 0.4 0.5
frequency
(a) ExpAR
Spectral Envelope (%)
0.8
0.4
0.0
0.0 0.1 0.2 0.3 0.4 0.5
frequency
(b) mmNAR
Figure 7.32: SpecEnv for x, |x|, x2 transformed residuals for ExpAR and mmNAR
Finally, we calculated the portfolio weights, variance and returns to the transformed residuals
set. Figure 7.33 shows the estimates for ϕ = 0, . . . , 0.8. The configurations of those plots indicated
higher variance at lower frequencies, that is, long-range dependencies still remained in the predic-
tion error. We saw the consistent tendency in the SpecEnv results. In Figure 7.33, the plot for each
ϕ indicated that the frequency indicating the maximum peak was different from the frequency in-
dicating the maximum SpecEnv. We reasoned that in the calculation procedure, the periodograms
for SpecEnv were the same as the periodograms for the portfolio calculation. When calculating
the mean and variance of the periodogram, the portfolio procedure required to aggregating peri-
odograms vectors at some sequential frequency points around λ; however, in the case of SpecEnv,
we didn’t summarize the periodograms within a range and the eigenvalue of the periodogram was
directly calculated for each λ. The SpecEnv and the variance of the portfolio for ϕ = 0 should be
equivalent theoretically; however, this issue was caused by different numerical algorithms. On the
other hand, the estimates for ϕ = 0, . . . , 0.5 did not indicate clear distinct, but the estimates for
ϕ ≧ 0.6 for ExpAR and the estimates for ϕ = 0.65 of mmNAR become more complicated. The
estimated portfolio weights are illustrated in Figure 7.34. The weights for ExpAR model were cal-
culated based on the maximum portfolio variance at the frequency, which was close to the frequency
0 0.05 0.1 0.15
0.2 0.25 0.3 0.35
0.4 0.45 0.5 0.55
0.6 0.65 0.7 0.75
0.8
(a) Residuals by the ExpAR model.
0 0.05 0.1 0.15
0.2 0.25 0.3 0.35
0.4 0.45 0.5 0.55
0.6 0.65 0.7 0.75
0.8
(b) Residuals by the mmNAR model.
Figure 7.33: The estimated portfolio variance for ϕ = 0, . . . , 0.8
indicating the maximum SpecEnv peak (see Figure 7.32). In the case of mmNAR, the weights were
calculated by the maximum portfolio variance for each ϕ because the maximum peak of the portfo-
lio variance stands on a similar frequency to the frequency for the maximum SpecEnv. The degree of
weights indicated a tendency |w|x| | > w x > w x2 in both models, where w|x| , w x and w x2 indicated the
portfolio weights for |x|, x and x2 . These results were consistent for the estimated scaling parameters
x
0.8 |x|
x2
0.6
0.4
0.2
-0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
1.5
0.5
-0.5
x
-1 |x|
x¨2
-1.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Figure 7.34: The estimated portfolio weights for transformed residuals
of SpecEnv. In Figure 7.34, the vertical line shows the point that the frequency indicating the maxi-
mum portfolio variance was changed. The weights were dramatically changed for ϕ > 0.5. Finally,
we shows the estimated portfolio return in Figure 7.35. The estimates were wider for ϕ > 0.2 in the
case of ExpAR and ϕ > 0.3 in the case of mmNAR. The portfolio returns were interpreted as the
mean that summarized the portfolio effect. On the other hand, the increase of prediction error is not
good for time series modeling. For ϕ < 0.2 (enough small), the portfolio weights would correspond
to the scaling parameter for SpecEnv.
0 0.05 0.1 0.15

0.1
0
-0.05
0.2 0.25 0.3 0.35
0.4 0.45 0.5 0.55
0.6 0.65 0.7 0.75

0.2
0.1
0
0.8
0.4
0.2
0
0 0.05 0.1 0.15

0.1
0
-0.05
0.2 0.25 0.3 0.35
0.4 0.45 0.5 0.55

0.1 0.2
0.05 0.1
0 0
0.6 0.65 0.7 0.75
0.4
0.2
0
0.8
0.4
0.2
0
Figure 7.35: The estimated portfolio returns for ϕ = 0, . . . , 0.8
Problems
7.1 Assume Rthat the kernel function K(·) in (7.1.9) is continuously differentiable, and satisfies K(x) =
K(−x), K(x)dx = 1 and K(x) = 0 for x < [−1/2, 1/2] and the bandwidth bn is of order bn =

O n−1/5 . Then, under the twice differentiablity of µ, show (see e.g., Shiraishi and Taniguchi
(2007))
t t !
2 1
E µ̂ =µ + O bn + O .
n n bn n
7.2 Verify (7.2.1)-(7.2.4).
7.3 Generate artificial indicator data including ‘1’ for 50 samples and ‘0’ for 50 samples. For the
two-groups samples, generate 1,000-array 100 samples, using normal distributions and ARCH
models with different settings. Show the histograms of two-group samples.
7.4 To the same data generated in 7.3, apply rank order statistics to compare the two-group for each
array. Plot the empirical distributions for the statistics using the statistics obtained for 1000 array.
7.5 For the same data generated in 7.3, fit an ARCH model and get the empirical residual for each
array. Calculate the rank order statistics to compare two-group for each array. Show the empirical
distributions of the statistics.
7.6 Take the first raw data obtained in 7.5 and reproduce the array data by shuffling samples. To
the reproduced data, re-apply the rank order statistics for two-group (1-50 and 51-100 samples).
Repeat the procedure 1000 times and plot the empirical distribution of the statistics. Calculate
the p-value of the statistics of the original data obtained on the same row in 7.5.
7.7 Download the five (Des, Loi, Mil, Min and Chin) cohorts in Affy947 expression and the clinical
data from the R code in Zhao et al. (2014).
a. To the GARCH residuals, apply other statistics for two-group comparisons, e.g. t statistics,
Fold Change, and B statistics, instead of rank order statistics and identify the significant genes.
b. Show the common genes list among significant genes obtained by those statistics and check
the common biological terms by GO analysis by the list.
7.8 What is the benefit of considering portfolio diversification?
7.9 What is the relationship between diversification measurement and information theory?
7.10 What are principal portfolios?
7.11 For the entropy

 N 
 X 
NENT ≡ exp − pi ln pi  ,
i=1
if it is a low number, what does it mean for the risk factor of the portfolio? If it achieves maximum
value, what the portfolio contsructs?
7.12 Verify (7.4.2)
7.13 Verify (7.4.3).
7.14 Verify (7.4.5).
7.15 Generate simulation data using the R code shown in Section 7.4.3.1 by changing the parameter
’by’ for l31 and l32.
a. Apply SpecEnv to the data and see the difference between parameter settings.
b. Apply (7.4.3) - (7.4.5) and see the difference for scaling parameters and the portfolio weights
between parameer settings.
7.16 Download another DNA sequence data, called bnrf1hvs for “tsa3.rda” of the library astsa (Ap-
plied Statistical Time Series Analysis, the software by R.H. Shummway and D.S. Stoffer Time
Series Analysis and Its Application: With R Examples, Springer) in R (Comprehensive R Archive
Network, http://cran.r-project.org) and re-apply SpecEnv and the proposed portfolio estimation
given by (7.4.3) – (7.4.5) to the data and show the scaling parameters and the portfolio weights.
Chapter 8
Theoretical Foundations and Technicalities
This chapter provides theoretical fundations of stochastic procsses and optimal statistical inference.
8.1 Limit Theorems for Stochastic Processes

Let {X(t) = (X1 (t), . . . , Xm (t))′ : t ∈ Z} be generated by
X
∞
X(t) = A( j)U (t − j), t ∈ Z, (8.1.1)
j=0
where {U (t) = (u1 (t), . . . , um (t))′ } is a sequence of m-vector random variables with
E{U (t)} = 0, E{U (t)U (s)′ } = 0 (t , s),
and
Var{U (t)} = K = {Kab : a, b = 1, . . . , m} for all t ∈ Z,
and A( j) = {Aab ( j) : a, b = 1, . . . , m}, j ∈ Z, are m × m matrices with A(0) = Im . We set down the
following assumption.
(HT0)
X
∞
tr A( j)KA( j)′ < ∞.
j=0
Then, the process {Xt } is a second-order stationary process with spectral density matrix
f (λ) = { fab (λ) : a, b = 1, . . . , m} ≡ (2π)−1 A(λ)KA(λ)∗ ,

P
For an observed stretch {X(1), . . . , X(n)}, let
 n  n ∗
1  X
 
X
itλ  


itλ 
In (λ) =  X(t)e   X(t)e  , (8.1.2)
2πn 
t=1


t=1


which is called the periodogram matrix. In time series analysis, we often encounter a lot of statistics
based on
Z π Z π
Fn = tr {φ(λ)In (λ)}dλ, and F ≡ tr {φ(λ)f (λ)}dλ, (8.1.3)
−π −π
where φ(λ) is an appropriate matrix-valued function. Hence, we will elucidate the asymptotics of
283
284 THEORETICAL FOUNDATIONS AND TECHNICALITIES
Fn . Here we mention two types of approaches. The first one is due to Brillinger (2001b). Assume
that the process {X(t)} is strictly stationary and that all moments of it exist. Write
caX1 ...ak (t1 , . . . , tk−1 ) ≡ cum{Xa1 (0), Xa2 (t1 ), . . . , Xak (tk−1 )}, (8.1.4)
for a1 , . . . , ak = 1, . . . , m, k = 1, 2, . . .. Throughout this book, we denote by cum(Y1 , . . . , Yk ) the

joint cumulant of random variables Y1 , . . . , Yk that is defined by the coefficients of (it1 ) · · · (itk ) in
the Taylor expansion of
X
k
log[E{exp(i t j Y j )}] (8.1.5)
j=1
(for the fundamental properties, see Theorem 2.3.1 of Brillinger (2001b)). The following condition
is introduced.
(B1) For each j = 1, 2, . . . , k − 1 and any k-tuple a1 , a2 , . . . , ak we have

X
∞
(1 + |t j |)|caX1 ···ak (t1 , . . . , tk−1 )| < ∞, k = 2, 3, . . . . (8.1.6)
t1 ,...,tk−1 =−∞
√
Under (B1) Brillinger (2001b) evaluated the Jth-order cumulant cn (J) of n(Fn − F), and
showed
√ that cn (J) → 0 as n → ∞ for all J ≥ 3, which leads to the asymptotic normality of
n(Fn − F).
The second approach is due to Hosoya and Taniguchi (1982). Recall the linear process (8.1.1).
Let Ft = σ{U (s); s ≤ t}. Assuming that {U (t)} is fourth-order stationary, write cU a1 ···a4 (t1 , t2 , t3 ) =
cum{ua1 (0), ua2 (t1 ), ua3 (t2 ), ua4 (t3 )}. Hosoya and Taniguchi (1982) introduced the following condi-
tions.
(HT1) For each a1 , a2 and s, there exists ǫ > 0 such that
Var [E{ua1 (t)ua2 (t + s)|Ft−τ } − δ(s, 0)Ka1 a2 ] = O(τ−2−ǫ ),
uniformly in t.
(HT2) For each a j and t j , j = 1, . . . , 4 (t1 ≤ t2 ≤ t3 ≤ t4 ), there exists η > 0 such that

E E ua1 (t1 ) ua2 (t2 ) ua3 (t3 ) ua4 (t4 )|Ft1 −τ − E{ua1 (t1 ) ua2 (t2 ) ua3 (t3 ) ua4 (t4 )} = O(τ−1−η ),
uniformly in t1 .
(HT3) For any ρ > 0 and for any integer L ≥ 0, there exists Bρ > 0 such that
h n oi
E T (n, s)2 χ T (n, s) > Bρ < ρ,
uniformly in n and s, where

 m  n
L X
2 1/2
 X X


 ua1 (t + s)ua2 (t + s + r) − Ka1 a2 δ(0, r) 
 

T (n, s) =   √   .

 n  

a1 ,a2 =1 r=0 t=1
(HT4) The spectral densities faa (λ), a = 1, . . . , m, are square integrable.

LIMIT THEOREMS FOR STOCHASTIC PROCESSES 285
(HT5) f (λ) ∈ Lip(α), the Lipschitz class of degree α, α > 1/2.
(HT6) For each a j , j = 1, . . . , m,

X
∞
|cU
a1 ···a4 (t1 , t2 , t3 )| < ∞.
t1 ,t2 ,t3 =−∞
In much literature a “higher-order martingale difference condition” for the innovation process is
assumed, i.e.,
E{ua1 (t)|Ft−1 } = 0 a.s., E{ua1 (t)ua2 (t)|Ft−1 } = Ka1 a2 , a.s.,

E{ua1 (t)ua2 (t)ua3 (t)|Ft−1 } = γa1 a2 a3 a.s., and so on (8.1.7)
(e.g., Dunsmuir (1979), Dunsmuir and Hannan (1976)). What (HT1) and (HT2) mean amounts to a
kind of “asymptotically”
√ higher-order martingale condition, which seems more natural than (8.1.7).
It is seen that n(Fn − F) can be approximated by a finite linear combination of
 n−s 
√ 1 X
 X



n
 Xa1 (t) Xa2 (t + s) − ca1 a2 (s)
 , (8.1.8)
n 
t=1
where s = 0, 1, . . . , n − 1, a1 , a2 = 1, . . . , m. Recalling that {X(t)} is generated by (8.1.1), we see

that (8.1.8) can be approximated by a finite linear combination of
 n 
1  X
 


√   u a (t)u a (t + l) − δ(l, 0)Ka1 2
a  , l ≤ s. (8.1.9)

n t=1
1 2

Applying Brown’s central limit theorem (Theorem 2.1.29) to (8.1.9), Hosoya and Taniguchi (1982)
elucidate the asymptotics of Fn . The two approaches are summarized as follows.
Theorem 8.1.1. (Brillinger (2001b), Hosoya and Taniguchi (1982)) Suppose that one of the follow-
ing conditions (B) and (HT) holds.
(B) {X(t) : t ∈ Z} is strictly stationary, and satisfies the condition (B1).
(HT) {X(t) : t ∈ Z} is generated by (8.1.1), and satisfies the conditions (HT0)–(HT6).
Let φ j (λ), j = 1, . . . , q, be m × m matrix-valued continuous functions on [−π, π] such that φ j (λ) =
φ j (λ)∗ and φ j (−λ) = φ j (λ)′ . Then
(i) for each j = 1, . . . , q,
Z π Z π
p
tr{φ j (λ)In (λ)}dλ → tr{φ j (λ)f (λ)}dλ, (8.1.10)
−π −π
as n → ∞,
(ii) the quantities
Z π
√
n tr[φ j (λ){In (λ) − f (λ)}]dλ, j = 1, . . . , q
−π
have, asymptotically, a normal distribution with zero mean vector and covariance matrix V
whose ( j, l)th element is
Z π
4π tr{f (λ)φ j (λ)f (λ)φl(λ)}dλ
−π
X
m Z Z π
+ 2π φ(rtj) (λ1 )φ(l) X
uv (λ2 )Qrtuv (−λ1 , λ2 , −λ2 )dλ1 dλ2 ,
r,t,u,v=1 −π
where φ(rtj) (λ) is the (r, t)th element of φ j (λ), and
X
∞
X
Qrtuv (λ1 , λ2 , λ3 ) = (2π)−3 X
exp{−i(λ1 t1 + λ2 t2 + λ3 t3 )}crtuv (t1 , t2 , t3 ).
t1 ,t2 ,t3 =−∞
Since the results above are very general, we can apply them to various problems in time series
analysis.
Let X = {Xt } be a square integrable martingale, X (c) the continuous martingale component of X
and ∆Xt = Xt − Xt− . The quadratic variation process of X is defined by
D E X
[X]t ≡ X (c) + ∆X s2 . (8.1.11)
t
s≤t
For another square integrable martingale Y = {Yt }, define

1
[X, Y]t ≡ {[X + Y, X + Y]t − [X − Y, X − Y]t } . (8.1.12)
4
If X = {Xt = (X1,t , . . . , Xm,t )′ } is an m-dimensional vector-valued square integrable martingale,
the quadratic variation [X] ≡ [Xt , Xt ] is defined as the Rm × Rm -valued process whose ( j, k)th
component is [X j,t , Xk,t ].
We now mention the concept of stable and mixing convergence. Let {Xn } be a sequence of
random variables on a probability space (Ω, F , P) converging in distribution to a random variable
X. We say that the convergence is stable if for all the continuity points x of the distribution function
of X and all A ∈ F
lim P[{Xn ≤ x} ∩ A] = Q x (A) (8.1.13)

n→∞
exists and if Q x (A) → P(A) as x → ∞. We denote this convergence by

d
Xn → X (stably). (8.1.14)
If
lim P[{Xn ≤ x} ∩ A] = P(X ≤ x)P(A), (8.1.15)

n→∞
for all A ∈ F and for all the points of continuity of the distribution function of X, the limit is said to
be mixing, and we write
d
Xn → X (mixing). (8.1.16)
The following is due to Sørensen (1991).

Theorem 8.1.2. Let X = {Xt = (X1 (t), . . . , Xm (t))′ } be an m-dimensional square integrable Ft -
martingale with the quadratic variation matrix [X], and set Ht ≡ E(Xt Xt′ ). Suppose that there
exist nonrandom variables kit , i = 1, . . . , m, with kit > 0 and kit → ∞ as t → ∞. Assume that the
following conditions hold:
" #
−1
(i) kit E sup |∆Xi (s)| → 0, 1 ≤ i ≤ n, (8.1.17)
s≤t
p
(ii) Kt−1 [X]t Kt−1 → η2,
where Kt ≡ diag(k1t , . . . , kmt ) and η 2 is a random nonnegative definite matrix, and
(iii) Kt−1 Ht Kt−1 → Σ,
STATISTICAL ASYMPTOTIC THEORY 287
where Σ is a positive definite matrix. Then,
Kt−1 Xt → Z (stably), (8.1.18)
where the distribution of Z is the normal variance mixture with characteristic function
" !#
1 ′ 2
Q(u) = E exp − u η u , u = (u1 , . . . , um )′ . (8.1.19)
2
Also, conditionally on [det η 2 > 0],
1/2 d
(Kt [X]−1
t Kt ) Kt−1 Xt → N(0, Im ) (mixing), (8.1.20)
and
2 d
Xt′ [X]−1
t Xt → χm (mixing). (8.1.21)
8.2 Statistical Asymptotic Theory

This section discusses asymptotic properties of fundamental estimators for general stochastic pro-
cesses.
Recall the vector-valued linear process
X
∞
X(t) = A( j)U (t − j), (8.2.1)
j=0
defined by (8.1.1) satisfying (HT0)–(HT6) with spectral density matrix f (λ). Suppose that f (λ) is
parameterized as fθ (λ), θ = (θ1 , . . . , θr )′ ∈ Θ, where Θ is a compact set of Rr . We are now interested
in estimation of θ based on {X(1), . . . , X(n)}. For this we introduce
Z π
D(fθ , In ) ≡ [log det{fθ (λ)} + tr {In (λ)fθ (λ)−1 }]dλ. (8.2.2)
−π
If we assume that {X(t)} is Gaussian, it is known (c.f. Section 7.2 of Taniguchi and Kakizawa
n
(2000)) that − 4π D(fθ , In ) is an approximation of the log-likelihood. However, henceforth, we do
not assume Gaussianity of {X(t)}; we call θ̂ ≡ arg minθ∈Θ D(fθ , In ) a quasi-Gaussian maximum
likelihood estimator, and write it as θ̂QGML .
Assumption 8.2.1. (i) The true value θ0 of θ belongs to Int Θ.
(ii) D(fθ , fθ0 ) ≥ D(fθ0 , fθ0 ) for all θ ∈ Θ, and the equality holds if and only if θ = θ0 .
(iii) fθ (λ) is continuously twice differentiable with respect to θ.
(iv) The r × r matrix
" Z π ( ! !) #
1 ∂ ∂
Γ(θ0 ) = tr fθ0 (λ)−1 fθ0 (λ) fθ0 (λ)−1 fθ0 (λ) dλ; j, l = 1, . . . , r
4π −π ∂θ j ∂θl
is nonsingular.
In view of Theorem 8.1.1 we can show the aymptotics of θ̂QGML . We briefly mention the outline of
derivation. Since fθ is smooth with respect to θ, and θ0 is unique, the consistency of θ̂QGML follows
from (i) of Theorem 8.1.1. Expanding the right-hand side of
∂
0= D(fθ̂QGML , In ) (8.2.3)
∂θ
at θ0 , we observe that
∂ ∂2
0= D(fθ0 , In ) + D(fθ0 , In )(θ̂QGML − θ0 )
∂θ ∂θ∂θ ′
+ lower order terms, (8.2.4)
leading to
" #−1 " #
√ ∂2 √ ∂
n(θ̂QGML − θ0 ) = − D(fθ0 , In ) n D(fθ0 , In )
∂θ∂θ ′ ∂θ
∂2 √ ∂
Application of (i) and (ii) in Theorem 8.1.1 to ∂θ∂θ ′ D(fθ0 , In ) and n ∂θ D(fθ0 , In ), respectively,
yields
Theorem 8.2.2. (Hosoya and Taniguchi (1982)) Under Assumption 8.2.1 and (HT0)–(HT6), the
following assertions hold true.
p
(i) θ̂QGML → θ0 as n → ∞. (8.2.6)
√
(ii) The distribution of n(θ̂QGML − θ0 ) tends to the normal distribution with zero mean vector
and covariance matrix
Γ(θ0 )−1 + Γ(θ0 )−1 Π(θ0 )Γ(θ0 )−1 , (8.2.7)
where Π(θ) = {Π jl (θ)} is an r × r matrix whose ( j, l)th element is

Z Z π
1 X
m
∂ (q,s) ∂
Π jl (θ) = fθ (λ1 ) fθ(u,v) (λ2 )
8π q,s,u,v=1 −π ∂θ j ∂θl
X
× Qqsuv (−λ1 , λ2 , −λ2 )dλ1 dλ2 .
(q,s)
Here fθ (λ) is the (q, s)th element of fθ (λ)−1 .
The results above are very general, and can be applied to a variety of non-Gaussian vector pro-
cesses (not only VAR, VMA and VARMA, but also second-order stationary “nonlinear processes”).
It may be noted that the Wold decomposition theorem (e.g., Hannan (1970, p.137)) enables us to
represent second-order stationary nonlinear processes as linear processes as in (8.1.1).
Although the model (8.1.1) is very general, the estimation criterion D(fθ0 , In ) is a quadratic form
motivated by Gaussian log-likelihood. In what follows we introduce an essentially non-Gaussian
likelihood for linear processes.
Let {X(t)} be generated by
X
∞
X(t) = Aθ ( j)U (t − j), (8.2.8)
j=0
where U (t)’s are i.i.d. with probability density function p(u), and θ ∈ Θ (an open set) ⊂ Rr .
R R R
Assumption 8.2.3. (i) lim p(u) = 0, up(u)du = 0, uu′ p(u)du = Im and ||u||4 p(u)
||u||→∞
du < ∞.
(ii) The derivatives Dp and Dp2 ≡ D(Dp) exist on Rm , and every component of D2 p satisfies the
Lipschitz condition.
R R
(iii) |φ(u)|4 p(u)du < ∞ and D2 p(u)du = 0, where φ(u) = p−1 Dp.
We impose the following conditions on Aθ ( j)’s.
Assumption 8.2.4. (i) There exists 0 < ρA < 1 so that
|Aθ ( j)| = O(ρAj ), j ∈ N.
(ii) Every Aθ ( j) = {Aθ,ab } is continuously two times differentiable with respect to θ, and the
derivatives satisfy
|∂i1 · · · ∂ik Aθ,ab ( j)| = O(γAj ), k = 1, 2, j ∈ N,
for some 0 < γA < 1 and for a, b = 1, . . . , m.
(iii) Every ∂i1 ∂i2 Aθ,ab ( j) satisfies the Lipschitz condition for all i1 , i2 = 1, . . . , m and j ∈ N.
nP o
(iv) det ∞j=0 Aθ ( j)z j , 0 for |z| ≤ 1 and
 −1


 X∞ 


 j

 A θ ( j)z 
 = Im + Bθ (1)z + Bθ (2)z2 + · · · , |z| ≤ 1,

 j=0 

where |Bθ ( j)| = O(ρBj ), j ∈ N, for some 0 < ρB < 1.

(v) Every Bθ ( j) = {Bθ,ab( j)} is continuously two times differentiable with respect to θ, and the
derivatives satisfy
|∂i1 · · · ∂ik Bθ,ab( j)| = O(γBj ), k = 1, 2, j ∈ N,
for some 0 < γB < 1 and for a, b = 1, . . . , m.
(vi) Every ∂i1 ∂i2 Bθ,ab( j) satisfies the Lipschitz condition for all i1 , i2 = 1, . . . , q, a, b = 1, . . . , m
and j ∈ N.
If {X(t)} is a regular VARMA model, then the assumptions above are satisfied. Hence, Assump-
tions 8.2.3 and 8.2.4 are natural ones.
The likelihood function of {U (s), s ≤ 0, X(1), . . . , X(n)} is given in the form of
 
Y 
 

X X
n t−1 ∞
 

dQn,θ = p B θ X(t − j) + c θ (r, t)U (−r) 
 dQu , (8.2.9)

 j=0 

t=1 r=0
where Qu is the probability distribution of {U (s), s ≤ 0}. Since {U (s), s ≤ 0} are unobservable, we
use the “quasi-likelihood”
 
Y 
 

X
n t−1
 

Ln (θ) = p B θ ( j)X(t − j) 
 (8.2.10)

 j=0 

t=1
for estimation of θ. A quasi-maximum likelihood estimator θ̂QML of θ is defined as a solution of the

equation
  
 
∂ X 
X  
n t−1
 


 log p 
 B θ ( j)X(t − j) 
  = 0 (8.2.11)
∂θ t=1

 j=0 

with respect to θ. Henceforth we denote the true value of θ by θ0 .

Theorem 8.2.5. (Taniguchi and Kakizawa (2000, p. 72)) Under Assumptions 8.2.3 and 8.2.4, the
following statements hold.
(i) There exists a statistic θ̂QML that solves (8.2.11) such that for some c1 > 0,
√
Pn,θ0 [ n||θ̂QML − θ0 || ≤ c1 log n] = 1 − o(1), (8.2.12)
uniformly for θ0 ∈ C where C is a compact subset of Θ.
(ii)
√ d
n(θ̂QML − θ0 ) → N(0, Γ(θ0 )−1 ), (8.2.13)
where Γ(θ) is an r × r matrix whose (i1 , i2 )th component is given by
 
 X
∞ X
∞ 
Γi1 i2 (θ) = tr F (p) {∂i1 Bθ ( j1 )}R( j1 − j2 ){∂i2 Bθ ( j2 ) } .
′
j1 =1 j2 =1
R
Here R( j) = E{X(s)X(s + j)′ } and F (p) = φ(u)φ(u)′ p(u)du.
Next we return to the CHARN model (2.1.36):
Xt = Fθ (Xt−1 , . . . , Xt−p ) + Hθ (Xt−1 , . . . , Xt−p )Ut , (8.2.14)
where Ut ’s are i.i.d. with probability density function p(u).
Assumption 8.2.6. (i) All the conditions (i)–(iv) for stationarity in Theorem 2.1.20 hold.
(ii) Eθ ||Fθ (Xt−1 , . . . , Xt−p )||2 < ∞, Eθ ||Hθ (Xt−1 , . . . , Xt−p )||2 < ∞, for all θ ∈ Θ.
(iii) There exists c > 0 such that
c ≤ ||Hθ−1/2
′ (x)Hθ (x)Hθ−1/2
′ (x)|| < ∞,
for all θ, θ ′ ∈ Θ and for all x ∈ Rmp .
(iv) Hθ and Fθ are continuously differentiable with respect to θ, and their derivatives ∂ j Hθ
and ∂ j Fθ (∂ j = ∂/∂θ j , j = 1, . . . , r) satisfy the condition that there exist square-integrable
functions A j and B j such that
||∂ j Hθ || ≤ A j and ||∂ j Fθ || ≤ B j , j = 1, . . . , r, for all θ ∈ Θ.
R
(v) The innovation density p(·) satisfies lim ||u||p(u) = 0, uu′ p(u)du = Im , where Im is the
||u||→∞
m × m identity matrix.
R R
(vi) The continuous derivative Dp of p(·) exists on Rm , and ||p−1 Dp||4 p(u)du < ∞, ||u||2
||p−1 Dp||2 p(u)du < ∞.
Write ηt (θ) ≡ log p{Hθ−1 (Xt − Fθ )}{det Hθ }−1 . We further impose the following assumption.
Assumption 8.2.7. (i) The true value θ0 of θ is the unique maximum of
X
n
lim n−1 E{ηt (θ)}
n→∞
t=p
with respect to θ ∈ Θ.
(ii) ηt (θ)’s are three times differentiable with respect to θ, and there exist functions Qti j =
Qti j (X1 , . . . , Xt ) and T it jk = T it jk (X1 , . . . , Xt ) such that
|∂i ∂ j ηt (θ)| ≤ Qti j , EQti j < ∞,

and
|∂i ∂ j ∂k ηt (θ)| ≤ T it jk , ET it jk < ∞.
Let Φt ≡ Hθ−1 (Xt − Fθ ), and write

 
 −vec′ (Hθ−1 ∂1 Hθ ), ∂1 Fθ′ · Hθ−1 

 ..
Wt =  .

 r × m2 + m -matrix

−vec′ (Hθ−1 ∂r Hθ ), ∂r Fθ′ · Hθ−1
" #
{Φt ⊗ Im }p−1 (Φt )Dp(Φt ) + vecIm
Ψt = ((m2 + m) × 1-vector),
p−1 (Φt )Dp(Φt )
F (p) = E{Ψt Ψ′t }, ((m2 + m) × (m2 + m)-matrix),
Γ(p, θ) = E{Wt F (p)W ′ }, (r × r-matrix).
For two hypothetical values θ, θ ′ ∈ Θ,

X
n
p{Hθ−1′ (Xt − Fθ ′ )} det Hθ
Λn (θ, θ ′) ≡ log . (8.2.15)
t=p p{Hθ−1 (Xt − Fθ )} det Hθ ′
The maximum likelihood estimator of θ is given by
θ̂ML ≡ arg max Λn (θ, θ), (8.2.16)

θ
where θ ∈ Θ is some fixed value.

Theorem 8.2.8. (Kato et al. (2006)) Under Assumptions 8.2.6 and 8.2.7, the following statements
hold.
p
(i) θ̂ML →
− θ0 as n → ∞,
√ d
(ii) n(θ̂ML − θ0 ) →− N(0, Γ(p, θ0 )−1 ), as n → ∞.
Let X = {Xt , Ft , t ≥ 0} be an m-dimensional semimartingale with the canonical representation
Z tZ Z tZ
Xt = X0 + αt + Xt(c) + xdµ + xd(µ − ν), (8.2.17)
0 kxk>1 0 kxk≤1
where X (c) ≡ {Xt(c) } is the continuous martingale component of Xt , the components of α ≡ {αt }
are processes of bounded variation (Definition 2.1.17), the random measure µ is defined by
X
µ(ω, dx, dt) = χ{∆X s (ω) , 0}ǫ(x,∆Xs (ω)) (dt, dx)
s
where ǫa is the Dirac measure at a and ν is the compensator of µ (see Jacod and Shiryaev (1987),
Prakasa Rao (1999)). Letting β ≡ hX (c) , X (c) i, the triple (α, β, ν) is called the triplet of P-local
characteristics of the semimartingale X.
Because notations and regularity conditions for the asymptotic likelihood theory of semimartin-
gales are unnecessarily complicated, in what follows, we state the essential line of discussions. For
detail, see Prakasa Rao (1999) and Sørensen (1991), etc.
Suppose that the triple (α, β, ν) of (8.2.17) depends on unknown parameter θ = (θ1 , . . . , θ p )′ ∈ Θ
(open set) ⊂ R p , i.e., (αθ , βθ , νθ ). Then, based on the observation {X s : 0 ≤ s ≤ t} from (8.2.17),
the log-likelihood function is given by
Z t Z t
(c)
lt (θ) = G s (θ)dX s + A s (θ)ds
0 0
Z tZ
+ B s (θ, X s−, x)(µ − νθ )(dx, ds)
0 Rm
Z tZ
+ C s (θ, X s−, xµ(dx, ds) (8.2.18)
0 Rm
where G s (·), A s (·), B s( , , ) and C s ( , , ) are smooth with respect to θ, and F s -measurable with respect
to the other arguments. Let
* + " #
∂ ∂ ∂2
It (θ) ≡ lt (θ) , Jt (θ) ≡ lt (θ) , jt (θ) ≡ − lt (θ). (8.2.19)
∂θ t ∂θ t ∂θ∂θ ′
It is seen that
Eθ {jt (θ)} = Eθ {It (θ)} = Eθ {Jt (θ)} = it (θ), (say). (8.2.20)
Let Dt (θ) = diag(dt (θ)11, . . . , dt (θ) pp ), where dt (θ) j j is the jth diagonal element of it (θ). Based
on the law of large numbers (Theorem 2.1.27) and the central limit theorem (Theorem 8.1.2), the
following results can be proved under appropriate regularity conditions (see Prakasa Rao (1999)).
Theorem 8.2.9. (1) Under P = Pθ , almost surely on the set {the minimum eigenvalue of It (θ) →
∞}, it holds that the likelihood equation
∂
lt (θ) = 0, (8.2.21)
∂θ
p
has a solution θ̂t satisfying θ̂t →
− θ as t → ∞.
(2) Assume that
p
− η2 (θ) under Pθ as t → ∞,
Dt (θ)−1/2 jt (θ̃t )Dt (θ)−1/2 → (8.2.22)
for any θ̃t lying on the line segment connecting θ and θ̂. Then, conditionally on the event
{det(η2 (θ)) > 0},
d
Dt (θ)1/2 (θ̂t − θ) →
− η−2 (θ)Z, (stably), (8.2.23)
d
[Dt (θ)−1/2jt (θ̂t )Dt (θ)−1/2 ]1/2 Dt (θ)1/2(θ̂t − θ) →
− N(0, I p ), (mixing),
and
d
− χ2p , (mixing),
2[lt (θ̂t ) − lt (θ)] → (8.2.24)
under Pθ as t → ∞ . Here the random vector Z has the characteristic function

" ( )#
1 ′ 2
φ(u) = Eθ exp − u η (θ)u , u = (u1 , . . . , u p )′ . (8.2.25)
2
Example 8.2.10. Let {Xt } be generated by
dXt = θXt dt + σdWt + dNt , t ≥ 0, X0 = x0 (8.2.26)
where {Wt } is a standard Wiener process and {Nt } is a Poisson process with intensity λ. Assuming
σ > 0 is known, we estimate (θ, λ) by the maximum likelihood estimator (MLE) from the observation
{X s : 0 ≤ s ≤ t}. The MLE are
Rt
X s− dX s(c)
θ̂t = 0R t , (8.2.27)
X 2 ds
0 s
and
Nt
λ̂t = , (8.2.28)
t
P
where Xt(c) = Xt − s≤t ∆X s and Nt is the number of jumps before t.
STATISTICAL OPTIMAL THEORY 293
8.3 Statistical Optimal Theory
Portfolio Estimation Influenced by Non-Gaussian Innovations and Exogenous Variables
In the previous section we discussed the asymptotics of MLE for very general stochastic pro-
cesses. Here we show the optimality of MLE, etc.
Lucien Le Cam introduced the concept of local asymptotic normality (LAN) for the likelihood
ratio of regular statistical models. Once LAN is proved, the asymptotic optimality of estimators and
tests is described in terms of the LAN property. In what follows we state general LAN theorems due
to Le Cam (1986) and Swensen (1985).
Let P0,n and P1,n be two sequences of probability measures on certain measurable spaces
(Ωn , Fn ). Suppose that there is a sequence Fn,k , k = 1, . . . , kn , of sub σ-algebras of Fn such that
Fn,k ⊂ Fn,k+1 and such that Fn,kn = Fn . Let Pi,n,k be the restriction of Pi,n to Fn,k , and let γn,k
be the Radon–Nikodym density taken on Fn,k of the part of P1,n,k that is dominated by P0,n,k . Let
Yn,k = (γn,k /γn,k−1 )1/2 − 1, where γn,0 = 1 and n = 1, 2, . . . . Then the logarithm of likelihood ratio
dP1,n
Λn = log (8.3.1)
dP0,n
P
taken on Fn is written as Λn = 2 k log(Yn,k + 1).
Theorem 8.3.1. (Le Cam (1986)) Suppose that under P0,n the following conditions are satisfied:
p
(L1) maxk |Yn,k | →
− 0,
P 2 p 2
− τ /4,
(L2) k Yn,k →
P 2
p
(L3) k E(Yn,k + 2Yn,k |Fn,k−1 ) →
− 0, and
P 2
p
(L4) k E{Yn,k χ(|Yn,k | > δ)|Fn,k−1} →
− 0 for some δ > 0.
d
Then, Λn → − N(−τ2 /2, τ2 ) as n → ∞.
Remark 8.3.2. If both P0,n,k and P1,n,k are absolutely continuous with respect to Lebesgue measure
µk on Rk , the condition (L3) above is always satisfied (Problem 8.1).
A version of the theorem is given with replacing the Yn,k by martingale differences.
Theorem 8.3.3. (Swensen (1985)) Suppose that there exists an adapted stochastic process
{Wn,k , Fn,k } which satisfies
(S1) E(Wn,k |Fn,k ) = 0 a.s.,
P
(S2) E{ k (Yn,k − Wn,k )2 } → 0,
P 2
(S3) supn E( k Wn,k ) < ∞,
p
(S4) maxk |Wn,k | →
− 0,
P 2 p 2
− τ /4, and
(S5) k Wn,k →
P 2
p
(S6) k E{Wn,k χ(|Wn,k | > δ)|Fn,k−1} →
− 0 for some δ > 0,
p
where →− in (S4)–(S6) are under P0,n . Then the conditions (L1), (L2) and (L4) of Theorem 8.3.1
hold. Further, if (L3) of Theorem 8.3.1 is satisfied, then
d
− N(−τ2 /2, τ2 )
Λn →
as n → ∞.
Remark 8.3.4. (Kreiss (1990)) The conditions (S5) and (S6) can be replaced by
P 2
p
(S5)′ k E(Wn,k − τ2 /4,
|Fn,k−1 ) →
P 2
p
(S6)′ k E{Wn,k χ(|Wn,k | > ǫ)|Fn,k−1 } →
− 0 for every ǫ > 0,
under P0,n (Problem 8.2).
Using Theorem 8.3.3, we shall prove the LAN theorem for the m-vector non-Gaussian linear
process defined by (8.2.8):
X
∞
(LP) X(t) = Aθ ( j)U (t − j)
j=0
satisfying Assumption 8.2.3. First, we mention a brief review of LAN results for stochastic pre-
cesses. Swensen (1985) showed that the likelihood ratio of an autoregressive process of finite order
with a trend is LAN. For ARMA processes, Hallin et al. (1985a) and Kreiss (1987) proved the LAN
property, and applied the results to test and estimation theory. Kreiss (1990) showed the LAN for a
class of autoregressive processes with infinite order. Also, Garel and Hallin (1995) established the
LAN property for multivariate ARMA models with a linear trend.
Returning to (LP), let
X
∞
Aθ (z) = Aθ ( j)z j (Aθ (0) = Im ), |z| ≤ 1, (8.3.2)
j=0
where Aθ ( j) = {Aθ,ab ( j); a, b = 1, . . . , m}, j ∈ N. We make the following assumptions.

Assumption 8.3.5. (i) For some d (0 < d < 1/2), the coefficient matrices Aθ ( j) satisfy
|Aθ ( j)| = O( j−1+d ), j ∈ N (8.3.3)
where |Aθ ( j)| denotes the sum of the absolute values of the entries of Aθ ( j).
(ii) Every Aθ ( j) is continuously two times differentiable with respect to θ, and the derivatives
satisfy
|∂i1 ∂i2 · · · ∂ik Aθ,ab ( j)| = O{ j−1+d (log j)k }, k = 0, 1, 2 (8.3.4)
for a, b = 1, . . . m, where ∂i = ∂/∂θi .

(iii) det Aθ (z) , 0 for all |z| ≤ 1 and Aθ (z)−1 can be expanded as
Aθ (z)−1 = Im + Bθ (1)z + Bθ (2)z2 + · · · , (8.3.5)
where Bθ ( j) = {Bθ,ab( j); a, b = 1, . . . , m}, j = 1, 2, . . . , satisfy
|Bθ ( j)| = O( j−1−d ). (8.3.6)
(iv) Every Bθ ( j) is continuously two times differentiable with respect to θ, and the derivatives
satisfy
|∂i1 ∂i2 · · · ∂ik Bθ,ab( j)| = O{ j−1−d (log j)k }, k = 0, 1, 2 (8.3.7)
for a, b = 1, . . . , m.
Under Assumption 8.3.5, our model (LP) is representable as
X
∞
Bθ ( j)X(t − j) = U (t), Bθ (0) = Im . (8.3.8)
j=0
Then it follows that

X
∞ X
∞
U (t) = Bθ ( j)X(t − j) + Cθ (r, t)U (−r), (8.3.9)
j=0 r=0
where
X
∞
Cθ (r, t) = Bθ (r′ + t)A(r − r′ ).
r′ =0
From Assumption 8.3.5, we can show
Cθ (r, t) = O(t−d/2 )O(r−1+d ) (8.3.10)
(Problem 8.3).
Let Qn,θ and Qu be the probability distributions of {U (s), s ≤ 0, X(1), . . . , X(n)} and {U (s), s ≤
0}, respectively. Then, from (8.3.9) it follows that
 
Y n 

 X
t−1 X∞ 


 
dQn,θ = p
 B θ ( j)X(t − j) + C θ (r, t)U (−r) 
 dQu . (8.3.11)

 

t=1 j=0 r=0
For two hypothetical values θ, θ ′ ∈ Θ, the log-likelihood ratio is

dQn,θ ′
Λn (θ, θ ′ ) ≡ log
dQn,θ
X
n h n X
k−1
= log p U (k) + (Bθ ′ ( j) − Bθ ( j))X(k − j)
k=1 j=0
X
∞ o i
+ (Cθ ′ (r, k) − Cθ (r, k))U (−r) /p{U (k)} . (8.3.12)
r=0
We denote by H(p; θ) the hypothesis under which the underlying parameter is θ ∈ Θ and the
probability density of Ut is p = (·). Introduce the contiguous sequence
1
θn = θ + √ h, h = (h1 , . . . , h p )′ ∈ H ⊂ R p , (8.3.13)
n
and let
Z
F (p) = φ(u)φ(u)′ p(u)du (Fisher information matrix),
and
R(t) = E{X(s)X(t + s)′ }, t ∈ Z,
where φ(u) = p−1 dud

p(u).
Now we can state the local asymptotic normality (LAN) for the vector process (LP) ((8.2.8)).
(For a proof, see Taniguchi and Kakizawa (2000)).
Theorem 8.3.6. For the vector-valued non-Gaussian linear process, suppose that Assumptions
8.2.3 and 8.3.5 hold. Then the sequence of experiments
En = {RZ , BZ, {Qn,θ : θ ∈ Θ ⊂ R p }}, n ∈ N,
is locally asymptotically normal and equicontinuous on compact subset C of H. That is,

(1) For all θ ∈ Θ, the log-likelihood ratio Λn (θ, θn ) has, under H(p; θ) as n → ∞, the asymptotic
expansion
1
Λ(θ, θn) = ∆n (h; θ) − Γh (θ) + o p (1), (8.3.14)
2
where
1 X X
n k−1
∆h (h; θ) = √ φ{U (k)}′ Bh′ ∂θ X(t − j), (8.3.15)
n k=1 j=1
and
 


 X∞ X ∞ 


 ′
Γ(θ) = tr 
 F (p) B h′ ∂θ ( j1 )R( j1 − j2 )Bh′ ∂θ ( j2 ) 


 

j1 =1 j2 =1
with
X
p
Bh′ ∂θ ( j) = hl ∂l Bθ ( j).
l=1
(2) Under H(p; θ)

d
∆h (h; θ) →
− N(0, Γh (θ)). (8.3.16)
(3) For all n ∈ N and all h ∈ H, the mapping
h → Qn,θn
is continuous with respect to the variational distance
kP − Qk = sup{|P(A) − Q(A)| : A ∈ BZ }.
Henceforth we denote the distribution law of random vector Yn under Qn,θ by L(Yn |Qn,θ ), and
d
its weak convergence to L by L(Yn |Qn,θ ) →
− L. Define the class A of sequences of estimators of θ,
{T n } as follows:
√ d
A = [{T n } : L{ n(T n − θn )|Qn,θn } →
− Lθ (·), a probability distribution], (8.3.17)
where Lθ (·) in general depends on {T n }.

Let L be the class of all loss functions l : Rq → [0, ∞) of the form l(x) = τ(|x|) which satisfies
τ(0) = 0 and τ(a) ≤ τ(b) if a ≤ b. Typical examples are l(x) = χ(|x| > a) and l(x) = |x| p , p ≥ 1.
Definition 8.3.7. Assume that the LAN property (8.3.14) holds. Then a sequence {θ̂n } of estimators
of θ is said to be a sequence of asymptotically centering estimators if
√
n(θ̂n − θ) − Γ(θ)−1 ∆n = o p (1) in Qn,θ . (8.3.18)
The following theorem can be proved along the lines of Jeganathan (1995).
Theorem 8.3.8. Assume that LAN property (8.3.14) holds, and ∆ ≡ N(0, Γ(θ)). Suppose that {T n } ∈
A. Then, the following statements hold.
(i) For any l ∈ L satisfying E{l(∆)} < ∞
√
lim inf E[l{ n(T n − θ)}|Qn,θ ] ≥ E{l(Γ−1 ∆)}. (8.3.19)
n→∞
(ii) If, for a nonconstant l ∈ L satisfying E{l(∆)} < ∞,

√
lim sup E[l{ n(T n − θ)}|Qn,θ ] ≤ E{l(Γ−1 ∆)}, (8.3.20)
n→∞
then {T n } is a sequence of asymptotically centering estimators.

From Theorem 8.3.8, we will define the asymptotic efficiency as follows.
Definition 8.3.9. A sequence of estimators {θ̂n } ∈ A of θ is said to be asymptotically efficient if it
is a sequence of asymptotically centering estimators.
Recall Theorem 8.2.5. Under Assumptions 8.2.3 and 8.2.4, it is shown that
√
n(θ̂QML − θ0 ) − Γ(θ0 )−1 ∆n = o p (1) in Qn,θ0 . (8.3.21)
Let Sn be a p-dimensional statistic based on {X(1), . . . , X(n)}. The following lemma is called
Le Cam’s third lemma (e.g., Hájek and Šidák (1967)).
d
Lemma 8.3.10. Assume that the LAN property (8.3.14) holds. If (Sn′ , Λn (θ0 , θn ))′ →
− N p+1 (m, Σ)
under Qn,θ0 , where m′ = (µ′ , −σ22 /2) and
!
Σ11 σ12
Σ= ′ ,
σ12 σ22
then
d
− N p (µ + σ12 , Σ11 )
Sn → under Qn,θn . (8.3.22)
From (8.3.21) we observe that

√ d
{ n(θ̂QML − θ0 )′ , Λn (θ0 , θn )}′ →
− N p+1 (m, Σ), under Qn,θn ,
where m′ = (0′ , h′ Γ(θ0 )h) and

!
Γ(θ0 )−1 h
Σ= .
h′ h′ Γ(θ0 )h
By Le Cam’s third lemma, it is seen that

√ d
− N p (0, Γ(θ0 )−1 ),
L{ n(θ̂QML − θn )|Qn,θn } →
leading to θ̂QML ∈ A. Hence θ̂QML is asymptotically efficient.

In what follows we denote by En,h (·) the expectation with respect to Qn,θn . Let L0 be a linear
subspace of R p . Consider the problem of testing (H, A):
H : h ∈ L0 against A : h ∈ R p − L0 . (8.3.23)
Definition 8.3.11. A sequence of tests ϕn , n ∈ N, is asymptotically unbiased of level α ∈ [0, 1] for

(H, A) if
lim sup En,h (ϕn ) ≤ α if h ∈ L0 ,

n→∞
lim inf En,h (ϕn ) ≥ α if h ∈ R p − L0 .
n→∞
The following results are essentially due to Strasser (1985).

Theorem 8.3.12. Suppose that the LAN property (8.3.14) holds for the non-Gaussian vector linear
process (8.2.8). Let πL0 be the orthogonal projection of R p onto L0 , id the identity map and ∆˜ n ≡
Γ−1/2 ∆n . Then the following assertions hold true.
(i) For α ∈ [0, 1], we can choose kα so that the sequence of tests
(
∗ 1 if k(id − πL0 ) ◦ ∆˜ n k > kα ,
ϕn =
0 if k(id − πL0 ) ◦ ∆˜ n k < kα ,
is asymptotically unbiased of level α for (H, A).
(ii) If {ϕn } is another sequence that is asymptotically unbiased of level α for (H, A), then
lim sup inf En,h (ϕn ) ≤ lim inf En,h (ϕ∗n ),

n→∞ h∈Bc n→∞ h∈Bc
where Bc = {h ∈ R p : kh − πL0 (h)k = c}. If the equality holds, we say that {ϕn } is locally
asymptotically optimal.
(iii) If {ϕn } is any sequence of tests satisfying for at least one c > 0,
lim sup En,0 (ϕn ) ≤ α,

n→∞
and
lim inf inf En,h (ϕn ) ≥ lim En,h (ϕ∗n )

n→∞ h∈Bc n→∞
whenever h ∈ Bc and h ∈ L⊥0 , then {ϕn } converges in distribution to the distribution limit of
ϕ∗n under Qn,θ .
Let M(B) be the linear space spanned by the columns of a matrix B. Consider the problem of
testing
√
H : n(θ − θ0 ) ∈ M(B) (8.3.24)
for some given p × (p − l) matrix B of full rank, and given vector θ0 ∈ R p . For this we introduce the
test statistic
ψ∗n = n(θ̂QML − θ0 )′ [Γ(θ̂QML) − Γ(θ̂QML)B{B′Γ(θ̂QML )B}−1 B′ Γ(θ̂QML )](θ̂QML − θ0 ). (8.3.25)
Then, it can be shown that ψ∗n is asymptotically equivalent to ϕ∗n in Theorem 8.3.12; hence, the test
rejecting H whenever ψ∗n exceeds the α-quantile χ2l:α of a chi-square distribution with l degrees of
freedom has asymptotic level α, and is locally asymptotically optimal.
For the non-Gaussian vector linear process (8.2.8), we established the LAN theorem (Theo-
rem 8.3.6) and the description of the optimal estimator (Theorem 8.3.8) and optimal test (Theorem
8.3.12). Hereafter we call the integration of these results LAN-OPT for short. LAN-OPT has been
developed for various stochastic processes. We summarize them as follows.
LAN-OPT results
(1) Time series regression model with long-memory disturbances:
Xt = zt′ β + ut , (8.3.26)
where zt ’s are observable nonrandom regressors, {ut } an unobservable FARIMA disturbance with
unknown parameter η, and θ ≡ (β ′ , η ′ )′ (Hallin et al. (1999)).
(2) ARCH(∞)-SM model {Xt }:



Xt − β ′ zt = st ut ,

 P (8.3.27)
 s2t = b0 + ∞j=1 b j (Xt− j − β ′ zt− j )2 ,
where {ut } ∼ i.i.d. (0, 1), and zt is σ{ut−1 , ut−2 , . . . }-measurable. Here b j = b j (η), j = 0, 1, · · · ,
and β and η are unknown parameter vectors, and θ = (β ′ , η ′ )′ (Lee and Taniguchi (2005)).
(3) Vector CHARN model {Xt } (defined by (8.2.14)):
Xt = Fθ (Xt−1 , . . . , Xt−p ) + Hθ (Xt−1 , . . . , Xt−p )Ut , (8.3.28)
(Kato et al. (2006)).

(4) Non-Gaussian locally stationary process {Xt,n }:
Z π
Xt,n = exp(iλt)Aθt,n (λ)dξ(λ), (8.3.29)
−π
where Aθt,n (λ) is the response function and {ξ(λ)} is an orthogonal increment process (Hirukawa
and Taniguchi (2006)).
(5) Vector non-Gaussian locally stationary process {Xt }:
t Z π
Xt,n = µ θ, + eiλt Aθ,t,n (λ)dξ(λ), (8.3.30)
n −π
where µ(θ, t/n) is the mean function , and Aθ,t,n (λ) is the matrix-valued response function. Here
{ξ(λ)} is a vector-valued orthogonal increment process.
Next we will discuss an optimality of estimators for semimartingales. Let X = {X s : s ≤ t}
be generated by the semimartingale (8.2.17) with the triple (αθ , βθ , νθ ), θ ∈ Θ ⊂ R p . The log-
likelihood function based on X is given by lt (θ) in (8.2.18). Introduce the contiguous sequence
θt = θ + δt h, h = (h1 , . . . , h p )′ ∈ H ⊂ R p , (8.3.31)
where δt = diag(δ1,t , . . . , δ p,t ), (p × p-diagonal matrix), and δ j,t ’s tend to 0 as t → ∞. Let Λt (θ, θt ) ≡
lt (θt )−lt (θ). Under appropriate regularity conditions, we get the following theorem (see Prakasa Rao
(1999)).
Theorem 8.3.13. For every h ∈ R p ,
1
Λ(θ, θt ) = h′ ∆t (θ) − h′ Gt (θ)h + oPθ (1), (8.3.32)
2
and
d
− (G1/2 W , G(θ)) as t → ∞,
(∆t (θ), Gt (θ)) → (8.3.33)
where
" #
∂ ∂
∆t (θ) = δt lt (θ), Gt (θ) = δt′ lt (θ) δt .
∂θ ∂θ t
Here G(θ) and W are independent, and L{W } = N(0, I p ).

If (8.3.32) and (8.3.33) hold, the experiment (Ω, F , {Ft : t ≥ 0}, {Pθ : θ ∈ Θ}) is said to be
locally asymptotic mixed normal (LAMN).
Definition 8.3.14. An estimator θ̃t of θ is said to be regular at θ if
d
L(δt−1 (θ̃t − θt ), Gt (θ)|Pθ+δth ) →
− (S, G), (8.3.34)
as t → ∞, where G = G(θ) and S is a random vector.

Let L be the class of loss functions used in Theorem 8.3.8. The following theorem is due to
Basawa and Scott (1983).
Theorem 8.3.15. Assume l ∈ L. Then,
lim inf Eθ [l{δt−1 (θ̃t − θ)}] ≥ E[l{G−1/2 W }], (8.3.35)

t→∞
for any regular estimator θ̃t of θ.

From Theorem 8.3.15, it is natural to introduce the asymptotic efficiency of regular estimators
as follows.
Definition 8.3.16. A regular estimator θ̃t of θ is asymptotically efficient if
d
L{δt−1 (θ̃t − θ)|Pθ } →
− G−1/2 W as t → ∞. (8.3.36)
Recalling Theorem 8.2.9, we have,

Theorem 8.3.17. Let θ̂t be the maximum likelihood estimator defined by (8.2.21). Then θ̂t is asymp-
totically efficient.
8.4 Statistical Model Selection

So far we have introduced many parametric models for stochastic processes which are specified by
unknown parameter θ ∈ Θ ⊂ R p . Assuming that p = dim θ is known, we discussed the optimal
estimation of θ. However, in practice, the order p of parametric models must be inferred from the
data. The best known rule for determining the true value p0 of p is probably Akaike’s information
criterion (AIC) (e.g., Akaike (1974)). A lot of the alternative criteria have been proposed by Akaike
(1977, 1978), Schwarz (1978) and Hannan and Quinn (1979), among others. For fitting AR(p)
models, Shibata (1976) derived the asymptotic distribution of the selected order p̂ by AIC, and
showed that AIC has a tendency to overestimate p0 . There are other criteria which attempt to correct
the overfitting nature of AIC. Akaike (1977, 1978) and Schwarz (1978) proposed the Bayesian
information criterion (BIC), and Hannan and Quinn (1979) proposed the Hannan and Quinn (HQ)
a.s.
criterion. If p̂ is the selected order by BIC or HQ, then it is shown that p̂ −−→ p0 . We say that BIC
and HQ are consistent. This property is not shared by AIC, i.e., AIC is inconsistent. However, this
does not mean to detract from AIC. For autoregressive models, Shibata (1980) showed that order
selection by AIC is asymptotically efficient in a certain sense, while order selection by BIC or HQ
is not so.
In this section we propose a generalized AIC (GAIC) for general stochastic models, which in-
cludes the original AIC as a special case. The general stochastic models concerned are characterized
by a structure g. As examples of g we can take the probability distribution function for the i.i.d. case,
the trend function for the regression model, the spectral density function for the stationary process,
and the dynamic system function for the nonlinear time series model. Thus, our setting is very gen-
eral. We will fit a class of parametric models { fθ : dim θ = p} to g by use of a measure of disparity
D( fθ , g). Concrete examples of D( fθ , g) were given in Taniguchi (1991, p. 2–3). Based on D( fθ , g)
we propose a generalized AIC (GAIC(p)) which determines the order p.
In the case of the Gaussian AR(p0 ) model, Shibata (1976) derived the asymptotic distribution
of selected order p̂ by the usual AIC. The results essentially rely on the structure of Gaussian
AR(p0 ) and the AIC. Without Gaussianity, Quinn (1988) gave formulae for the limiting probability
of overestimating the order of a multivariate AR model by use of AIC.
In this subsection we derive the asymptotic distribution of selected order p̂ by GAIC(p). The
results are applicable to very general statistical models and various disparities D( fθ , g).
We recognize that lack of information, data contamination, and other factors beyond our control
make it virtually certain that the true model is not strictly correct. In view of this, we may suppose
that the true model g would be incompletely specified by uncertain prior information (e.g., Bancroft
(1964)) and be contiguous to a fundamental parametric model fθ0 with dim θ0 = p0 (see Koul and
Saleh (1993), Claeskens and Hjort (2003), Saleh (2006)). One plausible parametric description for
g is
g = f(θ0 , h/ √n) , h = (h1 , · · · , hK−p0 )′ , (8.4.1)
where n is the sample size, and the true order is K. Then we may write g = fθ0 + O(n−1/2 ). In this
setting, Claeskens and Hjort (2003) proposed a focused information criterion (FIC) based on the
STATISTICAL MODEL SELECTION 301
MSE of the deviation of a class of submodels, and discussed properties of FIC. In what follows, we
will derive the asymptotic distribution of p̂ by GAIC under (8.4.1).
We now introduce the GAIC as follows. Let Xn = (X1 , · · · , Xn )′ be a collection of m-
dimensional random vectors Xt which are not necessarily i.i.d. Hence, our results can be applied
to various regression models, multivariate models and time series models. Let g be a function which
characterizes {Xt }. We will fit a class of parametric models P = { fθ : θ ∈ Θ ⊂ R p } to g by use of
a measure of disparity D( fθ , g), which takes the minimum if and only if fθ = g. Based on the ob-
served stretch Xn we estimate θ by the value θ̂n = θ̂n (Xn ) which minimizes D( fθ , ĝn ) with respect
to θ, where ĝn = ĝn (Xn ) is an appropriate estimator of g. Nearness between fθ̂n and g is measured
by
Eθ̂n {D( fθ̂n , g)}, (8.4.2)
where Eθ̂n (·) is the expectation of (·) with respect to the asymptotic distribution of θ̂n . In what
follows we give the derivation of GAIC. For this, all the appropriate regularity conditions for the
ordinary asymptotics of θ̂n and ĝn are assumed (e.g., smoothness of the model and the central limit
theorem for θ̂n ). Define the pseudo true value θ0 of θ by the requirement
D( fθ0 , g) = min D( fθ , g). (8.4.3)
θ∈Θ
Expanding D( fθ̂n , g) around θ0 we have,

∂
Eθ̂n {D( fθ̂n , g)} ≈ D( fθ0 , g) + (θ̂n − θ0 )′ D( fθ0 , g)
∂θ
1 ∂2
+ (θ̂n − θ0 )′ D( fθ0 , g)(θ̂n − θ0 ), (8.4.4)
2 ∂θ∂θ ′
where ≈ means that the left-hand side is approximated by the asymptotic expectation of the right-
hand side. From (8.4.3) it follows that
∂
D( fθ0 , g) = 0, (8.4.5)
∂θ
which leads to
1 ∂2
Eθ̂n {D( fθ̂n , g)} ≈ D( fθ0 , g) + (θ̂n − θ0 )′ D( fθ0 , g)(θ̂n − θ0 ). (8.4.6)
2 ∂θ∂θ ′
On the other hand,
D( fθ0 , g) = D( fθ̂n , ĝn ) + {D( fθ0 , ĝn ) − D( fθ̂n , ĝn )} + {D( fθ0 , g) − D( fθ0 , ĝn )}
1 ∂2
≈ D( fθ̂n , ĝn ) + (θ̂n − θ0 )′ D( fθ̂n , ĝn )(θ̂n − θ0 ) + M, (8.4.7)
2 ∂θ∂θ ′
where M = D( fθ0 , g) − D( fθ0 , ĝn ). From (8.4.6) and (8.4.7) it holds that
1
Eθ̂n {D( fθ̂n , g)} ≈ D( fθ̂n , ĝn ) + (θ̂n − θ0 )′ J(θ0 )(θ̂n − θ0 ) + M, (8.4.8)
2
∂2
where J(θ0 ) = ∂θ∂θ ′ D( fθ0 , g). Under natural regularity conditions, we can show
√ d
n(θ̂n − θ0 ) → − N(0, J(θ0)−1 I(θ0 )J(θ0 )−1 ), (8.4.9)
√ ∂
where I(θ0 ) = limn→∞ Var[ n ∂θ D( fθ0 , ĝn )] . Since the asymptotic mean of M is zero, (8.4.8) and
(8.4.9) validate that
1
T n (p) = D( fθ̂n , ĝn ) + tr{J(θ̂n )−1 I(θ̂n )} (8.4.10)
n
is an “asymptotically unbiased estimator” of Eθ̂n {D( fθ̂n , g)}. If g ∈ { fθ : θ ∈ Θ}, i.e., the true
model belongs to the family of parametric models, it often holds that I(θ0 ) = J(θ0 ). Hence T n′ (p) =
D( fθ̂n , ĝn ) + n−1 p is an “asymptotically unbiased estimator” of Eθ̂n {D( fθ̂n , g)}. Multiplying T n′ (p) by
n we call
GAIC(p) = nD( fθ̂n , ĝn ) + p (8.4.11)
a generalized Akaike information criterion (GAIC). The essential philosophy of AIC is an “asymp-
totically unbiased estimator” for disparity between the true structure g and estimated model fθ̂n .
GAIC is very general, and can be applied to a variety of statistical models.
Example 8.4.1. [AIC in i.i.d. case] Let X1 , · · · , Xn be a sequence of i.i.d. random variables
p
with
R probability density g(·).R Assume that g ∈ P = { fθ : θ ∈ Θ ⊂ R }. We set D( fθ , g) =
− g(x) log fθ (x)dx and ĝn (x)dx = the empirical distribution function. Then D( fθ , ĝn ) =
P
−n−1 nt=1 log fθ (Xt ). Let θ̂n be the maximum likelihood estimator (MLE) of θ. The GAIC is given
by
Y
n
GAIC(p) = − log fθ̂n (Xt ) + p, (8.4.12)
t=1
which is equivalent to the original AIC
AIC(p) = −2 log (maximum likelihood) + 2p, (8.4.13)
(see Akaike (1974)).

Example 8.4.2. [AIC for ARMA models] Suppose that {X1 , · · · , Xn } is an observed stretch from a
Gaussian causal ARMA(p1 , p2 ) process with spectral density g(λ). Let
 P p2 i jλ 2



 σ2 |1 + j=1 a j e |  

P=  f : f (λ) = P p1  (8.4.14)
 θ θ 2π |1 + j=1 b j e | 
i jλ 2 
be a class of parametric causal ARMA (p1 , p2 ) spectral models, where θ = (a1 , . . . , a p2 , b1 , . . . , b p1 )′ .

In this case we set
Z π( )
1 g(λ)
D( fθ , g) = log fθ (λ) + dλ, (8.4.15)
4π −π fθ (λ)
and
2
1 X
n
ĝn (λ) = Xt eitλ (periodogram).
2πn t=1
σ2
Write fθ (λ) = 2π hθ (λ). For hθ (λ), it is shown that
Z π
log hθ (λ)dλ = 0, (8.4.16)
−π
(e.g., Brockwell and Davis (2006, p. 191)). Then
θ̂n = arg min D( fθ , ĝn )

θ∈Θ
Z π
ĝn (λ)
= arg min dλ, (8.4.17)
θ∈Θ −π hθ (λ)
which is called a quasi-Gaussian MLE of θ. For σ2 , its quasi-Gaussian MLE σ̂n 2 (p), (p = p1 + p2 )
is given by
Z π 
∂ 

 ĝn (λ)  

2

 log σ + σ2 
 dλ = 0, (8.4.18)
∂σ −π 
2 
2π hθ̂n (λ) 2 2 σ =σ̂n (p)
which leads to
Z π
ĝn (λ)
σ̂2n (p) = dλ. (8.4.19)
−π hθ̂n (λ)
From (8.4.15) and (8.4.19) it is seen that the GAIC is equivalent to
n
GAIC(p) = log σ̂2n (p) + p. (8.4.20)
2
For simplicity, we assumed Gaussianity of the process; hence, it is possible to use (8.4.20) for
non-Gaussian ARMA processes.
Example 8.4.3. [AIC for Regression Model] Let X1 , · · · , Xn be a sequence of random variables
generated by
Xt = β1 + β2 z2t + · · · + β p z pt + ut , (8.4.21)
where ut ’s are i.i.d. with probability density g(·), and {z jt }, j = 2, · · · , p, are nonrandom variables
P
satisfying nt=1 z2jt = O(n). As in Example 8.4.2, we set
Z
D( fθ , g) = − g(x) log fθ (x)dx,
Z
ĝn (x)dx = the empirical distribution function,
which lead to
1X
n
D( fθ , ĝn ) = − log fθ (Xt − β1 − β2 z2t − · · · − β p z pt ). (8.4.22)
n t=1
If we take fθ (ut ) = (2πσ2 )−1/2 exp(−u2t /2σ2 ) and θ = (β1 , · · · , β p , σ2 )′ , then

n
GAIC(p) = log σ̂2 (p) + p, (8.4.23)
2
where
1X
n
σ̂2 (p) = (Xt − β̂1 − β̂2 z2t − · · · − β̂ p z pt )2 ,
n t=1
and (β̂1 , · · · , β̂ p ) is the least squares estimator of (β1 , · · · , β p ) .

Example 8.4.4. [GAIC for CHARN Model] Let {X1 , · · · , Xn } be generated by the following CHARN
model:
Xt = ψθ (Xt−1 , · · · , Xt−q ) + hθ (Xt−1 , . . . , Xt−r )ut , (8.4.24)
where ψθ : Rq → R is a measurable function, hθ : Rr → R is a positive measurable function,

and {ut } is a sequence of i.i.d. random variables with known probability density p(·), and ut is
independent of {X s : s < t}, and θ = (θ1 , · · · , θ p )′ . For the asymptotic theory, ψθ (·) and hθ (·) are
assumed to satisfy appropriate regularity conditions (see Kato et al. (2006)). Then we set
fθ (x1 , · · · , xn ) = the joint pdf model for (X1 , · · · , Xn ) under (8.4.24),
and
g(x1 , · · · , xn ) = the true joint pdf of (X1 , · · · , Xn ).
If we take
(
D( fθ , g) = − g(x1 , · · · , xn ) log fθ (x1 , · · · , xn )dx1 · · · xn ,
(
ĝn (x1 , · · · , xn )dx1 · · · dxn = the empirical distribution function.
Then,
1 X
n
D( fθ , ĝn ) = − log p{h−1 −1
θ (Xt − ψθ )}hθ ,
n t=max(q,r)
leading to
X
n
GAIC(p) = −2 log p{h−1
θ̂
(Xt − ψθ̂ ML )}h−1
θ̂
+ 2p, (8.4.25)
ML ML
t=max(q,r)
where θ̂ ML is the MLE of θ.

As we saw above, GAIC is a very general information criterion which determines the order of
model. By scanning p successively from zero to some upper limit L, the order of model is chosen
by the p that gives the minimum of GAIC(p), p = 0, 1, · · · , L. We denote the selected order by p̂.
To describe the asymptotics of GAIC, we introduce natural regularity conditions for fθ and θ̂n ,
which a wide class of statistical models satisfy. In what follows we write θ as θ(p) if we need to
emphasize the dimension.
Assumption 8.4.5. (i) fθ (·) is continuously twice differentiable with respect to θ.
(ii) The fitted model is nested, i.e., θ(p + 1) = (θ(p)′, θ p+1 )′ .
(iii) As n → ∞, the estimator θ̂n satisfies (8.4.9), and I(θ) = J(θ) if g = fθ .
(iv) If g = fθ(p0 ) , then D( fθ(p) , g) is uniquely minimized at p = p0 .
The assumption (iii) is not restrictive. For Examples 8.4.1 and 8.4.3, the estimators of ML type
are known to satisfy it. For Example 8.4.2, the q-MLE satisfies it (see Taniguchi and Kakizawa
(2000, p. 58)). For Example 8.4.4, by use of the asymptotic theory for CHARN models (see Kato
et al. (2006)), we can see that (iii) is satisfied.
In the case of the Gaussian AR(p0 ) model, Shibata (1976) derived the asymptotic distribution of
selected order p̂ by the usual AIC. The results essentially rely on the structure of Gaussian AR(p)
and AIC. In what follows we derive the asymptotic distribution of selected order p̂ by GAIC(p) for
very general statistical models including Examples 8.4.1–8.4.4. Let αi = P(χ2i > 2i), and let
 l 
Y 1 αi ri 
X∗ 
 

ρ(l, {αi }) =   , (8.4.26)
l 
 ri ! i  
i=1
X∗
where extends over all l-tuples (r1 , · · · , ri ) of nonnegative integers satisfying r1 +2r2 +· · ·+lrl =
l
l.
Theorem 8.4.6. (Taniguchi and Hirukawa (2012)) Suppose g = fθ(p0 ) . Then, under Assumption
8.4.5, the asymptotic distribution of p̂ selected by GAIC(p) is given by



0, 0 ≤ p < p0 ,
lim P( p̂ = p) =  (8.4.27)
n→∞ ρ(p − p0 , {αi })ρ(L − p, {1 − αi }), p0 ≤ p ≤ L,
where ρ(0, {αi }) = ρ(0, {1 − αi }) = 1.

Proof. We write θ̂n and θ0 as θ̂n (p) and θ0 (p) to emphasize their dimension. For 0 ≤ p < p0 ,
P{ p̂ = p} ≤ P{GAIC(p) < GAIC(p0 )}

= P{nD( fθ̂n (p) , ĝn ) − nD( fθ̂n (p0 ) , ĝn ) < p0 − p}. (8.4.28)
As n → ∞, n{D( fθ̂n (p) , ĝn ) − D( fθ̂n (p0 ) , ĝn )} tends to n{D( fθ0 (p) , g) − D( fθ0(p0 ) , g)} in probability. From
(iv) of Assumption 8.4.5, it follows that n{D( fθ0 (p) , g) − D( fθ0 (p0 ) , g)} diverges to infinity, which
implies that the right-hand side of (8.4.28) converges to zero. Hence,
lim P{ p̂ = p} = 0, for 0 ≤ p < p0 . (8.4.29)

n→∞
Next, we examine the case of p0 ≤ p ≤ L. First,
GAIC(p) = nD( fθ̂n (p) , ĝn ) + p

h ∂
= n D( fθ0 (p) , ĝn ) − (θ0 (p) − θ̂n (p))′ D( fθ̂n (p) , ĝn )
∂θ
2 i
1 ′ ∂
− (θ0 (p) − θ̂n (p)) ′
D( fθ̂n (p) , ĝn )(θ0 (p) − θ̂n (p)) + p
2 ∂θ∂θ
+ lower order terms
= nD( fθ0 (p) , ĝn )
1√ ∂2 √
− n(θ̂n (p) − θ0 (p))′ D( fθ̂n (p) , g) n(θ̂n (p) − θ0 (p)) + p
2 ∂θ∂θ ′
Since D( fθ0 (m) , ĝn ) = D( fθ0 (l) , ĝn ) for p0 ≤ m < l ≤ L, we investigate the asymptotics of
1√ ∂2 √
Qn (p) = n(θ̂n (p) − θ0 (p))′ ′
D( fθ0 (p) , g) n(θ̂n (p) − θ0 (p)), (8.4.31)
2 ∂θ∂θ
(p0 ≤ p ≤ L).
For p0 ≤ m < l ≤ L, write the l × l Fisher information matrix I{θ0 (l)}, and divide it as a 2 × 2 block
matrix:
m l−m
z}|{ z}|{
 
 F11 F12  }m
I{θ0 (l)} =   (8.4.32)
F21 F22 }l − m
where F11 = I{θ0 (m)} is the m × m Fisher information matrix. Also we write
!  ∂ 
Z11 √  ∂θ(m) D( fθ0 (m) , ĝn )
 }m
Z1 = = n  ∂  (8.4.33)
∂θ l−m(l) D( fθ0 (m) , ĝn )
Z21 }l − m
where θl−m (l) is defined by θ(l) = (θ(m)′ , θl−m (l)′ )′ . The matrix formula leads to
−1 −1 −1 −1 −1 −1
!
−1 F11 + F11 F12 F22·1 F21 F11 −F11 F12 F22·1
I{θ0 (l)} = −1 −1 −1 (8.4.34)
−F22·1 F21 F11 F22·1
−1
where F22·1 = F22 − F21 F11 F12 . From the general asymptotic theory,
√
n(θ̂n (l) − θ0 (l)) = I{θ0 (l)}−1 Z1 + o p (1), (8.4.35)
√ −1
n(θ̂n (m) − θ0 (m)) = F11 Z11 + o p (1). (8.4.36)
From (8.4.34)–(8.4.36) it follows that
1 −1
Qn (l) − Qn (m) = (Z21 − F21 F11 Z11 )′ F22·1
−1 −1
(Z21 − F21 F11 Z11 ) (8.4.37)
2
for p0 ≤ m < l ≤ L. Here it is seen that
−1 d
Z21 − F21 F11 − N(0, F22·1 )
Z11 → (8.4.38)
−1
and Z21 − F21 F11 Z11 is asymptotically independent of Z11 , i.e., it is asymptotically independent of
√
n(θ̂n (m) − θ0 (m)) (recall (8.4.36)). Therefore, for any m and l satisfying p0 ≤ m < l ≤ L, it holds
that, as n → ∞,
d 1 2
Qn (l) − Qn (m) →
− χ , (8.4.39)
2 l−m
√
Qn (l) − Qn (m) is asymptotically independent of n(θ̂n (m) − θ0 (m)). (8.4.40)
Now we turn to evaluate P( p̂ = p) in the case when p0 ≤ p ≤ L. For p0 ≤ p ≤ L, from (8.4.30) and
(8.4.33) it follows that
P( p̂ = p) = P{GAIC(p) ≤ GAIC(m), p0 ≤ m ≤ L} + o(1)

= P{−Qn (p) + p ≤ −Qn (m) + m, p0 ≤ m ≤ L} + o(1)
= P{−Qn (p) + p ≤ −Qn (m) + m, p0 ≤ m < p, and
− Qn (p) + p ≤ −Qn (m) + m, p < m ≤ L} + o(1)
= P{p − m ≤ Qn (p) − Qn (m), p0 ≤ m < p, and
Qn (m) − Qn (p) ≤ m − p, p < m ≤ L} + o(1) (8.4.41)
= (A) + o(1), (say).
From (8.4.39) and (8.4.40) we can express (8.4.41) as
(A) = P[S 1 > 0, · · · , S p−p0 > 0] · P[S 1 < 0, · · · , S L−p < 0] + o(1), (8.4.42)
where S i = (Y1 − 2) + · · · + (Yi − 2) with Yi ∼ i.i.d. χ21 . Applying Shibata (1976) and Spitzer (1956)
to (8.4.42) we can see that, for p0 ≤ p ≤ L,
lim P( p̂ = p) = ρ(p − p0 , {αi })ρ(L − p, {1 − αi }).

n→∞

Next we derive the asymptotic distribution of selected order under contiguous models. We rec-
ognize that lack of information, data contamination, and other factors beyond our control make it
virtually certain that the true model g is not strictly correct and incompletely specified by certain
prior information (e.g., Bancroft (1964)), and is contiguous to a fundamental parametric model fθ0 .
We may suppose that the true model is of the form
1
g(x) = fθ0 (x) + √ h(x), (8.4.43)
n
where dim θ0 = p0 and n is the sample size, and h(x) is a perturbation function. To be more specific,
if we make a parametric version of (8.4.43), it is
g(x) = f(θ0 ,h/ √n) (x), (8.4.44)
where h = (h1 , · · · , hK−p0 )′ . In practice, it is natural that the true model is (8.4.44), and that the true
order of (8.4.44) is K. In the analysis of preliminary test estimation, models of the form (8.4.44) are
often used (e.g., Koul and Saleh (1993) and Saleh (2006) ).
In what follows, we derive the asymptotic distribution of the selected order p̂ by GAIC under
the models (8.4.44). Suppose that the observed stretch Xn = (X1 , · · ·√, Xn )′ specified by the true
(n)
structure f(θ0 ,h/ √n) has the probability distribution P(θ 0 ,η)
with η = h/ n.
Assumption 8.4.7. (i) If h = 0, the model fθ = f(θ,0) satisfies Assumption 8.4.5.
(n)
(ii) {P(θ 0 ,η)
} has the LAN property, i.e., it satisfies
(n)
dP(θ √
,h/ n) 1
= h′ ∆n − h′ Γ(θ0 )h + o p (1),
0
log (n)
(8.4.45)
dP(θ 2
0 ,0)
d (n)
− N(0, Γ(θ0 )),
∆n → under P(θ 0 ,0)
,
where Γ(θ0 ) is the Fisher information matrix.

We assume p0 < K ≤ L and fit the model fθ(p) , 0 ≤ p ≤ L to the data from (8.4.44). We write
(θ0 , 0) as θ0 (p), p ≥ p0 , if the dimension of the fitted model is p. When p0 < l ≤ L, let
(l−1) ! ∂ !
Z11 √ ∂θ(l−1) D( fθ0 (l−1) , ĝn )
}l − 1
Z1l = (l) = n ∂ (8.4.46)
Z21 ∂θl D( fθ0 (l) , ĝn ) }1.
′
(n)
Assumption 8.4.8. Under P(θ 0 ,0)
, for p0 < l ≤ L, the random vectors (Z1(l) , ∆′n )′ are asymptotically
normal with
lim Cov(Z1(l) , ∆n ) = {σ j,k (l); j = 1, · · · , l, k = 1, · · · , K − p0 }, (say)

n→∞
(l × (K − p0 )-matrix), (8.4.47)
and
l−1 1
z}|{ z}|{
(l−1) (l−1) !
F11 F12 }l − 1
lim Var(Z1(l) ) = (l−1) (l) (8.4.48)
n→∞ F21 F22 }1.
Write
 
 σ1,1 (l) ··· σ1,K−p0 (l)

 . .. 
Σ(l−1) =  .. .  ,
 
σl−1,1 (l) · · · σl−1,K−p0 (l)
σ(l) = (σl,1 (l), · · · , σl,K−p0 (l))′ ,
−1
(l) (l) (l−1) (l−1) (l−1)
F22·1 = F22 − F21 F11 F12 , l = p0 + 1, · · · , L
and
−1
q
′ (l−1) (l−1) (l)
µl = (σ(l) h − F21 F11 Σ(l−1) h) F22·1 .
Letting W1 , · · · , WL−p0 ∼ i.i.d. N(0, 1), define
S j,m = (Wm − µm+p0 )2 + · · · + (Wm− j+1 − µm+p0 − j+1 )2 − 2 j,

(1 ≤ m ≤ L − p0 , j = 1, · · · , m),
T j,k = (Wk+1 − µk+p0 +1 )2 + · · · + (Wk+ j − µk+p0 + j )2 − 2 j,
(0 ≤ k ≤ L − p0 − 1, j = 1, · · · , L − p0 − k).
If 0 ≤ p < p0 , we can prove

(n)
lim P(θ √ { p̂ = p} = 0,
,h/ n)
(8.4.49)
n→∞ 0
(n)
as in the proof of Theorem 8.4.6 . Next, we discuss the case of p0 ≤ p ≤ L. Under P = P(θ 0 ,0)
, the
probability P( p̂ = p) is evaluated in (8.4.41). Recall (8.4.37) and (8.4.38). From Assumptions 8.4.7
and 8.4.8, applying Le Cam’s third lemma (Lemma 8.3.10) to (8.4.38), it is seen that
(l) (l−1) (l−1) (l−1) −1 d
(l−1) (l−1)′ −1
Z21 − F21 F11 Z11 − N(σ(l) h − F21
→ F11 Σ(l−1) h, F22·1 ) (8.4.50)
(n)
under P(θ √ . Then, returning to (8.4.41), we obtain the following theorem.
,h/ n)
0
Theorem 8.4.9. (Taniguchi and Hirukawa (2012)) Assume Assumptions 8.4.7 and 8.4.8. Let p̂ be a
(n)
selected order by GAIC(p). Then, under P(θ √ , the asymptotic distribution of p̂ is given by
,h/ n) 0



0, 0 ≤ p < p0
(n)
lim P(θ ,h/ √n) { p̂ = p} = 
 (8.4.51)
n→∞ 0 β(p − p0 )γ(p − p0 ), p0 ≤ p ≤ L,
where
   

 
 
 L−p0 −k 

\  \
m
 
  

β(m) = P 
 (S j,m > 0) 
 , γ(k) = P 
 (T j,k ≤ 0) 
 ,

 j=1 
 
 j=1 

β(0) = γ(L) = 1.
Let {Xt : t = 1, · · · , n} be a stationary process with spectral density fθ (λ) = f(θ0 ,h/ √n) (λ).
Here dim θ0 = p0 and dim h = K − p0 . We use the criterion (8.4.20) based on (8.4.15). Assume
Assumptions 8.4.7 and 8.4.8. Then it can be shown that
Z π
1 ∂ ∂
σ j,k (l) = log f(θ0 ,0l−p0 ) (λ) log f(θ0 ,0K−p0 ) (λ)dλ, (8.4.52)
4π −π ∂θ j ∂θ p0 +k
r
z }| {
l = p0 + 1, · · · , L, j = 1, · · · , l, k = 1, · · · , K − p0 , and 0r = (0, · · · , 0)′ . Write the ( j, k)th element
of the right-hand side of (8.4.48) as F jk (l). Then,
Z π
1 ∂ ∂
F j,k (l) = log f(θ0 ,0l−p0 ) (λ) log f(θ0 ,0l−p0 ) (λ)dλ, (8.4.53)
4π −π ∂θ j ∂θk
where l = p0 + 1, · · · , L, j, k = 1, · · · , l. In principle we can express the key quantities µl in terms
of spectral density fθ . However, even for simple ARMA models, it is very difficult to express µl by
θ explicitly. To avoid this problem, we consider the following exponential spectral density model:
 L 
X 
fθ(L) (λ) = σ exp  θ j cos( jλ)
2
j=1
p 
X 0
1 X0
L−p 
= σ exp  θ j cos( jλ) + √
2
h j cos{(p0 + j)λ} . (8.4.54)
j=1
n j=1
The form of the true model is
fθ (λ) = fθ0 ,h/ √n (λ)

p 
X 0
1 X0
K−p 
2 
≡ σ exp  θ0, j cos( jλ) + √ h j cos{(p0 + j)λ} , (8.4.55)
j=1
n j=1
where θ0 = (θ0,1 , · · · , θ0,p0 )′ and h = (h1 , · · · , hK−p0 )′ . Then it is seen that



 81 hl−p0 , l = p0 + 1, · · · , K,

µl = 
 (8.4.56)
0, l = K + 1, · · · , L.
From Theorem 8.4.9 and (8.4.56) we can see that

(n)
lim P(θ √ { p̂ = p} tends to 0 if khk becomes large for 0 ≤ p < K, (8.4.57)
n→∞ 0,h/ n)
(n)
lim P(θ √ [ p̂ ∈ {K, K + 1, · · · , L}] tends to 1 if khk becomes large. (8.4.58)
n→∞ 0,h/ n)
Although, if h = 0, we saw that GAIC is not a consistent order selection criterion in Theorem
8.4.6, the results (8.4.57) and (8.4.58) imply
(i) GAIC is admissible in the sense of (8.4.57) and (8.4.58).
Many other criteria have been proposed. We mention two typical ones. Akaike (1977, 1978) and
Schwarz (1978) proposed
BIC = −2 log(maximum likelihood) + (log n)(number of parameters). (8.4.59)
In the setting of Example 8.4.2, Hannan and Quinn (1979) suggested the criterion
n
HQ(p) = log σ̂2n (p) + cp · log log n, c > 1. (8.4.60)
2
When h = 0, it is known that the criteria BIC and HQ are consistent for “the true order p0 .” Here
we summarize the results following the proofs of Theorems 8.4.6 and 8.4.9.
(ii) BIC still retains consistency to “pretended order p0 (not K),” and
(iii) HQ still retains consistency to “pretended order p0 (not K).”
The statements (i)–(iii) characterize the aims of GAIC(AIC), BIC and HQ. That is, GAIC aims
to select contiguous models (8.4.55) with the orders K, K + 1, · · · , L including the true one, if our
model is “incompletely specified.” On the other hand, BIC and HQ aim to select the true model fθ0
if our model is “completely specified.”
Now, we evaluate the asymptotic distribution of the selected order in (8.4.27) and (8.4.51). The
asymptotic probabilities limn→∞ P( p̂ = p) under the null hypothesis θ = θ0 for L = 10 are listed
in Table 8.1. In this table the elements for p = p0 correspond to correctly estimated cases while
those for p > p0 correspond to overestimated cases. It is seen that the asymptotic probabilities
of overestimations are not equal to zero, hence GAIC is not a consistent order selection criterion
in contrast with BIC and HQ, which are consistent order selection criteria. However, they tend to
zero as p − p0 becomes large. Moreover, the asymptotic probabilities of correct selections, p = p0 ,
1 ≤ p0 ≤ L for L = 5, 8, 10 are plotted in Figure 8.1. We see that the correct selection probability
becomes larger as L becomes smaller for each p0 . However, in the first place, the setting of L < p0
causes the meaningless situation where we always underestimate the order.
Table 8.1: The asymptotic distribution of selected order under the null hypothesis, θ = θ0
p\p0 1 2 3 4 5 6 7 8 9 10
1 0.719 0 0 0 0 0 0 0 0 0
2 0.113 0.721 0 0 0 0 0 0 0 0
3 0.058 0.114 0.724 0 0 0 0 0 0 0
4 0.035 0.058 0.115 0.729 0 0 0 0 0 0
5 0.023 0.036 0.059 0.116 0.735 0 0 0 0 0
6 0.016 0.024 0.036 0.060 0.117 0.745 0 0 0 0
7 0.012 0.017 0.024 0.037 0.061 0.120 0.760 0 0 0
8 0.009 0.012 0.017 0.025 0.038 0.063 0.124 0.787 0 0
9 0.007 0.010 0.013 0.019 0.027 0.041 0.067 0.133 0.843 0
10 0.006 0.009 0.011 0.016 0.022 0.032 0.048 0.080 0.157 1
1.00
0.95
L=5 L=8 L=10
0.90
0.85
0.80
0.75
0.70
0 2 4 6 8 10
Figure 8.1: The asymptotic probabilities of correct selections
(n)
Next, we evaluate the asymptotic probabilities limn→∞ P(θ √ { p̂ = p} under the local al-
0 ,h/ n)
√
ternative hypothesis θ = (θ0′ , h′ / n)′ . We consider the exponential spectral density model (8.4.54)
where the local alternative (true) model is given by (8.4.55). Here we assume that hl ≡ h are constant
for 1 ≤ l ≤ K − p0 and hl ≡ 0 for K − p0 + 1 ≤ l ≤ L − p0 . Since the exact probabilities of β(p − p0 )
and γ(p − p0 ) in (8.4.51) are complicated, we replace them by the empirical relative frequencies
of events {∩ p−p L−p
j=1 (S j,p−p0 > 0)} and {∩ j=1 (T j,p−p0 ≤ 0)} in the 1000 times iteration. To simplify the
0
Table 8.2: √ The asymptotic distribution of selected order under the local alternative hypothesis θ =
(θ0′ , h′ / n)′ for h = 1
u\R 1 2 3 4 5 6 7 8 9 10
0 0.748 0.714 0.727 0.73 0.689 0.704 0.709 0.717 0.705 0.724
1 0.101 0.123 0.117 0.101 0.128 0.114 0.126 0.109 0.122 0.106
2 0.054 0.054 0.051 0.057 0.072 0.054 0.066 0.066 0.051 0.059
3 0.028 0.032 0.038 0.037 0.029 0.044 0.043 0.026 0.033 0.031
4 0.023 0.026 0.019 0.026 0.016 0.022 0.018 0.024 0.023 0.028
5 0.015 0.015 0.012 0.011 0.019 0.012 0.013 0.014 0.023 0.018
6 0.01 0.009 0.009 0.015 0.015 0.018 0.009 0.014 0.015 0.009
7 0.008 0.01 0.013 0.008 0.007 0.011 0.007 0.008 0.01 0.008
8 0.006 0.005 0.005 0.004 0.013 0.005 0.003 0.009 0.005 0.006
9 0.003 0.007 0.006 0.006 0.008 0.009 0.002 0.005 0.006 0.007
10 0.004 0.005 0.003 0.005 0.004 0.007 0.004 0.008 0.007 0.004
(θ0′ , h′ / n)′ for h = 10
u\R 1 2 3 4 5 6 7 8 9 10
0 0.464 0.366 0.262 0.233 0.213 0.205 0.163 0.154 0.124 0.111
1 0.332 0.208 0.173 0.137 0.094 0.088 0.071 0.056 0.065 0.052
2 0.067 0.236 0.166 0.149 0.103 0.085 0.06 0.07 0.05 0.051
3 0.041 0.067 0.242 0.149 0.107 0.094 0.078 0.058 0.056 0.041
4 0.027 0.048 0.053 0.182 0.133 0.105 0.099 0.077 0.058 0.048
5 0.028 0.023 0.036 0.051 0.201 0.127 0.107 0.08 0.062 0.066
6 0.011 0.019 0.022 0.035 0.056 0.167 0.13 0.092 0.081 0.06
7 0.015 0.007 0.017 0.022 0.04 0.05 0.192 0.137 0.108 0.087
8 0.007 0.009 0.014 0.017 0.032 0.033 0.054 0.179 0.135 0.11
9 0.004 0.01 0.007 0.012 0.006 0.025 0.025 0.054 0.202 0.114
10 0.004 0.007 0.008 0.013 0.015 0.021 0.021 0.043 0.059 0.26
notations we write Q ≡ L − p0 , R ≡ K − p0 and u = p − p0 ; then β(p − p0 ) and γ(p − p0 ) in (8.4.51)

(n)
depend on (p, p0 , L) through u and Q. The asymptotic probabilities limn→∞ P(θ √ { p̂ = u + p0 }
0 ,h/ n)
′ ′
√ ′
under the local alternative hypothesis θ = (θ0 , h / n) for h = 1, 10, 100 are listed in Tables
8.2–8.4. Here we fix Q = 10 and evaluate the asymptotic probabilities for each u = 0, . . . , Q and
R = 1, . . . , Q. In these tables the elements for u = R correspond to correctly estimated cases, while
those for u > R and u < R correspond to overestimated and underestimated cases, respectively.
Furthermore, the cases of u = 0 correspond to the acceptance of the null hypothesis under the local
alternative hypothesis. From these tables we see that GAIC tends to select null hypothesis u = 0 for
small h (see Table 8.2). On the other hand, GAIC tends to select local alternative hypothesis u = R
(true model) and does not select u < R (smaller orders than the true one) for large h (see Table 8.4).
At first sight, GAIC may seem to have a bad performance since it is unstable and inconsistent
in contrast with BIC and HQ, which are consistent order selection criteria of p0 . If khk is small, the
parameters in the local alternative part are negligible, so we expect to select the null hypothesis p0 . In
such cases GAIC tends to select p0 , although the asymptotic probabilities of overestimations are not
equal to zero. Therefore, GAIC is not so bad if we restrict ourselves to the negligible alternative and
null hypotheses where BIC and HQ are surely desirable. However, khk is clearly not always small.
(θ0′ , h′ / n)′ for h = 100
u\R 1 2 3 4 5 6 7 8 9 10
0 0 0 0 0 0 0 0 0 0 0
1 0.745 0 0 0 0 0 0 0 0 0
2 0.105 0.715 0 0 0 0 0 0 0 0
3 0.045 0.109 0.718 0 0 0 0 0 0 0
4 0.041 0.062 0.127 0.731 0 0 0 0 0 0
5 0.018 0.034 0.052 0.108 0.743 0 0 0 0 0
6 0.014 0.025 0.044 0.054 0.127 0.755 0 0 0 0
7 0.012 0.011 0.025 0.041 0.052 0.124 0.777 0 0 0
8 0.008 0.018 0.022 0.033 0.029 0.053 0.106 0.778 0 0
9 0.005 0.013 0.008 0.018 0.026 0.039 0.064 0.136 0.86 0
10 0.007 0.013 0.004 0.015 0.023 0.029 0.053 0.086 0.14 1
If khk is large, the parameters in the local alternative part are evidently not ignorable, so we should
select the larger order than the true order K to avoid misspecification. In such cases GAIC tends to
select the true order K and does not underestimate, so we are guarded from misspecification. On the
other hand, BIC and HQ still retain the consistency to “pretended order p0 ” for any bounded h. Note
that such a situation may happen not only for the small sample but also for the large sample since h
does not depend on n. Therefore, we can conclude that GAIC balances the order of the model with
the magnitude of local alternative. Furthermore, GAIC protects us from misspecification under the
unignorable alternative, while BIC and HQ always mislead us to underestimation.
8.5 Efficient Estimation for Portfolios

In this section, we discuss the asymptotic efficiency of estimators for optimal portfolio weights. In
Section 3.2, the asymptotic distribution of portfolio weight estimators ĝ for the processes are given.
Shiraishi and Taniguchi (2008) showed that the usual portfolio weight estimators are not asymptot-
ically efficient if the returns are dependent. Furthermore, they propose using maximum likelihood
type estimators for optimal portfolio weight g, which are shown to be asymptotically efficient.
Although Shiraishi and Taniguchi (2008) dropped Gaussianity and independence for return pro-
cesses, many empirical studies show that real time series data are generally nonstationary. To de-
scribe this phenomenon, Dahlhaus (1996a) introduced an important class of locally stationary pro-
cesses, and discussed kernel methods for local estimators of the covariance structure. Regarding the
parametric approach, Dahlhaus (1996b) developed asymptotic theory for a Gaussian locally station-
ary process, and derived the LAN property. Furthermore Hirukawa and Taniguchi (2006) dropped
the Gaussian assumption, i.e., they proved the LAN theorem for a non-Gaussian locally stationary
process with mean vector µ = 0, and applied the results to the asymptotic estimation. Regarding the
application to the portfolio selection problem, Shiraishi and Taniguchi (2007) discussed the asymp-
totic property of estimators ĝ when the returns are vector-valued locally stationary processes with
time varying mean vector. Since the returns are nonstationary, the quantities µ, Σ, µ̂, Σ̂ depend on
the time when the portfolio is constructed. Assuming parametric structure µθ and Σθ , they proved
the LAN theorem for non-Gaussian locally stationary processess with nonzero mean vector. In this
setting they propose a parametric estimator ĝ = g(µθ̂ , Σθ̂ ) where θ̂ is a quasi-maximum likelihood
estimator of θ. Then it is shown that ĝ is asymptotically efficient.
EFFICIENT ESTIMATION FOR PORTFOLIOS 313
8.5.1 Traditional mean variance portfolio estimators
Suppose that a sequence of random returns {X(t) = (X1 (t), . . . , Xm (t))′ ; t ∈ Z} is a stationary process.
Writing µ = E{X(t)} and Σ = Var{X(t)} = E[{X(t) − µ}{X(t) − µ}′ ], we define (m + r)-vector
parameter θ = (θ1 , . . . , θm+r )′ by
θ = (µ′ , vech(Σ)′)′
where r = m(m + 1)/2. We consider estimating a general function g(θ) of θ, which expresses an
optimal portfolio weight for a mean variance model such as (3.1.2), (3.1.3), (3.1.5) and (3.1.7). Here
it should be noted that the portfolio weight w = (w1 , . . . , wm )′ satisfies the restriction w′ e = 1. Then
we have only to estimate the subvector (w1 , . . . , wm−1 )′ . Hence we assume that the function g(·) is
(m − 1)-dimensional, i.e.,
g : θ → Rm−1 . (8.5.1)
In this setting we address the problem of statistical estimation for g(θ), which describes various
mean variance optimal portfolio weights.
Let the return process {X(t) = (X1 (t), . . . , Xm (t))′ ; t ∈ Z} be an m-vector linear process
X
∞
X(t) = A( j)U (t − j) + µ (8.5.2)
j=0
where {U (t) = (u1 (t), . . . , um (t))′ } is a sequence of independent and identically distributed (i.i.d.)
m-vector random variables with E{U (t)} = 0, Var{U (t)} = K (for short {U (t) ∼ i.i.d. (0, K)}) and
fourth-order cumulants. Here
A( j) = (Aab ( j))a,b=1,...,m , A(0) = Im , µ = (µa )a=1,...,m , K = (Kab )a,b=1,...,m .
The class of {X(t)} includes that of non-Gaussian vector-valued causal ARMA models. Hence it
is sufficiently rich. The process {X(t)} is a second-order stationary process with spectral density
matrix
1
f (λ) = ( fab (λ))a,b=1,...,m = A(λ)KA(λ)∗
2π
P
From the partial realization {X(1), . . . , X(n)}, we introduce
θ̂ = (µ̂′ , vech(Σ̂)′ )′ , θ̃ = (µ̂′ , vech(Σ̃)′ )′
where
1X 1X 1X
n n n
µ̂ = X(t), Σ̂ = {X(t) − µ̂}{X(t) − µ̂}′ , Σ̃ = {X(t) − µ}{X(t) − µ}′ .
n t=1 n t=1 n t=1
Since we have already discussed the asymptotic normality in Section 3.2, we discuss the asymptotic
Gaussian efficiency for g(θ̂). Fundamental results concerning the asymptotic efficiency of sample
autocovariance matrices of vector Gaussian processes were obtained by Kakizawa (1999). He com-
pared the asymptotic variance (AV) of sample autocovariance matrices with the inverse of the cor-
responding Fisher information matrix (F −1 ), and gave the condition for the asymptotic efficiency
(i.e., condition for AV = F −1 ). Based on this we will discuss the asymptotic efficiency of g(θ̂).
Suppose that {X(t)} is a zero-mean Gaussian m-vector stationary process with spectral density ma-
trix f (λ), and satisfies the following assumptions.
′
Assumption 8.5.1. H ⊂ Rq (i.e. f (λ) = fη (λ)).
R π (i) f (λ) is parameterized by η = (η1 , . . . , ηq ) ∈ P
(ii) For A (l) ≡ −π ∂ fη (λ)/∂η jdλ, j = 1, . . . , q, l ∈ Z, it holds that ∞
( j) ( j)
l=−∞ kA (l)k < ∞.
(iii) q ≥ m(m + 1)/2.
Assumption 8.5.2. There exists a positive constant c (independent of λ) such that fη (λ) − cIm is
positive semi-definite, where Im is the m × m identity matrix.
The limit of the averaged Fisher information matrix is given by
Z π
1
F (η) = ∆(λ)∗ [{ fη (λ)−1 }′ ⊗ fη (λ)−1 ]∆(λ)dλ
4π −π
where
∆(λ) = (vec{∂ fη (λ)/∂η1}, . . . , vec{∂ fη (λ)/∂ηq}) (m2 × q) − matrix.
Assumption 8.5.3. The matrix F (η) is positive definite.
We introduce an m2 × m(m + 1)/2 matrix
Φ = (vec{ψ11 }, . . . , vec{ψm1 }, vec{ψ22 }, . . . , vec{ψmm }),
1
where ψab = 2 {E ab + Eba } and Eab is a matrix of which the (a, b)th element is 1 and others are 0.
Then we have the following theorem.
Theorem 8.5.4. Under Assumptions 8.5.1–8.5.3 and the Gaussianity of {X(t)}, g(θ̂) is asymptoti-
cally efficient if and only if there exists a matrix C (independent of λ) such that
{ fη (λ)′ ⊗ fη (λ)}Φ = (vec{∂ fη (λ)/∂η1}, . . . , vec{∂ fη (λ)/∂ηq })C. (8.5.3)
√
Proof of Theorem 8.5.4. Recall Theorem 3.2.5. In the asymptotic variance matrix of n{g(θ̂) −
g(θ)}, the transformation matrix! (∂g/∂θ′ ) does not depend on the goodness of estimation. Thus,
Ω1 0
if the matrix Ω = is minimized in the sense of matrix, the estimator g(θ̂) becomes
0 ΩG2
asymptotically efficient. Regarding the ΩG2 part, Kakizawa (1999) gave a necessarily and sufficient
condition for ΩG2 to attain the lower bound matrix F (η)−1 , which is given by the condition (8.5.3).
Regarding the Ω1 part, it is known that Ω1 = 2π f (0) is equal to the asymptotic variance of the BLUE
estimator of µ (e.g., Hannan (1970)). Hence, Ω1 attains the lower bound matrix, which completes
the proof.
This theorem implies that if (8.5.3) is not satisfied, the estimator g(θ̂) is not asymptotically efficient.
This is a strong warning to use the traditional portfolio estimators even for Gaussian dependent re-
turns. The interpretation of (8.5.3) is difficult. But Kakizawa (1999) showed that (8.5.3) is satisfied
by VAR(p) models. The following are examples which do not satisfy (8.5.3).
Example 8.5.5. (VARMA(p1 , p2 ) process) Consider the m×m spectral density matrix of an m-vector
ARMA(p1 , p2 ) process,
1
f (λ) = Θ{exp(iλ)}−1 Ψ{exp(iλ)}ΣΨ{exp(iλ)}∗ Θ{exp(iλ)}−1∗ (8.5.4)
2π
where Ψ(z) = Im − Ψ1 z − · · · − Ψ p2 z p2 and Θ(z) = Im − Θ1 z − · · · − Θ p1 z p1 satisfy det Ψ(z) , 0,
det Θ(z) , 0 for all |z| ≤ 1. From Kakizawa (1999) it follows that (8.5.4) does not satisfy (8.5.3), if
p1 < p2 , hence g(θ̂) is not asymptotically efficient if p1 < p2 .
Example 8.5.6. (An exponential model) Consider the m × m spectral density matrix of exponential
type,
 

 

X
 

f (λ) = exp  A( j) cos( jλ) 
 (8.5.5)

 j,0 

where A( j)’s are m×m matrices, and exp {·} is the matrix exponential (for the definition, see Bellman
(1960)). Since (8.5.5) does not satisfy (8.5.3), g(θ̂) is not asymptotically efficient.
8.5.2 Efficient mean variance portfolio estimators
In the previous subsection, we showed that the traditional mean variance portfolio weight estimator
g(θ̂) is not asymptotically efficient generally even if {X(t)} is a Gaussian stationary process. There-
fore, we propose maximum likelihood type estimators for g(θ), which are asymptotically efficient.
Gaussian Efficient Estimator Let {X(t)} be the linear Gaussian process defined by (8.5.2) with
spectral density matrix fη (λ), η ∈ H. We assume η = vech(Σ). Denote by In (λ), the periodogram
matrix constructed from a partial realization {X(1), . . . , X(n)}:
1 X
n
In (λ) = [{X(t) − µ̂}eitλ }][{X(t) − µ̂}eitλ }]∗ , f or − π ≤ λ ≤ π.
2πn t=1
To estimate η we introduce
Z π h i
D( fη , In ) = log det fη (λ) + tr{ fη−1 (λ)In (λ)} dλ,
−π
(e.g., Hosoya and Taniguchi (1982)). A quasi-Gaussian maximum likelihood estimator η̂ of η is

defined by
η̂ = arg min D( fη , In ).
η∈H
Let {X(t)} be the Gaussian m-vector linear process defined by (8.5.2). For the process {X(t)} and
the true spectral density matrix f (λ) = ( fab (λ))a,b=1,...,m , we impose the following.
Assumption 8.5.7. (i) For each a, b and l,

Var[E{ua (t)ub (t + l)|Ft−τ } − δ(l, 0)Kab ] = O τ−2−α , α > 0,
uniformly in t.

(ii) E[E{ua1 (t1 )ua2 (t2 )ua3 (t3 )ua4 (t4 )|Ft1 −τ ) − E{ua1 (t1 )ua2 (t2 )ua3 (t3 )ua4 (t4 )}] = O τ−1−α
uniformly in t1 , where t1 ≤ t2 ≤ t3 ≤ t4 and α > 0.
(iii) the spectral densities faa (λ) (a = 1, . . . , m) are square integrable.

X
∞
(iv) |cU U
a1 ···a4 ( j1 , j2 , j3 )| < ∞ where ca1 ···a4 ( j1 , j2 , j3 )’s are the joint fourth-order cumulants of
j1 , j2 , j3 =−∞
ua1 (t), ua2 (t + j1 ), ua3 (t + j3 ) and ua4 (t + j3 ).
(v) f (λ) ∈ Lip(α), the Lipschitz class of degree α, α > 1/2.
Then, Hosoya and Taniguchi (1982) showed the following.

Lemma 8.5.8. (Theorem 3.1 of Hosoya and Taniguchi (1982)) Suppose that η exists uniquely and
lies in H ⊂ Rq . Under Assumption 8.5.7, we have
(i) p- lim η̂ = η
n→∞
√ L
(ii) n(η̂ − η) → N(0, V(η))
where
Z π " #
∂2 −1 ∂2
V(η) = tr( fη (λ) f (λ)) + log det( fη (λ)) dλ.
−π ∂η∂η ′ ∂η∂η ′
If {X(t)} is Gaussian, then V(η) = F (η)−1 . Therefore, η̂ is Gaussian asymptotically effi-
cient, hence g(θ̃), with θ̃ = (µ̂′ , η̂ ′ )′ , is Gaussian asymptotically efficient. Since the solution of
∂D( fη , In )/∂η = 0 is generally nonlinear with respect to η, we use the Newton–Raphson iteration
procedure. A feasible procedure is
η̂ (1) = vech(Σ̂)
" #−1
∂D( fη , In )
∂2 D( fη , In )
η̂ (k) = η̂ (k−1) − (k ≥ 2). (8.5.6)
∂η∂η ′ ∂η η=η̂ (k−1)
√
From
√ Theorem 5.1 of Hosoya and Taniguchi (1982) n(η̂ (2) − η) has the same limiting distribution
as n(η̂ −η), which implies that η̂ is asymptotically efficient. Therefore, g(θ(2) ), θ(2) = (µ̂′ , η̂ (2)′ )′
(2)
becomes Gaussian asymptotically efficient. In (8.5.6), we can express

Z π h i
∂D( fη , In ) 1
= tr fη (λ)−1 Ei j (Im − fη (λ)−1 In (λ)) dλ, (8.5.7)
∂ηi j 2π −π
!2 Z π
∂2 D( fη , In ) 1
= tr[− fη (λ)−1 Ekl fη (λ)−1 Ei j Im − fη (λ)−1 In (λ)
∂ηi j ∂ηkl 2π −π
+ fη (λ)−1 Ei j fη (λ)−1 Ekl fη (λ)−1 In (λ)]dλ (8.5.8)
where ηi j is the (i, j)th element of Σ. To make the step (8.5.6) feasible, we may replace fη in (8.5.7)
and (8.5.8) by a nonparametric spectral estimator
Z π
fˆη (λ) = Wn (λ − µ)In (µ)dµ (8.5.9)
−π
where Wn (λ) is a nonnegative, even, periodic function with period 2π whose mass becomes, as n
increases, more and more concentrated around λ = 0 (mod 2π) (details see, e.g., p. 390 of Taniguchi
and Kakizawa (2000)). Then appropriate regularity conditions (6.1.17) of Taniguchi and Kakizawa
(2000) showed
p
max k fˆη (λ) − fη (λ)kE → 0
λ∈[−π,π]
where kAkE is the Euclidean norm of the matrix A; kAkE = {tr(AA∗ )}1/2 . Therefore, g(θ̂(2) ) with θ̂(2) =
′
(µ̂′ , η̂ˆ (2) ) (where η̂ˆ (2) equivalents η̂ (2) applied fˆη instead of fη ) is also Gaussian asymptotically
efficient from this result, the asymptotic property of BLUE estimators and the δ-method.
Non-Gaussian Efficient Estimator Next we discuss construction of an efficient estimator when
{X(t)} is non-Gaussian. Fundamental results concerning the asymptotic efficiency of estimators for
unknown parameter θ of vector linear processes were obtained by Taniguchi and Kakizawa (2000).
They showed the quasi-maximum likelihood estimator θ̂QML is asymptotically efficient.
Let {X(t)} be a non-Gaussian m-vector stationary process defined by (8.5.2), where U (t)’s are
i.i.d. m-vector random variables with probability density p(u) > 0 on Rm , and A( j) ≡ Aθ ( j) =
(Aθ,ab( j))a,b=1,...,m for j ∈ N and µ ≡ µθ , are m × m matrices and m-vector depending on a parameter
vector θ = (θ1 , . . . , θq )′ ∈ Θ ⊂ Rq , respectively. Here Aθ (0) = Im , the m × m identity matrix. We
Assumption 8.5.9. (A1)(i) There exists 0 < ρA < 1 so that Aθ ( j) = O(ρAj ), j ∈ N where A is the
sum of the absolute values of the entries of A.
(ii) Every Aθ ( j) = {Aθ,ab ( j)} is continuously two times differentiable with respect to θ, and the
derivatives satisfy
|∂i1 · · · ∂ik Aθ,ab ( j)| = O(γAj ), k = 1, 2, j ∈ N

for some 0 < γA < 1 and for a, b = 1, . . . , m.
(iii) Every ∂i1 ∂i2 Aθ,ab ( j) satisfies the Lipschitz condition for all i1 , i2 = 1, . . . , q, a, b = 1, . . . , m
and j ∈ N.
P
(iv) det{ ∞j=0 Aθ ( j)z j } , 0 for |z| ≤ 1 and
X∞
det{ Aθ ( j)z j }−1 = Im + Bθ (1)z + Bθ (2)z2 + · · · , |z| ≤ 1
j=0
where Bθ ( j) = O(ρBj ), j ∈ N, for some 0 < ρB < 1.
(v) Every Bθ ( j) = {Bθ,ab( j)} is continuously two times differentiable with respect to θ, and the deriva-
tives satisfy
|∂i1 · · · ∂ik Bθ,ab( j)| = O(γBj ), k = 1, 2, j ∈ N
for some 0 < γB < 1 and for a, b = 1, . . . , m.
(vi) Every ∂i1 ∂i2 Bθ,ab( j) satisfies the Lipschitz condition for all i1 , i2 = 1, . . . , q, a, b = 1, . . . , m
and j ∈ N.
(A2)(i)
Z Z Z
lim p(u)| = 0, up(u)du = 0, ′
uu p(u)du = Im and kuk4 p(u)du < ∞.
kuk→∞
(ii) The derivatives Dp and D2 p ≡ D(Dp) exist on Rn , and every component of D2 p satisfies the
Lipschitz condition.
(iii)
Z Z
4
φ(u) p(u)du < ∞ and D2 p(u)du = 0
where φ(u) = p−1 Dp.

A quasi-maximum likelihood estimator θ̂QML is defined as a solution of the equation
  
 
∂ X 
X  
n t−1
 

 log p 
 B θ ( j)(X(t − j) − µ θ ) 
  = 0
∂θ t=1 
 j=0 

with respect to θ. Then Taniguchi and Kakizawa (2000) showed the following result.
Lemma 8.5.10. (Theorem 3.1.12 of Taniguchi and Kakizawa (2000)) Under Assumption 8.5.9, θ̂QML
is asymptotically efficient.
As we saw in the previous paragraph, the theory of linear time series is well established. How-
ever, it is not sufficient for linear models such as ARMA models to describe the real world. Using
ARCH models, Hafner (1998) gave an extensive analysis for finance data. The results by Hafner
(1998) reveal that many relationships in real data are “nonlinear.” Thus the analysis of nonlinear
time series is becoming an important component of time series analysis. Härdle and Gasser (1984)
introduced vector conditional heteroscedastic autoregressive nonlinear (CHARN) models, which
include the usual AR and ARCH models. Kato et al. (2006) established the LAN property for the
CHARN model.
Suppose that {X(t)} is generated by
X(t) = Fθ (X(t − 1), . . . , X(t − p1 ), µ) + Hθ (X(t − 1), . . . , X(t − p2 ))U (t), (8.5.10)
where Fθ : Rm(p1 +1) → Rm is a vector-valued measurable function, Hθ : Rmp2 → Rm × Rm is a

positive definite matrix-valued measurable function, and {U (t) = (u1 (t), . . . , um (t))′ } is a sequence
of i.i.d. random variables with E[U (t)] = 0, E U (t) < ∞ and U (t) is independent of {X(s), s <
t}. Henceforth, without loss of generality we assume p1 + 1 = p2 (= p), and make the following
assumption.
Assumption 8.5.11. (A-i) U (t) has a density p(u) > 0 a.e. on Rm .
(A-ii) There exist constants ai j ≥ 0, bi j ≥ 0, 1 ≤ i ≤ m, 1 ≤ j ≤ p such that, as |x| → ∞,
X
m X
p X p
m X
Fθ (x) ≤ ai j |xi j | + o( x ) and Hθ (x) ≤ bij |xij | + o( x ).
i=1 j=1 i=1 j=1
(A-iii) Hθ (x) is continuous and symmetric on Rmp , and there exists a positive constant λ such
that
λmin {Hθ (x)} ≥ λ f or all x ∈ Rmp
where λmin {(·)} is the minimum eigenvalue of (·).

Xp Xp
(A-iv) max { ai j + E U1 bi j } < 1.
1≤i≤m
j=1 j=1
(B-i) For all θ ∈ Θ
Eθ kFθ (X(t − 1), . . . , X(t − p), µ)k2 < ∞ and Eθ kHθ (X(t − 1), . . . , X(t − p))k2 < ∞.
(B-ii) There exists c > 0 such that
c ≤ kHθ−1/2
′ (x)Hθ (x)Hθ−1/2
′ (x)k < ∞
for all θ, θ′ ∈ Θ and for all x ∈ Rmp .

(B-iii) Hθ and Fθ are continuously differentiable with respect to θ, and their derivatives ∂ j Hθ
and ∂ j Fθ (∂ j = ∂/∂θ j ), j = 1, . . . , q, satisfy the condition that there exist square-integrable func-
tions A( j) and B( j) such that
k∂ j Hθ k ≤ A( j) and k∂j Fθ k ≤ B(j) for j = 1, . . . , q and for all θ ∈ Θ.
(B-iv) p(·) satisfies

Z
lim kukp(u) = 0 and uu′ p(u)du = Im .
kuk→∞
(B-v) The continuous derivative Dp of p(·) exists on Rm , and

Z Z
kp−1 Dpk4 p(u)du < ∞ and kuk2 kp−1 Dpk2 p(u)du < ∞.
Suppose that an observed stretch X (n) = {X(1), . . . , X(n)} from (8.5.10) is available. We denote
the probability distribution of X (n) by Pn,θ . For two hypothetical values θ, θ′ ∈ Θ, the log-likelihood
ratio based on X (n) is
dPn,θ′ X
n
p{Hθ−1
′ (X(t) − Fθ ′ )} det Hθ
Λn (θ, θ′ ) = log = log −1
.
dPn,θ t=p p{Hθ (X(t) − Fθ )} det Hθ′
As an estimator of θ, we use the following maximum likelihood estimator (MLE):
θ̂ ML ≡ arg max Λn (θ0 , θ)

θ
where θ0 ∈ Θ is some fixed value. Write ηt (θ) ≡ log p{Hθ−1 (X(t) − Fθ )} × {det Hθ }−1 . We further
Assumption 8.5.12. For i, j, k = 1, . . . , q, there exist functions Qti j = Qti j (X(1), . . . , X(t)) and
T it jk = T it jk (X(1), . . . , X(t)) such that
|∂i ∂ j ηt (θ)| ≤ Qti j , E[Qti j ] < ∞
and
|∂i ∂ j ∂k ηt (θ)| ≤ T it jk , E[T it jk ] < ∞.
Then Kato et al. (2006) showed the following result.

Lemma 8.5.13. (Kato et al. (2006)) Under Assumptions 8.5.11 and 8.5.12, θ̂ ML is asymptotically
efficient.
From Lemma 8.5.10 or 8.5.13, the optimal portfolio weight estimator g(θ̂QML ) or g(θ̂ ML ) is also
asymptotically efficient by the δ-method.
Locally Stationary Process The definition of a locally stationary process is as follows:
Definition 8.5.14. A sequence of stochastic processes X(t, T ) = (X1 (t, T ), . . . , Xm (t, T ))′ (t =
1, . . . , T ; T ∈ N) is called an m-dimensional non-Gaussian locally stationary process with transfer
function matrix A◦ and mean function vector µ = µ(·) if there exists a representation
t Z π
X(t, T ) = µ + exp(iλt)A◦ (t, T, λ)dξ(λ) (8.5.11)
T −π
with the following properties:

(i) ξ(λ) is a right-continuous complex-valued non-Gaussian vector process on [−π, π] (see Brock-
well and Davis (2006)) with ξa (λ) = ξa (−λ) and
X
k
cum{dξa1 (λ1 ), . . . , dξak (λk )} = η( λ j )hk (a1 , . . . , ak )dλ1 . . . dλk
j=1
P P Pk
where η( kj=1 λ j ) = ∞ l=−∞ δ( j=1 λ j +2πl) is the period 2π extension of the Dirac delta function,
1
cum{. . .} denotes the kth-order cumulant, and h1 = 0, h2 (a1 , a2 ) = 2π ca1 a2 , hk (a1 , . . . , ak ) =
1
(2π)k−1 c a1 ...ak
for all k > 2.
(ii) There exist a constant K and a 2π-periodic matrix-valued function A(u, λ) = {Aab (u, λ) : a, b =
1, . . . , m} : [0, 1] × R → Cm×m with A(u, λ) = A(u, −λ) and
t
sup A◦ab (t, T, λ) − Aab , λ ≤ KT −1
t,λ T
for all a, b = 1, . . . , m and T ∈ N. A(u, λ) and µ(u) are assumed to be continuous in u.

A typical example is the following ARMA process with time-varying coefficients:
X
p X
q
t
α( j, u)X(t − j, T ) = β( j, u)u(t − j), u= ,
j=0 j=0
T
Rπ
where u(t) ≡ −π
eitλ dξ(λ) is a stationary non-Gaussian process.
Let X(1, T ), . . . , X(T, T ) be realizations of an m-dimensional locally stationary non-Gaussian t pro-

◦
cess with mean function vector µ and transfer function matrix A . We assume that µ and
T
A◦ (t, T, λ) are parameterized by an unknown vector θ = (θ1 , . . . , θq ) ∈ Θ, i.e., µ = µθ and
A◦ (t, T, λ) = A◦θ (t, T, λ), where Θ is a compact subset of Rq . The time-varying spectral density is
′
given by fθ (u, λ) := Aθ (u, λ)Aθ (u, λ) . Introducing the notations ∇i = ∂θ∂ i , ∇ = (∇1 . . . , ∇q )′ , ∇i j =
∂ ∂ 2
∂θi ∂θ j , ∇ = (∇i j )i, j=1,...,q , we make the following assumption.
Assumption 8.5.15. (i) µθ and Aθ are uniformly bounded from above and below.
(ii) There exists a constant K with

t
sup ∇ s A◦θ,ab (t, T, λ) − Aθ,ab , λ ≤ KT −1
t,λ T
for s = 0, 1, 2 and all a, b = 1, . . . , m. The components of Aθ (u, λ), ∇Aθ (u, λ) and ∇2 Aθ (u, λ) are
∂ ∂
differentiable in u and λ with uniformly continuous derivatives ∂u ∂λ .
Letting
Z π
′
U (t) = (u1 (t), . . . , um (t)) ≡ exp(iλt)dξ(λ),
−π
we further assume the following.

Assumption 8.5.16. (i) U (t)’s are i.i.d. m-vector random variables with mean vector 0 and
variance matrix Im , and the components ua have finite fourth-order cumulant ca1 a2 a3 a4 ≡
cum(ua1 (t), ua2 (t), ua3 (t), ua4 (t)), a1, . . . , a4 = 1, . . . , m. Furthermore U (t) has the probability den-
sity function p(u) > 0.
(ii) p(·) satisfies
lim p(u) = 0, and lim up(u) = 0.

kuk→∞ kuk→∞
(iii) The continuous derivatives Dp and D2 p ≡ D(Dp)′ of p(·) exist on Rm . Furthermore each com-
ponent of D2 p satisfies the Lipschitz condition.
(iv) All the components of

Z
F (p) = φ(u)φ(u)′ p(u)du,
E{U (t)U (t)′ φ(U (t))φ(U (t))′ }, E{φ(U (t))φ(U (t))′ U (t)} and E{φ(u)φ(u)′φ(u)φ(u)′} are bounded,
and
Z
D2 p(u)du = 0, lim kuk2 Dp(u) = 0,
kuk→∞
where φ(u) = p−1 Dp(u).
(v) X(t, T ) has the MA(∞) and AR(∞) representations

t X ∞
X(t, T ) = µθ + a◦θ (t, T, j)U (t − j),
T j=0
X∞ ( !)
t−k
a◦θ (t, T, j)U (t) = b◦θ (t, T, k) X(t − k, T ) − µθ ,
k=0
T
where a◦θ (t, T, j), b◦θ (t, T, k) are m × m matrices with b◦θ (t, T, 0) ≡ Im and a◦θ (t, T, j) = a◦θ (0, T, j) =
a◦θ ( j) for t ≤ 0.
(vi) Every a◦θ (t, T, j) is continuously three times differentiable with respect to θ, and the derivatives
satisfy
 

 

X
∞
 ◦


sup  (1 + j)|∇i · · · ∇i a θ (t, T, j)| 
 < ∞ f or s = 0, 1, 2, 3.

t,T  j=0
1 s


(vii) Every b◦θ (t, T, k) is continuously three times differentiable with respect to θ, and the derivatives
satisfy
∞ 

X
 


◦
sup  (1 + k)|∇i · · · ∇i b
s θ
(t, T, k)| 
 < ∞ f or s = 0, 1, 2, 3. (8.5.12)
t,T  1

k=0
(viii)
Z π
1
a◦θ (t, T, 0)−1∇a◦θ (t, T, 0) = fθ◦ (t, T λ)−1 ∇ fθ◦ (t, T, λ)dλ,
4π −π
′
where fθ◦ (t, T, λ) = A◦θ (t, T, λ)A◦θ (t, T, λ) .
(ix) The components of µθ (u), ∇µθ (u) and ∇2 µθ (u) are differentiable in u with uniformly contin-
uous derivatives.
By Assumption 8.5.16.(i) we have
X
t−1 ( !) X ∞
t−k
a◦θ (t, T, 0)U (t) = b◦θ (t, T, k) X(t − k, T ) − µθ + c◦θ (t, T, r)U (−r), (8.5.13)
k=0
T r=0
where
X
r
c◦θ (t, T, r) = b◦θ (t, T, t + s)a◦θ (r − s).
s=0
From Assumption 8.5.16. (v)–(ix), we can see that

X
∞
|c◦θ (t, T, r)| = O(t−1 ). (8.5.14)
r=0
For the special case of m = 1 and µθ = 0, Hirukawa and Taniguchi (2006) showed the local asymp-
totic normal (LAN) property for {X(t, T )}. In what follows we show the LAN property for the vector
case with µθ , 0. This extension is not straightforward, and is important for applications.
Let Pθ,T and PU be the probability distributions of {U (s), s ≤ 0, X(1, T ), . . . , X(T, T )} and
{U (s), s ≤ 0}, respectively. It is easy to see that a linear transformation Lθ exists (whose Jaco-
Q
bian is Tt=1 |a◦θ (t, T, 0)|−1), which maps {U (s), s ≤ 0, X(1, T ), . . . , X(T, T )} into {U (s), s ≤ 0}.
Then recalling (8.5.14), we obtain
Y
T X
t−1 n
dPθ,T = |a◦θ (t, T, 0)|−1 p a◦θ (t, T, 0)−1 b◦θ (t, T, k) X(t − k, T )
t=1 k=0
!o X
∞ !
t−k
− µθ + c◦θ (t, T, r)U (−r) dPU .
T r=0
Let a◦θ (t, T, 0)U (t) ≡ zθ,t,T . Then zθ,t,T has the probability density function
n o
gθ,t,T (·) = |a◦θ (t, T, 0)|−1 p a◦θ (t, T, 0)−1(·) .
Denote by H(p; θ) the hypothesis under which the underlying parameter is θ ∈ Θ and the probability
density of U (t) is p = p(·). We define
1
θT = θ + √ h, h = (h1 , . . . , hq )′ ∈ H ⊂ Rq .
T
For two hypothetical values θ, θT ∈ Θ, the log-likelihood ratio is
dPθT ,T X T
ΛT (θ, θT ) ≡ log =2 log Φt,T (θ, θT ),
dPθ,T t=1
where
nP P∞ o
t−1
gθT ,t,T k=0 b◦θT (t, T, k)X(t − k, T ) + ◦
r=0 cθT (t, T, r)U (−r)
Φ2t,T (θ, θT ) = nP P∞ o
t−1
gθ,t,T k=0 b◦θ (t, T, k)X(t − k, T ) + ◦
r=0 cθ (t, T, r)U (−r)

gθT ,t,T zθ,t,T + q(1) (2)
t,T + qt,T
=
gθ,t,T zθ,t,T
with
Xt−1 ( )( !)
1 1 2 ◦ t−k
q(1) = ◦
√ ∇h bθ (t, T, k) + ∇ b ∗ (t, T, k) X(t − k, T ) − µθ
t,T
k=1 T 2T h θ T
X 1
∞
+ √ ∇h c◦θ∗∗ (t, T, r)U (−r) (8.5.15)
r=0 T
X
t−1 ( ! !)
t−k t−k
q(2)
t,T =− ◦
bθT (t, T, k) µθT − µθ . (8.5.16)
k=0
T T
Here the operators ∇h and ∇2h imply
X
q q X
X q
∇h a◦θ (t, T, 0) = hl ∇l a◦θ (t, T, 0), ∇2h b◦θ (t, T, k) = hl1 hl2 ∇l1 ,l2 b◦θ (t, T, k),
l=1 l1 =1 l2 =1
√
respectively, and θ∗ and θ∗∗ are points on the segment between θ and θ + h/ T and q(1)
t,T corresponds
to qt,T in Hirukawa and Taniguchi (2006).
Now we can state the local asymptotic normality (LAN) for our locally stationary processes.
Theorem 8.5.17. [LAN theorem] Suppose that Assumptions 8.5.15 and 8.5.16 hold. Then the se-
quence of experiments
ET = {RZ , BZ , {Pθ,T : θ ∈ Θ ⊂ Rq }}, T ∈ N,
where BZ denotes the Borel σ-field on RZ , is locally asymptotically normal and equicontinuous on
a compact subset C of H. That is
(i) For all θ ∈ Θ, the log-likelihood ratio ΛT (θ, θT ) admits, under H(p; θ), as T → ∞, the asymptotic
representation
1
ΛT (θ, θT ) = ∆T (h; θ) − Γh (θ) + o p (1),
2
where
( !)
1 X X
T t−1
t−k
∆T (h; θ) = √ φ(U (t))′ a◦θ (t, T, 0)−1∇h b◦θ (t, T, k) X(t − k, T ) − µθ
T t=1 k=1
T
1 X h ◦
T i
− √ tr aθ (t, T, 0)−1∇h a◦θ (t, T, 0){Im + U (t)φ(U (t))′ }
T t=1
!
1 X X
T t−1
t−k
− √ φ(U (t))′ a◦θ (t, T, 0)−1 b◦θ (t, T, k)∇h µθ
T t=1 k=0
T
and
Z 1Z π h i
1
Γh (θ) = tr F (p) fθ (u, λ)−1 ∇h fθ (u, λ){ fθ (u, λ)−1∇h fθ (u, λ)}′ dλdu
4π 0 −π
Z 1 nZ π o
1 ′ −1
+ E φ(U (0)) fθ (u, λ) ∇h fθ (u, λ)dλ
16π2 0 −π
nZ π o
′
× {U (0)U (0) − 2Im } fθ (u, λ)−1∇h fθ (u, λ) dλ φ(U (0)) du
−π
Z 1 nZ π o2
1
− 2
tr fθ (u, λ)−1 ∇h fθ (u, λ)dλ du
16π 0 −π
Z 1
+ ′
∇h µθ (u){Aθ (u, 0)−1}′ F (p)Aθ (u, 0)−1∇h µθ (u)du
0
Z 1Z π n
1
+ E φ(U (0))′ fθ (u, λ)−1 ∇h fθ (u, λ)U (0)φ(U (0))′
2π 0 −π
o
× Aθ (u, 0)−1∇h µθ (u) dλdu.
(ii) Under H(p; θ),
L
∆T (h; θ) → N(0, Γh (θ)).
(iii) For all T ∈ N and all h ∈ H, the mapping h → PθT ,T is continuous with respect to the variational
distance
kP − Qk = sup{|P(A) − Q(A)| : A ∈ BZ }.
Note that the third term of ∆T (h; θ) and the fourth and fifth terms of Γh (θ) vanish if µθ ≡ 0.
Next we estimate the true value θ0 of unknown parameter θ of the model by the quasi-likelihood
estimator
θ̂QML := arg min LT (θ),
θ∈Θ
where
  ( !) 
XT 
  X
t−1
t − k  
LT (θ) =  ◦
log p aθ (t, T, 0) −1 ◦
bθ (t, T, k) X(t − k, T ) − µ  − log |aθ (t, T, 0)| .
◦
t=1 k=0
T
Theorem 8.5.18. Under Assumptions 8.5.15 and 8.5.16,
√ L
T θ̂QML − θ → N 0, V −1
where
Z 1Z π h i
1
Vi j = tr fθ (u, λ)−1 ∇i fθ (u, λ)F (p){ fθ (u, λ)−1∇ j fθ (u, λ)}′ dλdu
4π 0 −π
Z 1 nZ π o
1 ′ −1
+ E φ(U (0)) fθ (u, λ) ∇i fθ (u, λ)dλ {U (0)U (0)′ − 2Im }
16π2 0 −π
nZ π o′
× fθ (u, λ)−1∇ j fθ (u, λ) dλ φ(U (0)) du
−π
Z 1 nZ π o nZ π o
1 −1
− 2
tr fθ (u, λ) ∇ f
i θ (u, λ)dλ tr fθ (u, λ)−1∇ j fθ (u, λ)dλ du
16π 0 −π −π
Z 1
+ ∇i µθ (u)′ {Aθ (u, 0)−1}′ F (p)Aθ (u, 0)−1∇ j µθ (u)du
0
Z 1Z π n o
1
+ E φ(U (0))′ fθ (u, λ)−1 ∇i fθ (u, λ)U (0)φ(U (0))′ Aθ (u, 0)−1∇ j µθ (u) dλdu.
2π 0 −π
Write the optimal portfolio weight function g(θ) as

" ′ T ′ #
T
g(θ) ≡ g µθ , vech Σθ
T T
where
T " T T ′ #
Σθ X(T, T ) − µθ
=E X(T, T ) − µθ
T T T
X
∞
= a◦θ (T, T, j)a◦θ (T, T, j)′.
j=0
Then we have the following theorem.

Theorem 8.5.19. Under Assumptions 8.5.15, 8.5.16 and 3.2.4,
! !′ !
√ n o L ∂g −1 ∂g
T g(θ̂QML ) − g(θ) → N 0, V .
∂θ′ ∂θ′
Proof of Theorem 8.5.19. The proof follows from Theorem 8.5.18 and the δ-method.
From this theorem and general results by Hirukawa and Taniguchi (2006) on the asymptotic √ effi-
ciency of θ̂QML , we can see that our estimator g(θ̂QML ) for optimal portfolio weight is T -consistent
and asymptotically efficient. Hence, we get a parametric optimal estimator for portfolio coefficients.
8.5.3 Appendix
Proof of Theorem 8.5.17. To prove this theorem we check sufficient conditions (S1)–(S6) (be-
low) for the LAN given by Swensen (1985). Let Ft be the σ-algebra generated by {U (t), s ≤
0, X(1, T ), . . . , X(t, T )}, t ≤ T . Since X(t, T ) fulfills difference equations (8.5.13) we obtain
X
t−1 X
∞
a◦θ (t, T, 0) = b◦θ (t, T, k)A◦θ (t − k, T, λ) exp(−iλk) + c◦θ (t, T, r) exp {−iλ(r + t)} , a.e. (8.5.17)
k=0 r=0
Hence we get
X
t−1
∇b◦θ (t, T, k)A◦θ (t − k, T, λ) exp(−iλk)
k=1
X
t−1
= ∇a◦θ (t, T, 0) − b◦θ (t, T, k)∇A◦θ (t − k, T, λ) exp(−iλk)
k=0
X
∞
− ∇c◦θ (t, T, r) exp {−iλ(r + t)} , a.e.
r=0
We set down
Yt,T = Φt,T (θ, θT ) − 1

1 1
gθ2T ,t,T (zθ,t,T + q(1)
t,T ) − gθ,t,T (zθ,t,T )
2
= 1
2
gθ,t,T (zθ,t,T )
1 1
gθ2T ,t,T (zθ,t,T + q(1) (2) (1)
t,T + qt,T ) − gθT ,t,T (zθ,t,T + qt,T )
2
+ 1
2
gθ,t,T (zθ,t,T )
(1) (2)
= Yt,T + Yt,T (say), (8.5.18)
and
( !)
1 X
t−1
′ ◦ −1 ◦ t−k
Wt,T = √ φ(U (t)) a θ (t, T, 0) ∇ b
h θT (t, T, k) X(t − k, T ) − µ θ
2 T k=1
T
h i
− tr a◦θ (t, T, 0)−1∇h a◦θ (t, T, 0) Im + U (t)φ(U (t))′
X
t−1 !
1 t−k
− √ φ(U (t))′ a◦θ (t, T, 0)−1 b◦θT (t, T, k)∇h µθ
2 T k=0
T
(1) (2)
= Wt,T + Wt,T (say). (8.5.19)
(S-1). E(Wt,T |Ft−1 ) = 0 a.e. (t = 1, . . . , T ).

(1)
Since E(Wt,T |Ft−1 ) = 0, we have
(1) (2)
E(Wt,T |Ft−1 ) = E(Wt,T + Wt,T |Ft−1 )
X
t−1 !
1 ′ ◦ −1 ◦ t−k
= − √ E {φ(U (t))|Ft−1 } aθ (t, T, 0) bθT (t, T, k)∇h µθ
2 T k=0
T
=0 (by Assumption 8.5.16),
hence (S-1) holds.

 T 

X
 

2
(S-2). E  (Yt,T − Wt,T ) 
 → 0.
 
t=1
First, recalling (8.5.18) and (8.5.19) we have
n o
(1) (1) (2) (2) 2
E (Yt,T − Wt,T )2 = E Yt,T − Wt,T + Yt,T − Wt,T

(1) (1) 2 (2) (2) 2
≤ 2E Yt,T − Wt,T + 2E Yt,T − Wt,T ,
and from Hirukawa and Taniguchi (2006)

 T 

X (1)
 
(1) 2 

E
 Yt,T − Wt,T   → 0. (8.5.20)
 
t=1
Therefore if we can show

 T 

X (2)
 
(2) 2 

E
 Yt,T − Wt,T   → 0,
 
t=1
then it completes the proof of (S-2). From Taylor expansion we can see that
1 h 1 1
(2)
Yt,T = 1 gθ2T ,t,T (zθ,t,T ) + Dgθ2T ,t,T (zθ,t,T )′ (q(1) (2)
t,T + qt,T )
gθ,t,T (zθ,t,T )
2
1 1 n o i
+ (q(1) t,T + q(2)
t,T )′ D2 gθ2T ,t,T zθ,t,T + α1 (q(1) t,T + q(2)
t,T ) (q(1)
t,T + q(2)
t,T )
(2 )
1 1
′ (1) 1 (1) ′ 2 21 (1) (1)
− gθT ,t,T (zθ,t,T ) + DgθT ,t,T (zθ,t,T ) qt,T + (qt,T ) D gθT ,t,T (zθ,t,T + α2 qt,T )qt,T
2 2
2
1
Dgθ2T ,t,T (zθ,t,T )′ q(2)
t,T 1
= 1
+ R1 {U (t)} (say)
gθ,t,T (zθ,t,T )
2 2
where 0 < α1 , α2 < 1. Since

1 1 1 ˜ 1
Dgθ2T ,t,T (zθ,t,T ) = Dgθ,t,T
2
(zθ,t,T ) + √ ∇Dg 2
(z )|′ h
θ̃,t,T θ,t,T θ̃=θ̃∗
T
n o
1 a◦θ (t, T, 0)−1 Dp a◦θ (t, T, 0)−1zθ,t,T
2
Dgθ,t,T (zθ,t,T ) = 1
,
2|a◦θ (t, T, 0)|gθ2T ,t,T (zθ,t,T )
we can see that

1 1 1
Dgθ2T ,t,T (zθ,t,T )′ q(2)
t,T
2
Dgθ,t,T (zθ,t,T )′ q(2)
t,T
˜
h′ ∇Dg 2
(z )| ∗ q(2)
θ̃,t,T θ,t,T θ̃=θ̃ t,T
1
= 1
+ √ 21
2
gθ,t,T (zθ,t,T ) 2
gθ,t,T (zθ,t,T ) T gθ,t,T (zθ,t,T )
1
2
Dgθ,t,T (zθ,t,T )′ q(2)
t,T
= 1
+ R2 {U (t)} (say)
2
gθ,t,T (zθ,t,T )
!
(2) 1
= Wt,T + R2 {U (t)} + O p
T
√
where ∇˜ = ∂
∂θ̃′
and θ̃∗ is a point on the segment between θ and θ + h/ T . Hence,
 T   T " !#2 

X (2)
  X 1 
(2) 2 
 1
E Yt,T − Wt,T  = E  R1 {U (t)} + R2 {U (t)} + O p 


t=1


t=1
2 T 
 T " !#
X 1 
≤ E   2 2
R1 {U (t)} + 4R2 {U (t)} + O p 2 
t=1
T
X
T h i XT h i !
1
= E R1 {U (t)}2 + 4 E R2 {U (t)}2 + O .
t=1 t=1
T
From (8.5.12), (8.5.14) and (8.5.15),

2 !
1
E q(1)
t,T = O ,
T
and from Assumption 8.5.16 (ix), (8.5.12) and (8.5.16),

2 !
1
E q(2)
t,T =O .
T
Hence,
 T 

X (2)
 
(2) 2 

E
 Yt,T − Wt,T   = o(1).
 
t=1
This, together with (8.5.20), completes the proof of (S-2).

 T 
X 2 
(S-3). sup E  Wt,T  < ∞.

T t=1
We have

(1) (2) 2
E(Wt,T )2 = E Wt,T + Wt,T
(1) (2)
(1) 2 (2) 2
= E Wt,T + E Wt,T + 2E Wt,T Wt,T
(1) (2) (3)
= Jt,T + Jt,T + Jt,T (say).
From Hirukawa and Taniguchi (2006), it follows that
X
T Z 1 Z π n o
(1) 1
Jt,T = tr F (p) fθ (u, λ)−1∇h fθ (u, λ){ fθ (u, λ)−1 ∇h fθ (u, λ)}′ dλdu
t=1
16π 0 −π
Z 1 nZ π o
1
+ E φ(U (0))′ fθ (u, λ)−1 ∇h fθ (u, λ)dλ U (0)U (0)′ − 2Im
64π 0 −π
nZ π o
× fθ (u, λ)−1∇h fθ (u, λ) dλ φ(U (0)) du
−π
Z 1 nZ π o2
1
− tr fθ (u, λ)−1∇h fθ (u, λ)dλ du + o(1).
64π 0 −π
From (8.5.19),
(2)
(2) 2
Jt,T = E Wt,T
 ! 2


 1 X
t−1
t−k 

′ ◦ −1 ◦
= E
 − √ φ(U (t)) aθ (t, T, 0) bθ (t, T, k)∇h µθ 
 .
 2 T T 
k=0
From Taylor expansion and Assumptions 8.5.15 (i) and 8.5.16 (ix),
! t t
t−k
∇h µθ = ∇h µθ +O
T T T
! t t
t−k
Aθ , λ = Aθ ,λ + O (8.5.21)
T T T
for k = 0, 1, . . . , t − 1. Hence from Assumption 8.5.15 (i), (8.5.14), (8.5.17) and (8.5.21), we obtain
X
t−1 t t
a◦θ (t, T, 0) = b◦θ (t, T, k)Aθ ,0 +O
k=0
T T
and
X
t−1 ! t −1 t t
t−k
a◦θ (t, T, 0)−1 b◦θ (t, T, k)∇h µθ = Aθ , 0 ∇h µθ +O .
k=0
T T T T
Therefore, we have
( t −1 t t )2 
(2)
 1 
Jt,T 
= E  − √ φ(U (t)) Aθ ′
, 0 ∇h µθ +O 
2 T T T T
" t ( t −1 )′ t −1 t # !
1 t2
= ∇h µ′θ Aθ , 0 F (p)Aθ , 0 ∇h µθ +O 3
4T T T T T T
and
X
T (Z 1 )
(2) 1
Jt,T = ∇h µ′θ (u){Aθ (u, 0)−1}′ F (p)Aθ (u, 0)−1∇h µθ (u)du + o(1).
t=1
4 0
Next it is seen that

(3)
(1) (2)
Jt,T = 2E Wt,T Wt,T
Z π t −1 t t −1
1
=− φ(U (t))′ fθ , λ ∇h fθ , λ U (t)φ(U (t))′ Aθ ,0
8πT −π T T T
t t
× ∇h µθ dλ + O 2
T T
which implies
X
T Z 1 Z πn
(3) 1
Jt,T =− φ(U (t))′ fθ (u, λ)−1∇h fθ (u, λ)U (t)φ(U (t))′ Aθ (u, 0)−1
t=1
8π 0 −π
o
× ∇h µθ (u) dλdu + o(1).
X
T P Γ(h, θ)
2
(S-5). E Wt,T |Ft−1 → .
t=1
4
We write
(1) (1) (2)
Wt,T = It,T + It,T
where
X
t−1 ( !)
(1) 1 t−k
It,T = √ φ(U (t))′ a◦θ (t, T, 0)−1 ∇h b◦θ (t, T, k) X(t − k, T ) − µθ
2 T k=1
T
(2) 1 h i
It,T = − √ tr a◦θ (t, T, 0)−1∇h a◦θ (t, T, 0) Im + U (t)φ(U (t))′ .
2 T
From the definition of {Wt,T } and Assumption 8.5.15 (ii), we can see that
n o
2
E Wt,T |Ft−1

(1) (2) 2
= E Wt,T + Wt,T |Ft−1
2 n (1) (2) o
(1) (2) (2) (2) 2
= E It,T |Ft−1 + 2E It,T It,T + Wt,T |Ft−1 + E It,T + Wt,T

(2) 2
≡ L(1)
t,T + L (2)
t,T + E I (2)
t,T + W t,T (say).
Hence we get
 T 

X 2
 


Var 
 E Wt,T |Ft−1 

 
t=1
X
T n (2) (2) (1) (2) o
= cum L(1) (1)
t1 ,T , Lt2 ,T + cum Lt1 ,T , Lt2 ,T + 2cum Lt1 ,T , Lt2 ,T
t1 ,t2 =1
≡ L(1) + L(2) + L(3) , (say).
According to Hirukawa and Taniguchi (2006), it follows that

!
(1) 1
L =O + o(1).
T
Writing
X
t−1 ( !)
t−k
W̃t,T = a◦θ (t, T, 0)−1 ∇h b◦θ (t, T, k) X(t − k, T ) − µθ
k=1
T
we observe that
n (1) (2) o
L(2) (2)
t,T = E It,T It1 ,T + Wt,T |Ft1 −1
1 ′
=− W̃t,T E φ(U (t))φ(U (t))′ a◦θ (t, T, 0)−1 ∇h a◦θ (t, T, 0)U (t)
4T
! !
X t−k
t−1
+ b◦θ (t, T, k)∇h µθ
k=0
T
′
= W̃t,T L̃(2)
t,T (say).
Hence
X
T
L(2) = cum L(2) (2)
t1 ,T , Lt2 ,T
t1 ,t2 =1
X
T (2)
= (L̃(2) ′
t1 ,T ) cum W̃t1 ,T , W̃t2 ,T L̃t2 ,T
t1 ,t2 =1
= o(1),
and similarly L(3) = o(1). Therefore, we can see that

 T 

X 2
 


Var 
 E Wt,T |Ft−1 
 →0 as T → ∞,
 
t=1
hence
X
T P Γ(h, θ)
2
E Wt,T |Ft−1 → .
t=1
4
(S-4) and (S-6) can be proved similarly as in Hirukawa and Taniguchi (2006). Hence the proofs of
(i) and (ii) of the theorem are completed.
Finally, (iii) of the theorem follows from Scheffe’s theorem and the continuity of p(·).
Proof of Theorem 8.5.18. We can see
" n o′ #
1 1 2 √
0 = √ ∇LT (θ0 ) + ∇ LT (θ0 ) + RT (θ∗ ) T θ̂QML − θ0 ,
T T
hn o′ n o′ i
1
where RT (θ∗ ) = T ∇2 LT (θ∗ ) − ∇2 LT (θ0 ) with |θ∗ − θ0 | ≤ |θ̂QML − θ0 |.
Here it should be noted that the consistency of θ̂QML is proved in Theorem 2 of Hirukawa and
Taniguchi (2006). Hence θ∗ = θ0 + o p (1). Let the ith component of √1T ∇LT (θ0 ) and the (i, j)th
n o
component of T1 ∇2 LT (θ0 ) be g(T ) (T )
θ0 (i) and hθ0 (i, j), respectively. Then
1
g(T )
θ (i) = √ ∇i LT (θ)
T
1 X
T
= √ φ(W̃t,T )′ −a◦θ (t, T, 0)−1∇i a◦θ (t, T, 0)a◦θ (t, T, 0)−1
T t=1
Xt−1 ( !)
◦ t−k
× bθ (t, T, k) X(t − k, T ) − µθ
k=0
T
X
t−1 ( !)
t−k
+ a◦θ (t, T, 0)−1 ∇i b◦θ (t, T, k) X(t − k, T ) − µθ
k=1
T
!
X t−k
t−1
− a◦θ (t, T, 0)−1 b◦θ (t, T, k)∇i µθ
k=0
T
!
− tr{a◦θ (t, T, 0)−1∇i a◦θ (t, T, 0)} . (8.5.22)
Denote the ith component of the ∆T (e; θ) and the (i, j)th component of Γe (θ) by ∆T (e; θ)(i) and
Γe (θ)(i, j) where e = (1, . . . , 1)′ (q × 1-vector), respectively. Then, under Assumptions 8.5.15 and
8.5.16, it can be shown that
1 X ′ X
T t−1
(8.5.22) = √ φ U (t) + O p t−1 a◦θ (t, T, 0)−1∇i b◦θ (t, T, k)
T t=1 k=1
( !)
t−k
× X(t − k, T ) − µθ
T
1 X
T ′ n o
− √ φ U (t) + O p t−1 a◦θ (t, T, 0)−1∇i a◦θ (t, T, 0) U (t) + O p t−1
T t=1
√ n o
− T tr a◦θ (t, T, 0)−1∇i a◦θ (t, T, 0)
!
1 X ′ X
T t−1
t−k
− √ φ U (t) + O p (t−1 ) a◦θ (t, T, 0)−1b◦θ (t, T, k)∇i µθ
T t=1 k=1
T
SHRINKAGE ESTIMATION 331
!
log T
= ∆T (e; θ)(i) + O p √ .
T
Similarly,
!
log T
h(T )
θ (i, j) = −Γe (θ) (i, j)
+ Op √ .
T
Hence we can see that
" n #−1
√ 1 2 o′
∗ 1
T θ̂QML − θ0 = − ∇ LT (θ0 ) + RT (θ ) √ ∇LT (θ0 )
T T
L
→ Γe (θ0 )−1 Y,
where Y ∼ N(0, Γe (θ0 )). Hence,
√ L
T θ̂QML − θ0 → N(0, V −1 )
where V = Γe (θ0 ).
8.6 Shrinkage Estimation

This section develops a comprehensive study of shrinkage estimation for dependent observations.
We discuss the shrinkage estimation of the mean and autocovariance functions for second-order sta-
tionary processes. This approach is extended to the introduction of a shrink predictor for stationary
processes. Then, we show that the shrinkage estimators and predictors improve the mean square
error of nonshrunken fundamental estimators and predictors.
Let X(1), · · · , X(n) be a sequence of independent and identically distributed random vectors dis-
P
tributed as Nk (θ, I k ), where I k is the k × k identity matrix. The sample mean X̄n = n−1 nt=1 X(n)
seems the most fundamental and natural estimator of θ. However, if k≥3, Stein (1956) showed that
X̄n is not admissible with respect to the mean squared error loss function. Furthermore James and
Stein (1961) proposed a shrinkage estimator b θn defined by
k−2
b
θn = 1 − X̄ n , (8.6.1)
nk X̄ n k2
which improves on X̄n with respect to mean squared error when k≥3.
In what follows we investigate the mean squared error of b
θn and X̄n when X(1), · · · , X(n) are from
a k-dimensional vector stochastic process.
Suppose that {X(t) : t ∈ Z} is a k-dimensional Gaussian stationary process with mean vector
′
E{X(t)} = θ and autocovariance matrix S(ℓ)=E[{X(t) − θ}{X(t + ℓ) − θ} ], such that
X ∞
kS(ℓ)k < ∞, (8.6.2)
ℓ=−∞
where k · k is the Euclidean norm. The process {X(t)} is therefore assumed to be short memory. It
has the spectral density matrix
1 X
∞
f (λ) = S(ℓ)e−iℓλ .
2π ℓ=−∞
Next we investigate the comparison of X̄ n and the Stein–James estimator b

θn in terms of mean
squared error. For this we introduce the Cesaro sum approximation for the spectral density, which
is defined by
1 X
n−1
|ℓ|
f cn (λ) = 1− S(ℓ)e−iℓλ ,
2π ℓ=−n+1 n
and which for short we call the C-spectral density matrix. Then
2π c
Var( X̄n ) = f (0). (8.6.3)
n n
Since f (λ) is evidently continuous with respect to λ, the C-spectral density matrix f cn (λ) converges
uniformly to f (λ) on [−π, π] as n → ∞ (Hannan (1970, p. 507)). Hence
√
lim Var( n X̄ n ) = 2π f (0). (8.6.4)
n→∞
Let ν1 , · · · , νk (ν1 < · · · < νk ) and ν1,n , · · · , νk,n (ν1,n ≤ · · · ≤ νk,n ) be the eigenvalues of 2π f (0)
and 2π f cn (0), respectively. From Magnus and Neudecker (1999, p. 163)) it follows that ν j,n → ν j as
n → ∞ for j = 1, · · · , k. We will evaluate
DMSEn ≡ E[nk X̄n − θk2 ] − E[nkb

θn − θk2 ].
Since the behaviour of DMSEn when θ = 0 is very different from that when θ , 0, we first give the
result for θ = 0. The following four theorems are due to Taniguchi and Hirukawa (2005).
Theoremn8.6.1. Supposeothat θ = 0 and (8.6.2)n holds. Then weo have the following.
νk,n k/2 −1 ν
(i) (k − 2) 2 − ( ν1,n ) νk,n ≤ DMSEn ≤ (k − 2) 2 − ( ν1,n k,n
)k/2 ν−1
1,n ,
b νk,n k/2 −1
which implies that θn improves upon X̄ n if 2 > ( ν1,n ) νk,n .
n νk o n ν1 o
(ii) (k − 2) 2 − ( )k/2 ν−1 k ≤ lim DMSEn ≤ (k − 2) 2 − ( )k/2 ν−1 1 ,
ν1 n→∞ νk
which implies that b θn improves upon X̄ n asymptotically if 2 > ( νν1k )k/2 νk −1 .
Proof. From the definition of DMSEn , we can see that
D
1 E (k − 2)2 D 1 1 E
DMSEn = 2(k − 2)E X̄ n , X̄ n − E X̄ n , X̄ n
k X̄ n k2 n k X̄ n k2 k X̄ n k2
n k − 2 1 o
= (k − 2) 2 − E , (8.6.5)
n k X̄ n k2
√
where < A, B > is the inner product of the vectors A and B. Write Z n = n X̄ n . From (8.6.3) and
the Gaussianity of {X(t)}, it follows that
Z n ∼ N(0, PV n P′ ),
where V n = diag(ν1,n , · · · , νk,n ) and P is a k × k orthogonal matrix which diagonalizes 2π f cn (0).

Therefore
n 1 o
DMSEn = (k − 2) 2 − (k − 2)E . (8.6.6)
kZ n k2
Since (8.6.6) is invariant if Z n is transformed in P′ Z n , we may assume that Z n ∼ N(0, V n). Next we
evaluate
1 ( 1 1 1
E 2
= 2 k/2 1/2
exp (− z′ V n−1 z)dz, (8.6.7)
kZ nk Rk kzk (2π) |V n| 2
where z = (z1 , · · · , zk )′ ∈ Rk . First, the upper bound of (8.6.7) is evaluated as
(
1 1 1
UP ≡ 2 k/2
exp (− z′ z/νk,n )dz.
R k kzk k/2
(2π) ν1,n 2
Consider the polar transformation
z1 = r sin ρ1
z2 = r cos ρ1 sin ρ2
..
.
zk−1 = r cos ρ1 · · · cos ρk−2 sin ρk−1
zk = r cos ρ1 · · · cos ρk−2 cos ρk−1
where −π/2 < ρi ≤ π/2, for i = 1, · · · , k − 2, and −π < ρk−1 ≤ π. The Jacobian is
rk−1 cosk−2 ρ1 cosk−3 ρ2 · · · cos ρk−2
(Anderson (1984, Chapter7)). By this transformation we observe that
1
UP = ν−k/2
1,n νk/2−1
k,n ,
k−2
which, together with (8.6.6), implies that
DMSEn ≥ (k − 2)(2 − ν−k/2 k/2−1
1,n νk,n ). (8.6.8)
Similarly we can show that
DMSEn ≤ (k − 2)(2 − ν−k/2 k/2−1
k,n ν1,n ). (8.6.9)
Statement (i) follows (8.6.8) and (8.6.9). Taking the limit as n → ∞ in (i) leads to (ii).
We examine the upper and lower bounds for lim DMSEn numerically. Let
n→∞
n νk k/2 1 o n ν1 k/2 1 o
L = (k − 2) 2− , U = (k − 2) 2− . (8.6.10)
ν1 νk νk ν1
Example 8.6.2.
(i) (Vector autoregression) Suppose that {X(t)} has the spectral density matrix
 
(2π)−1 |1 − η1 eiλ |−2 0 0 
 
f (λ) =  0 (2π)−1 |1 − η2 eiλ |−2 0  ,
 −1 iλ −2 
0 0 (2π) |1 − η3 e |
where 0 ≤ η1 < η2 < η3 < 1. In this case,
ν1 = |1 − η1 |−2 , ν2 = |1 − η2 |−2 , ν3 = |1 − η3 |−2 ,
so that
n |1 − η1 | 3 o n |1 − η3 | 3 o
L = 2− |1 − η3 |2 , U = 2− |1 − η1 |2 .
|1 − η3 | |1 − η1 |
(ii) (Vector moving average) Suppose that {X(t)} has the spectral density matrix
 
(2π)−1 |1 − η1 eiλ |2 0 0 
 
f (λ) =  0 (2π)−1|1 − η2 eiλ |2 0  ,
 −1 iλ 2 
0 0 (2π) |1 − η3 e |
where 0 ≤ η1 < η2 < η3 < 1. In this case,
ν1 = |1 − η3 |2 , ν2 = |1 − η2 |2 , ν3 = |1 − η1 |2 ,
so that
n |1 − η1 | 3 1 o n |1 − η3 | 3 1 o
L = 2− 2
, U = 2− .
|1 − η3 | |1 − η1 | |1 − η1 | |1 − η3 |2
Figures 8.2 (a) and (b) show L and U, respectively.
Figure 8.2: (a) L (left-hand side), (b) U (right-hand side)
Figure 8.2 shows that b

θn improves upon X̄n if ν1 ≈ ν3 and ν1 and ν3 are not close to 0.
Next we discuss the case of θ , 0.
Theorem 8.6.3. Suppose that θ , 0 and that (8.6.2) holds.
(i) Then
1 1
DMSEn = △n (k, θ) + o , (8.6.11)
n n
where
2(k − 2) h c
2θ′ {2π f nc (0)}θ k − 2 i
△n (k, θ) = tr{2π f n (0)} − − . (8.6.12)
kθk2 kθk2 2
Hence,
2(k − 2) X
k−1
k − 2
△n (k, θ) ≥ 2
ν j,n − νk,n − , (8.6.13)
kθk j=1
2
which implies that b

θn improves upon X̄ n up to O(n−1 ) if
X
k−1
k−2
ν j,n − νk,n − > 0. (8.6.14)
j=1
2
(ii) Taking the limit of (8.6.13), we have
2(k − 2) X
k−1
k − 2
lim △n (k, θ) ≥ 2
ν j − νk − , (8.6.15)
n→∞ kθk j=1
2
which implies that b

θn improves upon X̄ n asymptotically up to O(n−1 ) if
X
k−1
k−2
ν j − νk − > 0.
j=1
2
Proof. First, we obtain
n D E D k−2 k−2 E
DMSEn = n E X̄ n − θ, X̄ n − θ −E X̄n − θ − 2
X̄ n , X̄n − θ − 2
X̄ n
nk X̄n k nk X̄n k
n D k−2 E (k − 2)2 1 o
= n 2E X̄ n − θ, X̄n − E .
nk X̄ n k2 n k X̄ n k2
√
We can write X̄ n − θ = (1/ n)Z n, where Z n ∼ N 0, 2π f nc (0) .
Hence,
2 nD k−2 1 Eo (k − 2)2 1 o
DMSEn = √ E Z n, θ + √ Z n − E
n kθ + √1n Z nk2 n n k X̄ n k2
= (I) + (II), (say).
From Theorem 4.5.1 of Brillinger (1981, p. 98), there exists a constaut C > 0 such that
p
lim sup kZ n k ≤ C log n,
n→∞
almost everywhere, which shows that, almost everywhere

1 1 1 1
= =
kθ + √1 Z n k2 kθk2 + √2 kθk2 {1 + √ 2 < Z n , θ > +O( log n )}
< Z n , θ > +O( logn n )
n n n
nkθ k2
1 n 2 log n o
= 2
1− √ < Z n , θ > +O( ).
kθk nkθk2 n
From this and by Fatou’s lemma it is seen that
2(k − 2) n
1 2 o
(I) = √ E Z n , (θ + √ Z n )(kθk−2 − √ kθk−4 < Z n , θ >) +o(n−1)
n n n
2(k − 2) n 1 −2 2 o
= √ E < Z n , √ Z n > kθk −E < Z n , θ >2 √ kθk−4 +o(n−1 )
n n n
2(k − 2) h 2 i
= tr{2π f nc (0)} − E(< Z n , θ >2 ) +o(n−1 )
nkθk2 kθk2
2(k − 2) h 2θ′ {2π f nc (0)}θ i
= tr{2π f n
c
(0)} − +o(n−1 ).
nkθk2 kθk2
By a similar argument we have that
(k − 2)2 −2
(II) = − kθk + o(n−1 ).
n
Then, the assertions (8.6.11) and (8.6.12) follow. Inequality (8.6.13) follows from the fact that
θ′ {2π f nc (0)}θ
νk,n = max .
θ kθk2

Remark 8.6.4. If {X(t)} is a sequence of i.i.d. Gaussian random vectors with mean θ and identity
variance matrix, then the right-hand side of (8.6.13) becomes (k − 2)2 /kθk2 > 0, which implies that
b
θn always improves upon X̄ n .
Remark 8.6.5. Although in the case of θ , 0, DMSEn is of other O(n−1 ), if kθk is very near to 0, or
if all the eigenvalues ν j,n or ν j are very large so that (8.6.13) and (8.6.15) become large, Theorem
8.6.3 implies that b
θn improves upon X̄ n substantially even if n is fairly large.
Remark 8.6.6. If k = 3, then the lower bound (8.6.15) is
2
(ν1 + ν2 − ν3 − 1/2). (8.6.16)
kθk2
If, in Example 8.6.2(i), the autoregression coefficients η1 , η2 and η3 are near to 1 with 0 ≤ η1 <
η2 < η3 < 1, then (8.6.16) → ∞ as η1 → 1. Hence b θn improves upon X̄ n greatly in such a case.
In financial time series analysis, many cases show “near unit root” behaviour, and therefore it is
important to consider shrinkage estimators.
In what follows we introduce Stein–James estimators for long-memory processes. Suppose that
{X : t ∈ Z} is a k-dimensional Gaussian stationary process with autocovariance matrix S(ℓ), but
where the S(ℓ)’s do not satisfy (8.6.2). Instead of that we assume that
S(ℓ) ∼ ℓ2d−1 , ℓ ∈ Z, (8.6.17)
when 0 < d < 12 , and ∼ means that S(ℓ)/ℓ2d−1 tends to some limit as |ℓ| → ∞. Recalling (8.6.3), we
see that
1 X
n−1
|ℓ|
var(n1/2−d X̄ n ) = 2d
1− S(ℓ) = Hn (say), (8.6.18)
n −n+1 n
and from (8.6.17) that there exists a matrix such that limn→∞ Hn = H. Let u1 , · · · , uk (u1 < · · · < uk )
and u1,n , · · · , uk,n(u1,n ≤ · · · ≤ uk,n ) be the eigenvalues of H and Hn , respectively. Write
DMSEnL ≡ E(n1−2d k X̄ n − θk2 ) − E(n1−2d kθ̃n − θk2 ), (8.6.19)
where θ̃n = {1 − (k − 2)/(n1−2d k X̄ n k2 )} X̄n . For θ = 0, we can prove, similary to Theorem 8.6.1, that
n uk,n k/2 1 o n u1,n k/2 1 o
(k − 2) 2− ≤ DMSEnL ≤ (k − 2) 2−
u1,n uk,n uk,n uk,n
n uk k/2 1 o n u1 k/2 1 o (8.6.20)
(k − 2) 2− ≤ lim DMSEnL ≤ (k − 2) 2−
u1 uk n→∞ uk uk
if k ≥ 3.
Next we evaluate the lower and upper bounds of (8.6.20) more explicitly. We assume that the spectral
density matrix of {X(t)} is diagonalized by an orthogonal matrix to become
 
 1 1 g (λ)
 2π |1−eiλ |2d 11

0 


 ..  , (8.6.21)
 . 
 
0 1 1
2π |1−eiλ |2d gkk (λ)

where the g j j (λ)’s are bounded away from zero and are continuous at λ = 0, with
g11 (0) < · · · < gkk (0).
In this case, from Samarov and Taqqu (1988) it follows that the matrix H becomes
 
g (0) Γ(1−2d)
 11 Γ(d)Γ(1−d)d(1+2d)

0 


 ..  . (8.6.22)
 . 
 
0 Γ(1−2d)
gkk (0) Γ(d)Γ(1−d)d(1+2d)
This is summarized in Theorem 8.6.7.

Theorem 8.6.7. Suppose that θ = 0 and that (8.6.21) holds. Then,
h n gkk (0) ok/2 Γ(d)Γ(1 − d)d(1 + 2d) i
(k − 2) 2−
g11 (0) gkk (0)Γ(1 − 2d)
≤ lim DMSEnL
n→∞
h n g11 (0) ok/2 Γ(d)Γ(1 − d)d(1 + 2d) i
≤ (k − 2) 2− . (8.6.23)
gkk (0) g11 (0)Γ(1 − 2d)
Let us examine the upper and lower bounds U and L of (8.6.23).
Example 8.6.8. Let k = 3 and g11 (λ) = 1, g22 (λ) = |1 − η1 eiλ |−2 , g33 (λ) = |1 − η2 eiλ |−2 with 0 < η1 <
η2 < 1. Then
n 1 Γ(d)Γ(1 − d)d(1 + 2d) o n Γ(d)Γ(1 − d)d(1 + 2d) o
L= 2− , U = 2 − |1 − η2 |3 . (8.6.24)
|1 − η2 | Γ(1 − 2d) Γ(1 − 2d)
Figure 8.3: (a) L (left-hand side), (b) U (right-hand side)
We plotted L and U in Figure 8.3. We observe that, if d increases toward 1/2 and if η2 is not
large, θ̃n improves upon X̄n . Next we discuss the case of θ , 0.
Theorem 8.6.9. Suppose that θ , 0 and that (8.6.21) holds. Then
1 1
DMSEnL = △L (k, θ) + o( 1−2d ),
n1−2d n n
where
2(k − 2) n 2θ′ Hn θ k − 2 o
△nL (k, θ) = tr(H n ) − − .
kθk2 kθk2 2
Hence,
2(k − 2) X
k−1
k − 2
△nL (k, θ) ≥ 2
u j,n − uk,n − . (8.6.25)
kθk j=1
2
Taking the limit of (8.6.25), we obtain
2(k − 2) X
k−1
k − 2
lim △nL (k, θ) ≥ u j − u k − . (8.6.26)
n→∞ kθk2 j=1
2
If the right-hand sides of (8.6.25) and (8.6.26) are positive, θ̃n improves upon X̄n .
Proof. Write X̄n − θ = (1/n1/2−d )Z n , and replace 2π f cn (0) by Hn , in the proof of Theorem 8.6.3.
From Theorem 2 of Taqqu (1977) it follows that there exists C > 0 such that
p
kZ n k ≤ C log log n
almost surely; this is the law of the iterated logarithm for the sum of long-memory processes. Then,
along similar lines to Theorem 2, the results follow.
In the context of Example 8.6.8, we observe that θ̃n improves upon X̄n when kθk decreases
towards zero or d increases toward 0.5.
So far we assumed that the mean of the concerned process is a constant vector. However, it is
more natural that the mean structure is expressed as a regression form on other variables. In what
follows, we consider the linear regression model of the form
y(t) = B′ x(t) + ε(t), t = 1, · · · , n, (8.6.27)
where y(t) = (y1 (t), · · · , y p (t))′ , B ≡ (bi j) is a q × p matrix of unknown coefficients and {ε(t)} is
a p-dimensional Gaussian stationary process with mean vector E [ε(t)] = 0 and spectral density
matrix f (λ). Here x(t) = (x1 , · · · , xq (t))′ is a q × 1 vector of regressors satisfying the assumptions
P
below. Let di2 (n) ≡ nt=1 {xi (t)}2 , i = 1, · · · , q.
Assumption 8.6.10. lim di2 (n) = ∞, i = 1, · · · , q.

n→∞
{xi (m)}2
Assumption 8.6.11. lim = 0, i = 1, · · · , q, and, for some κi > 0,
n→∞ d 2 (n)
i
xi (n)
= O(1), i = 1, · · · , q.
nκi
Assumption 8.6.12. For each i, j = 1, · · · , q, there exists the limit
Pn
xi (t)x j (t + h)
ρi j (h) ≡ lim t=1 .
n→∞ di (n)d j(n)
Let R(h) ≡ (ρi j (h)).
Assumption 8.6.13. R(0) is nonsingular.

From Assumptions 8.6.10–8.6.13, we can express R(h) as
Z π
R(h) = eihλ d M(λ), (8.6.28)
−π
where M(λ) is a matrix function whose increments are Hermitian nonnegative, and which is
uniquely defined if it is required to be continuous from the right and null at λ = −π. Assump-
tions 8.6.10–8.6.13 are a slight modification of Grenander’s condition. Now we make an additional
assumption.
Assumption 8.6.14. {ε(t)} has an absolutely continuous spectrum which is piecewise continuous
with no discontinuities at the jumps of M(λ).
We rewrite (8.6.27) in the tensor notation
y = (I p ⊗ X)β + ε = Uβ + ε, (8.6.29)
wherein yi (t) is in row (i − 1)p + t of y, εi (t) is in row (i − 1)p + t of ε, X has x j (t) in row t column
j, β has bi j in row ( j − 1)q + i and U ≡ I p ⊗ X.
If we are interested in estimation of β based on y, the most fundamental candidate is the least
squares estimator β̂LS = (U′ U)−1 U′ y. When ε(t)’s are i.i.d., it is known that β̂LS is not admissible if
pq ≥ 3. Stein (1956) proposed an alternative estimator,
 
 c 
β̂ JS ≡ 1 − 2
 (β̂LS − b) + b, (8.6.30)
k(I p ⊗ Dn )(β̂LS − b)k
which is called the James–Stein estimator for β. Here c > 0 and Dn ≡ diag(d1 (n), · · · , dq (n)), b
is a preassigned pq × 1 vector toward which we shrink β̂LS . Stein showed that the risk of β̂ JS is
everywhere smaller than that of β̂LS under the MSE loss function
2
MSEn (β̂) ≡ E[k(I p ⊗ Dn )(β̂ − β)k ]
if pq ≥ 3 and c = pq − 2.
In what follows, we compare the risk of β̂ JS with that of β̂LS when {ε(t)} is a Gaussian stationary
process. From Gaussianity of {ε(t)}, it is seen that
(I p ⊗ Dn )(β̂LS − b) ∼ N((I p ⊗ Dn )(β − b), Cn ), (8.6.31)
where
Cn ≡ (I p ⊗ Dn )(U′ U)−1 U′ cov(ε)U(U′ U)−1 (I p ⊗ Dn ).

From Theorem 8 of Hannan (1970, p. 216), if Assumptions 8.6.10–8.6.14 hold, it follows that
Z π
C ≡ lim Cn = (I p ⊗ R(0))−1 2π f (λ) ⊗ M′ (dλ)(I p ⊗ R(0))−1. (8.6.32)
n→∞ −π
Let ν1,n , · · · , ν pq,n (ν1,n ≤ · · · ≤ ν pq,n ) and ν1 , · · · , ν pq (ν1 ≤ · · · ≤ ν pq ) be the eigenvalues of Cn and
C, respectively. We evaluate
2 2
DMSEn ≡ E[k(I p ⊗ Dn )(β̂LS − β)k ] − E[k(I p ⊗ Dn )(β̂JS − β)k ]
 
 1 
2   
= −c E  2 
k(I p ⊗ Dn )( b̂LS − β)k
  
  < (I p ⊗ Dn )(β − b), (I p ⊗ Dn )(β̂LS − b) > 
+ 2c 1 − E  2
 .
 (8.6.33)
k(I p ⊗ Dn )(β̂LS − b)k
Because the behaviour of DMSEn in the case of β − b = 0 is very different from that in the case of
β − b , 0, first, we give the result for β − b = 0. Theorems 8.6.15 and 8.6.16 are generalizations of
Theorems 8.6.1 and 8.6.3, respectively (for the proofs, see Senda and Taniguchi (2006)).
Theorem 8.6.15. In the case of β − b = 0, suppose that Assumptions 8.6.10–8.6.14 hold and that
pq ≥ 3. Then,
(i)
 ! pq   ! pq 


 c ν pq,n 2 1  
 

 c ν1,n 2 1  

c
2−  2 − pq − 2 ν pq,n
≤ DMSEn ≤ c   , (8.6.34)
 pq − 2 ν1,n ν pq,n 
  ν1,n 

which implies that β̂ JS improves β̂LS if the left-hand side of ( 8.6.34) is positive.
(ii)
 ! pq   ! pq 


 c ν pq 2 1  
 

 c ν1 2 1  

c
2− 
 ≤ lim DMSEn ≤ c 
2−  , (8.6.35)
 pq − 2 ν1 ν pq  n→∞  pq − 2 ν pq ν1 

which implies that β̂ JS improves β̂LS asymptotically if the left-hand side of (8.6.35) is positive.
From this theorem we observe that, if ν1,n ≈ · · · ≈ ν pq,n ր ∞ (ν1 ≈ · · · ≈ ν pq ր ∞), then β̂ JS
improves β̂LS enormously.
Pq 2
Next, we discuss the case of β − b , 0. Letting d(n)2 ≡ i=1 di (n), we note that
2 2
k(I p ⊗ Dn )(β − b)k = O(d (n)).
Theorem 8.6.16. In the case of β − b , 0, suppose that Assumptions 8.6.10–8.6.14 , and that
pq ≥ 3. Then,
(i)
2c
DMSEn = {∆n (c, β − b) + o(1)} , (8.6.36)
k(I p ⊗ Dn )(β − b)k2
where
2(β − b)′ (I p ⊗ Dn )Cn (I p ⊗ Dn )(β − b) c
∆n (c, β − b) = trCn − − .
k(I p ⊗ Dn )(β − b)k2 2
Here we also have the inequality
X
pq−1
c
∆n (c, β − b) ≥ ν j,n − ν pq,n − , (8.6.37)
j=1
2
which implies that β̂ JS improves β̂LS up to 1/d2 (n) order if the right-hand side of (8.6.37) is positive.
(ii) Taking the limit of (8.6.37), we have
X
pq−1
c
lim ∆n (c, β − b) ≥ ν j − ν pq − , (8.6.38)
n→∞
j=1
2
which implies that β̂ JS improves β̂LS asymptotically up to 1/d2 (n) order if the right-hand side of
(8.6.38) is positive.
From (8.6.38) we observe that if ν1 ≈ ν2 ≈ · · · ≈ ν pq and ν1 ր ∞, then, β̂ JS improves β̂LS
enormously.
Example 8.6.17. Let p = 1 and xi (t) = cos λi t, i = 1, · · · , q with 0 ≤ λ1 < · · · < λq ≤ π (time
series regression model with cyclic trend). Then di2 (n) = O(n). We find that
R(h) = diag(cos λ1 h, · · · , cos λq h), because



cos λi h if i = j,
ρi j (h) = 

 0 if i , j.
Hence M(λ) has jumps at λ = ±λi , i = 1, · · · , q. The jump at λ = ±λi is 1/2 at the ith diagonal
and 0’s elsewhere. Suppose that {ε(t)} is a Gaussian stationary process with spectral density f (λ)
satisfying Assumption 8.6.14. Then it follows that
C = diag(2π f (λ1 ), · · · , 2π f (λq ))

.
Let f (λ( j) ) be the jth largest value of f (λ1 ), · · · , f (λq ). Then
ν j = 2π f (λ( j) ), j = 1, · · · , q.
Thus the right-hand side of (8.6.38) is
 

 

X
q−1
 
 c
2π 
 f (λ( j) ) − f (λ(q) ) 
 − . (8.6.39)

 j=1 
 2
If all the values of f (λ( j) )’s are near and large, then (8.6.39) becomes large, which means that β̂ JS
improves β̂LS enormously. Because our model is very fundamental in econometrics, etc., the results
seem important.
Next we give numerical comparisons between β̂LS and β̂ JS by Monte Carlo simulations in the
models of Example 8.6.17. We approximate
2
MSEn (β̂) = E[k(I p ⊗ Dn )(β̂ − β)k ]
by

rep  rep  2
1 X  1 X 
(I p ⊗ Dn ) β̂ j − β̂k 
rep − 1 j=1 rep k=1
 

rep 2
 1 X
+ (I p ⊗ Dn )′  β̂k − β , (8.6.40)
rep k=1
where β̂k is the estimate in the kth simulation. For n = 100, q = 10 and λ j = 10−1 π( j − 1), and
rep = 1000, we calculate (8.6.40).
Table 8.5: MSE for β̂LS and β̂ JS
θ = 0.1 θ = 0.9
MSE of β̂LS 10.234 107.623
MSE of β̂ JS 5.102 94.770
Table 8.5 shows the values of MSE for the case when β − b = (0.1, · · · , 0.1)′ and the model is
AR(1) with coefficient θ = 0.1 and 0.9. Then we observe that β̂ JS improves β̂LS in both cases.
8.7 Shrinkage Interpolation for Stationary Processes
Interpolation is an important issue for statistics, e.g., missing data analysis. So we can apply the
results to estimation of portfolio coefficients when there is a missing asset value. This section dis-
cusses the interpolation problem. When the spectral density of the time series is known, the best in-
terpolators for missing data are discussed in Grenander and Rosenblatt (1957) and Hannan (1970).
However, the spectral density is usually unknown, so in order to get the interpolator, we may use
pseudo spectral density. In this case, Taniguchi (1981) discussed the misspecified interpolation prob-
lem and showed the robustness. The pseudo interpolator seems good, but similarly to the prediction
problem, it is worth considering whether the better pseudo interpolator with respect to mean squared
interpolation error (MSIE) exists. In this section, we show that the pseudo interpolator is not optimal
with respect to MSIE by proposing a pseudo shrinkage interpolator motivated by James and Stein
(1961). Under the appropriate conditions, it improves the usual pseudo interpolator in the sense of
MSIE. This section is based on the results by Suto and Taniguchi (2016).
Let {X(t); t ∈ Z} be a q-dimensional stationary process with mean 0 and true spectral density
matrix g(λ). We discuss the interpolation problem, that when we observed the data {X(t); t ∈ Z}
except t = 0, we estimate X(0). According to Hannan (1970), the optimal response function of the
interpolating filter is given as
Z π !−1
1
h(λ) = Iq − g(µ)−1 dµ g(λ)−1 , (8.7.1)
2π −π
where Iq is q × q identity matrix, and the interpolation error matrix is

Z π !−1
1 −1
Σ= {2πg(λ)} dλ . (8.7.2)
2π −π
In actual analysis, it is natural that the true spectral density g(λ) is unknown and we often use pseudo
spectral density f (λ). In this case, we obtain the pseudo response function of the interpolating filter
Z π !−1
1
hf (λ) = Iq − f (µ)−1 dµ f (λ)−1 , (8.7.3)
2π −π
and the pseudo interpolation error matrix

Z π !−1 Z π ! Z π !−1
1 1
Σf = f (λ)−1 dλ f (λ)−1 g(λ)f (λ)−1 dλ f (λ)−1 dλ , (8.7.4)
2π −π −π 2π −π
(cf. Taniguchi (1981)). In what follows, we propose shrinking the interpolator, which is motivated
by James and Stein (1961). The usual pseudo interpolator is given by
Z π
X̂ f (0) = hf (λ)dZ(λ),
−π
where Z(λ) is the spectral measure of X(t). We introduce the shrinkage version of X̂ f (0),
!
c
X̃ f (0) = 1 − X̂ f (0), (8.7.5)
E[kX̂ f (0)k2 ]
where c is a constant and we call it shrinkage constant. Then we can evaluate the mean squared
interpolation error of (8.7.5) as follows.
SHRINKAGE INTERPOLATION FOR STATIONARY PROCESSES 343
Theorem 8.7.1.
(i)
2c c2
E[kX(0) − X̃ f (0)k2 ] = E[kX(0) − X̂ f (0)k2 ] + B(f , g) + ,
E[kX̂ f (0)k2 ] E[kX̂ f (0)k2 ]
(8.7.6)
where
B(f , g)
Z π
= tr (Iq − hf (λ))g(λ)hf (λ)∗ dλ
−π
 !−1 Z π !
 1 Z π
= tr   −1
f (λ) dλ −1
f (λ) g(λ)dλ
2π −π −π
Z π !−1 Z π ! Z π !−1 
1 1 
− f (λ)−1 dλ f (λ)−1 g(λ)f (λ)−1 dλ f (λ)−1 dλ  . (8.7.7)
2π −π −π 2π −π
(ii) If B(f , g) , 0, the optimal shrinkage constant is c = −B(f , g). Then
B(f , g)2
E[kX(0) − X̂ f (0)k2 ] = E[kX(0) − X̂ f (0)k2 ] − , (8.7.8)
E[kX̂ f (0)k2 ]
which implies that the shrinkage interpolator improves the usual pseudo interpolator in the
sense of MSIE.
Proof. (i) For the vectors X and Y , define the inner product as hX, Y i = tr(XY ∗ ). Then
E[kX(0) − X̃ f (0)k2 ]
" 2 #
c
= E X(0) − X̂ f (0) + X̂ f
(0)

E[kX f (0)k2 ]
"* +#
c
= E[kX(0) − X̂ f (0)k2 ] + 2E X(0) − X̂ f (0), f
X̂ (0)
E[kX f (0)k2 ]
c2
+ E[kX̂ f (0)k2 ]
(E[kX f (0))k2 ])2
" (Z π ! Z π !∗ )#
f 2 2c f f
= E[kX(0) − X̂ (0)k ] + E tr (Iq − h (λ))dZ(λ) h (λ)dZ(λ)
E[kX f (0)k2 ] −π −π
c2
+
E[kX f (0)k2 ]
= (RHS of (8.7.6)).
(ii) RHS of (8.7.6) is the quadratic form of c. Thus the shrinkage constant c, which minimizes RHS
of (8.7.6), is c = −B(f , g), and we can get the result.
Remark 8.7.2. B(f , g) can be regarded as a disparity between f (λ) and g(λ) (see Suto et al.
(2016)).
Because the above shrinkage interpolator (8.7.5) includes the unknown value E[kX̂ f (0)k2 ], we
have to estimate E[kX̂ f (0)k2 ]. Since we can see that
" (Z π ! Z π !∗ )#
E[kX̂ f (0)k2 ] = E tr hf (λ)dZ(λ) hf (λ)dZ(λ)
−π −π
Z π
= tr hf (λ)g(λ)hf (λ)∗ dλ,
−π
we propose an estimator of E[kX̂ f (0)k2 ] as

Z π
f 2
Ê[kX̂ (0)k ] = tr hf (λ)In,X (λ)hf (λ)∗ dλ, (8.7.9)
−π
where In,X (λ) is the periodogram defined by

 n  n ∗
1 X
 
 X
itλ  


itλ 
In,X (λ) = 
 X(t)e 
 
 X(t)e 
 .
2πn  t=1
  
t=1
We see the goodness of the estimator (8.7.9) as follows. For this, we need the following assumption.
Assumption 8.7.3. Let {X(t)} be generated by
X
∞
X(t) = G( j)e(t − j),
j=0
where e( j)’s are q-dimensional random vectors with E(e( j)) = 0, E(e(t)e(s)′ ) = δ(t, s)K, det K > 0
X
∞
and G( j)’s satisfy tr G( j)KG( j)′ < ∞. Suppose that Hosoya and Taniguchi (1982) conditions
j=0
(HT1)–(HT6) are satisfied (see Section 8.1).
√
Under this assumption, using Theorem 8.1.1, we can see the estimator (8.7.9) is a n-consistent
estimator of E[kX̂ f (0)k2 ]. Then we can consider the sample version of (8.7.5) such as
!
f c
X̃S (0) = 1 − X̂ f (0), (8.7.10)
Ê[kX̂ f (0)k2 ]
and we can evaluate the MSIE as follows.
Theorem 8.7.4. Under Assumption 8.7.3, if B(f , g) , 0, the optimal shrinkage constant is c =
−B(f , g). Then
!
B(f , g)2 1
E[kX(0) − X̃Sf (0)k2 ] = E[kX(0) − X̂ f (0)k2 ] − +O √ , (8.7.11)
E[kX̂ f (0)k2 ] n
which implies that the shrinkage interpolator improves the usual pseudo interpolator asymptotically
in the sense of MSIE.
Proof. Let
√
sn = n{Ê[kX̂ f (0)k2 ] − E[kX̂ f (0)k2 ]}
"Z π Z π #
√
= n tr hf (λ)In,X (λ)hf (λ)∗ dλ − hf (λ)g(λ)hf (λ)∗ dλ .
−π −π
Then, by Lemma A.2.2 of Hosoya and Taniguchi (1982), we can see
s2n < ∞ a.s.
From this, we obtain

Z π
tr hf (λ)In,X (λ)hf (λ)∗ dλ
−π
Z π
d
= tr hf (λ)g(λ)hf (λ)∗ dλ + √ a.s., (8.7.12)
−π n
where d is a positive constant. Thus
1 1
= Z π a.s.
Ê[kX̂ f (0)k2 ] d
tr hf (λ)g(λ)hf (λ)∗ dλ + √
−π n
1
=   a.s.


 


Z π 
 

f f ∗

 d 1 

tr h (λ)g(λ)h (λ) dλ   1 + √ Z π 

−π 

 n f f ∗




 tr h (λ)g(λ)h (λ) dλ 

−π
 


 



 ! 
1 
 d 1 1  

= Z π 
 1 − √ Z π + O 

f f ∗


 n f f ∗ n 


tr h (λ)g(λ)h (λ) dλ   tr h (λ)g(λ)h (λ) dλ 

−π −π
a.s. (8.7.13)
Substituting (8.7.13) into X̃Sf (0), we can evaluate the interpolation error of X̃Sf (0) as
E[kX(0) − X̃Sf (0)k2 ]

 2 
 c  d2 c2
= E  X(0) − X̂ f (0) + X̂ (0)  +
f
E[kX̂ f (0)k2 ] n E[kX̂ f (0)k2 ]
Z π
2d c 2d c
− √ tr g(λ)hf (λ)∗ dλ + √
f 2
n (E[kX̂ (0)k ]) 2 −π n E[kX̂ f (0)k2 ]
2
!
2d c 1
− √ +O
f 2
n (E[kX̂ (0)k ]) 2 n
!
1
= E[kX(0) − X̃ f (0)k2 ] + O √ ,
n
which completes the proof.
Here B(f , g) depends on the unknown true spectral density g(λ). To use the shrinkage interpola-
tor practically, we must estimate B(f , g). Because B(f , g) is written as (8.7.7), which is the integral
functional of g(λ), we propose
[
B(f , g) = B(f , In,X ) (8.7.14)
√ an estimator of B(f , g). Under Assumption 8.7.3, we can also see the estimator (8.7.14) is a
as
n-consistent estimator due to Theorem 8.1.1. Therefore we can evaluate the MSIE of the practical
shrinkage interpolator  
 [
− B(f , g)  f
˜ f

X̃S (0) = 1 −  X̂ (0). (8.7.15)
Ê[kX̂ f (0)k2 ]
Theorem 8.7.5. Under Assumption 8.7.3, the MSIE of the practical shrinkage estimator X̃ ˜ f (0) is
S
evaluated as
!
˜ f (0)k2 ] = E[kX(0) − X̂ f (0)k2 ] − B(f , g)2 1
E[kX(0) − X̃ S +O √ , (8.7.16)
E[kX̂ f (0)k2 ] n
which implies that the shrinkage interpolator improves the usual pseudo interpolator asymptotically
in the sense of MSIE.
Proof. Similarly to the previous proof, we can see
[ d̃
B(f , g) = B(f , g) + √ , (d̃ > 0), a.s.
n
˜ f (0) as
Then we can evaluate the interpolation error of the practical shrinkage interpolator X̃ S
(8.7.16).
Remark 8.7.6. For the problem of prediction, we can construct a shrinkage predictor as in (8.7.15),
and show the result as in Theorem 8.7.5 (see Hamada and Taniguchi (2014)).
In what follows, we see that the pseudo shrinkage interpolator is useful under the misspecifica-
tion of spectral density. Assume the scalar-valued data {X(t)} are generated by the AR(1) model
X(t) = 0.3X(t − 1) + ε(t), ε(t) ∼ i.i.d. t(10), (8.7.17)
where t(10) is Student’s t-distribution with degree of freedom 10. The true spectral density is
1.25 1
g(λ) = .
2π |1 − 0.3eiλ |2
Here we consider that we fit a pseudo parametric spectral density which is
1.25 1
f (λ, θ) = .
2π |1 − (0.3 + θ)eiλ + 0.3θe2iλ |2
Figure 8.4 shows the behaviours of g(λ) and f (λ, θ), (θ = 0.1, 0.3, 0.9).
Figure 8.4: Plots of g(λ) and f (λ, θ), (θ = Figure 8.5: Misspecified interpolation error
0.1, 0.3, 0.9) with θ
From Figure 8.4, when |θ| becomes close to 1, the disparity between f (λ, θ) and g(λ) becomes
large. Figure 8.5 also shows that the misspecified interpolation error becomes large when |θ| is near
to 1. Since θ indicates the disparity between f (λ, θ) and g(λ), we call θ the disparity parameter. As
we saw, the misspecification of spectral density is troublesome; thus we should note that influence.
Under the situation, we see that the practical shrinkage interpolator defined by (8.7.15) is more
useful than the usual interpolator. Furthermore, we observe that that shrinkage interpolator is effec-
tive when the misspecification of the spectral density is large.
First, we generate the data {X(−n), . . . , X(−1), X(0), X(1), . . ., X(n)} from the model (8.7.17).
Next, by using {X(−n), . . . , X(−1), X(1), . . ., X(n)} without X(0) and pseudo spectral density f (λ, θ),
we calculate the usual interpolator X̂ f (0). Then, we calculate the shrinkage interpolator X̃˜ Sf (0) de-
fined by (8.7.15). We make 500 replications, and plot the values of the sample MSIE such that
1
[
MS UIE = (X(0) − X̂ f (0))2
500
and
1
[
MS S IE = (X(0) − X̃˜ Sf (0))2
500
with θ. Figure 8.6 shows that the shrinkage interpolator usually improves the usual interpolator in
the sense of MSIE, and when the misspecification of the spectral density is large, the improvement
of the shrinkage interpolator becomes large. We can say that the shrinkage interpolator is useful
under the misspecification of the spectra.
[
Figure 8.6: Plots of MS [
UIE and MS S IE
Problems
8.1. Show the statement of Remark (8.3.2).
8.2. Show the statement of Remark (8.3.4).
8.3. Verify (8.3.10).
8.4. Derive GAIC(p) in (8.4.20).
8.5. Verify 8.4.23.
8.6. Show GAIC(p) for CHARN given by (8.4.25).
8.7. Show that 8.5.4 does not satisfy (8.5.3).
8.8. Show that 8.5.5 does not satisfy (8.5.3).
8.9. Show the formula 8.6.32.
8.10. Generate X(1), ..., X(100) from N3 (0, I3 ), and repeat this procedure 1000 times. Calculating
X̂n and θ̂n in 8.6.1 1000 times, compare their empirical MSEs.
8.11. Generate X(1), ..., X(100) from a Gaussian stationary process with the spectral density matrix
given in (i) of Example 8.6.2 for the case when η1 = 0.1, η2 = 0.5 and η3 = 0.8. Then, repeat this
procedure 1000 times. Calculating X̄n and θ̂n in (8.6.1), compare their empirical MSEs.
8.12. Verify the formula (8.6.39).

Bibliography
C. Acerbi. Coherent Representations of Subjective Risk-Aversion, In: G. Szegö (ed.). John Wiley
& Sons Inc., New York, 2004.
Carlo Acerbi and Dirk Tasche. On the coherence of expected shortfall. (cond-mat/0104295), April
2001.
C. J. Adcock and K. Shutes. An analysis of skewness and skewness persistence in three emerging
markets. Emerging Markets Review, 6(4):396–418, 2005.
Yacine Aı̈t-Sahalia. Disentangling diffusion from jumps. Journal of Financial Economics, 74(3):
487–528, 2004.
Yacine Aı̈t-Sahalia and Michael W. Brandt. Variable selection for portfolio choice. The Journal of
Finance, 56(4):1297–1351, 2001.
Yacine Aı̈t-Sahalia and Lars Hansen, editors. Handbook of Financial Econometrics, Volume 1:
Tools and Techniques (Handbooks in Finance). North Holland, 2009.
Yacine Aı̈t-Sahalia, Julio Cacho-Diaz, and T.R. Hurd. Portfolio choice with jumps: A closed-form
solution. Ann. Appl. Probab., 19(2):556–584, 2009.
Yacine Aı̈t-Sahalia, Per A. Mykland, and Lan Zhang. Ultra high frequency volatility estimation
with dependent microstructure noise. J. Econometrics, 160(1):160–175, 2011.
H. Akaike. Information theory and an extension of the maximum likelihood principle. In B. N.
Petrov and F. Csadki, (eds.) 2nd International Symposium on Information Theory, pages 267–
281. Budapest: Akademiai Kiado, 1973.
Hirotugu Akaike. A new look at the statistical model identification. IEEE Trans. Automatic Con-
trol, AC-19:716–723, 1974. System identification and time-series analysis.
Hirotugu Akaike. On entropy maximization principle. In Applications of Statistics (Proc. Sympos.,
Wright State Univ., Dayton, Ohio, 1976), pages 27–41. North-Holland, Amsterdam, 1977.
Hirotugu Akaike. A Bayesian analysis of the minimum AIC procedure. Ann. Inst. Statist. Math.,
30(1):9–14, 1978.
Gordon J. Alexander and Alexandre M. Baptista. Economic implications of using a mean-var
model for portfolio selection: A comparison with mean-variance analysis. Journal of Economic
Dynamics and Control, 26(7):1159–1193, 2002.
M. Allais. Le comportement de l’homme rationnel rationnel devant le risque: critique des postulats
et axiomes de l’école américaine. Econometrica, 21:503–546, 1953.
Michael Allen and Somnath Datta. A note on bootstrapping M-estimators in ARMA models. J.
Time Ser. Anal., 20(4):365–379, 1999.
T. W. Anderson. The Statistical Analysis of Time Series. John Wiley & Sons, Inc., New York,
1971.
T. W. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley Series in Probability
and Mathematical Statistics: Probability and Mathematical Statistics. John Wiley & Sons Inc.,
New York, second edition, 1984.
349
350 BIBLIOGRAPHY
T. W. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley Series in Probability
and Statistics. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, third edition, 2003.
Fred C. Andrews. Asymptotic behavior of some rank tests for analysis of variance. Ann. Math.
Statistics, 25:724–736, 1954.
Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. Coherent measures of risk.
Mathematical Finance, 9(3):203–228, 1999.
Alexander Aue, Siegfried Hörmann, Lajos Horváth, Marie Hušková, and Josef G. Steinebach.
Sequential testing for the stability of high-frequency portfolio betas. Econometric Theory, 28
(4):804–837, 2012.
Zhidong Bai, Huixia Liu, and Wing-Keung Wong. Enhancement of the applicability of
Markowitz’s portfolio optimization by utilizing random matrix theory. J. Math. Finance, 19
(4):639–667, 2009.
T. A. Bancroft. Analysis and inference for incompletely specified models involving the use of
preliminary test(s) of significance. Biometrics, 20:427–442, 1964.
Gopal Basak, Ravi Jagannathan, and Guoqiang Sun. A direct test for the mean variance efficiency
of a portfolio. J. Econom. Dynam. Control, 26(7-8):1195–1215, 2002.
Ishwar V. Basawa and David John Scott. Asymptotic Optimal Inference for Nonergodic Models,
volume 17 of Lecture Notes in Statistics. Springer-Verlag, New York, 1983.
G. W. Bassett Jr., R. Koenker, and G. Kordas. Pessimistic portfolio allocation and Choquet ex-
pected utility. Journal of Financial Econometrics, 2(4):477–492, 2004.
Richard Bellman. Introduction to Matrix Analysis. McGraw−Hill Book Co., Inc., New York, 1960.
Richard Bellman. Dynamic Programming (Princeton Landmarks in Mathematics). Princeton Univ
Pr, 2010.
Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and pow-
erful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Method-
ological), pages 289–300, 1995.
Jan Beran. Statistics for Long-Memory Processes, volume 61 of Monographs on Statistics and
Applied Probability. Chapman & Hall, New York, 1994.
P. J. Bickel, C. A. J. Klaassen, Y. Ritov, and J. A. Wellner. Efficient and Adaptive Estimation for
Semiparametric Models. Johns Hopkins series in the mathematical sciences. Springer, New
York, 1998.
F. Black and R. Litterman. Global portfolio optimization. Financial Analysts Journal, 48(5):
28–43, 1992.
Avrim Blum and Adam Kalai. Universal portfolios with and without transaction costs. Machine
Learning, 35(3):193–205, 1999.
Taras Bodnar, Nestor Parolya, and Wolfgang Schmid. Estimation of the global minimum variance
portfolio in high dimensions. arXiv preprint arXiv:1406.0437, 2014.
Tim Bollerslev. Generalized autoregressive conditional heteroskedasticity. J. Econometrics, 31(3):
307–327, 1986.
Tim Bollerslev, Robert F. Engle, and Jeffrey M. Wooldridge. A capital asset pricing model with
time-varying covariances. Journal of Political Economy, 96(1):116–131, 1988.
B. M. Bolstad, F. Collin, J. Brettschneider, K. Simpson, L. Cope, R. A. Irizarry, and T. P. Speed.
Quality assessment of affymetrix genechip data. In Bioinformatics and Computational Biology
Solutions Using R and Bioconductor, pages 33–47. Springer, 2005.
Arup Bose. Edgeworth correction by bootstrap in autoregressions. Ann. Statist., 16(4):1709–1722,
1988.
BIBLIOGRAPHY 351
Michael W. Brandt, Amit Goyal, Pedro Santa-Clara, and Jonathan R. Stroud. A simulation ap-
proach to dynamic portfolio choice with an application to learning about return predictability.
Review of Financial Studies, 18(3):831–873, 2005.
Douglas T. Breeden. An intertemporal asset pricing model with stochastic consumption and in-
vestment opportunities. Journal of Financial Economics, 7(3):265–296, 1979.
David R. Brillinger. Time Series: Data Analysis and Theory, volume 36. SIAM, 2001a.
David R. Brillinger. Time Series, volume 36 of Classics in Applied Mathematics. Society for
Industrial and Applied Mathematics (SIAM), Philadelphia, 2001b. Data analysis and theory,
Reprint of the 1981 edition.
D. R. Brillinger. Time Series: Data Analysis and Theory. Classics in Applied Mathematics. Society
for Industrial and Applied Mathematics, 2001c.
Peter J. Brockwell and Richard A. Davis. Time Series: Theory and Methods. Springer Series in
Statistics. Springer, New York, 2006. Reprint of the second (1991) edition.
B. M. Brown. Martingale central limit theorems. Ann. Math. Statist., 42:59–66, 1971.
Peter Bühlmann. Sieve bootstrap for time series. Bernoulli, 3(2):123–148, 1997.
Anna Campain and Yee Hwa Yang. Comparison study of microarray meta-analysis methods. BMC
Bioinformatics, 11(1):1, 2010.
John Y. Campbell, Andrew W. Lo, and Archie Craig Mackinlay. The Econometrics of Financial
Markets. Princeton Univ Pr, 1996.
Gary Chamberlain. Funds, factors, and diversification in arbitrage pricing models. Econometrica,
51(5):1305–1323, 1983.
Gary Chamberlain and Michael Rothschild. Arbitrage, factor structure, and mean-variance analysis
on large asset markets. Econometrica, 51(5):1281–1304, 1983.
S. Ajay Chandra and Masanobu Taniguchi. Asymptotics of rank order statistics for arch residual
empirical processes. Stochastic Processes and Their Applications, 104(2):301–324, 2003.
C.W.S. Chen, Yi-Tung Hsu, and M. Taniguchi. Discriminant analysis for quantile regression.
Subumitted for publication, 2016.
Min Chen and Hong Zhi An. A note on the stationarity and the existence of moments of the garch
model. Statistica Sinica, pages 505–510, 1998.
Herman Chernoff and I. Richard Savage. Asymptotic normality and efficiency of certain nonpara-
metric test statistics. Ann. Math. Statist., 29:972–994, 1958.
S. H. Chew and K. R. MacCrimmmon. Alpha-nu choice theory: A generalization of expected
utility theory. Working Paper No. 669, University of British Columbia, Vancouver, 1979.
Koei Chin, Sandy DeVries, Jane Fridlyand, Paul T. Spellman, Ritu Roydasgupta, Wen-Lin Kuo,
Anna Lapuk, Richard M. Neve, Zuwei Qian, and Tom Ryder. Genomic and transcriptional
aberrations linked to breast cancer pathophysiologies. Cancer Cell, 10(6):529–541, 2006.
Ondřej Chochola, Marie Hušková, Zuzana Prášková, and Josef G. Steinebach. Robust monitoring
of CAPM portfolio betas. J. Multivariate Anal., 115:374–395, 2013.
Ondřej Chochola, Marie Hušková, Zuzana Prášková, and Josef G. Steinebach. Robust monitoring
of CAPM portfolio betas II. J. Multivariate Anal., 132:58–81, 2014.
G. Choquet. Théorie des capacitiés. Annals Institut Fourier, 5:131–295, 1953.
Gerda Claeskens and Nils Lid Hjort. The focused information criterion. Journal of the American
Statistical Association, 98(464):900–916, 2003.
G. Connor. The three types of factor models: A comparison of their explanatory power. Financial
Analysts Journal, pages 42–46, 1995.
Thomas M. Cover. Universal portfolios. Math. Finance, 1(1):1–29, 1991.
352 BIBLIOGRAPHY
Thomas M. Cover and Erik Ordentlich. Universal portfolios with side information. IEEE Trans.
Inform. Theory, 42(2):348–363, 1996.
Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley-Interscience [John
Wiley & Sons], Hoboken, NJ, second edition, 2006.
Jaksa Cvitanic, Ali Lazrak, and Tan Wang. Implications of the Sharpe ratio as a performance
measure in multi-period settings. Journal of Economic Dynamics and Control, 32(5):1622–
1649, May 2008.
R. Dahlhaus. On the Kullback-Leibler information divergence of locally stationary processes.
Stochastic Process. Appl., 62(1):139–168, 1996a.
R. Dahlhaus. Maximum likelihood estimation and model selection for locally stationary processes.
J. Nonparametr. Statist., 6(2-3):171–191, 1996b.
R. Dahlhaus. Asymptotic Statistical Inference for Nonstationary Processes with Evolutionary Spec-
tra, volume 115 of Lecture Notes in Statistics Springer, New York, 1996c.
Somnath Datta. Limit theory and bootstrap for explosive and partially explosive autoregression.
Stochastic Process. Appl., 57(2):285–304, 1995.
Somnath Datta. On asymptotic properties of bootstrap for AR(1) processes. J. Statist. Plann.
Inference, 53(3):361–374, 1996.
Giovanni De Luca, Marc G. Genton, and Nicola Loperfido. A multivariate skew-GARCH model.
In Econometric Analysis of Financial and Economic Time Series. Part A, volume 20 of Adv.
Econom., pages 33–57. Emerald/JAI, Bingley, 2008.
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the
EM algorithm. J. Roy. Statist. Soc. Ser. B, 39(1):1–38, 1977. With discussion.
M. A. H. Dempster, Gautam Mitra, and Georg Pflug, editors. Quantitative Fund Management.
Chapman & Hall/CRC Financial Mathematics Series. CRC Press, Boca Raton, FL, 2009.
Reprinted from Quant. Finance 7 (2007), no. 1, no. 2 and no. 4.
Christine Desmedt, Fanny Piette, Sherene Loi, Yixin Wang, Françoise Lallemand, Ben-
jamin Haibe-Kains, Giuseppe Viale, Mauro Delorenzi, Yi Zhang, and Mahasti Saghatchian
d’Assignies. Strong time dependence of the 76-gene prognostic signature for node-negative
breast cancer patients in the transbig multicenter independent validation series. Clinical Can-
cer Research, 13(11):3207–3214, 2007.
Maximilian Diehn, Gavin Sherlock, Gail Binkley, Heng Jin, John C Matese, Tina Hernandez-
Boussard, Christian A. Rees, J. Michael Cherry, David Botstein, and Patrick O. Brown. Source:
A unified genomic resource of functional annotations, ontologies, and gene expression data.
Nucleic Acids Research, 31(1):219–223, 2003.
Feike C. Drost, Chris A. J. Klaassen, and Bas J. M. Werker. Adaptive estimation in time-series
models. Ann. Statist., 25(2):786–817, 04 1997.
W. Dunsmuir. A central limit theorem for parameter estimation in stationary vector time series and
its application to models for a signal observed with noise. Ann. Statist., 7(3):490–506, 1979.
W. Dunsmuir and E. J. Hannan. Vector linear time series models. Advances in Appl. Probability,
8(2):339–364, 1976.
Eran Eden, Roy Navon, Israel Steinfeld, Doron Lipson, and Zohar Yakhini. GOrilla: A tool for
discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics, 10
(1):1, 2009.
Ron Edgar, Michael Domrachev, and Alex E. Lash. Gene expression omnibus: Ncbi gene expres-
sion and hybridization array data repository. Nucleic Acids Research, 30(1):207–210, 2002.
B. Efron. Bootstrap methods: Another look at the jackknife. Ann. Statist., 7(1):1–26, 1979.
BIBLIOGRAPHY 353
Noureddine El Karoui. High-dimensionality effects in the Markowitz problem and other quadratic
programs with linear constraints: Risk underestimation. Ann. Statist., 38(6):3487–3566, 2010.
E. J. Elton, M. J. Gruber, S. J. Brown, and W. N. Goetzmann. Mordern Portfolio Theory and
Investment Analysis. John Wiley & Sons Inc., New York, 7th edition, 2007.
Robert F. Engle. Autoregressive conditional heteroscedasticity with estimates of the variance of
United Kingdom inflation. Econometrica: Journal of the Econometric Society, pages 987–1007,
1982.
Robert F. Engle. Multivarite ARCH with Factor Structures — Cointegration in Variance. Univer-
sity of California Press, 1987.
Robert F. Engle and Kenneth F. Kroner. Multivariate simultaneous generalized arch. Econometric
Theory, 11(1):122–150, 1995.
E. F. Fama and K. R. French. The cross-section of expected stock returns. The Journal of Finance,
47(2):427–465, 1992.
Eugene F. Fama. Multiperiod consumption-investment decisions. American Economic Review, 60
(1):163–74, March 1970.
Jianqing Fan, Mingjin Wang, and Qiwei Yao. Modelling multivariate volatilities via condition-
ally uncorrelated components. Journal of the Royal Statistical Society. Series B (Statistical
Methodology), 70(4):679–702, 2008.
Hsing Fang and Tsong-Yue Lai. Co-kurtosis and capital asset pricing. The Financial Review, 32
(2):293, 1997.
Peter C. Fishburn. Mean-risk analysis with risk associated with below-target returns. American
Economic Review, 67(2):116–26, March 1977.
Daniéle Florens-Zmirou. Approximate discrete-time schemes for statistics of diffusion processes.
Statistics, 20(4):547–557, 1989.
Mario Forni, Marc Hallin, Marco Lippi, and Lucrezia Reichlin. The generalized dynamic-factor
model: Identification and estimation. Review of Economics and Statistics, 82(4):540–554, 2000.
Mario Forni, Marc Hallin, Marco Lippi, and Lucrezia Reichlin. The generalized dynamic fac-
tor model: One-sided estimation and forecasting. J. Amer. Statist. Assoc., 100(471):830–840,
2005.
Gabriel Frahm. Linear statistical inference for global and local minimum variance portfolios.
Statistical Papers, 51(4):789–812, 2010.
P. A. Frost and J. E. Savarino. An empirical Bayes approach to efficient portfolio selection. Journal
of Financial and Quantitative Analysis, 21(3):293–305, 1986.
Alexei A. Gaivoronski and Fabio Stella. Stochastic nonstationary optimization for finding univer-
sal portfolios. Ann. Oper. Res., 100:165–188 (2001), 2000. Research in stochastic programming
(Vancouver, BC, 1998).
Bernard Garel and Marc Hallin. Local asymptotic normality of multivariate ARMA processes with
a linear trend. Ann. Inst. Statist. Math., 47(3):551–579, 1995.
Joseph L. Gastwirth and Stephen S. Wolff. An elementary method for obtaining lower bounds on
the asymptotic power of rank tests. Ann. Math. Statist., 39:2128–2130, 1968.
John Geweke. Measurement of linear dependence and feedback between multiple time series.
Journal of the American Statistical Association, 77(378):304–313, 1982.
R. Giacometti and S. Lozza. Risk measures for asset allocation models, ed. Szegö. Risk Measures
for the 21st Century. John Wiley & Sons Inc., New York, 2004.
Liudas Giraitis, Piotr Kokoszka, and Remigijus Leipus. Stationary ARCH models: Dependence
structure and central limit theorem. Econometric Theory, 16(1):3–22, 2000.
354 BIBLIOGRAPHY
Kimberly Glass and Michelle Girvan. Annotation enrichment analysis: An alternative method for
evaluating the functional properties of gene sets. arXiv preprint arXiv:1208.4127, 2012.
Konstantin Glombek. High-Dimensionality in Statistics and Portfolio Optimization. BoD – Books
on Demand, 2012.
Sı́lvia Gonçalves and Lutz Kilian. Bootstrapping autoregressions with conditional heteroskedas-
ticity of unknown form. J. Econometrics, 123(1):89–120, 2004.
Christian Gouriéroux and Joann Jasiak. Financial Econometrics: Problems, Models, and Methods
(Princeton Series in Finance). Princeton Univ Pr, illustrated edition, 2001.
Clive W. J. Granger. Investigating causal relations by econometric models and cross-spectral meth-
ods. Econometrica: Journal of the Econometric Society, pages 424–438, 1969.
Ulf Grenander and Murray Rosenblatt. Statistical Analysis of Stationary Time Series. John Wiley
& Sons, New York, 1957.
Christian M. Hafner. Nonlinear Time Series Analysis with Applications to Foreign Exchange Rate
Volatility. Contributions to Economics. Physica-Verlag, Heidelberg, 1998.
Valérie Haggan and Tohru Ozaki. Modelling nonlinear random vibrations using an amplitude-
dependent autoregressive time series model. Biometrika, 68(1):189–196, 1981.
Jaroslav Hájek. Some extensions of the {W}ald-{W}olfowitz-{N}oether theorem. Ann. Math.
Statist., 32:506–523, 1961.
Jaroslav Hájek. Asymptotically most powerful rank-order tests. Ann. Math. Statist., 33:1124–1147,
1962.
Jaroslav Hájek and Zbyněk Šidák. Theory of Rank Tests. Academic Press, New York, 1967.
P. Hall and C. C. Heyde. Martingale Limit Theory and Its Application. Academic Press Inc. [Har-
court Brace Jovanovich Publishers], New York, 1980. Probability and Mathematical Statistics.
Peter Hall. The Bootstrap and Edgeworth Expansion. Springer, New York, pages xiv+352, 1992.
Marc Hallin and Davy Paindaveine. Optimal tests for multivariate location based on interdirections
and pseudo-Mahalanobis ranks. Ann. Statist., 30(4):1103–1133, 2002.
Marc Hallin and Davy Paindaveine. Affine-invariant aligned rank tests for the multivariate gen-
eral linear model with {VARMA} errors. Journal of Multivariate Analysis, 93(1):122–163,
2005.
Marc Hallin and Davy Paindaveine. Semiparametrically efficient rank-based inference for shape.
I. Optimal rank-based tests for sphericity. Ann Statistics, 34(6):2707–2756, 2006.
Marc Hallin and Madan L. Puri. Optimal rank-based procedures for time series analysis: Testing
an {ARMA} model against other {ARMA} models. Ann. Statist., 16(1):402–432, 1988.
Marc Hallin and Madan L. Puri. Rank tests for time series analysis: A survey. In New Directions in
Time Series Analysis, {P}art {I}, volume 45 of IMA Vol. Math. Appl., pages 111–153. Springer,
New York, 1992.
Marc Hallin and Bas J. M. Werker. Semi-parametric efficiency, distribution-freeness and invari-
ance. Bernoulli, 9(1):137–165, 2003.
Marc Hallin, Jean-François Ingenbleek, and Madan L. Puri. Linear serial rank tests for randomness
against {ARMA} alternatives. Ann. Statist., 13(3):1156–1181, 1985a.
Marc Hallin, Masanobu Taniguchi, Abdeslam Serroukh, and Kokyo Choy. Local asymptotic nor-
mality for regression models with long-memory disturbance. Ann. Statist., 27(6):2054–2080,
1999.
Kenta Hamada and Masanobu Taniguchi. Statistical portfolio estimation for non-Gaussian return
processes. Advances in Science, Technology and Environmentology, B10:69–83, 2014.
BIBLIOGRAPHY 355
Kenta Hamada, Dong Wei Ye, and Masanobu Taniguchi. Statistical portfolio estimation under the
utility function depending on exogenous variables. Adv. Decis. Sci., pages Art. ID 127571, 15,
2012.
E. J. Hannan. Multiple Time Series. John Wiley and Sons, Inc., New York, 1970.
E. J. Hannan and B. G. Quinn. The determination of the order of an autoregression. J. Roy. Statist.
Soc. Ser. B, 41(2):190–195, 1979.
W. Härdle and T. Gasser. Robust non-parametric function fitting. Journal of the Royal Statistical
Society. Series B (Methodological), 46(1):42–51, 1984.
W. Härdle, A. Tsybakov, and L. Yang. Nonparametric vector autoregression. J. Statist. Plann.
Inference, 68(2):221–245, 1998.
Campbell R. Harvey and Akhtar Siddique. Conditional skewness in asset pricing tests. Journal of
Finance, pages 1263–1295, 2000.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning.
Springer Series in Statistics. Springer, New York, second edition, 2009. Data mining, inference,
and prediction.
Edwin Hewitt and Karl Stromberg. Real and Abstract Analysis. A Modern Treatment of the Theory
of Functions of a Real Variable. Springer-Verlag, New York, 1965.
C. C. Heyde. Fixed sample and asymptotic optimality for classes of estimating functions. Contem-
porary Mathematics. 80:241–247, 1988.
Christopher C. Heyde. Quasi-Likelihood and Its Application. Springer Series in Statistics.
Springer-Verlag, New York, 1997. A general approach to optimal parameter estimation.
Junichi Hirukawa. Cluster analysis for non-Gaussian locally stationary processes. Int. J. Theor.
Appl. Finance, 9(1):113–132, 2006.
Junichi Hirukawa and Masanobu Taniguchi. LAN theorem for non-Gaussian locally stationary
processes and its applications. J. Statist. Plann. Inference, 136(3):640–688, 2006.
Junichi Hirukawa, Hiroyuki Taniai, Marc Hallin, and Masanobu Taniguchi. Rank-based inference
for multivariate nonlinear and long-memory time series models. Journal of the Japan Statistical
Society, 40(1):167–187, 2010.
J. L. Hodges and Erich Leo Lehmann. The efficiency of some nonparametric competitors of the
{t}-test. Ann. Math. Statist., 27:324–335, 1956.
M. Honda, H. Kato, S. Ohara, A. Ikeda, Y. Inoue, T. Mihara, K. Baba, and H. Shibasaki. Stochastic
time-series analysis of intercortical functional coupling during sustained muscle contraction.
31st Annual Meeting Society for Neuroscience Abstracts, San Diego, Nov 10–15, 2001.
Lajos Horváth, Piotr Kokoszka, and Gilles Teyssiere. Empirical process of the squared residuals
of an arch sequence. Annals of Statistics, pages 445–469, 2001.
Yuzo Hosoya. The decomposition and measurement of the interdependency between second-order
stationary processes. Probability Theory and Related Fields, 88(4):429–444, 1991.
Yuzo Hosoya and Masanobu Taniguchi. A central limit theorem for stationary processes and the
parameter estimation of linear processes. Ann. Statist., 10(1):132–153, 1982.
Soosung Hwang and Stephen E. Satchell. Modelling emerging market risk premia using higher
moments. Return Distributions in Finance, page 75, 1999.
I. A. Ibragimov and V. N. Solev. A certain condition for the regularity of {G}aussian station-
ary sequence. Zap. Naučn. Sem. Leningrad. Otdel. Mat. Inst. Steklov. (LOMI), 12:113–125,
1969.
I.A. Ibragimov. A central limit theorem for a class of dependent random variables. Theory of
Probability & Its Applications, 8(1):83–89, 1963.
356 BIBLIOGRAPHY
Il\cprimedar Abdulovich Ibragimov and Y. A. Rozanov. Gaussian Random Processes, volume 9
of Applications of Mathematics. Springer-Verlag, New York, 1978.
Rafael A. Irizarry, Benjamin M. Bolstad, Francois Collin, Leslie M. Cope, Bridget Hobbs, and
Terence P. Speed. Summaries of affymetrix genechip probe level data. Nucleic Acids Research,
31(4):e15–e15, 2003.
Jean Jacod and Albert N. Shiryaev. Limit Theorems for Stochastic Processes, volume 288.
Springer-Verlag, Berlin, 1987.
W. James and Charles Stein. Estimation with quadratic loss. In Proc. 4th Berkeley Sympos. Math.
Statist. and Prob., Vol. I, pages 361–379, Berkeley, Calif., 1961. Univ. California Press.
Harold Jeffreys. Theory of Probability. Third edition. Clarendon Press, Oxford, 1961.
P. Jeganathan. Some aspects of asymptotic theory with applications to time series models. Econo-
metric Theory, 11(5):818–887, 1995. Trending multiple time series (New Haven, CT, 1993).
Kiho Jeong, Wolfgang K. Härdle, and Song Song. A consistent nonparametric test for causality in
quantile. Econometric Theory, 28(04):861–887, 2012.
J. David Jobson and Bob Korkie. Estimation for Markowitz efficient portfolios. Journal of the
American Statistical Association, 75(371):544–554, 1980.
J. D. Jobson and R. M. Korkie. Putting Markowitz theory to work. The Journal of Portfolio
Management, 7(4):70–74, 1981.
J. D. Jobson, B. Korkie, and V. Ratti. Improved estimation for Markowitz portfolios using James-
Stein type estimators. 41:279–284, 1979.
Harry Joe. Dependence Modeling with Copulas, volume 134 of Monographs on Statistics and
Applied Probability. CRC Press, Boca Raton, FL, 2015.
Geweke John. The Dynamic Factor Analysis of Economic Time Series, Latent Variables in Socio-
Economic Models, 1977.
P. Jorion. Bayes-Stein estimation for portfolio analysis. Journal of Financial and Quantitative
Analysis, 21(3):279–292, 1986.
Yoshihide Kakizawa. Note on the asymptotic efficiency of sample covariances in Gaussian vector
stationary processes. J. Time Ser. Anal., 20(5):551–558, 1999.
Iakovos Kakouris and Berç Rustem. Robust portfolio optimization with copulas. European J.
Oper. Res., 235(1):28–37, 2014.
Shinji Kataoka. A stochastic programming model. Econometrica, 31:181–196, 1963.
H. Kato, M. Taniguchi, and M. Honda. Statistical analysis for multiplicatively modulated nonlinear
autoregressive model and its applications to electrophysiological signal analysis in humans.
Signal Processing, IEEE Transactions on, 54(9):3414–3425, 2006.
Daniel MacRae Keenan. Limiting behavior of functionals of higher-order sample cumulant spec-
tra. Ann. Statist., 15(1):134–151, 1987.
J. L. Kelly. A new interpretation of information rate. Bell. System Tech. J., 35:917–926, 1956.
Mathieu Kessler. Estimation of an ergodic diffusion from discrete observations. Scand. J. Statist.,
24(2):211–229, 1997.
Keith Knight. Limiting distributions for L1 regression estimators under general conditions. The
Annals of Statistics, 26(2):755–770, 1998.
I. Kobayashi, T. Yamashita, and M. Taniguchi. Portfolio estimation under causal variables. Special
Issue on the Financial and Pension Mathematical Science. ASTE, B9:85–100, 2013.
Roger Koenker. Quantile Regression (Econometric Society Monographs). Cambridge University
Press, 2005.
Roger Koenker and Gilbert Bassett, Jr. Regression quantiles. Econometrica, 46(1):33–50, 1978.
BIBLIOGRAPHY 357
Roger Koenker and Zhijie Xiao. Quantile autoregression. J. Amer. Statist. Assoc., 101(475):980–
990, 2006.
Hiroshi Konno and Hiroaki Yamazaki. Mean-absolute deviation portfolio optimization model and
its applications to Tokyo stock market. Management Science, 37(5):519–531, May 1991.
Amnon Koren, Itay Tirosh, and Naama Barkai. Autocorrelation analysis reveals widespread spatial
biases in microarray experiments. BMC Genomics, 8(1):164, 2007.
Hira L. Koul and A. K. Md. E. Saleh. R-estimation of the parameters of autoregressive [AR(p)]
models. Ann. Statist., 21(1):534–551, 1993.
Alan Kraus and Robert H. Litzenberger. Skewness preference and the valuation of risk assets*.
The Journal of Finance, 31(4):1085–1100, 1976.
Jens-Peter Kreiss. On adaptive estimation in stationary ARMA processes. Ann. Statist., 15(1):
112–133, 1987.
Jens-Peter Kreiss. Local asymptotic normality for autoregression with infinite order. J. Statist.
Plann. Inference, 26(2):185–219, 1990.
Jens-Peter Kreiss and Jürgen Franke. Bootstrapping stationary autoregressive moving-average
models. J. Time Ser. Anal., 13(4):297–317, 1992.
William H. Kruskal and W. Allen Wallis. Use of ranks in {one-criterion} variance analysis. Journal
of the American Statistical Association, 47(260):583–621, 1952.
S. N. Lahiri. Resampling Methods for Dependent Data. Springer Series in Statistics. Springer-
Verlag, New York, 2003.
G. J. Lauprete, A. M. Samarov, and R. E. Welsch. Robust portfolio optimization. Metrika, 55(1):
139–149, 2002.
Lucien Le Cam. Asymptotic Methods in Statistical Decision Theory. Springer Series in Statistics.
Springer-Verlag, New York, 1986.
O. Ledoit and M. Wolf. Improved estimation of the covariance matrix of stock returns with an
application to portfolio selection. Journal of Empirical Finance, 10(5):603–621, 2003.
O. Ledoit and M. Wolf. Honey, I shrunk the sample covariance matrix. The Journal of Portfolio
Management, 30(4):110–119, 2004.
Sangyeol Lee and Masanobu Taniguchi. Asymptotic theory for ARCH-SM models: LAN and
residual empirical processes. Statist. Sinica, 15(1):215–234, 2005.
Erich Leo Lehmann and J. P. Romano. Testing Statistical Hypotheses. Springer Texts in Statistics.
Springer, 2005.
Markus Leippold, Fabio Trojani, and Paolo Vanini. A geometric approach to multiperiod mean
variance optimization of assets and liabilities. Journal of Economic Dynamics and Control,
28:1079–1113, 2004.
Duan Li and Wan-Lung Ng. Optimal dynamic portfolio selection: Multiperiod mean-variance
formulation. Mathematical Finance, 10(3):387–406, 2000.
Kian-Guan Lim. A new test of the three-moment capital asset pricing model. Journal of Financial
and Quantitative Analysis, 24(02):205–216, 1989.
Wen-Ling Lin. Alternative estimators for factor GARCH models–A Monte Carlo comparison.
Journal of Applied Econometrics, 7(3):259–279, 1992.
John Lintner. The valuation of risk assets and the selection of risky investments in stock portfolios
and capital budgets: A reply. The Review of Economics and Statistics, 51(2):222–224, May
1969.
Greta M. Ljung and George E. P. Box. On a measure of lack of fit in time series models.
Biometrika, 65(2):297–303, 1978.
358 BIBLIOGRAPHY
Sherene Loi, Benjamin Haibe-Kains, Christine Desmedt, Françoise Lallemand, Andrew M. Tutt,
Cheryl Gillet, Paul Ellis, Adrian Harris, Jonas Bergh, and John A. Foekens. Definition of
clinically distinct molecular subtypes in estrogen receptor–positive breast carcinomas through
genomic grade. Journal of Clinical Oncology, 25(10):1239–1246, 2007.
Zudi Lu and Zhenyu Jiang. L1 geometric ergodicity of a multivariate nonlinear AR model with an
ARCH term. Statist. Probab. Lett., 51(2):121–130, 2001.
Zudi Lu, Dag Johan Steinskog, Dag Tjøstheim, and Qiwei Yao. Adaptively varying-coefficient
spatiotemporal models. Journal of the Royal Statistical Society: Series B (Statistical Method-
ology), 71(4):859–880, 2009.
David G. Luenberger. Investment Science. Oxford Univ Pr (Txt), illustrated edition edition, 1997.
Helmut Lütkepohl. General-to-specific or specific-to-general modelling? An opinion on current
econometric terminology. Journal of Econometrics, 136(1):319–324, 2007.
Jan R. Magnus and Heinz Neudecker. Matrix Differential Calculus with Applications in Statis-
tics and Econometrics. Wiley Series in Probability and Statistics. John Wiley & Sons Ltd.,
Chichester, 1999. Revised reprint of the 1988 original.
Yannick Malevergne and Didier Sornette. Extreme Financial Risks: From Dependence to Risk
Management (Springer Finance). Springer, 2006 edition, 2005.
Cecilia Mancini. Disentangling the jumps of the diffusion in a geometric jumping Brownian mo-
tion. Giornale dell’ Istituto Italiano degli Attuari, 64(44):19–47, 2001.
Vladimir A. Marčenko and Leonid Andreevich Pastur. Distribution of eigenvalues for some sets
of random matrices. Sbornik: Mathematics, 1(4):457–483, 1967.
H. Markowitz. Portfolio selection. The Journal of Finance, 7:77–91, 1952.
Harry M. Markowitz. Portfolio Selection: Efficient Diversification of Investments. Cowles Foun-
dation for Research in Economics at Yale University, Monograph 16. John Wiley & Sons Inc.,
New York, 1959.
David S. Matteson and Ruey S. Tsay. Dynamic orthogonal components for multivariate time series.
Journal of the American Statistical Association, 106(496), 2011.
Alexander J. McNeil, Ruediger Frey, and Paul Embrechts. Quantitative Risk Management: Con-
cepts, Techniques and Tools (Princeton Series in Finance). Princeton Univ Pr, revised edition,
2015.
Robert C. Merton. Lifetime portfolio selection under uncertainty: The continuous-time case. The
Review of Economics and Statistics, 51(3):247–57, August 1969.
Robert C. Merton. An intertemporal capital asset pricing model. Econometrica, 41:867–887, 1973.
Robert C. Merton. Continuous-Time Finance (Macroeconomics and Finance). Wiley-Blackwell,
1992.
Attilio Meucci. Managing diversification. https://papers.ssm.com/sol3/papers.cfm?abstract id=
1358533 2010.
Anders Milhøj. The moment structure of arch processes. Scandinavian Journal of Statistics, pages
281–292, 1985.
Lance D. Miller, Johanna Smeds, Joshy George, Vinsensius B. Vega, Liza Vergara, Alexander
Ploner, Yudi Pawitan, Per Hall, Sigrid Klaar, and Edison T. Liu. An expression signature for
p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient
survival. Proceedings of the National Academy of Sciences of the United States of America,
102(38):13550–13555, 2005.
Tatsuya Mima and Mark Hallett. Corticomuscular coherence: A review. Journal of Clinical Neu-
rophysiology, 16(6):501, 1999.
BIBLIOGRAPHY 359
Andy J. Minn, Gaorav P. Gupta, Peter M. Siegel, Paula D. Bos, Weiping Shu, Dilip D. Giri, Agnes
Viale, Adam B. Olshen, William L. Gerald, and Joan Massagué. Genes that mediate breast
cancer metastasis to lung. Nature, 436(7050):518–524, 2005.
J. Mossin. Equilibrium in capital asset market. Econometrica, 35:768–783, 1966.
J. Mossin. Optimal multiperiod portfolio policies. The Journal of Business, 41(2):215–229, 1968.
Kai Wang Ng, Guo-Liang Tian, and Man-Lai Tang. Dirichlet and Related Distributions. Wiley
Series in Probability and Statistics. John Wiley & Sons Ltd., Chichester, 2011. Theory, methods
and applications.
Chapados Nicolas, editor. Portfolio Choice Problems: An Introductory Survey of Single and
Multiperiod Models (Springer Briefs in Electrical and Computer Engineering). Springer,
2011.
Gottfried E. Noether. On a theorem of Pitman. The Annals of Mathematical Statistics, 26(1):
64–68, 1955.
Shinji Ohara, Takashi Nagamine, Akio Ikeda, Takeharu Kunieda, Riki Matsumoto, Waro Taki,
Nobuo Hashimoto, Koichi Baba, Tadahiro Mihara, Stephan Salenius et al. Electrocorticogram–
electromyogram coherence during isometric contraction of hand muscle in human. Clinical
Neurophysiology, 111(11):2014–2024, 2000.
Annamaria Olivieri and Ermanno Pitacco. Solvency requirements for life annuities. In Proceedings
of the AFIR 2000 Colloquium, Tromso, Norway, pages 547–571, 2000.
Tohru Ozaki and Mitsunori Iino. An innovation approach to non-Gaussian time series analysis.
Journal of Applied Probability, pages 78–92, 2001.
Helder P. Palaro and Luiz Koodi Hotta. Using conditional copula to estimate value at risk. Journal
of Data Science, 4:93–115, 2006.
Charles M. Perou, Therese Sørlie, Michael B. Eisen, Matt van de Rijn, Stefanie S. Jeffrey, Chris-
tian A. Rees, Jonathan R. Pollack, Douglas T. Ross, Hilde Johnsen, and Lars A. Akslen. Molec-
ular portraits of human breast tumours. Nature, 406(6797):747–752, 2000.
Georg Ch. Pflug. Some remarks on the value-at-risk and the conditional value-at-risk. In Proba-
bilistic Constrained Optimization, pages 272–281. Springer, 2000.
Dinh Tuan Pham and Philippe Garat. Blind separation of mixture of independent sources through
a quasi-maximum likelihood approach. IEEE T Signal Proces, 45(7):1712–1725, 1997.
E. J. G. Pitman. Lecture Notes on Nonparametric Statistical Inference: Lectures Given for the
University of North Carolina. Mathematisch Centrum, 1948.
Stanley R. Pliska. Introduction to Mathematical Finance: Discrete Time Models. Wiley, first
edition, 1997.
Vassilis Polimenis. The distributional capm: Connecting risk premia to return distributions. UC
Riverside Anderson School Working Paper, (02-04), 2002.
B. L. S. Prakasa Rao. Asymptotic theory for nonlinear least squares estimator for diffusion pro-
cesses. Math. Operationsforsch. Statist. Ser. Statist., 14(2):195–209, 1983.
B. L. S. Prakasa Rao. Statistical inference from sampled data for stochastic processes. In Statistical
Inference from Stochastic Processes (Ithaca, NY, 1987), volume 80 of Contemp. Math., pages
249–284. Amer. Math. Soc., Providence, RI, 1988.
B. L. S. Prakasa Rao. Semimartingales and Their Statistical Inference, volume 83 of Monographs
on Statistics and Applied Probability. Chapman & Hall/CRC, Boca Raton, FL, 1999.
J. W. Pratt. Risk aversion in the small and in the large. Econometrica, 32:122–136, 1964.
Jean-Luc Prigent. Portfolio Optimization and Performance Analysis. Chapman & Hall/CRC Fi-
nancial Mathematics Series. Chapman & Hall/CRC, Boca Raton, FL, 2007.
360 BIBLIOGRAPHY
Madan Lal Puri and Pranab Kumar Sen. Nonparametric Methods in Multivariate Analysis (Wiley
Series in Probability and Statistics). Wiley, first edition, 1971.
John Quiggin. A theory of anticipated utility. Journal of Economic Behavior & Organization, 3
(4):323–343, December 1982.
B. G. Quinn. A note on AIC order determination for multivariate autoregressions. J. Time Ser.
Anal., 9(3):241–245, 1988.
Svetlozar T. Rachev, John S. J. Hsu, Biliana S. Bagasheva, and Frank J. Fabozzi CFA. Bayesian
Methods in Finance (Frank J. Fabozzi Series). Wiley, first edition, 2008.
Karim Rezaul, Jay Kumar Thumar, Deborah H. Lundgren, Jimmy K. Eng, Kevin P. Claffey, Lori
Wilson, and David K. Han. Differential protein expression profiles in estrogen receptor positive
and negative breast cancer tissues using label-free quantitative proteomics. Genes & Cancer, 1
(3):251–271, 2010.
R. T. Rockafellar and S. P. Uryasev. Optimization of conditional value-at-risk. Journal of Risk, 2:
21–42, 2000.
B. Rosenberg. “Extra-Market Components of Covariance in Security Returns,” Journal of Finan-
cial and Quantitative Analysis, 9, 263–274, 1974.
Stephen A. Ross. The arbitrage theory of capital asset pricing. Journal of Economic Theory, 13
(3):341–360, December 1976.
A. D. Roy. Safety-first and the holding of assets. Econometrica, 20:431–449, 1952.
Mark E. Rubinstein. The fundamental theorem of parameter-preference security valuation. Journal
of Financial and Quantitative Analysis, 8(01):61–69, 1973.
Alexander M. Rudin and Jonathan S. Morgan. A portfolio diversification index. The Journal of
Portfolio Management, 32(2):81–89, 2006.
David Ruppert. Statistics and Finance. Springer Texts in Statistics. Springer-Verlag, New York,
2004. An introduction.
A. K. Md. Ehsanes Saleh. Theory of Preliminary Test and Stein-Type Estimation with Applications.
Wiley Series in Probability and Statistics. Wiley-Interscience [John Wiley & Sons], Hoboken,
NJ, 2006.
Alexander Samarov and Murad S. Taqqu. On the efficiency of the sample mean in long-memory
noise. Journal of Time Series Analysis, 9(2):191–200, 1988.
Paul A. Samuelson. Lifetime portfolio selection by dynamic stochastic programming. Review of
Economics and Statistics, 51(3):239–46, August 1969.
Paul A. Samuelson. The fundamental approximation theorem of portfolio analysis in terms of
means, variances, and higher moments. Review of Economic Studies, 37(4):537–42, October
1970.
Thomas J. Sargent and Christopher A. Sims. Business cycle modeling without pretending to have
too much a priori economic theory. New Methods in Business Cycle Research, 1:145–168, 1977.
João R. Sato, André Fujita, Elisson F. Cardoso, Carlos E. Thomaz, Michael J. Brammer, and Edson
Amaro. Analyzing the connectivity between regions of interest: An approach based on cluster
Granger causality for fmri data analysis. Neuroimage, 52(4):1444–1455, 2010.
Kenichi Sato. Lévy Processes and Infinitely Divisible Distributions, volume 68 of Cambridge
Studies in Advanced Mathematics. Cambridge University Press, Cambridge, 1999. Translated
from the 1990 Japanese original, Revised by the author.
Bernd Michael Scherer and R. Douglas Martin. Introduction to Modern Portfolio Optimization
With NUOPT And S-PLUS. Springer-Verlag, first ed. 2005. corr. 2nd. printing edition, 2005.
David Schmeidler. Subjective probability and expected utility without additivity. Econometrica,
57(3):571–87, May 1989.
BIBLIOGRAPHY 361
Gideon Schwarz. Estimating the dimension of a model. Ann. Statist., 6(2):461–464, 1978.
Motohiro Senda and Masanobu Taniguchi. James-Stein estimators for time series regression mod-
els. J. Multivariate Anal., 97(9):1984–1996, 2006.
C. E. Shannon. A mathematical theory of communication. Bell System Tech. J., 27:379–423,
623–656, 1948.
W. Sharpe. Capital asset prices: A theory of market equilibrium under conditions of risk. The
Journal of Finance, 19:425–442, 1964.
Ritei Shibata. Selection of the order of an autoregressive model by Akaike’s information criterion.
Biometrika, 63(1):117–126, 1976.
Ritei Shibata. Asymptotically efficient selection of the order of the model for estimating parameters
of a linear process. Ann. Statist., 8(1):147–164, 1980.
Y. Shimizu. Estimation of diffusion processes with jumps from discrete observations, Zenkin
Tenkai, University of Tokyo, Dec 4, 2002. PhD thesis, Master thesis 2003, Graduate School
of Mathematical Sciences, The University of Tokyo, 2002.
Yasutaka Shimizu. Threshold estimation for jump-type stochastic processes from discrete obser-
vations (in Japanese). Proc. Inst. Statist. Math., 57(1):97–118, 2009.
Yasutaka Shimizu and Nakahiro Yoshida. Estimation of parameters for diffusion processes with
jumps from discrete observations. Stat. Inference Stoch. Process., 9(3):227–277, 2006.
Hiroshi Shiraishi. A simulation approach to statistical estimation of multiperiod optimal portfolios.
Adv. Decis. Sci., pages Art. ID 341476, 13, 2012.
Hiroshi Shiraishi and Masanobu Taniguchi. Statistical estimation of optimal portfolios for locally
stationary returns of assets. Int. J. Theor. Appl. Finance, 10(1):129–154, 2007.
Hiroshi Shiraishi and Masanobu Taniguchi. Statistical estimation of optimal portfolios for non-
Gaussian dependent returns of assets. J. Forecast., 27(3):193–215, 2008.
Hiroshi Shiraishi, Hiroaki Ogata, Tomoyuki Amano, Valentin Patilea, David Veredas, and
Masanobu Taniguchi. Optimal portfolios with end-of-period target. Adv. Decis. Sci., pages
Art. ID 703465, 13, 2012.
Albert N. Shiryaev. Essentials of Stochastic Finance, volume 3 of Advanced Series on Statistical
Science & Applied Probability. World Scientific Publishing Co. Inc., River Edge, NJ, 1999.
Facts, models, theory, translated from the Russian manuscript by N. Kruzhilin.
Ali Shojaie and George Michailidis. Discovering graphical Granger causality using the truncating
lasso penalty. Bioinformatics, 26(18):i517–i523, 2010.
Steven E. Shreve. Stochastic Calculus for Finance. II. Springer Finance. Springer-Verlag, New
York, 2004. Continuous-time models.
R. H. Shumway and Z. A. Der. Deconvolution of multiple time series. Technometrics, 27(4):385,
1985.
Robert H. Shumway. Time-frequency clustering and discriminant analysis. Statistics & Probability
Letters, 63(3):307–314, 2003.
Z. Sidak, P. K. Sen, and Jaroslav Hájek. Theory of Rank Tests. Probability and Mathematical
Statistics. Elsevier Science, San Diego, 1999.
Andrew H. Sims, Graeme J. Smethurst, Yvonne Hey, Michal J. Okoniewski, Stuart D. Pepper, An-
thony Howell, Crispin J. Miller, and Robert B. Clarke. The removal of multiplicative, systematic
bias allows integration of breast cancer gene expression datasets–improving meta-analysis and
prediction of prognosis. BMC Medical Genomics, 1(1):1, 2008.
Christopher A. Sims. Money, income, and causality. The American Economic Review, 62(4):
540–552, 1972.
362 BIBLIOGRAPHY
Hiroko Kato Solvang and Masanobu Taniguchi. Microarray analysis using rank order statistics for
arch residual empirical process. Open Journal of Statistics, 7(01):54–71, 2017a.
Hiroko Kato Solvang and Masanobu Taniguchi. Portfolio estimation for spectral density of cate-
gorical time series data. Far East Journal of Theoretical Statistics, 53(1):19–33, 2017b.
Michael Sørensen. Likelihood methods for diffusions with jumps. In Statistical Inference in
Stochastic Processes, volume 6 of Probab. Pure Appl., pages 67–105. Dekker, New York, 1991.
Frank Spitzer. A combinatorial lemma and its application to probability theory. Trans. Amer. Math.
Soc., 82:323–339, 1956.
Charles Stein. Inadmissibility of the usual estimator for the mean of a multivariate normal dis-
tribution. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and
Probability, 1954–1955, vol. I, pages 197–206, Berkeley, 1956. University of California Press.
David S. Stoffer. Nonparametric frequency detection and optimal coding in molecular biology. In
Modeling Uncertainty, pages 129–154. Springer, 2005.
David S. Stoffer, David E. Tyler, and Andrew J. McDougall. Spectral analysis for categorical time
series: Scaling and the spectral envelope. Biometrika, 80(3):611–622, 1993.
David S. Stoffer, David E. Tyler, and David A. Wendt. The spectral envelope and its applications.
Statist. Sci., 15(3):224–253, 2000.
William F. Stout. Almost Sure Convergence. Academic Press [A subsidiary of Harcourt Brace
Jovanovich, Publishers], New York, 1974. Probability and Mathematical Statistics, Vol. 24.
Helmut Strasser. Mathematical Theory of Statistics, volume 7 of de Gruyter Studies in Mathemat-
ics. Walter de Gruyter & Co., Berlin, 1985. Statistical experiments and asymptotic decision
theory.
Frantisek Stulajter. Comparison of different copula assumptions and their application in portfolio
construction. Ekonomická Revue-Central European Review of Economic Issues, 12, 2009.
Y. Suto and M. Taniguchi. Shrinkage interpolation for stationary processes. ASTE, Research
Institute for Science and Engineering, Waseda University Special Issue “Financial and Pension
Mathematical Science,” 13:35–42, 2016.
Y. Suto, Y. Liu, and M. Taniguchi. Asymptotic theory of parameter estimation by a contrast func-
tion based on interpolation error. Statist. Infer. Stochast. Process, 19(1):93–110, 2016.
Anders Rygh Swensen. The asymptotic distribution of the likelihood ratio for autoregressive time
series with a regression trend. J. Multivariate Anal., 16(1):54–70, 1985.
Gábor J. Székely and Maria L. Rizzo. Brownian distance covariance. Ann. Appl. Stat., 3(4):
1236–1265, 2009.
Gábor J. Székely, Maria L. Rizzo, and Nail K. Bakirov. Measuring and testing dependence by
correlation of distances. Ann. Statist., 35(6):2769–2794, 2007.
H. Taniai and T. Shiohama. Statistically efficient construction of α-risk-minimizing portfolio.
Advances in Decision Sciences, 2012.
Masanobu Taniguchi. Robust regression and interpolation for time series. Journal of Time Series
Analysis, 2(1):53–62, 1981.
Masanobu Taniguchi. Higher Order Asymptotic Theory for Time Series Analysis. Springer, 1991.
Masanobu Taniguchi and Junichi Hirukawa. The Stein-James estimator for short- and long-
memory Gaussian processes. Biometrika, 92(3):737–746, 2005.
Masanobu Taniguchi and Junichi Hirukawa. Generalized information criterion. J. Time Series
Anal., 33(2):287–297, 2012.
Masanobu Taniguchi and Yoshihide Kakizawa. Asymptotic Theory of Statistical Inference for Time
Series. Springer Series in Statistics. Springer-Verlag, New York, 2000.
BIBLIOGRAPHY 363
Masanobu Taniguchi, Madan L. Puri, and Masao Kondo. Nonparametric approach for non-
Gaussian vector stationary processes. J. Multivariate Anal., 56(2):259–283, 1996.
Masanobu Taniguchi, Junichi Hirukawa, and Kenichiro Tamaki. Optimal Statistical Inference in
Financial Engineering. Chapman & Hall/CRC, Boca Raton, FL, 2008.
Masanobu Taniguchi, Alexandre Petkovic, Takehiro Kase, Thomas DiCiccio, and Anna Clara
Monti. Robust portfolio estimation under skew-normal return processes. The European Journal
of Finance, 1–22, 2012.
Murad S. Taqqu. Law of the iterated logarithm for sums of non-linear functions of Gaussian
variables that exhibit a long range dependence. Probability Theory and Related Fields, 40(3):
203–238, 1977.
L. Telser. Safety first and hedging. Review of Economic and Statistics, 23:1–16, 1955.
Andrew E. Teschendorff, Michel Journée, Pierre A. Absil, Rodolphe Sepulchre, and Carlos Caldas.
Elucidating the altered transcriptional programs in breast cancer using independent component
analysis. PLoS Comput Biol, 3(8):e161, 2007.
H. Theil and A. S. Goldberger. On pure and mixed statistical estimation in economics. Interna-
tional Economic Review, 2(1):65–78, 1961.
George C. Tiao and Ruey S. Tsay. Some advances in non-linear and adaptive modelling in time-
series. Journal of Forecasting, 13(2):109–131, 1994.
Dag Tjøstheim. Estimation in nonlinear time series models. Stochastic Process. Appl., 21(2):
251–273, 1986.
Howell Tong. Non-linear time series: A dynamical system approach. Oxford University Press,
1990.
Ruey S. Tsay. Analysis of Financial Time Series. Wiley Series in Probability and Statistics. John
Wiley & Sons Inc., Hoboken, NJ, third edition, 2010.
Virginia Goss Tusher, Robert Tibshirani, and Gilbert Chu. Significance analysis of microarrays
applied to the ionizing radiation response. Proceedings of the National Academy of Sciences,
98(9):5116–5121, 2001.
Amos Tversky and Daniel Kahneman. Prospect theory: An analysis of decision under risk. Econo-
metrica, 47(2):263–292, 1979.
Amos Tversky and Daniel Kahneman. Advances in prospect theory: Cumulative representation of
uncertainty. Journal of Risk and Uncertainty, 5(4):297–323, October 1992.
Roy van der Weide. GO-GARCH: a multivariate generalized orthogonal GARCH model. Journal
of Applied Econometrics, 17(5):549–564, 2002.
Martin H. van Vliet, Fabien Reyal, Hugo M. Horlings, Marc J. van de Vijver, Marcel J. T. Rein-
ders, and Lodewyk F. A. Wessels. Pooling breast cancer datasets has a synergetic effect on
classification performance and improves signature stability. BMC Genomics, 9(1):375, 2008.
John von Neumann and Oskar Morgenstern. Theory of Games and Economic Behavior. Princeton
University Press, Princeton, NJ, 1944.
I. D. Vrontos, P. Dellaportas, and D. N. Politis. A full-factor multivariate GARCH model. The
Econometrics Journal, 6(2):312–334, 2003.
Shouyang Wang and Yusen Xia. Portfolio Selection and Asset Pricing, volume 514 of Lecture
Notes in Economics and Mathematical Systems. Springer-Verlag, Berlin, 2002.
Frank Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80–83,
1945.
C.-F. J. Wu. Jackknife, bootstrap and other resampling methods in regression analysis. Ann.
Statist., 14(4):1261–1350, 1986. With discussion and a rejoinder by the author.
364 BIBLIOGRAPHY
Yingying Xu, Zhuwu Wu, Long Jiang, and Xuefeng Song. A maximum entropy method for a
robust portfolio problem. Entropy, 16(6):3401–3415, 2014.
Menahem E. Yaari. The dual theory of choice under risk. Econometrica, 55(1):95–115, January
1987.
Yee Hwa Yang, Yuanyuan Xiao, and Mark R. Segal. Identifying differentially expressed genes
from microarray experiments via statistic synthesis. Bioinformatics, 21(7):1084–1093, 2005.
Nakahiro Yoshida. Estimation for diffusion processes from discrete observation. J. Multivariate
Anal., 41(2):220–242, 1992.
Ken-ichi Yoshihara. Limiting behavior of {U}-statistics for stationary, absolutely regular processes.
Z. Wahrscheinlichkeitstheorie und Verw. Gebiete, 35(3):237–252, 1976.
Xi Zhao, Einar Andreas Rødland, Therese Sørlie, Bjørn Naume, Anita Langerød, Arnoldo Frigessi,
Vessela N. Kristensen, Anne-Lise Børresen-Dale, and Ole Christian Lingjærde. Combining
gene signatures improves prediction of breast cancer survival. PLoS One, 6(3):e17845, 2011.
Xi Zhao, Einar Andreas Rødland, Therese Sørlie, Hans Kristian Moen Vollan, Hege G. Russnes,
Vessela N. Kristensen, Ole Christian Lingjærde, and Anne-Lise Børresen-Dale. Systematic
assessment of prognostic gene signatures for breast cancer shows distinct influence of time and
er status. BMC Cancer, 14(1):1, 2014.
Rongxi Zhou, Ru Cai, and Guanqun Tong. Applications of entropy in finance: A review. Entropy,
15(11):4909–4931, 2013.
Author Index
Šidák, Z., 170 Box, G.E.P., 274

Brammer, M.J., 224
Absil, P., 257 Brandt, M.W., 76, 80, 81, 83, 84
Acerbi, C., 20, 35 Breeden, D.T., 26
Adcock, C.J., 207 Brettschneider, J., 251
Ait-Sahalia, Y., 41, 50, 63, 78, 80, 84, 89, Brillinger, D.R., 8, 64, 193, 211, 212, 230,
90, 96, 98, 207 232, 284, 285
Akaike, H., 238, 252, 300, 302, 309 Brockwell, P.J., 8, 44, 55, 214, 302, 319
Akslen, L.A., 251 Brown, B.M., 16
Alexander, G.J., 27 Brown, P.O., 254
Allais, M., 31 Brown, S.J., 38
Allen, M., 238 Bühlmann, P., 244
Amano, T., 84, 244, 245 Børresen-Dale, A.L., 249, 251, 280
Amaro, E., 224
An, Z., 174 Cacho-Diaz, J., 84, 89, 90, 98
Anderson, T.W., 8, 230, 333 Cai, R., 259
Andrews, F.C., 117 Caldas, C., 257
Artzner, P., 20, 34, 35 Campain, A., 249, 250
Aue, A., 26 Campbell, J.Y., 41
Cardoso, E.F., 224
Baba, K., 270 Chamberlain, G., 26, 56
Bagasheva, B.S., 49, 63 Chandra, S.A., 174, 176, 178–180, 250
Bai, Z., 58, 60, 61 Chapados, N., 84
Bancroft, T.A., 300, 307 Chen, C.W.S., 229
Baptista, A.M., 27 Chen, M., 174
Barkai, N., 250 Chernoff, H., 120, 121, 123, 170
Basak, G., 1, 41 Cherry, J.M., 254
Basawa, I.V., 299 Chew, S.H., 31
Bassett, G., 20, 32, 35, 41, 45, 46, 227 Chin, K., 251
Bellman, R., 76, 314 Chochola, O., 26
Benjamini, Y., 249, 252 Choquet, G., 32
Beran, J., 244 Choy, K., 298
Bergh, J., 251 Chu, G., 249
Bickel, P.J., 149, 151–154 Claeskens, G., 300
Binkley, G., 254 Claffey, K.P., 250
Birney, E., 254 Clarke, R.B., 251
Black, F., 50 Collin, F., 251
Blum, A., 98, 106–108, 111 Connor, G., 41
Bodnar, T., 58, 62, 63 Cope, L., 251
Bollerslev, T., 13, 180 Cover, T.M., 2, 75, 98–103, 108–111
Bolstad, B.M., 251 Cvitanic, J., 78
Bos, P.D., 251
Bose, A., 238 d’Assignies, M.S., 251
Botstein, D., 254 D’Eustachio, P., 254
365
366 AUTHOR INDEX
Dahlhaus, R., 239, 312 Gasser, T., 317
Datta, S., 238 Gastwirth, J.L., 124
Davis, R.A., 8, 44, 55, 214, 302, 319 Genton, M.G., 207
de Bono, B., 254 George, J., 251
De Luca, G., 207 Gerald, W.L., 251
Delbaen, F., 20, 34, 35 Geweke, J., 224
Dellaportas, P., 182, 183 Giacometti, R., 20, 33
Delorenzi, M., 251 Gillespie, M., 254
Dempster, A.P., 198–200 Gillet, C., 251
Dempster, M.A.H., 106, 108 Giraitis, L., 13
Der, Z.A., 196, 201 Giri, D.D., 251
Desmedt, C., 251 Girvan, M., 254, 257
DeVries, S., 251 Glass, K., 254, 257
DiCiccio, T., 209, 210 Glombek, K, 60
Diehn, M., 254 Glombek, K., 58–61, 63
Domrachev, M., 250 Goetzmann, W.N., 38
Drost, F. C., 203 Goldberger, A.S., 50
Dunsmuir, W., 285 Gonçalves, S., 244
Gopinah, G.R., 254
Eber, J.M., 20, 34, 35 Gourieroux, C., 63, 79, 84, 207
Eden, E., 254 Goyal, A., 76, 80, 81, 83, 84
Edgar, R., 250 Granger, C.W.J., 224
Efron, B., 238 Grenander, U., 221, 342
Eisen, M.B., 251 Gruber, M.J., 38
EL Karoui, N., 58, 60, 61 Gupta, G.P., 251
Ellis, P., 251
Elton, E.J., 38 Härdle, W., 14, 224, 274, 317
Embrechts, P., 38 Hafner, C.M., 317
Eng, J.K., 250 Haggan, V., 274
Engle, R.F., 6, 180, 181, 250 Haibe-Kains, B., 251
Hajek, J., 126, 128–130, 132–136, 141, 144,
Fabozzi, F.J., 49, 63 163, 297
Fama, E.F., 53, 75, 78 Hall, P., 15, 243, 251
Fan, J., 184, 186 Hallett, M., 270
Fang, H., 27 Hallin, M., 11, 12, 56–58, 63, 139, 141–143,
Fishburn, P.C., 33 145, 146, 148, 149, 155–163,
Florens-Zmirou, D., 96 165–168, 170, 202, 203, 205, 206,
Foekens, J.A., 251 294, 298
Forni, M., 11, 12, 56–58, 63 Hamada, K., 213, 217, 223, 346
Frahm, G., 61 Han, D.K., 250
Franke, J., 238 Hannan, E.J., 8–10, 15, 65, 174, 222, 225,
French, K.R., 53 226, 238, 285, 288, 300, 309, 314,
Frey, R., 38 332, 339, 342
Fridlyand, J., 251 Hansen, L., 41, 50, 63, 78, 84, 98
Friedman, J., 63 Harris, A., 251
Frigessi, A., 249 Harvey, C.R., 27
Frost, P.A., 47 Hashimoto, N., 270
Fujita, A., 224 Hastie, T., 63
Heath, D., 20, 34, 35
Gaivoronski, A.A., 98, 104–106, 108 Hernandez-Boussard, T., 254
Garat, P., 191, 192 Hewitt, E., 10
Garel, B., 294 Hey, Y., 251
AUTHOR INDEX 367
Heyde, C.C., 15, 93, 98 Kakizawa, Y., 8, 9, 227, 287, 289, 295, 304,
Hirukawa, J., 202, 203, 227, 261, 299, 305, 313, 314, 316, 317
308, 312, 321, 322, 324, 326, 327, Kakouris, I., 36, 38
329, 330, 332 Kalai, A., 98, 106–108, 111
Hjort, N.L., 300 Kase, T., 209, 210
Hobbs, B., 251 Kataoka, S., 20, 33
Hochberg, Y., 249, 252 Kato, H., 14, 248, 270, 271, 273, 274, 291,
Hodges, J.L., 117, 119 298, 304, 317, 319
Honda, M., 14, 270, 271, 273, 274, 291, 298, Keenan, D.M., 213
304, 317, 319 Kelly, J.L., 2, 75, 78
Horlings, H.M., 250 Kessler, M., 96
Horváth, L., 26, 170–172, 177 Kilian, L., 244
Hosoya, Y., 63, 65, 224, 284, 285, 288, 315, Klaar, S., 251
316, 344 Klaassen, C.A.J., 149, 151–154, 203
Hotta, L.K., 36 Knight, K., 228
Howell, A., 251 Kobayashi, I., 229, 233, 234
Hsu, J.S.J., 49, 63 Koenker, R., 20, 32, 35, 38, 41, 45, 46, 63,
Hsu, Y.T., 229 227, 229
Hurd, T.R., 84, 89, 90, 98 Kokoszka, P., 13, 170–172, 177
Hušková, M., 26 Kondo, M., 226, 227
Hwang, S., 27 Konno, H., 33
Hörmann, S., 26 Kordas, G., 20, 32, 35, 41, 45, 46
Hájek, J., 170 Koren, A., 250
Korkie, B., 1, 41, 47
Ibragimov, I.A., 16, 137 Korkie, R.M., 41, 47
Iino, M., 250 Koul, H.L., 300, 307
Ikeda, A., 270 Kraus, A., 26
Ingenbleek, J.F., 139, 141–143, 145, 161, Kreiss, J.P., 238, 293, 294
294 Kristensen, V., 249, 251, 280
Inoue, Y., 270 Kroner, K.F., 181
Irizarry, R.A., 251 Kruskal, W.H., 117
Jacod, J., 291 Kunieda, T., 270
Jagannathan, R., 41 Kuo, W.L., 251
James, W., 47, 331, 342
Lahiri, S.N., 238, 243
Jasiak, J., 63, 79, 84, 207
Lai, T.Y., 27
Jassal, B., 254
Laird, N.M., 198–200
Jeffrey, S.S., 251
Lallemand, F., 251
Jeffreys, H., 50
Langerød, A., 249
Jeganathan, P., 296
Lapuk, A., 251
Jeong, K., 224
Lash, A.E., 250
Jiang, R., 259
Lauprete, G.J., 41
Jiang, Z., 12, 14
Lazrak, A., 78
Jin, H., 254
LeCam, L., 1, 271, 293
Jobson, J., 1, 41, 47
Ledoit, O., 41, 47
Joe, H., 36–38
Lee, S., 250, 252, 298
John, G., 55
Lehmann, E.L., 117, 119, 124–126, 128
Johnsen, H., 251
Leippold, M., 78
Jorion, P., 41, 47
Leipus, R., 13
Joshi-Tope, G., 254
Lewis, S., 254
Journée, M., 257
Li, D., 78
Kahneman, D., 32 Lim, K.G., 27
368 AUTHOR INDEX
Lin, W.L., 180–182 Navon, R., 254
Lingjærde, O.C., 249, 251, 280 Neudecker, H., 226, 332
Lintner, J., 24 Neve, R.M., 251
Lippi, M., 11, 12, 56–58, 63 Ng, K.W., 108, 110
Lipson, D., 254 Ng, W.L., 78
Litterman, R., 50 Noether, G.E., 122
Litzenberger, R.H., 26
Liu, E.T., 251 Ogata, H., 84, 244, 245
Liu, H., 58, 60, 61 Ohara, S., 270
Liu, Y., 343 Okoniewski, M.J., 251
Ljung, G.M., 274 Olivieri, A., 244
Lo, A.W., 41 Olshen, A.B., 251
Loi, S., 251 Ordentlich, E., 98, 100–103, 108, 109, 111
Loperfido, N., 207 Ozaki, T., 250, 274
Lozza, S., 20, 33
Lu, Z., 12, 14 Paindaveine, D., 165, 166, 168, 170, 205,
Luenberger, D.G., 24, 26–28, 30, 38, 76 206
Lukepohl, H., 224 Palaro, H.P., 36
Lundgren, D.H., 250 Parolya, N., 58, 62, 63
Pastur, L.A., 58
MacCrimmmon, K.R., 31 Patilea, V., 84, 244, 245
Mackinlay, A.C., 41 Pawitan, Y., 251
Magnus, J.R., 226, 332 Pepper, S.D., 251
Mancini, C., 97 Perou, C.M., 251
Markowitz, H.M., 1, 19, 21, 26, 33, 38 Petkovic, A., 209, 210
Martin, R.D., 63 Pflug, G.C., 36, 106, 108
Marčenko, V.A., 58 Pham, D.T., 191, 192
Massagué, J., 251 Piette, F., 251
Matese, J.C., 254 Pitacco, E., 244
Matsumoto, R., 270 Pitman, E.J.G., 117, 118
Matteson, D.S., 187, 189–191 Pliska, S.R., 76, 79, 84
Matthews, L., 254 Ploner, A., 251
McDougall, A.J., 235, 259 Polimenis, V., 27
McNeil, A.J., 38 Politis, D.N., 182, 183
Merton, R.C., 26, 75, 84, 98 Pollack, J.R., 251
Meucci, A., 259–261 Prakasa Rao, B.L.S., 96, 291, 292, 299
Michailidis, G., 224 Pratt, J.M., 29
Mihara, T., 270 Prigent, J.L., 29–31, 38
Milhøj, A., 171 Prášková, Z., 26
Miller, C.J., 251 Puri, M.L., 139, 141–143, 145, 146, 148,
Miller, L.D., 251 149, 161, 170, 172, 176, 177, 226,
Mima, T., 270 227, 294
Minn, A.J., 251
Mitra, G., 106, 108 Qian, Z., 251
Monti, A.C., 209, 210 Quiggin, J., 32
Morgan, J.S., 259 Quinn, B.G., 238, 300
Morgenstern, O., 20, 29
Rachev, S.T., 49, 63
Mossin, J., 24, 78
Ratti, V., 41, 47
Mykland, P.A., 207
Rees, C.A., 251, 254
Nagamine, T., 270 Reichlin, L., 11, 12, 56–58, 63
Naume, B., 249 Reinders, J.T., 250
AUTHOR INDEX 369
Reyal, F., 250 Shiohama, T., 45, 46
Rezaul, K., 250 Shiraishi, H., 41, 80, 84, 239, 242, 244, 245,
Ritov, Y., 149, 151–154 279, 312
Rockafellar, R.T., 20, 35, 36, 38 Shiryaev, A.N., 8, 291
Romano, J.P., 124–126, 128 Shojaie, A., 224
Rosenblatt, M., 221, 342 Shreve, S.E., 87
Ross, D.T., 251 Shu, W., 251
Ross, S.A., 2, 19 Shumway, R.H., 196, 201, 227
Rothschild, M., 56 Shutes, K., 207
Roy, A.D., 20, 33 Sidak, Z., 126, 128, 141, 144, 163, 297
Roydasgupta, R., 251 Siddique, A., 27
Rozanov, Y.A., 137 Siegel, P.M., 251
Rubin, D.B., 198–200 Simpson, K., 251
Rubinstein, M.E., 26 Sims, A.H., 251
Rudin, A.M., 259 Sims, C.A., 55, 224
Ruppert, D., 26–28, 30, 38, 73 Smeds, J., 251
Russnes, H.G., 251, 280 Smethurst, G.J., 251
Rustem, B., 36, 38 Solev, V.N., 137
Ryder, T., 251 Song, S., 224
Ryudiger, F., 38 Song, X., 259
Rødland, E.A., 249, 251, 280 Speed, T.P., 251
Spellman, P.T., 251
Saleh, A.K.M.E., 300, 307 Spitzer, F., 306
Salenius, S., 270 Stein, C., 46, 47, 331, 339
Samarov, A., 41, 336 Stein, L., 254
Samuelson, P.A., 31, 75, 78, 79 Steinebach, J.G., 26
Santa-Clara, P., 76, 80, 81, 83, 84 Steinfeld, I., 254
Sargent, T.J., 55 Steinskog, D.J., 12
Satchell, S.E., 27 Stella, F., 98, 104–106, 108
Sato, J.R., 224 Stoffer, D.S., 235, 250, 259, 261, 262, 266,
Sato, K., 85 267, 274, 275
Savage, I.R., 170 Stout, W.F., 15
Savarino, J.E., 47 Strasser, H., 297
Savege, I.R., 120, 121, 123 Stromberg, K., 10
Scherer, B.M., 63 Stroud, J.R., 76, 80, 81, 83, 84
Schmeidler, D., 32 Stulajter, F., 36
Schmid, W., 58, 62, 63 Sun, G., 41
Schmidt, E., 254 Suto, Y., 342, 343
Schwarz, G., 238, 300, 309 Swensen, A.R., 293, 294, 324
Scott, D.J., 299 Sørensen, M., 13, 96, 286, 291
Segal, M.R., 249 Sørlie, T., 249, 251, 280
Sen, P.K., 126, 128, 141, 144, 163, 170, 172,
176, 177 Taki, W., 270
Senda, M., 339 Tamaki, K., 227, 261
Sepulchre, R., 257 Tang, M.L., 108, 110
Serroukh, A., 298 Taniai, H., 45, 46, 202, 203
Shannon, C.E., 2, 75 Taniguchi, M., 8, 9, 14, 41, 63, 65, 84, 174,
Sharpe, W., 2, 19, 23 176, 178–180, 202, 203, 209, 210,
Sherlock, G., 254 213, 217, 223, 226, 227, 229, 233,
Shibasaki, H., 270 234, 239, 242, 244, 245, 248, 250,
Shibata, R., 300, 304, 306 252, 261, 271, 273, 274, 279, 284,
Shimizu, Y., 96–98 285, 287–289, 291, 295, 298–300,
370 AUTHOR INDEX
304, 305, 308, 312, 315–317, 319, Wendt, D.A., 250, 259, 261, 262, 266, 267,
321, 322, 324, 326, 327, 329, 330, 274, 275
332, 339, 342–344, 346 Werker, B.J.M., 155–163, 167, 203
Taqqu, M.S., 336, 338 Wessels, L.F.A., 250
Tasche, D., 20, 35 Wilcoxon, F., 113, 114, 116, 117
Telser, L., 20, 33 Wilson, L., 250
Teschendorff, A.E., 257 Wofl, M., 41, 47
Teyssiere, G., 170–172, 177 Wolff, S.S., 124
Theil, H., 50 Wong, W.K., 58, 60, 61
Thomas, J.A., 98, 99, 108–110 Wooldridge, J.M., 180
Thomaz, C.E., 224 Wu, C.F.J., 244
Thumar, J.K., 250 Wu, G.R., 254
Tian, G.L., 108, 110 Wu, Z., 259
Tiao, G.C., 274
Tibshirani, R., 63, 249 Xia, Y., 33, 38
Tirosh, I., 250 Xiao, Y., 249
Tjøstheim, D., 12, 172, 176, 177 Xiao, Z., 227, 229
Tong, G., 259 Xu, Y., 259
Tong, H., 6
Trojani, F., 78 Yaari, M.E., 32
Tsay, R.S., 63, 73, 187, 189–191, 274 Yakhini, Z., 254
Tsybakov, A., 14, 274 Yamashita, T., 229, 233, 234
Tusher, V.G., 249 Yamazaki, H., 33
Tutt, A.M., 251 Yang, L., 14, 274
Tversky, A., 32 Yang, Y.H., 249, 250
Tyler, D.E., 235, 250, 259, 261, 262, 266, Yao, Q., 12, 184, 186
267, 274, 275 Ye, D.W., 217
Yoshida, N., 96–98
Uryasev, S.P., 20, 35, 36, 38 Yoshihara, K., 137–139, 142, 143
van de Rijn, M., 251 Zhang, Y., 251

van de Vijver, M.J., 250 Zhao, X., 249, 251, 280
van der Weide, R., 183, 184 Zhou, R., 259
van Vliet, M.H., 250
Vanini, P., 78
Vastrik, I., 254
Vega, V.B., 251
Veredas, D., 84, 244, 245
Vergara, L., 251
Viale, A., 251
Viale, G., 251
Vollan, H.K.M., 251, 280
von Neumann, J., 20, 29
Vrontos, I.D., 182, 183
Wallis, W.A., 117

Wang, M., 184, 186
Wang, S., 33, 38
Wang, T., 78
Wang, Y., 251
Wellner, J.A., 149, 151–154
Welsch, R.E., 41
Subject Index
D-distance, 186 asymptotic growth rate, 100, 101, 104–106

F statistic, 118 asymptotic normality, 61, 121
H test, 117 asymptotic power, 134
K-factor GARCH, 181 asymptotic relative efficiency, 117–123, 145,
K-factor GARCH model, 180 177
K-sample problem, 117 asymptotically centering, 296
P-local characteristics, 291 asymptotically efficient, 297
U-statistic, 137 asymptotically equivalent, 142
U-statistics, 142, 143, 161 asymptotically most powerful, 130
α-CVaR, 36 asymptotically most powerful test, 148, 149
α-risk, 20, 34 asymptotically orthogonal, 156
δ-method, 44, 98, 316 asymptotically unbiased estimator, 302
µ-weighted universal portfolio, 100 asymptotically unbiased of level α, 297
µ-weighted universal portfolio with side asymptotically uniformly most powerful,
information, 102 134
φ-mixing, 137 autocovariance function, 9
c1 -test, 121 autoregressive conditional heteroscedastic
h-step ahead prediction, 221 model (ARCH(q)), 2, 6, 170, 250
p-dependent, 142 autoregressive moving average process of
(n, m)-asymptotics, 58, 59 order (p, q) (ARMA(p, q)), 146
autoregressive moving average process of
absolute rank, 116 order (p, q)(ARMA(p, q)), 6
absolute risk aversion (ARA), 30 autoregressive process of order p (AR(p)), 6
absolutely continuous, 9, 46, 131, 293
absolutely regular, 137 bandwidth, 239
absolutely regular process, 142 BARRA approach, 53
adapted, 12 Bayesian estimation, 41
adapted stochastic process, 293 Bayesian information criterion (BIC), 238,
adaptive varying-coefficient spatiotemporal 300
model, 12 Bayesian optimal portfolio estimator, 50, 51
adaptiveness, 158, 159 Bellman equation, 75, 77, 81, 88
admissible, 309 best constant rebalanced portfolio (BCRP),
Affy947 expression dataset, 250 104–106
Akaike’s information criterion (AIC), 238, best constant rebalanced portfolio with
300 transaction costs, 107, 108
approximate factor model, 56 best constant rebalanced portfolio(BCRP),
AR bootstrap (ARB), 80, 81, 84, 235, 238 99–101
arbitrage opportunity, 28 best critical region, 134
arbitrage pricing theory (APT), 19, 27 best linear predictor, 221
ARCH(∞), 13 best linear unbiased estimator (BLUE), 197
ARCH(∞)-SM model, 2, 298 best state constant rebalanced portfolio
Arrow–Pratt measure, 29 (BSCRP), 102, 104
asymptotic efficiency, 97 beta, 25
asymptotic filter, 96 Black–Litterman model, 50
371
372 SUBJECT INDEX
blind separation, 192 constant relative risk aversion (CRRA), 30
BLUE, 314 consumption process, 79, 86
bootstrapped reserve fund, 247 consumption-based CAPM (CCAPM), 26,
Brownian motion process, 7, 84 79
budget constraint, 76, 79 contiguity, 134
contiguous, 131, 227
cancer cells, 257 continuity, 29
cancer-signaling pathways, 257 continuous time stochastic process, 5
canonical correlation, 230 continuously differentiable, 44
canonical correlation analysis (CCA), 2, 230 contrast function, 96
canonical representation, 291 convex risk measure, 34
canonical variate, 230 copula, 20, 36
capital asset pricing model (CAPM), 2, 19, copula density, 37
23 corticomuscular functional coupling (CMC),
capital market line (CML), 24 270
Cauchy–Schwarz inequality, 124 counting process, 6
causal (nonanticipating) portfolio, 99 Cramer–Wold device, 144
causal (nonanticipating) portfolio with critical function, 127
transaction costs, 107 critical region, 134
causal variable, 2 critical value, 121
central sequence, 155, 157–159, 162, 164, cross information, 169
166, 167, 210 CRRA utility, 75, 77, 80, 81, 83, 89, 95
certainty equivalent, 29 cumulant, 43
cesaro sum, 331 cumulant spectra, 207
chain rule, 151 cyclic trend, 341
characteristic equation, 141 cylinder set, 14
characteristic function, 85
Choquet distortion measure, 32 decreasing absolute risk aversion (DARA),
Choquet expected utility, 32 30
classification, 227 diagonal multivariate GARCH, 181
classification rule, 227 differentially expressed (DE), 249
coherence, 20 diffusion, 7
coherent risk measure, 34 diffusion process, 7
common (dynamic) eigenvalue, 57 diffusion process with jump, 84, 87, 90, 96
compensator, 13, 95, 291 Dirac measure, 95
compound Poisson process, 84, 95 Dirichlet distribution, 101
comutation matrix, 166 discount rate, 86
conditional heteroscedastic autoregressive discrete Fourier transform (DFT), 192, 197
nonlinear (CHARN), 2, 14, 235 discrete time stochastic process, 5
conditional least squares estimator, 83 discriminant analysis, 227
conditional value at risk (CVaR), 20 disparity, 227
conditionally uncorrelated components dispersion measure, 20, 33
(CUCs), 184, 186 distance correlation, 145
confidence region, 241 distance covariance, 145
conjugate prior, 50 distribution-CAPM, 27
consistency, 41, 60, 309 DNA sequence, 2
constant absolute risk aversion (CARA), 30 DNA sequence for Epstein–Barr virus
constant conditional correlation (CCC) (ELBV), 261
model, 65 Doob’s martingale convergence theorem, 15
constant rebalanced portfolio (CRP), 99 drift, 7
constant rebalanced portfolio with dynamic factor model, 11, 55
transaction cost, 107
SUBJECT INDEX 373
dynamic orthogonal components (DOCs), Fisher information, 130, 150, 155, 163, 166
187, 191, 202 Fisher information matrix, 155, 158, 159,
dynamic portfolio, 78 305
dynamic programming, 75 focused information criterion (FIC), 300
four moments CAPM, 27
Edgeworth expansion, 214 Fourier transform, 196
efficiency, 126 Fréchet differentiable, 150
efficient central sequence, 162, 168 full-factor GARCH, 182
efficient central sequences, 167 functional central limit theorem (FCLT), 138
efficient frontier, 21, 235 fundamental factor model, 53
efficient influence function, 150, 152, 153,
157 Gaussian copula, 37
efficient portfolio, 21 Gaussian process, 9
efficient score, 157 gene expression omnibus (GEO), 250
efficient score function, 154, 164 gene ontology (GO) analysis, 254
Einstein’s summation conversion, 216 generalized AIC (GAIC), 2, 300
elastic-net, 53 generalized ARMA process, 147
electomyograms (EMGs), 270 generalized autoregressive conditional
electroencephalogram (EEG), 270 heteroscedastic model
ellipical density, 164 (GARCH(p, q)), 13
elliptical random variable, 164 generalized consumption investment
elliptical residuals, 204 problem, 79
EM algorithm, 199, 200 generalized dynamic factor model, 56
empirical distribution function, 302 generalized EM (GEM) algorithm, 200
end-of-period target, 244 generalized inverse matrix, 62
envelope power function, 126 generalized least squares (GLS), 53
envelope ralation, 79 generalized method of moments (GMM), 75,
ergodic, 14 91, 190
estimated efficient frontier, 69 generalized orthogonal GARCH
estimating equation, 92, 192 (GO-GARCH) model, 183
Euler condition, 79 global minimum variance portfolio, 22, 58,
exogenous variable, 207 59, 69, 239, 242
expected excess rate of return, 25 Granger causality, 224
expected shortfall (ES), 20 Grenander’s condition, 220, 338
expected utility, 20, 30 group, 124
exponential AR (ExpAR) models, 274 group invariance, 167
factor, 52 Hamilton–Jacobi–Bellman (HJB) equation,

factor analysis, 41 75
factor loading, 28, 52 Hannan Quinn criterion (HQ), 238, 300
factor model, 19, 27 harmonic absolute risk aversion (HARA), 30
false discovery rate (FDR), 249 hedging demand, 78
Fama–French approach, 53 Hermite polynomial, 214
FARIMA model, 298 Hermitian nonnegative, 338
Fatou’s lemma, 335 high-dimensional problem, 41
feasible region, 21 homogeneous Poisson process, 6, 84
feasible set, 20 homogeneous Poisson random measure, 95
filtration, 12 Huber’s function, 190
final prediction error (FPE), 237 hyper-parameter, 49
finite Fourier transform, 211
first-order condition (FOC), 75, 77, 80, 89 idiosyncratic (dynamic) eigenvalue, 57
first-order stochastic dominance, 31 idiosyncratic risk, 26
374 SUBJECT INDEX
independence, 29 linear projection, 224
independent component analysis (ICA), 187 linear rank statistics, 114
inference function, 162 linear serial rank statistic, 139, 145
infinite time horizon problem, 88 linear serial rank statistics, 144, 148
influence function, 157 Lipschitz condition, 140
influence measure criterion (IMC), 210 Lipschitz continuity, 96
information bound, 150, 153 local alternative, 120
information matrix, 153, 166 local asymptotic linearity, 158
information set process, 76 local asymptotic normality (LAN), 1, 210,
information theory, 98 271, 293
informative prior, 50 local martingale, 15
innovation process, 140 local optimality, 169
intensity parameter, 84 locally asymptotic mixed normal (LAMN),
intercept of factor model, 27 299
interest rate, 244 locally asymptotically maximin efficient, 170
interpolation, 342 locally asymptotically normal (LAN), 155,
intertemporal CAPM (ICAPM), 26 156, 158, 159, 162, 164
intrinsic annotation, 251 locally asymptotically optimal, 298
invariance, 156 locally stationary process, 2, 235, 239, 299
invariant, 15, 124, 129 logarithmic utility, 75, 78
invariant principle, 168 long memory process, 2
invariant test, 125 loss function, 296
inverse Wishart distribution, 50 lottery, 28
invertibility condition, 42
Itô’s lemma, 86, 87, 90 m-vector process, 5
macroeconomic factor model, 52
James–Stein estimator, 331 Markov property, 88
Japanese Government Pension Investment Markov time, 12
Fund (GPIF), 84, 235, 244 martingale, 7
Jeffreys prior, 50 martingale central limit theorem, 92, 160
Jensen’s inequality, 124 martingale difference, 16
maximal invariant, 124, 125, 160, 167
kernel function, 239 maximum likelihood estimator (MLE), 3,
Kroneker product, 166 291, 292
mean absolute deviation model, 33
Lévy measure, 85 mean absolute error (MAE), 215
Lévy process, 84 mean semivariance model, 33
LASSO, 53 mean squared error (MSE), 48, 196, 339
least favourable, 126, 128, 129, 153, 156, mean squared interpolation error (MSIE),
157, 161, 162 342
least favourable density, 126, 127 mean target model, 33
least favourable direction, 156 mean variance efficient portfolio, 21
least favourable distribution, 128 mean-CVaR portfolio, 1, 35, 36
least squares estimator, 339 mean-ES portfolio, 35
Lebesgue bounded convergence theorem, mean-TCE portfolio, 35
118 mean-VaR portfolio, 1, 34
LeCam’s first lemma, 141, 158 mean-variance portfolio, 1, 208
LeCam’s third lemma, 144, 297 mean-variance-skewness (MVS) model, 31
limit theorem, 283 measure of disparity, 300
Lindeberg condition, 16, 134 measure preserving, 15
linear interpolation, 125 microarray data, 2, 235, 248, 249
linear process, 2 midrank, 114
SUBJECT INDEX 375
minimum Pitman efficiency, 117 orthogonal GARCH, 183
minimum-variance portfolio, 22 orthogonal increment process, 9
minimum-variance set, 21
misspecification, 346 parametric experiments, 155
misspecified prediction error, 222 parametric submodels, 155
mixing, 286 pathway analysis, 254
mixing transformation, 187 pathwise differentiable, 152
mixture successive constant rebalanced pension investment, 243
portfolio (MSCRP), 106 periodogram, 283
modern portfolio theory, 19 permutation, 125
molecular subtypes of breast tumours, 251 pessimistic, 35
monthly log-return, 235 pessimistic portfolio, 1, 32, 35, 44
Moore Penrose inverse, 62 Pitman efficiency, 121
Moore Penrose pseudo inverse, 63 pointwise ergodic theorem, 15
most efficient statistic, 145 portfolio weight process, 76
most powerful, 126–129 posterior density, 49
most powerful invariant test, 125 power function, 128
most powerful test, 126, 127 power functions, 122
moving average process of order q(MA(q)), predictable process, 15
6 predictive return density, 49
moving block bootstrap (MBB), 84 price relative, 98
multiperiod problem, 1 principal component analysis (PCA), 54, 188
multiplicatively modulated exponential prior distribution, 49
autoregressive model probability simplex, 98
(mmExpAR), 271 probability space, 5
multiplicatively modulated nonlinear projection, 157
autoregressive model (mmNAR), projection operator, 151, 152
271 prospect theory, 32
multivariate Student t-distribution, 50 pseudo-Mahalanobis signs and ranks, 205
myopic portfolio, 75, 78
quadratic characteristic, 15
naive portfolio, 58, 59 quadratic variation process, 286
Neyman–Pearson lemma, 128, 134 quadratically integrable derivative, 129, 132
non-Gaussian linear process, 41, 211 quantile, 35
non-Gaussian stationary process, 1 quantile regression, 41, 45, 227
noncentral chi square distribution, 118 quasi-Gaussian maximum likelihood
nonlinear autoregressive model, 6 estimator, 66, 67, 287
nonparametric estimator, 239 quasi-maximum likelihood, 191
nonparametric portfolio estimator, 235
nonparametric test statistic, 120 radial basis function (RBF), 273
nonsystematic risk, 26 Radon–Nikodym density, 293
nuisance parameter, 129 random coefficient regression, 45
rank, 113, 125, 165, 167, 205
optimal consumption, 89 rank autocorrelation coefficient, 148
optimal estimating function, 93 rank dependent expected utility (RDEU)
optimal portfolio, 1, 211 theory, 30, 32
optimal portfolio estimator, 41 rank order statistics, 173
optimal portfolio weight, 42 rank order statistics for ARCH residual, 250
optimal shrinkage factor, 47 rank statistic, 114, 131
order statistic, 114 ranks, 129
order statistics, 125 rapidly increasing experimental design, 96
ordinary least squares (OLS), 52, 53, 237 rational, 28
376 SUBJECT INDEX
regression spectral measure, 219 Sharpe ratio, 245
regular, 299 shrinkage estimation, 3, 41
regular parametric model, 150 shrinkage estimator, 47, 331
regular parametric submodel, 152 shrinkage factor, 47
regular point of the parametrization, 150 shrinkage interpolator, 342
relative risk aversion (RRA), 30 shrinkage sample mean, 47
required reserve fund, 244 shrinkage target, 47
resampled efficent frontier, 239 side information, 98, 101
resampled return, 239 sieve bootstrap (SB), 244
reserve fund, 244, 245 sign, 165, 167, 205
residual analysis for fitting time series sign function, 116
modeling, 274 sign test, 128
residual Fisher information, 156 signed rank statistic, 116
return process, 2, 42, 207 skew-GARCH, 207
ridge regression, 53 skew-normal distribution, 2, 207
risk aversion, 29 Sklar’s theorem, 36
risk aversion coefficient, 214 Sobolev space, 166
risk diversification, 259 Spearman’s rank correlation coefficient, 140
risk measure, 20 specific risk, 26
risk premium, 25, 29 spectral density matrix, 9, 42, 57
risk-free asset, 76, 216 spectral density of exponential type, 310
risky asset, 76 spectral distribution matrix, 9
robust, 114, 210 spectrum envelope (SpecEnv), 235, 259
run statistic, 139 square summable, 56
stable, 286
safety first, 33 state constant rebalanced portfolio (SCRP),
safety measure, 20, 33 102
sampling scheme, 95 state space, 5
Scheffe’s theorem, 330 stationary process, 41
score function, 139, 150, 160, 169 statistical factor model, 54
score generating function, 145 statistical model selection, 300
score generation function, 140 stochastic basis, 12
second order stochastic dominance, 31 stochastic dominance, 31
second-order stationary, 9, 42 stochastic integral, 9
second-order stationary process, 63 stochastic process, 5
security market line (SML), 25 stopping time, 13
self-exciting threshold autoregressive model strictly stationary, 8, 137
(SETAR(k; p, . . . , p)), 6 strong mixing, 191
self-financing, 76 Student t-distribution, 214
semi-parametrically efficient inference, 162 Student t-test, 119
semimartingale, 13 successive constant rebalanced portfolio
semiparametric efficiency, 156, 157 (SCRP), 104, 105
semiparametric inference, 151 surface, 151, 152
semiparametric model, 153, 162 symmetric restriction, 124
semiparametric models, 155 systematic risk, 26
semiparametrically efficient, 157
semiparametrically efficient central tail conditional expectation (TCE), 20
sequence, 156 tangency portfolio, 23, 58, 59
separating matrix, 187 tangent space, 149, 151, 152, 157, 167
sequence of experiments, 295 terminal condition, 77, 79–81
sequencing data analysis, 235 terminal wealth, 76, 80
shape efficient central sequence, 168 threshold estimation method, 75, 97
SUBJECT INDEX 377
time series multifactor model, 27
time-varying spectral density, 320
trading time, 76
training sample, 229
transaction cost, 98, 107
transformation, 124
turning point statistic, 139
two-sample location problem, 113
unbiased martingale estimating function, 92

uncorrelated, 52
uniform local asymptotic normality
(ULAN), 165, 166, 202, 203
uniformly most powerful (UMP), 125, 127,
128
uninformative prior, 50
unit root, 220
universal portfolio (UP), 2, 75, 98, 99
universal portfolio with transaction cost, 108
universal portfolio with transaction costs,
107
utility, 29, 76, 215
value at risk (VaR), 20, 34

value function, 75, 76, 79, 87
VAR(p), 11, 237, 238
VaR-CAPM, 27
variational distance, 296
VARMA(p, q), 11
vector GARCH equation, 66
vector linear process, 11, 42
vector-valued ARMA, 42
Volterra series expansion, 204
von Mises’s differentiable statistical
functional, 138
wealth process, 86
wealth relative, 98
weight process, 76
weighted successive constant rebalanced
portfolio (WSCRP), 105, 106
weighted utility theory, 31
white noise, 52
Whittle estimator, 222
Whittle’s approximate likelihood, 246
Wiener process, 7
Wilcoxon signed rank statistic, 116
Wilcoxon statistic, 114
Wilcoxon test, 114, 119, 120
Wilcoxon’s rank sum test, 113
Wilcoxon’s signed rank test, 113
wild bootstrap (WB), 244

(A Chapman & Hall Book) Masanobu Taniguchi, Hiroshi Shiraishi, Junichi Hirukawa, Hiroko Kato Solvang, Takashi Yamashita-Statistical Portfolio Estimation-CRC Press - Chapman and Hall - CRC (2018)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(A Chapman & Hall Book) Masanobu Taniguchi, Hiroshi Shiraishi, Junichi Hirukawa, Hiroko Kato Solvang, Takashi Yamashita-Statistical Portfolio Estimation-CRC Press - Chapman and Hall - CRC (2018)

Uploaded by

Copyright:

Available Formats

STATISTICAL

Masanobu Taniguchi, Hiroshi Shiraishi,

© 2018 by Taylor & Francis Group, LLC

No claim to original U.S. Government works

Printed on acid-free paper

International Standard Book Number-13: 978-1-4665-0560-5 (Hardback)

3 Portfolio Theory for Dependent Return Processes 19

4 Multiperiod Problem for Portfolio Theory 75

5 Portfolio Estimation Based on Rank Statistics 113

7 Numerical Examples 235

8 Theoretical Foundations and Technicalities 283

Author Index 365

Subject Index 371

Masanobu Taniguchi, Waseda, Tokyo,

The MathWorks, Inc.

2.1 Stochastic Processes and Limit Theorems

Example 2.1.2. (AR(p)-, MA(q)-, and ARMA(p, q)-processes).

Example 2.1.3. (Nonlinear processes).

where a0 > 0 and a j ≥ 0, j = 1, . . . , q.

Xt2 − Xt1 , Xt3 − Xt2 , . . . , Xtn − Xtn−1

where c is a positive constant, i.e., Xt − X s ∼ N(0, c(t − s)).

The following process is often used to describe a variety of phenomena.

Xt+∆t − Xt ∼ µ(Xt , t)∆t + σ(Xt , t)(Wt+∆t − Wt ),

dXt = µ(Xt , t)dt + σ(Xt , t)dWt , (2.1.6)

E(Xt+1 |Ft ) = Xt a.e. (2.1.7)

Definition 2.1.8. An m-vector process

Theorem 2.1.11. If Γ(·) is the autocovariance function of an m-vector second-order stationary

G(µ) − G(λ) = E[{Z(µ) − Z(λ)}{Z(µ) − Z(λ)}∗ ], λ ≤ µ.

as n → ∞ (e.g., Hewitt and Stromberg (1965)). Suppose that Mn (λ) is written as

to be the random vector I defined as above.

Then the following statements hold true:

Definition 2.1.16. A stochastic process {Xt , Ft } is said to be a martingale if

where a0 > 0, a j ≥ 0, j = 1, . . . q, and b j ≥ 0, j = 1, . . . , p.

Xt = Fθ (Xt−1 , · · · , Xt−p ) + Hθ (Xt−1 , · · · , Xt−q )Ut , (2.1.36)

where Ut = (U1,t , . . . , Um,t )′ ∼ i.i.d.(0, V), Fθ : Rmp → Rm , and Hθ : Rmq → Rm × Rm are

{the minimum eigenvalue of Hθ (x)} ≥ λ

for all x ∈ Rmp .

Then, {Xt } is strictly stationary.

{ω |(Xt1 , . . . , Xtk ) ∈ B},

A = {ω |(Xt1 , . . . , Xtk ) ∈ C}, C ∈ Bmk ,

T A = {ω |(Xt1 +1 , . . . , Xtk +1 ) ∈ C}.

Also, if EkXk2 < ∞, then

Mt2 − hM, Mit (2.1.39)

2.6. Let {Xt : t ∈ Z} be an m-vector process satisfying

Xt + ΦXt−1 + · · · + Φ p Xt−p = Ut + Θ1 Ut−1 + · · · + Θq Ut−q , (P.1)

det{Φ(z)} , 0 for all z ∈ C such that |z| ≤ 1. (P.2)

Then, show that Z t

E{Xt |Ft−1 } = 0, a.e., (P.4)

i.e., {Xt , Ft } is a martingale difference sequence. Hence, if {Xt } is an ARCH(q) or GARCH(p, q)

E{Xt |Ft−1 } = Fθ (Xt−1 , . . . , Xt−p ), a.e.

Portfolio Theory for Dependent Return

3.1 Introduction to Portfolio Theory

3.1.1 Mean-Variance Portfolio

Minimum-variance point Minimum-variance point

Suppose there exists a risk-free asset with the return µ f . Consider

3.1.2 Capital Asset Pricing Model

µ(w) − µ f = β(w)(µ M − µ f ) (3.1.10)

Xi (t) = µ f + βi (X M (t) − µ f ) + ǫi (t) (3.1.11)

σ2i = β2i σ2M + σ2ǫi (3.1.12)

3.1.3 Arbitrage Pricing Theory

where a = (a1 , . . . , am )′ is an m-dimensional constant vector, b = (bi j )i=1,...,m, j=1,...,p is an m × p

3.1.4 Expected Utility Theory

∀La , ∀Lb , La  Lb ⇔ U(La ) ≥ U(Lb ).

∀La , ∀Lb , La Lb ⇔ U(La ) ≥ U(Lb ).