You are on page 1of 14

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO.

3, FEBRUARY 1, 2017 663

Probabilistic Tensor Canonical Polyadic


Decomposition With Orthogonal Factors
Lei Cheng, Yik-Chung Wu, and H. Vincent Poor, Fellow, IEEE

Abstract—Tensor canonical polyadic decomposition (CPD), number of rank-1 component R is defined as the tensor rank
which recovers the latent factor matrices from multidimensional [14]. Under rather mild conditions, the CPD is unique up to a
data, is an important tool in signal processing. In many applica- trivial scalar and permutation ambiguity [14], and this fact has
tions, some of the factor matrices are known to have orthogonality underlain its importance in signal processing [1]–[13].
structure, and this information can be exploited to improve the
accuracy of latent factors recovery. However, existing methods for To find the factor matrices in CPD, a common approach is to
CPD with orthogonal factors all require the knowledge of tensor solve min{A ( n ) }Nn = 1 X − [[A(1) , A(2) , . . . , A(N ) ]]2F . Unfortu-
rank, which is difficult to acquire, and have no mechanism to han- nately, it can be seen from (1) that all the factor matrices are
dle outliers in measurements. To overcome these disadvantages, in nonlinearly coupled, and thus a closed-form solution does not
this paper, a novel tensor CPD algorithm based on the probabilistic exist. Consequently, the most popular solution is the alternating
inference framework is devised. In particular, the problem of ten- least squares (ALS) method, which iteratively optimizes one
sor CPD with orthogonal factors is interpreted using a probabilistic factor matrix at a time while holding the other factor matri-
model, based on which an inference algorithm is proposed that al-
ces fixed [14], [15]. However, the ALS method does not take
ternatively estimates the factor matrices, recovers the tensor rank,
and mitigates the outliers. Simulation results using synthetic data into account the potential orthogonality structure in the factor
and real-world applications are presented to illustrate the excellent matrices, which can be found in a variety of applications. For
performance of the proposed algorithm in terms of accuracy and example, the zero-mean uncorrelated signals in wireless com-
robustness. munications [5]–[9], the prewhitening procedure in ICA [1], [4],
and the basis matrices in linear image coding [12], [13], all give
Index Terms—Multidimensional signal processing, orthogo-
nal constraints, robust estimation, tensor canonical polyadic rise to orthogonal factors in the tensor model. Interestingly, the
decomposition. uniqueness of tensor CPD incorporating orthogonal factors is
guaranteed under an even milder condition than the case without
I. INTRODUCTION orthogonal factors. Pioneering work [17] formally established
this fact, and extended the conventional methods to account for
ANY problems in signal processing, such as indepen-
M dent component analysis (ICA) with matrix-based mod-
els [1]–[4], blind signal estimation in wireless communications
the orthogonality structure, among which the orthogonality con-
strained ALS (OALS) algorithm1 shows remarkable efficiency
in terms of accuracy and complexity.
[5]–[9], localization in array signal processing [10], [11], and
However, there are at least two major challenges the algo-
linear image coding [12], [13], eventually reduce to the issue
rithms in [17] (including the OALS) face in practical applica-
of finding a set of factor matrices {A(n ) ∈ C I n ×R }Nn =1 from a tions. Firstly, these algorithms are least-squares based, and thus
complex-valued tensor X ∈ C I 1 ×I 2 ×...×I N that satisfy
lack robustness to outliers in measurements, such as ubiquitous

R impulsive noise in sensor arrays or networks [18], [19], and
X = :,r ◦ A:,r ◦ · · · ◦ A:,r
A(1) (2) (N )
salt-and-pepper noise in images [20]. Secondly, knowledge of
r =1 tensor rank is a prerequisite to implement these algorithms. Un-
fortunately, tensor rank acquisition from tensor data is known
 [[A(1) , A(2) , . . . , A(N ) ]] (1)
to be NP-hard [14]. Even though for applications in wireless
where A:,r ∈ C I n ×1 is the rth column of the factor matrix
(n ) communications, where the tensor rank can be assumed to be
A(n ) , and ◦ denotes the vector outer product. This decomposi- known as it is related to the number of users or sensors, existing
tion is called canonical polyadic decomposition (CPD), and the decomposition algorithms are still susceptible to degradation
caused by network dynamics, e.g., users joining and leaving the
network, sudden sensor failures, etc.
Manuscript received November 9, 2015; revised May 21, 2016 and July 17,
2016; accepted August 15, 2016. Date of publication August 29, 2016; date
In order to overcome the disadvantages presented in exist-
of current version November 23, 2016. The associate editor coordinating the ing methods, we devise a novel algorithm for complex-valued
review of this manuscript and approving it for publication was Prof. Lei Huang. tensor CPD with orthogonal factors based on the probabilis-
This research was supported in part by the U. S. Army Research Office under tic inference framework. Probabilistic inference is well-known
MURI Grant W911NF-11-1-0036, and in part by the U.S. National Science
Foundation under Grant CCF-1420575.
for providing an alternative formulation to principal compo-
L. Cheng and Y.-C. Wu are with the Department of Electrical and Electronic nent analysis (PCA)[22]. With the inception of probabilistic
Engineering, The University of Hong Kong, Pokfulam, Hong Kong (e-mail: PCA, not only is the conventional singular value decomposition
leicheng@eee.hku.hk; ycwu@eee.hku.hk). (SVD) linked to statistical inference over a probabilistic model,
H. V. Poor is with the Department of Electrical Engineering, Princeton Uni-
versity, Princeton, NJ 08544 USA (e-mail: poor@princeton.edu).
Color versions of one or more of the figures in this paper are available online 1 It was called the “first kind of ALS algorithm for tensor CPD with orthogonal
at http://ieeexplore.ieee.org. factors (ALS1-CPO)” in [17]. For brevity of discussion, we just call it the OALS
Digital Object Identifier 10.1109/TSP.2016.2603969 algorithm in this paper.

1053-587X © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
664 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017

advances in Bayesian statistics and machine learning can also be A. Motivating Example 1: Blind Receiver Design for
incorporated to achieve automatic relevance determination [23] DS-CDMA Systems
and outlier removal [24]. Although the probabilistic approach
In a direct-sequence code division multiple access (DS-
is well established in matrix decomposition, extension to the
CDMA) system, the transmitted signal sr (k) from the rth user
tensor counterpart faces its unique challenges, since all the fac-
at the kth symbol period is multiplied by a spreading sequence
tor matrices are nonlinearly coupled via multiple Khatri-Rao
[c1r , c2r , · · · , cZ r ] where cz r is the zth chip of the applied
products [14].
spreading code. Assuming R users transmit their signals si-
In this paper, we propose a probabilistic CPD algorithm for
multaneously to a base station (BS) equipped with M receive
complex-valued tensors with some of the factors being orthog-
antennas, the received data is given by
onal, under unknown tensor rank and in the presence of out-
liers in the observations. In particular, the tensor CPD problem 
R
is reformulated as an inference problem over a probabilistic ym z (k) = hm r cz r sr (k) + wm z (k),
model, wherein the uniform distribution over the Stiefel man- r =1
ifold is leveraged to encode the orthogonality structure. Since
the complicatedly coupled factor matrices in the probabilis- 1 ≤ m ≤ M, 1 ≤ z ≤ Z, (2)
tic model lead to analytically intractable integrations in exact where hm r denotes the flat fading channel between the rth user
Bayesian inference, variational inference is exploited to give and the mth receive antenna at the base station, and wm z (k)
an alternative solution. This results in an efficient algorithm denotes white Gaussian noise. By introducing H ∈ C M ×R
that alternatively estimates the factor matrices, recovers the ten- with its (m, r)th element being hm r , and C ∈ C Z ×R with its
sor rank and mitigates the outliers. Interestingly, the OALS (z, r)th element being cz r , the model (2) can be written in ma-
in [17] can be interpreted as a special case of the proposed 
trix form as Y(k) = R r =1 H:,r ◦ C:,r sr (k) + W(k), where
algorithm. M ×Z
The remainder of this paper is organized as follows. Section II Y(k), W(k) ∈ C are matrices with their (m, z)th ele-
presents the motivating examples and the problem formulation. ments being ym z (k) and wm z (k), respectively. After collecting
In Section III, the CPD problem is interpreted using probability T samples along the time dimension and defining S ∈ C T ×R
density functions, and the corresponding probabilistic model with its (k, r)th element being sr (k), the system model can be
is established. In Section IV, based on a variational inference further written in tensor form as [5]
framework, a robust algorithm for tensor CPD with orthogonal 
R
factors is derived, and its relationship to the OALS algorithm is Y= H:,r ◦ C:,r ◦ S:,r + W
revealed. Simulation results using synthetic data and real-world r =1
applications are reported in Section V. Finally, conclusions are
drawn in Section VI. = [[H, C, S]] + W (3)
Notation: Boldface lowercase and uppercase letters will be
where Y ∈ C M ×Z ×T and W ∈ C M ×Z ×T are third-order ten-
used for vectors and matrices, respectively. Tensors are writ-
sors, which take ym z (k) and wm z (k) as their (m, z, k)th ele-
ten as calligraphic√letters. E[·] denotes the expectation of its
ments, respectively.
argument and j  −1. Superscripts T , ∗ and H denote trans- It is shown in [5] that under certain mild conditions, the CPD
pose, conjugate and Hermitian respectively. δ(·) denotes the of tensor Y, which solves minH ,C ,S Y − [[H, C, S]]2F , can
Dirac delta function. The operator Tr (A) denotes the trace blindly recover the transmitted signals S. Furthermore, since
of a matrix A and ·F represents the Frobenius norm of the transmitted signals are usually uncorrelated and with zero
the argument. The symbol ∝ represents a linear scalar re- mean, the orthogonality structure2 of S can further be taken into
lationship between two real-valued functions. CN (x|u, R) account to give better performance for blind signal recovery [9].
stands for the probability density function of a circularly- Similar models can also be found in blind data detection for
symmetric complex Gaussian vector x with mean u and cooperative communication systems [6]–[7], and in topology
covariance matrix R. CMN (X|M, Σr , Σc ) denotes the learning for wireless sensor networks (WSNs) [8].
complex-valued matrix normal probability density func-
tion p(X) ∝ exp{−Tr(Σ−1 H −1
c (X − M) Σr (X − M))}, and
VMF(X|F) stands for the complex-valued von Mises-Fisher B. Motivating Example 2: Linear Image Coding for a
matrix probability density function p(X) ∝ exp{−Tr(FXH + Collection of Images
XFH )}. The N × N diagonal matrix with diagonal compo- Given a collection of images representing a class of objects,
nents a1 through aN is represented as diag{a1 , a2 , . . . , aN }, linear image coding extracts the commonalities of these images,
while IM represents the M × M identity matrix. The (i, j)th which is important in image compression and recognition [12],
element and the jth column of a matrix A is represented by Ai,j [13]. The kth image of size M × Z naturally corresponds to a
and A:,j , respectively. matrix B(k) with its (m, z)th element being the image’s inten-
sity at that position. Linear image coding seeks the orthogonal
basis matrices U ∈ C M ×R and V ∈ C Z ×R that capture the di-
II. MOTIVATING EXAMPLES AND PROBLEM FORMULATION rections of the largest R variances in the image data, and this
Tensor CPD with orthogonal factors has been widely ex-
ploited in various signal processing applications [1]–[13]. In
this section, we briefly mention two motivating examples, and 2 Strictly speaking, S is only approximately orthogonal. But the approxima-
then we give the general problem formulation. tion gets better and better when the observation length T increases.
CHENG et al.: PROBABILISTIC TENSOR CANONICAL POLYADIC DECOMPOSITION WITH ORTHOGONAL FACTORS 665

problem can be written as [12], [13] generality, our goal is to estimate an N -tuplet of factor matri-
ces (Ξ(1) , Ξ(2) , . . . , Ξ(N ) ) with the first P (where P < N ) of

K
them being orthonormal, based on the observation Y and in the
min B(k)−Udiag{d1 (k), . . . , dR (k)}VT 2F
U,V ,{d r (k )}R
r=1
absence of the knowledge of noise power β −1 , outlier statistics
k =1
and the tensor rank R. In particular, since we do not know the
s.t. UH U = IR , VH V = IR . (4) exact value of R, it is assumed that there are L columns in each
factor matrix Ξ(n ) , where L is the maximum possible value of
Obviously, if there is only one image (i.e., K = 1), problem
the tensor rank R. Thus, the problem to be solved can be stated
(4) is equivalent to the well-studied SVD problem. Notice that
as
the expression inside the Frobenius norm in (4) can be written as

B(k) − R r =1 U:,r ◦ V:,r dr (k). Further introducing the matrix min βY − [[Ξ(1) , Ξ(2) , . . . , Ξ(N ) ]] − E2F
{Ξ ( n ) }N
n = 1 ,E
D with its (k, r)th element being dr (k), it is easy to see that
problem (4) can be rewritten in tensor form as

L 
N
 2 + γl
(n )H (n )
Ξ:,l Ξ:,l
 
  n =1
   l=1
 R

min   (n )H (n )
 B − U :,r ◦ V :,r ◦ D:,r  s.t. Ξ Ξ
= IL , n = 1, 2, . . . , P, (8)
U,V ,D  
 r =1
     (n )H (n )
  where the regularization term Ll=1 γl ( N n =1 Ξ:,l Ξ:,l ) is
=[[U,V ,D ]] F added to control the complexity of the model and avoid over-
s.t. H H
U U = IR , V V = IR , (5) fitting of noise [21], since more columns (thus more degrees
of freedom) in Ξ(n ) than the true model are introduced, and
where B ∈ C M ×Z ×K is a third-order tensor with B(k) as its kth {γl }Ll=1 are regularization parameters trading off the relative
slice. Therefore, linear image coding for a collection of images importance of the square error term and the regularization term.
is equivalent to solving a tensor CPD with two orthonormal Existing algorithms [17] for tensor CPD with orthonormal
factor matrices. factors cannot be used to solve problem (8), since they have no
mechanism to handle outliers E. Furthermore, the choice of reg-
C. Problem Formulation ularization parameters plays an important role, since setting γl
too large results in excessive residual squared error, while setting
From the two motivating examples above, we take a step fur-
γl too small risks overfitting of noise. In general, determining the
ther and consider a generalized problem in which the observed
optimal regularization parameters (e.g., using cross-validation
data tensor Y ∈ C I 1 ×I 2 ×...×I N obeys the following model:
[27], or the L-curve [28]) requires exhaustive search, and thus is
Y = [[A(1) , A(2) , . . . , A(N ) ]] + W + E (6) computationally demanding. To overcome these problems, we
propose a novel algorithm based on the framework of proba-
where W represents an additive noise tensor
with each ele- bilistic inference, which effectively mitigates the outliers E and
ment wi 1 ,i 2 ,...,i N ∼ CN wi 1 ,i 2 ,...,i N |0, β −1 and with correla-
automatically learns the regularization parameters.
tion E(wi∗1 ,i 2 ,...,i N wτ 1 ,τ 2 ,...,τ N ) = β −1 Nn =1 δ(τn − in ); E de-
notes potential outliers in measurements with each element III. PROBABILISTIC MODEL FOR TENSOR CPD
ei 1 ,i 2 ,...,i N taking an unknown value if an outlier emerges, and WITH ORTHOGONAL FACTORS
otherwise taking the value zero. Since the number of orthogonal
factor matrices could be known a priori in specific application, Before solving problem (8), we interpret different terms in (8)
it is assumed that {A(n ) }Pn =1 are known to be orthogonal where as probability density functions, based on which a probabilistic
P < N , while the remaining factor matrices are unconstrained. model that encodes our knowledge of the observation and the
Due to the orthogonality structure of the first P factor matrices unknowns can be established.
{A(n ) }Pn =1 , they can be written as A(n ) = U(n ) Λ(n ) where Firstly, since the elements of the additive noise W is white,
U(n ) is an orthonormal matrix and Λ(n ) is a diagonal matrix. zero-mean and circularly-symmetric complex Gaussian, the
Putting A(n ) = U(n ) Λ(n ) for 1 ≤ n ≤ P into the definition of squared error term in problem (8) can be interpreted as the
the tensor CPD in (1), it is easy to show that negative log of the likelihood given by [21]

[[A(1) , A(2) , . . . , A(N ) ]] = [[Ξ(1) , Ξ(2) , . . . , Ξ(N ) ]] (7) p Y | Ξ(1) , Ξ(2) , . . . , Ξ(N ) , E, β

with Ξ(n ) = U(n ) Π for 1 ≤ n ≤ P , Ξ(n ) = A(n ) Π for ∝ exp − βY − [[Ξ(1) , Ξ(2) , . . . , Ξ(N ) ]] − E2F . (9)
P + 1 ≤ n ≤ N − 1, and Ξ(N ) = A(N ) Λ(1) Λ(2) · · · Λ(P ) Π,
where Π ∈ C R ×R is a permutation matrix. From (7), it can Secondly, the regularization term in problem (8) can be in-
be seen that up to the scaling and permutation indeterminacy, terpreted as arising from a circularly-symmetric complex Gaus-
sian prior distribution over the columns of the factor matri-
the tensor CPD under orthogonal constraints is equivalent to L (n ) −1
that under orthonormal constraints. In general, the scaling and ces, i.e., Nn =1 l=1 CN (Ξ:,l | 0I n ×1 , γl IL ) [21]. Note that
permutation ambiguity can be easily resolved using side in- the columns of the factor matrices are independent of each
formation [5]. On the other hand, for those applications that other, and the lth columns in all factor matrices {Ξ(n ) }N n =1
seek the subspaces spanned by the factor matrices, such as lin- share the same variance γl−1 . This has the physical interpre-
ear image coding described in Section II.B, the scaling and tation that if γl is large, the lth columns in all Ξ(n ) ’s will
permutation ambiguity can be ignored. Thus, without loss of be effectively “switched off”. On the other hand, for the first
666 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017

P factor matrices {Ξ(n ) }Pn =1 , there are additional hard con-


straints in problem (8), which correspond to the Stiefel manifold
[33] VL (C I n ) = {A ∈ C I n ×L : AH A = IL } for 1 ≤ n ≤ P .
(n )H
Since the orthonormal constraints result in Ξ:,l Ξ:,l = 1, the
hard constraints would dominate the Gaussian distribution of the
columns in {Ξ(n ) }Pn =1 . Therefore, Ξ(n ) can be interpreted as
being uniformly distributed over the Stiefel manifold VL (C I n )
for 1 ≤ n ≤ P , and Gaussian distributed for P + 1 ≤ n ≤ N :

P
p(Ξ(1) , Ξ(2) , · · · , Ξ(P ) ) ∝ IV L (C I n ) (Ξ(n ) ),
n =1

p(Ξ (P +1)
,Ξ (P +2)
,··· ,Ξ (N )
)

N 
L
(n )
Fig. 1. Probabilistic model for tensor CPD with orthogonal factors.
= CN Ξ:,l |0I n ×1 , γl−1 IL , (10)
n =P +1 l=1
IV. VARIATIONAL INFERENCE FOR TENSOR FACTORIZATION
where IV L (C I n ) (Ξ(n ) ) is an indicator function with
IV L (C I n ) (Ξ(n ) ) = 1 when Ξ(n ) ∈ VL (C I n ), and otherwise Let Θ be a set containing the factor matrices {Ξ(n ) }N n =1 , and
I 1 ,...,I N
IV L (C I n ) (Ξ(n ) ) = 0. For the parameters β and {γl }Ll=1 , which other variables E, {γl }l=1 , {ζi 1 ,...,i N }i 1 =1,...,i N =1 , β . From the
L

correspond to the inverse noise power and the variances of probabilistic model established above, the marginal probability
columns in the factor matrices, since we have no informa- density functions of the unknown factor matrices {Ξ}N n =1 are
tion about their distributions, a non-informative Jeffrey’s prior given by
[27] is imposed on them, i.e., p(β) ∝ β −1 and p(γl ) ∝ γl−1 for 
p(Y, Θ)
l = 1, . . . , L. p(Ξ(n ) |Y) = dΘ\Ξ(n ) , n = 1, 2, . . . , N, (13)
p(Y)
Finally, although the generative model for outliers Ei 1 ,...,i N
is unknown, the rare occurrence of outliers motivates us to em- where
ploy Student’s t distribution as its prior [27], i.e., p(Ei 1 ,...,i N ) = 

P   
N
T (Ei 1 ,...,i N |0, ci 1 ,...,i N , di 1 ,...,i N ). To facilitate the Bayesian in- p(Y, Θ) ∝ IV L (C I n ) Ξ (n )
exp In − 1 ln β
ference procedure, Student’s t distribution can be equivalently n =1 n =1
represented as a Gaussian scale mixture as follows [34]:

N 
L 
N
T (Ei 1 ,...,i N | 0, ci 1 ,...,i N , di 1 ,...,i N ) + In + 1 ln γl − Tr Γ (n )H
Ξ (n )
Ξ


n =P +1 l=1 n =P +1
= CN Ei 1 ,...,i N | 0, ζi−1 1 ,...,i N

I1 
IN
 
+ ··· (ci 1 ,...,i N − 1) ln ζi 1 ,...,i N − di 1 ,...,i N ζi 1 ,...,i N
× gamma (ζi 1 ,...,i N | ci 1 ,...,i N , di 1 ,...,i N ) dζi 1 ,...,i N . (11)
i 1 =1 i N =1
This means that Student’s t distribution can be obtained by
mixing an infinite number of zero-mean circularly-symmetric 
I1 
IN

+ ··· ln ζi 1 ,...,i N − ζi 1 ,...,i N Ei∗1 ,...,i N Ei 1 ,...,i N


complex Gaussian distributions where the mixing distribution
i 1 =1 i N =1
on the precision ζi 1 ,...,i N is the gamma distribution with param- 
eters ci 1 ,...,i N and di 1 ,...,i N . In addition, since the statistics of
outliers such as means and correlations are generally unavailable − βY − [[Ξ(1) , Ξ(2) , . . . , Ξ(N ) ]] − E2F (14)
in practice, we set the hyper-parameters ci 1 ,...,i N and di 1 ,...,i N
as 10−6 to produce a non-informative prior on Ei 1 ,...,i N , and with Γ = diag{γ1 , · · · , γR }.
assume outliers are independent of each other: Since the factor matrices and other variables are nonlinearly

I1 
IN

coupled in (14), the multiple integrations in (13) are analytically
p (E) = · · · T Ei 1 ,...,i N |0, ci 1 ,...,i N =10−6 , di 1 ,...,i N =10−6 . intractable, which prohibits exact Bayesian inference. To han-
i 1 =1 i N =1 dle this problem, Monte Carlo statistical methods [25], [26], in
(12) which a large number of random samples are generated from the
The complete probabilistic model is shown in Fig. 1. Notice joint distributions and marginalization is approximated by op-
that the proposed probabilistic model in this paper is different erations on samples, can be explored. These Monte Carlo based
from that of existing works on tensor decompositions [30]– approximations can approach the exact multiple integrations
[32]. In particular, existing tensor probabilistic models do not when the number of samples approaches infinity, which how-
take orthogonality structure into account. Furthermore, existing ever is computationally demanding [27]. More recently, varia-
tenor decompositions [30]–[32], [43]–[45] are designed for real- tional inference, in which another distribution that is close to the
valued tensors only, and thus cannot process the complex-valued true posterior distribution in the Kullback-Leibler (KL) diver-
data arising in applications such as wireless communications gence sense is sought, has been exploited to give deterministic
[5]–[9] and functional magnetic resonance imaging [3]. approximations to the intractable multiple integrations [29].
CHENG et al.: PROBABILISTIC TENSOR CANONICAL POLYADIC DECOMPOSITION WITH ORTHOGONAL FACTORS 667

More specifically, in variational inference, a variational distri-


bution with probability density function Q(Θ) that is the closest
among a given set of distributions to the true posterior distribu-
tion p(Θ | Y) = p(Θ, Y)/p(Y) in the KL divergence sense is
sought [29]:
 

p (Θ | Y)
KL Q (Θ) p (Θ | Y)  −EQ (Θ ) ln . (15)
Q (Θ)
The KL divergence vanishes when Q(Θ) = p(Θ | Y) if no
constraint is imposed on Q(Θ), which however leads us back
to the original intractable posterior distribution. A common ap-
proach is to apply the mean field approximation, which as-
sumes that the variational probability density takes a fully fac-
torized form Q(Θ) = k Q(Θk ), Θk ∈ Θ. Furthermore, to
facilitate the manipulation of hard constraints on the first P
factor matrices, their variational densities are assumed to take Fig. 2. Unfolding operation for a third-order tensor.
a Dirac delta functional form Q(Ξ(k ) ) = δ(Ξ(k ) − Ξ̂(k ) ) for
k = 1, 2, . . . , P , where Ξ̂(k ) is a parameter to be derived.
Under these approximations, the probability density functions U(k ) [A]2F = Tr(U(k ) [A](U(k ) [A])H ) [14], where the unfold-
Q(Θk ) of the variational distribution can be obtained analyti- ing operation {U(k ) [A]}k =1,2,...,N on an N th-order tensor
cally via [29] A ∈ C I 1 ×···×I N along its kth mode is specified as U(k ) [A]
     N
Q(Ξ(k ) ) = δ Ξ(k ) −arg max E Θ
= Ξ ( k ) Q (Θ j ) ln p (Y, Θ) , = Ii 11 =1 · · · Ii nN=1 ai 1 ,...,i N eIi kk [ eIi nn ]T . In this ex-
n =1,n
= k
Ξ(k ) j
   pression, the elementary vector eIi nn ∈ RI n ×1 is all zeroes ex-
Ξ̂ ( k )
cept for a 1 at the ithn location, and the multiple Khatri-
k = 1, 2, . . . , P, N
(16) Rao products A(n ) = A(N ) · · · A(k +1) A(k −1)
n =1,n
= k
and · · · A(1) . For example, the unfolding operation for a third-
  
Q(Θk ) ∝ exp E j
= k Q (Θ j ) ln p (Y, Θ) , Θk ∈Θ\{Ξ(k ) }Pk=1 . order tensor is illustrated in Fig. 2. After expanding the square
of the Frobenius norm and taking expectations, the parameter
(17) Ξ̂(k ) for each variational density in {Q(Ξ(k ) )}Pk=1 can be ob-
Obviously, these variational distributions are coupled in the tained from the problem (19) shown at the bottom of this page.
sense that the computation of the variational distribution of one Using the fact that the feasible set for parameter Ξ(k ) is the
parameter requires the knowledge of the variational distribu- Stiefel manifold VL (C I k ), i.e., Ξ(k )H Ξ(k ) = IL , the term G(k )
tions of other parameters. Therefore, these variational distribu-
is irrelevant to the factor matrix of interest Ξ(k ) . Consequently,
tions should be updated iteratively. In the following, an explicit
problem (19) is equivalent to
expression for each Q (·) is derived.

A. Derivation for Q(Ξ(k ) ), 1 ≤ k ≤ P Ξ̂(k ) = arg max Tr F(k ) Ξ(k )H + Ξ(k ) F(k )H , (20)
Ξ ( k ) ∈V L (C I k )
By substituting (14) into (16) and only keeping the terms
relevant to Ξ(k ) (1 ≤ k ≤ P ), we directly have where F(k ) was defined in the first line of (19). Problem (20) is a
 non-convex optimization problem, as its feasible set VL (C I k ) is
Ξ̂(k ) = arg max E Θ
= Ξ ( k ) Q (Θ j ) non-convex [37]. While in general (20) can be solved by numer-
Ξ ( k ) ∈V L (C I k ) j
ical iterative algorithms based on a geometric approach or the
 alternating direction method of multipliers [37], a closed-form
− βY − [[Ξ(1) , · · · , Ξ(N ) ]] − E2F . (18)
optimal solution can be obtained by noticing that the objective
To expand the square of the Frobenius norm inside function in (20) has the same functional form as the log of
the expectation in (18), we use the result that A2F = the von Mises-Fisher matrix distribution with parameter matrix


  N ∗
(k )
Ξ̂ = arg max Tr EQ (β ) [β]U (k )
Y − EQ (E) [E] EQ (Ξ ( n ) ) [Ξ(n ) ] Ξ(k )H + Ξ(k ) F(k )H
Ξ ( k ) ∈V L (C I k ) n =1,n
= k
  
F ( k )
 
 N T  N ∗ 
− Tr Ξ(k )H Ξ(k ) EQ (β ) [β]E N (n ) ) Ξ(n ) Ξ(n ) + E L Q (γ l ) [Γ] (19)
n = 1 , n
= k Q (Ξ n =1,n
= k n =1,n
= k l= 1
  
G ( k )
668 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017

F(k ) , and the feasible set in (20) also coincides with the support where D[A] is a diagonal matrix taking the diagonal element
of this von Mises-Fisher matrix distribution [35]. As a result, N
from A, and the multiple Hadamard products A(n ) =
we have n =1,n
= k
 
A(N ) · · · A(k +1) A(k −1) · · · A(1) .
Ξ̂(k ) = arg max ln VMF Ξ(k ) | F(k ) . (21)
Ξ(k )
Proof: See Appendix B. 
Then, the closed-form solution for problem (21) can be ac-
quired using Property 1 below, which has been proved in [33]. C. Derivation for Q (E)
Property 1: Suppose the matrix A ∈ C κ 1 ×κ 2 follows a von
Mises-Fisher matrix distribution with parameter matrix F ∈ The variational density Q (E) can be obtained by taking only
C κ 1 ×κ 2 . If F = UΦVH is the SVD of the matrix F, then the the terms relevant to E after substituting (14) into (17), and can
unique mode of VMF (A | F) is UVH . be expressed as
From Property 1, it is easy to conclude that Ξ̂(k ) = 

I1 
IN ! !2
Υ(k ) Π(k )H , where Υ(k ) and Π(k ) are the left-orthonormal ! !
Q (E)∝ ··· exp E Θ Q (Θ j ) − ζi 1 ,...,i N !Ei 1 ,...,i N !
matrix and right-orthonormal matrix from the SVD of F(k ) , j
= E
i 1 =1 i N =1
respectively. "
! 
L 
N !2

! (n ) !
B. Derivation for Q Ξ(k ) , P + 1 ≤ k ≤ N −β !Yi 1 ,...,i n − Ξi n ,l − Ei 1 ,...,i N ! . (26)

l=1 n =1
Using (14) and (17), the variational density Q Ξ(k ) (P +
1 ≤ k ≤ N ) is derived in Appendix A to be a circularly- After taking expectations, the term inside the exponent of
symmetric complex matrix normal distribution [35] as (26) is

Q Ξ(k ) = CMN (Ξ(k ) | M(k ) , II k , Σ(k ) ) (22)  


− Ei∗1 ,...,i N EQ (β ) [β] + EQ (ζ i 1 , . . . , i N ) [ζi 1 ,...,i N ] Ei 1 ,...,i N
where   

 p i 1 , . . . , i N
(k )
Σ = EQ (β ) [β] E N Q (Ξ ( n ) )
n = 1 , n
= k

−1 + 2Re Ei∗1 ,...,i N pi 1 ,...,i N


 N
T  N
∗ 
Ξ(n ) Ξ(n ) + E L [Γ] "
n =1,n
= k n =1,n
= k l= 1 Q (γ l )

L 
N  
×EQ (β ) [β]p−1
(n )
(23) i 1 ,...,i N Yi 1 ,...,i N − EQ (Ξ ( n ) ) Ξi n ,l .
   
l=1 n =1

M(k ) = EQ (β ) [β] U(k ) Y − EQ (E) [E]
m i 1 , . . . , i N
 N ∗
(27)
× EQ (Ξ ( n ) ) [Ξ(n ) ] Σ(k ) . (24)
n =1,n
= k
Since (27) is a quadratic function with respect to Ei 1 ,...,i N , it
Due to the fact that Q(Ξ(k ) ) is Gaussian, the parameter M(k )
is easy to show that
is both
the

expectation and the mode of the variational density
Q Ξ(k ) .

I1 IN

To calculate M(k ) , some expectation computations are re- Q (E) = ··· CN Ei 1 ,...,i N | mi 1 ,...,i N , p−1
i 1 ,...,i N .
quired as shown in (23) and (24). For those with the form i 1 =1 i N =1
EQ (Θ k ) [Θk ] where Θk ∈ Θ, the value can be easily ob- (28)
tained if the corresponding Q(Θk ) is available. The remain- Notice that from (27), the computation of outlier mean
ing challenge stems from the expectation E N Q (Ξ ( n ) ) mi 1 ,...,i N can be rewritten as mi 1 ,...,i N = n1 n2 , where
n = 1 , n
= k
N N (n ) ∗
[( Ξ (n ) T
) ( Ξ ) ] in (23). But its calculation
−1
n =1,n
= k n =1,n
= k
EQ (ζ i 1 , . . . , i N ) [ζi 1 ,...,i N ]
becomes straightforward after exploiting the orthonormal struc- n1 =
−1
−1
ture of {Ξ̂(k ) }Pk=1 and the property of multiple Khatri-Rao prod- EQ (ζ i 1 , . . . , i N ) [ζi 1 ,...,i N ] + EQ (β ) [β]
ucts, as presented in the following property.
Property 2: Suppose the matrix A(n ) ∈ C κ n ×ρ ∼ δ(A(n ) −  (n )
and n2 = Yi 1 ,...,i N − Ll=1 ( N n =1 EQ (Ξ ( n ) ) [Ξi n ,l ]). From the
 ) for 1 ≤ n ≤ P , where Â(n ) ∈ Vρ (C κ n ) and P < N , and
(n )
general data model in (6), it can be seen that n2 consists of
the matrix A(n ) ∈ C κ n ×ρ ∼ CMN (A(n ) | M(n ) , Iκ n , Σ(n ) ) the estimated outliers plus noise. On the other hand, since
for P + 1 ≤ n ≤ N . Then,
−1
EQ (ζ i 1 , . . . , i N ) [ζi 1 ,...,i N ] and (EQ (β ) [β])−1 can be interpreted
 N T  N ∗ 
as the estimated power of the outliers and the noise respectively,
E N p(A ( n ) )
A(n ) A(n )
n = 1 , n
= k n =1,n
= k n =1,n
= k n1 represents the strength of the outliers in the estimated out-
 N (n )H (n )
∗  liers plus noise. Therefore, if the estimated power of the out-
=D M M + κn Σ(n ) (25) liers (EQ (ζ i 1 , . . . , i N ) [ζi 1 ,...,i N ])−1 goes to zero, the outlier mean
n =P +1,n
= k mi 1 ,...,i N becomes zero accordingly.
CHENG et al.: PROBABILISTIC TENSOR CANONICAL POLYADIC DECOMPOSITION WITH ORTHOGONAL FACTORS 669

D. Derivations for Q (γl ) , Q (ζi 1 ,...,i N ) , and Q (β) tion for each variable in Θ should be iteratively updated. The
iterative algorithm is summarized as follows.
Using (14) and (17) again, the variational density Q (γl ) can
Initializations:
be expressed as
 N Choose L > R and initial values {Ξ̂(n ,0) }Pn =1 , {M(n ,0) ,
 Σ(n ,0) }N ˜0 ˜0
n =P +1 , b̃l , {c̃i 1 ,...,i N , di 1 ,...,i N } and f for all l and
0 0
Q (γl ) ∝ exp In −1 ln γl i1 , · · · , iN .
 N
n =P +1
   Let ãl = N n =P +1 In and ẽ = n =1 In .
ã l
Iterations: For the tth iteration (t ≥ 1),
" Update the statistics of outliers: {pi 1 ,··· ,i N , mi 1 ,··· ,i N }Ii 11 =1,···
,...,I N
,i N =1

N  
(n )H (n )
− γl EQ (Ξ ( n ) ) Ξ:,l Ξ:,l (29)

n =P +1
  ẽ c̃t−1
i 1 ,··· ,i N
pti 1 ,··· ,i N = + t−1 (33)
b̃ l f˜t−1 d˜i 1 ,··· ,i N
which has the same functional form as the probability density

function of the gamma distribution, i.e., Q(γl ) = gamma(γl | mti 1 ,··· ,i N = Yi 1 ,··· ,i N
ãl , b̃l ). Since EQ (γ l ) [γl ] = ãl /b̃l is required for updating the f˜t−1 pti 1 ,··· ,i N
variational distributions of other variables in Θ, we need P N "
L  (n ,t−1) 
to compute ãl and b̃l . While computation of ãl is straight- − Ξ̂i n ,l
(n ,t−1)
Mi n ,l
forward, the computation of b̃l can be facilitated by us- l=1 n =1 n =P +1
ing the correlation property of the matrix normal distribu- (34)
(n )H (n ) (n )H (n ) (n )
tion EQ (Ξ ( n ) ) [Ξ:,l Ξ:,l ] = M:,l M:,l + In Σl,l [35] for
P + 1 ≤ n ≤ N.
Similarly, using (14) and (17), the variational densities Update the statistics of factor matrices: {M(k ) , Σ(k ) }N
k =P +1
Q (ζi 1 ,...,i N ) and Q (β) can be found to be gamma distributions
as

Σ(k ,t) =
Q (ζi 1 ,...,i N ) = gamma ζi 1 ,...,i N | c̃i 1 ,...,i N , d˜i 1 ,...,i N (30)
ẽ   ∗ 

D
N
M(n ,t−1)H M(n ,t−1) +In Σ(n ,t−1)
Q (β) = gamma β | ẽ, f˜ (31) f˜t−1 n =P +1,n
= k

with parameters c̃i 1 ,...,i N = ci 1 ,...,i N + 1, d˜i 1 ,...,i N = di 1 ,...,i N   −1


N ã1 ãL
+ (mi 1 ,...,i N )∗ mi 1 ,...,i N + p−1
i 1 ,...,i N , ẽ = n =1 In , and f =
˜ + diag t−1 , . . . , t−1 (35)
b̃1 b̃L
E N Q (Ξ ( n ) )Q (E) [Y −[[Ξ ,· · ·, Ξ ]]−EF ]. For c̃i 1 ,...,i N ,
(1) (N ) 2
n=1

d˜i 1 ,...,i N and ẽ, the computations are straightforward. f˜ is de- ẽ (k )  

M(k ,t) = U Y − Mt
rived in Appendix C to be ˜
f t−1
   P ∗

I1 
IN
×
N
M(n ,t−1) Ξ̂(n ,t−1) Σ(k ,t) (36)
f˜ = Y − M2F + ··· p−1
i 1 ,...,i N n =P +1,n
= k n =1
i 1 =1 i N =1
  N  ∗ 
+ Tr D M(n )H M(n ) + In Σ(n ) Update the orthonormal factor matrices {Ξ̂(k ) }Pk=1
n =P +1
    
− 2Re Tr U(1) Y − M
N
M(n )  (k ,t) (k ,t)  ẽ (k )  
n =P +1 Υ ,Π = SVD U Y − Mt
f˜t−1
 P
∗  "
Ξ̂(n ) Ξ̂(1)H (32)  N
  P
∗
n =2
× M (n ,t)
Ξ̂ (n ,t−1)
n =P +1 n =1,n
= k
where M is a tensor with its (i1 , . . . , iN ) element being th

mi 1 ,...,i N , and Re(·) denotes the real part of its argument. Al- Ξ̂(k ,t) = Υ(k ,t) Π(k ,t)H (37)
though Eq. (32) for computing f˜ is complicated, its meaning is
clear when we refer to its definition below (31), from which it
can be seen that f˜ represents the estimate of the overall noise Update {b̃l }Ll=1 , {d˜i 1 ,...,i N }Ii 11 =1,...,i
,...,I N
N =1
and f˜
power.

N
E. Summary of the Iterative Algorithm b̃tl = M:,l
(n ,t)H
M:,l
(n ,t)
+ In Σl,l
(n ,t)
(38)
From the expressions for Q(Θk ) evaluated above, it is seen n =P +1
that the calculation of a particular Q (Θk ) relies on the statistics c̃ti 1 ,...,i N = c̃0i 1 ,...,i N + 1 (39)
of other variables in Θ. As a result, the variational distribu-
670 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017



d˜ti 1 ,...,i N = d˜0i 1 ,...,i N + mti 1 ,...,i N mti 1 ,...,i N + 1/pti 1 ,...,i N TABLE I
THREE DIFFERENT OUTLIER MODELS
(40)

I1 
IN
f˜t = Y − Mt 2F + ··· (pti 1 ,...,i N )−1
i 1 =1 i n =1
   ∗ 
N
+ Tr D M(n ,t)H M(n ,t) +In Σ(n ,t)
n =P +1 gorithm in [17]. In this regard, the proposed algorithm not only

  N
 provides a probabilistic interpretation of the OALS algorithm,
− 2Re Tr U(1) Y − Mt M(n ,t) but also has the additional properties in automatic rank determi-
n =P +1 nation, outlier removal and learning of the noise power.
 4) Computational Complexity: For each iteration, the com-
 P ∗
Ξ̂(n ,t) (1,t)H plexity is dominated by updating each factor matrix, costing
Ξ̂ (41) 
n =2 O( N 2
n =1 In L + N N n =1 In L). Thus, the overall complex-
N
ity is about O(q( N I
n =1 n L 2
+ N n =1 In L)) where q is the
Until Convergence number of iterations needed for convergence. On the other hand,
for the OALS algorithm with exact tensor rank R, its complexity
 N
F. Further Discussions is O(m( N 2
n =1 In R + N n =1 In R)) where m is the number
To gain further insight from the above proposed CPD algo- of iterations needed for convergence. Therefore, for each itera-
rithm, discussions of its convergence property, automatic rank tion, the complexity of the proposed algorithm is comparable to
determination, relationship to the OALS algorithm and compu- that of the OALS algorithm.
tational complexity are presented in the following.
1) Convergence Property: Although the functional mini- V. SIMULATION RESULTS AND DISCUSSIONS
mization of the KL divergence in (15) is non-convex over the In this section, numerical simulations are presented to as-
mean-field family Q(Θ) = k Q(Θk ), it is convex with re-
sess the performance of the proposed algorithm (labeled as
spect to a single variational density Q(Θk ) when the others
VB) using synthetic data and two applications, in compari-
{Q(Θj )|j
= k} are fixed [29]. Therefore, the proposed algo-
son with various state-of-the-art tensor CPD algorithms. The
rithm, which iteratively updates the optimal solution for each
algorithms being compared include the ALS [15], the simulta-
Θk , is essentially a coordinate-descent algorithm in the func-
neous diagonalization method for coupled tensor CPD (labeled
tional space of variational distributions with each update solving
as SD) [40], the direct algorithm for CPD followed by enhanced
a convex problem. This guarantees monotonic decrease of the
ALS (labeled as DIAG-A) [41], the Bayesian tensor CPD (la-
KL divergence in (15), and the proposed algorithm is guaranteed
beled as BCPD) [32], the robust iteratively reweighed ALS
to converge to at least a stationary point [39, Theorem 2.1].
(labeled as IRALS) [42], and the OALS algorithm (labeled as
2) Automatic Rank Determination: The automatic rank de-
OALS) [17]. In all experiments, three outlier models are con-
termination for the tensor CPD uses an idea from the Bayesian
sidered, and they are listed in Table I. For all the simulated
model selection (or Bayesian Occam’s razor) [27, pp. 157].
More specifically, the parameters {γl }Ll=1 control the model algorithms, the initial factor matrix Ξ̂(n ,0) is set as the matrix
complexity, and their optimal variational densities are obtained consisting of L leading left singular vectors of U(n ) [Y] where
together with those of other parameters by minimizing the KL L = max{I1 , I2 , . . . , IN } for the proposed algorithm and the
divergence. After convergence, if some E[γl ] are very large, BCPD, and L = R for other algorithms. The initial parameters
e.g., 106 , this indicates that their corresponding columns in of the proposed algorithm {c̃0i 1 ,...,i N , d˜0i 1 ,...,i N } are set as 10−6
 ˜0 N In ,
{M(n ) }Nn =P +1 can be “switched off”, as they play no role in for all i1 , . . . , iN , b̃0l = N n =P +1 In for all l, f = n =1
explaining the data. Furthermore, according to the definition of and {Σ(n ,0) }N n =P +1 are all set to be I L . All the algorithms ter-
the tensor CPD in (1), the corresponding columns in {Ξ̂(n ) }Pn =1 minate at the tth iteration when [[A(1,t) , A(2,t) , . . . , A(N ,t) ]] −
should also be pruned accordingly. Finally, the learned tensor [[A(1,t−1) , A(2,t−1) , . . . , A(N ,t−1) ]]2F < 10−6 or the iteration
rank R is the number of remaining columns in each estimated number exceeds 2000.
factor matrix Ξ̂(n ) .
3) Relationship to the OALS: If the tensor rank R is known,
the regularization term in (8) is not needed, and consequently A. Validation on Synthetic Data
there are no parameters {ãl , b̃l }Ll=1 . Further restricting Q(Ξ(k ) ) Synthetic tensors are used in this subsection to assess
to be δ(Ξ(k ) − M(k ) ) for P + 1 ≤ k ≤ N , it can be shown that the performance of the proposed algorithm on convergence,
all the equations in the above algorithm still hold except that rank learning ability and factor matrix recovery under dif-
the term In Σ(n ,t−1) in (35) and the term In Σ(n ,t) in (41) are ferent outlier models. A complex-valued third-order tensor
removed. Then, the proposed algorithm is a robust version of [[A(1) , A(2) , A(3) ]] ∈ C 12×12×12 with rank R = 5 is consid-
OALS, even covering the case of P = N . If we further have ered, where the orthogonal factor matrix A(1) is constructed
the knowledge that outliers do not exist, only (35)–(37) remain. from the R leading left singular vectors of a matrix drawn
Interestingly, this resulting algorithm is exactly the OALS al- from CMN (A|012×5, , I12×12 , I5×5 ), and the factor matrices
CHENG et al.: PROBABILISTIC TENSOR CANONICAL POLYADIC DECOMPOSITION WITH ORTHOGONAL FACTORS 671

Fig. 3. Convergence of the proposed algorithm under different outlier models.

{A(n ) }3n =2 are drawn from CMN (A| 012×5, , I12×12 , I5×5 ).
Parameters for outlier models are set as π = 0.05, σe2 =
100, H = 10 arg maxi 1 ,...,i N |[[A(1) , A(2) , A(3) ]]i 1 ,...,i N |, μ =
3, λ = 1/50 and ν = 10. The signal-to-noise ratio (SNR) is
defined as 10 log10 ([[A(1) , A(2) , A(3) ]]2F / W2F ) [5], [17].
Each result in this subsection is obtained by averaging 500
Monte-Carlo runs.
Fig. 3 presents the convergence performance of the
proposed algorithm under different outlier models, where
the mean-square-error (MSE) [[Ξ̂(1) , Ξ̂(2) , Ξ̂(3) ]] − [[A(1) ,
A(2) , A(3) ]]2F is chosen as the assessment criterion. From
Fig. 3, it can be seen that the MSEs decrease significantly in
the first few iterations and converge to stable values quickly,
demonstrating the rapid convergence property. Furthermore, by
comparing the simulation results with outliers to that without
outliers, it is clear that the proposed algorithm is effective in
mitigating outliers. Fig. 4. Rank determination using (a) the proposed method and (b) the
For tensor rank learning, the simulation results of the pro- Bayesian tensor CPD [32].
posed algorithm are shown in Fig. 4(a), while those of the
Bayesian tensor CPD algorithm are shown in Fig 4(b). Each
vertical bar in the figures shows the mean and standard de- orthogonal factor matrix A(1) under different outlier mod-
viation of rank estimates, with the red horizontal dotted lines els. The criterion is set as the best congruence ratio defined
indicating the true tensor rank. The percentages of correct es- as minΔ P A(1) − Ξ̂(1) PΔF /A(1) F , where the diagonal
timates are also shown on top of the figures. From Fig. 4(a), matrix Δ and the permutation matrix P are found via the greedy
it is seen that the proposed method can recover the true ten- least-squares column matching algorithm [5]. From Fig. 5(a), it
sor rank with 100% accuracy when SNR ≥ 5 dB, both with or is seen that both the proposed algorithm and OALS perform bet-
without outliers. This shows the accuracy and robustness of the ter than other algorithms when outliers are absent. This shows
proposed algorithm when the noise power is moderate. Even the importance of incorporating the orthogonality information
though the performance at low SNRs is not as impressive as that of the factor matrix. On the other hand, while OALS offers the
at high SNRs, it can be observed that the proposed algorithm same performance as the proposed algorithm when there is no
still gives estimates close to the true tensor rank with the true outlier, its performance is significantly different in the presence
rank lying mostly within one standard deviation from the mean of outliers, as presented in Fig. 5(b)– (d). Furthermore, since all
estimate. On the other hand, in Fig. 4(b), it is observed that the algorithms except VB and IRALS have not taken the out-
while the Bayesian tensor CPD algorithm performs nearly the liers into account, their performances degrade significantly as
same as the proposed algorithm without outliers, it gives tensor shown in Fig. 5(b)–(d). Even though the IRALS uses the robust
rank estimates very far away from the true value when outliers lp (0 < p ≤ 1) norm optimization to alleviate the effects of out-
are present. liers, it cannot learn the statistical information of the outliers,
Fig. 5 compares the proposed algorithm to other state-of- leading to its worse performance in outliers mitigation than that
the-art CPD algorithms in terms of recovery accuracy of the of the proposed algorithm.
672 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017

Fig. 5. Performance of factor matrix recovery versus SNR under different outlier models. (a) No outliers, (b) Bernoulli-Gaussian, (c) Bernoulli-Uniform, (d)
Bernoulli-Student’s t.

B. Blind Data Detection for DS-CDMA Systems C. Linear Image Coding for Face Images

In this subsection, we consider an uplink DS-CDMA In this subsection, we conduct experiments on 165 face im-
system, in which R = 5 users communicate with the BS ages from the Yale Face Database3 [38], representing different
equipped with M = 8 antennas over flat fading channels hm r ∼ facial expressions (also with or without sunglasses) of 15 people
CN (hm r |0, 1). The transmitted data sr (k) are random binary (11 images for each person). In each classification experiment,
phase-shift keying (BPSK) symbols. The spreading code is of we randomly choose two people’s images. Among these 22 im-
length Z = 6, and with each code element cz r ∼ CN (cz r |0, 1). ages, 12 (6 from each person) are used for training. In particular,
After observing the received tensor Y ∈ C 8×6×100 , the pro- each image is of size 240 × 320, and the training data can be
posed algorithm and other state-of-the-art tensor CPD algo- naturally represented by a third-order tensor Y ∈ R240×320×12 .
rithms, combined with ambiguity removal and constellation Various state-of-the-art tensor CPD algorithms and the proposed
mapping [5], [9], are executed to blindly detect the transmit- algorithm are run to learn the two orthogonal basis matrices (see
ted data. Their performance is measured in terms of bit error (4)). Then, the feature vectors of these 12 training images, which
rate (BER). are obtained by projecting them onto the multilinear subspaces
The BERs versus SNR under different outlier models are spanned by the two orthogonal basis matrices, are used to train
presented in Fig. 6, which are averaged over 10000 indepen- a support vector machine (SVM) classifier. For the 10 testing
dent trials. The parameter settings for different outlier models images, their feature vectors are fed into the SVM classifier
are the same as those in the last subsection. It is seen from to determine which person is in each image. The parameters
Fig. 6(a) that when there are no outliers, the proposed algorithm of various outlier models are: π = 0.05, σe2 = 100, H = 100,
and OALS behave the same, and both outperform other CPDs. μ = 1, λ = 1/1000 and ν = 20.
However, when outliers exist, it is seen from Fig. 6(b)–(d) that
the proposed algorithm performs significantly better than other
3 http://vision.ucsd.edu/content/yale-face-database
algorithms.
CHENG et al.: PROBABILISTIC TENSOR CANONICAL POLYADIC DECOMPOSITION WITH ORTHOGONAL FACTORS 673

Fig. 6. BER versus SNR under different outlier models. (a) No outliers, (b) Bernoulli-Gaussian, (c) Bernoulli-Uniform, (d) Bernoulli-Student’s t.

TABLE II
CLASSIFICATION ERROR AND CPD COMPUTATION TIME IN FACE RECOGNITION

Since the tensor rank is not known in the image data, it should as 12 when outliers exist. On the other hand, no matter whether
be carefully chosen. For the algorithms (ALS, SD, IRALS, there are outliers or not, the proposed algorithm automatically
DIAG-A and OALS) that cannot automatically determine the learns the appropriate tensor rank without exhaustive search,
rank, it can be obtained by first running the algorithms with and thus saves considerable computational complexity.
tensor rank ranges from 1 to 12, and then finding the knee point The average classification errors of 10 independent experi-
of the reconstruction error decrement [27]. When there are no ments and the corresponding average CPD computation times
outliers, it is able to find the appropriate tensor rank. However, (benchmarked in Matlab on a personal computer with an i7
when outliers exist, the knee point cannot be found and we set CPU) are shown in Table II, and it can be seen that the pro-
the rank as the upper bound 12. For the BCPD, although it learns posed algorithm provides the smallest classification error under
the appropriate rank when there are no outliers, it learns the rank all considered scenarios.
674 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017

VI. CONCLUSION APPENDIX B


In this paper, a probabilistic CPD algorithm with orthogonal PROOF OF PROPERTY 2
factors has been proposed for complex-valued tensors, under To prove Property 2, we first introduce the following lemma.
unknown tensor rank and in the presence of outliers. It has Lemma 1: Suppose the matrix A(n ) ∈ C κ n ×ρ (1 ≤ n ≤
been shown that, without knowledge of noise power and outlier N N
statistics, the proposed algorithm alternatively estimates the fac- N ). Then, if N ≥ 2, ( A(n ) )T ( A(n ) )∗ can be computed
n =1 n =1
tor matrices, recovers the tensor rank and mitigates the outliers. as
Interestingly, the widely used OALS algorithm in [17] has been
shown to be a special case of the proposed algorithm. Simulation  T  ∗
N N N
results using synthetic data and tests on real data have demon- A (n )
A (n )
= A(n )T A(n )∗ . (45)
n =1 n =1 n =1
strated the excellent performance of the proposed algorithm in
terms of accuracy and robustness.
Proof: Mathematical induction is used to prove this lemma.
APPENDIX A For N = 2, it is proved in [36] that
DERIVATIONS OF Q(Ξ(k ) ), P + 1 ≤ k ≤ N
By substituting (14) into (17) and taking only the terms rele-  T  ∗
vant to Ξ(k ) , we directly have A(2) A(1) A(2) A(1)
    
Q(Ξ ) ∝ exp E Θ
= Ξ ( k ) Q (Θ j ) [
(k )
= A(2)T A(2)∗ A(1)T A(1)∗ . (46)
j


−βY − [[Ξ(1) , · · · , Ξ(N ) ]] − E2F − Tr(ΓΞ(k )H Ξ(k ) ) . Without loss of generality, assume that (45) holds for some
(42) K ≥ 2, i.e.,
Using the fact that A2F = U(n ) [A]2F = Tr(U(n ) [A]  T  ∗ K
(U(n ) [A])H ), the square of the Frobenius norm inside expec- K
A(n )
K
A(n ) = A(n )T A(n )∗ . (47)
tation in (42) can be expressed as n =1 n =1 n =1
 T  ∗
N N
Tr Ξ(k ) Ξ(n ) Ξ(n ) Ξ(k )H Now consider
K
[A(K +1) ( A(n ) )]T [A(K +1) (
K
n =1,n
= k n =1,n
= k
n =1 n =1
 T  A(n ) )]∗ . Treating
K
A(n ) as a matrix and using (46),
N   H n =1
−Ξ (k )
Ξ (n ) (k )
U Y −E K K
n =1,n
= k we have [A(K +1) ( A(n ) )]T [A(K +1) ( A(n ) )]∗ =
 ∗ n =1 n =1
  K K
− U(k ) Y − E
N
Ξ(n ) Ξ(k )H (A(K +1)T A(K +1)∗ ) [( A(n ) )T ( A(n ) )∗ ]. Further
n =1,n
= k n =1 n =1
using (47), we have
   H
+ U(k ) Y − E U(k ) Y − E . (43)  T 
K +1 K +1
∗
A (n )
A(n )
By substituting (43) into (42), and distributing the expecta- n =1 n =1

tions into various terms, (42) becomes (44) shown at the bottom    K 
of the page. = A(K +1)T A(K +1)∗ A(n )T A(n )∗
n =1
After completing the square over Ξ(k ) , it can be seen
K +1
that (44) corresponds to the functional form of a circularly- = A(n )T A(n )∗ . (48)
symmetric complex matrix normal distribution [35]. In par- n =1

ticular, with CMN (X|M, Σr , Σc ) denoting the distribution


−1
(k∝)
exp{−Tr(Σc (k(X
p(X) − M)H Σ−1r (X − M))}, we have Then, (45) is shown to hold for N = K + 1. Thus, by math-
Q Ξ = CMN (Ξ |M(k ) , II k , Σ(k ) ).
)
ematical induction, (45) holds for any N ≥ 2. 


  N T  N ∗  
Q(Ξ ) ∝ exp − Tr Ξ(k ) EQ (β ) [β] E N
(k )
( n ) Ξ (n )
Ξ(n )
+E L [Γ] Ξ(k )H
n = 1 , n
= k Q (Ξ ) n =1,n
= k n =1,n
= k l = 1 Q (γ l )
  
−1
[Σ ]( k )

"H 
   N ∗
(k ) −1 (k ) −1 (k )H
− Ξ [Σ ]
(k )
EQ (β ) [β] U (k )
Y − EQ (E) [E] (n )
EQ (Ξ ( n ) ) [Ξ ] Σ (k )
− M [Σ ] Ξ
(k )
. (44)
n =1,n
= k
  
M ( k )
CHENG et al.: PROBABILISTIC TENSOR CANONICAL POLYADIC DECOMPOSITION WITH ORTHOGONAL FACTORS 675

 
 
H (1)    N T  N ∗ 
˜
f = Tr EQ (E) U(1) Y − E U Y − E +EQ (Ξ ( 1 ) ) [Ξ(1) ] E N Q (Ξ ( n ) ) Ξ(n ) Ξ(n ) EQ (Ξ ( 1 ) ) [Ξ(1) ]H
n=2 n =2 n =2
     
w1 w2
 N T  H
− EQ (Ξ ( 1 ) ) [Ξ(1) ] EQ (Ξ ( n ) ) [Ξ(n ) ] U(1) Y − EQ (E) [E]
n =2

  N ∗
− U Y − EQ (E) [E]
(1)
EQ (Ξ ( n ) ) [Ξ ] EQ (Ξ ( 1 ) ) [Ξ ]
(n ) (1) H
. (51)
n =2

Using Lemma 1 and taking expectations, it is easy to prove [2] J. F. Cardoso, “High-order contrasts for independent component analysis,”
that Neural Comput., vol. 11, no. 1, pp. 157–192, 1999.
[3] C. F. Beckmann and S. M. Smith, “Tensorial extensions of independent
 N T 
N
∗  component analysis for multisubject FMRI analysis,” Neuroimage, vol. 25,
E N p(A ( n ) )
A (n )
A (n ) no. 1, pp. 294–231, 2005.
n = 1 , n
= k n =1,n
= k n =1,n
= k [4] L. De Lathauwer, “Algebraic methods after prewhitening,” in Handbook
  of Blind Source Separation, Independent Component Analysis and Appli-
P   cations. New York, NY, USA: Academic, 2010, pp. 155–177.
= Ep(A ( n ) ) A(n )T A(n )∗ [5] N. D. Sidiropoulos, G. B. Giannakis, and R. Bro, “Blind PARAFAC re-
n =1,n
= k
ceivers for DS-CDMA systems,” IEEE Trans. Signal Process., vol. 48,
 
N   no. 3, pp. 810–823, 2000.
Ep(A ( n ) ) A (n )T
A (n )∗
. (49) [6] C. A. R. Fernandes, A. L. F. de Almeida, and D. B. da Costa, “Uni-
n =P +1,n
= k fied tensor modeling for blind receivers in multiuser uplink coopera-
tive systems,” IEEE Signal Process. Lett., vol. 19, no. 5, pp. 247–250,
Since the matrix A(n ) ∼ δ(A(n ) − Â(n ) ) for 1 ≤ n ≤ May 2012.
[7] A. Y. Kibangou and A. De Almeida, “Distributed PARAFAC based DS-
P where Â(n ) ∈ Vρ (C κ n ), and the matrix A(n ) ∼ CDMA blind receiver for wireless sensor networks,” in Proc. IEEE Int.
CMN (A (n ) | M(n ) , Iκ n , Σ(n ) ) for P + 1 ≤ n ≤ N , we have Conf. Signal Process. Adv. Wireless Commun., Marrakech, Morocco, Jun.
20-23, 2010, pp. 1–5.
Ep(A ( n ) ) A(n )T A(n )∗ = Â(n )T Â(n )∗ = Iρ for 1 ≤ n ≤ P ,
  [8] A. L. F. de Almeida, A. Y. Kibangou, S. Miron, and D. C. Araujo, “Joint
and Ep(A ( n ) ) A(n )T A(n )∗ = M(n )T M(n )∗ + κn Σ(n )∗ for data and connection topology recovery in collaborative wireless sensor
P + 1 ≤ n ≤ N . Then, (49) becomes networks,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,
Vancouver, BC, Canada, May 26–31, 2013, pp. 5303–5307.
 T  "
N N
∗ [9] M. Sorensen, L. D. Lathauwer, and L. Deneire, “PARAFAC with orthog-
onality in one mode and applications in DS-CDMA systems,” in Proc.
E N p(A ( n ) )
A (n )
A (n )
IEEE Int. Conf. Acoust., Speech, Signal Process., Dallas, TX, USA, Mar.
n = 1 , n
= k n =1,n
= k n =1,n
= k
2010, pp. 4142–4145.
  ∗  [10] D. Nion and N. D. Sidiropoulos, “Tensor algebra and multidimensional
N
= Iρ M(n )H M(n ) + κn Σ(n ) harmonic retrieval in signal processing for MIMO radar,” IEEE Trans.
n =P +1,n
= k Signal Process., vol. 58, no. 11, pp. 5693–5705, Nov. 2010.
  ∗  [11] W. Sun, H. C. So, F. K. W. Chan, and L. Huang, “Tensor approach
N for eigenvector-based multi-dimensional harmonic retrieval,” IEEE Trans.
=D M(n )H M(n ) + κn Σ(n ) . (50) Signal Process., vol. 61, no. 13, pp. 3378–3388, Jul. 2013.
n =P +1,n
= k
[12] B. Pesquet-Popescu, J.-C. Pesquet, and A. P. Petropulu, “Joint singular
value decomposition—A new tool for separable representation of images,”
APPENDIX C in Proc. IEEE Int. Conf. Image Process., Thessaloniki, Greece, Oct. 2001,
pp. 569–572.
DERIVATION OF f˜ [13] A. Shashua and A. Levin, “Linear image coding for regression and clas-
sification using the tensor-rank principle,” in Proc. IEEE Comput. Soc.
Recall that Y − [[Ξ(1) , · · · , Ξ(N ) ]] − E2F is expressed Conf. Comput. Vision Pattern Recognit., Kauai, HI, USA, Dec. 2001,
in (43). Using this result and taking expectations with pp. 42–49.
respect to Q(Ξ(n ) ) for 1 ≤ n ≤ N and Q(E), we ob- [14] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,”
SIAM Rev., vol. 51, pp. 455–500, 2009.
tain (51) shown at the top of this page. Since [15] J. D. Carroll and J. J. Chang, “Analysis of individual differences in mul-
Q(Ei 1 ,...,i N ) = CN (Ei 1 ,...,i N |mi 1 ,...,i N , p−1i 1 ,...,i N ), it is easy tidimensional scaling via an N-way generalization of “Eckart-Young”
to show that Tr(w1 ) = Tr((U [Y − M])H U(1) [Y − M])) +
(1) decomposition,” Psychometrika, vol. 35, pp. 283–319, 1970.
I 1 ,··· ,I N −1 I 1 ,··· ,I N −1 [16] V. de Silva and L. H. Lim, “Tensor rank and the ill-posedness of the best
i 1 ,...,i N pi 1 ,...,i N = Y − MF +
2
i 1 ,...,i N pi 1 ,...,i N , where low-rank approximation problem,” SIAM J. Matrix Anal. Appl., vol. 30,
M is a tensor with its (i1 , . . . , iN )th element being mi 1 ,...,i N . pp. 1084–1127, 2008.
[17] M. Sorensen, L. D. Lathauwer, P. Comon, S. Icart, and L. Deneire,
On the other hand, using Property 2, we have w2 = “Canonical polyadic decomposition with a columnwise orthonormal fac-
D[ N n =P +1 (M
(n )H
M(n ) + In Σ(n ) )∗ ]. Substituting these two tor matrix,” SIAM J. Matrix Anal. Appl., vol. 33, no. 4, pp. 1190–1213,
results into (51) we have (32). 2012.
[18] M. Muma, Y. Cheng, F. Roemer, M. Haardt, and A. M. Zoubir, “Robust
source number enumeration for R-dimensional arrays in case of brief
REFERENCES sensor failures,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,
Kyoto, Japan, Mar. 2012, pp. 3709–3712.
[1] L. D. Lathauwer, “A short introduction to tensor-based methods for [19] Z. Lu, Y. Chakhchoukh, and A. M. Zoubir, “Source number estimation
factor analysis and blind source separation,” in Proc. IEEE Int. Symp. in impulsive noise environments using bootstrap techniques and robust
Image Signal Process. Anal., Dubrovnik, Croatia, Sep. 4-6, 2011, statistics,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,
pp. 558–563. Prague, Czech Republic, May 22–27, 2011, pp. 2712–2715.
676 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017

[20] R. H. Chan, C. W. Ho, and M. Nikolova, “Salt-and-pepper noise re- Lei Cheng received the B.Eng. degree from Zhe-
moval by median-type noise detectors and detail-preserving regulariza- jiang University, in 2013. He is currently working
tion,” IEEE Trans. Image Process., vol. 14, no. 10, pp. 1479–1485, toward the Ph.D. degree at the University of Hong
Oct. 2005. Kong, Pokfulam, Hong Kong. His research interests
[21] D. J. C. MacKay, “Probable networks and plausible predictions - a review include general areas of signal processing and ma-
of practical Bayesian methods for supervised neural networks,” Netw. chine learning, and in particular statistical inference
Comput. Neural Syst, vol. 6, no. 3, pp. 469–505, 1995. for multidimensional data.
[22] M. E. Tipping and C. M. Bishop, “Probabilistic principal component
analysis,” J Roy. Stat. Soc. B, vol. 61, no. 3, pp. 611–622, 1999.
[23] C. M. Bishop, “Bayesian principal component analysis,” in Proc. Adv.
Neural Inf. Process. Syst., 1999, pp. 382–388.
[24] X. Ding, L. He, and L. Carin, “Bayesian robust principal component
analysis,” IEEE Trans. Image Process., vol. 29, no. 12, pp. 3419–3430,
Dec. 2011.
[25] L. Xiong, X. Chen, T. K. Huang, J. Schneider, and J. G. Carbonell, “Tem-
poral collaborative filtering with Bayesian probabilistic tensor factoriza-
tion,” in Proc. SIAM Int. Conf. Data Min., Columbus, OH, USA, May
2010, pp. 211–222.
[26] I. Porteous, E. Bart, and M. Welling, “Multi-HDP: A non-parametric
Bayesian model for tensor factorization,” in Proc. AAAI Conf. Artif. Intell.,
Chicago, IL, USA, Jul., 2008, pp. 1487–1490.
[27] K. P. Murphy, Machine Learning: A Probabilistic Perspective. Cambridge, Yik-Chung Wu received the B.Eng. degree in elec-
MA, USA: MIT Press, 2012. trical and electronic engineering, in 1998, the M.Phil.
[28] P. C. Hansen, Rank Deficient and Discrete Ill-Posed Problems-Numerical degree from The University of Hong Kong (HKU),
Aspects of Linear Inversion. Philadelphia, PA, USA: SIAM, 1998. Pokfulam, Hong Kong, in 2001, and the Ph.D. degree
[29] M. J. Wainwright and M. I. Jordan, “Graphical models, exponential fam- from Texas A&M University, College Station, TX,
ilies, and variational inference,” Found. Trends Mach. Learn., vol. 1, USA, in 2005, for which he received the Croucher
no. 102, pp. 1–305, Jan. 2008. Foundation scholarship in 2002 . From August 2005
[30] B. Ermis and A. T. Cemgil, “A Bayesian tensor factorization model via to August 2006, he was with the Thomson Corporate
variational inference for link prediction,” Sep. 2014. [Online]. Available: Research, Princeton, NJ, USA, as a Member of Tech-
http://arxiv.org/pdf/1409.8276v1.pdf nical Staff. Since September 2006, he has been with
[31] Z. Xu, F. Yan, and Y. Qi, “Bayesian nonparametric models for multiway HKU, where he is currently as an Associate Profes-
data analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 2, sor. He was a Visiting Scholar at Princeton University, Princeton, NJ, USA, in
pp. 475–487, Nov. 2013. summers of 2011 and 2015. His research interests include general areas of sig-
[32] Q. Zhao, L. Zhang, and A. Cichocki, “Bayesian CP factorization of in- nal processing, machine learning, and communication systems, and in particular
complete tensors with automatic rank determination,” IEEE Trans. Pattern distributed signal processing and robust optimization theories with applications
Anal. Mach. Intell., vol. 37, no. 9, pp. 1751–1753, Sep. 2015. to communication systems and smart grid. He served as an Editor for IEEE
[33] C. G. Khat and K. V. Mardia, “The von Mises-Fisher distribution in COMMUNICATIONS LETTERS, and is currently an Editor for IEEE TRANSAC-
orientation statistics,” J. Roy. Stat. Soc., vol. 39, pp. 95–106, 1977. TIONS ON COMMUNICATIONS and Journal of Communications and Networks.
[34] M. West, “On scale mixtures of normal distributions,” Biometrika, vol. 74,
no. 3, pp. 646–648, 1987.
[35] A. K. Gupta and D. K. Nagar, Matrix Variate Distributions. Boca Raton,
FL, USA: CRC Press,1999.
[36] S. Liu and G. Trenkler, “Hadamard, Khatri-Rao, Kronecker and other
matrix products” Int. J. Inf. Syst. Sci., vol. 4, no. 1, pp. 160–177, 2008.
[37] T. Kanamori and A. Takeda, “Non-convex optimization on Stiefel mani-
fold and applications to machine learning,” in Proc. 19th Int. Conf. Neural
Inf. Process., 2012, pp. 109–116.
[38] A. Georghiades, D. Kriegman, and P. Belhumeur, “From few to many:
Generative models for recognition under variable pose and illumination,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 6, pp. 643–660, H. Vincent Poor (S’72–M’77–SM’82–F’87) re-
Jun. 2001. ceived the Ph.D. degree in electrical engineering and
[39] M. J. Beal, “Variational algorithms for approximate Bayesian infer- computer science from Princeton University, Prince-
ence,” Ph.D. dissertation, Gatsby Comput. Neurosci. Unit, Univ. College, ton, NJ, USA, in 1977. From 1977 to 1990, he was
London, U.K., 2003. on the faculty of the University of Illinois at Urbana-
[40] M. Sørensen, D. Ignat, and L. D. Lieven, “Coupled canonical polyadic Champaign. Since 1990 he has been on the faculty
decompositions and (coupled) decompositions in multilinear rank- at Princeton, where he is the Michael Henry Strater
(L r , n, L r , n, 1) terms—Part II: Algorithms,” SIAM J. Matrix Anal. Appl., University Professor of electrical engineering. From
vol. 36, no. 3, pp. 1015–1045, Mar. 2015. 2006 to 2016, he served as the Dean of Princeton’s
[41] X. Luciani and L. Albera, “Canonical polyadic decomposition based on School of Engineering and Applied Science. His re-
joint eigenvalue decomposition,” Chemometrics Intell. Laboratory Syst., search interests include the areas of statistical signal
vol. 132, pp. 152–167, Mar. 2014. processing, stochastic analysis, and information theory, and their applications
[42] X. Fu, K. Huang, W.-K. Ma, N. D. Sidiropoulos, and R. Bro, “Joint in wireless networks and related fields. Among his publications in these areas
tensor factorization and outlying slab suppression with applications,” IEEE is the recent book Mechanisms and Games for Dynamic Spectrum Allocation
Trans. Signal Process., vol. 63, no. 23, pp. 6315–6328, Dec. 2015. (Cambridge University Press, 2014). He is a Member of the National Academy
[43] H. Huang and C. H. Q. Ding, “Robust tensor factorization using r1 norm,” of Engineering and the National Academy of Sciences, and a Foreign Member
in Proc. IEEE Conf. Comput. Vision Pattern Recognit., Anchorage, AK, of the Royal Society. He is also a Fellow of the American Academy of Arts
USA, 2008, pp. 1–8. and Sciences and the National Academy of Inventors, and of other national and
[44] D. Goldfarb and Z. Qin, “Robust low-rank tensor recovery: Models and international academies. He received the Technical Achievement and Society
algorithms,” SIAM J. Matrix Anal. Appl., vol. 35, no. 1, pp. 225–253, Awards of the IEEE Signal Processing Society, in 2007 and 2011, respectively.
2014. Recent recognition of his work includes the 2014 URSI Booker Gold Medal, the
[45] Q. Zhao, G. Zhou, L. Zhang, A. Cichocki, and S. I. Amari, “Bayesian 2015 EURASIP Athanasios Papoulis Award, the 2016 John Fritz Medal, and
robust tensor factorization for incomplete multiway data,” IEEE Trans. honorary doctorates from Aalborg University, Aalto University, Hong Kong
Neural Netw. Learn. Syst., vol. 27, no. 4, pp. 736–748, Apr. 2016. University of Science and Technology, and the University of Edinburgh.

You might also like