You are on page 1of 9

FuSSO: Functional Shrinkage and Selection Operator

Junier B. Oliva Barnabas Poczos Timothy Verstynen Aarti Singh



Jeff Schneider Fang-Cheng Yeh Wen-Yih Tseng

Carnegie Mellon University National Taiwan University

Abstract end we present the Functional Shrinkage and Selection


Operator (FuSSO), for performing sparse functional
We present the FuSSO, a functional analogue regression in a principled, semi-parametric manner.
to the LASSO, that efficiently finds a sparse Indeed, there are a multitude of applications and do-
set of functional input covariates to regress mains where the study of a mapping that takes in a
a real-valued response against. The FuSSO functional input and outputs a real-value is of interest.
does so in a semi-parametric fashion, making That is, if I is some class of input functions with do-
no parametric assumptions about the nature main R and range R, then one may be interested
of input functional covariates and assuming a in a mapping h : I 7 R: h(f ) = Y (Figure 1(a)).
linear form to the mapping of functional co- Examples include: a mapping that takes in the time-
variates to the response. We provide a statis- series of a commoditys price in the past (f is a func-
tical backing for use of the FuSSO via proof of tion with the domain of time and range of price) and
asymptotic sparsistency under various condi- outputs the expected price of the commodity in the
tions. Furthermore, we observe good results nearby future; also, a mapping that takes a patients
on both synthetic and real-world data. cardiac monitors time-series and outputs a health in-
dex. Recently, work by [8] has explored this type of
regression problem when the input function is a distri-
1 Introduction bution. Furthermore, the general case of an arbitrary
functional input is related to functional analysis [2].
Modern data collection has allowed us to collect not
just more data, but more complex data. In particular, However, often it is expected that the response one
complex objects like sets, distributions, and functions is interested in regressing is dependent on not just
are becoming prevalent in many domains. It would one, but many functions. That is, it may be fruit-
be beneficial to perform machine learning tasks us- ful to consider a mapping h : I1 . . . Ip 7 R:
ing these complex objects. However, many existing h(f1 , . . . , fp ) = Y (Figure 1(b)). For instance, this is
techniques can not handle complex, possibly infinite likely the case in regressing the price of a commodity
dimensional, objects; hence one often resorts to the in the future, since the commoditys future price is not
ad-hoc technique of representing these complex object only dependent on the history of it own price, but also
by arbitrary summary statistics. the history of other commodities prices as well. A
responses dependence on multiple functional covari-
In this paper, we look to perform a regression task ates is especially common in neurological data, where
when dealing with functional data. Specifically, we thousands of voxels in the brain may each contain a
look to regress a mapping that takes in many func- corresponding function. In fact, in such domains it is
tional input covariates and outputs a real value. More- not uncommon to have a number of input functional
over, since we are considering many functional covari- covariates that far exceeds the number of training in-
ates (possibly many more than the number of instances stances one has in a data-set. Thus, it would be benefi-
of ones data), we look to find an estimator that per- cial to have an estimator that is sparse in the number
forms feature selection by only regressing on a sub- of functional covariates used to regress the response
set of all possible input functional covariates. To this against. That is, find an estimate, hs , that depends
on a small subset {i1 , . . . , iS } {1, . . . , p}, such that
Appearing in Proceedings of the 17th International Con- h(f1 , . . . , fp ) = hs (fi1 , . . . , fiS ) (Figure 1(c)).
ference on Artificial Intelligence and Statistics (AISTATS)
2014, Reykjavik, Iceland. JMLR: W&CP volume 33. Copy- Here we present a semi-parametric estimator to per-
right 2014 by the authors. form sparse regression with multiple input functional

715
FuSSO: Functional Shrinkage and Selection Operator

f1 f1

f2 f2

...

...
f Y fi Y fi Y

...

...
fp-1 fp-1

fp fp

(a) Single Func- (b) Multiple Functional (c) Sparse Model


tional Covariate Covariates

Figure 1: (a) Model where mapping takes in a function f and produces a real Y . (b) Model where response Y is
dependent on multiple input functions f1 , . . . , fp . (c) Sparse model where response Y is dependent on a sparse
subset of input functions f1 , . . . , fp .

covariates and a real-valued response, the FuSSO: LASSO-like regression estimators that work with func-
Functional Shrinkage and Selection Operator. No tional data include the following. In [6], one has a func-
parametric assumptions are made on the nature of in- tional output and several real-valued covariates. Here,
put functions. We shall assume that the response is the estimator finds a sparse set of functions to scale by
the result of a sparse set of linear combinations of in- the real valued covariates to produce the functional re-
put functions
P and other non-paramteric functions {gi }: sponse. Also, [17, 3] study the case when one has one
Y = j hfj , gj i. The resulting method is a LASSO- functional covariate f and one real valued response
like [10] estimator that effectively zeros out entire func- that is linearly Rdependent on f and some function g:
tions from consideration in regressing the response. Y = hf, gi = f g. In [17] the estimator searches
for sparsity across wavelet basis projection coefficients.
Our contributions are as follows. We introduce the
In [3], sparsity is in achieved in the time (input) do-
FuSSO, an estimator for performing regression with
main of the dth derivative of g; i.e. [Dd g](t) = 0 for
many functional covariates and a real-valued response.
many values of t where Dd is the differential opera-
Furthermore, we provide a theoretical backing of the
tor. Hence, roughly speaking, [17, 3] look for spar-
FuSSO estimator via proof of asymptotic sparsistency
sity across frequency and time domains respectively,
under certain conditions. We also illustrate the esti-
for the regressing function g. However, these methods
mator with applications on synthetic data as well as
do not consider the case where one has many input
in regressing the age of a subject when given orien-
functional covariates {f1 , . . . , fp }, and needs to choose
tation distribution function (dODF) [14] data for the
among them. That is, [17, 3] do not provide a method
subjects white matter.
to select among function covariates in an analogous
fashion to how the LASSO selects among real-valued
2 Related Work covariates.
Lastly, it is worth noting that in P
our estimator we will
As previously mentioned, recently [8] explored regres-
sion with a mapping that takes in a probability density
have an additive linear model, j hfj , gj i where we
search for {gi } in a broad, non-parametric family such
function and outputs a real value. Furthermore, [7]
that many gj are the zero function. Such a task is
studies the case when both the input and outputs are
similar in nature to the SpAM estimator [9], in which
distributions. In addition, functional analysis relates P
one also has an additive model j gj (Xj ) (in the di-
to the study for functional data [2]. In all these works,
mensions of a real vector X) and searches for {gi } in
the mappings studied take in only one functional co-
a broad, non-parametric family such that many gj are
variate. Based on them, it is not immediately evident
the zero function. Note though, that in the SpAM
how to expand on these ideas to develop an estimator
model, the {gi } functions are applied to real covari-
that simultaneously performs regression and feature
ates via a function evaluation. In the FuSSO model,
selection with multiple function covariates.
{gi } are applied to functional covariates via an inner
To the best of our knowledge, there has been no prior product; that is, FuSSO works over functional, not
work in studying sparse mappings that take multiple real-valued covariates, unlike SpAM.
functional inputs and produce a real-valued output.

716
Oliva, Poczos , Verstynen, Singh, Schneider, Yeh, Tseng

(i)
3 Model p, fj L2 [0, 1]}. Then, an additive linear model
would be:
To better understand FuSSOs model we draw several p p X

analogies to real-valued linear regression and Group- Yi =
X (i)
hfj , gj i + i =
X (i)
jm jm + i , (2)
LASSO [16]. Note that although for simplicity we fo- j=1 j=1 m=1
cus on functions working over a one dimensional do-
main, it is straightforward to extend the estimator and (i)
where g1 , . . . , gp L2 [0, 1], and jm and jm are pro-
results to the multidimensional case. Consider a model (i)
for typical real-valued linear regression with a data-set jection coefficients for fj and gj respectively.
of input-output pairs {(Xi , Yi )}N
i=1 : Suppose that one has few observations relative to the
number of features (N  p). In the real-valued
Yi = hXi , wi + i , case, in order to effectively find a solution for w =
(w1T , . . . , wpT )T one may search for a group sparse so-
iid
where Yi R, Xi Rd , w Rd , i N (0, 2 ), lution where many wj = 0. To do so, one may consider
Pd
and hXi , wi = j=1 Xij wj . If instead one were work- the following Group-LASSO regression:
ing with functional data {(f (i) , Yi )}N
i=1 , where f
(i)
: p p
(i)
[0, 1] 7 R and f L2 [0, 1], one may similarly con- 1 X X
w? = argmin kY Xj wj k2 + N kwj k,
sider a linear model: w 2N j=1 j=1
(3)
Yi = hf (i) , gi + i ,
R1 where Xj is the N d matrix Xj = [X1j . . . XN j ]T ,
where, g : [0, 1] 7 R, and hf (i) , gi = 0 f (i) (t)g(t)dt. Y = (Y1 , . . . , YN )T , and k k is the Euclidean norm.
If = {m } m=1 is an orthonormal basis for L2 [0, 1]
[11] then we have that If in the functional case (2) one also has that N  p,
one may set up a similar optimization to (3), whose

X direct analogue is:
f (i) (x) = (i)
m m (x), (1)
2
m=1 N p
1 X X (i)
(i) R1 g ? = argmin Yi hfj , gj i (4)
where, m = 0 f (i) (t)m (t)dt. Similarly, g(x) = g 2N i=1 j=1
P R1
m=1 m m (x) where m = 0 g(t)m (t)dt. Thus, p
X
+ N kgj k; (5)
Yi = hf (i) , gi + i j=1

X
X
=h m(i)
m (x), k k (x)i + i equivalently,
m=1 k=1 2
X N p X
X
(i) 1 X X (i)
= m k hm (x), k (x)i + i ? = argmin Yi jm jm (6)
m=1 k=1 2N i=1 j=1 m=1
v
p
X
(i) u
= m m + i , X uX
2 ,
m=1
+ N t jm (7)
j=1 m=1
where the last step follows from orthonormality of . P
where g = {gi }pi=1 = { m=1 im m , }pi=1 .
Going back to the real-valued covariate case, if instead
of having one feature vector per data instance: Xi However, it is unfeasible to directly observe functional
(i)
Rd , one had p feature vectors associated to each data inputs {fj | 1 i N, 1 j p} . Thus, we
instance: {Xij | 1 j p, Xij Rd }, an additive shall instead assume that one observes {~yj
(i)
|1i
linear model may be used for regression: N, 1 j p} where
p
(i) (i) (i)
=f~j + j ,
X
Yi = hXij , wj i + i , where w1 , . . . , wp Rd . ~yj (8)
j=1  T
(i) (i) (i) (i)
f~j = fj (1/n), fj (2/n), . . . , fj (1) , (9)
Similarly, in the functional case one may have p func- (i) iid
(i)
tions associated with data instance i: {fj | 1 j j N (0, 2 In ). (10)

717
FuSSO: Functional Shrinkage and Selection Operator

That is, we observe a grid of n noisy values for each (and efficient) to smoothness than the FuSSO. Fur-
(i)
functional input. Then, one may estimate jm as: thermore, we note that it is not necessary to have ob-
servations for input functions that are on a grid for the
(i) 1 T (i) 1 T ~ (i) (i) (i) (i) FuSSO, since one may estimate projection coefficients
jm =
~ ~y = ~ (f + j ) = jm + jm
n m j n m j in the case of an irregular design [11]. Moreover, we
(11) may also estimate projection coeffincients for density
functions with samples drawn from the pdf. Note that
T
where ~ m = (m (1/n), m (2/n), . . . , m (1)) . Fur- the Y-GL would fail to estimate our model in the irreg-
thermore, we may truncate the number of basis func- ular design case, and would not be possible in the case
(i)
tions used to express fj to Mn , estimating it as: were functions are pdf. Also, a two-stage estimator as
described in [5], where one first uses the regularization
Mn
(i)
X (i)
penalty with a large to find the support, then solves
fj (x) = jm m (x). (12) the optimization problem with a smaller on just the
m=1 estimated support to estimate the response, may be
Using the truncated estimate (12), one has: more efficient at estimating the response. Further-
more, an analogous problem as (15) may be framed
Mn to perform logistic regression and classification.
(i) (i)
X
hfj (x), gj i = jm jm , and
m=1
v 4 Theory
u Mn
(i)
u X (i)
kfj (x)k =t (jm )2 . Next, we show that the FuSSO is able to recover the
m=1 correct sparsity pattern asymptotically; i.e., that the
FuSSO estimate is sparsistent. In order to do so, we
Hence, using the approximations (12), (7) becomes:
shall show that with high probability there is an opti-
2 mal solution to our optimization problem (15) with the
N p XMn
1 X
Yi
X (i) correct sparsity pattern. We follow a similar argument
= argmin jm jm (13)
2N to [12, 9]. We shall use a witness technique to show
i=1 j=1 m=1
v that there is a coefficient/subgradient pair (, u) such
p u Mn
X uX that supp() = supp( ), for true response generating
+ N 2
jm (14)
Pp
. Let () = j=1 kj k2 , be our penalty term (14).
t
j=1 m=1 Let S denote the true set of non-zero functions; i.e.
1 Xp Xp S = {j | j 6= 0}, with s = |S|. First, we fix S c = 0,
2
= argmin kY Aj j k + N kj k, (15) and set uS = ()( )S . Note that for a vector 0 ,
2N j=1 j=1 ()( 0 ) = {u} where: uj = j0 /k 0 k2 , if j0 6= 0;
uj = kuj k2 1 if j0 = 0. We shall show that with high
where Aj is the N Mn matrix with values Aj (i, m) =
(i) probability, j S, j 6= 0 and j S c , kuj k2 < 1,
jm and j = (j1 , . . . , jMn )T . Note that one need thus showing that there is an optimal solution to our
not consider projection coefficients jm for m > Mn optimization problem (15) that has the true sparsity
since such projection coefficients will not decrease the pattern with high probability.
(i)
MSE term in (13) (because jm = 0 for m > Mn ), and
jm 6= 0 for m > Mn increases the norm penalty term First, we elaborate on our assumptions.
in (14). Hence we see that our sparse functional esti-
mates are a Group-LASSO problem on the projection 4.1 Assumptions
coefficients.
Let be the trigonometric basis, 1 (x) 1, k 2:
Extensions It is useful to note that there are sev-
2k (x) 2 cos(2kx), 2k+1 (x) 2 sin(2kx).
eral straightforward extensions to the FuSSO as pre-
(i) (i)
sented. First, we would like to note that it may Let D = {({~yj }pj=1 , Yi )}N
i=1 , where ~
yj is as (8), and
be possible to estimate the inner product of a func- Pp P (i)
(i) R (i) (i) Yi = j=1 m=1 jm jm + i as in (2). Assume that
tion fj , and gj as fj gj h~yj , n1 ~gj i, where (i)
1 i N, 1 j p: j (, Q), where:
~gj = (gj (1/n), . . . , gj (1))T . This effectively allows one
to use a naive approach of simply using Group-LASSO
X
(i)
on the ~yj feature vectors directly (well refer to this (, Q) ={ : c2k k2 Q},
method as Y-GL). It is important to note, however, k=1
that Y-GL will be less robust to noise, and adaptive ck =k if k even or one, (k 1) otherwise,

718
Oliva, Poczos , Verstynen, Singh, Schneider, Yeh, Tseng

Z 1
(i) (i) (i) (i) 4.2 Sparsistency
j ={jm R | jm = fj m , m N+ }
0  
Theorem 1: P SN = S 1.
for 0 < Q < and 12 < < . Furthermore, as-
sume that that for the true generating the observed First, we state some lemmas, whose proofs may be
responses Yi , , 1 j p: j (, Q). found in the supplementary materials.
Let Aj be the N Mn matrix with entries Aj (i, m) =
(i) 4.2.1 Lemmata
jm . Let AS denote the matrix made up from hori-
zontally concatenating the Aj matrices with j S; i.e. Lemma 1 Let X be a non-negative r.v. and C be an
AS = [Aj1 . . . Ajs ], where {j1 , . . . , js } = S and ji < jk measurable event, then E [X|C] P(C) E [X].
for i < k. Suppose the following: Pn
Lemma 2 n1 k=1 m (k/n)l (k/n) = I{l = m}, for
max N1 ATS AS Cmax <

(16) 1 l, m n 1.
1 T

min N AS AS Cmin >0. (17) (i) (i) iid
Lemma 3 Let Hj be the rows of Hj , then Hj
Also, suppose (0, 1] s.t. j S c 2 (i) iid 2
N (0, n I), and HS N (0, n I).
max N1 ATj Aj Cmax <

(18) 12a
n
a a1/2 2 2
( Aj AS )( 1 ATS AS )1 1 / s Lemma 4 P (kHkmax n ) 2 pMn n
1 T
(19) e
N N 2
+1/2
Let Aj be the N Mn matrix with entries Aj (i, m) = Lemma 5 kEj kmax CQ n , where CQ (0, )
(i) (i) is a constant depending on Q.
~ Tm f~j . Let Hj be the N Mn matrix with
jm = n1
entries Hj (i, m) = jm = n1
(i) (i)
~ Tm j . Thus, Aj = Aj + Lemma 6 kS k22 Qs.
Hj . Furthermore, let Ej = Aj Aj . Then, Aj = Lemma 7 N0 , n0 , Cmin , Cmax , 0 < Cmin Cmax <
Aj + Ej + Hj . , 0 < 1 s.t. if kHkmax < na , and N > N0 ,
n > n0 then
In addition to the aforementioned assumptions, we  
shall further assume the following: max N1 ATS AS Cmax < (29)
12a
pMn na /2 en
1
a < 1/2 0
 
s.t. (20) min N1 ATS AS Cmin >0 (30)
N minkj k
>0 (21)
jS 1
j S c , ( N1 ATj AS )( N1 ATS AS )1 (31)
p  
sMn n+ /2 + na 0
1
(22) 2 s
1  3/2 1/22 p 
4.2.2 Proof of Theorem 1
s M n + log(sM n )/N 0 (23)
N  
p Proposition 1 P j S j 6= 0 1.
N sMn /N 0 (24)
 p 
1
q
s Mn n +1/2
+ s log(N )
n 0 (25) Proof. Recall that by (21), N = minjS kj k > 0.
N
p ! Thus to prove that j S j 6= 0, it suffices to
1 sMn sMn log(N )
 that : kS 
show S k 2 . To do so we show
N
+ 0 (26)
N n+a1/2 na+1/2
P kS S k > N
2 0. Let B be the event that
1 p a
Mn log((p s)Mn )/N 0 (27) kHkmax < n . Note that:
N
N
 
1
s/(N N Mn2 /2 ) 0, (28) P kS S k >
2
and we assume 1 for the sake of simplification. We 


may further simplify our assumptions if we take n = P kS S k > N B P(B) + P (B c ) .
2
N 1/2 and choose Mn optimally for function estimation:
Mn  n1/(2+1) = N 1/(4+2) . Furthermore, take s = Furthermore,
O(1), N  1, and = 2. Under these conditions, our N
  h i

P kS S k > B 2
E kS S k B .

1
assumptions reduce to 10 < a and taking the follow to 2 N

go to zero:
10a3 1 a
7
Then, looking at the stationarity condition for the sup-
eN 2 , N 10 /N ,
1/20
pN 20 , N N port S:
1
log(N ), 12 N /10 log(pN ).
1 9
N 2
 
1 T
2N N N A S A S S Y + N uS = 0. (32)

719
FuSSO: Functional Shrinkage and Selection Operator

p p 
Let
P VPbe the N 1 vector with entries Vi = Qs CQ sMn n+1/2
(i)
jS m=Mn +1 jm jm ; i.e. the error from trunca- p
= QCQ s Mn n+1/2
p
tion. Then, using (32) Y = AS S + V +  =
k1
  h i
1 T SS k
N A S A S S A S S V  + N uS = 0 = Thus, E N kATS (ES + HS )S k B
T
 
AS
N AS (S S ) (AS AS )S V  = N uS r !!
p p s log(N )
=O sMn s Mn n+1/2 + .
n
Thus,

Furthermore, E k(ES + HS )T (ES + HS )S k B P(B)

1 T
N AS AS (S S ) = 1 T
N AS (ES + HS )S

+ 1 T
N AS V
1 T
+ N AS  N uS . (33)
 
T

= E max |(ESj + HSj ) ((ES + HS )S ) | B P(B)
jsMn
Let 1 1 T
SS = ( N AS AS )
1
; we see that,  


E max kESj + HSj k1 k(ES + HS )S k B P(B)

kS S k k1 1 T
SS ( N AS )(ES + HS )S k
jsMn

= E kES + HS k1 k(ES + HS )S k B P(B).


 
+ k1 1 T 1 1 T
SS ( N AS )V k + kSS ( N AS )k
+ k1
SS N uS k . Then, given that B occurs kES + HS k1

Thus, we proceed to bound each term on the LHS kES k1 + kHSj k1 N (CQ n+1/2 + na ),
in expectation. First, note that k1 1 T
SS ( N AS )(ES +
1 and, as before: E [k(ES + HS )S k |B]

HS )S k kSS k k( N AS )(ES + HS )S k =
1 T

k1 1 T T T
SS kk N (AS + ES + HS )(ES + HS )S k


r
p
+1/2 s log(N )
k1 k
SS T T C 2 s Mn n + C3 .
N kAS (ES + HS )S k + k(ES + HS ) (ES + n

HS )S k . Moreover, given that B occurs:
h i
Hence, E k1 1 T
SS ( N AS )(ES + HS )S k B P(B)


1
p sMn p  p 
sMn k1
p
kSS k SS k2 . =O sMn s Mn n+1/2 + s log(N )/n
Cmin  
h 1
kSS k T
i 1 + n+1/2 + na
Thus, E N kA S (ES + H S ) S k
B P(B)
p  p 
p
=O sMn s Mn n+1/2 + s log(N )/n
sMn
kATS k E k(ES + HS )S k B P(B)
 

Cmin N
The next terms are bounded as follows1 .
Q sMn
kES S k + E kHS S k B P(B) ,
   h i
Lemma 8 E k1 ( 1 T
A )V k

B P(B) =
Cmin  3/2  SS N S
s
O .
noting that kATS k N Q. Moreover, by Lemma 1: Mn
21/2

h i
E kHS S k B P(B) E [kHS S k ] . E k1 1 T
 
Lemma 9 ( A )k =

SS N S B P(B)
p 
Also, HS S is normally distributed and Var[HS
(i)T
S ] O log(sMn )/N .

sM Lastly,
Xn (i) 2 2 2 Qs
= Var[HSj Sj ] = kS k2 .
n n kuS k = maxkuj k maxkuj k2 1 =
j=1 jS jS

Hence, by a Gaussian inequality (e.g. [13]) we have: N sMn
k1 u
SS N S k N k 1
k
SS ku k
S .
q Cmin
E [kHS S k ] 22 Qs log(N )/n. h i
Keeping only leading terms, E kS S k B P(B)

Unless otherwise specified, let X (i) be the ith row of
matrix X and Xj be the j th column. Also,
 p 
=O s /2 Mn n+1/2 + s Mn log(N )/n
3

(i)T (i)
kES S k = max |ES S | kS k2 max kES k2 1
See Supplemental Materials for proof.
1iN 1iN

720
Oliva, Poczos , Verstynen, Singh, Schneider, Yeh, Tseng

 p p 
3 1
+ O s /2 /Mn2 /2 + log(sMn )/N + N sMn . periments performed were as follows. First, we fix
N, n, p, and s. For i = 1, . . . , N , j = 1, . . . , p we create
Hence, by assumptions (20)-(24) we have random functions using a maximum of M projection

 
P kS S k > 2N 0 iid
coefficients as follows: 1) Set ajm Unif[1, 1] for
m = 1, . . . , M ; 2) set ajm = aji /c2m , where cm = m
One may similarly look at  the stationarity for j S c if m = 1 or is even, cm = m 1 if m is odd; 3)
(i)

1 T
to analyze uj : 0 = N Aj AS S Y + N uj set ajm = ajm /kaj k; 4) set j = aj . (See Figures
2(b),2(c) for typical functions.) Similarly, we generate
AT
 
= Nj AS (S S ) (AS AS )S V  + N uj . j for j = 1, . . . , s; for j = s + 1, . . . , p, we set j = 0.
Pp (i)
Then, we generate Yi as Yi = j=1 hj , j i + i =
Thus, iid
Ps (i)
j=1 hj , j i + i , where  N (0, .1). Also, a grid
uj = 1N ATj AS (S S ) + 1 T
N N Aj (AS AS )S
N of n noisy function evaluations were generated to make
1 T (i)
+ N N Aj (V + ) ~yj as in (8), with = .1. These were then used to
(i)
compute jm for m = 1, . . . , Mn as in (11), Mn was

= 1 jS 1
SS
1 T
N AS (ES + HS )S 1 T
N AS V
N
 chosen by cross validation. (See Figures 2(b), 2(c) for
1 T
+ N uS 1 T
+ HS )S typical noisy observations and function estimates for
N AS  N N Aj (ES
n = 5 and n = 25 respectively.)
1 T
+ N N Aj (V + ),
We fixed s = 5 and chose the following con-
where jS = N1 ATj AS and using (33). We wish to figurations for the other parameters: (p, N, n)
show that j S c uj satisfies the KKT conditions, {(100, 50, 5), (1000, 500, 25), (20000, 500, 25)}. For
that is: each tuple of (p, N, n) configurations, 100 random
trails were performed. We recorded, r, the fraction
Proposition 2 P (maxjS c kuj k2 < 1) 1.
of the trails that a value was able recover the cor-
  rect sparsity pattern (i.e. that only the first 5 func-
Proof. Let H
j E uj H . We proceed as follows:

tions are in the support). We also recorded the mean
length of the range of , , that were able to recover
 
P maxc kuj k2 < 1 P100 (t)
jS the correct support; i.e. = 1t t=1 , where
(t) (t) (t) (t) (t)
= (f l )/max , f is the largest value
 
H H
P maxc kj k2 + kuj j k2 < 1
jS found to recover the correct support in the tth trails,
(t) (t)
l the smallest such , and max is the smallest
 p 
P maxc kH k
j 2 + M n ku j H
j k < 1 (t)
jS to produce = 0 ( is taken to be zero if no
  recovered the correct support). The results were as
P maxc kH k
j 2 < 1
, max
2 jS c ku j H
k
j <
2 Mn follows:
jS
 
1 P maxc kH (p, N, n) r
j k2 1
jS (100,50,5) .68 .2125
!
(1000,500,25) 1 .4771
H
P maxc kuj j k . (20000,500,25) 1 .4729
jS 2 Mn

We obtain the following results: Hence we see that even when the number of observa-
  tions per function is small (5 or 25) and the number

Lemma 10 P maxjS c kH j k 2 1 2 0
of total number of input functional covariates is large
  (we were able to test up to 20000), the FuSSO can re-
Lemma 11 P maxjS c kuj H j k 2 M
0 cover the correct support. Also, to illustrate this point
n (i)
that running Group-LASSO on the ~yj features (Y-
Hence, we have that P (max jS c kuj k2 < 1) 1. GL) is less robust to noise and adaptive to smooth-
ness, we ran noisier trails using the configuration of
5 Experiments (p, N, n) = (1000, 500, 25). We increased the standard
deviation of the noise on grid function observation and
5.1 Synthetic Data on the response to be 5 and 1 respectively. Under these
conditions the FuSSO was able to recover the support
We tested the FuSSO on synthetic data-sets of D = in 49% of the trails were as Y-GL recovered the sup-
(i) (i)
{({~yj }pj=1 , Yi )}N
i=1 (where ~
yj as in (8)). The ex- port in 32% of the trails. Furthermore the FuSSO had

721
FuSSO: Functional Shrinkage and Selection Operator

Typical Function Functional Regularization Path


0 1 statistic of an dODF function for age regression [15].

Regressed Function Norm


0.8
0.5 The projection coefficients for the dODFs at each voxel
0.6
1
were estimated using the cosine basis. The FuSSO es-
Y

0.4
timator gave a cross-validated MSE of 70.855, where
1.5 True
Observations
Estimated
0.2 the variance for age was 156.4265; selected voxels in
2
0 0.5 1
0
0 20 40 60 the support may be seen in Figure 3(c). The LASSO
X
estimate using QA values gave a cross-validated MSE
(a) Function at n = 5 (b) Reg. Path at p = 100, of 77.1302. Thus, one may see that considering the
n = 50, n = 5 of kj k entire functional data gave us better results for age
2
Typical Function
Regressed Function Norm 1
Functional Regularization Path regression. We note that we were unable to use the
0.8
naive approach of Y-GL in this case because of mem-
1.5
0.6
ory constraints and the fact that function evaluation
1 points did not lie on a 2d square grid.
Y

0.4
0.5 True
Observations
0.2
40
Estimated
0 0
0 0.5 1 0 100 200 300 400
X 30

Frequency
(c) Function at n = 25 (d) Reg. Path at p = 1000, 20
n = 500, n = 25 of kj k
10

0
Figure 2: (a)(c) Two typical functions, noisy obser- 0 20 40 60 80
Age
vations, and estimates. (b)(d) Regularization paths
(a) Example ODF (b) Ages
showing the norms of j (in red for j in support, blue
otherwise) for a range of ; rightmost vertical line in- 20

dicates largest able to recover the support, leftmost 15


line for smallest such

Frequency
10
(t) (t)
a = .0743 compared to = .0254 for Y-GL. 5

0
5.2 Neurological Data 0 10 20
Absolute Error
30

(c) Voxels in support (d) Errors


We also tested the FuSSO estimator with a neurolog-
ical data-set, using a total of 89 subjects [14]. Sub-
jects ranged in age from 18 to 60 years old (Figure Figure 3: (a) An example ODF for a voxel. (b) His-
3(b)). Our goal was to learn a regression that maps togram of ages for subjects. (c) Voxels in the support
the dODFs at each white matter voxel for each sub- of model shown in blue. (d) Histogram of held out
ject to the subjects age. The dODF is a function rep- error magnitudes.
resents the amount of water molecules, or spins, un-
dergoing diffusion in different orientations over the S 2 6 Conclusion
sphere[15]. I.e., each dODF is a function with a 2d do-
main (of azimuth, elevation spherical coordinates) and In conclusion, this paper presents the FuSSO, a func-
a range of reals representing the strength of water dif- tional analogue to the LASSO. The FuSSO allows one
fusion at the given orientations (see Figure 3(a)). Data to efficiently find a sparse set of functional input co-
was provided for each subject in a template space for variates to regress a real-valued response against. The
white-matter voxels; a total of over 25 thousand vox- FuSSO makes no parametric assumptions about the
els dODFs were regressed on (i.e. p 25000). We also nature of input functional covariates and assumes a
compared regression using the FuSSO and functional linear form to the mapping of functional covariates to
covariates to using the LASSO and real valued covari- the response. We provide a statistical backing for use
ates. We used the non-functional collection of quanti- of the FuSSO via proof of asymptotic sparsistency.
tative anisotropy (QA) values for the same white mat-
ter voxels as with dODF functions. QA values are Acknowledgements
the estimated amount of spins that undergo diffusion
This work is supported in part by NSF grants
in the direction of the principle fiber orientation, i.e.,
IIS1247658 and IIS1250350.
the peak of the dODF; QAs have been used as a mea-
sure of white matter integrity in the underlying voxel
hence making for a descriptive and effective summary

722
Oliva, Poczos , Verstynen, Singh, Schneider, Yeh, Tseng

References [15] Fang-Cheng Yeh, Van Jay Wedeen, WY Tseng,


et al. Generalized q-sampling imaging. IEEE
[1] Rajendra Bhatia. Matrix analysis, volume 169. transactions on medical imaging, 29(9):1626,
Springer, 1997. 2010.

[2] F. Ferraty and P. Vieu. Nonparametric functional [16] Ming Yuan and Yi Lin. Model selection and
data analysis: theory and practice. Springer, 2006. estimation in regression with grouped variables.
Journal of the Royal Statistical Society: Series B
[3] Gareth M James, Jing Wang, and Ji Zhu. Func- (Statistical Methodology), 68(1):4967, 2006.
tional linear regression thats interpretable. The
Annals of Statistics, pages 20832108, 2009. [17] Yihong Zhao, R Todd Ogden, and Philip T Reiss.
Wavelet-based lasso in functional linear regres-
[4] Michel Ledoux and Michel Talagrand. Probabil- sion. Journal of Computational and Graphical
ity in Banach Spaces: isoperimetry and processes, Statistics, 21(3):600617, 2012.
volume 23. Springer, 1991.

[5] Nicolai Meinshausen. Relaxed lasso. Computa-


tional Statistics & Data Analysis, 52(1):374393,
2007.

[6] Nicola Mingotti, Rosa E Lillo, and Juan Romo.


Lasso variable selection in functional regression.
2013.

[7] Junier B Oliva, Barnabas Poczos, and Jeff Schnei-


der. Distribution to distribution regression. In
International Conference on Machine Learning
(ICML), page 10491057. ICML, 2013.

[8] B. Poczos, A. Rinaldo, A. Singh, and L Wasser-


man. Distribution-Free Distribution Regression.
AISTATS, 2013.

[9] Pradeep Ravikumar, John Lafferty, Han Liu, and


Larry Wasserman. Sparse additive models. Jour-
nal of the Royal Statistical Society: Series B (Sta-
tistical Methodology), 71(5):10091030, 2009.

[10] Robert Tibshirani. Regression shrinkage and se-


lection via the lasso. Journal of the Royal Sta-
tistical Society. Series B (Methodological), pages
267288, 1996.

[11] Alexandre B Tsybakov. Introduction to nonpara-


metric estimation. Springer, 2008.

[12] Martin J Wainwright. Sharp thresholds for high-


dimensional and noisy recovery of sparsity. arXiv
preprint math/0605740, 2006.

[13] Larry Wasserman. Probability inequalities.


http://www.stat.cmu.edu/~larry/=stat705/
Lecture2.pdf, August 2012.

[14] Fang-Cheng Yeh and Wen-Yih Isaac Tseng. Ntu-


90: a high angular resolution brain atlas con-
structed by q-space diffeomorphic reconstruction.
NeuroImage, 58(1):9199, 2011.

723

You might also like