Professional Documents
Culture Documents
Algorithm 1 CoFineUCB We will show that the following definitions of ct (·) and
1: input: λ, λ̃, U , ct (·), c̃t (·) c̃t (·) yield a valid 1 − δ confidence interval:
2: for t = 1, . . . , T do � � � �
(v) � � (b) � �
3: Define Xt ≡ [x1 , x2 , . . . , xt−1 ] c̃t (x) = α̃t �U � Mt−1 x� + α̃t �M̃t−1 U � Mt−1 x� (7)
M̃t−1
4: Define X̃t ≡ U � Xt � �
ct (x) = αt �x�M −1 + αt �Mt−1 x� ,
(v) (b)
5: Define Yt ≡ [ŷ1 , ŷ2 , . . . , ŷt−1 ] (8)
t
6: M̃t ← λ̃IK + X̃t X̃t�
7: w̃t ← M̃t−1 X̃t Yt� //least squares on coarse level (v) (b) (v) (b)
where α̃t , α̃t , αt , and αt are coefficients that
8: Mt ← λID + Xt Xt� must be set properly (Lemma 1).
9: wt ← Mt−1 (Xt Yt� + λU w̃t ) //least sq on fine level
10: Define µt (x) ≡ wt� x Broadly speaking, there are two types of uncertainty
11: xt ← argmaxx∈Xt µt (x)+ct (x)+c̃t (x) //play action affecting an estimate, wt� x, of the utility of x: vari-
with highest upper confidence bound
ance and bias. In our setting, variance is due to the
12: Recommend xt , observe reward ŷt
13: end for stochasticity of user feedback ŷt . Bias, on the other
hand, is due to regularization when estimating w̃t and
wt . Intuitively, as our algorithm receives more feed-
For simplicity and practical relevance, we focus on two- back, it becomes less uncertain (w.r.t. both bias and
level hierarchies. variance) of its estimates, w̃t and wt . This notion of
uncertainty is captured via the inverse feature covari-
4. Algorithm & Main Results ance matrices M̃t and Mt (Lines 6 & 8 in Algorithm 1).
Table 1 provides an interpretation of the four sources
We now present a bandit algorithm that exploits fea- of uncertainty described in (7) and (8).
ture hierarchies. Our algorithm, CoFineUCB, is an
Lemma 1 below describes how to set the coefficients
upper confidence bound algorithm that generalizes
such that ct (x)+c̃t (x) is a valid 1−δ confidence bound.
the well-studied LinUCB algorithm, and automatically
trades off between exploring the coarse and full feature Lemma 1. Define S̃ = �w̃∗ � and S⊥ = �w⊥
∗
�, and let
spaces. CoFineUCB is described in Algorithm 1. At � � �
each iteration t, CoFineUCB estimates the user’s pref- (v) 1/2 1/2
αt = log det (Mt ) det (λID ) /δ
erences in the subspace, w̃t , as well as the full feature � �
space, wt . Both estimates are solved via regularized � �1/2 � �1/2 �
(v)
least-squares regression. First, w̃t is estimated via α̃t = λ log det M̃t det λ̃IK /δ
t−1
� (b)
√
w̃t = argmin (w̃� x̃τ − ŷτ )2 + λ̃�w̃�2 , (4) αt = 2λS⊥
w̃ τ =1 (b)
α̃t = λλ̃S̃.
�
where x̃τ ≡ U xτ denotes the projected features of
Then (6) is a valid 1 − δ confidence interval.
the action taken at time τ . Then wt is estimated via
t−1
� With the confidence intervals defined, we are now
wt = argmin (w� xτ − ŷτ )2 + λ�w − U w̃t �2 , (5) ready to present our main result on the regret bound.
w
τ =1
Theorem 1. Define c̃t (·) and ct (·) as in (7), (8) and
which regularizes wt to the projection of w̃t back into Lemma 1. For λ ≥ maxx �x�2 and λ̃ ≥ maxx �x̃�2 ,
the full space. Both optimization problems have closed with probability 1 − δ, CoFineUCB achieves regret
form solutions (Lines 7 & 9 in Algorithm 1). � √ √ ��
RT (w∗ ) ≤ βT D + β̃T K 2T log(1 + T ),
CoFineUCB is an optimistic algorithm that chooses
the action with the largest potential reward (given where
some target confidence). Selecting such an action � √
requires computing confidence intervals around the βT = D log((1 + T /λ)/δ) + 2λS⊥ (9)
mean estimate wt . We maintain confidence intervals � �
for both the full space and the subspace, denoted ct (·) β̃T = K log((1 + T /λ̃)/δ) + λ̃S̃. (10)
and c̃t (·), respectively. Intuitively, a valid 1 − δ confi-
dence interval should satisfy the property that Lemma 1 and Theorem 1 are proved in Appendix A.
|x� (wt − w∗ )| ≤ ct (x) + c̃t (x) (6) Theorem 1 essentially bounds the regret as
��� √ �√ �
holds with probability at least 1 − δ. RT (w∗ ) = O λ̃�w̃∗ �K + ∗
2λ�w⊥ �D T , (11)
Hierarchical Exploration for Accelerating Contextual Bandits
Figure 2. An example of confidence regions utilized by where w̃ ≡ (U � U )−1 U � w, and C̃ = maxx �U � x� con-
CoFineUCB and LinUCB. B denotes the ellipsoid confi- strains the magnitude of U .
dence region used by LinUCB. CoFineUCB maintains two It is difficult to optimize (13) directly, so we approxi-
ellipsoid confidence regions, B̃ and B⊥ , for subspace and
mate it using a smooth formulation,4
full space, respectively. The joint confidence region of
CoFineUCB is essentially the convolution of B̃ and B⊥ , �
B̃ ⊗ B⊥ , which can be much smaller than B. argmin �w̃�2 , (14)
U ∈span(U0 ):�U �2F ro =K w∈W
ignoring log factors. In contrast, the conventional Lin- where we now constrain U via �U �2F ro = K.
UCB algorithm only explores in the full feature space We further restrict U to be U ≡ U0 Ω1/2 for Ω � 0.
and achieves an analogous regret bound of Under this restriction, (14) is equivalent to
�√ √ �
RT (w∗ ) = O λ�w∗ �D T . (12) �
argmin �w̃0� Ω−1 w̃0 �2 , (15)
Ω:trace(Ω)=K w∈W
Comparing (11) with (12) suggests that, when K <<
∗
D and �w⊥ � is small, CoFineUCB suffers much less where w̃0 ≡ (U0� U0 )−1 U0� w = U0� w. This formula-
regret due to more efficient exploration. Depending on tion is akin to multi-task structure learning, where W0
U , �w̃∗ � can also be much smaller than �w∗ �. Section would denote the various tasks and Ω denotes feature
5 describes an approach for computing such a U . relationships common across tasks (Argyriou et al.,
Intuitively, CoFineUCB enjoys a superior regret bound 2007; Zhang & Yeung, 2010). One can show that (15)
to LinUCB due to its use of tighter confidence regions. is convex and is minimized by
Figure 2 depicts a comparative example. LinUCB em- �
K
ploys ellipsoid confidence regions. CoFineUCB utilizes Ω= � W̃0 W̃0� , (16)
confidence regions that are essentially the convolution trace W̃0 W̃0 �
the feature space such that �w∗ � in the new space is must choose a set of L actions and receives rewards
small. Thus, running LinUCB in the altered feature based on both the quality as well as diversity of the ac-
space yields an improved bound on the regret (12), tions chosen (L = 1 is the conventional bandit setting).
which is linear in �w∗ �. Using this structured action space leads to a more real-
istic setting for content recommendation, since recom-
6.1. Baseline Approaches mender systems often must recommend multiple items
at a time. It is straightforward to extend CoFineUCB
Mean-Regularized One simple approach is to reg- to the submodular bandit setting (see Appendix C).
ularize to w̄ (e.g., the mean of W ) when estimating wt
in LinUCB. The estimation problem can be written as
6.3. Simulations
t−1
� We performed simulation evaluations using data col-
wt = argmin (w� xτ − ŷτ )2 + λ�w − w̄�2 . (17) lected from a previous user study in personalized news
w
τ =1
recommendation by (Yue & Guestrin, 2011). The data
Typically, �w∗ − w̄� < �w∗ �, implying lower regret. includes featurized articles (D = 100) and N = 77
user profiles. We employed leave-one-out validation:
Reshape Another approach is to use LinUCB with a for each user, the transformations UD and U (K =
feature space “reshaped” via a transform UD ∈ �D×D : 5) were trained using the remaining users’ profiles.
For each user, we ran 25 simulations (T = 10000).
t−1
� All algorithms used the same U and UD projections,
wt = argmin (w� UD
�
xτ − ŷτ )2 + λ�w�2 . (18) where applicable. We also compared with a variant
w
τ =1 of CoFineUCB, CoFineUCB-focus, which scales down
exploration in the full space ct by a factor of 0.25.
As in the mean-regularization approach above, here
we would like the representation of w∗ in the reshaped Figure 3(a) shows the cumulative regret of each al-
space to have a small norm. In our experiments, we gorithm averaged over all users when recommending
use UD = LearnU(W, D) (Algorithm 2). one article per iteration (L = 1). All algorithms
dramatically outperform Naive LinUCB, with the ex-
We can incorporate such reshaping into CoFineUCB.
ception of Mean-Regularized which performs almost
We first project W into the space defined by UD , de-
identically. While Reshape shows good eventual con-
noted by Ŵ ,5 then compute U via LearnU(Ŵ , K).
vergence behavior, it incurs higher initial regret than
During model estimation, we replace (5) with
the CoFineUCB algorithms and SubspaceUCB. The
t−1
� trends also hold when recommending multiple articles
wt = argmin (w� UD
�
xτ − ŷτ )2 + λ�w − U w̃t �2 . per iteration (L = 5), as seen in Figure 3(b).
w
τ =1
The performance of the two variants of CoFineUCB
Incorporating reshaping into CoFineUCB can lead to a and SubspaceUCB demonstrate the benefit of explor-
decrease in S⊥ = �ŵ⊥ ∗
�. We found the modification to ing in the subspace. However, Figure 3(c) reveals the
be quite effective in practice; all our experiments in the critical shortfall of SubspaceUCB by comparing aver-
following sections employ this variant of CoFineUCB. age cumulative regret for the ten users with the largest
∗
residual �w⊥ �. For these atypical users, the subspace
SubspaceUCB Finally, we can simply ignore the is not sufficient to adequately learn their preferences,
full space and only apply LinUCB in the subspace. resulting in linear regret for SubspaceUCB.
While the method seems to perform well given a good Figure 3(d) shows the behavior of CoFineUCB as we
subspace (as seen in (Li et al., 2010; Chapelle & Li, vary K. Larger subspaces require more exploration,
2011; Yue & Guestrin, 2011), among others), it can which in general leads to increased regret.
yield linear regret if the residual of the user’s prefer-
ence is strong, as we will see in the experiments. Figure 3(e) shows the behavior of CoFineUCB as we
vary the scaling of exploration in the full space ct
(CoFineUCB-focus is the special case where the scal-
6.2. Experimental Setting
ing factor is 0.25). More conservative exploration in
We employ the submodular bandit extension of linear the full space tends to reduce regret. However, no ex-
stochastic bandits (Yue & Guestrin, 2011) to model ploration of the full space can lead to higher regret.
the news recommendation setting. Here, the algorithm
Synthetic Dataset. We used a 25-dimensional syn-
� � �−1 �
5
Ŵ ≡ UD UD UD W. thetic dataset to study the effect of mismatch between
Hierarchical Exploration for Accelerating Contextual Bandits
4 5 5
10 10 10
4 4
10 10
3
10
Cumulative Regret
Cumulative Regret
Cumulative Regret
3 3
10 10
2
10
2 2
10 10
Naive
1 Mean−Regularized
10
Reshape 1 1
10 10
CoFineUCB
SubspaceUCB
CoFineUCB−focus
0 0 0
10 10 10
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
Iterations Iterations Iterations
(a) All users simulation (L = 1) (b) All users simulation (L = 5) (c) Atypical users simulation (L = 5)
2
10
10
2 800 CoFineUCB−focus
600
t=10000
10
1 t=5000 400
CoFineUCB, K=15 t=2500
CoFineUCB, K=8 t=1250 200
Figure 3. (a)–(e) Cumulative regret results for news recommendation simulation. (f) Comparison over preference vectors
with varying projection residuals using synthetic simulation.
w∗ and U . This dataset allows for a more system- Comparison #Users Win/Tie/Lose Gain/Day
atic analysis by forcing every x and w∗ to have unit CoFineUCB
27 24 / 1 / 3 0.69
norm. For residual magnitude β ∈ [0, 1], we sampled v. Naive
CoFineUCB
w∗ uniformly
� in a 5-dimensional subspace with magni- v. Reshape
30 21 / 3 / 6 0.27
tude 1 − β 2 , and uniformly in the remaining dimen-
sions with magnitude β. Figure 3(f) shows the regret Table 2. User study comparing CoFineUCB with two base-
of both SubspaceUCB and CoFineUCB-focus increase lines. All results satisfy 95% statistical confidence.
with the residual, with SubspaceUCB exhibiting more
dramatic increase, beyond that of even Naive LinUCB.
first phase, we compared CoFineUCB with Naive. Af-
6.4. User Study terwards, we took all the user profiles learned so far
to estimate a reshaping of the full space UD , and com-
Our user study design follows the study conducted in
pared against Reshape. Due to the short duration of
(Yue & Guestrin, 2011). We presented each user with
each session (T = 10), we did not expect a meaningful
ten articles per day over ten days from January 21,
comparison between CoFineUCB and SubspaceUCB,
2012 to February 8, 2012. Each day comprised approx-
so we omitted it (We expect both methods to perform
imately ten thousand articles. We represented articles
equally well in early iterations, as seen in the simula-
using D = 100 features corresponding to topics learned
tion experiments.). For each user session, we counted
via latent Dirichlet Allocation (Blei et al., 2003). For
the total number of liked articles recommended by each
each day, articles shown are selected using an interleav-
algorithm. An algorithm wins a session if the user liked
ing of two bandit algorithms. The user is instructed
more articles recommended by it.
to briefly skim each article and mark each article as
“interested in reading in detail”or “not interested”. Table 2 shows that over the two stages, about 80% of
the users prefer CoFineUCB. We see a smaller gain
We conducted the user study in two phases. Prior to
against Reshape, a stronger baseline. On average,
the first phase, we conducted a preliminary study to
users liked over half an additional article per day from
collect preferences for constructing U (K = 5). In the
CoFineUCB over Naive, and about a quarter addi-
Hierarchical Exploration for Accelerating Contextual Bandits
1 11
0.5 0.5
0.5
0 00
ï0.5 ï0.5
-0.5 w̃ �w⊥ �
1
0.8
ï1 ï1
-1 CoFineUCB
1 2 1 1 3 2 2 4 3 3 5 4 0.6 4 Naive6 5 5
0.4
0.2
−0.2
Figure 4. Each column of word clouds represents a dimension in the subspace. The bar lengths denote the magnitude
−0.4
in each dimension of preferences vectors learned by CoFineUCB (blue) and Naive LinUCB (red). The rightmost column
∗ −0.6
shows the norm of residual w⊥ of weight vectors learned by CoFineUCB and Naive LinUCB.
−0.8
1 2 3
tional per day over Reshape. These results show that knowledge for accelerated bandit learning.
CoFineUCB is effective in reducing the amount of ex-
Our work builds upon a long line of research on linear
ploration required.
stochastic bandits (Dani et al., 2008; Rusmevichien-
Figure 4 shows a representation of four dimensions tong & Tsitsiklis, 2010; Abbasi-Yadkori et al., 2011).
of U learned from user profiles. Each dimension is Although often practical, one limitation is the assump-
a combination of features, i.e., topics from LDA. In tion of realizability. In other words, we assume that
the top row, the i-th word cloud contains represen- the true model of user behavior lies within our class.
tative words from topics associated with high positive
The use of hierarchies in bandit learning is not new.
weights in i-th column of U , and the bottom row those
For instance, the work of (Pandey et al., 2007b;a) en-
with high negative weights. Examining Figure 4 can
code prior knowledge by hierarchically clustering arti-
reveal tendencies in the user preferences collected in
cles into a taxonomy. However, their setting is feature-
our study; for example, the third column shows that
free, which can make it difficult to generalize to new
users interested in Republican politics also tend to fol-
articles and users. In contrast, our approach makes
low healthcare debates, but tend to be uninterested
use of readily available feature-based prior knowledge
in videogaming. Figure 4 also shows a comparison of
such as the learned preferences of existing users.
weights estimated by CoFineUCB and Naive LinUCB
for one user. Since Naive LinUCB does not utilize the Another related line of work is that of sparse linear
subspace, the weights it estimates tend to have much bandits (Abbasi-Yadkori et al., 2012; Carpentier &
higher residual norm, whereas CoFineUCB puts higher Munos, 2012). The assumption is that the true w∗
weights on the subspace dimensions. is sparse, and one can achieve regret bounds that de-
pend on the sparsity of w∗ . In contrast, we consider
7. Related Work settings where user profiles are not necessarily sparse,
but can be well-approximated by a low-rank subspace.
Optimizing recommender systems via user feedback
It may be possible to integrate our feature hierar-
has become increasingly popular in recent years (El-
chy approach with other bandit learning algorithms,
Arini et al., 2009; Li et al., 2010; 2011; Yue & Guestrin,
such as Thompson Sampling (Chapelle & Li, 2011).
2011; Ahmed et al., 2012). Most prior work do not ad-
Thompson Sampling is a probability matching algo-
dress the issue of exploration and often train with pre-
rithm that samples wt from the posterior distribution.
collected feedback, which may lead to a biased model.
Using feature hierarchies, one can define a hierarchical
The exploration-exploitation tradeoff inherent in sampling approach that first samples w̃t in the sub-
learning from user feedback is naturally modeled as a space, and then samples wt around w̃t in the full space.
contextual bandit problem (Langford & Zhang, 2007;
Our approach can be applied to many structured
Li et al., 2010; Slivkins, 2011; Chapelle & Li, 2011;
classes of bandit problems (e.g., (Streeter & Golovin,
Krause & Ong, 2011). In contrast to most prior work,
2008; Cesa-Bianchi & Lugosi, 2009)), assuming that
we focus on principled approaches for encoding prior
Hierarchical Exploration for Accelerating Contextual Bandits
actions can be featurized and modeled linearly. For in- Carpentier, Alexandra and Munos, Remi. Bandit theory
stance, our experiments demonstrated substantial im- meets compressed sensing for high dimensional stochas-
provements upon naive UCB algorithms for the linear tic linear bandit. In Conference on Artificial Intelligence
and Statistics (AISTATS), 2012.
submodular bandit problem (Yue & Guestrin, 2011).
Cesa-Bianchi, Nicol and Lugosi, Gabor. Combinatorial
The problem of learning a good subspace U is related bandits. In Conference on Learning Theory (COLT),
to finding a good regularization structure for multi- 2009.
task learning (Argyriou et al., 2007; Zhang & Yeung,
Chapelle, Olivier and Li, Lihong. An empirical evaluation
2010). Given a sample of user profiles (task weights), of thompson sampling. In Neural Information Processing
our goal is essentially to learn a regularization struc- Systems (NIPS), 2011.
ture so that future users (tasks) are solved efficiently.
Dani, Varsha, Hayes, Thomas, and Kakade, Sham.
However, the coarse subspace of our feature hierarchy Stochastic linear optimization under bandit feedback. In
was estimated using a relatively small number of im- Conference on Learning Theory (COLT), 2008.
perfectly estimated existing user profiles. A more gen-
El-Arini, Khalid, Veda, Gaurav, Shahaf, Dafna, and
eral problem would be to learn the feature hierarchy Guestrin, Carlos. Turning down the noise in the blo-
on-the-fly as an online learning problem itself. gosphere. In ACM Conference on Knowledge Discovery
and Data Mining (KDD), 2009.
8. Conclusion Krause, Andreas and Ong, Cheng Soon. Contextual gaus-
sian process bandit optimization. In Neural Information
We have presented a general approach to encoding Processing Systems (NIPS), 2011.
prior knowledge for accelerating contextual bandit
Langford, John and Zhang, Tong. The epoch-greedy al-
learning. In particular, our approach employs a coarse- gorithm for contextual multi-armed bandits. In Neural
to-fine feature hierarchy which dramatically reduces Information Processing Systems (NIPS), 2007.
the amount of exploration required. We evaluated our
Li, Lei, Wang, Dingding, Li, Tao, Knox, Daniel, and Pad-
approach in the setting of personalized news recom- manabhan, Balaji. Scene: A scalable two-stage personal-
mendation, where we showed significant improvements ized news recommendation system. In ACM Conference
over existing approaches for encoding prior knowledge. on Information Retrieval (SIGIR), 2011.
Acknowledgements. The authors thank the anonymous Li, Lihong, Chu, Wei, Langford, John, and Schapire,
reviewers for their helpful comments. The authors also Robert. A contextual-bandit approach to personalized
news article recommendation. In World Wide Web Con-
thank Khalid El-Arini for help with data collection and ference (WWW), 2010.
processing. This work was supported in part by ONR
(PECASE) N000141010672, ONR Young Investigator Pro- Pandey, Sandeep, Agarwal, Deepak, Chakrabarti, Deep-
ayan, and Josifovski, Vanja. Bandits for taxonomies: A
gram N00014-08-1-0752, and by the Intel Science and Tech-
model-based approach. In SIAM Conference on Data
nology Center for Embedded Computing. Mining (SDM), 2007a.
Pandey, Sandeep, Chakrabarti, Deepayan, and Agarwal,
References Deepak. Multi-armed bandit problems with dependent
arms. In International Conference on Machine Learning
Abbasi-Yadkori, Yasin, Pál, David, and Szepesvári, Csaba.
(ICML), 2007b.
Improved algorithms for linear stochastic bandits. In
Neural Information Processing Systems (NIPS), 2011. Rusmevichientong, Paat and Tsitsiklis, John. Linearly
parameterized bandits. Mathematics of Operations Re-
Abbasi-Yadkori, Yasin, Pal, David, and Szepesvari, Csaba. search, 35(2):395–411, 2010.
Online-to-confidence-set conversions and application to
sparse stochastic bandits. In Conference on Artificial Slivkins, Aleksandrs. Contextual bandits with similar-
Intelligence and Statistics (AISTATS), 2012. ity information. In Conference on Learning Theory
(COLT), 2011.
Ahmed, Amr, Teo, Choon Hui, Vishwanathan, S.V.N.,
and Smola, Alexander. Fair and balanced: Learning Streeter, Matthew and Golovin, Daniel. An online algo-
to present news stories. In ACM Conference on Web rithm for maximizing submodular functions. In Neural
Search and Data Mining (WSDM), 2012. Information Processing Systems (NIPS), 2008.
Argyriou, Andreas, Micchelli, Charles A., Pontil, Massi- Yue, Yisong and Guestrin, Carlos. Linear submodular ban-
miliano, and Ying, Yiming. A spectral regularization dits and their application to diversified retrieval. In Neu-
framework for multi-task structure learning. In Neural ral Information Processing Systems (NIPS), 2011.
Information Processing Systems (NIPS), 2007.
Zhang, Yu and Yeung, Dit-Yan. A convex formulation
Blei, David, Ng, Andrew, and Jordan, Michael. Latent for learning task relationships in multi-task learning.
dirichlet allocation. Journal of Machine Learning Re- In Conference on Uncertainty in Artificial Intelligence
search (JMLR), 3:993–1022, 2003. (UAI), 2010.