Multi Armed Bandits

Tutorial on Bandits Games
Sbastien Bubeck e
Online Learning with Full Information

Adversary
1: CNN
2: NBC
d: ABC
Player

Adversary
1: CNN
2: NBC
d: ABC
A {1, . . . , d} Player
Adversary
1: CNN
2: NBC
d: ABC
A {1, . . . , d} Player
Adversary
1: CNN
2: NBC
d: ABC
A {1, . . . , d} Player
Adversary
1: CNN
2: NBC
d: ABC
ered: loss su
Player
A {1, . . . , d}
Adversary
Feedback: 1, . . . , d
1: CNN
2: NBC
d: ABC
ered: loss su
Player
A {1, . . . , d}
Online Learning with Bandit Feedback

Adversary
Player

Adversary
A {1, . . . , d} Player
Adversary
A {1, . . . , d} Player
Adversary
A {1, . . . , d} Player
Adversary
ered loss su
Player
A {1, . . . , d}

Adversary
ered: loss su
Player
A {1, . . . , d}
Feedback:
Some Applications
Computer Go
Brain computer interface
Medical trials
Packets routing
Ads placement
Dynamic allocation
Notation
For each round t = 1, 2, . . . , n;
1
The player chooses an arm It {1, . . . , d}, possibly with the help of an external randomization. Simultaneously the adversary chooses a loss vector d t = ( 1,t , . . . , d,t ) [0, 1] . The player incurs the loss It ,t , and observes:
The loss vector t in the full information setting. Only the loss incured It ,t in the bandit setting.
Goal: Minimize the cumulative loss incured. We consider the regret:

n n It ,t t=1
Rn = E
min E
i=1,...,d t=1
i,t .
Notation
1

n n It ,t t=1
Rn = E
min E
i=1,...,d t=1
i,t .
Notation
1

n n It ,t t=1
Rn = E
min E
i=1,...,d t=1
i,t .
Notation
1

n n It ,t t=1
Rn = E
min E
i=1,...,d t=1
i,t .
Notation
1

n n It ,t t=1
Rn = E
min E
i=1,...,d t=1
i,t .
Notation
1

n n It ,t t=1
Rn = E
min E
i=1,...,d t=1
i,t .
Notation
1

n n It ,t t=1
Rn = E
min E
i=1,...,d t=1
i,t .
Exponential Weights (EW, EWA, MW, Hedge, ect)

Draw It at random from pt where exp pt (i) =
d j=1 exp t1 s=1 i,s t1 s=1 j,s
Theorem (Cesa-Bianchi, Freund , Haussler, Helmbold, Schapire and Warmuth [1997]) Exp satises Rn Moreover for any strategy, sup
adversaries
n log d . 2
Rn
n log d + o( n log d). 2

adversaries
n log d . 2
Rn

adversaries
n log d . 2
Rn
Magic trick for bandit feedback
i,t = is an unbiased estimate of on the estimated losses. Exp3 satises: Rn Moreover for any strategy, sup
adversaries i,t .
i,t
pt (i)
1It =i ,
We call Exp3 the Exp strategy run
Theorem (Auer, Cesa-Bianchi, Freund and Schapire [2003]) 2nd log d.
Rn
1 nd + o( nd). 4
adversaries i,t .
i,t
pt (i)
1It =i ,
Rn
1 nd + o( nd). 4
adversaries i,t .
i,t
pt (i)
1It =i ,
Rn
1 nd + o( nd). 4
High probability bounds

What about bounds directly on the true regret
n It ,t t=1 n
min
i=1,...,d
i,t t=1
Auer et al. [2003] proposed Exp3.P: exp pt (i) = (1 ) where i,t =

i,t d j=1 exp t1 s=1 i,s t1 s=1 j,s
, d
pt (i)
1It =i +
. pt (i)

n It ,t t=1 n
min
i=1,...,d
i,t t=1

, d
pt (i)
1It =i +
. pt (i)

n It ,t t=1 n
min
i=1,...,d
i,t t=1

, d
pt (i)
1It =i +
. pt (i)

Theorem (Auer et al. [2003], Audibert and Bubeck [2011]) Let (0, 1), with = = 1.05
d log d n , n It ,t t=1 log(d 1 ) , nd
= 0.95
log d nd
and
Exp3.P satises with probability at least 1 :

n
min
i=1,...,d
i,t t=1
5.15 nd log(d 1 ).
On the other hand with =
log d nd ,
= 0.95
log d nd
and
= 1.05 d log d , Exp3.P satises, for any (0, 1), with n probability at least 1 :
n It ,t t=1 n
min
i=1,...,d
i,t t=1
nd log( 1 ) + 5.15 nd log d. log d

Theorem (Auer et al. [2003], Audibert and Bubeck [2011]) Let (0, 1), with = = 1.05
d log d n , n It ,t t=1 log(d 1 ) , nd
= 0.95
log d nd
and
Exp3.P satises with probability at least 1 :

n
min
i=1,...,d
i,t t=1
5.15 nd log(d 1 ).
On the other hand with =
log d nd ,
= 0.95
log d nd
and
= 1.05 d log d , Exp3.P satises, for any (0, 1), with n probability at least 1 :
n It ,t t=1 n
min
i=1,...,d
i,t t=1
nd log( 1 ) + 5.15 nd log d. log d
Other types of normalization
INF (Implicitely Normalized Forecaster) is based on a potential function : R R increasing, convex, twice + continuously dierentiable, and such that (0, 1] (R ). At each time step INF computes the new probability distribution as follows:
t1
pt (i) = Ct
s=1
i,s
,
d i=1 pt (i)
where Ct is the unique real number such that (x) = exp(x) + (x) = (x)
1/2 d
= 1.
corresponds exactly to the Exp3 strategy.

d
is the quadratic INF strategy.
t1
pt (i) = Ct
s=1
i,s
,
d i=1 pt (i)
1/2 d
= 1.

d
t1
pt (i) = Ct
s=1
i,s
,
d i=1 pt (i)
1/2 d
= 1.

d
t1
pt (i) = Ct
s=1
i,s
,
d i=1 pt (i)
1/2 d
= 1.

d
Minimax optimal regret bound
Theorem (Audibert and Bubeck [2009], Audibert and Bubeck [2010], Audibert, Bubeck and Lugosi [2011]) Quadratic INF satises: Rn 2 2nd.
Stochastic Assumption
Assumption (Robbins [1952]) The sequence of losses ( t )1tn is a sequence of i.i.d random variables. For historical reasons in this setting we consider gains rather than losses and we introduce dierent notation: Let i be the unknown reward distribution underlying arm i, i the mean of i , = max1id i and i = i . Let Xi,s i be the reward obtained when pulling arm i for the s th time, and Ti (t) = t 1Is =i the number of times s=1 arm i was pulled up to time t. Thus here
n d
Rn = n E
t=1
I t =
i=1
i ETi (n).
n d
Rn = n E
t=1
I t =
i=1
i ETi (n).
n d
Rn = n E
t=1
I t =
i=1
i ETi (n).
n d
Rn = n E
t=1
I t =
i=1
i ETi (n).
n d
Rn = n E
t=1
I t =
i=1
i ETi (n).
Optimism in face of uncertainty
General principle: given some observations from an unknown environment, build (with some probabilistic argument) a set of possible environments , then act as if the real environment was the most favorable one in . Application to stochastic bandits: given the past rewards, build condence intervals for the means (i ) (in particular build upper condence bounds), then play the arm with the highest upper condence bound.
Optimism in face of uncertainty
General principle: given some observations from an unknown environment, build (with some probabilistic argument) a set of possible environments , then act as if the real environment was the most favorable one in . Application to stochastic bandits: given the past rewards, build condence intervals for the means (i ) (in particular build upper condence bounds), then play the arm with the highest upper condence bound.
UCB (Upper Condence Bounds)

Theorem (Hoeding [1963]) Let X , X1 , . . . , Xt be i.i.d random variables in [0, 1], then with probability at least 1 , 1 EX t
t
Xs +
s=1
log 1 . 2t
This directly suggests the famous UCB strategy of Auer, Cesa-Bianchi and Fischer [2002]: It argmax
1id
1 Ti (t 1)
Ti (t1)
Xi,s +
s=1
2 log t . Ti (t 1)
Auer et al. proved the following regret bound: Rn

i:i >0
10 log n . i

t
Xs +
s=1
log 1 . 2t
1id
1 Ti (t 1)
Ti (t1)
Xi,s +
s=1
2 log t . Ti (t 1)

i:i >0
10 log n . i

t
Xs +
s=1
log 1 . 2t
1id
1 Ti (t 1)
Ti (t1)
Xi,s +
s=1
2 log t . Ti (t 1)

i:i >0
10 log n . i
Illustration of UCB
Illustration of UCB
Lower bound
For any p, q [0, 1], let kl(p, q) = p log p 1p + (1 p) log . q 1q
Theorem (Lai and Robbins [1985]) Consider a consistent strategy, i.e. s.t. a > 0, we have ETi (n) = o(na ) if i > 0. Then for any Bernoulli reward distributions, Rn i . lim inf n+ log n kl(i , )
i:i >0
Note that
1 i (1 ) . 2i kl(i , ) 2i
Lower bound
i:i >0
Note that
1 i (1 ) . 2i kl(i , ) 2i
Lower bound
i:i >0
Note that
1 i (1 ) . 2i kl(i , ) 2i
KL-UCB
Theorem (Chernos inequality) Let X , X1 , . . . , Xt be i.i.d random variables in [0, 1], then P 1 t
t
Xs EX
s=1
exp (t kl(EX , EX )) .
In particular this implies that with probability at least 1 : EX max q [0, 1] : kl 1 t

t
Xs , q
s=1
log 1 t
KL-UCB
t
Xs EX
s=1

t
Xs , q
s=1
log 1 t
KL-UCB
t
Xs EX
s=1

t
Xs , q
s=1
log 1 t
KL-UCB
Thus Chernos bound suggests the KL-UCB strategy of Garivier and Capp [2011] (see also Honda and Takemura [2010], Maillard, e Munos and Stoltz [2011]) : It argmax max q [0, 1] : 1id Ti (t1) (1 + ) log t 1 Xi,s , q . kl Ti (t 1) Ti (t 1)
s=1
Garivier and Capp proved the following regret bound for n large e enough: i (1 + 2 ) Rn log n. kl(i , )
i:i >0
KL-UCB
s=1
i:i >0
KL-UCB
s=1
i:i >0
Heavy-tailed distributions
The standard UCB works for all 2 - subgaussian distributions (not only bounded distributions), i.e. such that E exp ((X EX )) 2 2 , R. 2
It is easy to see that this equivalent to > 0 s.t. E exp(X 2 ) < +. What happens for distributions with heavier tails? Can we get logarithmic regret if the distributions only have a nite variance?
Median of means, Alon, Gibbons, Matias and Szegedy [2002]
Lemma Let X , X1 , . . . , Xt be i.i.d random variables such that n E(X EX )2 1. Let (0, 1), k = 8 log 1 and N = 8 log 1 . Then with probability at least 1 , N kN 1 8 log( 1 ) 1 Xs , . . . , Xs + 8 . EX median N N n
s=1 s=(k1)N+1
Lemma Let X , X1 , . . . , Xt be i.i.d random variables such that n E(X EX )2 1. Let (0, 1), k = 8 log 1 and N = 8 log 1 . Then with probability at least 1 , N kN 1 1 8 log( 1 ) Xs , . . . , Xs + 8 . EX median N N n
s=1 s=(k1)N+1
Lemma Let X , X1 , . . . , Xt be i.i.d random variables such that n E(X EX )2 1. Let (0, 1), k = 8 log 1 and N = 8 log 1 . Then with probability at least 1 , N kN 1 1 8 log( 1 ) EX median Xs , . . . , Xs + 8 . N N n
s=1 s=(k1)N+1
Lemma Let X , X1 , . . . , Xt be i.i.d random variables such that n E(X EX )2 1. Let (0, 1), k = 8 log 1 and N = 8 log 1 . Then with probability at least 1 , N kN 1 1 8 log( 1 ) Xs , . . . , Xs + 8 EX median . N N n
s=1 s=(k1)N+1
LT-UCB
This suggests LT-UCB, Bubeck, Cesa-Bianchi and Lugosi [2012]: Ni,t kt Ni,t 1 1 It argmax median Xi,s , . . . , Xi,s Ni,t Ni,t 1id
s=1 s=(kt 1)Ni,t +1
+ 32
log t , Ti (t 1)
with kt = 16 log t and Ni,t = Ti (t1) . The following regret bound 16 log t can be proved for any set of distributions with variance bounded by 1: 200 log n Rn . i
i:i >0
LT-UCB
This suggests LT-UCB, Bubeck, Cesa-Bianchi and Lugosi [2012]: Ni,t kt Ni,t 1 1 It argmax median Xi,s , . . . , Xi,s Ni,t Ni,t 1id
s=1 s=(kt 1)Ni,t +1
+ 32
log t , Ti (t 1)
with kt = 16 log t and Ni,t = Ti (t1) . The following regret bound 16 log t can be proved for any set of distributions with variance bounded by 1: 200 log n Rn . i
i:i >0
Markovian rewards
Assumption The sequence (Xi,t )t1 forms an aperiodic irreducible nite-state markov chain with unknown transition matrix Pi . Again in this framework it is possible to design a UCB strategy with logarithmic regret (Tekin and Liu, [2011]), using the following result: Theorem (Lezaud [1998]) Let X1 , . . . , Xt be an aperiodic irreducible nite-state markov chain with transition matrix P. Let 2 be the second largest eigenvalue of the multiplicative symmetrization of P and = 1 2 . Let be the expectation of X1 under the stationary distribution. There exists C > 0 such that for any (0, 1], P 1 t
t
Xs +
s=1
C exp
t 2 28
Markovian rewards
t
Xs +
s=1
C exp
t 2 28
Markovian rewards
t
Xs +
s=1
C exp
t 2 28
Online Lipschitz and Stochastic Optimization

Stochastic multi-armed bandit where {1, . . . , K } is replaced by X . At time t, select xt X , then receive a random variable rt [0, 1] such that E[rt |xt ] = f (xt ). Assumption X is equipped with a symmetric function : X X R+ such that (x, x) = 0. f is Lipschitz with respect to , that is |f (x) f (y )| (x, y ), x, y X .
n
Rn = nf E
t=1
f (xt ),
where f = supxX f (x).

n
Rn = nf E
t=1
f (xt ),

n
Rn = nf E
t=1
f (xt ),

n
Rn = nf E
t=1
f (xt ),
Example in 1d
Where should one sample next?
How to dene a high probability upper bound at any state x?
Noiseless case, rt = f (xt )
f* f
f(x t)
xt
Lipschitz property the evaluation of f at xt provides a rst upper-bound on f .
New point rened upper-bound on f .
Back to the noisy case
UCB in a given domain
f (xt ) rt
Xi
x xt
For a xed domain Xi x containing ni points {xt } Xi , we have that ni rt f (xt ) is a martingale. Thus by Azumas inequality, t=1
1 ni
ni
rt +
t=1
log 1/ 1 2ni ni
ni
f (xt ) f (x) diam(Xi ),

t=1
since f is Lipschitz (where diam(Xi ) = supx,y Xi (x, y )).
f (xt ) rt
Xi
x xt
1 ni
ni
rt +
t=1
log 1/ 1 2ni ni
ni

t=1
f (xt ) rt
Xi
x xt
1 ni
ni
rt +
t=1
log 1/ 1 2ni ni
ni

t=1
High probability upper bound
Xi
log 1/ 2ni 1 ni ni t=1 rt
Xi
1 ni
ni
w.p. 1 ,
rt +
t=1
log 1/ + diam(Xi ) sup f (x). 2ni xXi
Tradeo between number of points in a domain and size of the domain. By considering several domains we can derive a tigher upper bound.
Xi
Xi
1 ni
ni
w.p. 1 ,
rt +
t=1
Xi
Xi
1 ni
ni
w.p. 1 ,
rt +
t=1
A hierarchical decomposition
Use a tree of partitions at all scales:
Bi (t) = min i (t) +
def
2 log(t) + diam(i), max Bj (t) Ti (t) jC(i)
Hierarchical Optimistic Optimization (HOO)

[Bubeck, Munos, Stoltz, Szepesvri, 2008, 2011]: Consider a tree a of partitions of X , each node i corresponds to a subdomain Xi . HOO Algorithm: Let Tt be the set of expanded nodes at round t. - T1 = {root} (space X ) - At t, select a leaf It of Tt by maximizing the B-values, - Tt+1 = Tt {It } - Select xt XIt - Observe reward rt and update the B-values: Bi (t) = min i (t) +
def
Followed path
Turnedon nodes
B B
h,i
h+1,2i1
h+1,2i
Selected node
Pulled point
Xt
2 log(t) + diam(i), max Bj (t) Ti (t) jC(i)
Example in 1d
rt B(f (xt )) a Bernoulli distribution with parameter f (xt )
Resulting tree at time n = 1000 and at n = 10000.
Analysis of HOO
The near-optimality dimension d of f is dened as follows: Let X = {x X , f (x) f } be the set of -optimal points. Then X can be covered by O( d ) balls of radius . A similar notion was introduced in [Kleinberg, Slivkins, Upfall, 2008]. Theorem HOO satises:
def
Rn = O(n d+2 ).
d+1
Analysis of HOO
def
Rn = O(n d+2 ).
d+1
Analysis of HOO
def
Rn = O(n d+2 ).
d+1
Analysis of HOO
def
Rn = O(n d+2 ).
d+1
Example 1:
Assume the function is locally peaky around its maximum: f (x ) f (x) = (||x x||).
It takes O( 0 ) balls of radius to cover X with (x, y ) = ||x y ||. Thus d = 0 and the regret is O( n).
Example 1:
Example 1:
Example 2:
Assume the function is locally quadratic around its maximum: f (x ) f (x) = (||x x||2 ).
For (x, y ) = ||x y ||, it takes O( D/2 ) balls of radius D+2 cover X . Thus d = D/2 and Rn = O(n D+4 ). For (x, y ) = ||x y ||2 , it takes O( 0 ) -balls of radius cover X . Thus d = 0 and Rn = O( n).
to to
Example 2:
to to
Example 2:
to to
Example
X = [0, 1]D , 0 and mean-payo function f locally -smooth around (any of) its maximum x (in nite number): f (x ) f (x) = (||x x || ) as x x . Theorem Assume that we run HOO using (x, y ) = ||x y || . Known smoothness: = . Rn = O( n), i.e., the rate is independent of the dimension D. Smoothness underestimated: < . 1 Rn = O(n(d+1)/(d+2) ) where d = D
1
Smoothness overestimated: > . No guarantee. Note: UCT corresponds to = +.
Example
1
Example
1
Example
1
Example
1
Path planning
Combinatorial prediction game

Adversary
Player

Adversary
Player
Adversary
Player
Adversary
4 1 2 3 8 9 6 7 5 d2 d1 d
Player
Adversary
4 1 2 3 8 9 6 7 5 d2 d1 d
Player
loss suered:
+ ... +

Full Info:
1, 2, . . . , d
Adversary
Feedback:
4 1 2 3 8 9 6 7 5
d2 d1 d
Player
loss suered:
+ ... +

Full Info: Semi-Bandit: Feedback: Bandit:
4 1 2 3 8 9 6 7 5 1, 2, . . . , d 2, 7, . . . , d 2
Adversary
+ ... +
d2 d1 d
Player
loss suered:
+ ... +

Full Info: Semi-Bandit: Feedback: Bandit:
4 1 2 3 8 9 6 7 5 1, 2, . . . , d 2, 7, . . . , d 2
Adversary
+ ... +
d2 d1 d
Player
loss suered:
+ ... +
Notation
S {0, 1}d
4 1 2 3 8 9 6 7 5 d2 d1 d
Rd +
4 1 2 3 8 9 6 7 5 d2 d1 d
Vt S, loss suered:
n n T t Vt t=1
TV t t
Rn = E
min E
uS t=1
T t u
Notation
S {0, 1}d
4 1 2 3 8 9 6 7 5 d2 d1 d
Rd +
4 1 2 3 8 9 6 7 5 d2 d1 d
Vt S, loss suered:
n n T t Vt t=1
TV t t
Rn = E
min E
uS t=1
T t u
Notation
S {0, 1}d
4 1 2 3 8 9 6 7 5 d2 d1 d
Rd +
4 1 2 3 8 9 6 7 5 d2 d1 d
Vt S, loss suered:
n n T t Vt t=1
TV t t
Rn = E
min E
uS t=1
T t u
Notation
S {0, 1}d
4 1 2 3 8 9 6 7 5 d2 d1 d
Rd +
4 1 2 3 8 9 6 7 5 d2 d1 d
Vt S, loss suered:
n n T t Vt t=1
TV t t
Rn = E
min E
uS t=1
T t u
Set of concepts S {0, 1}d

Paths k-sets Matchings
Spanning trees k-sized intervals Parallel bandits
Key idea
Vt pt , pt (S) Then, unbiased estimate t of the loss t : t = t in the full information game, i,t =
i,t V S:Vi =1
pt (V ) Vi,t
in the semi-bandit game,
= Pt+ Vt VtT t , with Pt = EV p (VV T ) in the bandit game. t t
Key idea
i,t V S:Vi =1
pt (V ) Vi,t
Key idea
i,t V S:Vi =1
pt (V ) Vi,t
Key idea
i,t V S:Vi =1
pt (V ) Vi,t
Key idea
i,t V S:Vi =1
pt (V ) Vi,t
Loss assumptions
Denition (L ) We say that the adversary statises the L assumption: if t 1 for all t = 1, . . . , n. Denition (L2 ) We say that the adversary statises the L2 assumption: if T v 1 for all t = 1, . . . , n and v S. t
Loss assumptions
Denition (L ) We say that the adversary statises the L assumption: if t 1 for all t = 1, . . . , n. Denition (L2 ) We say that the adversary statises the L2 assumption: if T v 1 for all t = 1, . . . , n and v S. t
Expanded Exponentially weighted average forecaster (Exp2)

exp pt (v ) =
uS t1 T s=1 s v t1 T s=1 s u
exp
In the full information game, against L2 adversaries, we have (for some ) Rn 2dn, which is the optimal rate, Dani, Hayes and Kakade [2008]. Thus against L adversaries we have Rn d 3/2 2n. But this is suboptimal, Koolen, Warmuth and Kivinen [2010]. Audibert, Bubeck and Lugosi [2011] showed that, for any , there exists a subset S {0, 1}d and an L adversary such that: Rn 0.02 d 3/2 n.

exp pt (v ) =
exp

exp pt (v ) =
exp

exp pt (v ) =
exp
Legendre function
Denition Let D be a convex subset of Rd with nonempty interior int(D) and boundary D. We call Legendre any function F : D R such that F is strictly convex and admits continuous rst partial derivatives on int(D), For any u D, for any v int(D), we have
s0,s>0
lim (u v )T F (1 s)u + sv = +.
Legendre function
s0,s>0
lim (u v )T F (1 s)u + sv = +.
Legendre function
s0,s>0
lim (u v )T F (1 s)u + sv = +.
Bregman divergence
Denition The Bregman divergence DF : D int(D) associated to a Legendre function F is dened by DF (u, v ) = F (u) F (v ) (u v )T F (v ).
CLEB (Combinatorial LEarning with Bregman divergences)

Parameter: F Legendre on D Conv (S) (1) wt+1 D : F (wt+1 ) = F (wt ) t D
(2) wt+1 argmin DF (w , wt+1 )

w Conv (S)
(3) pt+1 (S) : wt+1 = EV pt+1 V Conv (S)
(S)


w Conv (S)
(S)
pt


w Conv (S)
wt
(S)
pt

Parameter: F Legendre on D Conv (S) (1) wt+1 D : F (wt+1 ) = F (wt ) t D wt+1 wt

w Conv (S)
(S)
pt

Parameter: F Legendre on D Conv (S) (1) wt+1 D : F (wt+1 ) = F (wt ) t D wt+1 wt wt+1 Conv (S)

w Conv (S)
(3) pt+1 (S) : wt+1 = EV pt+1 V
(S)
pt

Parameter: F Legendre on D Conv (S) (1) wt+1 D : F (wt+1 ) = F (wt ) t D wt+1 wt wt+1 Conv (S) pt+1 (S) pt

w Conv (S)
(3) pt+1 (S) : wt+1 = EV pt+1 V
General regret bound for CLEB
Theorem If F admits a Hessian Rn

2F
always invertible then,

n
diamDF (S) + E
t=1
T t
F (wt )
t .
Dierent instances of CLEB: LinExp (Entropy Function)

D = [0, +)d , F (x) =
1 d i=1 xi
log xi Full Info: Hedge
Semi-Bandit=Bandit: Exp3 Auer et al. [2002] Full Info: Component Hedge Koolen, Warmuth and Kivinen [2010] Semi-Bandit: MW Kale, Reyzin and Schapire [2010] Bandit: new algorithm

D = [0, +)d , F (x) =
1 d i=1 xi

D = [0, +)d , F (x) =
1 d i=1 xi

D = [0, +)d , F (x) =
1 d i=1 xi

D = [0, +)d , F (x) =
1 d i=1 xi

D = [0, +)d , F (x) =
1 d i=1 xi
Dierent instances of CLEB: LinINF (Exchangeable Hessian)

D = [0, +)d , F (x) =
xi d i=1 0
1 (s)ds
INF, Audibert and Bubeck [2009]
(x) = exp(x) : LinExp (x) = (x)q , q > 1 : LinPoly

D = [0, +)d , F (x) =
xi d i=1 0
1 (s)ds

D = [0, +)d , F (x) =
xi d i=1 0
1 (s)ds

D = [0, +)d , F (x) =
xi d i=1 0
1 (s)ds
Dierent instances of CLEB: Follow the regularized leader
D = Conv (S), then

t
wt+1 argmin
w D s=1
T w + F (w ) s
Particularly interesting choice: F self-concordant barrier function, Abernethy, Hazan and Rakhlin [2008]
Dierent instances of CLEB: Follow the regularized leader
D = Conv (S), then

t
wt+1 argmin
w D s=1
T w + F (w ) s
Particularly interesting choice: F self-concordant barrier function, Abernethy, Hazan and Rakhlin [2008]
Minimax regret for the full information game
Theorem (Koolen, Warmuth and Kivinen [2010]) In the full information game, the LinExp strategy (with well-chosen parameters) satises for any concept class S {0, 1} and any L -adversary: Rn d 2n. Moreover for any strategy, there exists a subset S {0, 1} and an L -adversary such that: Rn 0.008 d n.
Minimax regret for the semi-bandit game
Theorem (Audibert, Bubeck and Lugosi [2011]) In the semi-bandit game, the LinExp strategy (with well-chosen parameters) satises for any concept class S {0, 1} and any L -adversary: Rn d 2n. Moreover for any strategy, there exists a subset S {0, 1} and an L -adversary such that: Rn 0.008 d n.
Minimax regret for the bandit game
For the bandit game the situation becomes trickier. First it appears necessary to add some sort of forced exploration on S to control third order error terms in the regret bound. 2 F (w ) 1 Second, the control of the quadratic term T t t t is much more involved than previously.

Currently there exists three approaches to solve these issues: Dani, Hayes and Kakade [2008] construct a barycentric spanner of S (a sort of basis) and play Exp2 mixed with an uniform exploration on the spanner. Regret of order: d 5/2 n. Cesa-Bianchi and Lugosi [2009] use Exp2 with an uniform exploration on S. For good sets S, regret of order: d 2 n. Abernethy, Hazan and Rakhlin [2008] use FTRL with a self-concordant barrier F . They proposed an exploration 2 F (w ) 1 . Regret guided by the structure of the Hessian t of order: d 5/2 n. Theorem (Audibert, Bubeck and Lugosi [2011]) In the bandit game, for any strategy, there exists a subset S {0, 1} and an L -adversary such that: Rn 0.01 d 3/2 n.




Multi Armed Bandits

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multi Armed Bandits

Uploaded by

Copyright:

Available Formats

Tutorial on Bandits Games

Online Learning with Full Information

Online Learning with Full Information

Online Learning with Full Information

Online Learning with Full Information

Online Learning with Full Information

Online Learning with Full Information

Online Learning with Bandit Feedback

Online Learning with Bandit Feedback

Online Learning with Bandit Feedback

Online Learning with Bandit Feedback

Online Learning with Bandit Feedback

Online Learning with Bandit Feedback

Brain computer interface

Goal: Minimize the cumulative loss incured. We consider the regret:

Goal: Minimize the cumulative loss incured. We consider the regret:

Goal: Minimize the cumulative loss incured. We consider the regret:

Goal: Minimize the cumulative loss incured. We consider the regret:

Goal: Minimize the cumulative loss incured. We consider the regret:

Goal: Minimize the cumulative loss incured. We consider the regret:

Goal: Minimize the cumulative loss incured. We consider the regret:

Exponential Weights (EW, EWA, MW, Hedge, ect)

n log d + o( n log d). 2

Exponential Weights (EW, EWA, MW, Hedge, ect)

n log d + o( n log d). 2

Exponential Weights (EW, EWA, MW, Hedge, ect)

n log d + o( n log d). 2

Magic trick for bandit feedback

We call Exp3 the Exp strategy run

Theorem (Auer, Cesa-Bianchi, Freund and Schapire [2003]) 2nd log d.

Magic trick for bandit feedback

We call Exp3 the Exp strategy run

Theorem (Auer, Cesa-Bianchi, Freund and Schapire [2003]) 2nd log d.

Magic trick for bandit feedback

We call Exp3 the Exp strategy run

Theorem (Auer, Cesa-Bianchi, Freund and Schapire [2003]) 2nd log d.

High probability bounds

Auer et al. [2003] proposed Exp3.P: exp pt (i) = (1 ) where i,t =

High probability bounds

Auer et al. [2003] proposed Exp3.P: exp pt (i) = (1 ) where i,t =

High probability bounds

Auer et al. [2003] proposed Exp3.P: exp pt (i) = (1 ) where i,t =

High probability bounds

Exp3.P satises with probability at least 1 :

On the other hand with =

nd log( 1 ) + 5.15 nd log d. log d

High probability bounds

Exp3.P satises with probability at least 1 :

On the other hand with =

nd log( 1 ) + 5.15 nd log d. log d

Other types of normalization

corresponds exactly to the Exp3 strategy.

is the quadratic INF strategy.

Other types of normalization

corresponds exactly to the Exp3 strategy.

is the quadratic INF strategy.

Other types of normalization

corresponds exactly to the Exp3 strategy.

is the quadratic INF strategy.

Other types of normalization

corresponds exactly to the Exp3 strategy.

is the quadratic INF strategy.

Minimax optimal regret bound