You are on page 1of 161

Tutorial on Bandits Games

Sbastien Bubeck e

Online Learning with Full Information


Adversary

1: CNN

2: NBC

d: ABC

Player

Online Learning with Full Information


Adversary

1: CNN

2: NBC

d: ABC

A {1, . . . , d} Player

Online Learning with Full Information

Adversary

1: CNN

2: NBC

d: ABC

A {1, . . . , d} Player

Online Learning with Full Information

Adversary

1: CNN

2: NBC

d: ABC

A {1, . . . , d} Player

Online Learning with Full Information

Adversary

1: CNN

2: NBC

d: ABC

ered: loss su
Player

A {1, . . . , d}

Online Learning with Full Information

Adversary

Feedback: 1, . . . , d

1: CNN

2: NBC

d: ABC

ered: loss su
Player

A {1, . . . , d}

Online Learning with Bandit Feedback


Adversary

Player

Online Learning with Bandit Feedback


Adversary

A {1, . . . , d} Player

Online Learning with Bandit Feedback

Adversary

A {1, . . . , d} Player

Online Learning with Bandit Feedback

Adversary

A {1, . . . , d} Player

Online Learning with Bandit Feedback

Adversary

ered loss su
Player

A {1, . . . , d}

Online Learning with Bandit Feedback


Adversary

ered: loss su
Player

A {1, . . . , d}

Feedback:

Some Applications

Computer Go

Brain computer interface

Medical trials

Packets routing

Ads placement

Dynamic allocation

Notation
For each round t = 1, 2, . . . , n;
1

The player chooses an arm It {1, . . . , d}, possibly with the help of an external randomization. Simultaneously the adversary chooses a loss vector d t = ( 1,t , . . . , d,t ) [0, 1] . The player incurs the loss It ,t , and observes:
The loss vector t in the full information setting. Only the loss incured It ,t in the bandit setting.

Goal: Minimize the cumulative loss incured. We consider the regret:


n n It ,t t=1

Rn = E

min E
i=1,...,d t=1

i,t .

Notation
For each round t = 1, 2, . . . , n;
1

The player chooses an arm It {1, . . . , d}, possibly with the help of an external randomization. Simultaneously the adversary chooses a loss vector d t = ( 1,t , . . . , d,t ) [0, 1] . The player incurs the loss It ,t , and observes:
The loss vector t in the full information setting. Only the loss incured It ,t in the bandit setting.

Goal: Minimize the cumulative loss incured. We consider the regret:


n n It ,t t=1

Rn = E

min E
i=1,...,d t=1

i,t .

Notation
For each round t = 1, 2, . . . , n;
1

The player chooses an arm It {1, . . . , d}, possibly with the help of an external randomization. Simultaneously the adversary chooses a loss vector d t = ( 1,t , . . . , d,t ) [0, 1] . The player incurs the loss It ,t , and observes:
The loss vector t in the full information setting. Only the loss incured It ,t in the bandit setting.

Goal: Minimize the cumulative loss incured. We consider the regret:


n n It ,t t=1

Rn = E

min E
i=1,...,d t=1

i,t .

Notation
For each round t = 1, 2, . . . , n;
1

The player chooses an arm It {1, . . . , d}, possibly with the help of an external randomization. Simultaneously the adversary chooses a loss vector d t = ( 1,t , . . . , d,t ) [0, 1] . The player incurs the loss It ,t , and observes:
The loss vector t in the full information setting. Only the loss incured It ,t in the bandit setting.

Goal: Minimize the cumulative loss incured. We consider the regret:


n n It ,t t=1

Rn = E

min E
i=1,...,d t=1

i,t .

Notation
For each round t = 1, 2, . . . , n;
1

The player chooses an arm It {1, . . . , d}, possibly with the help of an external randomization. Simultaneously the adversary chooses a loss vector d t = ( 1,t , . . . , d,t ) [0, 1] . The player incurs the loss It ,t , and observes:
The loss vector t in the full information setting. Only the loss incured It ,t in the bandit setting.

Goal: Minimize the cumulative loss incured. We consider the regret:


n n It ,t t=1

Rn = E

min E
i=1,...,d t=1

i,t .

Notation
For each round t = 1, 2, . . . , n;
1

The player chooses an arm It {1, . . . , d}, possibly with the help of an external randomization. Simultaneously the adversary chooses a loss vector d t = ( 1,t , . . . , d,t ) [0, 1] . The player incurs the loss It ,t , and observes:
The loss vector t in the full information setting. Only the loss incured It ,t in the bandit setting.

Goal: Minimize the cumulative loss incured. We consider the regret:


n n It ,t t=1

Rn = E

min E
i=1,...,d t=1

i,t .

Notation
For each round t = 1, 2, . . . , n;
1

The player chooses an arm It {1, . . . , d}, possibly with the help of an external randomization. Simultaneously the adversary chooses a loss vector d t = ( 1,t , . . . , d,t ) [0, 1] . The player incurs the loss It ,t , and observes:
The loss vector t in the full information setting. Only the loss incured It ,t in the bandit setting.

Goal: Minimize the cumulative loss incured. We consider the regret:


n n It ,t t=1

Rn = E

min E
i=1,...,d t=1

i,t .

Exponential Weights (EW, EWA, MW, Hedge, ect)


Draw It at random from pt where exp pt (i) =
d j=1 exp t1 s=1 i,s t1 s=1 j,s

Theorem (Cesa-Bianchi, Freund , Haussler, Helmbold, Schapire and Warmuth [1997]) Exp satises Rn Moreover for any strategy, sup
adversaries

n log d . 2

Rn

n log d + o( n log d). 2

Exponential Weights (EW, EWA, MW, Hedge, ect)


Draw It at random from pt where exp pt (i) =
d j=1 exp t1 s=1 i,s t1 s=1 j,s

Theorem (Cesa-Bianchi, Freund , Haussler, Helmbold, Schapire and Warmuth [1997]) Exp satises Rn Moreover for any strategy, sup
adversaries

n log d . 2

Rn

n log d + o( n log d). 2

Exponential Weights (EW, EWA, MW, Hedge, ect)


Draw It at random from pt where exp pt (i) =
d j=1 exp t1 s=1 i,s t1 s=1 j,s

Theorem (Cesa-Bianchi, Freund , Haussler, Helmbold, Schapire and Warmuth [1997]) Exp satises Rn Moreover for any strategy, sup
adversaries

n log d . 2

Rn

n log d + o( n log d). 2

Magic trick for bandit feedback

i,t = is an unbiased estimate of on the estimated losses. Exp3 satises: Rn Moreover for any strategy, sup
adversaries i,t .

i,t

pt (i)

1It =i ,

We call Exp3 the Exp strategy run

Theorem (Auer, Cesa-Bianchi, Freund and Schapire [2003]) 2nd log d.

Rn

1 nd + o( nd). 4

Magic trick for bandit feedback

i,t = is an unbiased estimate of on the estimated losses. Exp3 satises: Rn Moreover for any strategy, sup
adversaries i,t .

i,t

pt (i)

1It =i ,

We call Exp3 the Exp strategy run

Theorem (Auer, Cesa-Bianchi, Freund and Schapire [2003]) 2nd log d.

Rn

1 nd + o( nd). 4

Magic trick for bandit feedback

i,t = is an unbiased estimate of on the estimated losses. Exp3 satises: Rn Moreover for any strategy, sup
adversaries i,t .

i,t

pt (i)

1It =i ,

We call Exp3 the Exp strategy run

Theorem (Auer, Cesa-Bianchi, Freund and Schapire [2003]) 2nd log d.

Rn

1 nd + o( nd). 4

High probability bounds


What about bounds directly on the true regret
n It ,t t=1 n

min

i=1,...,d

i,t t=1

Auer et al. [2003] proposed Exp3.P: exp pt (i) = (1 ) where i,t =


i,t d j=1 exp t1 s=1 i,s t1 s=1 j,s

, d

pt (i)

1It =i +

. pt (i)

High probability bounds


What about bounds directly on the true regret
n It ,t t=1 n

min

i=1,...,d

i,t t=1

Auer et al. [2003] proposed Exp3.P: exp pt (i) = (1 ) where i,t =


i,t d j=1 exp t1 s=1 i,s t1 s=1 j,s

, d

pt (i)

1It =i +

. pt (i)

High probability bounds


What about bounds directly on the true regret
n It ,t t=1 n

min

i=1,...,d

i,t t=1

Auer et al. [2003] proposed Exp3.P: exp pt (i) = (1 ) where i,t =


i,t d j=1 exp t1 s=1 i,s t1 s=1 j,s

, d

pt (i)

1It =i +

. pt (i)

High probability bounds


Theorem (Auer et al. [2003], Audibert and Bubeck [2011]) Let (0, 1), with = = 1.05
d log d n , n It ,t t=1 log(d 1 ) , nd

= 0.95

log d nd

and

Exp3.P satises with probability at least 1 :


n

min

i=1,...,d

i,t t=1

5.15 nd log(d 1 ).

On the other hand with =

log d nd ,

= 0.95

log d nd

and

= 1.05 d log d , Exp3.P satises, for any (0, 1), with n probability at least 1 :
n It ,t t=1 n

min

i=1,...,d

i,t t=1

nd log( 1 ) + 5.15 nd log d. log d

High probability bounds


Theorem (Auer et al. [2003], Audibert and Bubeck [2011]) Let (0, 1), with = = 1.05
d log d n , n It ,t t=1 log(d 1 ) , nd

= 0.95

log d nd

and

Exp3.P satises with probability at least 1 :


n

min

i=1,...,d

i,t t=1

5.15 nd log(d 1 ).

On the other hand with =

log d nd ,

= 0.95

log d nd

and

= 1.05 d log d , Exp3.P satises, for any (0, 1), with n probability at least 1 :
n It ,t t=1 n

min

i=1,...,d

i,t t=1

nd log( 1 ) + 5.15 nd log d. log d

Other types of normalization

INF (Implicitely Normalized Forecaster) is based on a potential function : R R increasing, convex, twice + continuously dierentiable, and such that (0, 1] (R ). At each time step INF computes the new probability distribution as follows:
t1

pt (i) = Ct
s=1

i,s

,
d i=1 pt (i)

where Ct is the unique real number such that (x) = exp(x) + (x) = (x)
1/2 d

= 1.

corresponds exactly to the Exp3 strategy.


d

is the quadratic INF strategy.

Other types of normalization

INF (Implicitely Normalized Forecaster) is based on a potential function : R R increasing, convex, twice + continuously dierentiable, and such that (0, 1] (R ). At each time step INF computes the new probability distribution as follows:
t1

pt (i) = Ct
s=1

i,s

,
d i=1 pt (i)

where Ct is the unique real number such that (x) = exp(x) + (x) = (x)
1/2 d

= 1.

corresponds exactly to the Exp3 strategy.


d

is the quadratic INF strategy.

Other types of normalization

INF (Implicitely Normalized Forecaster) is based on a potential function : R R increasing, convex, twice + continuously dierentiable, and such that (0, 1] (R ). At each time step INF computes the new probability distribution as follows:
t1

pt (i) = Ct
s=1

i,s

,
d i=1 pt (i)

where Ct is the unique real number such that (x) = exp(x) + (x) = (x)
1/2 d

= 1.

corresponds exactly to the Exp3 strategy.


d

is the quadratic INF strategy.

Other types of normalization

INF (Implicitely Normalized Forecaster) is based on a potential function : R R increasing, convex, twice + continuously dierentiable, and such that (0, 1] (R ). At each time step INF computes the new probability distribution as follows:
t1

pt (i) = Ct
s=1

i,s

,
d i=1 pt (i)

where Ct is the unique real number such that (x) = exp(x) + (x) = (x)
1/2 d

= 1.

corresponds exactly to the Exp3 strategy.


d

is the quadratic INF strategy.

Minimax optimal regret bound

Theorem (Audibert and Bubeck [2009], Audibert and Bubeck [2010], Audibert, Bubeck and Lugosi [2011]) Quadratic INF satises: Rn 2 2nd.

Stochastic Assumption
Assumption (Robbins [1952]) The sequence of losses ( t )1tn is a sequence of i.i.d random variables. For historical reasons in this setting we consider gains rather than losses and we introduce dierent notation: Let i be the unknown reward distribution underlying arm i, i the mean of i , = max1id i and i = i . Let Xi,s i be the reward obtained when pulling arm i for the s th time, and Ti (t) = t 1Is =i the number of times s=1 arm i was pulled up to time t. Thus here
n d

Rn = n E
t=1

I t =
i=1

i ETi (n).

Stochastic Assumption
Assumption (Robbins [1952]) The sequence of losses ( t )1tn is a sequence of i.i.d random variables. For historical reasons in this setting we consider gains rather than losses and we introduce dierent notation: Let i be the unknown reward distribution underlying arm i, i the mean of i , = max1id i and i = i . Let Xi,s i be the reward obtained when pulling arm i for the s th time, and Ti (t) = t 1Is =i the number of times s=1 arm i was pulled up to time t. Thus here
n d

Rn = n E
t=1

I t =
i=1

i ETi (n).

Stochastic Assumption
Assumption (Robbins [1952]) The sequence of losses ( t )1tn is a sequence of i.i.d random variables. For historical reasons in this setting we consider gains rather than losses and we introduce dierent notation: Let i be the unknown reward distribution underlying arm i, i the mean of i , = max1id i and i = i . Let Xi,s i be the reward obtained when pulling arm i for the s th time, and Ti (t) = t 1Is =i the number of times s=1 arm i was pulled up to time t. Thus here
n d

Rn = n E
t=1

I t =
i=1

i ETi (n).

Stochastic Assumption
Assumption (Robbins [1952]) The sequence of losses ( t )1tn is a sequence of i.i.d random variables. For historical reasons in this setting we consider gains rather than losses and we introduce dierent notation: Let i be the unknown reward distribution underlying arm i, i the mean of i , = max1id i and i = i . Let Xi,s i be the reward obtained when pulling arm i for the s th time, and Ti (t) = t 1Is =i the number of times s=1 arm i was pulled up to time t. Thus here
n d

Rn = n E
t=1

I t =
i=1

i ETi (n).

Stochastic Assumption
Assumption (Robbins [1952]) The sequence of losses ( t )1tn is a sequence of i.i.d random variables. For historical reasons in this setting we consider gains rather than losses and we introduce dierent notation: Let i be the unknown reward distribution underlying arm i, i the mean of i , = max1id i and i = i . Let Xi,s i be the reward obtained when pulling arm i for the s th time, and Ti (t) = t 1Is =i the number of times s=1 arm i was pulled up to time t. Thus here
n d

Rn = n E
t=1

I t =
i=1

i ETi (n).

Optimism in face of uncertainty

General principle: given some observations from an unknown environment, build (with some probabilistic argument) a set of possible environments , then act as if the real environment was the most favorable one in . Application to stochastic bandits: given the past rewards, build condence intervals for the means (i ) (in particular build upper condence bounds), then play the arm with the highest upper condence bound.

Optimism in face of uncertainty

General principle: given some observations from an unknown environment, build (with some probabilistic argument) a set of possible environments , then act as if the real environment was the most favorable one in . Application to stochastic bandits: given the past rewards, build condence intervals for the means (i ) (in particular build upper condence bounds), then play the arm with the highest upper condence bound.

UCB (Upper Condence Bounds)


Theorem (Hoeding [1963]) Let X , X1 , . . . , Xt be i.i.d random variables in [0, 1], then with probability at least 1 , 1 EX t
t

Xs +
s=1

log 1 . 2t

This directly suggests the famous UCB strategy of Auer, Cesa-Bianchi and Fischer [2002]: It argmax
1id

1 Ti (t 1)

Ti (t1)

Xi,s +
s=1

2 log t . Ti (t 1)

Auer et al. proved the following regret bound: Rn


i:i >0

10 log n . i

UCB (Upper Condence Bounds)


Theorem (Hoeding [1963]) Let X , X1 , . . . , Xt be i.i.d random variables in [0, 1], then with probability at least 1 , 1 EX t
t

Xs +
s=1

log 1 . 2t

This directly suggests the famous UCB strategy of Auer, Cesa-Bianchi and Fischer [2002]: It argmax
1id

1 Ti (t 1)

Ti (t1)

Xi,s +
s=1

2 log t . Ti (t 1)

Auer et al. proved the following regret bound: Rn


i:i >0

10 log n . i

UCB (Upper Condence Bounds)


Theorem (Hoeding [1963]) Let X , X1 , . . . , Xt be i.i.d random variables in [0, 1], then with probability at least 1 , 1 EX t
t

Xs +
s=1

log 1 . 2t

This directly suggests the famous UCB strategy of Auer, Cesa-Bianchi and Fischer [2002]: It argmax
1id

1 Ti (t 1)

Ti (t1)

Xi,s +
s=1

2 log t . Ti (t 1)

Auer et al. proved the following regret bound: Rn


i:i >0

10 log n . i

Illustration of UCB

Illustration of UCB

Lower bound
For any p, q [0, 1], let kl(p, q) = p log p 1p + (1 p) log . q 1q

Theorem (Lai and Robbins [1985]) Consider a consistent strategy, i.e. s.t. a > 0, we have ETi (n) = o(na ) if i > 0. Then for any Bernoulli reward distributions, Rn i . lim inf n+ log n kl(i , )
i:i >0

Note that

1 i (1 ) . 2i kl(i , ) 2i

Lower bound
For any p, q [0, 1], let kl(p, q) = p log p 1p + (1 p) log . q 1q

Theorem (Lai and Robbins [1985]) Consider a consistent strategy, i.e. s.t. a > 0, we have ETi (n) = o(na ) if i > 0. Then for any Bernoulli reward distributions, Rn i . lim inf n+ log n kl(i , )
i:i >0

Note that

1 i (1 ) . 2i kl(i , ) 2i

Lower bound
For any p, q [0, 1], let kl(p, q) = p log p 1p + (1 p) log . q 1q

Theorem (Lai and Robbins [1985]) Consider a consistent strategy, i.e. s.t. a > 0, we have ETi (n) = o(na ) if i > 0. Then for any Bernoulli reward distributions, Rn i . lim inf n+ log n kl(i , )
i:i >0

Note that

1 i (1 ) . 2i kl(i , ) 2i

KL-UCB

Theorem (Chernos inequality) Let X , X1 , . . . , Xt be i.i.d random variables in [0, 1], then P 1 t
t

Xs EX
s=1

exp (t kl(EX , EX )) .

In particular this implies that with probability at least 1 : EX max q [0, 1] : kl 1 t


t

Xs , q
s=1

log 1 t

KL-UCB

Theorem (Chernos inequality) Let X , X1 , . . . , Xt be i.i.d random variables in [0, 1], then P 1 t
t

Xs EX
s=1

exp (t kl(EX , EX )) .

In particular this implies that with probability at least 1 : EX max q [0, 1] : kl 1 t


t

Xs , q
s=1

log 1 t

KL-UCB
Theorem (Chernos inequality) Let X , X1 , . . . , Xt be i.i.d random variables in [0, 1], then P 1 t
t

Xs EX
s=1

exp (t kl(EX , EX )) .

In particular this implies that with probability at least 1 : EX max q [0, 1] : kl 1 t


t

Xs , q
s=1

log 1 t

KL-UCB
Thus Chernos bound suggests the KL-UCB strategy of Garivier and Capp [2011] (see also Honda and Takemura [2010], Maillard, e Munos and Stoltz [2011]) : It argmax max q [0, 1] : 1id Ti (t1) (1 + ) log t 1 Xi,s , q . kl Ti (t 1) Ti (t 1)
s=1

Garivier and Capp proved the following regret bound for n large e enough: i (1 + 2 ) Rn log n. kl(i , )
i:i >0

KL-UCB
Thus Chernos bound suggests the KL-UCB strategy of Garivier and Capp [2011] (see also Honda and Takemura [2010], Maillard, e Munos and Stoltz [2011]) : It argmax max q [0, 1] : 1id Ti (t1) (1 + ) log t 1 Xi,s , q . kl Ti (t 1) Ti (t 1)
s=1

Garivier and Capp proved the following regret bound for n large e enough: i (1 + 2 ) Rn log n. kl(i , )
i:i >0

KL-UCB
Thus Chernos bound suggests the KL-UCB strategy of Garivier and Capp [2011] (see also Honda and Takemura [2010], Maillard, e Munos and Stoltz [2011]) : It argmax max q [0, 1] : 1id Ti (t1) (1 + ) log t 1 Xi,s , q . kl Ti (t 1) Ti (t 1)
s=1

Garivier and Capp proved the following regret bound for n large e enough: i (1 + 2 ) Rn log n. kl(i , )
i:i >0

Heavy-tailed distributions

The standard UCB works for all 2 - subgaussian distributions (not only bounded distributions), i.e. such that E exp ((X EX )) 2 2 , R. 2

It is easy to see that this equivalent to > 0 s.t. E exp(X 2 ) < +. What happens for distributions with heavier tails? Can we get logarithmic regret if the distributions only have a nite variance?

Heavy-tailed distributions

The standard UCB works for all 2 - subgaussian distributions (not only bounded distributions), i.e. such that E exp ((X EX )) 2 2 , R. 2

It is easy to see that this equivalent to > 0 s.t. E exp(X 2 ) < +. What happens for distributions with heavier tails? Can we get logarithmic regret if the distributions only have a nite variance?

Heavy-tailed distributions

The standard UCB works for all 2 - subgaussian distributions (not only bounded distributions), i.e. such that E exp ((X EX )) 2 2 , R. 2

It is easy to see that this equivalent to > 0 s.t. E exp(X 2 ) < +. What happens for distributions with heavier tails? Can we get logarithmic regret if the distributions only have a nite variance?

Median of means, Alon, Gibbons, Matias and Szegedy [2002]

Lemma Let X , X1 , . . . , Xt be i.i.d random variables such that n E(X EX )2 1. Let (0, 1), k = 8 log 1 and N = 8 log 1 . Then with probability at least 1 , N kN 1 8 log( 1 ) 1 Xs , . . . , Xs + 8 . EX median N N n
s=1 s=(k1)N+1

Median of means, Alon, Gibbons, Matias and Szegedy [2002]

Lemma Let X , X1 , . . . , Xt be i.i.d random variables such that n E(X EX )2 1. Let (0, 1), k = 8 log 1 and N = 8 log 1 . Then with probability at least 1 , N kN 1 1 8 log( 1 ) Xs , . . . , Xs + 8 . EX median N N n
s=1 s=(k1)N+1

Median of means, Alon, Gibbons, Matias and Szegedy [2002]

Lemma Let X , X1 , . . . , Xt be i.i.d random variables such that n E(X EX )2 1. Let (0, 1), k = 8 log 1 and N = 8 log 1 . Then with probability at least 1 , N kN 1 1 8 log( 1 ) EX median Xs , . . . , Xs + 8 . N N n
s=1 s=(k1)N+1

Median of means, Alon, Gibbons, Matias and Szegedy [2002]

Lemma Let X , X1 , . . . , Xt be i.i.d random variables such that n E(X EX )2 1. Let (0, 1), k = 8 log 1 and N = 8 log 1 . Then with probability at least 1 , N kN 1 1 8 log( 1 ) Xs , . . . , Xs + 8 EX median . N N n
s=1 s=(k1)N+1

LT-UCB
This suggests LT-UCB, Bubeck, Cesa-Bianchi and Lugosi [2012]: Ni,t kt Ni,t 1 1 It argmax median Xi,s , . . . , Xi,s Ni,t Ni,t 1id
s=1 s=(kt 1)Ni,t +1

+ 32

log t , Ti (t 1)

with kt = 16 log t and Ni,t = Ti (t1) . The following regret bound 16 log t can be proved for any set of distributions with variance bounded by 1: 200 log n Rn . i
i:i >0

LT-UCB
This suggests LT-UCB, Bubeck, Cesa-Bianchi and Lugosi [2012]: Ni,t kt Ni,t 1 1 It argmax median Xi,s , . . . , Xi,s Ni,t Ni,t 1id
s=1 s=(kt 1)Ni,t +1

+ 32

log t , Ti (t 1)

with kt = 16 log t and Ni,t = Ti (t1) . The following regret bound 16 log t can be proved for any set of distributions with variance bounded by 1: 200 log n Rn . i
i:i >0

Markovian rewards
Assumption The sequence (Xi,t )t1 forms an aperiodic irreducible nite-state markov chain with unknown transition matrix Pi . Again in this framework it is possible to design a UCB strategy with logarithmic regret (Tekin and Liu, [2011]), using the following result: Theorem (Lezaud [1998]) Let X1 , . . . , Xt be an aperiodic irreducible nite-state markov chain with transition matrix P. Let 2 be the second largest eigenvalue of the multiplicative symmetrization of P and = 1 2 . Let be the expectation of X1 under the stationary distribution. There exists C > 0 such that for any (0, 1], P 1 t
t

Xs +
s=1

C exp

t 2 28

Markovian rewards
Assumption The sequence (Xi,t )t1 forms an aperiodic irreducible nite-state markov chain with unknown transition matrix Pi . Again in this framework it is possible to design a UCB strategy with logarithmic regret (Tekin and Liu, [2011]), using the following result: Theorem (Lezaud [1998]) Let X1 , . . . , Xt be an aperiodic irreducible nite-state markov chain with transition matrix P. Let 2 be the second largest eigenvalue of the multiplicative symmetrization of P and = 1 2 . Let be the expectation of X1 under the stationary distribution. There exists C > 0 such that for any (0, 1], P 1 t
t

Xs +
s=1

C exp

t 2 28

Markovian rewards
Assumption The sequence (Xi,t )t1 forms an aperiodic irreducible nite-state markov chain with unknown transition matrix Pi . Again in this framework it is possible to design a UCB strategy with logarithmic regret (Tekin and Liu, [2011]), using the following result: Theorem (Lezaud [1998]) Let X1 , . . . , Xt be an aperiodic irreducible nite-state markov chain with transition matrix P. Let 2 be the second largest eigenvalue of the multiplicative symmetrization of P and = 1 2 . Let be the expectation of X1 under the stationary distribution. There exists C > 0 such that for any (0, 1], P 1 t
t

Xs +
s=1

C exp

t 2 28

Online Lipschitz and Stochastic Optimization


Stochastic multi-armed bandit where {1, . . . , K } is replaced by X . At time t, select xt X , then receive a random variable rt [0, 1] such that E[rt |xt ] = f (xt ). Assumption X is equipped with a symmetric function : X X R+ such that (x, x) = 0. f is Lipschitz with respect to , that is |f (x) f (y )| (x, y ), x, y X .
n

Rn = nf E
t=1

f (xt ),

where f = supxX f (x).

Online Lipschitz and Stochastic Optimization


Stochastic multi-armed bandit where {1, . . . , K } is replaced by X . At time t, select xt X , then receive a random variable rt [0, 1] such that E[rt |xt ] = f (xt ). Assumption X is equipped with a symmetric function : X X R+ such that (x, x) = 0. f is Lipschitz with respect to , that is |f (x) f (y )| (x, y ), x, y X .
n

Rn = nf E
t=1

f (xt ),

where f = supxX f (x).

Online Lipschitz and Stochastic Optimization


Stochastic multi-armed bandit where {1, . . . , K } is replaced by X . At time t, select xt X , then receive a random variable rt [0, 1] such that E[rt |xt ] = f (xt ). Assumption X is equipped with a symmetric function : X X R+ such that (x, x) = 0. f is Lipschitz with respect to , that is |f (x) f (y )| (x, y ), x, y X .
n

Rn = nf E
t=1

f (xt ),

where f = supxX f (x).

Online Lipschitz and Stochastic Optimization


Stochastic multi-armed bandit where {1, . . . , K } is replaced by X . At time t, select xt X , then receive a random variable rt [0, 1] such that E[rt |xt ] = f (xt ). Assumption X is equipped with a symmetric function : X X R+ such that (x, x) = 0. f is Lipschitz with respect to , that is |f (x) f (y )| (x, y ), x, y X .
n

Rn = nf E
t=1

f (xt ),

where f = supxX f (x).

Example in 1d

Where should one sample next?

How to dene a high probability upper bound at any state x?

Noiseless case, rt = f (xt )

f* f

f(x t)

xt

Lipschitz property the evaluation of f at xt provides a rst upper-bound on f .

Noiseless case, rt = f (xt )

New point rened upper-bound on f .

Noiseless case, rt = f (xt )

Back to the noisy case

UCB in a given domain

f (xt ) rt

Xi
x xt

For a xed domain Xi x containing ni points {xt } Xi , we have that ni rt f (xt ) is a martingale. Thus by Azumas inequality, t=1
1 ni
ni

rt +
t=1

log 1/ 1 2ni ni

ni

f (xt ) f (x) diam(Xi ),


t=1

since f is Lipschitz (where diam(Xi ) = supx,y Xi (x, y )).

UCB in a given domain

f (xt ) rt

Xi
x xt

For a xed domain Xi x containing ni points {xt } Xi , we have that ni rt f (xt ) is a martingale. Thus by Azumas inequality, t=1
1 ni
ni

rt +
t=1

log 1/ 1 2ni ni

ni

f (xt ) f (x) diam(Xi ),


t=1

since f is Lipschitz (where diam(Xi ) = supx,y Xi (x, y )).

UCB in a given domain

f (xt ) rt

Xi
x xt

For a xed domain Xi x containing ni points {xt } Xi , we have that ni rt f (xt ) is a martingale. Thus by Azumas inequality, t=1
1 ni
ni

rt +
t=1

log 1/ 1 2ni ni

ni

f (xt ) f (x) diam(Xi ),


t=1

since f is Lipschitz (where diam(Xi ) = supx,y Xi (x, y )).

High probability upper bound

Xi

log 1/ 2ni 1 ni ni t=1 rt

Xi
1 ni
ni

w.p. 1 ,

rt +
t=1

log 1/ + diam(Xi ) sup f (x). 2ni xXi

Tradeo between number of points in a domain and size of the domain. By considering several domains we can derive a tigher upper bound.

High probability upper bound

Xi

log 1/ 2ni 1 ni ni t=1 rt

Xi
1 ni
ni

w.p. 1 ,

rt +
t=1

log 1/ + diam(Xi ) sup f (x). 2ni xXi

Tradeo between number of points in a domain and size of the domain. By considering several domains we can derive a tigher upper bound.

High probability upper bound

Xi

log 1/ 2ni 1 ni ni t=1 rt

Xi
1 ni
ni

w.p. 1 ,

rt +
t=1

log 1/ + diam(Xi ) sup f (x). 2ni xXi

Tradeo between number of points in a domain and size of the domain. By considering several domains we can derive a tigher upper bound.

A hierarchical decomposition
Use a tree of partitions at all scales:

Bi (t) = min i (t) +

def

2 log(t) + diam(i), max Bj (t) Ti (t) jC(i)

Hierarchical Optimistic Optimization (HOO)


[Bubeck, Munos, Stoltz, Szepesvri, 2008, 2011]: Consider a tree a of partitions of X , each node i corresponds to a subdomain Xi . HOO Algorithm: Let Tt be the set of expanded nodes at round t. - T1 = {root} (space X ) - At t, select a leaf It of Tt by maximizing the B-values, - Tt+1 = Tt {It } - Select xt XIt - Observe reward rt and update the B-values: Bi (t) = min i (t) +
def
Followed path

Turnedon nodes

B B

h,i

h+1,2i1

h+1,2i

Selected node

Pulled point

Xt

2 log(t) + diam(i), max Bj (t) Ti (t) jC(i)

Example in 1d
rt B(f (xt )) a Bernoulli distribution with parameter f (xt )

Resulting tree at time n = 1000 and at n = 10000.

Analysis of HOO

The near-optimality dimension d of f is dened as follows: Let X = {x X , f (x) f } be the set of -optimal points. Then X can be covered by O( d ) balls of radius . A similar notion was introduced in [Kleinberg, Slivkins, Upfall, 2008]. Theorem HOO satises:
def

Rn = O(n d+2 ).

d+1

Analysis of HOO

The near-optimality dimension d of f is dened as follows: Let X = {x X , f (x) f } be the set of -optimal points. Then X can be covered by O( d ) balls of radius . A similar notion was introduced in [Kleinberg, Slivkins, Upfall, 2008]. Theorem HOO satises:
def

Rn = O(n d+2 ).

d+1

Analysis of HOO

The near-optimality dimension d of f is dened as follows: Let X = {x X , f (x) f } be the set of -optimal points. Then X can be covered by O( d ) balls of radius . A similar notion was introduced in [Kleinberg, Slivkins, Upfall, 2008]. Theorem HOO satises:
def

Rn = O(n d+2 ).

d+1

Analysis of HOO

The near-optimality dimension d of f is dened as follows: Let X = {x X , f (x) f } be the set of -optimal points. Then X can be covered by O( d ) balls of radius . A similar notion was introduced in [Kleinberg, Slivkins, Upfall, 2008]. Theorem HOO satises:
def

Rn = O(n d+2 ).

d+1

Example 1:
Assume the function is locally peaky around its maximum: f (x ) f (x) = (||x x||).

It takes O( 0 ) balls of radius to cover X with (x, y ) = ||x y ||. Thus d = 0 and the regret is O( n).

Example 1:
Assume the function is locally peaky around its maximum: f (x ) f (x) = (||x x||).

It takes O( 0 ) balls of radius to cover X with (x, y ) = ||x y ||. Thus d = 0 and the regret is O( n).

Example 1:
Assume the function is locally peaky around its maximum: f (x ) f (x) = (||x x||).

It takes O( 0 ) balls of radius to cover X with (x, y ) = ||x y ||. Thus d = 0 and the regret is O( n).

Example 2:
Assume the function is locally quadratic around its maximum: f (x ) f (x) = (||x x||2 ).

For (x, y ) = ||x y ||, it takes O( D/2 ) balls of radius D+2 cover X . Thus d = D/2 and Rn = O(n D+4 ). For (x, y ) = ||x y ||2 , it takes O( 0 ) -balls of radius cover X . Thus d = 0 and Rn = O( n).

to to

Example 2:
Assume the function is locally quadratic around its maximum: f (x ) f (x) = (||x x||2 ).

For (x, y ) = ||x y ||, it takes O( D/2 ) balls of radius D+2 cover X . Thus d = D/2 and Rn = O(n D+4 ). For (x, y ) = ||x y ||2 , it takes O( 0 ) -balls of radius cover X . Thus d = 0 and Rn = O( n).

to to

Example 2:
Assume the function is locally quadratic around its maximum: f (x ) f (x) = (||x x||2 ).

For (x, y ) = ||x y ||, it takes O( D/2 ) balls of radius D+2 cover X . Thus d = D/2 and Rn = O(n D+4 ). For (x, y ) = ||x y ||2 , it takes O( 0 ) -balls of radius cover X . Thus d = 0 and Rn = O( n).

to to

Example
X = [0, 1]D , 0 and mean-payo function f locally -smooth around (any of) its maximum x (in nite number): f (x ) f (x) = (||x x || ) as x x . Theorem Assume that we run HOO using (x, y ) = ||x y || . Known smoothness: = . Rn = O( n), i.e., the rate is independent of the dimension D. Smoothness underestimated: < . 1 Rn = O(n(d+1)/(d+2) ) where d = D
1

Smoothness overestimated: > . No guarantee. Note: UCT corresponds to = +.

Example
X = [0, 1]D , 0 and mean-payo function f locally -smooth around (any of) its maximum x (in nite number): f (x ) f (x) = (||x x || ) as x x . Theorem Assume that we run HOO using (x, y ) = ||x y || . Known smoothness: = . Rn = O( n), i.e., the rate is independent of the dimension D. Smoothness underestimated: < . 1 Rn = O(n(d+1)/(d+2) ) where d = D
1

Smoothness overestimated: > . No guarantee. Note: UCT corresponds to = +.

Example
X = [0, 1]D , 0 and mean-payo function f locally -smooth around (any of) its maximum x (in nite number): f (x ) f (x) = (||x x || ) as x x . Theorem Assume that we run HOO using (x, y ) = ||x y || . Known smoothness: = . Rn = O( n), i.e., the rate is independent of the dimension D. Smoothness underestimated: < . 1 Rn = O(n(d+1)/(d+2) ) where d = D
1

Smoothness overestimated: > . No guarantee. Note: UCT corresponds to = +.

Example
X = [0, 1]D , 0 and mean-payo function f locally -smooth around (any of) its maximum x (in nite number): f (x ) f (x) = (||x x || ) as x x . Theorem Assume that we run HOO using (x, y ) = ||x y || . Known smoothness: = . Rn = O( n), i.e., the rate is independent of the dimension D. Smoothness underestimated: < . 1 Rn = O(n(d+1)/(d+2) ) where d = D
1

Smoothness overestimated: > . No guarantee. Note: UCT corresponds to = +.

Example
X = [0, 1]D , 0 and mean-payo function f locally -smooth around (any of) its maximum x (in nite number): f (x ) f (x) = (||x x || ) as x x . Theorem Assume that we run HOO using (x, y ) = ||x y || . Known smoothness: = . Rn = O( n), i.e., the rate is independent of the dimension D. Smoothness underestimated: < . 1 Rn = O(n(d+1)/(d+2) ) where d = D
1

Smoothness overestimated: > . No guarantee. Note: UCT corresponds to = +.

Path planning

Combinatorial prediction game


Adversary

Player

Combinatorial prediction game


Adversary

Player

Combinatorial prediction game

Adversary

Player

Combinatorial prediction game

Adversary
4 1 2 3 8 9 6 7 5 d2 d1 d

Player

Combinatorial prediction game

Adversary
4 1 2 3 8 9 6 7 5 d2 d1 d

Player

loss suered:

+ ... +

Combinatorial prediction game


Full Info:
1, 2, . . . , d

Adversary

Feedback:
4 1 2 3 8 9 6 7 5

d2 d1 d

Player

loss suered:

+ ... +

Combinatorial prediction game


Full Info: Semi-Bandit: Feedback: Bandit:
4 1 2 3 8 9 6 7 5 1, 2, . . . , d 2, 7, . . . , d 2

Adversary

+ ... +
d2 d1 d

Player

loss suered:

+ ... +

Combinatorial prediction game


Full Info: Semi-Bandit: Feedback: Bandit:
4 1 2 3 8 9 6 7 5 1, 2, . . . , d 2, 7, . . . , d 2

Adversary

+ ... +
d2 d1 d

Player

loss suered:

+ ... +

Notation

S {0, 1}d
4 1 2 3 8 9 6 7 5 d2 d1 d

Rd +

4 1 2 3 8 9 6 7 5 d2 d1 d

Vt S, loss suered:
n n T t Vt t=1

TV t t

Rn = E

min E
uS t=1

T t u

Notation

S {0, 1}d
4 1 2 3 8 9 6 7 5 d2 d1 d

Rd +

4 1 2 3 8 9 6 7 5 d2 d1 d

Vt S, loss suered:
n n T t Vt t=1

TV t t

Rn = E

min E
uS t=1

T t u

Notation

S {0, 1}d
4 1 2 3 8 9 6 7 5 d2 d1 d

Rd +

4 1 2 3 8 9 6 7 5 d2 d1 d

Vt S, loss suered:
n n T t Vt t=1

TV t t

Rn = E

min E
uS t=1

T t u

Notation

S {0, 1}d
4 1 2 3 8 9 6 7 5 d2 d1 d

Rd +

4 1 2 3 8 9 6 7 5 d2 d1 d

Vt S, loss suered:
n n T t Vt t=1

TV t t

Rn = E

min E
uS t=1

T t u

Set of concepts S {0, 1}d


Paths k-sets Matchings

Spanning trees k-sized intervals Parallel bandits

Key idea

Vt pt , pt (S) Then, unbiased estimate t of the loss t : t = t in the full information game, i,t =
i,t V S:Vi =1

pt (V ) Vi,t

in the semi-bandit game,

= Pt+ Vt VtT t , with Pt = EV p (VV T ) in the bandit game. t t

Key idea

Vt pt , pt (S) Then, unbiased estimate t of the loss t : t = t in the full information game, i,t =
i,t V S:Vi =1

pt (V ) Vi,t

in the semi-bandit game,

= Pt+ Vt VtT t , with Pt = EV p (VV T ) in the bandit game. t t

Key idea

Vt pt , pt (S) Then, unbiased estimate t of the loss t : t = t in the full information game, i,t =
i,t V S:Vi =1

pt (V ) Vi,t

in the semi-bandit game,

= Pt+ Vt VtT t , with Pt = EV p (VV T ) in the bandit game. t t

Key idea

Vt pt , pt (S) Then, unbiased estimate t of the loss t : t = t in the full information game, i,t =
i,t V S:Vi =1

pt (V ) Vi,t

in the semi-bandit game,

= Pt+ Vt VtT t , with Pt = EV p (VV T ) in the bandit game. t t

Key idea

Vt pt , pt (S) Then, unbiased estimate t of the loss t : t = t in the full information game, i,t =
i,t V S:Vi =1

pt (V ) Vi,t

in the semi-bandit game,

= Pt+ Vt VtT t , with Pt = EV p (VV T ) in the bandit game. t t

Loss assumptions

Denition (L ) We say that the adversary statises the L assumption: if t 1 for all t = 1, . . . , n. Denition (L2 ) We say that the adversary statises the L2 assumption: if T v 1 for all t = 1, . . . , n and v S. t

Loss assumptions

Denition (L ) We say that the adversary statises the L assumption: if t 1 for all t = 1, . . . , n. Denition (L2 ) We say that the adversary statises the L2 assumption: if T v 1 for all t = 1, . . . , n and v S. t

Expanded Exponentially weighted average forecaster (Exp2)


exp pt (v ) =
uS t1 T s=1 s v t1 T s=1 s u

exp

In the full information game, against L2 adversaries, we have (for some ) Rn 2dn, which is the optimal rate, Dani, Hayes and Kakade [2008]. Thus against L adversaries we have Rn d 3/2 2n. But this is suboptimal, Koolen, Warmuth and Kivinen [2010]. Audibert, Bubeck and Lugosi [2011] showed that, for any , there exists a subset S {0, 1}d and an L adversary such that: Rn 0.02 d 3/2 n.

Expanded Exponentially weighted average forecaster (Exp2)


exp pt (v ) =
uS t1 T s=1 s v t1 T s=1 s u

exp

In the full information game, against L2 adversaries, we have (for some ) Rn 2dn, which is the optimal rate, Dani, Hayes and Kakade [2008]. Thus against L adversaries we have Rn d 3/2 2n. But this is suboptimal, Koolen, Warmuth and Kivinen [2010]. Audibert, Bubeck and Lugosi [2011] showed that, for any , there exists a subset S {0, 1}d and an L adversary such that: Rn 0.02 d 3/2 n.

Expanded Exponentially weighted average forecaster (Exp2)


exp pt (v ) =
uS t1 T s=1 s v t1 T s=1 s u

exp

In the full information game, against L2 adversaries, we have (for some ) Rn 2dn, which is the optimal rate, Dani, Hayes and Kakade [2008]. Thus against L adversaries we have Rn d 3/2 2n. But this is suboptimal, Koolen, Warmuth and Kivinen [2010]. Audibert, Bubeck and Lugosi [2011] showed that, for any , there exists a subset S {0, 1}d and an L adversary such that: Rn 0.02 d 3/2 n.

Expanded Exponentially weighted average forecaster (Exp2)


exp pt (v ) =
uS t1 T s=1 s v t1 T s=1 s u

exp

In the full information game, against L2 adversaries, we have (for some ) Rn 2dn, which is the optimal rate, Dani, Hayes and Kakade [2008]. Thus against L adversaries we have Rn d 3/2 2n. But this is suboptimal, Koolen, Warmuth and Kivinen [2010]. Audibert, Bubeck and Lugosi [2011] showed that, for any , there exists a subset S {0, 1}d and an L adversary such that: Rn 0.02 d 3/2 n.

Legendre function

Denition Let D be a convex subset of Rd with nonempty interior int(D) and boundary D. We call Legendre any function F : D R such that F is strictly convex and admits continuous rst partial derivatives on int(D), For any u D, for any v int(D), we have
s0,s>0

lim (u v )T F (1 s)u + sv = +.

Legendre function

Denition Let D be a convex subset of Rd with nonempty interior int(D) and boundary D. We call Legendre any function F : D R such that F is strictly convex and admits continuous rst partial derivatives on int(D), For any u D, for any v int(D), we have
s0,s>0

lim (u v )T F (1 s)u + sv = +.

Legendre function

Denition Let D be a convex subset of Rd with nonempty interior int(D) and boundary D. We call Legendre any function F : D R such that F is strictly convex and admits continuous rst partial derivatives on int(D), For any u D, for any v int(D), we have
s0,s>0

lim (u v )T F (1 s)u + sv = +.

Bregman divergence

Denition The Bregman divergence DF : D int(D) associated to a Legendre function F is dened by DF (u, v ) = F (u) F (v ) (u v )T F (v ).

CLEB (Combinatorial LEarning with Bregman divergences)


Parameter: F Legendre on D Conv (S) (1) wt+1 D : F (wt+1 ) = F (wt ) t D

(2) wt+1 argmin DF (w , wt+1 )


w Conv (S)

(3) pt+1 (S) : wt+1 = EV pt+1 V Conv (S)

(S)

CLEB (Combinatorial LEarning with Bregman divergences)


Parameter: F Legendre on D Conv (S) (1) wt+1 D : F (wt+1 ) = F (wt ) t D

(2) wt+1 argmin DF (w , wt+1 )


w Conv (S)

(3) pt+1 (S) : wt+1 = EV pt+1 V Conv (S)

(S)

pt

CLEB (Combinatorial LEarning with Bregman divergences)


Parameter: F Legendre on D Conv (S) (1) wt+1 D : F (wt+1 ) = F (wt ) t D

(2) wt+1 argmin DF (w , wt+1 )


w Conv (S)

wt

(3) pt+1 (S) : wt+1 = EV pt+1 V Conv (S)

(S)

pt

CLEB (Combinatorial LEarning with Bregman divergences)


Parameter: F Legendre on D Conv (S) (1) wt+1 D : F (wt+1 ) = F (wt ) t D wt+1 wt

(2) wt+1 argmin DF (w , wt+1 )


w Conv (S)

(3) pt+1 (S) : wt+1 = EV pt+1 V Conv (S)

(S)

pt

CLEB (Combinatorial LEarning with Bregman divergences)


Parameter: F Legendre on D Conv (S) (1) wt+1 D : F (wt+1 ) = F (wt ) t D wt+1 wt wt+1 Conv (S)

(2) wt+1 argmin DF (w , wt+1 )


w Conv (S)

(3) pt+1 (S) : wt+1 = EV pt+1 V

(S)

pt

CLEB (Combinatorial LEarning with Bregman divergences)


Parameter: F Legendre on D Conv (S) (1) wt+1 D : F (wt+1 ) = F (wt ) t D wt+1 wt wt+1 Conv (S) pt+1 (S) pt

(2) wt+1 argmin DF (w , wt+1 )


w Conv (S)

(3) pt+1 (S) : wt+1 = EV pt+1 V

General regret bound for CLEB

Theorem If F admits a Hessian Rn


2F

always invertible then,


n

diamDF (S) + E
t=1

T t

F (wt )

t .

Dierent instances of CLEB: LinExp (Entropy Function)


D = [0, +)d , F (x) =
1 d i=1 xi

log xi Full Info: Hedge

Semi-Bandit=Bandit: Exp3 Auer et al. [2002] Full Info: Component Hedge Koolen, Warmuth and Kivinen [2010] Semi-Bandit: MW Kale, Reyzin and Schapire [2010] Bandit: new algorithm

Dierent instances of CLEB: LinExp (Entropy Function)


D = [0, +)d , F (x) =
1 d i=1 xi

log xi Full Info: Hedge

Semi-Bandit=Bandit: Exp3 Auer et al. [2002] Full Info: Component Hedge Koolen, Warmuth and Kivinen [2010] Semi-Bandit: MW Kale, Reyzin and Schapire [2010] Bandit: new algorithm

Dierent instances of CLEB: LinExp (Entropy Function)


D = [0, +)d , F (x) =
1 d i=1 xi

log xi Full Info: Hedge

Semi-Bandit=Bandit: Exp3 Auer et al. [2002] Full Info: Component Hedge Koolen, Warmuth and Kivinen [2010] Semi-Bandit: MW Kale, Reyzin and Schapire [2010] Bandit: new algorithm

Dierent instances of CLEB: LinExp (Entropy Function)


D = [0, +)d , F (x) =
1 d i=1 xi

log xi Full Info: Hedge

Semi-Bandit=Bandit: Exp3 Auer et al. [2002] Full Info: Component Hedge Koolen, Warmuth and Kivinen [2010] Semi-Bandit: MW Kale, Reyzin and Schapire [2010] Bandit: new algorithm

Dierent instances of CLEB: LinExp (Entropy Function)


D = [0, +)d , F (x) =
1 d i=1 xi

log xi Full Info: Hedge

Semi-Bandit=Bandit: Exp3 Auer et al. [2002] Full Info: Component Hedge Koolen, Warmuth and Kivinen [2010] Semi-Bandit: MW Kale, Reyzin and Schapire [2010] Bandit: new algorithm

Dierent instances of CLEB: LinExp (Entropy Function)


D = [0, +)d , F (x) =
1 d i=1 xi

log xi Full Info: Hedge

Semi-Bandit=Bandit: Exp3 Auer et al. [2002] Full Info: Component Hedge Koolen, Warmuth and Kivinen [2010] Semi-Bandit: MW Kale, Reyzin and Schapire [2010] Bandit: new algorithm

Dierent instances of CLEB: LinINF (Exchangeable Hessian)


D = [0, +)d , F (x) =
xi d i=1 0

1 (s)ds

INF, Audibert and Bubeck [2009]

(x) = exp(x) : LinExp (x) = (x)q , q > 1 : LinPoly

Dierent instances of CLEB: LinINF (Exchangeable Hessian)


D = [0, +)d , F (x) =
xi d i=1 0

1 (s)ds

INF, Audibert and Bubeck [2009]

(x) = exp(x) : LinExp (x) = (x)q , q > 1 : LinPoly

Dierent instances of CLEB: LinINF (Exchangeable Hessian)


D = [0, +)d , F (x) =
xi d i=1 0

1 (s)ds

INF, Audibert and Bubeck [2009]

(x) = exp(x) : LinExp (x) = (x)q , q > 1 : LinPoly

Dierent instances of CLEB: LinINF (Exchangeable Hessian)


D = [0, +)d , F (x) =
xi d i=1 0

1 (s)ds

INF, Audibert and Bubeck [2009]

(x) = exp(x) : LinExp (x) = (x)q , q > 1 : LinPoly

Dierent instances of CLEB: Follow the regularized leader

D = Conv (S), then


t

wt+1 argmin
w D s=1

T w + F (w ) s

Particularly interesting choice: F self-concordant barrier function, Abernethy, Hazan and Rakhlin [2008]

Dierent instances of CLEB: Follow the regularized leader

D = Conv (S), then


t

wt+1 argmin
w D s=1

T w + F (w ) s

Particularly interesting choice: F self-concordant barrier function, Abernethy, Hazan and Rakhlin [2008]

Minimax regret for the full information game

Theorem (Koolen, Warmuth and Kivinen [2010]) In the full information game, the LinExp strategy (with well-chosen parameters) satises for any concept class S {0, 1} and any L -adversary: Rn d 2n. Moreover for any strategy, there exists a subset S {0, 1} and an L -adversary such that: Rn 0.008 d n.

Minimax regret for the semi-bandit game

Theorem (Audibert, Bubeck and Lugosi [2011]) In the semi-bandit game, the LinExp strategy (with well-chosen parameters) satises for any concept class S {0, 1} and any L -adversary: Rn d 2n. Moreover for any strategy, there exists a subset S {0, 1} and an L -adversary such that: Rn 0.008 d n.

Minimax regret for the bandit game

For the bandit game the situation becomes trickier. First it appears necessary to add some sort of forced exploration on S to control third order error terms in the regret bound. 2 F (w ) 1 Second, the control of the quadratic term T t t t is much more involved than previously.

Minimax regret for the bandit game

For the bandit game the situation becomes trickier. First it appears necessary to add some sort of forced exploration on S to control third order error terms in the regret bound. 2 F (w ) 1 Second, the control of the quadratic term T t t t is much more involved than previously.

Minimax regret for the bandit game

For the bandit game the situation becomes trickier. First it appears necessary to add some sort of forced exploration on S to control third order error terms in the regret bound. 2 F (w ) 1 Second, the control of the quadratic term T t t t is much more involved than previously.

Minimax regret for the bandit game


Currently there exists three approaches to solve these issues: Dani, Hayes and Kakade [2008] construct a barycentric spanner of S (a sort of basis) and play Exp2 mixed with an uniform exploration on the spanner. Regret of order: d 5/2 n. Cesa-Bianchi and Lugosi [2009] use Exp2 with an uniform exploration on S. For good sets S, regret of order: d 2 n. Abernethy, Hazan and Rakhlin [2008] use FTRL with a self-concordant barrier F . They proposed an exploration 2 F (w ) 1 . Regret guided by the structure of the Hessian t of order: d 5/2 n. Theorem (Audibert, Bubeck and Lugosi [2011]) In the bandit game, for any strategy, there exists a subset S {0, 1} and an L -adversary such that: Rn 0.01 d 3/2 n.

Minimax regret for the bandit game


Currently there exists three approaches to solve these issues: Dani, Hayes and Kakade [2008] construct a barycentric spanner of S (a sort of basis) and play Exp2 mixed with an uniform exploration on the spanner. Regret of order: d 5/2 n. Cesa-Bianchi and Lugosi [2009] use Exp2 with an uniform exploration on S. For good sets S, regret of order: d 2 n. Abernethy, Hazan and Rakhlin [2008] use FTRL with a self-concordant barrier F . They proposed an exploration 2 F (w ) 1 . Regret guided by the structure of the Hessian t of order: d 5/2 n. Theorem (Audibert, Bubeck and Lugosi [2011]) In the bandit game, for any strategy, there exists a subset S {0, 1} and an L -adversary such that: Rn 0.01 d 3/2 n.

Minimax regret for the bandit game


Currently there exists three approaches to solve these issues: Dani, Hayes and Kakade [2008] construct a barycentric spanner of S (a sort of basis) and play Exp2 mixed with an uniform exploration on the spanner. Regret of order: d 5/2 n. Cesa-Bianchi and Lugosi [2009] use Exp2 with an uniform exploration on S. For good sets S, regret of order: d 2 n. Abernethy, Hazan and Rakhlin [2008] use FTRL with a self-concordant barrier F . They proposed an exploration 2 F (w ) 1 . Regret guided by the structure of the Hessian t of order: d 5/2 n. Theorem (Audibert, Bubeck and Lugosi [2011]) In the bandit game, for any strategy, there exists a subset S {0, 1} and an L -adversary such that: Rn 0.01 d 3/2 n.

Minimax regret for the bandit game


Currently there exists three approaches to solve these issues: Dani, Hayes and Kakade [2008] construct a barycentric spanner of S (a sort of basis) and play Exp2 mixed with an uniform exploration on the spanner. Regret of order: d 5/2 n. Cesa-Bianchi and Lugosi [2009] use Exp2 with an uniform exploration on S. For good sets S, regret of order: d 2 n. Abernethy, Hazan and Rakhlin [2008] use FTRL with a self-concordant barrier F . They proposed an exploration 2 F (w ) 1 . Regret guided by the structure of the Hessian t of order: d 5/2 n. Theorem (Audibert, Bubeck and Lugosi [2011]) In the bandit game, for any strategy, there exists a subset S {0, 1} and an L -adversary such that: Rn 0.01 d 3/2 n.

You might also like