A slightly more general setting that the multiarmed bandit. As in that problem we have a set of actions $A$ . The reward of each action now depends on an additional context vector, $s$ . The formal setup is as follows.

For $t = 1, \dots, T$ :

We are presented with a context vector $s$
We select an action $a$ according to our policy $π_{t}$ (which can depend on $s$ and the history until the current moment)
We see reward $r_{t} \sim R (a, s)$ .

Our goal is to minimize regret:

E [R e g (T)] = t = 1 \sum T (a max [E_{S} R (a, S)] - r_{t}) .

The Stats Map

Explore

contextual bandit

Known results

Graph View

Backlinks

Explore