todo

A slightly more general setting that the multiarmed bandit. As in that problem we have a set of actions . The reward of each action now depends on an additional context vector, . The formal setup is as follows.

For :

  • We are presented with a context vector
  • We select an action according to our policy (which can depend on and the history until the current moment)
  • We see reward .

Our goal is to minimize regret:

Known results