A slightly more general setting that the multiarmed bandit. As in that problem we have a set of actions . The reward of each action now depends on an additional context vector, . The formal setup is as follows.
For :
- We are presented with a context vector
- We select an action according to our policy (which can depend on and the history until the current moment)
- We see reward .
Our goal is to minimize regret: