The task of contextual bandit is to find a policy $\pi$ for deciding what action $a$ to take given a context $x$ or $a = \pi(x)$
The goal is to find a policy which maximizes the reward $V^\pi = E(r_{\pi(x)})$
one problem is how to measure the performance using offline data which was not collected with $\pi$. Instead we have a sample of $(x,a,r_a)$ made by a different policy. For example a policy which uniformally sample from all available actions a regardless of x
Part 4 in A Contextual-Bandit Approach to Personalized News Article Recommendation describes how to test a bandit using offline data. But it was limited to a collection made with fixed (equal) probability for every arm $p(a) = 1/K$. This is a special case of inverse propensity score (IPS) method in which the probability $\hat{p}(a|x)$ for every arm selection a is predicted based on the context x
$\hat{V}^\pi_{\mbox{IPS}} = \hat{E}(\frac{r_a I(\pi(x)=a)}{\hat{p}(a|x)})$
where $\hat{E}$ is averaging over all our samples in the offline data
A different approach is the direct method (DM) which predicts the reward r for every arm selection $\hat{\rho}_a(x)$
$\hat{V}^\pi_{\mbox{DM}} = \hat{E}(\hat{\rho}_{\pi(x)}(x))$
"Doubly Robust Policy Evaluation and Learning", by Miroslav Dudik, John Langford and Lihong Li. In ICML 2011. combine the two methods using
$\hat{V}^\pi_{\mbox{DR}} = \hat{E}(\frac{(r_a - \hat{\rho}_a(x))I(\pi(x)=a)}{\hat{p}(a|x)} + \hat{\rho}_{\pi(x)}(x))$
VW¶
VW has a unit that implements contextual bandit using offline data.
Evaluation a policy on offline data¶
It is not clear to me what is the policy $\pi$ being evaluated or optimized in VW (after all DR is just a way to evaluate it.)
The wiki gives the folowing example
%%writefile /tmp/train.dat
1:2:0.4 | a c
3:0.5:0.2 | b d
4:1.2:0.5 | a b c
2:1:0.3 | b c
3:1.5:0.7 | a d
!vw -d /tmp/train.dat --cb 4 -f /tmp/cb.model
prediciton¶
the policy from previous step /tmp/cb.model
can be applied to new test data
%%writefile /tmp/test.dat
1:2 3:5 4:1:0.6 | a c d
1:0.5 2:1:0.4 3:2 4:1.5 | c d
!vw -t -d /tmp/test.dat -i /tmp/cb.model -p /tmp/out
!cat /tmp/out