The task of contextual bandit is to find a policy $\pi$ for deciding what action $a$ to take given a context $x$ or $a = \pi(x)$

The goal is to find a policy which maximizes the reward $V^\pi = E(r_{\pi(x)})$

one problem is how to measure the performance using offline data which was not collected with $\pi$. Instead we have a sample of $(x,a,r_a)$ made by a different policy. For example a policy which uniformally sample from all available actions a regardless of x

Part 4 in A Contextual-Bandit Approach to Personalized News Article Recommendation describes how to test a bandit using offline data. But it was limited to a collection made with fixed (equal) probability for every arm $p(a) = 1/K$. This is a special case of inverse propensity score (IPS) method in which the probability $\hat{p}(a|x)$ for every arm selection a is predicted based on the context x

$\hat{V}^\pi_{\mbox{IPS}} = \hat{E}(\frac{r_a I(\pi(x)=a)}{\hat{p}(a|x)})$

where $\hat{E}$ is averaging over all our samples in the offline data

A different approach is the direct method (DM) which predicts the reward r for every arm selection $\hat{\rho}_a(x)$

$\hat{V}^\pi_{\mbox{DM}} = \hat{E}(\hat{\rho}_{\pi(x)}(x))$

"Doubly Robust Policy Evaluation and Learning", by Miroslav Dudik, John Langford and Lihong Li. In ICML 2011. combine the two methods using

$\hat{V}^\pi_{\mbox{DR}} = \hat{E}(\frac{(r_a - \hat{\rho}_a(x))I(\pi(x)=a)}{\hat{p}(a|x)} + \hat{\rho}_{\pi(x)}(x))$

VW¶

VW has a unit that implements contextual bandit using offline data.

Evaluation a policy on offline data¶

It is not clear to me what is the policy $\pi$ being evaluated or optimized in VW (after all DR is just a way to evaluate it.)

The wiki gives the folowing example

In [1]:

%%writefile /tmp/train.dat
1:2:0.4 | a c  
3:0.5:0.2 | b d  
4:1.2:0.5 | a b c  
2:1:0.3 | b c  
3:1.5:0.7 | a d

Overwriting /tmp/train.dat

In [2]:

!vw -d /tmp/train.dat --cb 4 -f /tmp/cb.model

Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
final_regressor = /tmp/cb.model
using no cache
Reading datafile = /tmp/train.dat
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
*estimate* *estimate*                                                avglossreg last pred  last correct
5.000000   5.000000          1      1.0    known        1        3   4.000000   0.000000   2.000000  
2.500000   0.000000          2      2.0    known        2        3   2.125000   0.000000   0.500000  
2.083333   1.666667          4      4.0    known        2        3   1.672500   0.000000   1.000000  

finished run
number of examples per pass = 5
passes used = 1
weighted example sum = 5
weighted label sum = 0
average loss = 1.69347
best constant = 0
total feature number = 16

prediciton¶

the policy from previous step /tmp/cb.model can be applied to new test data

In [3]:

%%writefile /tmp/test.dat
1:2 3:5 4:1:0.6 | a c d  
1:0.5 2:1:0.4 3:2 4:1.5 | c d

Overwriting /tmp/test.dat

In [4]:

!vw -t -d /tmp/test.dat -i /tmp/cb.model -p /tmp/out

only testing
Num weight bits = 18
learning rate = 10
initial_t = 1
power_t = 0.5
predictions = /tmp/out
using no cache
Reading datafile = /tmp/test.dat
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
*estimate* *estimate*                                                avglossreg last pred  last correct
1.000000   1.000000          1      1.0    known        4        4   0.318207   0.435902   1.000000  
1.000000   1.000000          2      2.0    known        2        3   0.426955   0.268082   1.000000  

finished run
number of examples per pass = 2
passes used = 1
weighted example sum = 2
weighted label sum = 0
average loss = 1
best constant = -1
total feature number = 7

In [5]:

!cat /tmp/out

4.000000
2.000000

udibr

VW contextual bandit

VW¶

Evaluation a policy on offline data¶

prediciton¶

Comments