When working in machine learning, one will find that most problems are described in terms of statistical models, where probability distributions determine the characteristics of the task at hand. We rely quite heavily on these models, and have built very powerful algorithms based on standard randomization capable of learning from such distributions in order to solve complex tasks such as those found in reinforcement learning. But what if there are aspects of these models that we aren't accounting for? And what if we aren't accounting for these aspects because we can't be made aware of them through our current approaches?

Unobserved Confounders (UCs) pose a unique challenge to algorithms based on standard randomization. In the paper *Counterfactual Data-Fusion for Online Reinforcement Learners* (2017) by Andrew Forney, Judea Pearl, and Elias Bareinboim, the authors show how counterfactual-based decision making circumvents the problems which arise when UCs are naively averaged out which leads to a coherent fusion of observational and experimental data.

In this post, I want to summarize the background involved in this paper. This includes describing their story of the "Greedy Casino" as well as describing some of the language and tools used in this paper. I plan on following this post up with a description of their solutions to these problems.

## The Greedy Casino

The paper explores their approach through a contrived multi-armed bandit problem which incorporates unobserved confounders. The multi-armed bandit problem is a classic reinforcement learning problem that exemplifies the exploration-exploitation trade-off dilemma. The setup is to imagine a casino with \( K \) slot machines, each providing a random reward which is sampled from it's own, unobservable probability distribution. The gambler, our agent, wants to maximize cumulative reward.

The problem our agent faces is that it must explore the environment (i.e., play the slots) in order to learn the machines' reward distributions. However, this process of exploration is costly; as the agent explores the environment, they are inevitably choosing machines which are suboptimal. But, in order to find which machines produce the best rewards, the agent has to first explore the machines, and the more reward samplings the agent can acquire, the better the agent's guess as to which machine(s) produce the best rewards. A bit of a catch-22.

The paper takes this problem further by introducing unobserved confounders to the mix in the form of a story. They describe a casino which is looking to rip off gamblers while avoiding legal troubles. The casino collects data about the gambler's characteristics, and finds two traits which predict which of four machines a gambler is likely to play, which are whether or not the machines are all blinking and whether or not the gambler is drunk. The gambler's machine choice is modeled by \( f_X (B, D) = B + 2\cdot D \) if the four machines are indexed as \( X \in \{ 0, 1, 2, 3 \} \). About half the gamblers are drunk and the machines are programmed to blink half the time.

Due to a new gambling law, the slot machines in the casino must maintain a 30% win rate. To circumvent this, the casino equips their machines with a sensor capable of determining if the gambler is drunk. They adjust the payout rates of their machines using the following payout scheme:

So, for example, when a drunk ( \( D = 1 \) ) gambler chooses the fourth machine ( \( X = 3 \) ) while it's blinking ( \( B = 1 \) ), their payout rate will be set to \( 0.20 \), so that their win rate is less than the legal minimum. At the same time, if the same drunk gambler picked the same (fourth) machine while it *wasn't* blinking, their win rate would be \( 0.60 \).

The idea, here, is that since the average win rate for each machines is \( 0.40 \), which is well above the legal limit, they are meeting the requirements of the law. However, gamblers are likely to play on machines where the win rate is only \( 0.20 \). If enough gamblers follow this pattern (which the casino's data indicates), the casino will be able to payout less than the legal limit. This is a pretty clever scam, and quite nefarious when you consider how realistic the scenario is.

The paper continues their story by describing how gamblers began observing a win rate of \( 0.20 \), prompting an investigation. The investigators asked random gamblers to play on random machines, and collected the data. After compiling their findings, the investigators determined the win rate to be \( 0.40 \) (due to the reason I described above), and they perceived the casino to be operating legally.

This is an interesting problem. The approach the investigators approach to test the machines (random sampling) seems reasonable, but we find that ignoring the influence of UCs results in estimations which don't reflect the reality of the situation. The paper then describes how running a number of other experiments using standard MAB algorithms result in similar, inaccurate findings similar to that of the investigator.

## Confounding Variables

This brings us back to the original muti-armed bandit problem. One of the most ubiquitous examples of reinforcement learning tasks is susceptible to these confounding variables, and our approaches for solving such tasks may perform suboptimally as a result.

A confounder is a variable which influences both the dependent and independent variable in a data generating model. If \( X \) is some independent variable and \( Y \) some dependent variable, we say that \( X \) and \( Y \) are confounded by some other variable \(Z \) whenever \( Z \) is a cause of both \( X \) and \( Y \).

We say \( X \) and \( Y \) are not confounded when \( P( y | do(x)) = P(y | x) \) for all \( x \) and \( y \). You'll notice I'm using the \( do() \) operator from *do-calculus* which marks an intervention (or action) in the model. This allows us to describe both \( P( y | x ) \), where \( x \) is the observed variable, and \( P( y | do(x) ) \), where an action is taken to force a specific value \( x \).

To be more specific, the observational \( P (y|x) \) tells us the distribution of \( Y \) given that I observe variable \( X \) taking value \( x \) while interventional \( P ( y| do(x) ) \) tells us the distribution of \( Y \) if I were to set the value of \( X \) to \( x \). In other words, \( X \) and \( Y \) are not confounded whenever the observationally witnessed association between them is the same as the association that would be measured in a controlled experiment with \( x \) randomized.

In the case of the Greedy Casino, the unobserved confounders, the blinking of the machines and whether or not the gambler was drunk, found in our model induced both a observational and experimental distributions. When the investigators selected random gamblers to play on random machines, they were forcing the value of \( X \) so that they generated \( P(y | do(x) \). However, the gamblers in the casino observed the "natural" distribution \( P(y | x) \) which differed from that of the experiment (indicating confounding).

## Structural Causal Models

Hopefully by now the problem is somewhat clear. We want to know the true distributions of data that we're interested in, but unobserved confounders pose a challenge. To handle these situations, we use Structural Causal Models (SCMs) to formalize notions of observational and experimental distributions as well as introduce and define counterfactual distributions. Using SCMs, we can formalize the problem of confounding due to the influence of unobserved confounders.

Each SCM \( M \) is associated with a causal diagram \( G \) and encodes a set of endogenous (or observed) variables \( V \) and exogenous (or unobserved) variables \( U \); edges in \( G \) correspond to functional relationships relative to each endogenous variable \( V_i \in V \), namely, \( V_i \leftarrow f_i (PA_i, U_i) \), where \( PA_i \subseteq V /\ V_i \) and \( U_i \subseteq U \); and a probability distribution over the exogenous variables \( P (U = u) \).

Each \( M \) induces observational distributions, \( P(V = v ) \), experimental distributions \( P(Y = y | do(X=x)) \) for \( X, Y \subseteq B \), and *counterfactual distributions*, defined:

(Definition) Counterfactual:Let X and Y be two subsets of endogenous variables in \( V \). The counterfactual sentence "\( Y \) would be \(y\) (in situation \( U = u \) ), had \( X \) been \(x \)" is interpreted as the equality \( Y_x (u) = y \), where \(Y_x(u) \) encodes the solution for \( Y \) in a structural system where for every \( V_i \in X \), the equation \( f_i \) is replaced with the constant \( x \).

We aim to estimate these counterfactual distributions using the approaches later discussed in the paper. The paper note's that the counterfactual expression \( \mathbb{E}[Y_x = y | X = x'] \) is well-defined, even when \( x \not= x' \), and is read "The expectation that \( Y = y \) has \( X \) been \(x\) given that \( X \) was observed to be \( x' \)". We note that, in an offline setting, we *can't* estimate counterfactual distributions since if we submitted the subject to condition \( X = x' \), we cannot go back in time before exposure and submit the same subject to a new condition \( X = x \).

We can estimate the observational and experimental distributions through *random sampling* and *random experimentation*, respectively. We can compare these distributions to detect confounding bias. The absence of unobserved confounders implies \( P(Y|do(X=x)) = P(Y|X=x) \), so that a non-zero difference between these distributions implies a confounding bias.

The authors make a point to drive this point home. They contrast observational and experimental data in the distinction between *actions* and *acts*. Actions represent reactive choices the agents make, while acts represent deliberate choices which sever the causal influences of the system. In the Greedy Casino, the gamblers took actions based on the blinking lights and their drunkenness, resulting in the influence of the UCs, while the investigators made acts in order to experiment.

(Definition) Intent:Consider a SCM \(M\) and an endogenous variable \( X \in V \) that is amenable to external interventions and is (naturally) determined by the structural equation \( f_x ( PA_x, U_x ) \), where \( PA_x \subseteq V \), represents the observed parents of \( X \), and \( U_x \subseteq U \) are the UCs of \( X \).

After realization \( PA_x = pa_x \) and \( U_x = u_x \) (without any external intervention), we say that the output of the structural function given the current configuration of all UCs is the agent's intent, \( I = f_x (pa_x, u_x) \).

The agent's chosen action, before its execution, is its intent. That is, intent is the decision an agent takes given the influence of observed variables and unobserved confounders. In the Greedy Casino, gamblers' intents are enacted on the casino floor, leading to their poor observed win rate.

## K-Armed Bandits with Unobserved Confounders

With the definitions in place, we can now explicitly define the multi-armed bandit with unobserved confounders problem. A \(K\)-Armed bandit problem (\( K \in \mathbb{N}, K \geq 2 \)) with unobserved confounders (MABUC, for short) is defined as a model \( M \) with a reward distribution over \( P(u) \) where, for each round \( 0 < t < T, t \in \mathbb{N} \):

*Unobserved confounders:*\( U_t \) represents the unobserved variable encoding the payout rate and unobserved influences to the propensity to choose arm \(x_t \) at round \(t\).*Intent:*\( I_t \in \{ i_1, \ldots, i_k \} \) represents the agent's intended arm choice at round \( t \) (prior to it's final choice, \( X_t \)) such that \( I_t = f_i (pa_{x_t}, u_t) \).*Policy:*\( \pi \in \{ x_1, \ldots, x_k \} \) denotes the agent's decision algorithm as a function of its history and current intent, \(f_\pi (h_t, i_t ) \).*Choice:*\( X_t \in \{ x_1, \ldots, x_k \} \) denotes the agent's final arm choice that is "pulled" at round \( t, x_t = f_x (\pi_t ) \).*Reward:*\( Y_t \in \{ 0, 1 \} \) represents the Bernoulli reward ( \( 0 \) for losing, \( 1 \) for winning) from choosing arm \(x_t \) under UC state \( u_t \) as decided by \( y_t = f_y (x_t, u_t ) \).

The following graphical model represents a prototypical MABUC.

The paper also includes a graphical representation of the agent's history \( H_t \), a data structure containing the agent's observations, experiments, and counterfactual experiences up to time step \(t\). At every round \( t \) of MABUC, the unobserved state \( u_t \) is drawn from \( P(u) \), which then decides \(i_t\), which is then considered by the strategy \( \pi_t \) in convert with the game's history \( h_t \); the strategy makes a final arm choice, which is then pulled, as represented by \( x_t \), and the reward \( y_t \) is revealed.

At this point, we can discuss regret in decision theory. When making a decision under uncertainty, the decision maker may feel regret should information about the best course of action arrive *after* taking a fixed decision. In our case, we want the gambler to select the machine which will maximize the win rate, even when that machine isn't necessarily the machine the gambler may intend to choose. Formally,

(Definition) Regret Decision Criterion (RDC):In a MABUC instance with arm choice \( X \), intent \( I = i \), and reward \( Y \), agents should choose the action \( a \) that maximizes their intent-specific reward, or formally: \[ argmax_a \mathbb{E}[Y_{X=a} | X = i] \]

Briefly, an RDC prescribes that the arm \( X = a \) that maximizes the expected value of reward \( Y \) having conditioned on the intended arm \( X = i \) should be selected, even when \( a \not= i \).

## Summary

So, at this point the problem is made relatively simple: it is possible that a statistical-driven problem, such as the multi-armed bandit, includes unobserved confounders capable of skewing our perception of the reality of the problem. What we want to do, now, is utilize observed *and* experimental data to handle these cases.

The paper, itself, provides an interesting introduction to these ideas through the story of the Greedy Casino. The multi-armed bandit problem is a classic example in reinforcement learning, and to see how a small and intuitive adjustment to the problem can cause such devastating results to our naive approaches we rely on is a bit disturbing.

Our goal, now, is to figure out how we can fuse observed, experimental, and counterfactual data to train our online agents to be able to solve such tasks.