I’ve been thinking a lot about the multi-armed bandit problem lately. Here’s the description from Wikipedia:

In probability theory, the multi-armed bandit problem is a problem in which a gambler at a row of slot machines (sometimes known as “one-armed bandits”) has to decide which machines to play, how many times to play each machine and in which order to play them. When played, each machine provides a random reward from a probability distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls.

I’m especially interested in this problem because I think it’s a good exercise for thinking about how society should allocate scarce resources. For example, when trying to combat an infectious disease we should spend a substantial amount on the immmediate response. But we also need to invest in research on long-term solutions like vaccines and diagnostic tools even if these outcomes are more uncertain. How do we decide how much to allocate to each of these areas over time? This explore-exploit tradeoff is what makes these types of problems difficult to solve.

The traditional problem above has some good solutions, but I recently thought of a twist that hasn’t been researched yet: What if you only know the collective score of a group of machines each round rather than the score of each individual machine (each individual is “ignorant” of it’s own score)? This seems to be precisely the problem we have in many policy situations: we change some policy but we’re not sure if the effect we observe is due to our change, some other change (possibly from a different time step), or random noise. Running a randomized trial (RCT) can help cancel out the random noise and other effects, but even RCTs have issues with applicability, external validity, and scale.

Below I lay out the problem in Python and propose a few initial solutions, one of which converges. For now, I limit the effect of each machine to a single time step and only change one at a time. This isn’t as complicated as reality but it allows me to reach an initial solution. The code for this post is available here.

## Initial Approach

My initial approach to solving this problem is to just select five random machines, get a score, then replace one of them at random. If the resulting score is an improvement, I keep the new selection of machines for the next step. This approach works fairly well, but it never converges because I am always randomly replacing a machine, reducing the score.

```
Selected Levers:
Mu: 99
Mu: 100
Mu: 100
Mu: 99
Mu: 99
```

## Improved Solution

To get around the non-convergence of the initial approach, I add a history of deltas to each machine. Think of the history as the changes “caused” by each machine when it’s added to an existing group. So I start by selecting and pulling five levers, then replace the worst performing one with the best performing lever left out, then re-score the results with the new lever. The score change is added to the new machine’s deltas, and then the process above is repeated until it settles on the best machine.

In reality, it’s impossible to know if the new machine actually caused the change observed in the score. This is because the new machine replaces an existing one and there’s additional random noise from the probability distributions. The equation for this process would look something like this:

`new_score = old_score - old_machine + new_machine + random_noise`

I only know the `new_score`

and `old_score`

in this equation – the rest are unknowns. But by assuming that the `new_machine_delta = new_score - old_score`

, I am able to create a solution that converges on the best machine anyways. This approach is kind of like an experiment without a control group. I just assume the effect is entirely due to the machine I change. By repeating it enough times, it leads to a fairly accurate result even without a control.

```
Selected Levers:
Mu: 100, Mean Delta: -5.6185
Mu: 100, Mean Delta: -5.6185
Mu: 100, Mean Delta: -5.6185
Mu: 100, Mean Delta: -5.6185
Mu: 100, Mean Delta: -5.6185
```

Below are the final ten machines and their mean deltas. Although they’re not ordered perfectly, this approach still ends up selecting the machine with the highest mean value and lowest delta after enough steps:

```
Mu: 91, Mean Delta: -6.0507
Mu: 92, Mean Delta: -6.7578
Mu: 93, Mean Delta: -6.0826
Mu: 94, Mean Delta: -5.6559
Mu: 95, Mean Delta: -7.0706
Mu: 96, Mean Delta: -5.7516
Mu: 97, Mean Delta: -8.1045
Mu: 98, Mean Delta: -5.6500
Mu: 99, Mean Delta: -5.6606
Mu: 100, Mean Delta: -5.6185
```

## Conclusion

In reality, the random noise in situations like these could be much, much larger, which might prevent an approach like this from converging. In addition, society often makes many changes at once that can affect multiple future time steps, so it would be hard to associate score changes with a single machine. But this solution is a start so maybe I can try to address these additional complications in the future.