Bayesian Bandits: Behind the scenes of Facebook’s spend allocation decisioning

This guest post is written by Shamanth Rao, a seasoned mobile user acquisition executive and a popular contributor to the Mobile Dev Memo slack team.

Oftentimes, Facebook’s automated ad spend allocation can be frustrating to comprehend. Because spend is allocated unequally among ads within an ad set container, it can be hard to conclude which ads are performing better than the others.

Clearly, ads that Facebook considers better get more spend — and while we often see high-performance ads receive spends, we also sometimes see ads that don’t do as well get a fair amount of spend.  

The algorithmic decisions used to determine which ad is better, and which is worse, are hard to understand.

To know is to be prepared. Today, we’ll talk about how Facebook approaches this mathematically, why Facebook makes the spend allocation decisions in the way it does, and some of the reasons this algorithmic allocation can break down.

Challenges with simple A/B testing approaches

Although they are easy to understand, winner-takes-all A/B tests can result in wastage of money for advertisers. There are 3 key challenges with A/B tests:

  1. Consumer preferences evolve over time. If a single creative is declared the winner, that implies it should always be shown.  This isn’t always the case — the A/B test may have been done during a specific season or on specific days (for instance: a food delivery app may have a salad as a winning creative in summer, and hot chocolate as winning creative in winter). A ‘winner’ may not always be a winner — and a ‘loser’ might not always be a loser. There can be false negatives and positives;
  2. All ads are (sometimes) effective. Real life is probabilistic — a winning creative isn’t the best for 100% of users / impressions (as is sometimes implied in an A/B testing paradigm). The fact that a creative is ‘winning’ only means it performs better most of the time. There may be 10% of the impressions / audiences for which the ‘losing’ creative is better. Some impressions on ‘bad’ ads lead to purchases, and some impressions on ‘good’ ads result in no purchases.
  3. Equal allocation of impressions results in wasted spend. If a winning creative is better 90% of the time and a losing creative is better 10% of the time, then allocating spend 50-50 between them can result in wasted spend. The losing creative can be expensive to keep running more often than it should.

The Bayesian Bandits paradigm

Facebook’s algorithm uses a probabilistic Bayesian approach to address the above problems. While Bayesian Bandits is a cool name (for an algorithm), why is it better than less complex algorithms?

In a Bayesian paradigm, you use information you already know (priors) to make predictions about something you want to know. The term ‘bandits’ comes from a class of problems in probability that deal with variables that have ‘many arms,’ much like a row of slot machines on a casino floor. The ‘bandits’ appear identical but have different payouts.

In the Bayesian Bandits paradigm, our gambler knows how often the slot machines previously resulted in ‘wins’ — and is confronted with the problem of making decisions about what machines to play in the future.

How does all this apply to Facebook ads? 

Much like a gambler is confronted with the problem of which slot machine to play in order to maximize winnings, Facebook’s ‘game’ is to select and prioritize ads it can show to a user, to maximize revenue for itself and for the advertiser.

Facebook could use A/B testing and show different ads to equal numbers of users before shutting off ads with worse performance. Such winner-take-all A/B testing has the disadvantages of time, marginal effectiveness, and wasted initial spend, as mentioned above. Facebook uses the Bayesian Bandits paradigm rather than A/B testing to address these problems.

For what it’s worth, the Bayesian Bandits approach isn’t unique to Facebook ads. Bayesian approaches have seen massive growth in the last couple of decades. The growth of the internet created ubiquitous measurement of user behavior and interactions. The Bayesian approach uniquely takes advantage of this massive data store. 

What was once a neglected statistical technique became more powerful when the massive amounts of information that the algorithm requires became available, and over the past decade, Bayesian approaches have been catapulted from an arcane mathematical field to an area that can directly take advantage of the ubiquity of data that the internet has enabled. 

Newspapers looking to decide between different headlines, retailers looking to decide between different packaging, airlines looking to decide between different price points — and of course advertising platforms looking to decide between different ads — all use some flavor of the Bayesian Bandits approach.

How the Bayesian Bandits approach works specifically in Facebook ads

Let’s assume that an advertiser is optimizing for purchases. In this case, its KPI is purchases per thousand impressions (PPM). The algorithm starts by assuming we don’t know what the expected performance PPM is for each ad.

The algorithm displays all ads to a random selection of users and measures their purchases. If an ad (call it A) has more purchases than another ad (call it B) for the same number of impressions, we infer that ad A has a higher probability of having a higher PPM than ad B. 

Here is how this would look graphically: this is a probability distribution function, where the X axis shows PPM (purchases per mille) and Y axis the probability that the PPM is a certain amount. 

So both ads have a very low probability of having a PPM over 0.7, the ‘good’ ad has a high probability of having a PPM around 0.5, and the ‘bad’ ad has a high probability of having a PPM around 0.45.

Now that the algorithm knows which ad has higher ‘expected PPM’ (in the above example, it would be ad A), it shows more impressions to ad A and fewer impressions to ad B.

What if ad A’s expected PPM was massively higher than ad B (say PPM range for which we had 90% confidence was 0.8 to 0.9 for ad A, and 0.2 to 0.3 for ad B)?

In this case, the algorithm would show most impressions to ad A — but it would still recognize that ad B has a small non-zero probability of having a PPM of 0.9 and above, and it would therefore continue showing a small amount of impressions there.

(This of course is a simplified representation, and while Facebook’s actual algorithm does account for decisioning between multiple optimization goals, multiple advertisers and multiple users, the above is the broad approach that Facebook uses).

Why practice diverges from theory — and when this paradigm might break down

All of this makes sense and is intuitive. It’s clear that ‘better’ ads should win in a Bayesian Bandits paradigm. 

So why don’t they always win? This is a source of frustration for most marketers: an ad that performs badly sometimes seems to get a lot of spend, and that seemingly doesn’t fit into the Bayesian Bandits paradigm, where the better ads should win.

Here’s why this can happen. In order for the algorithm to make decisions about expected probabilities, it has to have enough volumes of purchases to come to an expected value. If an ad has 2 or 3 purchases, the algorithm’s estimate of expected probabilities of PPMs being in a certain range will not be very accurate. 

Sometimes the algorithm will see 1 or 2 purchases (or more upstream events such as installs or registrations that it takes as leading indicators for purchases) and infer that the ad is ‘better’, just because it doesn’t (yet) have the 10 or 20 purchases to make a conclusive assessment. 

This of course is why savvy marketers favor and recommend liquidity, or having enough conversion events to let the algorithms make conclusive decisions.

However this isn’t always possible or easy, especially since getting ads takes time and money to accumulate enough history for the algorithms to form accurate probability estimates. 

This is why Facebook’s algorithms can sometimes seem to favor ‘bad’ ads: it’s simply doesn’t have enough data to calculate accurate probabilities.

Hopefully understanding what happens ‘under the hood’ can help marketers understand how to work with, and leverage, Facebook’s algorithms better.

Understanding that Facebook’s decisioning is based on past performance helps us understand that the more high-signal events we send Facebook, the better Facebook’s decisioning can be — and the better these decisions can work for you rather than against you. 

As a signboard in a surf shack I visited once proclaimed…

Shamanth Rao is the founder and CEO of the boutique user acquisition firm RocketShip HQ and host of the Mobile User Acquisition Show podcast. He’s managed user acquisition leading up to 3x exits(Bash Gaming sold for $170mm, PuzzleSocial sold to Zynga and FreshPlanet sold to GameLoft) – and has managed 8 figures in user acquisition spends.

With thanks to Erika Kretzmer for reviewing and critiquing this article