A/B testing, while a useful and popular tactic, suffers from a few fundamental shortcomings for mobile apps:
- Many mobile apps, especially freemium apps, are operated as services, and thus features can be improved and iterated upon ad infinitum. A/B testing lends itself to a “set it and forget it” mentality, whereby a feature that has been tested once may not be revisited again for further testing, even though other features have evolved since the original test;
- A/B testing is usually implemented using frequentist sensibilities, which impose many questions on tests that are difficult to answer: how long should the test be run for, how large should the sample sizes be, what “statistical significance” should be achieved in comparing the variants, etc. (see here for a thorough overview of the challenges of operating A/B tests with a frequentist mentalist). The answers to questions like these – in addition to being onerous to produce – are often difficult for communicate clearly and concisely;
- A/B tests can produce results that are inconclusive. Since A/B tests are not always trivial to set up and execute, such results are often interpreted as “the feature is perfect”. Continuous testing and optimization is important, but given some hurdle (in terms of effort and time) to further testing, product managers may not feel incentivized to run multiple rounds of A/B tests when the first was not immediately actionable;
- In achieving “statistical significance”, A/B tests expose half of a testing sample to a variant that might not be the optimal one (or, at best, is only as good as the other). For some mobile apps, especially in stages of high growth, this can produce a lot of “wasted” user attention (because the poorly-performing variant is only expunged at the conclusion of the test);
- Many A/B testing software suites – and especially the open source spreadsheet templates that are often used to interpret A/B testing results — don’t handle the comparison of more than one variant very well.
While A/B testing is a competent tool in evaluating variants for a simple process – for example, the best-converting variant of an e-commerce landing page that isn’t likely to change in the future relative to the rest of the site – it’s perhaps not well-suited to dynamic mobile apps operating as services.
Bayesian Bandits Algorithms
An alternative to A/B testing is Bayesian Bandits testing, exemplified through the Multi-Armed Bandit problem. Bayesian testing is an implementation of the Bayesian statistical paradigm, meaning it is a “learning method” that improves continuously as more data becomes available.
A Bayesian Bandits test essentially treats a set of feature variants (called bandits) as random variables with inherent but (a priori) unknown conversion distributions (the distributions are considered uniform distributions before data is available, given that conversion can take the form of 0 for false or 1 for true. More background on this here).
In other words, a Bayesian Bandits test considers the distribution of conversion probabilities for all bandits as equal before it knows anything about how they perform, and it updates this assumption as data is generated through the test. Once data is available, the test favors the bandit that performs the best but allows for the others to continue to be exposed, given that each bandit is considered a stochastic process (ie. poorly-performing bandits can sometimes produce good results).
A Bayesian Bandits test operates in two modes: exploration and exploitation. When a test is exploring, it is gathering data about a bandit that may not be, historically, the best performing. And when the test is exploiting, it is simply choosing the bandit with the best track record (the highest probability of success).
Two popular flavors of Bayesian Bandits testing are epsilon greedy and epsilon decreasing. An epsilon greedy algorithm biases towards the bandit with the best historical performance (exploitation) but allows for a random bandit to be chosen ε percent of the time (exploration) to allow for additional data to be harvested. A common value for ε in an epsilon greedy algorithm is 0.10, meaning that 10% of the time, a user will be exposed to a random bandit and not necessarily the best performing bandit.
An epsilon-decreasing algorithm operates similarly to an epsilon greedy algorithm, except that epsilon is very high early in the test and decreases over time. For example, an epsilon decreasing test might start with 80% exploration and 20% exploitation, meaning that early on, 80% of the time, a random bandit is chosen. This ratio might shift by 5% every X trials until it reaches 10 / 90, meaning that exploration is only conducted 10% of the time.
Note that these are only two examples of Bayesian Bandits algorithms; for a discussion of more examples, see this post.
Since the purpose of a Bandits test is to learn and react to what it has learned, in each instance of the test (ie. each time a user is shown a bandit), the results of that bandit are used to re-calculate its posterior, or its updated success distribution.
Implementing Bayesian Bandits
One attractive aspect of Bandits testing (especially with respect to A/B testing) is its ease of implementation. In psuedo-code, an epsilon greedy test might be implemented as follows:
This simulator visualizes the process of a Bandits test. Note that each bandit begins with a uniform distribution; as the test progresses, the distributions evolve to match each bandit’s inherent probability of success (the black lines represent those probabilities). The distributions here simply reflect the historical performance of each bandit; they are seeded initially as
Beta(a = 1, b =1)
and updated as
Beta(a = 1 + cumulative successes, b = 1 + n trials - cumulative successes)
As the number of trials progresses, each distribution becomes sharper, with a more pronounced peak.
The performance of a Bandits test, measured in terms of how often a sub-optimal variant was exposed to a user instead of the best-performing variant, is essentially opportunity cost and is termed “regret”. This presentation details how regret is calculated for various flavors of Bandits algorithms.
Bayesian Bandits for Mobile
The application of Bandits testing to web-based services, especially display advertising and click-through optimization, is probably obvious. But one of the key advantages of Bandits testing – that variants can be added and removed with little effort – is difficult to reconcile with mobile development cycles since updates to mobile apps aren’t necessarily immediate like they are for web-based apps (some mobile platforms require review periods).
In other words, because mobile clients can’t immediately be updated, the logic and content manipulations required of a Bandits algorithm are shifted to the server, restricting such tests only to information that can be transmitted very quickly (eg. color codes for buttons, text strings, numerical values) or for hard-coded feature variants.
This isn’t to say that content – such as various versions of a sprite, or different loading splash screens – can’t be tested with Bandits, only that the variants in such tests can’t be updated more often than the app is. For instance, a Bandits test evaluating different versions of a new users funnel would require that the variants (and all associated art assets, etc.) are either hard-coded into the app or somehow completely generated via logic determined by the test and delivered from the server.
Despite these restrictions, Bandits tests may be better suited for mobile apps that operate as services than A/B tests because of their ability to be left running indefinitely and react to upstream changes in the app. Freemium apps are generally in a constant state of improvement, either through the addition of new content, data-driven feature additions and changes, or both.
Under such conditions, one-off A/B testing can’t be relied upon to deliver the most optimized version of some aspect of the app indefinitely: all aspects of the product need to continuously evolve. Bandits testing is a simple yet effective means of ensuring that such evolution isn’t staggered across the app.