Two fundamental problems with product A/B testing

I joined the Deconstructor of Fun podcast last week for a discussion around the future of growth teams, and the topic of conversation settled at one point around issues with A/B testing features and user experience elements, especially in mobile apps (the topic was prompted by this article: The future of mobile growth teams). I made essentially two points in the podcast discussion: that teams generally see A/B tests as “one and done,” meaning tests tend to not be revisited after they are completed, and that traffic mix is rarely accounted for when running A/B tests.

In this post, I’ll expand more on those two ideas and, more broadly, what I see as fundamental problems with A/B testing. Many words have been dedicated to A/B testing on Mobile Dev Memo: see A/B testing can kill product growth, It’s time to re-think A/B testing, Problems with A/B testing in mobile marketing, and The Hidden Costs of A/B Testing. I believe A/B testing is a blunt instrument that is often mis- and over-utilized, and the value of which tends to be exaggerated by product-oriented growth practitioners. I have seen endless rounds of A/B testing paralyze product teams without delivering any material long-term metrics improvements. Slavish devotion to the “test everything” mantra can actually kill growth as a product slowly evolves into a Frankstein’s monster of independently-tested product components rather than a cohesive user experience.

More broadly, I think there are two fundamental problems with product A/B testing, or rather, the general way in which A/B testing is implemented in many product organizations:

Traffic and creative mix change over time

A/B tests reflect some measured preference at one moment in time, but traffic is very rarely accounted for in A/B testing. What I mean by this is: A/B test groups are often drawn from across the entire user base and not segmented by traffic source. The reason for this is that product changes are often implemented for the whole user base, and so segmenting the A/B groups by source of acquisition wouldn’t be actionable.

But the source of traffic has an impact on product usage, and traffic composition (and the creative that was used to source paid traffic) change over time. Imagine the DAU of a mobile product for which acquisition source skewed heavily towards organic installs at launch and progressively shifted towards paid installs from a number of different channels over the course of a year:

(note that DAU mix here means the composition of the daily user base in terms of acquisition source — not the daily new user traffic mix)

Would the results of an A/B test conducted in February, on a user base that mostly found the app through organic discovery and word of mouth, apply to the user base in December, when the vast majority of DAU was acquired from Facebook, Google, and programmatic channels?

It is not rare for significant — sometimes extreme — differences to exist between the engagement and monetization profiles of organic users and those acquired via paid channels. If a user base changes to the extent depicted above, which isn’t uncommon, then the results from the A/B test conducted in February are obsolete by December. This isn’t to say that a team couldn’t conduct a new test every month or even every week to accommodate the above change, but in my experience, product teams very rarely think about the composition of DAU when planning tests and are loath to revisit already-tested mechanics.

And it’s not just the DAU mix that needs to be considered when retiring A/B test results: its the average cohort age of DAU, and the daily traffic mix for new users (especially when considering tests related to a FTUE), and even the creative that is used to onboard paid users. As an example, here is a screenshot from a current, live Facebook ad for the mobile game Homescapes, taken from its Facebook ads library:

Homescapes was launched in 2017. Below is a screenshot from a 30-second ad launched in 2017 that is perhaps more indicative of gameplay:

The point is that the intentions and motivations of the user acquired from the first ad might be markedly different from those of the user acquired from the second ad. Ad creative changes over time as audiences saturate and new demographics need to be targeted; A/B tests must capture that, but a constant testing cadence is generally not thought of as a necessity with A/B testing.

Overlapping / chained tests aren’t accounted for

The second fundamental problem with the way most teams approach A/B testing is that multiple, intersecting tests will be run at the same time and not be accounted for. This is fine if the things being tested — the features, dialogue boxes, copy, screen orderings, etc. — are completely independent of each other, meaning that a user’s reaction to one wouldn’t influence the user’s reaction to another. But this is rarely the case, especially when the FTUE is concerned (let alone the traffic source, as detailed above and at length in this article). These chained A/B tests can influence user behavior throughout the lifecycle; without very thoughtful construction, overlapping A/B tests amalgamate a series of tests into simply one user lifecycle test variant.

For example, consider the following two users that are exposed to three A/B tests over the first few sessions of their user lifecycles:

A PM might believe that three A/B tests are running simultaneously in the above scenario, but if those product features are not independent of each other, in reality, just one test is being run:

The observable variants here are:

  • Variant 2 of FTUE Test -> Variant 1 of Early Offer Test -> Variant 4 of Chat Interface Test;
  • Variant 1 of FTUE Test -> Variant 5 of Early Offer Test -> Variant 2 of Chat Interface Test;

The reason for this is that the FTUE Test very well may interfere with the Early Offer Test; the results can’t be parsed out as independent of each other. The variant that User 1 saw in the FTUE Test may influence their behavior when exposed to the Early Offer Test, and if that is true, then those two tests — together with the subsequent Chat Interface Test — combine to form just one variant of one test. If each of these tests featured five variants, then there’d be 5^3 or 125 total variants of this test — likely unmanageable from a data volume perspective.

Bayesian Bandits testing has come into vogue as an alternative to the stricter and less flexible A/B/n testing that most product teams orient their development cycles around. But Bandits systems require even more infrastructure and deeper product connections than A/B testing mechanics do: implementing these systems poses a significant development cost and a re-orientation of product development around maintaining that infrastructure versus the more discrete process of planning and executing an A/B test.

But it seems clear that this is the direction in which product analytics is heading, especially as mobile advertising becomes more automated and deeper-funnel signaling becomes a more important part of traffic acquisition. If user acquisition depends on product signals in real time, then product signals need to be tuned and optimized in real time, and thus a real-time testing framework is needed to optimize the entire user lifecycle.

Photo by Louis Reed on Unsplash