Problems with A/B testing in mobile marketing

Posted on April 6, 2015 by Eric Benjamin Seufert

mad-scientist

A/B testing in app development (and in general) is a controversial subject; despite the data-oriented nature of app development, a common criticism of the constant regimen of feature and aesthetics tests that most apps undergo is that the creative component of the design process is relegated to an afterthought.

But from a statistical and analytical standpoint, A/B testing is problematic for other reasons. The head of Data Analytics at King recently gave an interview in which he identified four problems with A/B testing in free-to-play games:

  1. The Pareto distribution of monetization characteristics in free-to-play games lends itself to skewed results derived from a small number of highly enthusiastic players (the opposite perspective on this point is also relevant: because the behaviors of the most enthusiastic players are those that developers seek to optimize for, tests must be exposed to large numbers of players to capture behaviors from across the entire distribution);
  2. A/B tests are difficult to administer in games for which player experiences should be unified across devices (eg. a game that runs on both Facebook canvas, iPhone, and iPad);
  3. The effects of A/B tests can be difficult to verify, leading to very long testing periods;
  4. A/B test results are often too specific to be applied universally (across a game portfolio), reducing the value of any single test.

These are salient points, and they highlight conceptual problems with A/B testing (especially in the context of freemium products) that go beyond the typical questions raised around the practice, such as the tendency for testers to stop tests as soon as significant effects are observed (see this article for a list of common testing pitfalls).

But while testing features and stylistic elements of apps is fraught with the potential for misuse and misinterpretation, A/B testing for mobile marketing and performance user acquisition purposes is an even more precarious minefield.

One problem with A/B testing marketing materials –such as ad creatives, icons, app store screenshots, etc. — is that it can be difficult to run mobile advertising campaigns that aren’t automatically optimized by ad networks, producing skewed results. Most mobile ad networks operate some version of a Bayesian bandits algorithm that prioritizes the best-performing (with respect to acquisition costs) ad creative. If two marketing variants are being A/B tested against each other, but the traffic being input into the tests is optimized by variant, then the results are incomparable: each variant was fed the traffic that performs best for it, which means the pools of impressions from which each was served are different. Some ad networks (eg. Facebook) allow campaigns to be run without automatic optimizations being applied, but most don’t.

But a second and more fundamental problem with A/B testing for app marketing creatives is more subtle. A/B testing aspects of an app’s design requires a view of the overall impact of a change on a user’s behavior over their lifetime in the app (eg. will an increase in a short-term metric ultimately reduce a long-term metric?). A/B testing of marketing items has the same concerns but also introduces considerations around targeting and total addressable market.

Targeting in app marketing presents a difficult balance to achieve: what’s the optimal level of breadth for a marketing campaign with respect to some optimization metric? When marketers test their creatives, they often seek to optimize the effectiveness of their advertising campaign materials: creative conversion (click-through-rate) performance against cost of acquisition (cost per install). This is understandable, but it’s a local optimization that ignores the ultimate success metric of an app developer: revenue. Does an hyper-optized campaign delivering low per-user acquisition costs actually benefit the app more than a campaign with higher per-user acquisition costs? Not necessarily.

This is the click-through-rate conundrum: ad conversions (clicks) can move in the opposite direction of platform store conversions (installs from the platform store page) when ad creatives are optimized exclusively for clicks. For this reason, the real success metric for ad performance is click-to-install (click-through-rate * install rate from the platform store page), but even this isn’t necessarily the best metric to use in evaluating ad materials. What matters most for an app developer is total net revenue (unless other strategic initiatives take precedent, eg. growing an app’s user base prior to a funding round or exit), and that’s how targeting should be defined.

Thus, an A/B test for marketing materials shouldn’t necessarily reveal which ads reduce acquisition costs the most, or produce the most engaged or monetizing users, but rather which materials produce the optimized traffic mix, at scale, that generates the most revenue. This won’t always be the broadest, cheapest, and most click-able campaign, nor will it always be the campaign that produces the cohorts with the strongest engagement or monetization metrics. The optimal marketing variant is the one that ultimately produces the most revenue for the app: size of the user base against user economics (cost of acquisition and lifetime revenue).