Ad creative shouldn’t be A/B tested

Note: this post is adapted from content from the upcoming Modern Mobile Marketing at Scale workshop taking place in Berlin on April 23. More information here; places are still available for the workshop.

Ad creative management and testing has taken on increased priority for mobile-first companies as platform-driven algorithmic campaign management has become the standard in mobile advertising. Back in September, I wrote the following about the power of creative experimentation in this new landscape:

The point here is that, while creative was always an important input to marketing success on mobile, with manual segment targeting, it wasn’t feasible to test creative against every possible user segment — but now it is. And because of that, there is a direct link between the volume and diversity of creatives and the success of a campaign: the more varied creatives an advertiser is able to produce, the more granularly and specifically the various segments that Facebook and Google construct are able to be evaluated, leading to optimal performance at the level of very targeted segments versus sub-optimal performance at the level of the very broad segment definitions that advertisers would construct manually.

People that work in consumer technology tend to equate “testing” with A/B or multivariate testing: that is, exposing similarly-sized, similarly-grouped users to different variants of some aspect of a product and using an average success metric per group to determine which variant performs the best. A/B testing has been discussed at length on Mobile Dev Memo, and I won’t revisit the merits and drawbacks of A/B testing as an approach. But I do want to make the point that A/B testing doesn’t make sense in the context of testing ad creatives: A/B testing simply isn’t congruous with the way that ad creatives are utilized in the modern marketing environment.

I’ll begin my argument with a thought experiment: if a company incurred no cost in producing ad creatives and could craft one ideal, perfectly relevant and personalized ad creative to expose to every single person on the planet — meaning that each person’s assigned ad creative was so accurately tailored to them that they’d click on it immediately, but that their response to any other person’s ad creative would by definition be less enthusiastic — would there be any reason to A/B test those creatives against each other? With each creative tailored to a specific person, there’d be nothing to test: everyone is getting their own personal creative, and it’s perfect, so nothing can be improved.

A/B testing in this scenario wouldn’t make sense because the groups consist of one person each and are optimally served with one creative. The purpose of A/B testing in general is to maximize some average behavior (engagement, monetization, etc.) at the level of the group. But personalization has a different goal: to maximize behavior at the level of the individual. These objectives are similar but personalization is always preferable to group-level optimization: companies A/B test because they can’t personalize. If the experience of every user was totally optimized for them, there’d be nothing for an A/B test to improve, whereas merely selecting the best-performing A/B test variant does not imply that every single user is being exposed to the variant most relevant for them (in other words: the winning variant of an A/B test just has the highest average performance across the sample; it doesn’t mean it has perfect performance for every user).

Facebook provides a tool for A/B testing ad creative (and other campaign properties) that it calls a Split Test; the purpose of split testing creatives is to assess the performance of one against the other. But Facebook also algorithmically optimizes campaigns in real time, distributing traffic to the ad sets and ads that perform the best so as to maximize campaign performance. If Facebook dynamically curates and diverts traffic to creatives on the basis of performance — meaning it changes the audiences that see certain ads based on how they respond to them — then isn’t Split Testing an unnecessary and arbitrary test of an audience on an ad creative that might not ever be exposed to it “in the wild?”

What a Split Test does is take a rigid audience definition and force equal amounts of exposure for both ad variants to it:

Suppose that Variant 1 works very well on a small and well-defined audience, whereas the audience definition used here is broad. Variant 2 might outperform Variant 1 on the broad audience, and thus Variant 2 would be deployed to the campaign. But Variant 1 still has value with the niche audience and should be used for it — that it lost this Split Test for this particular audience doesn’t necessarily indict it as a poor performer.

And Facebook’s own algorithmic traffic distribution mechanism recognizes that — assuming no demographic constraints on the campaign, Facebook would distribute the more niche creative to a better-defined audience dynamically as it saw that audience respond. Audiences on Facebook aren’t rigid, they are fluid: they change in accordance to performance. Forcing a rigid audience definition on a Split Test is antithetical to the way Facebook channels traffic.

The Facebook distribution mechanism performs as below, with audiences being updated and refined in real time as performance data is aggregated:

In the above diagram, Facebook initially distributes traffic to the ads equally but recognizes that the audiences for each are different and thus defines new, creative-specific audiences in real time to optimize performance. A more efficient way of testing ad creatives on Facebook than Split Testing is to simply put the creatives in an ad set let Facebook manage traffic distribution to them. And if audiences need to be segmented or geo-fenced, then that can be accomplished with different ad sets — there’s almost no reason to Split Test creatives, since the audience used in a Split Test won’t be the one exposed to any given creative at scale.

In fact, Split Testing is probably the most expensive way of vetting creatives, since it forces some minimum amount of traffic to be exposed to a given creative, no matter how appropriately that specific audience and the creative are paired. Allowing Facebook’s algorithmic traffic distribution mechanism to update audiences in real time is not only easier, faster, and more convenient than Split Testing, but it’s also cheaper in most cases as poorly-performing creatives are simply starved of traffic (versus having some minimum number of impressions fulfilled on them). Split Testing violates the dynamics of Facebook’s ad serving algorithm; it’s a concept that isn’t appropriate for the platform and does little more than satiate a desire to put a checkmark next to a “Tested?” line item on a process checklist.