Avoiding Simpson's paradox in data analysis

Posted on April 28, 2014 by Eric


Simpson's paradox is a statistical phenomenon in which a population, comprised of different sub-populations (“classes”) with differing statistical properties, exhibits a relationship at the global, aggregate level that might not reflect what is observed at the level of the individual classes. In other words, Simpson's paradox means that observed relationships between variables in a data set can change when the data is grouped.

Simpson's paradox manifests when a data set is looked at broadly, without being broken down into meaningful segments. Simpson's paradox is the result of what are known as “confounding variables” being omitted from a study. A confounding variable is essentially a variable outside of the focus of a study that correlates with (changes with) the independent variable.

For example, consider a mobile app's user base in which 10,000 people use Android devices and 5,000 people use iOS devices. The conversion rate for the user base, overall, is 5%, with 4% of iOS devices converting and 5.5% of Android devices converting:


Assuming equivalent monetization (eg. Android users spend as much as iOS users), a product manager with limited resources might make some drastic decisions based on this data, perhaps prioritizing Android feature development over iOS or even canceling the iOS project altogether.

When the data is broken down by device, however, a different picture of the user base unfolds:


It is now revealed that iOS tablets convert better than Android tablets and that iOS phones likewise convert better than Android phones. Knowing this, the product manager might make a completely set of decisions about the future of the product.

In this scenario, device type is a confounding variable: when the data is “filtered” by device type, classes of sub-populations with completely different statistical properties take shape that can't be compared.

The reason iOS can beat Android in conversion at the device level but lose overall is because the mix of device types for each platform is different: tablets convert better than phones, overall, and in this user base, the percentage of iOS devices that are tablets (30%) is lower than the same percentage for Android (80%) – although the conversion rate on Android tablets is lower. Mixing the data together into one giant subject of consideration compares two groups with wholly different properties – apples to oranges.

Confounding variables emerge frequently when analyzing freemium products, for a few reasons:

  1. Size. Freemium products need large user bases to generate appreciable total revenues given the model's inherently low conversion rate. These large user bases are generally comprised of people from all over the world, across all regions of the income spectrum, using a wide array of devices. This diversity renders universal averages almost meaningless;
  2. LTV spectrum. Freemium products benefit from a very long-tail monetization curve. Using user spend as a proxy for delight, engagement metrics would likely pair closely with spend and therefore serve as boundaries for classification;
  3. Most users won't pay. The previously-cited low inherent conversion rate of freemium products exists as a fundamental distinction between two classes of users: payers and non-payers. For this reason, any measurement of a freemium user base as a whole is immediately flawed as it skews all metrics to the vast majority of users that will never pay (which is why the minimum viable metrics model includes both ARPU and ARPPU).

The key to avoiding Simpson's paradox – of drawing conclusions about a user base that don't reflect the reality of the way different classes of users interact with the product – is to judiciously apply dimensionality to analysis. User segmentation is eminently important in data analysis; especially in freemium products, the “average user” not only does not exist, its visage serves as a siren leading developers onto perilously misguided feature development paths. Universal data is of no use when a user base exists across a broad, diverse spectrum of “real people”.

But user segmentation isn't only crucial when considering the product development roadmap: if data analysis dictates which features should be prioritized by identifying the most valuable and engaged users, then it also dictates which people should be marketed to in growing the user base. Because of this, the specious conclusions drawn from aggregate-level analysis not only result in the wrong features being built, they also cause more of the wrong users to be brought into the user base.

To avoid this, the basic set of dimensions (“filters”, or user characteristics) used to prioritize feature development should establish the rough set used to market to users. For mobile products, the most basic set generally includes:

  • Location (country);
  • Device (platform, form factor, device model);
  • Acquisition source;
  • Early behavioral cues (such as monetization / engagement milestones);
  • Date of join (for controlling for seasonality).

For some acquisition channels (eg. Facebook), other demographic data points such as age, gender, etc. may be targetable, too.

Analyses conducted with these dimensions taken into consideration offer far more reliable insights than the broad “iOS vs. Android” example previously cited. Ultimately the goal of an analysis is to improve a product for the real people using it; if that analysis is being undertaken under a false premise, then real people (and their real pain points) aren't being addressed.

  • Blibbax

    "iOS has a higher share of tablets (30%) than Android (20%)" - in your example, it actually has a much lower share, with Android on 80%.

    • ESeufert

      Whoops, you're right, typo. Fixed that.

  • Mark Ettinger

    Mr. Seufert, it is not the case that the finer ("disaggregated" or "conditional") data is always the basis for a proper decision, as you seem to argue in this article. For an extremely lucid and detailed discussion see "Understanding Simpson's Paradox" (available online) by Judea Pearl. The correct decision depends on causal assumptions expressed via a directed, acyclic graph. The correct graph for your scenario is unclear, in contrast to the examples in Pearl's paper, and therefore we cannot conclude that conditioning on device type is correct.

    • ESeufert

      Point taken, and thanks for the paper, but I didn't actually suggest that *more* dimensionality *always* provides for more valid analysis: the context of this article is the analysis of mobile app user bases, which are often treated as monoliths and not properly segmented. In the example, I simply indicated that, upon adding an additional partition to the data, a product manager "might make a completely set of decisions about the future of the product". The PM might also not -- the point is that considering a mobile app's user base as one homogenous unit (or, in this case, two units: iOS and Android users) isn't sufficiently rigorous for the purposes of making product decisions.

      • Mark Ettinger

        My apologies for reading more into the post than you intended. However that leaves me with the interesting question of what IS the proper decision? Pearl has devoted his career to addressing questions of causality and developing the proper calculus for making correct decisions. I'm fairly new to his theory and machinery so, for me, your example is a fascinating open question. (P.S. Love your book!)

        • ESeufert

          I wish I knew -- I think this is where the marriage of data science and deep product expertise produces 3-sigma results.