Dealing with “whales” in freemium data analysis


The presence of “whales”, or highly-monetizing users (HMUs), in a freemium product’s user base often leads to analytical confusion. Contrasted against the large majority of a freemium product’s user base that never monetizes, HMUs skew parametric statistics, such as a cohort’s mean LTV, to misleading or unworkable values. The low density of HMUs in freemium user bases can sometimes be used as justification for their omission from analysis altogether: considered outliers, they are ignored for the purposes of calculating average LTVs, which may be deemed more useful for setting user acquisition bid targets.

But highly-monetizing users are not outliers: since the purpose of the freemium business model is precisely to produce a small contingent of HMUs, their existence in a freemium product’s user base is not only predictable but necessary for commercial success. The definition of a statistical outlier within the context of a data sample is not simply an extreme value; it is a value that contradicts the known distribution of the data sample. In his book, Identification of Outliers, Douglas Hawkins defines an outlier with:

An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.

Generally speaking, outliers are values that change the shape of the known or assumed distribution of data, calling into question the veracity of the data point or the provenance of the result observed. Outlier detection in consumer software is most commonly associated with fraud detection: finding patterns of purchases that are extreme and wouldn’t be predicted by the underlying distribution of purchase data. Highly monetizing users in freemium are expected, just rarely — and when they occur rarely, they are not outliers. (For more background on the definition of outliers and formal techniques for identifying them, see Charu Aggarwal’s book, Outlier Analysis.)

This begs the question: what should be done to accommodate HMUs in freemium data analysis? Consider this discrete probability distribution of LTV values for a freemium app with a 10% overall conversion rate (ie. 10% of users pay):


(Discrete here means simply that LTVs are limited to a specific number of possibilities – for this product, they are limited, for whatever reasons, to increments of $10. The $0 value is excluded here because it would render the graph unreadable.)

According to the graph, the probability of a user having an LTV of $10 is 1.52%, while the probability of having an LTV of $500 is a mere 0.0002% – meaning that, in a cohort of 1,000,000 users, only 2 would be expected to achieve lifetime customer values of $500.

But, according to the graph, the proportion of LTV values above $100 is 2.36% (the graph is truncated at $100 because individual values above that are too small to see). So while only 2 users out of 1,000,000 can be expected to have an LTV of exactly $500, more than 23,500 users in that cohort can be expected to have an LTV of more than $100.

These “extreme” LTV values (relative to the lower end of the spectrum) aren’t outliers because they’re expected (albeit rarely) in the above probability distribution. Any analysis of a new cohort entering the product must be grounded in the understanding that extreme LTV values can and will take place; excluding them when calculating averages – which can be misleading when analyzing freemium products, anyway – would ignore the realities of the product. HMUs in freemium aren’t outliers, they’re exactly what the product should be optimized to produce more of.

But LTV is a very specific metric that serves a very explicit purpose: setting a ceiling on per-user marketing costs. Setting advertising bids based on the underlying distribution of LTVs will, over a long enough timeline, capture the value of users at the tails, but this approach may also engender anxiety as individual cohorts of users acquired through advertising don’t conform to the assumed distribution (or if the product is new and the exact distribution of LTVs can’t be deduced). It will also not optimize spending at the level of the individual advertising network or acquisition source, since those distributions will vary.

To accommodate this, there are two techniques that can be used to dull the effect of (presumed) HMUs on bid prices:

  • For individual campaigns or channels, set bid prices based on the median LTV, not the mean.
  • Winsorize the LTV values for a specific channel or cohort by setting all values beyond some threshold to the value at that threshold. For instance, a 10% Winsorization would see all values past below the 5th and 95th percentiles set to the values at the 5th and 95th percentiles. Given the shape of a freemium LTV curve, only the values at the extreme high end would be affected by Winsorization.

Note that these approaches would only be employed when analyzing segmented cohorts of users, not the entire user base.

The problem with dismissing HMUs as mere outliers and thus not representative of the user base in freemium is that those outliers are precisely what a freemium business is dependent on. By taking analytical shortcuts, such as truncating the far end of an LTV distribution’s tail, a freemium product limits the scope of its optimizations to the users that aren’t fanatical about the product.

This tactic – focusing on the densest portion of the LTV distribution – is a vestige of premium product analysis: since everyone pays the same admission fee, optimizing to the lowest common denominator (the greatest number of people) is a prudent strategy. In freemium, that is nonsensical.

Highly-monetizing users are the often the backbone of freemium commercial success, and thus the entire LTV distribution needs to be taken into consideration when setting marketing bids, or prioritizing a product’s feature backlog, or handling customer support. To dismiss HMUs is to misunderstand the fundamental core of the freemium model: that the power law distribution of LTVs ultimately (and hopefully) creates more revenue than a single price point would.