How much data is needed to predict LTV?

Posted on January 9, 2017 by Eric

cash_register_LTV_ufert.se-NOT-FOR-REPRODUCTION

One common predicament that arises for freemium mobile app publishers is determining when they've accumulated enough data in a soft launch to feel comfortable that their LTV estimates are reliable. Often, developers will validate early retention and monetization metrics against comparable products and launch them before longer-term data even becomes available.

It's an understandable approach. Projected LTV values are often forecasted over some months (eg. 6 months, 12 months, etc.), and LTV models require data over correspondingly long time horizons to be dependable. In order to get data for, say, 180 days LTV, a campaign needs to be monitored over 181 days, but with retention decay, the numbers of users that need to "flow through" an app to get a sizable sample of day 180 data can be quite large.

Consider a soft launch that runs a three-day campaign of 500 acquired users per day:

campaign_1

By day 181 of the soft launch -- that is, 180 days after launching the campaign to acquire users for soft launch -- only 500 users could have even potentially reached day 180 to produce just 500 data points. In reality, much fewer would have. By adding in some daily retention metrics, we can get a sense for how long it can take to accumulate a large data-set for longer-term app usage: with a 3% Day 180 retention rate, of the 1500 users acquired in the three-day campaign in soft launch, just 15 would have reached day 180:

campaign_2

In order to accumulate 500 data points at Day 180 in as little time as possible (ie. wait just 181 days for the results), the campaign would have to see an increased daily volume of acquired users to 16,700:

campaign_3

This is probably totally unrealistic for most developers, at least in a single geography for a single platform.

This exercise raises a question: how much data is needed in order to estimate LTV? Many times, a sample size is decided arbitrarily -- as in the case above in which 500 data points was deemed sufficient. But in reality, the sample size needed to build a robust LTV estimate is dependent on the consistency of the sampled data and the level of certainty that the organization requires around the estimate to feel comfortable in spending money against it. 500 could be enough or 10,000 might be required.

This brings to the fore the concept of the confidence interval*. There are a number of great resources that explain the concept very well, including this set of videos from the Khan Academy.

The application of confidence intervals here allows for the LTV forecasting process to be approached a different way: instead of stipulating which LTV value needs to be forecast in a soft launch and assigning some arbitrary data set size that needs to be accumulated to determine for how long the soft launch should last, a level of confidence around some LTV day value (eg. Day 180 LTV) can be established that will ultimately define every other variable that exists in the process.

That is to say: rather than requiring 500 data points for determining a 180-day LTV (which may or may not be enough and may or may not be collectable in 181 days), the analyst can say: we will wait until we're able to calculate a 95% confidence interval for an actionable (biddable) 90-day LTV value, compare that against industry standards, and determine if our app is viable.

In order to illustrate this process, I built a simple model that forecasts LTV based on some arbitrary assumptions around cumulative monetization and retention. These are:

model_settings

Explaining these variables:

  • The 'days' list contains the days that the model estimates cumulative LTV values for; the 'retention profile' list contains the retention values for those days (eg. 50% retention for Day 1 from above);
  • The 'DNU' and 'cohorts' variables are used to control the number of users that 'flow through' the app (DNU is Daily New Users and cohorts is the number of days that the campaign runs, eg. the number of days that DNU is brought into the app);
  • The 'Waiting Period' variable is the number of days that the app is observed over (that is, the number of days the analyst would wait to see the cohorts' data evolve within the app from the very first day of running a campaign).
  • The 'Starting LTV' value is just a randomly chosen cumulative LTV value that gets assigned to the first day in the Days variable (so in this case, the Cumulative LTV at Day 1 is $0.10). This value gets multiplied by two at each subsequent Day value (again: in this case, Day 7 -- the next value in the Days list variable -- has a cumulative LTV of $0.10 * 2 = $0.20, Day 30 has a cumulative LTV of $0.20 * 2 = $0.40, and so on. These values were chosen arbitrarily).

The model generates a population data set based on the total number of people brought into the app (DNU times the number of cohorts, or 500 * 180 = 90,000) for each Day based on a Lomax distribution (a long-tailed probability distribution that starts at 0, much like most LTV distributions) with an expected value of whatever the cumulative LTV value is for that day. The population distribution at Day 90 from the settings above looks like this ($0.80 is the expected value -- again, this is a totally arbitrary value):

day_90_ltv_distribution

A random sample is then drawn from each population based on the retention value for that day. So, for instance, since the retention value at Day 90 from above is 4%, the random sample drawn has a size of 1,800 data points (90 cohorts that have reached 180 days old * 500 DNU per cohort * 4%). In other words, 1,800 points are randomly drawn from the total population of LTV values for users that could have potentially reached Day 180 in the app.

Confidence intervals are then drawn (at 95%) for those samples and the LTV values by day are plotted on a graph, along with error bars for the confidence intervals:

graph_1

In this graph, the Y-axis is predicted LTV value and the X-axis is the LTV Day.

The values plotted here are the Predicted LTV values along with their confidence intervals and the percentage error of the interval endpoints (ie. how large as a percentage of the predicted value is the range, up or down, of the confidence interval). As you can see, the error margin at Day 90 is fairly small (6.5% up or down from the predicted value), but given the above model settings, this reasonably narrow confidence interval comes at a price: 181 days of waiting (six months) and 180 days of 500-person cohorts (90,000 acquired users). Assuming an average cost of $1.50 for a volume of 500 DNU, this soft launch would have cost the developer $135,000 over the course of six months for a robust 90-day LTV value.

What happens if the developer only wants to acquire 200 DNU but is willing to wait for 181 days?

graph_2

The confidence interval increases fairly dramatically and the error margin up or down increases to more than 10%.

What happens if the developer only wants to acquire 200 DNU and only wants to wait 120 days (4 months) for Day 90 data?

graph_3

The error margin on the confidence interval for the Day 90 LTV increases to more than 23%, meaning the predicted value is not actionable as a marketing metric.

The above is for a Day 90 LTV value, which is low compared to what many developers bid against: Day 180 or Day 365 LTV (ie. expected revenue contributions from a user over half a year or a year). In order to get a sub-10% error margin for a Day 180 predicted LTV, we'd need to run 180 cohorts and wait for 250 days (about eight months):

graph_4

To get the same sub-10% error rate for the confidence interval on a Day 365 LTV estimate, you'd need to wait about 455 days:

graph_5

These validation periods are very long and go beyond what most developers can allocate to a soft launch. This reality leaves developers with a few different options for making the decision to enter a global launch with a limited understanding of long-term LTV:

  1. Only launch apps with a Day 90 LTV that supports marketing. I see many developers choosing this option: they attach their marketing viability decision to their app's Day 90 LTV and use that threshold to decide whether to kill an app in soft launch. In other words: if their Day 90 LTV -- which is verifiable in a reasonable amount of time in soft launch, as per above -- doesn't support profitable user acquisition, then the app is killed. As the app collects more data from older cohorts in global launch, they adjust their marketing bids against longer-term LTV values that they feel have been quantitatively substantiated;
  2. Use LTV curve ratios from comparable apps to estimate a longer-term LTV. Developers releasing apps similar to those they already have in market may have seen a reliable ratio emerge between points on those apps' LTV curves (eg. Day 365 LTV is 4x Day 30 LTV). This is a fairly common approach, but I think it underestimates the impact that small nuances can have on late-stage monetization for apps.

The approach used to alleviate the cumbersome data requirements for predicting a reliable and statistically sturdy long-term LTV generally depends on the level of risk the organization is willing to take on marketing spend and its overall appetite for uncertainty. Some firms simply have no tolerance for estimates that can't be validated with actual data: for those companies, growth is slow, incremental, but dependable and principled. This mindset is the one I detailed in my recent presentation about app positioning: developers without recourse to marketing loss funding are forced to adopt a disciplined approach to growth that minimizes risk and accommodates the thesis of this article, which is that late-stage LTV estimation is very difficult to do with a high degree of statistical rigor without a lot of money (data) and a lot of time.

* just like many topics in statistics, the concept of the confidence interval is somewhat controversial (I briefly discuss this controversy in Freemium Economics). This article uses the concept of the confidence interval to illustrate a point: that low numbers of data points in long-term LTV measurements create the potential for a high level of variance across those data points.

Advertisment