The privacy landscape for digital advertising is in the midst of dramatic change. This tectonic shift has been catalyzed in part through regulation, such as GDPR, and in part by platform owners like Google and Apple, which each have specific motivations for wanting to define consumer privacy in very particular ways. Regardless, a wholly new paradigm is being instantiated around how digital advertising is targeted, advertising campaign performance is measured, and what data is or isn’t permissible for use in advertising optimization. As this new paradigm takes root, techniques for safeguarding user privacy are being applied to digital advertising in new and novel ways. One of these techniques is differential privacy.
A primer on differential privacy
Differential privacy is a statistical method that attempts to obfuscate the output of some function such that it would be impossible to determine if any given user was present in the dataset fed into the function. Put another way: if two datasets pertaining to a group of people exist — one that includes your data, and one that doesn’t include your data — then a differential privacy method would make it very difficult for the results of any statistical query to reveal whether the query was made against a dataset containing your data. This very helpful paper describes differential privacy as such: “differential privacy guarantees that hardly any information is gained on individual records upon seeing the output of a computation.”
Cynthia Dwork, one of the pioneers in the field of differential privacy research, formally defines differential privacy from this video as:
The idea of the above is that the differential privacy mechanism, or algorithm, M, provides ε-differential privacy if the probability of seeing some event S is almost the same (e^ε) whether dataset x or dataset y is used, where x and y are datasets that are identical except for the presence of one person’s data. Note: ε is used here to define a maximum, tolerable differentiation between the output yielded from both datasets — its purpose is explained in more detail below.
As Cynthia Dwork describes in this video about how differential privacy is applied to the US Census, differential privacy is really a statistical property: anything that can be learned about a person from a statistical analysis of a dataset can be learned about that person even if they are not a member of that dataset if they nonetheless belong to the group being described by that dataset. In that sense, differential privacy is a property of the analysis of datasets such that any given member’s presence in a group cannot be determined by the results of that analysis.
The below video also provides a helpful (if dense) overview of differential privacy.
Differential privacy is often implemented using noise: introducing random values into a dataset such that properties like counts, sums, averages, etc. from the dataset are obfuscated, preventing sensitive data from being known precisely. Cynthia Dwok et al formalize the application of noise to datasets for privacy preservation in this seminal paper on differential privacy. Noise is often derived using the Laplace distribution, which is similar to the normal distribution (symmetrical, uni-modal) except that it has a sharper peak. The amount of noise added to a given query is partly defined by the sensitivity of the query related to how substantially any given user’s data contributes to the query output. This article does a good job of explaining how sensitivity is used to determine noise in a differential privacy setting.
The purpose of noise in differential privacy is to hide sensitive and potentially identifiable output from some analysis or query function while still preserving the utility of that function. In this way, differential privacy presents a tradeoff between utility and privacy: the results of a query on a dataset could be obfuscated to such an extent that the properties of the underlying dataset are completely unknowable — eg. by multiplying each value by a random number — but that obviously isn’t helpful in conducting analysis. The aforementioned paper by Dwok et al presents a framework for applying noise that optimizes for accuracy and insight while delivering some desired level of privacy protection. That level of privacy protection is captured in the mathematical appraisal of differential privacy by the ε referenced earlier. ε is often described as a metric of privacy loss: the value of ε is chosen by the implementor of an algorithm on the basis of how much utility they want to sacrifice in the pursuit of privacy. The higher the value of ε, the less privacy protection is delivered by the algorithm.
What are broad use cases for differential privacy?
The academic text around differential privacy is somewhat inscrutable, filled with terms like adversary and model inversion attacks that don’t map well to the vernacular used in industry. For instance, a common example put forth by many academic studies of differential privacy relates to surveys about smoking. A substantial gap exists between the academic privacy landscape and the practical application of differential privacy to important problems in consumer technology.
But differential privacy is widely used by large consumer technology companies in protecting customer privacy. Uber, for instance, implemented differential privacy into its own internal analytics stack through a product called CHORUS, which Uber open-sourced, that was developed in collaboration with researchers at the University of California, Berkeley. This tool transforms standard SQL queries before they are processed such that it integrates seamlessly within Uber’s existing analytics infrastructure.
Apple similarly has implemented differential privacy into its collection of device usage statistics, such as the popularity of emojis, although its approach has been criticized by researchers for not adequately protecting user privacy. And Google has built a differential privacy tool called RAPPOR (Randomized Aggregatable Privacy-Preserving Ordinal Response) that attempts to apply differential privacy to the usage data it collects from its Chrome browser. Google open-sourced a library for providing differential privacy protection to datasets across multiple languages and for a set of summary and descriptive statistics. The examples section of the repository walks through a hypothetical use case for differential privacy using animals on a farm that is helpful in understanding the purpose of the concept.
In understanding the applicability of differential privacy to various use cases, it’s important to consider the components of a privacy setting that, in combination, lend themselves to privacy abuses for which a solution is relevant:
- Some dataset exists that is dimensionalized to an extent that, despite user data not being indexed by uniquely-identifying information such as PII or device identifiers, users can be identified by some combination of demographic and usage dimensions. This describes the user data sets at most consumer tech companies: while user data might not be identifiable discretely / deterministically through unique service identifiers, certain demographic features can be combined so as to connect that data to a personal identity;
- Some group of people has the ability to query results from that dataset on the basis of various demographic features. This could be employees of the company that owns the dataset, such as in Uber’s case, or a broader, much more heterogeneous group of people that query the dataset through eg. a web interface or API. One non-intuitive aspect of the application of differential privacy is that, given a specific-enough set of query parameters, users might be identified either in their presence in query results or in their absence.
Mapping the above conditions to use cases, it’s obvious how differential privacy is applicable to digital advertising. The query function in digital advertising can exist in the form of ad targeting: setting targeting parameters such that a subset of an overall audience is exposed to ads based on various user features. And advertising campaign results are often reported so as to include descriptive data about the audience that was reached in a campaign, eg. age, gender, location, etc. These qualities of digital advertising render it vulnerable to reverse-engineering: anyone with access to reporting data can determine which specific users were — or weren’t — exposed to ads.
Two examples illuminate this vulnerability. The first is from this article, in which a digital marketer describes targeting ads to the continent of Antarctica, on which only 200 people live during the winter. Through purchasing low-cost banner ads via Google’s display network, the marketer was able to observe which apps the community on Antarctica used on a daily basis, comprised of, “VPN apps, and also mobile games, gay dating apps, weather, and file transfer apps.” With enough demographic filtering — eg. alternating between genders, age ranges, etc. — it wouldn’t be difficult, starting from such a small population, to match app usage to a specific individual, given existing knowledge of the people that live on Antarctica. The article goes on to detail other techniques for matching uploaded list data to Google using the Customer Match feature in a way that allows for an advertising campaign to target a single person.
The second example, which outlines a similar susceptibility from a different angle, is presented in this paper that proposes various privacy attacks that could be perpetrated using Facebook’s advertising tools. In the paper, the author details how she was able to infer the age of a friend — despite the friend having hidden their birthdate from public view — by targeting ads to 1) employees of the firm at which the friend worked who 2) graduated from the same university as did the friend. The campaign with these targeting parameters was cloned once each for ages in the range of 33 to 37 years old. The author of the paper was able to deduce her friend’s age when only the campaign targeted to 35-year-olds received impressions (the author confirmed the veracity of the experiment’s outcome because she had prior knowledge of the friend’s age).
A recent paper, authored by members of Google’s Chrome and Research teams and titled Measuring Sensitivity of Cohorts Generated by the FLoC API, explores the notion of homogeneity attacks: the exploitation of privacy enabled by revelations from group membership, eg. when a specific interest group about a rare disease exposes members of that group as being victims of that disease. The paper — which is a fairly quick read at just six pages — proposes a novel approach to provisioning privacy based on t-closeness: ensuring that membership in any given interest-based cohort (FLoC) is revealed only if membership in that cohort doesn’t reveal membership in a “sensitive” advertising category above the extent to which membership across the general population would be expected, plus some privacy factor (t). Google has already defined sensitive interest-based advertising categories which cannot be specifically targeted, such as, “adult and medical websites as well as sites with political or religious content.” The privacy factor proposed in the paper allows the Chrome team to quantify the privacy-utility tradeoff of blocking the transparency of specific cohorts based on sensitivity correlation.
The applicability of differential privacy to digital advertising is manifest, and its use will surely accelerate moving forward in the changing privacy environment as profile-based targeting is jettisoned. Ultimately, the privacy-utility tradeoff has a real economic impact, and motivated, well-resourced research groups at large companies like Google, Microsoft, Facebook etc. are dedicating immense resources to building privacy-compliant mechanisms for delivering relevant ads. One of those mechanisms is differential privacy, and its pertinence is only poised to grow in the near term.
Photo by Dayne Topkin on Unsplash