Differential privacy is the study of how to make mechanisms which act on databases (i.e., make queries) secure from leaking information.

Intuitively, a differentially private mechanism should be one which is not excessively reactive to small changes in the dataset. Let be the set of all databases, and consider a function which acts on . (Think of a database as just some big data table where, e.g., each row corresponds to a different user.) For instance, given a database , might be the size of , or the total amount user deposits in , or it might ask for the number of users with some property (a “counting query”). We want to add noise to in such a way that, regardless of how it queries the database, it cannot back out sensitive information.

Formally, we call a (randomized) mechanism which acts on databases -differentially private if, for all outputs and all databases and such that and differ by one row,

If , we call -differentially private.

The intuition behind the definition is easier to grasp if we negate it. If is not differentially private, this means there exists some such that . That is, swapping just a single row of to changed the output distribution considerably, meaning that is sensitive to the data. If is differentiably private then, roughly speaking, small changes in the input result in small changes in the output.

A popular to write -differential privacy is as the likelihood ratio

where is the set of databases which different by at most 1 row from .

Example: The Laplace Mechanism

Consider a deterministic function . Define

is often called the -sensitivity of . The Laplace mechanism is defined as

where independently. Recall that the Laplacian distribution has pdf . To show that this mechanism is differentially private, we show that

for all . To see this, it suffices to show that for all , ) and then to integrate over . Note that,

Therefore, by the reverse triangle inequality,

Composition

One immediate question is how differentially private mechanisms behave under composition. For instance, can we employ multiple differentially private algorithms in tandem and retain differential privacy? Sort of.

Suppose are , differentially private, respectively. Considering concatenating the output so that, given a database , we construct a new map such that , where we run and independently of one another. Will be differentially private?

It turns out that will be differentially private. You’ll often see this result referred to as “the epsilons and deltas add up.” Unfortunately, when people cite this result, they typically cite Theorem 1 of this paper, which has an incorrect proof. There is a corrected proof in the book by Dwork and Roth, but I think it’s more complicated than necessary. Here’s a simpler proof.

Note the range of is the product space . Since we run and independently, we have, for any and any neighboring databases and ,

Here we’ve just used basic facts of the min function and recognized that probabilities can be at most 1. Of course, we can extend this result by induction and conclude that given finitely many mechanisms where is -differentially private, then the mechanism given by is differentially private.

We can also consider deterministic composition, and demonstrate that it does not affect the privacy guarantees. Indeed, suppose is private and let be deterministic and invertible. Then for any ,

so is private.

DP in practice

Differential privacy is now used in a lot of applications. Apple started using it in macOS Sierra and has since expanded its application to Safari. Google used differential privacy when gathering insights from searches related to Covid-19, and for their Covid-19 mobility reports. Other examples include LinkedIn, Microsoft and Uber.