Learning Theory

The study of proving performance guarantees for learning algorithms, usually in a supervised learning setting (also in self-supervised learning, but this is a stupid category).

We have a class $F$ , a set of functions $f : X \to Y$ which map the feature space $X$ to a label space $Y$ , which we’ll assume is some subset of $R$ . Our algorithm sees training data (usually drawn iid from some distribution, but occasionally the distribution is more complicated) and chooses a function $f \in F$ (via empirical risk minimization, say).

From here, there are several questions we might want to answer.

The first is, naturally, how good is $f$ ? That is, how well does $\hat{f}$ generalize to unseen data? <ore formally, how close is the empirical risk of $f$ to the true risk (see statistical decision theory)? PAC bounds are the most common way to answer this question. This usually leads to bounds which depend on some notion of the complexity of the class $F$ , such as VC dimension or Rademacher complexity. Another way to answer this question is via PAC-Bayes bounds, which is a Bayesian take (sort of) on traditional PAC bounds. Here, the generalization gap is bounded in terms of some divergence.

The second is, how close is $f$ to the best function $f^{*} \in F$ , the one that has the lowest risk. This question is usually answered by uniform convergence bounds in the following way. We want to bound $α = R (f) - R (f^{*})$ where $R$ is the risk. Write

α = (i) R (f) - R_{n} (f) + (ii) R_{n} (f) - R_{n} (f^{*}) + (iii) R_{n} (f^{*}) - R (f^{*}) .

Term (ii) is $\leq 0$ , since $f$ is chosen to minimize empirical risk. Since $f^{*}$ is independent of the data, term (iii) can be handled via normal concentration inequalities (it’s simply the deviation between the sum of random variables and their mean). Then for term (i) we turn to a uniform convergence bound, since $f$ is highly data-dependent.

The Stats Map

Explore

learning theory

Graph View

Backlinks

Explore