What does it mean that Deep Learning is Non-Parametric?
Most days, I write here about getting my hands dirty with code and keep mathematical insights for my friends and clients. Today, I want to share a small piece of intuition about how deep neural networks work and how they differ from "classical" machine learning models. Concretely
- We very briefly think about formalizing the problem setup for classification,
- we review bias and variance decomposition and fly very high over VC- (Vapnik Chervonenkis) dimensions and nonparametrics in the context of the bias-variance tradeoff,
- we look at some literature on interpolating neural networks and double descent,
- finally, we offer a very simple experiment with graphics to intuit over why double descent phenomena happen.
Probably none of this is new to you, but here you are reading it anyway.Stas Bekman hat helpful comments on this post before publication. As usual, they helped improve things greatly and any errors are entirely my own. Thank you, Stas!
With this plan in mind, let us dive right in.
Formal setup of the machine learning problem
Let us try, for fun, to formalize our problem a bit.
It starts with data, here we have labelled data, i.e. inputs $x$ (e.g. images) and labels $y$ drawn from some distribution $P_{x,y}$.
Typically we have:
- The distribution $P_{x,y}$ is fixed and unknown. We typically have access to a sample $x_i, y_i$ for some $i = 1,\dots, N$.
- Mathematical analysis typically assumes independence of the samples.
- Often, but not always, we have or assume that the labels $y$ are indeed functions $y = f_{true}(x)$ of the inputs, i.e. that there is no ambiguity in the labels.
What we aim to "train" is some function $f : x \mapsto y$ or more generally estimate the conditional distribution $P(y | x)$. Our candidates come from some parametrized set $\mathcal{F} = \{f_\theta | \theta \in \Theta\}$, where $\theta$ represents the parameters.
To do this, we have some loss (or risk) function $L$ and conceptually we want to minimize the expected loss $E_{P_{x,y}}(L(y, f(x)))$.
The first attempt usually is to try to minimize the empirical risk or empirical loss $L_{emp} = \min \sum_{i=1}^N L(x_i, y_i)$. If our loss function is a negative log likelihood, minimizing $L_{emp}$ means making a maximum likelihood estimate.
A very brief look at the bias-variance decomposition and tradeoff
For the least squares loss $L = (f(x) - y)^2$ (the bread-and butter regression estimator), it is easy to decompose the expected loss into several components.All this is not crucial to what follows, so you could just skip this if you want. For some reason the bias variance tradeoff is often mentioned in connection with deep learning as something that is outdated, and so I found it instructive to write up some things as briefly as possible, and the take-home is that the bias-variance-decomposition-tradeoff (for squared error and some other losses) entails two possible strategies to approach the learning problem, of which one (regularization) seems exactly what we do in deep learning. C. Bishop, Pattern Recognition and Machine Learning treats the bias-variance-decomposition in section 3.2 and does so much gentler than I do here. The trick is that we need to realize that our training data itself is a random variable sampled from the N-times product distribution of $P_{\mathcal{D}} = \prod_{i=1}^N P_{x,y}$ and so our trained model $f$ depends on $\mathcal{D}$, which we write as $f_{\mathcal{D}}$. Then by cleverly adding $0 = E_{\mathcal{D}}[f_{\mathcal{D}}(x)] - E_{\mathcal{D}}[f_{\mathcal{D}}(x)]$ and using the independence of $(x,y)$ and $\mathcal{D}$, we can decompose the expected squared error of our predictions as $$ \begin{aligned} E_{x,y,\mathcal{D}}[(f_{\mathcal{D}}(x) - y)^2] = & E_{\mathcal{D}, x}[(f_{\mathcal{D}}(x) - E[y|x])^2] \\ & + E_{\mathcal{D}, x}[(f_{\mathcal{D}}(x) - E_{\mathcal{D}}[f_{\mathcal{D}}(x)])^2] \\ & + E_{x,y}[(y - E[y | x])^2] \\ = & E_{x}[(Bias_{\mathcal{D}}[f_{\mathcal{D}}]^2 )(x)] \\ & + E_{x}[Var_{\mathcal{D}}[f_{\mathcal{D}}(x)]] \\ & + E_{x}[Var[y | x]], \end{aligned} $$ where the last term, the noise, isn't dependent on our modelWith the somewhat funny notation I try to emphasize that the bias and variance are pointwise in $x$ here.. Now to get a good expected squared error of our predictions, we have to see that the sum of bias (the first term) and variance (the second term) of our model outputs is small. Note that all terms are non-negative here.
A similar but more complex decomposition for the expected accuracy of a binary classifier exists, but it is in general hard for other loss functions, although we might imagine that the modelling choices work similar there.
The decomposition leads us to a tradeoff because it is relatively easy get either one to be $0$ or very close to $0$: The bias by can be very small by using the sample mean for each datapoint in $\mathcal{D}$ as the estimate and do something smart elsewhere and, say, a regularity argument assuming that the function is Lipschitz or at least uniformly continuous. This is extreme overfittingI'm calling it extreme overfitting here. But when we do something sensible instead of "whatever", this can be useful. In fact, the entire reason I'm writing this is that deep neural networks that interpolate the training data have been remarkably successful, and that is exactly the subject of the articles I quote and discuss below. Also note that it only is easy to get to $0$ bias at the points $x$ where you have samples.. We can make the variance $0$ by just predicting some function independent of $\mathcal{D}$, e.g. $f_{\mathcal{D}}(x) = 0$, an extreme underfit.
Note that the bias-variance-decomposition of the mean squared error is an equation which holds by virtue of a mathematical theorem (or at least a lemma), so it universally applies. There isn't any regime in which you don't have it.
Loosening up from very rigid and bringing structure in the very wild
In a way, moving from the above extremes to more moderate regimes suggests what can be done in practice, too.
Increasing the set of candidate functions
One approach can be to start with a set of very rigid functions as candidates and then meaningfully enlarge the spaces of candidate functions to get (if they are nested) a sequence $\mathcal{F}_0 \subset \mathcal{F}_1 \subset \dots$. The key idea is that moving further along allows the model to fit the data better and one has to know when to stop. The Structural Risk Minimization principle of Vapnik and Chervonenkis does this, though it does not use the bias-variance decomposition but an bounding the expected loss (the risk) by the observed loss on $\mathcal{D}$ (the empirical risk) and a term depending on the training dataset size $N$ and the size or VC dimension of the function set $\mathcal{F}_i$. The typical thing here is that one certainly wants $N \geq \textrm{VC-dim}$ but maybe has $N \leq 20 \,\textrm{VC-dim}$ samplesI took this number $20$ from V. Vapnik: The nature of statistical learning that inspired much of this paragraph.. But note that here, the number of parameters isn't part of the criterion (but it may influence the VC dimension).
Finally, traditional criteria like the Akaike Information Criterion (try to) tell you how many parameters you should "invest" to achieve low negative log likelihood. But as Bishop finds in his introduction, these tend to not work terribly well.
Non-parametric estimation: Regularization
Classic non-parametric estimation starts from the opposite extreme. If we take some function space, e.g. the Sobolev space $H^1$ of functions with weak derivative in $L^2$ (regarding whatever measure, either the distribution of $x$ or the Lebesgue measure of the $\mathbb{R^d}$ containing the inputs), we can match the sample mean at each point on any finite sample $\mathcal{D}$ and so can obtain zero pointwise bias, but minimizing just the empirical risk is ill-posed, there are infinitely many solutions.
What is done then is to regularize. The most famous case is perhaps adding a norm term, leading to Tikhonov-type regularizations, so our loss might look like $$L_{\lambda} = \frac{1}{N} \sum_{i=1}^N |f(x_i) - y_i|^2 + \lambda |\nabla f|_{L_2}^2. $$
If we want to look at it from a bias-variance perspective, we can balance bias (with $\lambda \rightarrow 0$ removing the bias but going ill-posed) and variance (when $\lambda \rightarrow \infty$, we're at $f_{\mathcal{D}} \stackrel{const}{=} E_{\mathcal{D}}[y]$. We don't quite get variance 0 because our regularizing term is just a seminorm). Of particular relevance to striking a good balance here is Grace Wahba's research on regularized regression, in particular to find a good $\lambda$ etc.
There is a connection to the nested spaces $\mathcal{F_i}$ of Ansatz-functions in the previous section by observing that for a given $\lambda$, the minimizer $f_\lambda$ of $L_\lambda$ will have (semi-)norm value $\kappa(\lambda) = |\nabla f_\lambda|$For non-unique minimizers, take the supremum over them here. and it necessarily minimizes the empirical least squares loss (the first term) in $\mathcal{F_\kappa} = \{ f | |\nabla f| \leq \kappa \}$, so a decreasing sequence of norms $\kappa_i > 0$ from an increasing sequence of weights $\lambda_i$ gives us nested Ansatz-spaces.
Many popular regularized regression approaches (e.g. Lasso) fit in this type of framework.
Appreciating the Vapnik-Chervonenkis bound
Let us circle back to the VC bound and make it a bit more formal to enhance the intuition. A key probabilistic bound is for accuracy (or 0-1 risk)This is by plugging in Theorem 3.3 into formula 3.15 in Chapter 3.4 of Vapnik's book and is also the bound given in Wikipedia. $$ \mathrm{Prob}\left(|Acc_{P_{x,y}}-Acc_{\mathcal{D}}| \leq \sqrt{\frac{\textrm{VC-dim}\left(\ln \frac{2N}{\textrm{VC-dim}} + 1 \right) - \ln \frac{\eta}{4}}{N}} \right) \geq 1-\eta. $$
Let us take this apart a bit. The outer part says "with probability at least $1-\eta$" (and we would have to this more precise), where we think of $\eta$ as small, so this just means that we have a probabilistic bound and not a "almost-surely"-guarantee.
The inner part basically says the accuracy on the full probability distribution is very close to the accuracy on the training set, in the sense that we have a precise bound on the difference that goes to zero as $N$ grows very large.
At face value, this tells us about the risk or accuracy, but what does say about the model? To my mind, the key message is that our model is so rigid that we see everything that happens on the test set (or more precisely on the full distribution $P_{x,y}) $ already on the training set.
Bayesian aside
The regularization can be interpreted as a maximum-a-posteriori (MAP) estimate in a Bayesian context, or - if we go through the trouble of defining a prior - we may also integrate the estimates over all $f \in \mathcal{F}$.
What does this tell us about deep learning?
So when we pass model.parameters()
Unsurprisingly, I'm using PyTorch here. to our optimizer, that sounds parametric. But it isn't!
It would seem that this regularization approach is exactly the theoretical framework in which - at times poorly understood - deep learning operates. Our models are large enough to be "morally" non-parametric and most of the things we do (augmentation, norm layers, dropout) are all regularizations, though we do not fully understand them yet.Incidentally, Andrej Karpathy's excellent recipe for training neural networks suggests to put the theoretical concept of having extreme overfitting and then regularizing into practice in the steps "overfit" and "regularize" of the recipe. Some of the regularization advice seems to have aged better than other but note that most things done to neural networks count as regularization here.
This is also the theme of M. Belkin et al.: Reconciling modern machine learning practice and the bias-variance trade-off and their earlier work, that the key to the generalizing performance is the regularity or smoothness of a function as measured by a certain function space norm.
It is worth looking closely at the sketch M. Belkin et al. (I think this is the first) provide for the double descent phenomenon:
Some things to notice here:
- M. Belkin et. al. put the "classical" and "modern" qualifiers for the regimes in quotation marks. The "modern" regime turns out to be very much non-parametric learning with a regularization we have yet to better understand.
- It would seem that the bias-variance thinking still is fully applicable in the many-parameter regime, but non-parametric regression might be the better frame of reference than "limiting the capacity" of the candidate set An earlier M. Belkin et al. is titled To understand deep learning we need to understand kernel learning..
The popular appreciation of the double descent phenomenon seems to be based more at P. Nakkiran et al.: Deep Double Descent a testament to the (good) paper and also OpenAI's reach and ability to introduce topics to a broader audience. They make systematic experiments with more realistic networks (M. Belkin et al. cited above use more shallow networks). One of the key takeaways for me was that the double descent phenomenon has "bump" between the two regimes for experiments with corrupted labels, they report much more modest ones in experiments with clean (well, not intentionally corrupted) labels.
Their opening figure is a graph showing test errors from training a modified ResNet18 to CIFAR10 with corrupted labels for a fixed number of epochs. The modification is that the number of channels is reduced to one $k$th of the original number for $k$ from 64 to 1 (so they start with 1/64th of the "original" ResNet18 and reach the original eventually). The label corruption is that in the dataset (in particular once, not per-epoch), 15% of the labels are switched to a random wrong class. This is interpreted as a sketch mis-specification.
So what does VC theory have to say about fitting models with noisy labels?
So by the above discussion and bound, we know that with a model in the regime where the VC bound is useful (so small models, the "classical" regime), the test set has (very likely) a test accuracy close to the training accuracy if the training data $\mathcal{D}$ comes from the same distribution $P_{x,y}$ as the original data. In other words, the condition means that we assume here that $P_{x,y}$ has the same level (and kind) of corruption. But so this means that if the model learns, it learns to not be distracted too much by the corrupted training data, i.e. on the training data, the correct labels outcrowd the corrupted labels.This is a bit of a leap because the VC bound actually neither says that the classification error will be of the same proportion as the label corruption rate (so we need to say if the model learns) nor that the error will be "on the same subpopulation" such as the corrupted samples.
Features and learning
One of the things that makes intuiting about deep learning hard is the adaptive nature of the Ansatz space.And I could probably write an entire book chapter around my intuition of this. In our Deep Learning with PyTorch book, we interpret a small experiment on neural network architecture components. The three things we introduce under the header regularization have very different behaviour there: weight decay and dropout reduce the gap between train and test error while the most dramatic effect of batch norm is that it let's the train accuracy reach almost 100%. We offer the explanation that the first two have interpretations as statistical regularization in terms of Bayesian inference but batch norm is more a convergence helper. Combined with a footnote in the initialization section mentioning FixUp, one of our attentive readers saw this as taking a swipe at batch norm and filed a bug that we were undermining batch norm, and I'll admit that perhaps the word doesn't do justice to the rĂ´le of batch norm, and a more in-depth discussion would stress the importance of the regularization of the function space we are optimizing in. By this, I mean that we do not have a feature extractor that is fixed (manually constructed, given by the kernel families used in kernel machines or somesuch) and then apply learning to the features.While a draft of this blog post was rotting on my hard drive, the neural tangent kernel theory has become quite popular and moves explanations for deep learning quite a bit towards kernel learning. Quite often, we view the inputs to the final layer as features (in vector representations learned with word2vec-style losses, prototypical networks, benchmark unsupervised learning and such), or we might split a convolutional network at the end of the convolutional layers before a MLP classifier head, which I recently wrote about as being similar in structure.
Quite the opposite of the traditional picture of learned classifier on top of fixed feature extractor, E. Hoffer et al. E. Hoffer et al.: Fix your classifier, ICLR 2018 even suggested to fix the classifier, i.e. make the training about the feature extractor only.
So we might try to make our intuiting easier by pretending we were extracting features. When using some dimension reduction mechanism like t-SNE to visualize the features learned by with the noise-free data in (our reproduction of) the headline image experiment of P. Nakkiran et al., adding the label noise would correspond to adding noise to the blobs of points corresponding to each class. With this in mind, we can undertake an experiment that is similar, but even simpler than the artificial data experiment in M. Belkin et al: To understand deep learning.
An intuition about label noise, capacity, dual descent, and the test error with experiment
Setting statistics aside: Here is some speculation what might be going on, by imagining we could get the same phenomenon depicted in the figure from P. Nakkiran et al. above with prototypical networks and the capacity is given by the number of prototypes we can have:
- At the very left, around width (the parameter) values 1 to 5, we have fewer prototypes than classes, the model underfits because it simply cannot represent all classes.
- At around width 5, we have 10 (or moderately more) prototypes, but the corrupted labels are outcrowded at training around every prototype so they don't play a role.The prototype analogy has the weakness that typically you would prescribe the classes for the prototypes in advance, the ResNet does not.
- At width 5 to 10, the prototypes pick up the corrupted labels. As each prototype has a "region of influence" for inference, there is a sizeable space where the corrupted prototypes are very relevant in testing.
- At width beyond 10, we add more prototypes. The prototypes get closer and the corrupted label prototypes are now "outcrowded" during inference, so that their "region of influence" gets smaller (because it is much more likely to have 3 of 5 non-corrupt prototypes than having 3 corrupted ones for the same class).
What would this mean for the bias-variance decomposition? Recall that the decomposition is pointwise in space and takes the variance and bias over various training datasets considered as before. Imagine that you only had two classes and so the the prediction and label were either 0 or 1. Then prototypes picking up corrupted labels will produce both bias (because you'll predicting wrong things with some probability) and variance (because the regions of bad predictions depend on which labels got corrupted, so on which dataset $\mathcal{D}$ we drew), and making the regions of wrong predictions smaller reduce both.
The role of early stopping in this intuition would be to detect when the model starts picking up corrupted labels.
So it would seem that modern neural networks are essentially non-parametric and how they work relies on the various regularizations. To use M. Belkin et. al.'s formulation, we would wish to understand better the degree to which we understand how the various techniques contribute to certain function space norms. It does seem hard to conclude that "classical" statistics suggest modern learning would not work.
Using Least-Squares as a model problem Hastie et al: Surprises in High-Dimensional Ridgeless Least Squares Interpolation offer very thorough analysis which might provide intuition for the deep learning phenomena, too.
Outcrowding mislabeled data in the interpolating regime
We can make a very simple analogue of the interpolating regime. We consider a binary classification problem with points drawn from 2d standard unit normals that are shifted along the horizontal axis by $\pm 2$. We draw 25% of the points of each category from the distribution of the other.
To get an interpolating regime we use a kernel that has a pronounced spike. In order to have an analytically tractable mass and normalize to $1$, we use the kernel
$$ \phi : \mathbb{R}^2 \rightarrow \mathbb{R}, \qquad x \mapsto \frac{1}{2 \pi |x|} \exp(-|x|) . $$
This has unit mass, goes to infinity at $x=0$ and decays away from the origin:
This means that if we represent the density of each class as the mean of kernels at the samples $x_i$, i.e.
$$ d^C(x) = \frac{1}{N} \sum_{i=1}^N \phi(x - x^C_i), $$
Assuming that points from distinct classes do not coincide (which is true almost surely), we can classify each point according to which $d^C$ is larger, or, if we want probabilities by normalizing the probability densities at each point $p^C(x) = d^C(x) / \sum_{C'} d^{C'}(x)$.
This gives us an interpolating solution - at each training point, the label class has infinite density, so it is classified as belonging to this class.
So what happens with the labeling errors? They cause some neighborhood of the corrupted training point to be assigned to the erroneous class. However, the more points of the correct class are nearby, the smaller the area of misclassification becomes. We can try this interactively. As you increase the number of points, the test error decreases.
So what does this mean? For interpolating solutions, the good training points outcrowd the badly labeled points at test time.
Adversarial examples
But while the area of bad classification and thus the probability of hitting it with a randomly sampled datapoint is reduced with more data, the distance of randomly sampled point to the next bad sample also decreases. This means that in addition to exploiting the poor continuity properties of models (i.e. that a small change in input can cause a large change in the extracted features), the interpolating regime makes adversarial examples easier to produce because we need only a small change in the features.
Feature noise is similar to label noise
But doesn't the double descent also happen in the absense of corrupted labels? Do we "just" need to be extra-careful with our training data?
Well, it is not this easy. High-dimensional features can be inherently more noisy than low-dimensional features: Imagine having a two-class linear classification in a high-dimensional space, say of dimension $d$. So we have some classifier with vector $v \in \mathbb{R}^d$ and bias $b \in \mathbb{R}$. Given an input $x \in \mathbb{R}^d$, the class is $1$ if $x \cdot v + b \geq 0$ and the class is $0$ otherwise. If we know a priori that the inputs are bounded, we may find class prototypes $\hat x_C$ and classify using the distances $|x - \hat x_C|$. But then the part of the vector in the $d-1$ dimensional null-space of the linear classifier, i.e. the space $\{ x' : x' \cdot v = 0 \}$ of vectors we can add to the input without changing the result, may contribute significantly to this distance, making $|x - \hat x_C|$ a rather noisy estimate of the more relevant projected distance $|(x - \hat x_C) \cdot v|$.
If we want to stay in two dimensions, we can scale up the noisy dimension. This gives us our second experiment. We draw independent random 2d points with standard deviation $0.5$ in the "feature dimension" and $5$ in the noise dimension. The two classes are separated by adding $\pm 1$. We use the EM algorithm to fit a mixture of $K$ gaussians with standard deviation $1$ in each dimension to each class. The classification is as above by comparing the two fitted densities. We use $5000$ training points and $1000$ test pointsThis makes things a bit slow and so I use $500$ training points when initially loading the this blog page..
If we run this 200 times for a variety of $K$ and record the accuracy, we can see the bump from the double descent:
One thing about these experiments: The error distribution is skewed: Many times, we get a test of error around $2%$-$3%$, however, there is a tail with errors going up to around $10%$. The mass of in this tail of bad fits varies with the number of components $K$ and appears to be the main reason for the bump in average error with intermediate $K$.
Conclusion
So what have we learned?
-
I think about deep learning with very large models as similar to non-parametric learning with the training phase (which is foreign to the purely non-parametric setup) as memorizing relevant aspects of data in addition to the model selection (finding "hyper-" parameters to non-parametric models).
-
To my mind, we see the non-parametric nature of deep learning models in that the noise needs to outcrowd the features at inference time rather than training time. This is very close to the KDE experiments.
-
Even if we have great labels (you have them, don't you?), the high feature dimensions in deep learning leads to noise in the features that behaves similar to noise in the labels.
I hope you enjoyed this blog post. I appreciate your comments at tv@lernapparat.de.
If you have tricky applied mathematical problems in your AI that you want a consultant to help you with, you can hire me through my company MathInf GmbH.