본문 바로가기
Deep Learning

[DIVE INTO DEEP LEARNING] 5.5. Generalization in Deep Learning

by ram_ 2023. 12. 29.

딥러닝 이론은 유망한 연구 분야를 제시하고 흥미로운 결과들을 쏟아냈지만 (i) 신경망을 최적화할 수 있는 이유와 (ii) 경사 하강으로 학습한 모델이 고차원적인 작업에서도 일반화를 잘 수행하는 방법 모두에 대한 포괄적인 설명과는 거리가 멀어 보입니다. 그러나 실제로 (i)은 거의 문제가 되지 않으며, 따라서 일반화를 이해하는 것이 훨씬 더 큰 문제입니다. 반면에, 일관된 과학 이론의 안락함이 없더라도 실무자들은 실제로 잘 일반화되는 모델을 생성하는 데 도움이 될 수 있는 다양한 기법을 개발해 왔습니다. 딥러닝의 일반화라는 방대한 주제를 간결하게 요약하는 것은 불가능하고 전반적인 연구 현황을 파악하는 것도 쉽지 않지만, 이 섹션에서는 연구와 실무의 현황에 대한 대략적인 개요를 제시하고자 합니다.


 

5.5.1. Revisiting Overfitting and Regularization

Strangely, for many deep learning tasks (e.g., image recognition and text classification) we are typically choosing among model architectures, all of which can achieve arbitrarily low training loss (and zero training error). Because all models under consideration achieve zero training error, the only avenue for further gains is to reduce overfitting. Even stranger, it is often the case that despite fitting the training data perfectly, we can actually reduce the generalization error further by making the model even more expressive, e.g., adding layers, nodes, or training for a larger number of epochs. Stranger yet, the pattern relating the generalization gap to the complexity of the model (as captured, for example, in the depth or width of the networks) can be non-monotonic, with greater complexity hurting at first but subsequently helping in a so-called “double-descent” pattern (Nakkiran et al., 2021). Thus the deep learning practitioner possesses a bag of tricks, some of which seemingly restrict the model in some fashion and others that seemingly make it even more expressive, and all of which, in some sense, are applied to mitigate overfitting.

 

Deep neural networks are capable of fitting arbitrary labels even for large datasets, and despite the use of familiar methods such as  regularization, traditional complexity-based generalization bounds, e.g., those based on the VC dimension or Rademacher complexity of a hypothesis class cannot explain why neural networks generalize.

 

5.5.2. Inspiration from Nonparametrics

Approaching deep learning for the first time, it is tempting to think of them as parametric models. After all, the models do have millions of parameters. When we update the models, we update their parameters. When we save the models, we write their parameters to disk. However, mathematics and computer science are riddled with counterintuitive changes of perspective, and surprising isomorphisms between seemingly different problems. While neural networks clearly have parameters, in some ways it can be more fruitful to think of them as behaving like nonparametric models. So what precisely makes a model nonparametric? While the name covers a diverse set of approaches, one common theme is that nonparametric methods tend to have a level of complexity that grows as the amount of available data grows.

 

Perhaps the simplest example of a nonparametric model is the 𝒌-nearest neighbor algorithm (we will cover more nonparametric models later, for example in Section 11.2). Here, at training time, the learner simply memorizes the dataset. Then, at prediction time, when confronted with a new point 𝒙, the learner looks up the 𝒌 nearest neighbors (the 𝒙 points 𝒙 that minimize some distance 𝑑(𝒙,𝒙′)). When 𝒌=1, this algorithm is called 1-nearest neighbors, and the algorithm will always achieve a training error of zero. That however, does not mean that the algorithm will not generalize. In fact, it turns out that under some mild conditions, the 1-nearest neighbor algorithm is consistent (eventually converging to the optimal predictor).

 

In a sense, because neural networks are over-parametrized, possessing many more parameters than are needed to fit the training data, they tend to interpolate the training data (fitting it perfectly) and thus behave, in some ways, more like nonparametric models. More recent theoretical research has established deep connection between large neural networks and nonparametric methods, notably kernel methods. In particular, Jacot et al. (2018) demonstrated that in the limit, as multilayer perceptrons with randomly initialized weights grow infinitely wide, they become equivalent to (nonparametric) kernel methods for a specific choice of the kernel function (essentially, a distance function), which they call the neural tangent kernel. While current neural tangent kernel models may not fully explain the behavior of modern deep networks, their success as an analytical tool underscores the usefulness of nonparametric modeling for understanding the behavior of over-parametrized deep networks.

 

5.5.3. Early Stopping

A new line of work (Rolnick et al., 2017) has revealed that in the setting of label noise, neural networks tend to fit cleanly labeled data first and only subsequently to interpolate the mislabeled data. Moreover, it has been established that this phenomenon translates directly into a guarantee on generalization: whenever a model has fitted the cleanly labeled data but not randomly labeled examples included in the training set, it has in fact generalized (Garg et al., 2021).

 

Together these findings help to motivate early stopping, a classic technique for regularizing deep neural networks. Here, rather than directly constraining the values of the weights, one constrains the number of epochs of training. The most common way to determine the stopping criterion is to monitor validation error throughout training (typically by checking once after each epoch) and to cut off training when the validation error has not decreased by more than some small amount E for some number of epochs. This is sometimes called a patience criterion. As well as the potential to lead to better generalization in the setting of noisy labels, another benefit of early stopping is the time saved. 

 

Notably, when there is no label noise and datasets are realizable (the classes are truly separable, e.g., distinguishing cats from dogs), early stopping tends not to lead to significant improvements in generalization. On the other hand, when there is label noise, or intrinsic variability in the label (e.g., predicting mortality among patients), early stopping is crucial. Training models until they interpolate noisy data is typically a bad idea.

 

5.5.4. Classical Regularization Methods for Deep Networks

In deep learning implementations, weight decay remains a popular tool. However, researchers have noted that typical strengths of ℓ2 regularization are insufficient to prevent the networks from interpolating the data (Zhang et al., 2021) and thus the benefits if interpreted as regularization might only make sense in combination with the early stopping criterion. Absent early stopping, it is possible that just like the number of layers or number of nodes (in deep learning) or the distance metric (in 1-nearest neighbor), these methods may lead to better generalization not because they meaningfully constrain the power of the neural network but rather because they somehow encode inductive biases that are better compatible with the patterns found in datasets of interests. Thus, classical regularizers remain popular in deep learning implementations, even if the theoretical rationale for their efficacy may be radically different.

 

 

5.5.5. Summary

Unlike classical linear models, which tend to have fewer parameters than examples, deep networks tend to be over-parametrized, and for most tasks are capable of perfectly fitting the training set. This interpolation regime challenges many hard fast-held intuitions. Functionally, neural networks look like parametric models. But thinking of them as nonparametric models can sometimes be a more reliable source of intuition. Because it is often the case that all deep networks under consideration are capable of fitting all of the training labels, nearly all gains must come by mitigating overfitting (closing the generalization gap). Paradoxically, the interventions that reduce the generalization gap sometimes appear to increase model complexity and at other times appear to decrease complexity. However, these methods seldom decrease complexity sufficiently for classical theory to explain the generalization of deep networks, and why certain choices lead to improved generalization remains for the most part a massive open question despite the concerted efforts of many brilliant researchers.