Scaling Laws
Need to re-iterate
Core/Basics
History
Researchers like Leslie Valiant and Michael Kearns explored the relationship between model size and performance through the framework of probably approximately correct (PAC) learning and the VC (Vapnik-Chervonenkis) dimension (Valiant, 1984) of hypothesis classes. Their work showed that the sample complexity of a learning algorithm—that is, the number of examples required to generalize well—scales with the VC dimension of the hypothesis class, which quantifies the model’s capacity or complexity.
Motivation
Hyperparameter tuning has a huge cost (especially on large models). Scaling laws provide guidance to perform tuning on the smaller models, and apply the insights on the larger models.
Basics of Power Laws
Zipf’s law in NLP: frequency of a word in a text is inversely proportional to the rank of the word in its table. In other words, the most frequent word occurs approximately twice as often as the second most frequent word, three times as often as the third most frequent word, and so on.
Seminal Works
OpenAI Scaling Law
OpenAI researchers (Kaplan et al., 2020) demonstrated that, for large deep learning models, increasing model size, dataset size, and compute resources consistently reduces model loss in a power-law relationship. This work laid the groundwork for quantifying the performance gains achievable through scaling, showing that, under specific conditions, larger models trained with more data and compute can achieve significantly higher accuracy.
\[L(N, D) \propto N^{-α} + D^{-β}\]- L is the model loss
- N is the number of model parameters
- D is the dataset size (or # of tokens in the training set)
- α and β are scaling exponents that determine the impact of model size and dataset size on model loss
Fig. Language modeling performance improves smoothly as we increase the model size, datasetset size, and amount of compute used for training. For optimal performance all three factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not bottlenecked by the other two. (Kaplan et al., 2020)
Chinchilla Scaling Law
The Chinchilla findings (Hoffmann et al., 2022) revealed that, particularly in compute-constrained settings, increasing data often yields greater benefits than increasing model size alone, leading to improved performance and cost-efficiency. In experiments, a smaller model (Chinchilla with 70 billion parameters) trained on a larger dataset outperformed a much larger model with 175 billion parameters trained on less data.
\[L(N, D) = 406.4N^{-0.34} + 410.7^{-0.28} + 1.69\]Key Points/FAQ(s)
- Hold on many different kind of phenomenon
- Dataset Size, Compute and Number of parameters
- Loss and Dataset size is linear on a log-log plot
- Hold in many domains including machine translation, speech recognition, language modeling, object recognition, etc.
- Dataset Size, Compute and Number of parameters
- Conceptual Foundations
- Why is it power law or linear in log-log?
- Toy task of mean of gaussian distributed RVs, error is σ²/N , i.e., log(Error) is -log(N) which is a scaling law.
- All classical regression models have 1/N scaling.
- What do exponents mean? It represents the hidden dimension (intrinsic dimensionality) of the data.
- log(error) = (-1/d) log(N)
- Why is it power law or linear in log-log?
- Double Descent Phenomenon
- Performance improves, and then gets worse, and then improves again with increased model size, data and compute.
- Scaling Hypothesis
- The strong scaling hypothesis is that, once we find a scalable architecture like self-attention or convolutions, which like the brain can be applied fairly uniformly, we can simply train ever larger NNs and ever more sophisticated behavior will emerge naturally as the easiest way to optimize for all the tasks & data. More powerful NNs are ‘just’ scaled-up weak NNs, in much the same way that human brains look much like scaled-up primate brains.
- AI critics often say that the long tail of scenarios for tasks like self-driving cars or natural language can only be solved by true generalization & reasoning; it follows then that if models solve the long tail, they must learn to generalize & reason
- The Bitter Lesson
- Throw scale, compute and data to the hardest problems to solve. Examples are (missing reference), (missing reference)
Why do big models work?
Big models work because they encode a dizzyingly vast number of sub-models in an extremely high-dimensional abstract space, representing countless small sub-models (Orseau et al 2020) interpolating over data, one of which is likely to solve the problem well, and so ensures the problem is soluble by the overall model. They function as an ensemble: even though there countless overfit sub-models inside the single big model, they all average out, leading to a preference for simple solutions. This Occam’s razor biases the model towards simple solutions which are flexible enough to gradually expand in complexity to match the data.
Practical Considerations
- Diminishing Returns: As model size and dataset size increase, the improvements in performance tend to slow down. This phenomenon, known as diminishing returns, suggests that doubling the model size may not always double the performance gain.
- Data Quality: Scaling is not purely about size—data quality plays a crucial role, particularly at larger scales. Datasets need to be diverse and high-quality to maximize scaling benefits, as noisy or low-quality data can lead to suboptimal outcomes
Case Study: GPT-3 Scaling
Fig. GPT-3: not even that much compute—3640 petaflop/s-day, only 2× their estimate for AlphaGo Zero, 1860. Credits: (Gwern, 2022)
GPT-3 represents ~103 on this chart, leaving plenty of room for further loss decreases—especially given the uncertainty in extrapolation:
Fig. Far beyond the model sizes we study empirically, we find a contradiction between our equations for \(L(C_{min})\) and \(L(D)\) due to the slow growth of data needed for compute-efficient training. The intersection marks the point before which we expect our predictions to break down. The location of this point is highly sensitive to the precise exponents from our power-law fits. Credits: (Kaplan et al., 2020)
GPT-3 continues to scale as predicted. (Brown et al., 2020)
Fig. Smooth scaling of performance with compute. Performance (measured in terms of cross-entropy validation loss) follows a power-law trend with the amount of compute used for training. The power-law behavior observed in [KMH+20] continues for an additional two orders of magnitude with only small deviations from the predicted curve. For this figure, we exclude embedding parameters from compute and parameter counts. Credits: (Brown et al., 2020)
References
- Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142.
- Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling Laws for Neural Language Models. https://arxiv.org/abs/2001.08361
- Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., … Sifre, L. (2022). Training Compute-Optimal Large Language Models. https://arxiv.org/abs/2203.15556
- Gwern. (2022). The Scaling Hypothesis [Blog]. https://gwern.net/scaling-hypothesis
- Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. https://arxiv.org/abs/2005.14165
- Mandliya, R. (2024). Scaling Laws in Large Language Models [Blog]. https://hackernoon.com/scaling-laws-in-large-language-models