Parameter Efficient Tuning (PET) Methods
LoRA: Low Rank Adaptation
(Hu et al., 2021) proposed an efficient training method to finetune or custom-train pre-trained LLMs without excessive memory footprint. LoRA proposes to decompose the weight update matrix \((\Delta W)\) into two low-rank matrices \(A\) and \(B\), with inner dimension as \(r\). An additional hyperparameter \(\alpha\) is used for scaling, \(W = (A \times B) * \alpha /r\)
Fig. LoRA weight updates. Credits: From (missing reference)
When using LoRA, we hypothesize that the model requires \(W\) to be a large matrix with full rank to capture all the knowledge in the pretraining dataset. However, when we finetune an LLM, we don’t need to update all the weights and capture the core information for the adaptation in a smaller number of weights.
Key Takeaways and Important Points from (missing reference) (Ongoing Research)
- Consistency: Benchmark results are surprisingly consistent across the different runs despite the inherent randomness of LLM training or when training models on GPUs.
- QLoRA (Dettmers et al., 2023) quantized LoRA, quantizes the pre-training weights to 4-bits, to further reduce the memory footprint. (missing reference) found that QLoRA saves GPU memory however at the cost of increased training runtime (caused by the additional quantization and dequantization of the pretrained model weights in QLoRA). Further, with QLoRA the model performance was similar to LoRA.
- Cosine annealing scheduler to the LoRA finetuning improved the SGD performance noticeably. However, it has less impact on Adam and AdamW optimizers and makes almost no difference.
- SGD vs Adam Optimizers: Adam optimizers maintain two moving averages for each model parameter: the first moment (mean) of the gradients and the second moment (uncentered variance) of the gradients, i.e., Adam optimizers store two additional values for each single model parameter in memory. If we are working with a 7B parameter model, that’s an extra 14B parameters to track during training. SGD optimizers don’t need to track any additional parameters during training, so a question is: what advantage does swapping Adam by SGD have on the peak memory requirements when training LLMs. Swapping Adam optimizers with SGD may not be worthwhile when LoRA’s r is small. However, it may be worthwhile when we are increasing r.
- Multi-epoch training might not benefit instruction finetuning since it can deteriorate the results. This performance decline is likely due to increased overfitting, which warrants additional investigation.
- Enabling LoRA for more layers, in addition to the Query and Value matrices shows improved performance.
- A common rule of thumb for choosing alpha is two times \(r\).
- Data quality is very important for finetuning. According to the LIMA (Zhou et al., 2023) paper, a 65B Llama model finetuned on LIMA noticeably outperforms a 65B Llama model finetuned on Alpaca.
- Choosing the best rank: choosing an r that is too large could result in more overfitting. On the other hand, a small r may not be able to capture diverse tasks in a dataset. In other words, the more diverse the tasks in the dataset, the larger the r should be.
Apple Intelligence Foundation Language Models (Gunter et al., 2024) use LoRA for on-device task specialization of LLMs.
(Biderman et al., 2024) experimentally demonstrates that LoRA learns less and forgets less as compared to full finetuning. Hence, full finetuning is better for absorbing new knowledge from more distant domains but leads to more forgetting of previously learned tasks. LoRA, by changing fewer parameters, learns less new information but retains more of the original capabilities. More details in (Raschka, 2023).
QLoRA: Quantized LoRA
In (Dettmers et al., 2023), the low-rank matrices are quantized, meaning their numerical precision is reduced. This is done by mapping the continuous range of values in these matrices to a limited set of discrete levels. This process reduces the model’s memory footprint and computational demands, as operations on lower-precision numbers are less memory-intensive.
DoRA: Weight-Decomposed Low-Rank Adaptation
(Liu et al., 2024) extended LoRA by first decomposing a pretrained weight matrix into two parts: a magnitude vector \(m\) and a directional matrix \(V\). This decomposition is rooted in the idea that any vector can be represented by its length (magnitude) and direction (orientation), and here we apply it to each column vector of a weight matrix. Once we have \(m\) and \(V\), DoRA applies LoRA-style low-rank updates only to the directional matrix \(V\), while allowing the magnitude vector \(m\) to be trained separately.
Fig. Annotated Illustration of DoRA from (Raschka, 2023)
A two-step approach gives DoRA more flexibility than standard LoRA. Rather than uniformly scaling both magnitude and direction as LoRA tends to do, DoRA can make subtle directional adjustments without necessarily increasing the magnitude. The result is improved performance and robustness, as DoRA can outperform LoRA even when using fewer parameters and is less sensitive to the choice of rank.
References
- Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. https://arxiv.org/abs/2106.09685
- Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. https://arxiv.org/abs/2305.14314
- Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L., Zhang, S., Ghosh, G., Lewis, M., Zettlemoyer, L., & Levy, O. (2023). LIMA: Less Is More for Alignment. https://arxiv.org/abs/2305.11206
- Gunter, T., Wang, Z., Wang, C., Pang, R., Narayanan, A., Zhang, A., Zhang, B., Chen, C., Chiu, C.-C., Qiu, D., Gopinath, D., Yap, D. A., Yin, D., Nan, F., Weers, F., Yin, G., Huang, H., Wang, J., Lu, J., … Ren, Z. (2024). Apple Intelligence Foundation Language Models. https://arxiv.org/abs/2407.21075
- Biderman, D., Portes, J., Ortiz, J. J. G., Paul, M., Greengard, P., Jennings, C., King, D., Havens, S., Chiley, V., Frankle, J., Blakeney, C., & Cunningham, J. P. (2024). LoRA Learns Less and Forgets Less. https://arxiv.org/abs/2405.09673
- Raschka, S. (2023). Noteworthy AI Research Papers of 2024 (Part One) [Blog]. https://magazine.sebastianraschka.com/p/ai-research-papers-2024-part-1
- Liu, S.-Y., Wang, C.-Y., Yin, H., Molchanov, P., Wang, Y.-C. F., Cheng, K.-T., & Chen, M.-H. (2024). DoRA: Weight-Decomposed Low-Rank Adaptation. https://arxiv.org/abs/2402.09353
- Raschka, S. (2023). Improving LoRA: Implementing Weight-Decomposed Low-Rank Adaptation (DoRA) from Scratch [Blog]. https://magazine.sebastianraschka.com/p/lora-and-dora-from-scratch
- Raschka, S. (2023). Improving LoRA: Implementing Weight-Decomposed Low-Rank Adaptation (DoRA) from Scratch [Blog]. https://magazine.sebastianraschka.com/p/lora-and-dora-from-scratch