Crypto News

The Last Rank We Need? QDyLoRA’s Vision for the Future of LLM Tuning

Abstract and 1. Introduction

  1. Proposed Method: Quantized DyLoRA
  2. Experiments and Evaluation
  3. On the semi-sorted behavior of QDyLoRA
  4. Conclusion, Limitations, and References

A. Supplementary Material

A.1. Hyperparameters

A.2. Generated Text Quality

5 Conclusion

QDyLoRA offers an efficient and effective technique for LoRA-based fine-tuning LLMs on downstream tasks. Eliminating the need for fine-tuning multiple models to find the optimal LoRA rank and offering the possibility of fine-tuning larger LLMs are two main advantages of QDyLoRA. The experimental results demonstrated that the optimal rank for QDyLoRA can be surprisingly low, yet it consistently outperforms QLoRA. QDyLoRA provides greater flexibility for deploying LLMs in various contexts and represents a promising step towards making fine-tuning large language models more accessible and efficient.

Limitations

While the 4-bit QDyLoRA exhibits notable performance, it falls short of achieving the performance levels of full precision fine-tuning. One possible solution could be dynamic quantized DyLoRA (DyQDyLoRA), in which the quantization level could also vary during finetuning. In particular, the finetuning strategy can dynamically switch between different quantization levels based on a predefined learning feedback. Additionally, further research is required to investigate the impact of LoRA’s scalar and the range of underlying ranks in QDyLoRA.

References

Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. 2020. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. arXiv preprint arXiv:2012.13255.

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.

Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. 2023. Parameter-efficient fine-tuning of large-scale pretrained language models. Nature Machine Intelligence, 5(3):220–235.

Ali Edalati, Marzieh Tahaei, Ivan Kobyzev, Vahid Partovi Nia, James J Clark, and Mehdi Rezagholizadeh. 2022. Krona: Parameter efficient tuning with kronecker adapter. arXiv preprint arXiv:2212.10650.

Junxian He, Chunting Zhou, Xuezhe Ma, Taylor BergKirkpatrick, and Graham Neubig. 2021. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366.

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551.

Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagyfi, et al. 2023. Open-assistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327.

Se Jung Kwon, Jeonghoon Kim, Jeongin Bae, Kang Min Yoo, Jin-Hwa Kim, Baeseong Park, Byeongwook Kim, Jung-Woo Ha, Nako Sung, and Dongsoo Lee. 2022. Alphatuning: Quantization-aware parameterefficient adaptation of large-scale pre-trained language models. arXiv preprint arXiv:2210.03858.

Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. 2022. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965.

Xiao Liu, Hanyu Lai, Hao Yu, Yifan Xu, Aohan Zeng, Zhengxiao Du, Peng Zhang, Yuxiao Dong, and Jie Tang. 2023. Webglm: Towards an efficient webenhanced question answering system with human preferences. arXiv preprint arXiv:2306.07906.

Yuning Mao, Lambert Mathias, Rui Hou, Amjad Almahairi, Hao Ma, Jiawei Han, Wen-tau Yih, and Madian Khabsa. 2021. Unipelt: A unified framework for parameter-efficient language model tuning. arXiv preprint arXiv:2110.07577.

Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. 2022. Lst: Ladder side-tuning for parameter and memory efficient transfer learning. Advances in Neural Information Processing Systems, 35:12991–13005.

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Stanford alpaca: An instruction-following llama model.

Mojtaba Valipour, Mehdi Rezagholizadeh, Ivan Kobyzev, and Ali Ghodsi. 2022. Dylora: Parameter efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. arXiv preprint arXiv:2210.07558.

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.

A.1 Hyperparameters

Table 4 provides an overview of the hyperparameters and experimental configurations employed in this study, which are crucial configurations that determine various aspects of the training process and model behavior in this study. Common key parameters across the experiments include the choice of optimizer, Adam-Beta2 value, maximum gradient norm, and warmup ratio, which collectively influence how the model adjusts its weights during training. LoRA-specific parameters such as LoRA dropout probability, maximum LoRA rank, and alpha value control the behavior of LoRA layers. Additionally, double quantization and quantization type impact the precision of numerical representations within the model, which are considered the same as baselines. Learning rate scheduling and weight decay contribute to the optimization process, helping to prevent overfitting and stabilize training. Random seeds ensure reproducibility, while the specified GPU determines the hardware used for training. Each model configuration, whether for the Web-GLM, GSM8k, or the specific experiment outlined in Table 1, features parameters tailored to the characteristics of the dataset and the computational resources available. These hyperparameters collectively shape the training process, ultimately influencing the performance and effectiveness of the models in the study.

A.2 Generated Text Quality

To describe the quality of texts generated by QDyLoRA, a sample query taken from GSM8k is fed to Falcon-40b trained by QDyLoRA. Below, we report the generated answers for different LoRA ranks.

Query: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers’ market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers’ market?

Rank=64: Janet has 16 – 3 – 4 = «16-3-4=7»7 eggs left. She sells 7 – 16 = «7-16=9»9 eggs at the

Table 4: The list of hyperparameters employed across various experiments in our study. The common settings are applied to each experiments. If a parameter is not explicitly mentioned, it means that we use default values from HuggingFace Transformers v’4.28.0.dev0’ (source: https://huggingface.co/docs/transformers).Table 4: The list of hyperparameters employed across various experiments in our study. The common settings are applied to each experiments. If a parameter is not explicitly mentioned, it means that we use default values from HuggingFace Transformers v’4.28.0.dev0’ (source: https://huggingface.co/docs/transformers).

farmers’ market. She makes 9 * 2 =«9*2=18»18 at the farmers’ market every day.

Rank=32: She eats 3 eggs for breakfast and bakes 4 muffins per day, so she uses 3 + 4 = «3+4=7»7 eggs per day. She has 16 – 7 = «16- 7=9»9 eggs left over each day. She sells these eggs at the farmers’ market for $2 per egg, so she makes 2 * 9 = «2*9=18»18 per day at the market.

Rank=28: She eats 3 eggs for breakfast and bakes 4 muffins per day, so she uses 3+4 = «3+4=7»7 eggs per day. She sells 16-7 = «16-7=9»9 eggs per day for $2 per egg. She makes 9*2 = «9*2=18»18 per day at the market.

Rank=16: Janet’s ducks lay 16 eggs per day and she eats 3 for breakfast every morning so she has 16-3=13 eggs left. Janet bakes 4 muffins every day and she uses 4 eggs so she has 13-4=9 eggs left. Janet sells the remaining eggs at the farmers’ market for $2 each so she makes 9*2=«9*2=18»18 dollars a day. Janet makes 18 dollars a day at the farmers’ market.


This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-SHAREALIKE 4.0 INTERNATIONAL license.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button