Running Quantized Code Models on a Laptop Without a GPU
Table of Links
-
Abstract and Introduction
-
Related Works
2.1 Code LLMs
2.2 Quantization
2.3 Evaluation benchmarks for code LLMs and 2.4 Evaluation metrics
2.5 Low- and high-resource languages
-
Methodology
3.1 Run-time environment
3.2 Choice of LLMs
3.3 Choice of benchmarks
3.4 Evaluation procedure
3.5 Model parameters and 3.6 Source code and data
-
Evaluation
4.1 Pass@1 rates
4.2 Errors
4.3 Inference time
4.4 Lines of code and 4.5 Comparison with FP16 models
-
Discussion
-
Conclusions and References3 Methodology
3.1 Run-time environment
All models were run in a Python environment within a Windows 11 machine. Python 3.12.4 with Miniconda 24.4 was used. llama-cpp-python[5] package was used to load and run the quantized models within the Python environment. llama-cpp-python provides a high-level Python interface to the llama.cpp library written in C/C++. llama.cpp was specifically designed for quantizing LLMs and working with quantized models in a GGUF format. Compared to the other solutions, such as the Transformers[6] API from HuggingFace, llama-cpp-python is more efficient and has the least impact on performance when working with quantized models. The library also supports both GPU-based and CPU-only inferences.
Concerning hardware, we used a consumer laptop Dell Latitude 5440 with Intel Core i5-1335U 1.30GHz with 12 CPU cores, 16GB DDR4 RAM, BG6 KIOXIA NVMe SSD, and no dedicated GPU. Therefore, all inference was done purely with the CPU. This device represents a generic work laptop that is typically used by different consumer segments such as businesses, academia, students, etc.
3.2 Choice of LLMs
This section addressed the research question RQ1. The code LLMs were chosen based on licensing, comparative performance, and computational demand that can meet the limitations of the hardware specified in the preceding section.
Table 1 summarizes the evaluated models. The following LLMs trained for code generation were selected for this study: DeepSeek Coder 6.7B Instruct [19], CodeQwen 1.5 7B Chat [18], CodeLlama 7B Instruct [17], StarCoder2 7b [21], and CodeGemma 7b [20]. As of August 14th, 2024, these models were ranked among the top in the Multilingual Code Models Evaluation leaderboard. This leaderboard ranks multilingual code-generation models based on their performance on HumanEval [28] and MultiPL-E [33] benchmarks.
To maximize the diversity of models, only original models were considered, and fine-tuned offshoots of these models were ignored. For example, Artigenz Coder DS 6.7B, while ranked high on the leaderboard, is a fine-tuned version of DeepSeek Coder 6.7B and was not included in this study.
Only small models with 7 billion parameters or less were considered to ensure that the quantized models can be run reasonably well on consumer devices. Lastly, all these models employ a free-to-use model, albeit with certain restrictions (e.g., output from CodeLlama cannot be used to train other models).
For each of the five models, we tested 2-, 4-, and 8-bit integer weights-only quantized versions in a GPT-Generated Unified Format (GGUF)[7] format. All models were downloaded from HuggingFace’s model repository. If multiple similar versions of the same model were available then the version with the highest download count was used. 2- and 8-bit quantizations are the most common quantizations at the lower and higher precision ends. 4-bit quantization is often recommended as a well-balanced trade-off between quality and size [16]. We will test whether this observation also applies to code LLMs. This setup allows us to establish a correlation between quantization precision and performance thereby addressing the research questions RQ2 and RQ3.
Author:
(1) Enkhbold Nyamsuren, School of Computer Science and IT University College Cork Cork, Ireland, T12 XF62 ([email protected]).
[5] https://llama-cpp-python.readthedocs.io/en/latest/ [6] https://huggingface.co/docs/transformers