We don't need to express the impact of Large Language Models (LLMs), and generally how incredible they are. However, not everyone has the economic freedom to host them at scale for their business, especially because large language models have hefty minimum GPU requirements. Although quantization methods (algorithms that convert a number that initially requires 32 bits of memory to only use 4 bits, i.e. sacrificing mathematical precision for memory efficiency), like AWQ [1] and GPTQ [2] have reduced the barrier to entry for state of the art LLMs and have enabled end-users to operate them more affordably, there remains a cost <> performance tradeoff.
Building robust LLM systems that are performant and deployable at various form-factors for end-users is key in advancing technology accessibly. We explored the cost <> performance tradeoff introduced by quantization techniques, and identified methods to mitigate performance loss post-quantization, effectively providing high quality LLM responses at a 10th of the compute costs; demonstrated by running a relatively massive 70 billion parameter language model on ~$4,000 worth of GPUs as opposed to a $24,000. In this article we provide a gentle introduction to quantization in LLMs, its promise for reducing operational costs, and ultimately delve into explorations undertaken at Yurts to identify processes that minimize the performance degradation experienced while being used for RAG based tasks.
LLMs perform billions of basic math operations to generate their responses, and use “parameters”, or individually tuned numbers, to do that math. Even though each mathematical operation is simple, they add up to a very complicated algorithm - and cost scales with that complexity. At the moment, the commonly used LLM form-factors are 7B, 13B and 70B, which represents the number of parameters in the LLM (the “B” stands for billion). These parameters are arranged in a structure known as an architecture. These are what phrases like “Mistral” and “Llama” refer to; an arrangement of parameters on which we apply operations to. When a user inputs text, that text interacts with these parameters, and ultimately generates some output -in this case, also text.
Generally speaking, the more precise these numbers are (i.e. more floating points), the “better” the model response can be - but more precision means each parameter takes up more memory and each computation takes up more resources. Traditionally, these models use 32 bits, or 4 bytes [3] (floating point 32; FP32) to represent a single parameter. As a result, the minimum amount of memory required to load a model for the common form-factors, are 28GB (7B * 4 bytes = 28 billion bytes, or 28GB), 52GB, and 280GB respectively. These values denote the minimum GPU memory required for loading the model, and does not encompass the added memory requirements for running any text generation application.
While there are many factors that contribute to the overall cost of running LLMs for enterprises, the GPU is generally the most significant hardware limitation today, both in cost and time to procurement. Cloud makes time to procurement a null problem, but cost remains significant. To put some numbers to these statements, let’s dive a little deeper into GPU cost. Starting with the bare minimum requirement, in today’s market, these costs would roughly be:
Unfortunately, GPU cost doesn’t scale in a linear fashion with memory requirements. GPUs that have more memory are technologically more advanced, and thus each GB of GPU memory costs more on those devices. For example an A2 with 16 GB of memory costs around $84 per GB, an A10 with 24 GB costs around $130 per GB, and an A40 with 48 GB costs around $280 per GB. This means that running a 70B parameter model is more than 10 times more expensive than running a 7B parameter model - it’s closer to 30 times more expensive if one uses optimal GPU hardware for both. See Figure 1 for the GPU procurement cost per GB of memory for current data center GPUs.
One can use multiple GPUs to meet the GPU requirement. However, this often leads to performance bottlenecks as data now has to travel inefficiently throughout the system. It’s highly preferred to run models on a single GPU.
While companies with seemingly infinite budgets and compute infrastructure race to develop the best models, practitioners and researchers need to find efficient ways to experiment and deploy models without the enormous compute & budget requirements. One common method to accomplish this is through the reduction of precision of the parameters, formally known as quantization. In published studies and our own experiments, reducing the number of decimal places in parameters (i.e., reducing precision from 4 bytes (float32) to 2 bytes (float16)) led to a relatively small performance degradation in comparison to the amount of memory saved. This reduction in precision leads to models using half as much memory, effectively reducing operational requirements by 50% for the same model.
Is this the limit? At what point does quantization start making LLMs worse? Practitioners and researchers have found that storing every parameter as a 2 byte float performs at par with models at higher precision (of 4 bytes or 8 bytes per parameter). However, reducing each parameter to 1 byte or less degrades performance substantially, causing models to generate text that is grammatically incorrect and at times nonsensical.
The trend of decreasing bytes per parameter negatively impacting text generation performance was observed widely, until Lin et. al. at MIT, JSTU, and Tsinghua University made an interesting discovery about LLMs, and subsequently developed a technique that enables quantization down to under 4 bits (0.5 bytes) without incurring a large drop in performance. In their 2023 paper titled AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration, Lin et. al. [1] discovered that only 1% of parameters contributed significantly to LLM outputs. What this effectively meant was that nearly 99% of the LLM can be quantized down to a lower precision, if a specific set representing 1% of the model can retain a higher precision (such as float16).
Since most of the model parameters are quantized and stored in a lower precision, we can use that precision (int4) to identify the new memory requirements for the model. When the model’s parameters are stored at 0.5 bytes, the minimum GPU requirements for the 7B, 13B and 70B LLMs become 3.5 GB (7B * 0.5 bytes = 3.5 billion bytes), 6.5 GB, and 35 GB respectively (down from their earlier memory requirements of 28, 52, and 280 GB). Now, the same GPU requirement that was required to run a 7B model at its highest precision can now run the largest 70B model. This means we can now run the largest models on GPUs that cost ~$4,000 instead of $24,000.
“Sounds too good to be true, there must be a catch!” What’s the catch? There’s always a catch right? There absolutely is. While this technique attempts to minimize the error resulting from quantization, it still does not perform quite as well as the original model. Problems such as strange character artifacts and nonsensical responses appear more often in the quantized model. However, AWQ greatly improves model performance over the naive method of simply rounding to the nearest decimal place used in early quantization techniques with a negligible increase in memory overhead. Additionally, it’s important to mention that the larger models are more capable with their responses, even when quantization is performed to bring their overall size down; a larger quantized model may perform better than a smaller non-quantized model of the same memory footprint.
A crucial part of the AWQ method is the provision of a calibration dataset for the process. This dataset is used in the quantization process by evaluating how much groups of parameters affect the overall output. Parameters that have a larger impact are adjusted (known as scaling) to minimize the difference between the original model and the quantized one. This can be thought of as feeding some data into the model and quantized model and adjusting the quantized values to best match the original model outputs. In the original paper, it is stated that:
“Our method is less sensitive to the [data] distribution since we only measure the average activation scale from the [data], which is more generalizable across different dataset distributions.”
What this means is that the authors generated high quality AWQ-quantized models independent of the dataset used. Although the authors make this claim, at Yurts we discovered during our explorations that the data used for quantization does indeed have a significant impact on the resulting model, especially for a RAG based system. We discovered that the data used for quantization resulted in anywhere from a 1% to 10% difference in similarity to the expected non-quantized response. In addition to memory benefits of AWQ, we have also seen a 2x speed increase in token generation with AWQ in general, ultimately meaning that chat is twice as fast; this is due to kernel layer optimizations.
To demonstrate the importance of the calibration dataset used during AWQ quantization, Table 1 displays few prompts from the natural instructions dataset [4] and responses from both commercial off the shelf and Yurts’ AWQ models. Here we compare two AWQ models based on Mistral Instruct 7B: one, the highly downloaded AWQ model provided by TheBloke1 [5] on Hugging Face, and the other calibrated on an internal Yurts instructional dataset. All the data used for this blog can be found here; this includes 100 randomly sampled prompts and responses with their similarity scores.
1. These models are publicly available and commonly downloaded on huggingface. These models use casper-hansen’s library called AutoAWQ [6], which defaults to using the Pile Eval [7] dataset for calibration/quantization.
To quantify the differences we compared model responses with their “expected responses” generated by the non-quantized model. This was done by embedding[2] each model response, then comparing how similar they are to the expected response via cosine similarity. Scores are between 0 and 1.0, with 1.0 being identical, and 0 being completely dissimilar. Figure 2 shows the results of 100 responses.
2. Embeddings were generated using Sentence Transformer’s [8] sBERT model with all-mpnet-base-v2 weights.
From this evaluation, it’s clear that the dataset does indeed have a significant impact on the quantization process, both qualitatively and quantitatively. Using our internal instructional dataset the responses are visually much more aligned with natural human responses. We find that over half of the responses from Yurts’ model produce responses that are nearly identical to the original model, as compared to under 10% of the responses in the publicly available AWQ-quantized model on HuggingFace.
While we are continuing to explore the optimization of this quantization procedure, Yurts currently provides the ability to quantize models against specific datasets, and in the future, collections, and can seamlessly deploy them for application use. Quantization takes approximately 30 minutes against our internal RAG dataset. We believe models quantized on targeted datasets using AWQ serve as a cost effective alternative to the increasing demands of the latest state of the art models. We use these models internally, and we hope to bring them to your Yurt in the near future.
[1] Lin, Ji, et al. "Awq: Activation-aware weight quantization for llm compression and acceleration." arXiv preprint arXiv:2306.00978 (2023)
[2] Frantar, Elias, et al. "Gptq: Accurate post-training quantization for generative pre-trained transformers." arXiv preprint arXiv:2210.17323 (2022).
[3] Wikipedia contributors. "Single-precision floating-point format." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 26 Apr. 2024. Web. 15 May. 2024.
[4] Mishra, Swaroop, et al. “Cross-task generalization via natural language crowdsourcing instructions.” ACL inproceedings https://instructions.apps.allenai.org (2022).
[5] TheBloke, https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-AWQ
[6] casper-hanser, https://github.com/casper-hansen/AutoAWQ
[7] Gao, Leo, et al. “The Pile: An 800GB Dataset of Diverse Text for Language Modeling.” arXiv preprint arXiv:2101.00027 (2020).
[8] Reimers, Nils, et al. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks“. arXiv https://arxiv.org/abs/1908.10084 (2019).