Discover the recent breakthrough in LLM weight compression in this week’s Five-Minute Friday episode. Jon walks listeners through the SpQR approach, a new near-lossless LLM weight compression technique that leverages quantization.
Wouldn’t it be nice if you could compress the size of larger Large Language Models and fit them on a single consumer GPU? In a recent breakthrough, a group of international collaborators unveiled the SpQR approach, known as the Sparse-Quantized Representation technique that brings near-lossless LLM weight compression to the forefront. SpQR allowed the authors a 15% speedup of inference while experience no reduction in model accuracy–and it’s all thanks to the widely-used approach of quantization.
Model quantization reduces memory usage and speeds up computations by representing model parameters and model activation through lower precision values. SpQR is the first quantization method that has achieved the compression ratios of other quantization methods while being near-lossless. In most cases, fewer than 1% of the model’s outlier weights result in over 75% of the overall error introduced by quantization. And since these weights lead to high, irreducible error, SpQR just keeps them intact. Usually, these exceptional cases make up less than 1% of the parameters, and keeping them intact has a negligible effect on compression, while observing little decrease in the model’s accuracy. Tune in to hear Jon walk you through the four steps behind this new SpQR algorithm in this week’s episode. 
				ITEMS MENTIONED IN THIS PODCAST:
- Anthropic’s Claude
- SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
- SpQR GitHub repo
- SDS 674: Parameter-Efficient Fine-Tuning of LLMs using LoRA (Low-Rank Adaptation)
- SDS 672: Open-source “ChatGPT”: Alpaca, Vicuña, GPT4All-J, and Dolly 2.0
- QLoRA GitHub repo
- SDS 670: LLaMA: GPT-3 performance, 10x smaller
- Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM
- Shaan Khosla
- Nebula
- Let’s Talk Text newsletter
DID YOU ENJOY THE PODCAST?
- Will SpQR be useful in your model deployment process? How will SpQR affect your use of LLMs?
- Download The Transcript
Podcast Transcript
(00:05):
This is Five-Minute Friday on “Lossless LLM Weight Compression”.
(00:27):
Many recent episodes have focused on open-source Large Language Models, LLMs, that you can download and fine-tune to particular use cases depending on your needs or your users’ needs. I’ve particularly been highlighting LLMs with seven billion up to 13 billion model parameters because this size of model can typically be run on a single consumer GPU so it’s relatively manageable and affordable both to train and to have in production.
Many recent episodes have focused on open-source Large Language Models, LLMs, that you can download and fine-tune to particular use cases depending on your needs or your users’ needs. I’ve particularly been highlighting LLMs with seven billion up to 13 billion model parameters because this size of model can typically be run on a single consumer GPU so it’s relatively manageable and affordable both to train and to have in production.
(00:54):
However, if you’d like to approach the capabilities of the top commercial LLMs like OpenAI’s GPT-4 or Anthropic’s Claude, and you want to be able to do this on a broad range of tasks, you may need a much bigger models. So, wouldn’t it be nice if you could compress the size of these larger much larger open-source LLMs to be able to fit them on a single consumer GPU? Such compression would enable you to decrease training costs, to decrease model size for storage, to increase inference speed, and, in some cases, compression can even act as a regularizer so it improves the model’s generalizability to data it hasn’t encountered before.
However, if you’d like to approach the capabilities of the top commercial LLMs like OpenAI’s GPT-4 or Anthropic’s Claude, and you want to be able to do this on a broad range of tasks, you may need a much bigger models. So, wouldn’t it be nice if you could compress the size of these larger much larger open-source LLMs to be able to fit them on a single consumer GPU? Such compression would enable you to decrease training costs, to decrease model size for storage, to increase inference speed, and, in some cases, compression can even act as a regularizer so it improves the model’s generalizability to data it hasn’t encountered before.
(01:32):
All right, that all sounds great, right? But the problem is that, historically, compressing our model leads to lower accuracy. That changed earlier this month with a paper published by international collaborators from both academia and industry in which they revealed their SpQR approach, so SpQR stands for “Sparse-Quantized Representation” and this allows for near-lossless LLM weight compression. This is a huge deal. The authors demonstrate being able to run a 33B-parameter LLM on a single 24 GB GPU while simultaneously allowing a 15% speedup of inference and, critically, no reduction in model accuracy. To do this, they leveraged a widely-used approach called quantization.
All right, that all sounds great, right? But the problem is that, historically, compressing our model leads to lower accuracy. That changed earlier this month with a paper published by international collaborators from both academia and industry in which they revealed their SpQR approach, so SpQR stands for “Sparse-Quantized Representation” and this allows for near-lossless LLM weight compression. This is a huge deal. The authors demonstrate being able to run a 33B-parameter LLM on a single 24 GB GPU while simultaneously allowing a 15% speedup of inference and, critically, no reduction in model accuracy. To do this, they leveraged a widely-used approach called quantization.
(02:20):
So, model quantization, I talked about this on the podcast before, is a process that reduces the memory and computational requirements of a deep learning model by representing model parameters and model activations with lower precision values. So this could be something like using integers or fixed-point numbers in the place of higher-precision floating-point numbers. This quantization both reduces memory usage and speeds up computations.
So, model quantization, I talked about this on the podcast before, is a process that reduces the memory and computational requirements of a deep learning model by representing model parameters and model activations with lower precision values. So this could be something like using integers or fixed-point numbers in the place of higher-precision floating-point numbers. This quantization both reduces memory usage and speeds up computations.
(02:43):
SpQR, this new method, is the first quantization method that can reach the compression ratios of other quantization methods, so this is like a 4x compression, so getting a model down to a quarter of its size, that’s a much much smaller model. Also that means that really big LLMs you can fit on a single GPU and SpQR gets that kind of 4x compression while being near-lossless. So that means that near model accuracy is retained. And, so to allow this to happen, there are four steps to this new SpQR algorithm. In the first step, they iterate through the layers of your deep learning model and quantize the weights by converting them to a lower-bit representation. So that’s just normal quantization. What’s new is that in step 2, for each layer, they measure the inputs and outputs of the quantized model and compare these outputs with the uncompressed model. In step 3, they identify the weights whose quantization results in an outsized impact on layer output behavior and these weights, these particular weights are considered to be outliers. In the fourth and final step, most of the weights, typically greater than 99% of the weights, are converted to a low-bitwidth representation. The outliers, that were identified in the previous step, these ones are the only ones that have an outsized impact they are extracted separately and left in their higher-precision representation.
SpQR, this new method, is the first quantization method that can reach the compression ratios of other quantization methods, so this is like a 4x compression, so getting a model down to a quarter of its size, that’s a much much smaller model. Also that means that really big LLMs you can fit on a single GPU and SpQR gets that kind of 4x compression while being near-lossless. So that means that near model accuracy is retained. And, so to allow this to happen, there are four steps to this new SpQR algorithm. In the first step, they iterate through the layers of your deep learning model and quantize the weights by converting them to a lower-bit representation. So that’s just normal quantization. What’s new is that in step 2, for each layer, they measure the inputs and outputs of the quantized model and compare these outputs with the uncompressed model. In step 3, they identify the weights whose quantization results in an outsized impact on layer output behavior and these weights, these particular weights are considered to be outliers. In the fourth and final step, most of the weights, typically greater than 99% of the weights, are converted to a low-bitwidth representation. The outliers, that were identified in the previous step, these ones are the only ones that have an outsized impact they are extracted separately and left in their higher-precision representation.
(04:19):
The rationale behind this fourth step process is that, in most cases fewer than 1% of the outlier weights result in over 75% of the overall error that is introduced by quantization. Since these weights lead to high, irreducible error, SpQR just keeps them intact. Since these outliers account for typically fewer than 1% of the parameters in the model, retaining them has a negligible impact on compression while simultaneously avoiding any noticeable reduction in the model’s accuracy.
The rationale behind this fourth step process is that, in most cases fewer than 1% of the outlier weights result in over 75% of the overall error that is introduced by quantization. Since these weights lead to high, irreducible error, SpQR just keeps them intact. Since these outliers account for typically fewer than 1% of the parameters in the model, retaining them has a negligible impact on compression while simultaneously avoiding any noticeable reduction in the model’s accuracy.
(04:51):
Really cool. It’s like logical, it’s one of these things, it’s like why didn’t I think of that? And so, well, I didn’t think of it but these authors did and we’ve got the paper for you as well as the associated GitHub repo in the show notes so you can apply SpQR to your own LLM today.
Really cool. It’s like logical, it’s one of these things, it’s like why didn’t I think of that? And so, well, I didn’t think of it but these authors did and we’ve got the paper for you as well as the associated GitHub repo in the show notes so you can apply SpQR to your own LLM today.
(05:10):
Finally, if you’re not just interested in compressing your model for deploying it to production, but you’re also interested in fine-tuning a big open-source LLM — say, a 33B or larger model — you’ll also want to check out QLoRA. So QLoRA builds on the parameter-efficient low-rank adaptation, LoRA, that I introduced back in Episode #674, but now also incorporates quantization, that we’ve been talking about in today’s episode so that you can fine-tune open-source 33B- or even 65B-parameter models on a single GPU. Although, addmitedly, a pretty darn big GPU, it’s 48GB GPU. But neverthless, a single GPU, so you know, relatively inexpensive, relatively straight forward MLOps for trainig.
Finally, if you’re not just interested in compressing your model for deploying it to production, but you’re also interested in fine-tuning a big open-source LLM — say, a 33B or larger model — you’ll also want to check out QLoRA. So QLoRA builds on the parameter-efficient low-rank adaptation, LoRA, that I introduced back in Episode #674, but now also incorporates quantization, that we’ve been talking about in today’s episode so that you can fine-tune open-source 33B- or even 65B-parameter models on a single GPU. Although, addmitedly, a pretty darn big GPU, it’s 48GB GPU. But neverthless, a single GPU, so you know, relatively inexpensive, relatively straight forward MLOps for trainig.
(05:56):
So, the QLoRA authors made a big splash a few weeks ago when they claimed this enabled their Guanaco family of models to approach 99.3% of ChatGPT’s performance on the Vicuña benchmarks that I covered back in Episode #672. The QLoRA approach is already integrated with Hugging Face’s PEFT and Transformer libraries; we’ve got a link in the show notes to the GitHub repo for all the information, including access to their new Guanaco model family, which comes in four sizes: 7B, 13B, 33B and 65B parameter versions. Note however, that this Guanaco family of models were fine-tuned starting with Meta’s LLaMA models so, as detailed back in Episode #670, they can’t be used for commercial purposes, but you can now apply QLoRA to a commercial-use model like Dolly 2.0 and fine-tune it to whatever your desired use case is.
So, the QLoRA authors made a big splash a few weeks ago when they claimed this enabled their Guanaco family of models to approach 99.3% of ChatGPT’s performance on the Vicuña benchmarks that I covered back in Episode #672. The QLoRA approach is already integrated with Hugging Face’s PEFT and Transformer libraries; we’ve got a link in the show notes to the GitHub repo for all the information, including access to their new Guanaco model family, which comes in four sizes: 7B, 13B, 33B and 65B parameter versions. Note however, that this Guanaco family of models were fine-tuned starting with Meta’s LLaMA models so, as detailed back in Episode #670, they can’t be used for commercial purposes, but you can now apply QLoRA to a commercial-use model like Dolly 2.0 and fine-tune it to whatever your desired use case is.
(06:54):
All right, really cool. I hope you can build some amazing, powerful LLMs and have them be in production with your users in no time using the approaches like QLoRA and SpQR that I covered in today’s episode. That’s it for this week. Thanks to Shaan Khosla on the data science team at my machine learning company Nebula for much of today’s SpQR content, which I got from a recent edition of his Let’s Talk Text newsletter.
All right, really cool. I hope you can build some amazing, powerful LLMs and have them be in production with your users in no time using the approaches like QLoRA and SpQR that I covered in today’s episode. That’s it for this week. Thanks to Shaan Khosla on the data science team at my machine learning company Nebula for much of today’s SpQR content, which I got from a recent edition of his Let’s Talk Text newsletter.
(07:22):
Until next time, my friend, keep on rockin’ it out there and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon.
Until next time, my friend, keep on rockin’ it out there and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon.



