(00:05):
This is Five-Minute Friday on SparseGPT.
(00:27):
“Large language models” or LLMs for short, are super powerful because they’ve been trained on tons of data and — because they have billions of model parameters — they’re capable of performing remarkably well on a remarkably broad range of natural-language tasks. The best-known LLM, for example — GPT-3 — has 175 billion model parameters. If you’d like to learn all of the key info about GPT-3 and its capabilities, check out episode #559 with Melanie Subbiah — one of the first authors of the GPT-3 paper.
(00:59):
Today’s episode isn’t specifically about GPT-3, however. It’s about the issue of how massive these large language models are and how we can prune these models to compress them, giving us a number of advantages including increasing real-time inference speed in production, decreasing the model size in memory storage, decreasing compute costs, and, through a common machine learning concept called regularization, we could potentially even improve the generalization of our model to real-world data that are structurally different from the data the model was trained on. So, in terms of speed, memory size, cost and maybe even generalization, pruning a model is a good thing.
(01:39):
While there are even bigger models today and while the forthcoming GPT-4 is rumored to be several orders of magnitude larger, as the most popular large language model today GPT-3 — with its 175 billion model parameters — serves as a solid exemplar of LLMs in general. For context, 175 billion model parameters take up about 320GB of memory. Thus making production inferences with just one copy of GPT-3 requires about five state-of-the-art Nvidia A100 GPUs, which have 80GB of memory each. Since each one of these GPUs costs $15k, running GPT-3 in production requires, just to run one model of GPT-3 in production, requires $75k worth of GPUs alone.
(02:34):
Clearly, GPT-3’s remarkable capabilities come with a chunky price tag. Thankfully, an exciting new paper from researchers at IST Austria on a parameter-pruning technique called SparseGPT indicates that 100 billion parameters — more than half of GPT-3’s full 175-billion-parameter complement — can be removed without adversely impacting GPT-3’s accuracy. Wow! This is a massive improvement over comparable methodologies for pruning LLMs on the scale of GPT-3. Specifically, the previous top GPT pruning approach is called Magnitude Pruning and it is only able to prune 10% of GPT-3 before accuracy begins to take a hit. So the previous best approach 10%, this new SparseGPT approach 50%. That’s incredible, that’s a big reduction in model size.
(03:32):
Countless different pruning methodologies exist. Some of these pruning techniques are applied before model training, some are applied post-training, while others — historically the best-performing — are iterative and applied throughout model training. SparseGPT, this new one, is noteworthy not only because of how it can remove more than half of GPT-3’s model parameters without impacting accuracy, but also because it’s easier to apply than these historically-best-performing iterative approaches. Specifically, SparseGPT can be applied just once, post-training, and so its creators have highlighted this by referring to SparseGPT as a convenient “one-shot” pruning approach.
(04:13):
The details of how SparseGPT works are fairly mathematically complex and detailed in their paper, if you are interested in reading more about it, but the general idea is that pruning is carried out layer by layer. Deep learning models like large language models have many layers of artificial neurons in them and with this kind of layer-by-layer pruning, each one of these layers in the deep learning architecture is pruned separately and then the final model is stitched back together by recomposing the compressed layers. The complexity of this approach comes from the mathematics of stitching the layers back together in such a way that the outputs produced by the full-size model are preserved despite the internal structure of the network being changed so drastically.
(04:56):
Now, being able to halve the size of large language models while retaining accuracy is clearly exciting news that has positive commercial and environmental implications given the widespread use of these models today, powering myriad natural-language processing techniques. Perhaps the most exciting news of all then is that the SparseGPT authors reckon that combined with fine-tuning mechanisms and iterative pruning during training, their one-shot post-training SparseGPT approach could reduce model size by up to 90% without adversely impacting accuracy. So hopefully they can make that come into fruition soon. In dollar terms, that would mean we would use about $7500 worth of GPUs to run GPT-3 in production instead of $75,000-worth of them, 10 times as much. All right, so super cool, we’ve got 50% of the parameters in big models like GPT-3 being able to be removed today without impacting accuracy and hopefully, we’ll be able to remove 90% without adversely impacting accuracy in the nearer future.
(06:06):
Thanks to Shaan Khosla, a data scientist on my team at my machine learning company Nebula for inspiring this Five-Minute Friday episode on SparseGPT today by providing a summary of the SparseGPT paper via his Let’s Talk Text Substack newsletter. He uses the newsletter to provide you a weekly easy-to-read summary of a recent key natural language processing paper and you can subscribe if that’s something you’re interested in — we’ve provided a link to Shaan’s Substack in the show notes.
(06:33):
And finally, I’ve been mentioning this on-air a lot lately because I’m really excited about it so hopefully this hasn’t been too annoying for our regular listeners but if you’d like to learn more about large language models like the GPT series of models, coming up on March 1st, I’ll be hosting a virtual conference on natural language processing with large language models. It’ll be interactive, practical, and it’ll feature some of the most influential scientists and instructors in the large natural language model space as speakers, including Melanie Subbiah, one of the first authors of the GPT-3 paper, whom I mentioned at the onset of this episode. So this half-day virtual conference, it’ll be live in the O’Reilly platform, which many employers and universities provide free access to; otherwise you can grab a free 30-day trial of O’Reilly using our special code SDSPOD23. We’ve got a link to that code ready for you in the show notes as well.
(07:27):
All right, I hope you enjoyed this informative episode on SparseGPT. Until next time, keep on rockin’ it out there, folks, and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon.