SDS 801: Merged LLMs Are Smaller And More Capable, with Arcee AI’s Mark McQuade and Charles Goddard

Merged LLMs are the future, and we’re exploring how with Mark McQuade and Charles Goddard from Arcee AI on this episode with Jon Krohn. Learn how to combine multiple LLMs without adding bulk, train more efficiently, and dive into different expert approaches. Discover how smaller models can outperform larger ones and leverage open-source projects for big enterprise wins. This episode is packed with must-know insights for data scientists and ML engineers. Don’t miss out!

Interested in sponsoring a Super Data Science Podcast episode? Email natalie@superdatascience.com for sponsorship information.

About Mark McQuade

Mark is driven by his passion for LLMs. In 2023, he had a vision to empower enterprises with industry-specific AI solutions. This idea emerged from his time at Hugging Face where he had helped spearhead the Monetization team, collaborating with high-profile enterprises. This frontline experience exposed him to critical industry pain points: the reluctance to rely on closed-source APIs and the challenges of training open-source models without compromising data security. These insights led to the inception of Arcee. Earlier in his career, Mark held engineering, consulting, business, and leadership roles in the telecom and VoIP space–before jumping into cloud in 2015. In 2018 he joined Onica where he pivoted to fully focus on AI/ML, which was later acquired by Rackspace. He then joined Hugging Face as an early hire on their Monetization team, and later moved to Roboflow to lead their Field Engineering and Partnership efforts.

About Charles Goddard

Charles is a software engineer with a distinguished track record in the AR/VR and aerospace industries, including a long tenure at NASA’s Jet Propulsion Laboratory. He has a long history of driving technical initiatives leading to breakthroughs in AR/VR visualization and terrain reconstruction, and has been recognized for innovative contributions with prestigious awards from NASA. Charles’ work as the founder of the popular model merging repository MergeKit led to MergeKit joining forces with Arcee, solidifying Arcee’s role as the industry leader in model merging. Both Arcee and Charles hold a strong commitment to the open-source community and have pledged to keep MergeKit in the public domain, helping contribute to pushing the boundaries of technology and human knowledge.

Overview

This week, Mark McQuade and Charles Goddard from Arcee AI take us into the fascinating world of “model merging” and its game-changing impact on AI. They reveal their open-source model-merging approach, combining the capabilities of multiple large language models (LLMs) without adding extra bulk. It’s a method that boosts efficiency and power in AI models. They also introduce their Spectrum project, which makes LLM training more affordable by targeting specific network modules while freezing others, cutting training costs and making advanced AI more accessible.

The discussion moves to comparing Mixture-of-Experts versus Mixture-of-Agents approaches. Mark and Charles break down how the Mixture-of-Agents method lets each submodel generate distinct responses, while the Mixture-of-Experts approach blends them for optimal results. They show how smaller models, like their 7B-parameter Arcee Spark model, can outperform much larger foundation models like GPT-4, Gemini, and Claude in certain benchmarks, proving that specialized small models can excel in specific tasks.

They also get into the technical and practical sides of their work, explaining how Sparse Upcycling allows efficient copying, pasting, and training of an LLM’s weights into a Mixture of Experts setup. This method, though computationally heavy, leads to significant performance gains. They further discuss how Arcee Cloud helps companies transition to specialized small LLMs, offering a cheaper alternative to relying on large, costly foundation models.

Finally, Mark and Charles share their insights on leveraging open-source projects to land big enterprise contracts and attract substantial venture capital. They offer practical advice based on their experiences and strategies, providing valuable guidance for data scientists, ML engineers, and AI enthusiasts aiming to succeed through open-source initiatives.

In this episode you will learn:

Explanation of Charles’ job title: Chief of Frontier Research [03:31]
Model Merging Technology combining multiple LLMs without increasing size [04:43]
Using MergeKit for model merging [14:49]
Evolutionary Model Merging using evolutionary algorithms [22:55]
Commercial applications and success stories [28:10]
Comparison of Mixture of Experts (MoE) vs. Mixture of Agents [37:57]
Spectrum Project for efficient training by targeting specific modules [54:28]
Future of Small Language Models (SLMs) and their advantages [01:01:22]

Items mentioned in this podcast:

Arcee AI
MergeKit
Mixture of Agents
Mixtral
Sparse Upcycling
Optimizing LLM Training with Arcee’s Spectrum
Arcee Spark
Arcee Cloud
Hugging Face
roboflow.com
Rackspace
NASA Jet Propulsion Laboratory (JPL)
Emergence Capital
Mosaic
Thomson Reuters
Collision Conference
Web Summit
Gödel, Escher, Bach by Douglasa Hofstadterja
Hyperion Cantos by Dan Simmons
Harlan Coben
SuperDataScience
Intro to Probability Theory
Probability Level II
SDS special code for a free 30-day trial on O’Reilly: SDSPOD23
The Super Data Science Podcast Team

Follow Mark:

Follow Charles:

Follow Arcee AI:

Follow Jon:

Episode Transcript:

Download The Transcript

Podcast Transcript

Jon Krohn: 00:00:00

This is episode number 801 with Mark McQuade and Charles Goddard of Arcee AI.

00:00:11

Welcome to the Super Data Science Podcast, the most listened to podcast in the data science industry. Each week we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today, and now let’s make the complex simple.

00:00:42

Welcome back to the Super Data Science Podcast. Today, we’ve gotten a seriously mind-expanding episode with two frontier-pushing members of the team at Arcee AI. This is a startup that’s behind the pioneering concept of “model merging” that, this very day, announced a $24 million Series A round of venture capital investment. Those two people, again are Mark and Charles. Mark McQuade is cofounder and CEO of Arcee. Previously, he held client facing roles at Hugging Face and Roboflow, as well as leading the data science and engineering practice of a Rackspace company. He studied electronic engineering at Fleming College in Canada. Our second guest is Charles Goddard. He’s chief of frontier research at Arcee. Previously, he was a software engineer at Apple and NASA’s famed Jet Propulsion Laboratory. He studied engineering at Olin College in Massachusetts.

00:01:33

Today’s episode is relatively technical, so we’ll likely appeal most to hands-on practitioners like data scientists and ML engineers. In today’s episode, Charles and Mark detail how their impressive open-source model merging approach combines the capabilities of multiple LLMs without increasing the model size. They provide a separate open-source approach for training LLMs efficiently by targeting specific modules of the network to train while freezing others. They talk about the pros and cons of Mixture of Effects approaches versus Mixture of Agents approaches. They talk about how to enable small language models to outcompete the big foundation LLMs, like GPT-4, Gemini and Claude, and they talk about how to leverage open-source projects to land big enterprise contracts and attract big chunks of venture capital investment. You ready for this astounding episode? Let’s go.

00:02:25

Mark and Charles, welcome to the Super Data Science Podcast. It’s awesome to have you on and I’m excited for this episode. I was blown away doing research for this episode. You guys are onto some really cool things. We’re going to dig into it. Let’s start to get people used to your voices. Let’s start with Mark. Mark, tell us where you are in the world.

Mark McQuade: 00:02:43

I am in Florida. I’m just north of Miami area on the Gulf coast.

Jon Krohn: 00:02:48

Nice, and Charles, where are you?

Charles Goddard: 00:02:51

I’m over here in Los Angeles.

Jon Krohn: 00:02:53

Nice. I’m over here. That would be a great response for people to have on the show more often. I’m just over here. Nice.

Charles Goddard: 00:03:02

I believe I’m off to the right a little.

Jon Krohn: 00:03:07

So Charles, you brought your model merging technology to Arcee. So this idea of model merging, that’s what has really blown my mind. And before we get into model merging, I want to talk about your title for a second, which is chief of frontier research. So Mark has a title of CEO. Everyone knows the CEO has title. There’s a huge variability in what CEOs do, but I feel like a lot of people have some sense of that. Your title, Charles, is Chief of Frontier Research. What does that mean? And did you invent that title? Where did it come from?

Charles Goddard: 00:03:38

So the title was Mark and our CTO, Jacob, and I workshopped that title together. It was important to us that my title accurately reflects the fact that I wear a cowboy hat while I do my work. But actually-

Jon Krohn: 00:03:54

Nicely done.

Charles Goddard: 00:03:55

… it’s meant to reflect that I’m very much into the speculative, cutting-edge, probably won’t pan out type of research. And at the time that I got into model merging, that was very much the case. So people hadn’t really used this technique in any productionalized way. But essentially it’s to reflect that I like to work on problems and technical solutions that are just barely emerging as possible and seeing if they could be brought more into the productionalized [inaudible 00:04:25].

Jon Krohn: 00:04:25

Yeah, and this is certainly there. So explain to us how model merging represents the next frontier in AI, as well as in transfer learning, and what potential you see in this technology for transforming various industries. So I guess that’s a two-parter. So give us some insight into what model merging is. Explain it because you could do a better job than I can. You’re the expert on that. And then explain to us the potential you see in this model merging, and then I’ve got lots of follow-up questions related to that.

Charles Goddard: 00:04:54

Absolutely. So model merging, it’s a family of techniques that let you take the pre-trained weights of neural networks and combine them into a single network that captures some of the strengths or abilities of all the networks that you threw into that melting pot. And of course, there’s a big variation in the techniques of what performance characteristics they have and what scenarios they excel in, but essentially it’s a natural advancement of the philosophy of transfer learning. So we discovered that if instead of training language models from scratch every time on however many trillion tokens and then throwing in our specific task, if we just do the foundation model training once and then fine-tune many different versions of that on our various different tasks, then we get better performance and it’s infinitely cheaper than training from scratch every time.

00:05:46

Model merging is very similar in that it’s an even further finer subdivision of the work involved. So whereas with typical language model training setups today, you’ll take some foundation model like Llama 3, a 8 billion parameter, the untrained based model, and then you’ll curate some extensive data set that reflects what you want it to be capable of doing, which includes just general instruction following capabilities. You need a very diverse data set to get a robust instruction following capability in a text model.

00:06:22

So you get all of that data and curating all of that data is quite difficult. The open-source community has some good ones, but it’s still not to the level of what companies, like Meta or OpenAI for example, have curated. Then on top of that general stuff, you also need to curate a specific data set for the task that you actually care about. So let’s say you’re a finance company and you want a language model to summarize financial statements, you need to curate that data set and then you train the model on top of that entire aggregate pile. And the end result is fantastic, but there’s really no reason to be doing 90% of that work because we have incredible instruction following models available, open weight, just on the internet. If you create just the data set for your specific tasks, then model merging is a tool that lets you take those pre-existing artifacts and incorporate their strengths, essentially for free.

Jon Krohn: 00:07:12

Yeah, so let me try to explain in my own words and you can tell me where I get this wrong. When you have, out there in the world, there’s lots of different kinds of specialized models that are great at different tasks. And right now, if I was without model merging, if I was going to try to make use of that, I might have some kind of separate model that is predicting where to triage requests to and then I’d have to grab… So let’s say there’s five different tasks that I want my LLM to be able to perform really well at. And so I train, probably myself, this model for triaging natural language requests that come in. And I then, based on that I get some probability that okay, this is a great situation for model two out of five, and then I separately call just that one model and send the original natural language request to that, allow to follow through.

00:08:14

That is going to require me to have a whole bunch of different models in production, which could be all very large, requiring a bunch of GPUs maybe to be in production running, at great expense to me. With model merge, I can have one model running, that is probably smaller than my five in aggregate, and so that means I can reduce my compute costs and probably also deliver results more rapidly in real time to my users.

Charles Goddard: 00:08:45

Absolutely, yeah. So that’s one very important use case of model merging and it is good to highlight, as you did, the size of the models involved. So for most of these model merging techniques, you put in some number of models that are all the same size and you get out a model that’s that same size, just one model. So to give concrete example there because that wording was not great. Say that you’ve trained seven or eight models based on Llama 3, 8 billion parameters, and you put all of those into a merge, t he model that you get out is still just 8 billion parameters. So versus a classic [inaudible 00:09:19], for example, where you’d inference each of these seven models and then combine the results at the end. You’re just inferencing a single model. [inaudible 00:09:29] the same size as each of these individual ones.

Jon Krohn: 00:09:31

Yeah, and so let me make sure that I’m understanding this. So you could actually, so if I was starting with five, 7 billion parameter models, a common size that we see these days. So let’s say I have five different 7 billion parameter models. With model merge, when I merge them, I end up with one 7 billion parameter model.

Charles Goddard: 00:09:48

Exactly, yes.

Jon Krohn: 00:09:49

That is pretty wild. That is the frontier. That is the frontier. That sounds too good to be true, but we’ll dig into that in a bit more. I’m going to give Mark a chance here. So Mark is a cofounder of Arcee. Mark, tell us about that name, A-R-C-E-E, what does that mean? Where does that come from?

Mark McQuade: 00:10:06

Arcee is the most famous female transformer, and I have two-

Jon Krohn: 00:10:11

Oh.

Mark McQuade: 00:10:11

I have two daughters. So there’s nothing else to it other than that. Just coined it that way.

Jon Krohn: 00:10:18

Nice. Well it’s perfect for SEO, being Arcee.ai, it gives you good SEO in the tech space. I have a company named Nebula, which I don’t think is great because there’s a hundred other tech companies named Nebula. So it’s great to have those short, pithy names. They don’t always necessarily have to mean something, so it’s nice that yours does. So as a cofounder of Arcee, you’ve been instrumental in integrating model merging, that Charles just described, into Arcee’s company offerings. So tell us how the market has responded to this technology, maybe feedback you’ve heard from early adopters. And also I understand if, this isn’t too many questions all at once, that you have a unique strategy that is working out well already in terms of bridging both offering things open source, so our listeners can go and access model merging as an open-source technique, but you’re also landing enterprise clients at the same time.

Mark McQuade: 00:11:20

Correct. Yeah, so model merging as a technique blew me away when I first came across it, right after Charles released the repo. And the concept of, you had mentioned it at the top there, transfer learning, right? It’s essentially the next wave of transfer learning in my opinion, right? There’s millions of open-source checkpoints that exist somewhere. You can now bring them all into one unified model through model merging and MergeKit, right? So it was of those things where, as a CEO of a company, you recognize something that’s about to explode, potentially, and you want to be driving that bus, right?

00:12:03

So I’m ex-Hugging Face. I used to work at Hugging Face, which I’m sure most of listeners know. I still have a lot of people from within Hugging Face that are close friends and it just kept coming up. I’d have conversations with them, and model merging, and then this crazy library called MergeKit. So we brought it in, we adopted it internally and then from there we brought Charles in as well to steer the ship on that, right? So from what the market has been saying, it’s knowledge curve in that sense, right? Model merging is so novel and so new, people don’t really understand… Even now I see people saying, “Well, why would anyone want to merge?” Right? And it’s like, well, I’m so close to it, I say, “Why would you not merge?” Right? But through that education, through teaching the community and the market education around model merging, we’ve had a ton of interest from obviously the developer community, right?

00:13:07

The amount of models that have used MergeKit on the Hugging Face OpenLLM Leaderboard is massive, but it’s starting to get felt from within the enterprise, where we get outreach from companies like Fidelity or Bayer, the pharmaceutical company, that say, “Oh, we’ve played with merging. It seems fantastic.” So it’s trickling its way in, as every great open-source project eventually does, right? It trickles its way up. So yeah, it’s been great. It’s just a matter of continuing to showcase the power of model merging. And I think it speaks for itself when it comes to what it can do. It’s just a matter of getting more people to understand it and know it’s there and it exists. And I keep saying, over and over and over, soon there’ll be a day where it’s not, “Why would I merge?” It’s “Everybody does a merge.” Everybody does, as a post-training technique, just as Google just did when they released the new Gemma, right? When Google starts using it, you know something good is happening.

Jon Krohn: 00:14:18

Yeah, that’s-

Mark McQuade: 00:14:18

And I think we had one more question at the end of that and I forgot it.

Jon Krohn: 00:14:21

It is, and well, this is a perfect point for me to actually re-articulate it. So we understand your excitement about model merge, and I guess I should highlight that the tool for doing that is called MergeKit, right? So model merging is done through MergeKit. One word, I’ll provide a link to the GitHub repo for MergeKit. So, Mark, you became aware of MergeKit early, you rightly, in my view, identified this as something that is transformative, that is at the frontier and places like Google, picking it up and using it makes a lot of sense to me. Where do you see the commercial opportunity or where have you already succeeded in getting… Because you’re not a charity, you’re not there trying to just create open-source tools to… It does make the world a better place and that must be a nice part of having a business like Arcee’s, but simultaneously you’ve got to pay the bills and you’ve got to make investors happy. And so, yeah, how do you monetize this?

Mark McQuade: 00:15:28

Yeah, so overall, I think that actually makes us very unique because we have monetized successfully, early on, right? We are just about to hit 2 million in revenue and we just started selling the-

Jon Krohn: 00:15:40

Wow.

Mark McQuade: 00:15:40

… product in November.

Jon Krohn: 00:15:41

Wow.

Mark McQuade: 00:15:42

So yeah, so we are not the flashy GenAI startup that is going to figure out how to make money later. We actually figured it out from day one and it’s because we’re providing something that’s extremely valuable to these organizations, right? And you mentioned MergKit as a library, as an open-source platform and repo, that’s open source, but we actually have a platform that is MergKit and other things wrapped into a platform that allows organizations to customize their language models, right? I say it all the time, I think every organization will want to own and customize their own LLM at some point soon, right? Closed-source models are good, closed-source APIs are good, they do their job, it doesn’t matter, right? It doesn’t matter.

00:16:35

Plenty of people will use GPT, they’ll use Claude and they’ll have their own models, right? And the reason they’re going to want to have their own models is they’re going to want to customize them on their own data that they do not want to share with closed-source APIs, right? So what we’ve done, early on, is we’ve recognized that and we’ve capitalized on the fact of organizations wanting to keep their data private and have a secure platform. And so we have built our software stack, which is not only merging, it’s training and merging, right? So understanding that the true power of merging comes with training your own model as well. And it’s completely deployable inside your virtual private cloud where your data never leaves and it’s completely secure, right? So that messaging has a ton of pull from organizations of all size, not just the enterprise.

Jon Krohn: 00:17:29

Do you ever feel isolated, surrounded by people who don’t share your enthusiasm for data science and technology? Do you wish to connect with more like-minded individuals? Well, look no further. SuperDataScience Community is the perfect place to connect, interact, and exchange ideas with over 600 professionals in data science, machine learning and AI. In addition to networking, you can get direct support for your career through the mentoring program where experienced members help beginners navigate. Whether you’re looking to learn, collaborate, or advance your career, our community is here to help you succeed. Join Kirill, Hadelin and myself, and hundreds of other members, who connect daily. Start your free 14-day trial today at www.superdatascience.com and become a part of the community.

00:18:14

Fantastic. Yeah, so the short answer there is that wild amount, $2 million of revenue since November launching. That is super impressive, and that comes through providing a secure platform for training and deploying these models. And so, yes, that makes a huge amount of sense. You can also see how your Hugging Face background could come in handy with thinking about these kinds of monetization-

Mark McQuade: 00:18:38

Came up with that at Hugging Face. So had to just do it on my own.

Jon Krohn: 00:18:42

Yeah, yeah, yeah. That makes a lot of sense, and so we actually, a big announcement that is public today is that through this tremendous early success, you’re doing a big A round announcement. Yeah, which you might even, if people are listening to this podcast exactly when it comes out Tuesday morning, they could potentially be hearing this announcement before it’s actually even supposed to be publicly released.

Mark McQuade: 00:19:08

Correct, absolutely. Yeah. So we closed our A round, I guess, it was six weeks ago now, led by Emergence Capital, which is a great Silicon Valley VC. Love them, shout out to Emergence. So $24 million A, where we can take things to new heights, just do everything bigger, badder, and faster, right? So we had a lot of, as I mentioned before, with the revenue, and the traction we’ve had, and we’ve been able to capitalize in this market, and we did so pretty early on with… We got our seed round in September, just over 5 million, but we now have the capital to take things to new levels based on obviously the revenue and also the new investment.

Jon Krohn: 00:20:03

Yeah, those seed round investors must be pretty darn happy.

Mark McQuade: 00:20:07

[inaudible 00:20:08] September, that’s not a bad timeline, right? So [inaudible 00:20:11]-

Jon Krohn: 00:20:10

No, $2 million in revenue, come July after a September seed round investment of $5 million. You rarely come across stories like that. You guys are living the dream of early-stage startup life, that everyone, and when they have their startup idea, they leave their safe company and they’re like, “Yeah, I’m going to do what Arcee is doing.” And it’s 1% probably of people who do that, probably much less than 1% get to pull that off. So congrats to you guys.

Mark McQuade: 00:20:44

Yeah, we struggled early on to raise too, because our founding team is unique in the sense that it’s myself who, I played engineer, I’ve been technical and then business oriented, revenue oriented, split in between, my entire career. And then we have the ex-head of sales at Hugging Face. His name is Brian Benedict. He is part of the founding team. And then Jacob is also part of the founding team, which is our CTO, and he’s pure technical, but we weren’t two researchers from Meta, right? I’ll say that. Or we weren’t two researchers from DeepMind. So getting funding initially was tough, but when you start making real money, and traction, and logos, it makes it a little easier. But I’ll always have a chip on my shoulder over the struggles that I had early on, I’ll say that [inaudible 00:21:37]-

Jon Krohn: 00:21:37

Yeah. Yeah, you show them. Have you watched HBO’s Silicon Valley series?

Mark McQuade: 00:21:42

Yeah.

Jon Krohn: 00:21:43

Yeah. So it’s pretty funny wheen they, I think I can say this on air, when they’re talking about getting their balls on the table, where they have this period of the first several seasons where it’s a complete struggle, they can’t get any VC’s attention and then all of a sudden they have it and the pendulum swings. And of course, as always, in that fantastic satire series, they overdo it and they get their balls on the table too much.

Mark McQuade: 00:22:17

Exactly.

Jon Krohn: 00:22:17

[inaudible 00:22:18] get into trouble and it swings back the other way against them. Spoiler alert, slightly, though it wouldn’t surprise you at all as you watch it. And that’s generally, I think for people doing the startup thing, that series is fantastic for having a laugh while also seriously learning a lot about startup culture. They did a great job [inaudible 00:22:35].

Mark McQuade: 00:22:34

Yeah, [inaudible 00:22:35]. Yeah.

Charles Goddard: 00:22:35

It’s exaggerated but only a little bit.

Jon Krohn: 00:22:37

Exactly, just ever so slightly. So back to you, Charles, for one here. We talked about model merging in general already and how that is something special that you offer through the ModelKit and that Arcee then offers to enterprise customers, but specifically the technique that you use that’s different from, I guess, “traditional model merging” is something called Evolutionary Model Merging. So what does that mean and how is it different from the traditional approach?

Charles Goddard: 00:23:09

Yeah, so this was an approach that was introduced in a fantastic paper from Sakana AI. They’ve been doing some very interesting work on self-organizing collective intelligence, and evolutionary algorithms, and stuff. But essentially the approach is to, when you’re merging models, there are these parameters that you have to set that determine how much of what model is emphasized and where. And basically they propose an approach where you take an evolutionary algorithm, CMA-ES specifically is the one that they used, to determine what values to use for those parameters, in order to maximize some quantifiable metric. So they used a couple of neat examples, like a language model that can answer grade school math questions in Japanese, which it’s a hard thing to come up with a differentiable objective for that, but since this is a purely zeroth-order optimization, you can literally just check, okay, are the characters in this Japanese? And they [inaudible 00:24:16] some great results with that.

00:24:17

So we’ve taken that approach, and run with it, and constructed… Well, so there’s a script in the MergeKit library now. It’s mergekit-evolve, where you can do this run on whatever hardware you have and it selects the input models that you want to be merging with, describe what task or set of tasks that you want the model to excel at, and then just let it rip, and you’ll get out the other end a recipe for an optimized merge. And one of the things that’s great about that is that model merging is super flexible and super powerful, but we’re very much still in the alchemy stage of understanding how to work with it. I have a couple of heuristics in my head that I’ve built up just through experience of mutating these parameters for how to get what I want out of a merge, but not everybody has the time to make a couple hundred merges and build up that intuition. So this automated approach let’s you just describe what you want and then get you’ll it out the other end.

Jon Krohn: 00:25:17

Nice. So basically the Evolutionary Model Merging, so my understanding of evolutionary algorithms is that you have multiple parents, so multiple starting point algorithms, and then you, at a random just like evolution does, just like evolution randomly mixes genes, you randomly mix parts of models together, you see which children of those random mixes do best, and then you breed another generation with the top performers and you can keep repeating that. So it’s that kind of approach here, right?

Charles Goddard: 00:25:54

Yeah, yeah, absolutely. CMA-ES is a lot more principled and mathematically justified than that, but essentially the approach is just, okay, take a bunch of things, put them in arena, smash them with a big hammer, see what comes out the other end. Those are your new you combatants.

Jon Krohn: 00:26:12

Then, so that allows us to take… Earlier I was talking about, say a scenario where I have five different LLMs that I’d like to be able to have all of their capabilities in a single model, and then I learned from you that they could be the same size. So I take five 7 billion parameter models, and then you use this evolutionary approach to get that to a single 7 billion parameter LLM for performing well at the objective that you’d like it to. And then with this model… Sorry, with mergekit-evolve capability that people can get through the MergeKit package, it takes advantage of the breadth of experience that you have, having done these kinds of model merges evolutionarily, you said a hundred times. And so that intuition is now coded into the way that this mergekit-evolve works. And so any of our listeners can go and now merge models, leverage your expertise and have that work pretty automatically, yeah?

Charles Goddard: 00:27:13

Yep. Really the only downside to the evolutionary approach, versus manually writing these configs, is that it does require GPUs, which when you’re not using the evolutionary approach, you can do this on the CPO of your 10-year-old laptop.

Jon Krohn: 00:27:27

[inaudible 00:27:29].

Charles Goddard: 00:27:29

But aside from that, absolutely, it’s a great way to get started with model merging.

Jon Krohn: 00:27:32

Well, it seems to me today that so much that we do with LLMs requires GPUs that I think probably having to use them for som ething like this isn’t a huge hurdle for people that want to fit that kind of MLOps. Very cool, Charles. Well, back to you, Mark. Given the enthusiasm around Evolutionary Model Merging and clearly how effective it is, do you have any initial results or case studies maybe that you can talk about, maybe even with already existing paying clients? I know sometimes that’s proprietary and so you can’t go into it, but are there any case studies you can go into, that illustrate, that bring… We’ve been talking about things hypothetically, my made up example about five LLMs with 7 billion parameters. Maybe you have some more concrete example where it’s real models, leading to some real resolve for some real client, something like that.

Mark McQuade: 00:28:17

Yeah, Thomson Reuters is a client of ours. So it’s public news, I’m allowed to say that. So Thomson Reuters is a customer of ours. We initially trained a 7 billion parameter model for them. And so they were using closed-source model provider for a line of their business, and we trained a 7 billion parameter model with a bunch of their data, and then we did an Evolutionary Model Merge as well. And I can confidently say, after seeing the results there, that model is performing much better than the closed-source models of today, that I’m sure we all know what those closed-source models are. And that’s a 7 billion parameter model, right?So you can imagine the price savings, 90% plus in price savings, obviously depending on the scale of what they were going to hit it at. They own it. It’s their own model. It runs inside their own environment, easy to handle, less complex on infra.

00:29:23

So many benefits to it, and very little drawbacks in the sense of what we were able to accomplish there. And the training itself, when I say training, obviously there’s a million ways people can think of what training is, right? So we did continual pre-training, right? So we did actual deep pre-training of the model with their data, and then we did a supervised fine-tune, and then we did the Evolutionary Model Merge. So adding in the Evolutionary Model Merge really just took the model itself to amazing heights, compared to the just regular trained model, because model merging in itself removes catastrophic forgetting. It brings in the greatness of other models that you’re bringing it in, great chat capabilities if you’re merging with an in instruct version, things like that.

00:30:12

So yeah, when comparing it to their existing queries and what they were getting for responses from closed-source models versus what they can now get out of a 7 billion parameter model that is completely domain injected, is unbelievable to be honest. So yeah, I don’t want to get too crazy on the details of the use case, but I gave enough, yeah.

Jon Krohn: 00:30:38

Since April I’ve been offering my machine learning foundations curriculum live online via a series of 14 training sessions within the O’Reilly platform. My curriculum provides all the foundational knowledge you need to understand modern ML applications, including deep learning, LLMs and AI in general. The linear algebra and calculus classes are in the rear-view mirror, but probability, statistics and computer science classes are all still to come. Registration for both of the probability classes is open now. We’ve got links in the show notes to those, and these will cover all of the essential probability theory you need for statistics applications, as well as machine learning. Intro to Probability Theory will be on July 23rd. Probability Level II will be on August 7th. If you don’t already have access to O’Reilly, you can get a free 30-day trial via our special code also in the show notes.

00:31:28

Yeah, you did give enough, and that allows me to give my own case study that will bolster some of what you’ve said there. Where at my company, Nebula, despite its unambiguous name, we have done some interesting things with LLMs involving things like, for example, taking a 7 billion or 13 billion parameter open-source model, like Llama 3 today is probably the model that I would start with today if I was training that, and fine-tuning that model to some specific task, and giving it capabilities that are superior to what you get from closed-source models. You didn’t want to mention them, but I’ll mention some. So this is, we’re talking about the OpenAI API, we’re talking about Claude from Anthropic, we’re talking about Cohere. So these are some of the big frontier model players with their closed-source models, and those models are absolutely mind-blowing. They continue to be, for me on a daily basis, the breadth of tasks that I can be blown away with them on. But you don’t need all of that breadth for most enterprise use cases.

00:32:34

So someone like Thomson Reuters as a client, they have some specific use case in their platform. And so same kind of thing with us in Nebula, there are parts of the platform where we don’t need this huge breadth of capability. We need a specific task and we can fine-tune that 7 billion parameter model, it still has a huge amount of flexibility thanks to those 7 billion parameters. That’s still a lot in historical terms, in machine learning. You get a huge amount of natural language flexibility, and in some specific task or some small set of tasks, we’re able to get performance that outperforms any of those top players like your GPT-4.0, your Claude 3.5 Sonnet, your Google Gemini.

00:33:15

We’re able to outperform those and because we’re running it on our own infrastructure, you made some points there, especially if this model is going to be called a lot, having your own GPU running at this much smaller model, 7 billion parameters, you could save a ton of money. You gave a 90% figure, that sounds right to me, roughly speaking, relative to calling these closed-source models, and so that’s fantastic. And so what’s great about what you guys are offering is that to do what I just described, that required a lot of in-house expertise and a lot of time. And so now with your tech, with the MergeKit, people can be, our listeners can be, on orders of magnitude less time, for orders of magnitude less human resource cost, you’re able to get these small models running in production, having the capabilities that you’d like and potentially, in cases like you saw with Thomson Reuters, outperforming the big proprietary closed-source players.

Mark McQuade: 00:34:20

Exactly. Yeah, and I think that everything has its place and the closed-source players, absolutely great models, right? But do you care if the GPT can tell you something, a movie fact from 20 years ago, not necessarily. So in the enterprise world, they really just want a model that can answer the questions they need answered from their business, right? So I use this example, that your IT guy from nine months ago was named Carl. You know what I mean? That’s something that probably, a closed-source model, is probably not able to answer for you, right?

00:34:59

Whereas you train your own 7 billion parameter model on your own data, it’s so much power in that small model if all your data is injected into it. And maybe it doesn’t do as great on the general aspects of things, but that’s where merging comes in, where we try to get it back to being great from a general aspect thing. And speaking specifically to the price, I’ll give you, let’s look at AWS as an example because I’m an AWS guy, right? So an A10, a G5 in AWS, a g5.2xl, that’s an A10, that is going to cost you a buck an hour, okay? You run a 7 billion parameter model on that, hit it infinitely and you’re probably looking at about 15K a year, right? And that’s on-demand pricing, right? So 15K a year for hosting that model. If you start hitting a closed-source model like Claude or GPT at scale, it could get into the millions, right? So absolute orders of magnitude difference when it comes to price, and you own it.

Jon Krohn: 00:36:00

Nicely said.

Mark McQuade: 00:36:02

And you own it. [inaudible 00:36:02].

Jon Krohn: 00:36:02

Yeah, and you own it. That’s IP, that’s helpful for your investors. See that as a moat, it’s differentiable, and that’s also the kind of thing that over time, so having that proprietary model trained on your own data gives you a bit of a moat. Then if that allows you to scale more, have more users, collect more data, build a wider and wider moat, deeper and deeper moat, let’s have it wider and deeper. No horses can gallop over it.

Mark McQuade: 00:36:32

But I always see, you used to hear all the time, that data was the new gold, right? That was like a thing that everyone said, “Data’s the new gold. Everyone has data, let’s keep our data. We’ll build these massive data warehouses, we’ll keep the data, it’s everything.” And then with the LLM world, and when that started happening, it seemed like everyone was like, “Oh, who cares? I’ll just call this public API. And then…” No, that data’s still gold and what’s an LLM? It’s just an extension of your data, that’s all it is. So it should be kept in-house because you would never share your data warehouse with an external company, right? You would never do the share your data lake like that. So why share your LLM?

Jon Krohn: 00:37:10

Mm-hmm. Makes perfect sense. Yeah, really exciting, and I don’t have to tell you, you guys are really onto something here. You know that now. That’s nice to have that, a year later, it must be really vindicating. It’s so weird to me. We’re going to have to find out, we’re going to have to do some research to confirm that I can say on this family podcast, “Putting your balls on the table.” But there we go, it’s the third time I said it.

Mark McQuade: 00:37:34

That’d be a record.

Jon Krohn: 00:37:40

Nice. So a related concept that I want to get into here. I also, I really appreciate in your most recent response there, Mark, how you talked about specific technologies, like specific GPUs, and the cost of that. That kind of detail is really great in a podcast like this, so thank you for that. Another question here that might give you that opportunity, and either of you could answer this, is related to this concept of what you guys are doing, is this idea of a Mixture of Experts, MoE. And so we have never been told publicly, but there’s a lot of evidence, a lot of rumor that models, like the GPT-4 architectures that OpenAI has, are what’s called a Mixture of Experts. And that is something akin to what I was talking about near the beginning of this episode, where you have a model that triages requests to an expert, and the rumor with GPT-4, at least the original GPT-4, was that it was a mixture of eight experts.

00:38:43

So each of those eight experts, you have, to make things simple, it wouldn’t be this simple in reality. But let’s say one does math, one does coding, one does translations between human languages. So you have these different experts that specialize in different kinds of tasks, and then when a question comes in that’s related to math, it can be routed by some metamodel that triages them. It gets sent to the math expert, and that way you’re not using all of your model weights, you’re saving some cost by having those seven other experts not need to be used. And they might occupy multiple GPUs each, if we’re talking about models the size of GPT-4. And if people want to use these kinds of models themselves, the best to known example is Mixtral from Mistral, the French startup.

00:39:35

So it seems to me, conceptually, that this Mixture of Experts idea is related in some ways to what you guys are doing. So it would be great to distinguish and something… I’m not sure if this is the right time or not. It might not even be. And so I can bring it up again if now isn’t, but you mentioned, Mark, catastrophic forgetting and how important that is for when we’re fine-tuning these kinds of models. So I’m not sure if that relates to the Mixture of Experts thing. So maybe I’ll just leave you guys to do that first and then we can come back to the catastrophic forgetting.

Mark McQuade: 00:40:10

You want to go for the MoE stuff, because I could say, first on the MoE stuff, is yeah, MoEs are fantastic, right? I’m shocked that Llama 3 wasn’t an MoE, but I guess they did a very, very good job with dense model. But yeah, Mixture of Experts is great, but the MoE concept itself, it’s not a true eight experts, this one’s great at this domain. It’s just the way the pattern works as far as I believe, Charles, you can keep me honest here, that it’s not like, okay, you have eight experts and one’s really, really good at finance and one’s really, really good at coding, that’s not how the structure is.

00:40:50

So training MoEs is difficult in that sense. It’s simpler if you think of it like that though, each expert is really good at one thing, which is more the Mixture of Agent stuff. If you saw that get released recently, which we’re actually cooking something in that space. We’re cooking something in that space, more of the Mixture of Agent side, not the Mixture of Expert side. But yeah, I’ll let Charles talk about it. I probably just didn’t say… I said a bunch of nothing there, so-

Jon Krohn: 00:41:15

Yeah, that was great.

Charles Goddard: 00:41:19

Yeah.

Mark McQuade: 00:41:19

[inaudible 00:41:19].

Charles Goddard: 00:41:19

So a Mixtral, for example, it’s labeled as 8x7B, and people often take that to mean, okay, there are eight experts that are indivisible objects within this, that you can reason about as subject experts on something, and that’s very much not the case. It’s actually that each layer of Mixtral, the MLP section, is divided into eight experts and essentially it’s more of a matrix factorization than a actual explicit separation of knowledge. The architecture though did give rise to a cool merging hack, which some people are quite fond of. And I’ll be very clear up front here, this is a cool hack, this is not something you should really ever use in production. It doesn’t give substantial benefits over standard merging. It’s just neat that it works at all. But essentially when the Mixtral model was released, I saw the architecture and thought, “Oh, cool, you could shove eight Mistral models into a trench coat and say, ‘It’s a Mixtral model,’ and it would actually work.”

00:42:33

So in that case, you’re taking basically an entire Mistral model, and taking the MLP section from that, and shoving that into each layer of a Mixtral model. And then instead of the actual trains routing that Mixtral gets, which is… I’ll not get into that yet. Basically it’s just a approximation based on the latent states from the model when you feed in specific prompts to it, to redirect actual subject specific queries, like, okay, math questions go to this one, literature questions go to this one. And this is incredibly inefficient and quite hacky, and it’s neat that it works at all, but it is not really a technique that I’d recommend actually using.

Jon Krohn: 00:43:24

Nice. So I guess the idea is, yeah, so my oversimplification of MoE, Mixture of Experts, conceptually, maybe, yeah, the clarifications were really helpful that you both just provided, but how does this Mixture of Experts concept relate to model merging and what you do with MergeKit?

Charles Goddard: 00:43:48

Yeah, so a very specific example is that, well, we know the way that a couple of these prominent Mixture of Experts models that have been released open source were trained is through a method called sparse upcycling, where essentially they take a smaller model, like for Mixtral for example, they actually did take eight copies of Mistral, just identical, and then combine them to this architecture, and then did a full pre-train on top of that for however many trillion tokens. We’ll never know. But the technique for that is, you can call it, a merging technique and it is implemented in MergeKit. So if you want to do a sparse upcycling of a model that you have into a larger Mixture of Experts model, there’s a script in MergeKit that you can do that were.

Jon Krohn: 00:44:30

Sparse subcycling.

Charles Goddard: 00:44:30

[inaudible 00:44:32].

Jon Krohn: 00:44:31

So if I can try to break that down… Sorry, say it again?

Charles Goddard: 00:44:36

Upcycling.

Jon Krohn: 00:44:37

Upcycling, sparse upcycling.

Charles Goddard: 00:44:38

Exactly.

Jon Krohn: 00:44:39

And so let me just recap back to you because this is actually, that is a really fascinating technique there, and I hadn’t heard of it before. So the idea there is some key terminology here, and this was, I think Mark said the word dense a little while ago. And so this dense versus sparse is worth mentioning here. So when you have a, yeah, Mark was talking about it in the context of Llama 3. So Llama 3, Llama 2, up until about a year ago with GPT-4, prior to that, all of the LLMs we heard about were dense, meaning that when you make a query of one of these LLMs, of a dense LLM, every single parameter, so if it has 7 billion parameters or 70 billion parameters, whatever, in a dense LLM, every single one of those parameters has information flowing through it in order to generate the response. And so that requires a huge amount of compute.

00:45:39

So a sparse model is this Mixture of Experts approach where instead of needing to activate all of the parameters, you end up only needing to activate a smaller portion. Yeah, and it isn’t as simple as, and you guys both, I think, know a lot more about this than I do, but we talked earlier about if you have eight experts when you do a call, it isn’t like you only call one eighth of the parameters, it can end up being half or a quarter or things like that. And yeah, if you want to go into some technical detail on that, our listeners might love to hear about it, but-

Charles Goddard: 00:46:20

[inaudible 00:46:20]. Quick answer there actually.

Jon Krohn: 00:46:21

Yeah.

Charles Goddard: 00:46:21

So for Mixtral specifically, they use a top two selection. So you’ll end up activating two out of eight experts for each token.

Jon Krohn: 00:46:29

Gotcha. So a quarter there, and then plus whatever parameters need to be activated for the triaging model up front. So a little bit more than a quarter. Perfect. Yeah, so those other six experts don’t get activated. You’re roughly getting 25%, maybe 30% of the activations, meaning you’re saving about 75% on cost, about 70% on inference time. And so you’re getting high quality results, faster, cheaper to your users, which is obviously something that you love to do, especially when you’re working with large language models. So that’s the idea of a sparse, sparse meaning that, in this case, not all of your parameter weights are being used or being activated on a given call to the model. So then this idea of sparse upcycling, did I say that right this time?

Charles Goddard: 00:47:23

Yep.

Jon Krohn: 00:47:24

Is that you, and this is something that you can do with MergeKit, it allows you to take, say, eight, 7 billion parameter starting point models. It could be a Mistral one, it could be a Llama 3 one, and you just take copies of those with the starting model weights all identical, but then you fine-tune that model on some task and through this sparse upcycling, you end up fine-tuning each of those expert to be expert at different things. And you get those Mixture of Expert advantages once you’re in production. Did I get that?

Charles Goddard: 00:48:01

Yeah. That’s pretty much accurate. I will say that this technique pretty much is reserved for the extremely GPU-rich to truly get the advantages of training a sparse model. You need to be thinking in terms of trillions of tokens, not millions, which is out of the reach of most people. But for smaller scale models and for larger scale datasets, yeah, it’s a very, very useful technique.

Jon Krohn: 00:48:24

Nice. Okay, and then, so I interrupted to dig into the sparse upcycling a bit and just sparse models in general. So I think I interrupted you while you were going into some explanation of, okay, so if we know that the sparse upcycling is reserved for people who can train with trillions of tokens, huge amounts of compute, that’s not going to be the case for most people. Maybe you might have something that you wanted to get back to, but if not, then I guess an interesting thing here to do would be to say, with other MergeKit capabilities other than sparse upcycling, what kinds of number of training samples or amount of compute… We already have some sense from you that some of these things can be done CPU-only, for example, if we don’t want to use Evolutionary Model Merging, we can do CPU-only, which is pretty darn impressive. It has never occurred to me that I could be fine-tuning LLMs using CPUs only. So I guess maybe talk a bit about how you can be using MergeKit to be getting more efficient training.

Charles Goddard: 00:49:31

Sure. So I’ll briefly touch on that first part. So the actual sparse upcycling and creating true Mixture of Experts models from scratch is very much reserved for the people with more money than I can count. But I think there’s a lot of promise in the Mixture of Agents approach, which is a more coarse-grained approach to similar thing. So I think there’s going to be some exciting stuff coming out in that space in the nearest future. We’ll see. Yeah, so as far as [inaudible 00:50:03]-

Jon Krohn: 00:50:02

Nice, and so that idea with Mixture of Agents, you’re less concerned with blending. And so they are, with a Mixture of Agents, it is more like that discrete thing that I was describing where you have a coding model, a math model, a natural language model.

Charles Goddard: 00:50:20

Exactly. Yeah. You actually have separate models and each is generating their own separate answer, and that’s getting combined down into one better overall answer.

Jon Krohn: 00:50:30

Nice. Okay. Yeah, and I interrupted you again.

Charles Goddard: 00:50:34

No worries. Yeah, so as for the efficiency angle, that’s honestly a big part of why initially created MergeKit, was that I wanted to play with large language models. And when you don’t have access to a serious GPU hardware, it’s hard to get into. So these algorithms for model merging don’t fundamentally need to be run on a full cluster of A100s or H100s. It’s just easiest to write the code that assumes that everything’s in memory all at once.

00:51:05

So one of the big advantages of MergeKit is that it’s written entirely as an out-of-core operation. So as to say that as long as you can fit one part of the model in memory at a time, well, okay, one per input model at a time, then you can merge a model. So I’ve run MergeKit on everything ranging from, I used an example earlier of a 10-year-old laptop with only a CPU. So I’ve actually done that when I was initially developing MergeKit. I’ve merged 70 billion parameter models on a laptop with eight gigabytes of RAM. And at the same time, it scales all the way up to, if you do want to be running it on your cluster of eight H 100 for whatever reason, it’ll take advantage of that too. So it’s entirely elastic to what hardware you have available.

Jon Krohn: 00:51:52

Very cool. That is pretty wild. I was not aware that that was possible. So yeah, huge models like a 70 billion parameter model, which is the biggest open-source model size that I would ever think about trying to fine-tune, being able to train that on CPU only, that does sound cost-effective. And then based on my experience, I can imagine that there would be scenarios where even just a few hundred training examples would allow you to effectively fine-tune. Certainly, we have done that, where we’ve taken 7 billion or 13 billion parameter models and then using just hundreds or, in some cases, thousands of examples. And those examples could actually be, if you don’t have those hundreds or thousands of examples from your own data pool, an interesting thing is you can be using larger generative AI models, that could be open source, especially if you want to be careful not to be stepping on the terms of use of a closed-source model.

00:52:56

You can be using an open-source model, like a 70 billion Llama 3 to generate training data, to generate hundreds of examples, thousands of examples, to fine-tune your model. And then you could be using something like MergeKit to be efficiently and cheaply fine-tuning to just those hundreds or thousands of examples, in potentially minutes.

Charles Goddard: 00:53:18

Yeah. So I do want to point out just a distinction there in between the merging and the fine-tuning.

Jon Krohn: 00:53:24

Yeah, [inaudible 00:53:26].

Charles Goddard: 00:53:26

So the actual fine-tuning operation does still need a GPU. So MergeKit doesn’t do any of that [inaudible 00:53:31].

Jon Krohn: 00:53:30

[inaudible 00:53:31].

Charles Goddard: 00:53:31

But the key advantage is that when you have very little data, it’s like hundreds is just an extreme case. That’s difficult, but you can make it work. But yeah, you’ve got somewhat of a small data set, you can train your model just on that, and then using MergeKit, you can combine it with some other model that’s seen the millions to billions of tokens, the general instruction following or other domain models you may have online are available, and Voltron them together into one.

Jon Krohn: 00:53:58

Wow, that is unreal, Charles. Yeah, and so thank you for clarifying. That really concrete example makes it easy to understand what’s going on there. So before we started recording, I was tipped off to another product, I guess, other than MergeKit. And so we can clarify, maybe I’ll hand this one to you, Mark, since we’ve been focusing on Charles for a while here with these technical ones, though, you’ve also demonstrated, Mark, that you could be handling these technical questions very well yourself. But so this one is about Spectrum. So I don’t know anything about it. It sounds like it’s something new that you guys are releasing and we need to talk about on air.

Mark McQuade: 00:54:35

Yeah, I mentioned earlier that the power of merging, I think, is pairing it with training, right? So doing some trainer or customizing of a language model and then merging it with other models, okay? Merging two models that exist is pretty cool, but true, let’s say, business power is training a model and then merging it with another model that exists, right? So the training side of things, we utilize something called Spectrum, right? So Spectrum itself is essentially, it’s an efficient training mechanism where it just allows you to train at about 40 to 50% faster, therefore cheaper than the traditional methods of today, right?

00:55:25

So it does so by targeting specific layer modules and their signal-to-noise ratio. I won’t get too crazy on that, I didn’t write the paper. A couple other people in our company, a couple engineers, wrote the paper. So they target these specific modules, and they train those, and they freeze the remaining modules from within the network. So because you’re freezing modules, then you’re not training over every single module and parameter, then you have the ability to do it much more efficiently. And there has been no degradation based on full training, right? Full continual pre-training versus Spectrum continual pre-training, right? So we utilize Spectrum to be able to deliver customers a much more efficient training process for models where they continue to save money, right? We’re all about efficiency. We think we’re inventing the next wave of efficiency when it comes to models with transfer learning and MergeKit, and now we’re doing the same thing with our training routines.

Jon Krohn: 00:56:31

Very cool, and so is this something that is also available open source?

Mark McQuade: 00:56:37

Yes, it is available open source. We’ve adopted it, adapted it, sorry, a little bit from our perspective within our product, but yes, the core of Spectrum is open source and we have a paper, an Archive paper, that you can share with the audience after. And it just talks obviously about much more detail on the technical side of things, on how we do the training, but it is open source.

Jon Krohn: 00:57:02

Perfect. Yeah, I’ll be sure to include both the GitHub repo and the Archive paper in the show notes for people who are interested.

Mark McQuade: 00:57:11

We also have a MergeKit paper, if you want to share that as well.

Jon Krohn: 00:57:15

Nice. Yeah, will do. So, yeah, so we’ll include all those things. Lots to dig into there. Yeah, so it sounds like, and you don’t need to answer this question if it’s getting into stuff that I’m not supposed to be asking, but it’s starting to sound to me like Arcee is a company that allows people, either via an open-source route or an enterprise route, to be able to create great LLMs through a range of different approaches. And so to, presumably over time, be adding more and more approaches. So Spectrum, MergeKit, to be able to have more and more of these kinds of approaches to allow people to create more and more efficient LLMs, running in production, securely, at a lower and lower cost.

Mark McQuade: 00:58:09

Yeah, exactly. As I said, we’re all about efficiency. It wasn’t that long ago when Mosaic came out and said, “Oh, train GPT-3.5…” Was it 3.5? Maybe it was three, I don’t remember, but, “Train a GPT-3 or 3.5 level model for half a million dollars.” Right, and everyone was like, “Oh my God, only half a million.” We actually say, “Train a model for 10K.” Literally, we’re taking it that much deeper, right? We’re going so much further than that. So we think we’re the next frontier of efficiency from within model training, where you don’t have to be the large enterprise of the world with a ton of money. You don’t have to focus on one high ROI use case anymore with us. We’re saying, “Focus on 10, 20, 50 use cases. We can make it all happen.”

00:59:02

Yeah, and I look at it and what we do as two prong, okay? So if you could think of it, the product that we have, which has MergeKit incorporated into it and Spectrum, and we’ve wrapped all these great routines under APIs. We deliver it as a product with an SDK, a UI. What that is a set of tools to enable companies to train their own models, right? It’s a set of tools to enable companies to build great models, okay? And then on the second side, we are training our own models and releasing it to the world, right? So we’re going after both pieces, like that saying of give a man a fish, he’ll eat for a day, teach a man how to fish, he’ll eat for a lifetime, right? So that’s the tool suite, right?

00:59:47

We’re giving them the tool suite and we’re saying, “Go train your own models. Do it efficiently, effectively, and fast.” And then alternatively, we are also going to train models and we’re going to pump them out to the world, right? And that will give us a great angle on both sides of the coin, if you look at it. We just released the Arcee Spark last week, which is a 7 billion parameter model that outperformed GPT-3.5 on certain evals. We then released Arcee Agent, which is a 7 billion parameter function calling model that, again, beat closed-source models in specific evals and 7 billion parameters. We’re going to keep doing that, right? We’re going to keep releasing new models. That’s going to put us in this category, that I think no one else is in, right? And it’s providing the tools to enable people to train models and the model building category.

Jon Krohn: 01:00:42

Nice. So yeah, so my kind of definition of Arcee being this company that provides a set of tools to allow companies to train models efficiently and effectively, that is part of what you guys do. The second part, which we hadn’t talked about yet in this episode at all, and so I’m glad you mentioned it, I was going to make sure we got to it, is things like this Arcee Spark model, your compact 7 billion parameter model that outperforms larger models like GPT-3.5 from OpenAI on some benchmarks.

01:01:08

Super impressive, and I think that also is a nice segue into this general idea of small language models, SLMs. So you have the Arcee Cloud and small language models. Can you elaborate on why you think smaller, more specialized SLMs or the future and how they compare to the utility and scalability of foundation LLMs? So your GPT-4, your Gemini, your Claude, those are foundation LLMs that are useful in a broad range of circumstances. Can you elaborate on why you think smaller, specialized SLMs, small language models, are the future instead? And then maybe you can work in Arcee Cloud in the explanation of that transition.

Mark McQuade: 01:01:59

Yeah, well, I’ll start with Arcee Cloud. Arcee Cloud is being launched on Tuesday. So [inaudible 01:02:06].

Jon Krohn: 01:02:05

Today.

Mark McQuade: 01:02:06

Air day, yes. [inaudible 01:02:09]-

Jon Krohn: 01:02:08

[inaudible 01:02:09].

Mark McQuade: 01:02:08

We’re live. So, yeah, and Arcee cloud is just our offering. I had mentioned before, offering we go into customer’s environments. We drop our software stack into their environment. Arcee Cloud is just a true SaaS. So we expose a true SaaS and allow people to come into our software and train and merge language models, right? The reason we wanted to launch Arcee Cloud is because launching inside a company’s VPC is not always needed, right? Some companies don’t need that level of security and they just want a quick, easy, fast way to go in and rip models, right? So in that case, we’re launching our SaaS and a bunch of other companies that are close to us don’t have in VPC offerings, and they’re very successful with a cloud offering. We figured we should get in the game of the cloud offering as well, and it can be extremely lucrative in that sense, right?

01:02:59

So now why SLMs, right? So it’s what we spoke about already. SLMs are more compact, more efficient, cheaper to train, and can be equally or more powerful on your specific task and use case if trained right, right? So why have your 200, 300, 500 billion parameter model for your HR use case, as an example, okay? Or for your customer support LLM? It’s just not needed, right? So because you’re going narrow with the use case and you don’t want a GPT-like model that can do a million things for you, you have the ability to shrink that model size and then by shrinking the model size, it comes with a bunch of different benefits that come with it, right? And I think that’s just what we’ve seen, right? We’ve seen that. We felt the pull for smaller and smaller, to the point where now it’s on device, on edge, right? People are like, “Oh, okay, well, how can you run an LLM on the edge? How can you run it here or there?”

01:04:06

So I think it’s just going to keep getting smaller and smaller, to the point where… You remember when, yeah, maybe you don’t remember, but in the ’80s, how big was the PC? How big was the server back then, right? And now you have a smartphone in your pocket that is how many thousand times more powerful than those computers back then? It’s just the same. It’s the same evolution, right? It’s the same evolution because those gigantic models just aren’t sustainable. So, yeah, we’re all about it. Small is a subjective term, we actually think around the 70 billion and under right is considered small to me, to us. So you could flex up to Llama 3, 70B, Qwen, 72B. They are extremely powerful, and then flex all the way down to your Phi. What is Phi now? Is it 2 billion, 3 billion?

Jon Krohn: 01:04:59

There are Phi’s that small. Yeah, for sure.

Mark McQuade: 01:05:04

Yeah. Yeah. So, yeah, and then that case, imagine an organization that has a hundred SLMs just humming, working on different tasks and different use cases and all of them cost the same as hitting quad at scale, right? You know what I mean? So yeah, it just makes more sense.

Jon Krohn: 01:05:21

Mm-hmm. Yeah, all of them together, not each of them.

Mark McQuade: 01:05:23

No, no. All of them together. All of them [inaudible 01:05:25].

Jon Krohn: 01:05:23

Yeah, exactly.

Mark McQuade: 01:05:23

Yeah.

Jon Krohn: 01:05:27

Nice. Very cool. So we’re running out of time a little bit, I guess, to give you guys kind of maybe a tricky, big open-ended question to finish with before I get into my final questions that I always have at the end of every episode, is you guys have worked on some really cool tech. So, Mark, at Hugging Face, that’s one of the most transformative platforms for LLM’s of our time. Charles, you’ve done stuff like working at the NASA Jet Propulsion Lab as well as Apple, and then been at the forefront of pioneering technologies like augmented reality and 3D visualization. So do you guys have a vision, maybe some interesting idea, some dream of how LLM’s, AI, could transform the world in our lifetime? So this is like you’re trying to imagine 30, 40 years into the future, how transformative could this AI revolution, that I believe we’re just starting in right now, be across our life and our work? I don’t know, we can start with you Charles.

Charles Goddard: 01:06:36

That’s a tough question. Talk about chaotic systems, and things are moving so fast at this point in time, and I think any individual person’s prediction on where these technologies are going to be, even a year from now, is going to be just guesswork at best. I do think we’ll probably see some significant societal changes with the more commoditization of intelligence. That’s not a thing that we’ve had access to ever before in the history of the human race, being able to, even if the models don’t ever improve from where they are, which they will, having the ability to have even the intelligence of a grade schooler on tap at the press of a button, there’s so much that’s going to change once the full ramifications of that sink into the world at large.

Jon Krohn: 01:07:28

Yeah. That’s a great answer. I like that. The commoditization of intelligence. Absolutely, and you’re so right. With just the capabilities we have today, which are probably nothing compared to what we’ll have in a year or five years, with just the capabilities we have today, getting those small, getting those cheap to train, cheap to call, for example, in edge devices like Mark mentioned in his most recent response, that will transform all industries and all ways of life, but we are going to have even more intelligent machines. So that’s crazy. Mark, I’ll give you a crack at this impossible question.

Mark McQuade: 01:08:04

Yeah, it’s extremely hard, right? I’m a good Catholic boy growing up, right? So I’ll never forget this, when I was growing up in Catholic school, one of my teachers said to me, “You can never imagine heaven because it’s beyond your imagination.” So your imagination stops and then that’s where heaven begins, right? And that always stuck with me, and I would say the same thing here. You can’t really imagine. Could you have imagined the internet prior to the internet? No, right? No, I don’t think anyone could have even imagined what that would’ve been, right? I think this will be bigger than the internet. And internet, cloud, when that blew up in, whenever that was, the early 2000s, mid 2000s, I think this is bigger than both of them. So you just can’t really imagine where it’ll be and then one day you’ll be looking back going, “Geez, look what it became.” I guess, this is how it is.

Jon Krohn: 01:09:07

Yeah. Yeah, you’re right. And I guess that’s how we get terms like the singularity, Ray Kurzweil’s word, beyond which that’s kind of what you’re describing there, the thing beyond what you can imagine. Yeah, and I hope that that will be more heaven than hell.

Mark McQuade: 01:09:26

Yes, [inaudible 01:09:27].

Jon Krohn: 01:09:27

And I think there’s a good chance it will be. So let’s keep pushing on that, what you guys are doing there at Arcee as well as what many of our listeners are doing, will hopefully be nudging us in the direction of utopia instead of dystopia. Nice.

Charles Goddard: 01:09:43

[inaudible 01:09:43] for the better.

Jon Krohn: 01:09:44

So with that, I ask all of our podcast guests for a book recommendation. What have you got for us? Charles, do you want to go first?

Charles Goddard: 01:09:53

Sure. So I’m a big sci-fi fan. The Hyperion Cantos is a great read if you’re into that sort of thing, or if you just feel like having a weird time, I always recommend reading Godel, Escher, Bach.

Jon Krohn: 01:10:04

Yeah, Godel, Escher, Bach. Yeah, people have recommended that on air before. That is one that I’m dying to get into. It seems like a great book, and Mark?

Mark McQuade: 01:10:14

Yeah, I’m a big mystery, thriller, horror guy, for my books and my shows as well. Movies.

Jon Krohn: 01:10:26

And your weekend escapades.

Mark McQuade: 01:10:27

And my weekend escapades. I don’t read books as much as I used to because if I do read a book, it tends to be a tech book and it’s just so… Yeah.

Jon Krohn: 01:10:36

[inaudible 01:10:37].

Mark McQuade: 01:10:37

Yeah, but I love Harlan Coben, so I’ll give an author, I won’t give a book. I’ll give you an author, Harlan Coben. Anything he does is gold. He just doesn’t miss, you know what I mean? He just doesn’t miss.

Jon Krohn: 01:10:45

Mm-hmm.

Mark McQuade: 01:10:45

[inaudible 01:10:48].

Jon Krohn: 01:10:47

Nice. Great recommendations. All right, and then I will obviously have links to arcee.ai, the company as well as the various GitHub repos associated with your company, and papers associated with things like Spectrum and the MergeKit. Other than those kinds of resources, where should people be following either the company, Arcee, or you as individuals, Charles and Mark. Charles, I guess I’ll give that to you first, if there’s social media use that people should follow you on, anything like that.

Charles Goddard: 01:11:22

Sure. So I am on Twitter. I basically never post, but I’m there and I have a blog where I post every four months.

Jon Krohn: 01:11:32

Nice. We’ll find links to those, and Mark?

Mark McQuade: 01:11:35

LinkedIn. I’m much more active on LinkedIn than Twitter. I do have Twitter. I don’t even remember my Twitter name. I used to have a podcast myself, and it’s named after the podcast I used to have.

Jon Krohn: 01:11:47

[inaudible 01:11:47].

Mark McQuade: 01:11:47

Yeah. Yeah, I worked at Rackspace. I had a podcast called AI & U. So [inaudible 01:11:52]-

Jon Krohn: 01:11:50

AI & U.

Mark McQuade: 01:11:53

AI & U. Yeah, [inaudible 01:11:55]. Yeah, yeah. Yes, yeah, yeah.

Jon Krohn: 01:11:55

Yeah, yeah, like the letter, U. Yeah, that’s [inaudible 01:11:58], nice.

Mark McQuade: 01:11:58

Yeah, you’ve heard of it, you’re a subscriber? Oh, okay.

Jon Krohn: 01:12:00

It’s my favorite show.

Mark McQuade: 01:12:01

So I am on Twitter. I am much more active on LinkedIn and that’s pretty much it.

Jon Krohn: 01:12:10

Nice. Yeah, that is the-

Mark McQuade: 01:12:12

[inaudible 01:12:12] Arcee as well. Arcee’s obviously on LinkedIn, on Twitter, and you can catch us anytime, anywhere.

Jon Krohn: 01:12:18

LinkedIn is where we all seem to have converged now, in a post-Twitter world.

Charles Goddard: 01:12:24

Yeah.

Jon Krohn: 01:12:24

That is the same, it’s like I have a Twitter account and for people who don’t use LinkedIn, the stuff that I post on LinkedIn, my social media manager takes that and converts it into something nice for Twitter, but it’d be pretty rare that you’d see something unique for me on Twitter that isn’t already on LinkedIn. And it’s just because I get 10 X the engagement or more on LinkedIn and you just see that, you’re like, “Well, this is where I’m going to spend my time.”

01:12:47

It’s pretty funny. I did, this is a bit off-piste and we are out of time here so I’ll make it really quick, but I was recently hosting an afternoon of sessions at Collision, which is run by the same people as Web Summit and next year it will actually be called Web Summit Vancouver, but for the past decade or two, it’s been something called Collision. It used to be in New Orleans, now it’s been in Toronto for the last decade, and it’s one of the biggest tech conferences in North America. 40,000 people show up and I was hosting an afternoon of sessions on content creation and how AI and other emerging technologies are transforming content creation.

01:13:31

And there was a YouTuber, Hafu Go, who he’s adding a million YouTube subscribers a month right now, which is insane. And so people like that, and they were asking to add me on TikTok, and Instagram, and I was like, “It is so interesting that content creators in general, and I think people in general, there are completely different platforms than for people that are guests on my show. And for me, 90% of the time, conservatively by far, their social medium of choice is LinkedIn.” But that was really foreign to all these content creators. They were like, “What?”

Mark McQuade: 01:14:15

That’s good.

Jon Krohn: 01:14:19

Anyway, all right, so I’ll leave the episode with that. Thank you so much Charles and Mark for an amazing, interesting episode. You guys are doing great work, transforming the world, making paradise on earth, hopefully, and thanks so much. Hopefully we’ll catch up with you guys again in the future and see how the Arcee journey is coming along.

Mark McQuade: 01:14:39

Thanks, Jon.

Charles Goddard: 01:14:40

Thanks for having me on.

Mark McQuade: 01:14:41

Yeah, thanks for having us on. It’s been great.

Jon Krohn: 01:14:49

What an interesting and inspiring episode with Mark and Charles. In today’s episode, they filled this in on how model merging through say, Arcee’s open-source MergeKit, combines the strengths of multiple LLMs without increasing parameter count. How their Spectrum project allows for training an LLM for $10,000, that might otherwise cost 50 times more by strategically freezing specific modules during training. How a Mixture of Agents approach allows each submodel to generate a separate discrete response while a Mixture of Experts approach is blended. How sparse upcycling requires lots of compute running over trillions of tokens, but this allows the copy pasting and then training of a single LLM’s weights into a broader Mixture of Experts. They talked about how smaller language models like their 7 billion parameter Arcee Spark model can outperform foundation models that are orders of magnitude larger on some benchmarks. And they talked about how Arcee Cloud enables companies to transition to specialized small language models instead of relying on the big compute and big cost of huge foundation LLMs.

01:15:51

As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Charles, Mark and Arcee’s social media profiles, as well as my own, at www.superdatascience.com/801. Thanks to everyone on the Super Data Science Podcast team, our podcast manager, Ivana Zibert, media editor Mario Pombo, operations manager Natalie Ziajski, researcher Serg Masis, writers Dr. Zara Karschay and Silvia Ogweng, and founder Kirill Eremenko, for producing another astounding episode for us today.

01:16:21

For enabling that super team to create this free podcast for you, we’re so grateful to our sponsors. You can support the show, please do so, by checking out our sponsor’s links, which you can find in the show notes. And if you’d ever like to sponsor the show yourself, you can get the details on how to do that by hitting to jonkrohn.com/podcast. Otherwise, share, review, subscribe and all that good stuff, but most importantly, just keep on tuning in. I’m so grateful to have you listening and hope I can continue to make episodes you love for years and years to come. Until next time, keep on rocking it out there and I’m looking forward to enjoying another round of the Super Data Science Podcast with you very soon.

Podcasts SDS 801: Merged LLMs Are Smaller And More Capable, with Arcee AI’s Mark McQuade and Charles Goddard

Podcast Transcript

Share on

Related Podcasts

November 14, 2025

November 11, 2025

November 7, 2025

Podcasts SDS 801: Merged LLMs Are Smaller And More Capable, with Arcee AI’s Mark McQuade and Charles Goddard

Share

SDS 801: Merged LLMs Are Smaller And More Capable, with Arcee AI’s Mark McQuade and Charles Goddard

Podcast Transcript

Share on

Related Podcasts

November 14, 2025

SDS 940: In Case You Missed It in October 2025

November 11, 2025

SDS 939: Mixture-of-Experts and State-Space Models on Edge Devices, with Tyler Cox and Shirish Gupta

November 7, 2025

SDS 938: Frontier AI Agents for Data Science, with Sphinx’s Rohan Kodialam