Podcasts SDS 867: LLMs and Agents Are Overhyped, with Dr. Andriy Burkov

89 minutes
Artificial Intelligence, Machine Learning

SDS 867: LLMs and Agents Are Overhyped, with Dr. Andriy Burkov

Subscribe on Apple Podcasts, Spotify, Stitcher Radio or TuneIn

The realities of Agentic AI, AGI, and chatbots that don’t hallucinate: Andriy Burkov talks to Jon Krohn about AI in 2025. Best known for his concise machine learning modelling books, author and AI influencer Andriy Burkov also talks about his latest publication in the series, The Hundred-Page Language Learning Models Book.

Thanks to our Sponsors:

Interested in sponsoring a Super Data Science Podcast episode? Email natalie@superdatascience.com for sponsorship information.

About Andriy

Andriy Burkov is the author of “The Hundred-Page Machine Learning Book” and “The Hundred-Page Language Models Book,” both of which became #1 Best Sellers on Amazon. He holds a Ph.D. in Artificial Intelligence and is a recognized expert in machine learning and natural language processing. As a machine learning expert and leader, Andriy has successfully led dozens of production-grade AI projects in different business domains at Fujitsu and Gartner. His books have been translated into more than a dozen languages and are used as textbooks in many universities worldwide. His work has impacted millions of machine learning practitioners and researchers worldwide.

Overview
Best known for his concise machine learning modelling books, author and AI influencer Andriy Burkov sits down with Jon Krohn to talk about his latest publication in the series, The Hundred-Page Language Learning Models Book. Andriy wanted to lay out the history of this field. Though many listeners may think large language models (LLMs) only emerged two years ago with the launch of ChatGPT, the science surrounding LLMs has existed since the 1960s. As with many early technologies, their scalability depended on computing power, so Andriy explores how these capabilities improved to result in the complex tools we use today.

AI agents face several fundamental challenges that Andriy doubts are surmountable. “It’s not as beautiful in reality as it is in presentations,” he says. He says that the core flaw of multi-agent systems is their black-box approach to returning queries, saying that such a method bars them from being debugged while they are operated in tandem. Further, they are primarily limited to their training data and struggle with distribution tasks outside this wheelhouse. This can cause huge issues for companies that want to utilize LLMs. Commenting on the promise of agentic AI, Andriy asks listeners to stay skeptical about new developments touted by representatives and beneficiaries of companies with stakes in such developments.

Jon also had to ask Andriy how he created a chatbot at Talent Neuron that did not hallucinate. Jon noted a shift in Claude’s responses to many of his queries, where the LLM has become far more conservative in its responses to prompts that may have, in earlier iterations, returned inaccurate answers. Andriy responded that a laser focus on understanding user input was critical to this success and that predefined templates helped ensure the responses were helpful and accurate. Further, he says, “The only way to make a chatbot not hallucinate is not to use an LLM to generate the output.” Instead, he uses RAG (retrieval augmented generation) to decrease hallucinations in the model.

Finally, Andriy threw his hat in the ring for how and when we might achieve artificial general intelligence (AGI). For Andriy, the key will be to find a way to transpose the human ability to plan infinitely ahead. For this to happen, our neural networks will have to become active rather than merely reactive as they are now.

Listen to the episode to hear about Andriy’s work at Talent Neuron, the architectures that might override the critical flaw of multi-agent systems, and what o3-mini thinks about Andriy’s new book!

In this episode you will learn:

(07:38) Andriy’s “triology” of books on machine learning
(29:32) On the limitations of AI agents
(41:12) On the prospect of artificial general intelligence (AGI)
(54:24) On developing a chatbot that doesn’t hallucinate
(01:10:07) On open-weight and open-source LLMs

Items mentioned in this podcast:

This episode is brought to you by the Dell AI Factory with NVIDIA
AGNTCY
TalentNeuron
BERT
Roberta
GIMP
DeepSeek
rvatech Data + AI Summit
The Hundred-Page Language Models Book by Andriy Burkov
The Little Prince, Antoine de Saint-exupery (trans. Richard Howard)
SuperDataScience
Jon Krohn’s Generative AI with Large Language Models, hands-on training
Super Data Science Podcast Team

Follow Andriy:

Follow Jon:

Episode Transcript:

Download The Transcript

Podcast Transcript

Jon Krohn: 00:00:00
This is episode number 867 with Dr. Andriy Burkov, machine learning lead at TalentNeuron. Today’s episode is brought to you by the Dell AI Factory with NVIDIA.

00:00:15
Welcome to the SuperDataScience Podcast, the most listened to podcast in the data science industry. Each week, we bring you fun and inspiring people and ideas, exploring the cutting edge of machine learning, AI, and related technologies that are transforming our world for the better. I’m your host, Jon Krohn. Thanks for joining me today. Now, let’s make the complex simple.

00:00:49
Welcome back to the SuperDataScience Podcast, today’s episode is one not to miss with the super famous machine learning author Dr. Andriy Burkov, who very rarely does interviews. Andriy wrote the indispensable Hundred-page Machine Learning Book that seems to be on every data scientist and ML engineer’s bookshelf. His artificial intelligence newsletter is subscribed to by nearly 900,000 people on LinkedIn. That’s insane. He’s the machine learning lead at TalentNeuron, a global labor market analytics provider.

00:01:19
He runs his own book publishing company called True Positive, he previously held data science and machine learning roles at Gartner, Fujitsu and more, and he holds a PhD in computer science with an AI specialization from the Universite Laval in Quebec where his doctoral dissertation focused on multi-agent decision-making 15 years ago. Andriy’s latest book, the Hundred-page Language Models Book was released a few weeks ago and has already been met with rave reviews online. I will personally ship five physical copies of the Hundred-page Language Models Book to people who comment or reshare the LinkedIn post that I published about Andriy’s episode from my personal LinkedIn account today. Simply mention in your comment or reshare that you’d like the book. I’ll hold a draw to select the five book winners next week, so you have until Sunday, March 9th to get involved with this book contest.

00:02:09
Despite Dr. Burkov being such a technical individual, much of today’s episode should appeal to anyone interested in AI, although some parts here and there will be particularly appealing to hands-on machine learning practitioners. In today’s episode, Andriy details why he believes AI agents are destined to fail, how he managed to create a chatbot that never hallucinates by deliberately avoiding LLMs, why he DeepSeek crushed Bay Area AI leaders like OpenAI Anthropic, and what makes human intelligence unique from all other animals and why AI researchers need to crack this in order to attain human level intelligence in machines. All right, are you ready for this tremendous episode? Let’s go.

00:02:47
Andriy, welcome to the SuperDataScience Podcast, I’ve been trying for years to get you on, so it’s a great delight for me to finally have you on the show. Andriy, where are you calling in from today?

Andriy Burkov: 00:03:04
Hi John, and thanks for having me. I’m calling you from Quebec City, Canada.

Jon Krohn: 00:03:10
Very nice. I grew up in Toronto and when we were in high school, as people started to turn 18, the legal drinking age in Quebec is 18, but in Ontario it’s 19. In Ontario, we have March break, which is a week off in the middle of March, and everyone would organize a dozen buses from the high school to drive from Ontario to Quebec City so that we could take advantage of the 18-year-old drinking age there.
Andriy Burkov:

00:03:47
They accepted your Ontarion ID cards as proof?

Jon Krohn: 00:03:57 Yeah, of course.

Andriy Burkov: 00:03:58
My daughters, I have two daughters, 18 and 17. The oldest, she counted days before she can enter the alcohol store and just order whatever she wants. Quebec folks, I know that they’re very proud that here they can consume alcohol start in 18, but in the US it starts in, I think, 21.

Jon Krohn: 00:04:25
21, that’s right, exactly. Yeah. You get people from Vermont driving up to Quebec because they have the three-year gap there, it’s an even bigger deal. University students taking advantage of that. Yeah, but I don’t think I’ve been back since. That was some time ago, 20 years ago now that I was taking advantage of that, but I haven’t been back, which is too bad because Quebec City is beautiful. I think it’s the only city in North America that has that European kind of vibe, because it still has the original walls from when you had to have walls to protect yourself from the Americans.

Andriy Burkov: 00:05:08
From the American, or whoever wants to conquer the territory that you conquered first.

Jon Krohn: 00:05:12
Right, exactly, that’s right. Or take back the land that’s rightfully theirs, perhaps. Maybe that wall is going to come in handy again because of the rhetoric coming out of the US recently.

Andriy Burkov: 00:05:25
Yeah, it’s nice. In Montreal, there is a small part of old Montreal where it also feels quite old, but yeah, this wall around the downtown makes it special. It’s a lot of restaurants, people walk on the streets, cars are very rare, so it’s nice. Especially now during winter, all is eliminated. They put some installations like sculptures from ice illuminated with different colors, so it’s really, really like a postcard. You take pictures and you can send them wherever you want, like postcards.

Jon Krohn: 00:06:08
Very Instagrammable for all our listeners looking for that perfect Instagrammable place in winter, although I might recommend visiting in the summer when it’s nice and warm.

Andriy Burkov: 00:06:19
Yeah.

Jon Krohn: 00:06:21
Nice. You have either option. Listeners, this message has not been sponsored supported by the Quebec Tourism Board, but we do highly recommend checking out Quebec City for a unique city in North America. All right, Andriy, onto topics that maybe more directly interest our audience. You have over 15 years of hands-on experience in automated data analysis, machine learning, natural language processing, you’re currently the machine learning lead at a company called TalentNeuron. That’s a data platform for global labor market intelligence, and so it helps businesses make workforce decisions. However, you are best known for your best-selling trilogy of concise machine learning modeling books. There was Hundred-page Machine Learning Book, which is, I see that constantly all over the world on people’s bookshelves, people who are data scientists or machine learning engineers.

00:07:16
I see your first book, The Hundred-page Machine Learning Book on their bookshelves, and now your latest installment, The Hundred-page Language Models Book is out, it’s just a month old. In that new book, in The Hundred-page Language Models Book, in the preface you describe how your interest in text developed and how the complexity of extracting meaning from text fueled your determination to crack it. Can you walk us through your interest in language modeling and how close we have come to cracking it?

Andriy Burkov: 00:07:49
Yeah. Well, I just should correct one thing. It’s not a trilogy… Is it trilogy or triology? Sorry.

Jon Krohn: 00:07:55
Yeah, trilogy.

Andriy Burkov: 00:07:56
Trilogy. It’s like a duology with a spin-off because I started with The Hundred-page Machine Learning Book and I didn’t expect to write anything else because I thought that it would be kind of a trick if I write a Hundred-page Machine Learning Book and then I continue to write about machine learning and it’ll be like, okay, yeah, it’s not really a hundred, so actually it’s much more than that. But then, Covid happened and I just looked for some project to do because we all stayed home and nothing to do, so I decided to write a kind of machine learning engineering book. It’s not about machine learning, but more like how to apply it for solving real business problems, but then large language models happened. Large language models, for me, it’s like a totally different story. Yes, it’s machine learning, but it’s so important on its own and there is so much new developments, both scientific and engineering that happened during the last two years since ChatGPT was released, that it wouldn’t be a trick to write a book just about language models.

00:09:22
What I also wanted to avoid is to write just another book on LLMs because if you go to Amazon today, you will see maybe dozens, maybe even hundreds of titles with LLMs, so I wanted to show the progression of the field, like, how language models evolved because probably 99% of people heard about language models just two years ago because of ChatGPT, but the science around language models existed since the sixties of the last century, so people always try, I mean scientists, always tried to create some algorithms that would allow a machine to communicate like a human. The most successful approaches in the past were what we call count-based language models. These are basically just statistics. You take a large text collection and you take what we call n-grams, like sequences of words, and you count how many times the word horse preceded by an n-gram.

00:10:37
She rides a, and you say, okay, she rides a, is followed by horse a hundred times in this collection and it follow it by car 70 times. If you want the machine to generate a text, it will just find the word with the highest count and it will generate she rides a horse. This approach was quite successful, but the problem with it was that it’s scaled very poorly because you needed to calculate all possible statistics for all possible n-grams, and if you want your model to be accurate, your n-gram should be long. For example, you should be able to, like today, which you put a thousand words as an input and you want the model to generate the next one. If you want to count all possible n-grams of thousand words, it grows very fast in volume. This count models, by the way, they’re still used in our smartphones.

00:11:41
For example, when you type something to your friend, and let’s say you type often like, okay, what do we do this night? It will remember that what do we do this, and it will remember that night follows quite frequently these words and it will suggest you as the first option. For this, neural networks aren’t used, this would be an overkill and this would be really slow because you would have to retrain the neural network every time, but count-based models on very short context, they work very well. My book is like, it’s a history of where it started and then where we are today. My personal fascination with the topic started because I started to work with the internet in 1998. I was 18 years old. The internet was really kind of a new thing, very few people actually used it, and even in my city in Sevastopol, I’d find a phone line that doesn’t have noises.

00:12:53
Previously, you had to dial in the modem that you have to dial in, so if there are noises, the connection drops. Just for me, the company that operated landline phones… Well, not just for me, but for some group of people who needed this kind of stable connection, they created a special landline so that we can connect. Yeah, and when I connected to the internet, for me, the obsession was like, there is so much information, but you really have to extract it manually. You go to one website, you read it or you copy something, you save.

00:13:32
I thought that if we can automate this process, that this would create kind of an interconnected automated information exchange. I started to create some kind of scrapers, like a robot that can go to some website, detect that something new happened, some new information appeared, extract it and send me an email like, okay, this information… For example, I was interested in games and there was a website that published some new articles about how to solve this game or the history of this specific game series. I was interested to know how it works, but they didn’t have any alert, like, okay, you are subscribed, we sent you emails. I had to really scrape the additions to their website, and this is how I started.

Jon Krohn: 00:14:23
This episode of SuperDataScience is brought to you by the Dell AI Factory with NVIDIA, two trusted technology leaders united to deliver a comprehensive and secure AI solution customizable for any business. With a portfolio of products, solutions, and services tailored for AI workloads—from desktop to data center to cloud—the Dell AI Factory with NVIDIA paves the way for AI to work seamlessly for you. Integrated Dell and NVIDIA® capabilities accelerate your AI-powered use cases, integrate your data and workflows and enable you to design your own AI journey for repeatable, scalable outcomes. Visit www.Dell.com/superdatascience to learn more. That’s Dell.com/superdatascience.

00:15:09
Oh, that’s a lot of history with doing it today, you have scaled this up tremendously. You’re not having to worry about noise on a landline anymore. At TalentNeuron, you’re collecting over 700 million job posting data points daily, and then you’re using language modeling methods to deliver insights, which you’ve said previously, 95 million daily predictions based on 700 million job posting data points, and obviously each of those job posts has a huge amount of information on it. Can you tell us a bit more about that work, obviously you can’t go into maybe proprietary details, but kind of linking the history of what you’ve been doing, what you cover in the book, to what you’re doing today?

Andriy Burkov: 00:15:57
By the way, I’m working at TalentNeuron, but I didn’t join TalentNeuron. I joined the company whose name was WANTED Analytics, and this was a local Quebec City company that was created by a local, [inaudible] who is an entrepreneur here from Quebec City. When I joined, we were about probably 30 people, and two or three years later, the company was acquired by a multinational American consultant called CEB, and we were under CEB probably for another year or two, and then CEB was acquired by Gartner, so then we were under Gartner for another two or three years. Now, Gartner, well, two years ago, Gartner sold our business as a separate entity and this is how I ended up working at TalentNeuron, so basically I changed four companies without changing my chair. This was quite funny.

00:17:03
Yeah, the product that we have is based on, the goal is to provide actionable insights to people responsible for recruiting or workforce planning in enterprise. How we do it is that we have robots that go to different corporate websites, job boards, some aggregators, applicant tracking systems, those sources that contain jobs, we call them jobs, it’s job posting, when you know look for a job, you open something and it says the title is this and this is the description, the conditions, and so on. We download these postings daily. We have about, I think today, about about 35, 40,000 such robots. These robots, they’re not intelligent, they’re fixed, so basically they know how the specific website works, so they know what to click to go to what page, they detect that something changed and they download based on some predefined rules. For most part of my history in the company, my team was responsible for post-download.

00:18:32
When the job is already in your database and you have to normalize it or extract something from it, we worked on all sorts of projects. For example, a typical job posting contains skills, so we created a system that detects different skills in the job posting, extract them, normalize them, because sometimes it’s written, for example, I don’t know, JavaScript, sometimes it’s JS, sometimes it’s JavaScript, one word, so we need to detect that it’s all the same skill. Sometimes it’s funny when they say you must excel in Word. In this case, Excel is not actually a skill, but Word is. There are plenty of interesting NLP, natural language processing challenges here. Since the first language models have been released, and probably you know about BERT, RoBERTa, it is the first transformers that Google… Well, Google released BERT and then Facebook released RoBERTa, so these were language models of what we call encoder language models.

00:19:49
They cannot talk, they cannot be used to generate text, but they understood text very well, so we adapted these BERT transformers probably somewhere around 2018, 2019 to, for example, predict the industry. The job talks about the company, and from the company description, the model can predict what industry it is, or those models also are good as classifiers, so if we want to distinguish Excel in one context versus Excel in different contexts, we can train such transformer to actually read not just the word Excel itself, but it’s surrounding in the text and make a prediction, then it all become multilingual, like previously, historically, machine learning was monolingual, so if let’s say you wanted your… We had a salary extractor, for example, we supported about 25 different languages. For each language, in the past, we created a salary extractor for this specific language, so we really needed to label examples in Chinese, in Japanese, in Russian, and so on.

00:21:12
But now today it’s all so much simplified, and again, it’s just a very recent trend that most of models that are being released, they’re now multilingual. It’s like, for me, even five years ago, no, five years ago probably, I wouldn’t be surprised, but 10 years ago, if you tell me that the same model can accept whatever language it is and output whatever language you want, you would say, no, it’s crazy, it cannot work like this, but today this is what we have. Now, for example, for extracting the salaries for every new countries that we add, we don’t label the data for this new languages by hand anymore. We don’t label probably at all, we just reuse already labeled data for other languages to train the model for this new language because a salary, it’s a salary, so no matter what language you use to describe it, it’s still a salary.

00:22:08
The models are really strong now to generalize across languages. Yeah, this was post-download, but now we also work a lot with what we call before downloads, for example, those robots that I mentioned, that come to some website and find the jobs to download. Now, we work on a system that builds these robots automatically from scratch. Basically you just say, here is the home URL of a corporate website, go find the career section, identify where the listing, where all the links to the jobs are, create rules to extract different elements from this listing, for example, the title, the location, the posted date, and then click on those links, visualize the description and extract the description itself, and if it contains any additional attributes like industry or perks and so on, also create rules for extracting this. Now, we don’t have to create those scripts or robots manually because historically, again, it was very difficult for our software engineers to automate every specific website because all websites are different, programmed using different technologies by different people.

00:23:38
Some are programmed well, some are less well, some are really ugly, so our developers really struggle to create those scripts and especially fix… Something small changes on the website, the robot is no longer capable of gathering information, so you have to open the script, look inside, what happened, why it’s no longer working and so on.

00:24:02
Now, with this automated script creation, we can automate at least half of the websites fully. For the other half, it’s still manual work, but those websites we consider them challenging, and challenges are usually more interesting for humans to solve them. Just something routine.

Jon Krohn: 00:24:25
Yeah, you’re giving a great example there of the kinds of capabilities that the large language model revolution is unlocking for us. There’s so many things that now can be done in an automated way, like you’re saying there, half of the time you can identify on a website automatically what the formatting should be like and download it as opposed to having to hand code it. I’m sure that that percentage is going to go up and up over the coming years.

Andriy Burkov: 00:24:54
Well, yeah, but the result of excitement, I think probably, well, you definitely follow this media craze and hype around AI, automated agents that they will solve problems for us, as I mentioned. Well, we have an extensive experience now in trying to use those LLMs out of the box to help us in organizing or extracting information. It’s not as beautiful in reality as it is in presentations. The problem is that the LLMs are really good at the problems that we call indistribution. Indistribution means that the problem that you ask it to solve, it’s similar to the data that the LLMs saw when it was trained. The problem for businesses specifically, maybe for particulars it’s less of a problem, but for businesses, the problem is that we don’t know what is indistribution and what is not, because to know it, you actually have to have access to the entire data set that was used to train the model.

00:26:18
This data set is hidden, probably with a couple of exceptions, all LLMs, including those that we call open-weight, they’re not really open. You can use them, yes, you don’t even have to pay them, often, yes, but you cannot really tell whether your specific per-business problem is in distribution or not. You can test, for example, let’s say you develop a system based on LLMs and you provide some tests, some inputs, and it’s kind of cool. You give it a problem, it solves it, you see the solution, it makes sense, but when you put it in production and real situations start happening, not those that came from your head that you decided that it would be a good test, but real data, and this real data may not be aligned with what you think the data in production will look like.

00:27:20
In this case, the LLM can become arbitrarily wrong or make wrong decisions, output wrong information, and because we don’t really have any detection mechanism, and of course we don’t have any prevention mechanism, so when you put systems that are based on LLMs in production, this is where the nightmare starts, because while you are coding, testing in your controlled environment and you are happy, everything is good, but then you put it in production, what does it mean in production? In production it means in front of the users. This is where you start getting troubles, and users get angry and you say, okay, we’ll fix it, but you have no idea how to fix it because you are blind. You know that it doesn’t work, but you have no idea what to do to make it work because, like, how close it is to the distribution, maybe it’s just close enough and you will just add a couple of examples to fine tune it and that’s it.

00:28:23
Or maybe the use case was so far from the distribution that no matter how many examples you give, there will be always cases around this place that will still not work. This is where I really recommend anyone with the decision power listening to this podcast, is to think twice, maybe even three times or maybe 30 times before you actually decide to put something LLM based in front of your users because it might sound cool, like, oh look, we use LLMs, but then they lost reputation and angry customers. This is not something you will find cool.

Jon Krohn: 00:29:09
Yeah, digging into the point that you just made about some of the limitations around agents, you emailed to me ahead of us recording this episode, I was caught off guard, you said something like, agents won’t fly. In this time, where everybody is talking about agents, just as some examples, Jensen Huang, the CEO of NVIDIA said recently at the Consumer Electronics Show that this is the year, 2025 is the year of AI agents. Salesforce CEO Marc Benioff is equally bullish, predicting AI agents will take over the labor force. Andrew Ng, Andrej Karpathy, they both say that Agentic AI will revolutionize labor and is paving the way to AGI. When you and I were discussing over email potential topics that we could cover on the show, you said to me, AI agents won’t fly, and so I was so surprised by that. I mean, I knew that you might mean they won’t work at all, but it’s so contrarian that I asked you explicitly.

00:30:12
I was like, do you mean they’re not going to be flying airplanes or are you saying that it won’t work? You meant the latter, you meant that they won’t work, and then you elaborated, you said agents will never work in real life because one agent is just a rebranded LLM and then it has the kinds of indistribution issues that you were just describing. You went on to say that while more than one agent working together, a multi-agent system is un-debuggable. I don’t know, do you want to dig into this a bit more?

Andriy Burkov: 00:30:39
Well, I have a couple of comments. First of all, Karpathy, I respect him very much and he is very cautious in his choice of work. He never say something like, okay, agents will replace humans. He posts his ideas like how it might be, but he never predicts that it will actually happen. Those who you mentioned, for example, the NVIDIA CEO, well, we should understand that these people, when they speak, they don’t speak of their heart, they speak as representatives of a huge company that should be responsible in front of their shareholders. If saying something increases the share value and it’s legally permissive, they do it. He knows that if you say that agents this year will become huge, it will mean for investors that you need to buy more GPUs, because if everyone runs agents and you don’t have GPUs to run them, then you lose.

00:31:56
Saying something like this just works well for his specific company. Salesforce, the same thing. When they started, they said that, okay, traditional software is dead and now it’s software as a service, and they even had a logo where it says software and crossed like, okay, it’s gone. Traditional software didn’t go anywhere. Yes, there is a lot of SaaS, but there is a lot of traditional software too. Now they say, okay, well SaaS is dead, now it’s agent. Will it happen? I really doubt that. Yes, some use cases, agents will be probably good.

00:32:46
Some use cases, again, if we look into this in distribution versus out of distribution, the best use case you can imagine for an agent is the agent that gathers information, like, crawls the web, finds some interesting documents, some relevant documents for your business for decision-making and extracts it kind of aggregates into some report and sends it to some decision-maker. Why it would work is because LLMs were trained on the web data. For them, web data is the closest to indistribution that you theoretically can get, so of course if you say that my agents are agents that crawls the web and extracts pieces of relevant text, yeah, why not? It might work.

00:33:39
Is it a huge use case? Do everyone need agents that crawl the web and extract relevant information? Some might, some probably not. Some might say I can just Google the information that I need. I for example, I have Google alerts about every time someone mentions my book online, I receive the alert. Is it an agent? Someone might say it’s an agent, but it’s just a crone job that runs a search on Google Index. These people, they are interested in promoting their business and this is what they say.

00:34:20
Talking about multi-agent system, my PhD was in multi-agent system, and so if I understand something in AI, one might say that it’s multi-agent system. The biggest challenge with multi-agent systems and any distribution systems is to debug them. Debugging is hard because it’s a multiple… When you debug a typical software, I don’t know if you have an experience coding, for example, some function doesn’t work or the code enters this function and then it crashes, so what do you do? You run the debugger, you say, you put a breaking point in your function and you run the code, the code runs until it reaches the breaking point, then it stops, and then you have this next, next, next like a step by step where you can execute each command or operation one by one. By doing this, you also can observe how the values of all variables in the environment, how they change.

00:35:27
This is how as a human being, you can detect that something is wrong and this is how you will update your code. Now, imagine that you do this for one of your agent and then 25 others still do something and you cannot stop all of them at the same time because all of them are independent pieces of software so they are still operating while you try to debug one of them. This is when you actually control them. With LLM-based agents, you don’t control them at all, you cannot debug an LLM. An LLM, it’s a neural network, it’s a black box, there is nothing to look inside to say, oh, why the information flows this specific part in the neural network, this is not how it’s done. It either works or it doesn’t. Imagine if you have 20 or 50 such agents and when they especially interact to one another because it’s one thing to debug 25 independent agents, but when they all collaborate together to provide some final result, it’s crazy.

00:36:44
Debugging a distributed system is difficult because of this. It’s asynchronous. It’s like, every process runs independently of every other, you really cannot really kind of stop the whole system and debug it. This is why I am very much skeptical about agents. As I said, for some very specific use cases it will work, but imagine you have agents that should navigate your intranet, not internet, but your intranet with all those legacy software that you have, you have a software that contains your employee’s salary and performance and so on. You have a software with SharePoint with some outdated information. You have your Git with code, you have your documentation, everything, and you put agents in there and they don’t know anything about any one of your internal systems. They see them for the first time, and you think that just by telling them, okay, you are a helpful intelligent agent, you can walk through different applications in my intranet and ask and find issues and fix them. Come on, let’s be realistic, they will break their teeth quite fast.

Jon Krohn: 00:38:12
All right, yeah, that’s a compelling argument. Our researcher, Serg Masis, he had a question for you that maybe this is a tricky question because maybe there isn’t an answer, maybe multi-agent systems just can’t be debugged, but given your PhD in this, maybe you have some insight into some kind of alternative architectures or design principles that we could use to create robust and maybe even interpretable AI systems for complex tasks.

Andriy Burkov: 00:38:44
Well, I think if we are realistic, okay, any multi-agent systems must give you a hundred percent control over every actors in the system. Well, we can call them agent, okay? If you can control every agent, you can design a specific schedule and specific communication interchange protocol that will allow you to detect bugs or misbehavior. For example, you can analyze how agents exchange information, what was in those packages, and detect that something was abnormal. With LLMs, as I said, these are black boxes so you really have zero control over how they think, how they make decisions and so on. I think that if we want to be realistic that in the future there will be some agents that will do job for us and we can sleep the night without worrying that they will launch some nuclear code, enter some nuclear codes and launch nuclear missiles, I think yeah, it should be something similar to what they call AGI, like artificial general intelligence where at least we can trust this AGI the way we trust a regular human being.

00:40:14
You know that in security if, for example, you want to secure some important object or you want to control access to some important briefcase, there is never one person, and often even we saw in movies, like, to just open a door, two people must be on a significantly long distance from one another so that one person cannot use two keys and then two people must turn these two keys at the same time. Why this is done? This is done because we as humans are unreliable, so if we want something secure, something stable, something we can sell to our customers and say it’s good stuff, it should be as reliable as a human-based system.

00:41:04
But today no one will argue that those agents that we talk about, they’re nowhere close to be as reliable as a human, so until this happens, building multi-agent systems with such agents, it’s a recipe for disaster. Somewhere in the future we will have this AGI and we’ll see if we can trust it and if we can create systems similar to human-based systems with this additional levels of security by doubling, tripling people, but not today. As far as I know today, there is no one having a clear idea how to reach this future with this AGI being the real thing and not something from science fiction.

Jon Krohn: 00:41:52
Eager to learn about large language models and generative AI but don’t know where to start? Check out my comprehensive two-hour training, which is available in its entirety on YouTube. Yep. That means not only is it totally free, but it’s ad-free as well. It’s a pure educational resource. In the training, we introduced deep learning transformer architectures and how these enable the extraordinary capabilities of state-of-the-art LLMs. And it isn’t just theory; my hands-on code demos, which feature the hugging face and PyTorch Lightning Python libraries, guide you through the entire life cycle of LLM development, from training to real-world deployment. Check out my generative AI with large language models, hands-on training today on YouTube. We’ve got a link for you in the show notes.

00:42:36
In your work is both a real-world AI assistant developer as well as through the books you’ve written, and recently this huge amount of expertise you’ve developed in language models to write this book on language models. You probably have an interesting perspective on AGI and when it could be realized. You just mentioned there that we might have it in the future, do you want to hazard any guesses in terms of timeline?

Andriy Burkov: 00:43:03
When I say that we may have it in the future, it’s like to say we may have teleportation in the future and it might work. Yes, it can work because if we humans are conscious, then something in nature changed compared to our predecessors, so we evolved somehow in humans, because what is the difference, the biggest difference between humans and the rest of animals? Humans can plan over an infinite horizon. Some monkeys like chimpanzee and the most developed ones, they can use tools because previously it was considered that only humans can use tools, but now after decades of research, we know that even some birds can use tools. For example, I think that it’s crows that they have a nut and they can throw it from a height and it falls and it cracks, and even when they live in the city, they can wait for a car.

00:44:26
They wait for a car, then they throw the nut, the car walk over the nut and the nut is broken, so they use tools. Some monkeys can even use tools. For most animals, they use tools, like in this specific moment. They will not keep their tools for tomorrow, but some monkeys will actually, like you, for example, you give one monkey a stick, and only with this stick she’s able to get a banana, so she will get a banana and when she goes to sleep, she will put the stick under her belly when she sleeps because she knows that tomorrow she will also need a banana. This means that some animals can plan one day in the future, two days, but if you remove bananas for more than three or four days, it will throw away the stick, it will not think that maybe in five days bananas will be back, but humans will think that I will still keep this stick because who knows?

00:45:32
We can even plan many years, even hundreds year, thousands years. Today we think about saving the planet. We think about reducing the consumption of plastic and we think about the global warming issue. Why do we do it? We will die maybe in the next 60, 70, 80 years, the planet will be still fine. We do it for the next generation, for our kids, for their kids and so on. This is what we managed to gain somehow through evolution. Now the question is, how can we get this AGI? Basically, the answer is, what inside us is different that makes us planners for infinity versus every other living creature on this earth. If we can answer this question, I think this will be probably the biggest breakthrough because this is something that our LLMs or whatever neural network you talk about, this is what they don’t have.

00:46:48
They don’t have the ability to actually plan. They are reactive. You ask a question, it gives you an answer. Even if you call it agent, they don’t really have agency, it’s because they might act as agent because in the system prompt you said you are an agent and your goal is to provide your users with the best information on a specific topic, but this agency didn’t come from the agent itself, it came from you, you instructed it to be an agent, because the LLM doesn’t really understand what it does, it just generates text. Sometimes this agency will be violated, so it will not do what you want it to do and you cannot really explain why. It’s like a black box, it works or it doesn’t, and you don’t know why. If we answer this fundamental question, what makes us planners for infinity? I think that this is where we will get one step closer to AGI.

Jon Krohn: 00:47:55
Yeah, I would suspect that some of the answer lies in our prefrontal cortex and the ratio of prefrontal cortex that humans have relative to other primates that allows us to kind of sustain a loop through our other sensory cortices over an extended period of time. Which brings me to a point that I’ve talked about on this show before, which is that it seems to me, and it sounds like it may be the case for you as well, that cracking AGI may require modeling the neuroanatomy of a brain, a human brain perhaps in a more sophisticated way than just scaling up a single kind of architecture like a transformer, that we might need to have different kinds of modules so that we have something like a prefrontal cortex that can be doing this kind of infinite horizon planning that you’re describing. You have different parts that are connected by large connections pre-planned as opposed to just allowing all of the model weights to be learned in a very fine way across the entire cortex in the same way, across the entire neural network in the same way.

Andriy Burkov: 00:49:11
Yeah, and it’s not only… Well, I simplify it a bit by saying that this is just one thing that will make us different. Another thing that we also have, and LLMs, for example, don’t, is that humans somehow have a feeling about what they know and what they don’t know. Okay? For example, I ask you about astronomy or about the universe, stars or galaxies, and if it’s not your domain, you will tell me, Andriy, I like to talk about these topics, but if for you it’s something critical, you probably should talk to a specialist because I can only tell that planets spin around stars. This is what I know. LLMs don’t have this mechanism to detect that what you ask about wasn’t part of its training data, or it was, but not in the level of detail granular enough to have an opinion that is worth sharing.

00:50:41
It will still answer you. For example, I made a test a couple of days ago with this o3-mini from OpenAI, I wanted to see, because all models, all LLMs, they have been trained on the web data, and on the web there is a lot of information about my first book, but my third book just came out so there is really few information and I’m sure that their cutoff was earlier than the book was released so they should not know anything at all about it. I asked o3-mini like, is my The Hundred-page Machine Learning Book good? What is interesting is that previously you couldn’t see this, currently they show this what they call a chain of thought, this internal discussion before they provide an answer. I read this chain of thought, and it’s funny, it starts by saying, okay, he asks about this book, but this book looks very different from the previous one, so probably it’s some new book.

00:51:49
Okay, what do I know about this new book? Not much. Okay, so what do I know about the previous book? Oh, the previous book is XYZ. This discussion and then it starts releasing the final answer where it just says that, yeah, this new book is very good, it’s praised by experts and by readers and it delivers content in a very good way. I’m like, where does it come from? It just made up the recommendation and it’s based on its internal discussion in which it says, yeah, but I don’t have anything about this book, but given that Burkov has a great reputation, this is what I might say. It doesn’t tell it you in the official answer that it’s a pure speculation. It answered this just as if it was a real deal. This is where the LLM cannot really understand this difference between, I’m sure about this, I’m less sure about this, I can be totally wrong.

00:53:08
Again, if we can solve this, this will be additional next step to AGI, the model that can be reliably self-observing and self-criticizing. Saying that I would love to help you, but here I feel like I am in the domain where I cannot be reliable. By the way, they try to fine tune models to tell this, but it doesn’t work this way. Basically, for example, some models, especially released by the Chinese company, they decided to fine tune their models to say, I don’t know this person.

00:53:51
Previously, for example, there is information about you online, you can ask a model who is Jon Krohn? It might say, well, he’s a podcast host, book writer, but it might also say that you are a Ukrainian footballer like me, so to avoid being ridiculized, like people Google themselves, people ask about themselves, they know that information, some information is online, but it comes out totally made up, so they decided that they will fine tune their models to say, I don’t know anything about this person, and they fine-tuned it by giving the names of really famous people and they say, let’s go answer, and then they give some just random names, people who don’t exist online or very small footprint and they say answer, I don’t know. It’s funny because I asked who is Andriy Burkov, it says first time I hear this name, don’t know anything, and then I say, who wrote Hundred-page Machine Learning Book? Oh, it’s written by Andriy Burkov. You just told me that you don’t know. No, they try to create some hacks around it, but it’s not really training a model to recognize where it can be wrong.

Jon Krohn: 00:55:10
I’ve noticed a related hack recently in Claude outputs where it’s something you can tell it’s probably not directly a part of the core LLM, but again, something that they’ve tacked on top where I’m now frequently seeing in Claude responses, things like, this is a relatively niche topic, I don’t have that much information, you might want to double check this. I find that they’re being really conservative with that, where I’m getting that frequently on questions that I ask on things that I don’t think are particularly niche. Maybe there’s some fine-tuning that they need to get right there. That kind of problem seems like something that these big LLM trainers are working at and they’re probably all taking different kinds of approaches. You wrote on LinkedIn that you developed an enterprise chatbot that doesn’t hallucinate, which seems related to this. Yeah, hallucination, having this kind of confidence about things that the LLM doesn’t know anything about, it seems like you’ve achieved something here. How did you accomplish this?

Andriy Burkov: 00:56:18
Well, yeah, the only way to make a chatbot not hallucinate is not use an LLM to generate the output. We all know that RAG, this retrieval-augmented generation decreases the level of hallucinations quite significantly. For example, if you ask about machine learning and you pulled the data from Wikipedia, the machine learning article, and you answer the user’s questions question based on this Wikipedia article, then the chances that you will say something entirely wrong are quite small. There are still chances, but quite small. For example, I don’t know, compare it to just answering out of the box without doing any retrieval, maybe you will hallucinate 20, 30% of time, but with retrieval-augmented generation, maybe one, 2%. It’s still there, but not quite a lot. What we decided to do is, we decided to exclude any possibility of hallucination.

00:57:32
Basically our chatbot, it’s not the open domain chatbot. This is a very big advantage for any machine learning team when they work with closed domain versus open domain. For example, OpenAI, Anthropic, Google, Gemini, they all work in open domain. There is zero chance that they can create some kind of templates, kind of fixed templates for every possible kind of answers, but if you work with the closed domain like ours, our SaaS can answer user’s questions, for example, what are the top skills for a Java developer in Chicago? Or what is the difficulty to hire a registered nurse in San Francisco? All this information can be directly pulled from our internal APIs. For example, you provide the occupation, you provide the location ID, and you call the API about salary. It takes your occupation, it takes your location, it pulls from the index, the distribution of salary distribution, and then you just show it to the user.

00:58:54
What we decided to do is that we decided to create a set of predefined templates, for example, okay, you look for a average salary for a nurse in San Francisco, the average salary for a nurse in San Francisco is, and then there is a placeholder for a number, and we pull this number from the API and we show it. The possibility of hallucination here is zero. There is a possibility of an error, an error in how we interpret the user input, but because we always show all our interpretation… For example, let’s say the user says, someone with a JS skill. Before we show any number to the user, we need to normalize, we need to convert this JS into our internal skill taxonomy. We take this JS and we use our internal skill normalizer and it says, okay, JS, it’s skill number one, two, three.

00:59:59
We show the user, okay, so you look for someone with JavaScript skill, so the user sees exactly how their input was understood by a machine, and then the user sees the output, and the output also come directly from the database. Hallucination, it’s when you see some number and you are not sure whether this number represents what you asked for or it represents something arbitrarily different from what you asked for. In our case, because it’s a closed domain, we say, okay, occupation code A, skill number one, two, three, location, it’s San Francisco, California, US. It’s all shown in different, we call them like pills. They’re all normalized kind of labels that you see and then you see a number.

01:00:53
Yes, the number can be wrong, but the number can be wrong, not because we made it up, but it’s because the distribution of jobs corresponding to your search doesn’t reflect the reality, but you would get exactly the same wrong answer if you use the system directly using the traditional UI. There is a one-to-one correspondence between what you see in the chatbot and what would you see on the platform if you didn’t use the chatbot. This is what we call zero hallucination, but of course errors can always be there, but some errors we can control, but some they just come from the data that we gather online, and this data is never a hundred percent perfect.

Jon Krohn: 01:01:33
Right, of course, yeah. That is an interesting approach. Avoiding the LLM in order to avoid hallucinations.

Andriy Burkov: 01:01:42
Yeah, sorry, we use LLMs in the process. We use LLMs to understand the user input because the user input is just a string, but we need to convert this string into some structured format and then every piece in this structure we need to normalize. Yes, the LLMs is used to understand, but not to inform.

Jon Krohn: 01:02:04
Gotcha. That kind of sounds like at the top of the episode you were talking about BERT and RoBERTa for encoding natural language into some other kind of representation, and so it’s an interesting kind of callback to there. Before we wrap up this episode, I feel like we’ve got to talk about DeepSeek.

Andriy Burkov: 01:02:22
Yeah.

Jon Krohn: 01:02:23
It’s what everyone is talking about these days. You wrote to me in an email that DeepSeek crushed open AI in Anthropic. What do you mean by that?

Andriy Burkov: 01:02:36
Well, I made a post by the way on this last week, so I refer everyone to read more in detail. I think that DeepSeek is probably the most important thing that happened to language models since the release of ChatGPT. It’s not in terms of, okay, this model beats that model. We already saw multiple examples when some new model beats a previous one, and then the company that created the previous model releases something new, and now again, it’s state-of-the-art. It’s not in this sense, it’s more in the sense that what DeepSeek did… Okay, I will probably enumerate. The first thing they did, they trained a state-of-the-art model by using a very tiny budget. What previously was considered like needing maybe 20, 30, $50 million to train a new version of some model, now it takes probably five to 10 million.

01:03:48
It’s about 500% decrease. See, this is one. Again, if they kept this only for themselves, well everyone would say, okay, well they’re lucky they don’t spend a lot on their models, but so what? Others have money, so nothing changed. What they did, they actually has shown everyone how to do the same thing as a recipe, as a step-by-step. Now, not only you know that a very competitive model can be created with a small budget, now you can create one for yourself. They published this public technical report and already online you can find several independent implementation of the R10 and R1 training, so anyone can do it. This is two. The third is that not only they are more cheaper to train, they’re also much cheaper to run. If you compare the pricing of, let’s say open AI for R1, it’s about $60 for a million output tokens. 60.

01:05:12
It’s probably, I don’t know, you will spend five minutes talking to it with sufficiently long context. For example, talking about some book or some article, and you’ll pay $60 in five minutes. What DeepSeek shows is that their model, they charge $2 for a million output tokens. Again, it’s like 30 times reduction in cost. Not just anyone can train it now, but anyone can run it and it will be very efficient. If you take all of this, they gave their state-of-the-art AI to anyone, to your brother, to your grandmother so they can just make it and have it. This is what was considered open AIs or Anthropic secret sauce. The final thing. When you create the language models, it’s not just about compute and it’s not just about knowing how, it’s also about having the right data.

01:06:25
The data was always what we call the mode. Companies like OpenAI and Anthropic, they invested a lot in creating high quality data for the model fine-tuning, because when you just retrain it, it cannot talk, it’s just the next word predictor, and then you must convert it into a chatbot, so it takes questions, it output answers, and then you also must convert it into a problem solver. Not just you ask questions and expect answers, but you also have to have some sort of multistep interaction where you solve a problem, for example, a coding problem jointly with the LLM. To make LLMs act like this, you need examples of this problem solving conversations. Those examples should not be some random stuff, they should really be to the point. Okay? Okay, let’s solve this specific problem. To create such examples, you need subject matter experts.

01:07:25
Having such subject matter experts available to create hundreds, even thousands, even hundreds of thousands such examples, it’s a huge investment. I think that they invested billions just in getting those conversations. Now, DeepSeek, they have shown that you don’t need that. Basically, their approach to training R1 is based on automated validation of solutions. For example, let’s say you ask it to generate some code, it generates this code, and instead of, like previously, you would ask a subject matter expert to look in the code and say, yeah, makes sense, or no, I don’t like it, it’s too long. They run the code, and once the code is executed, they take the output and then compare it to the ground truth, what was supposed to be the output, or just it compiled, that’s it. It’s a signal. For reinforcement learning it says, one, it’s compiled, zero, it crashed. The same for math.

01:08:30
They have a math problem, they know the solution, the solution is 42, and they ask the LLM that they train to generate a bunch of solutions, and if one of them gives 42, they say one, for the rest of them it’s zero. For the logic, it’s the same thing. You can create a kind of logical derivation, like, you have this hypothesis and then you try to derive whether this hypothesis, this conclusion is true, you can verify the logic so you can create this task automatically. Again, your LLM tries to solve this logic problem, and in the output is, the killer is the cook. You see whether the output is cook and you say one, the reward is one because it’s cook. If it’s something else, reward is zero. Creating such examples are very simple. For example, you take a GitHub repo, you hide just one class from it, and you ask the LLM to fill this class, write it from scratch, and then you compile the full repo.

01:09:46
If it compiles, cool, the reward is one, if it doesn’t, reward is zero. You don’t need subject matter experts anywhere in this pipeline. They created hundreds of thousands, close to a million such examples, fully automatically. Again, it removes this mode that was previously only available to companies with big budgets. Now again, you can just recreate this training set at home and create R1 named after you.

01:10:21
They kind of removed any competitive advantage the biggest players had, and they had this advantage for two years. Even, I remember a year ago, the OpenAI folks, before everyone left, they gave an interview and some journalist asked them like, okay, you see there is a huge movement of open source LLMs, are you worried about that they can even probably undermine your business models? They laughed it like, huh, no, with someone working at home with a tiny budget, there is so much they can do, they don’t have data, they don’t have compute. Now, what they mentioned is gone. Now we are kind of at square one. What’s next? This is the biggest revolution, I would say, that I wanted. The model itself is good, but it’s not about the model, it’s about their fundamental change in how open LLM is. The notion of open LLM changed and what it can do now.

Jon Krohn: 01:11:39
Very well said. Earlier you mentioned open-weight LLMs. Llama models from Meta, for example, are open-weight, they’re not open source because you can’t see the source code. Would you say that these models from DeepSeek are actually open source?

Andriy Burkov: 01:11:54
Well, my personal opinion on this is much stricter than for some others. For example, Yann LeCun, he calls Llama open source models. In my definition, because like LLM… What does open source mean, not in terms of a formal definition, but in terms of practical aspect? Open source in software means that anyone can reproduce your software independently. You put the source code online, anyone can download it, tweak it, adapt a little bit to their system, run and get the same software as yours. With LLMs, it’s not like this. You cannot take just the model itself and run it locally and say that you reproduced it. No, the model itself, it’s similar to the binary. For example, you download GIMP, open source graphical editor. If you just download the binary, it’s not open source, or you download Adobe Acrobat or you download Word, you can run it on your machine, but you don’t call it open source.

01:13:22
You can use it, you can access to its binaries, but you cannot tweak it, you cannot update it and make it different. This is what open source is. With open models, if you want to reproduce this model at home, you need not just the code that was used, but you also need data that was used because the model is nothing without the data. From this perspective, these open-weight models, they’re open-weight, but they’re not open source. I mentioned in the beginning, there is a bunch, maybe a couple of models that come not just as open-weights, but also you can download the full training dataset that was used to train it.

01:14:10
Unfortunately, it hasn’t become a standard practice of releasing new models, so these models with open dataset, they’re no longer competitive today, so if you want, you can take the dataset and you can train a new model based on it. Again, by today’s standards, a typical dataset for pre-training today, it’s between probably 15 and 20 trillion tokens. Again, if you download some openly available dataset, it will be maybe four or five trillion tokens, so you cannot really hope to reach state-of-the-art when you have four times less data.

Jon Krohn: 01:14:56
Nicely said. Before I let you go, we did have some audience questions. We actually got a very long one here. I posted a week before we recorded that you’d be coming on the show like I do with some upcoming guests. Dmitry Petukhov, who is a fraud prevention data scientist in Moscow, he said that he wasn’t aware of you before, and so he’s grateful to us, to the podcast for bringing another interesting personality with new additions into his book queue. You can expect a few purchases there from Dmitry. He says a related question came to mind for him. He says, these days there’s a lot of talk about disruptive and transformative effects of language models and generative AI on society and on technology in particular. For me, in the conversation that we’ve had today, Andriy, this seems to relate to even the kind of thing you were describing, how previously at TalentNeuron you were only concerned with the post-download, but now you’re able to apply LLMs and to pre-download aspects as well.

01:16:01
Anyway, his question is, it would be interesting for me to hear Andriy’s thoughts about what effects these transformations have already had and will continue to have on the traditional machine learning project lifecycle. He describes that traditional cycle as data gathering, quality checking, model developing, validating, deploying, monitoring, and then celebrating the results. How have LLMs, generative AI disrupted that traditional machine learning project lifecycle, and how might it continue to?

Andriy Burkov: 01:16:33
Well, I can tell… Dmitry, was the name, right?

Jon Krohn: 01:16:38
Dmitry, yeah.

Andriy Burkov: 01:16:39
Yeah, Dmitry. Okay, I can tell Dmitry that for maybe this year, maybe part of next year, people will still pretend that LLMs out of the box work sufficiently well and we don’t need to follow a traditional machine learning process where you gather the data, you select an architecture, you train, you tweak hyperparameters, you test, and you go back if you see that your initial approach wasn’t good. For some more time, people will still follow the hype and say, okay, we don’t need to train anymore. Again, my team, we are four people, we’re all hands-on. My initial position was, okay, because LLMs can do so much stuff out of the box, we should change the way we work on projects, we should transition from traditional training based approach to prompt engineering and probably what they call few shot learning, when you add, or few shot prompting, when you add examples directly in the prompt, and this kind of tweak the model’s performance.

01:18:10
What we concluded is that this approach has its benefits only in the beginning. For example, you want to build a complex system. For example, the system that I explained where we have some AI going to a corporate website and it must figure out how to create a robot for gathering the data from it. In this complex system, you need multiple places where you would put machine learning. For example, you need to detect what link to click so that you reach the career section. You need some classifier that would say you are where you are supposed to be and not somewhere else. You need a model that can say, okay, I see the job title, I see a location, I see X, Y, Z, so for all these, you need models.

01:19:06
Imagine, in the past, for example, five years ago, you start a project like this, every single piece where you need a machine, you need some kind of AI-based decision, you would have to gather the data, implement the full process of developing a model just to put it in one place. Now you have, for example, five, 10, 15 places where you need to make such decision. Before you can deliver, not deliver, but conceive a prototype, you would have to solve like 15 machine learning problems from scratch. It’s crazy, it might take years. For larger teams, for example, you can scale horizontally by adding people, it scales. We cannot clone more people to train all those models in parallel. For us, it would take years to develop. Now, thanks to LLMs, we can replace all those places where we need decisions with an LLM that we just instructed, like, you should predict whether this is a job title or not. This makes the creating of a minimum viable product, if you want, or some kind of production-like prototype very fast.

01:20:29
When you really want to go in production, you will not tolerate 30% error in this place, 25% error here, 40% error here because the error has the property it accumulates, so if you make 15% error prediction here and then 15 here and then 15 here, it makes the chance to reach your destination tending to zero very fast. This is where LLMs are cool for fast prototyping, you don’t need to train your model, you can just instruct an LLM, but then when you actually want to go in production, you will have to investigate for all of your placeholder LLMs, which one of them are the weakest… How do they call it? Piece in the chain. You will have to replace this weakest piece in the chain with a real classifier that you will actually control, you will actually be sure that you can reach 95% accuracy, 99% accuracy if needed.

01:21:39
For this, you will fine-tune, you will either create a model from scratch, you don’t have to use LLMs all the time, or you can fine-tune LLM, but in a real machine learning sense. You gather data set, you actually execute learning iterations, and you see how well the model becomes on the holdout set, like a validation set, and once you are satisfied, you say, okay, cool, this piece now works as intended. You can run your system already in production, but it will be kind of a working prototype or alpha beta version, whatever you want to call it, and then in the future you will replace those critical pieces with actual machine learning models. LLMs are cool for fast prototyping, they are cool for interactive problem solving, like, okay, what if I try this? What if I try that? But then when you go to production, you would really want to follow a rigorous machine learning process.

Jon Krohn: 01:22:45
Nicely said, very well summarized there on a critical topic for a lot of our listeners who are themselves developing machine learning pipelines. Now, like you say, you’re having those 15 models that a team of four would have to develop, it was previously intractable, and now you can come up with the right prompt, and poof, you have some level of accuracy that in some cases is acceptable out of the box.

Andriy Burkov: 01:23:10
It’s cheap. You spend nothing and you have something. This is already better than zero, right?

Jon Krohn: 01:23:16
Yeah, exactly. All right, Andriy, this has been an amazing episode, I’ve really enjoyed learning from you today, you’re really brilliant, it’s been awesome. Before I let you go, I always ask my guests for a book recommendation.

Andriy Burkov: 01:23:29
Well, I should tell that I am more a writer than a reader. I was really a fanatical reader when I was a teenager, and my dad has a huge collection of science fiction and historical books, so I read a lot, but since I moved, I don’t have myself a library except probably the one that stores counterfeit copies of my books, I keep all of them. I think that the biggest impact on me was by The Little Prince, and I even put a quote from The Little Prince into my new book where the prince says to the fox that the language is a source of misunderstandings.

01:24:26
I really found it really like to the point for the book, because yes, you build those language models, but they can create more problems than solutions, not just because of this, but I think The Little Prince, for me, it’s a reminder that an adult can still remain a child in their heart. For me specifically, it resonates because I still feel like if I was 20, 22, 25 years old, despite that I already have 43. I had my kids grown up, but when I read The Little Prince, not sometimes, but every time it makes me want to cry because I really feel this atmosphere of a child stuck in an adult’s life.

Jon Krohn: 01:25:32
The book is also a hugely influential one for me. I try to use it more and more recently in guiding my professional decisions to be having more of a sense of play in my life and be asking things more like, what’s my favorite color, more so than how much revenue will this bring in? Yeah, in addition to The Little Prince, amazing suggestion there. You also, of course, I’ll just kind of recap my mistake for describing it as a trilogy, so we have in the duology, I don’t know what the equivalent is for two.

Andriy Burkov: 01:26:08
A duology with a spin-off. It’s always confusing when I talk to my kids and they say, what’s about this book? I say, the second one. No, that is the third one. Well, I say the third one if you count books, but the second one in this series, yeah, so it’s a duology and a spin-off.

Jon Krohn: 01:26:24
Yeah, in your Hundred-page Book series, you have the original, The Hundred-page Machine Learning Book that is iconic, and now your brand new, Hundred-page Language Models Book as well. If people are interested, the spin-off that is not part of the trilogy, but is your third book is the machine learning engineering book that people can dig into to learn how to use machine learning to solve business problems at scale. Yeah, amazing books that you’ve provided. Thank you also for the amazing perspectives you provided on a broad range of timely topics on today’s podcast, Andriy, and hopefully when your next book comes out-

Andriy Burkov: 01:27:10
I’m already thinking, but I should make a pause because it’s very exhausting to write books, especially when you challenge yourself and the book should be small but not superficial, so it’s exhausting. I spent nine months writing this one, so I think I will take probably a break for a couple of months before I start-

Jon Krohn: 01:27:30
Yeah, definitely take a break. If in a few years you have another one done, we’d be delighted to have you on the show again to discuss that.

Andriy Burkov: 01:27:36
Thanks for invitation.

Jon Krohn: 01:27:37
You have an open invite. Andriy, what’s the best way for people to follow you after this episode?

Andriy Burkov: 01:27:43
Oh, it’s not hard to find me. You can Google my name, Andriy Burkov on Google and you’ll find links to my LinkedIn profile and my Twitter profile. On LinkedIn, I am more professional, so I try to filter what I post. On Twitter, I’m more like myself because, well, Twitter, it’s mostly anonymous so I can share some stuff without being linked to my employer. It was especially hard with Gartner because Gartner had a strict online presence, so I had to limit myself very much in what I post. Now, because we are not Gartner anymore, so I am much more open even on LinkedIn, but if you really read my unfiltered conscious flow, join my Twitter.

Jon Krohn: 01:28:36
Yeah, Andriy, thanks so much and hopefully we catch up with you again in the future.

Andriy Burkov: 01:28:40
Thank you, John. It was a pleasure to be with you, and thank you for the questions.

Jon Krohn: 01:28:49
What an excellent episode with Dr. Andriy Burkov. In it, he covered how AI agents face fundamental challenges, they can’t be effectively debugged when working together, they lack true agency and they struggle with out-of-distribution tasks that weren’t part of their training data. He also talked about how LLMs are excellent for rapid prototyping of ML systems, but production-grade applications still require traditional ML development processes for critical components. He filled us in on how he achieved zero hallucinations in a TalentNeuron chatbot by using LLMs only for understanding user input while relying on structured data and predefined templates for responses. He talked about how DeepSeek revolutionized the field by reducing model training costs by 500%, making their methods public, cutting inference costs by 30X, and eliminating the need for human experts in training data creation. He also talked about how the key distinction between humans and AI is our ability to plan infinitely into the future and accurately assess what we do and don’t know.

01:29:45
As always, you can get all the show notes including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Andriy’s social media profiles, as well as my own at www.superdatascience.com/867. Yeah, I’ve been saying this for a few weeks now, but it’s now coming up, in just two weeks, I will be speaking at the RVA Tech Data plus AI Summit in Richmond, Virginia, it would be awesome to connect with you there In real life. There’s a ton of great speakers, and so this would be a great conference to show out, especially if you’re in the Richmond area, it would be awesome to meet you there. Thanks of course to everyone on the Super Data Science Podcast team, our podcast manager, Sonja Brajovic, media editor, Mario Pombo, partnerships manager, Natalie Ziajski, researcher, Serg Masis, writer Dr. Zara Karschay, and I can’t forget our founder, Kirill Eremenko.

01:30:36
Thanks to all of them for producing another tremendous episode for us today. For enabling that super team to create this free podcast for you, we are deeply grateful to our sponsors. You can support this show by checking out our sponsor’s links, which you can find in the show notes. If you yourself are interested in sponsoring an episode, you can get the details on how to do that by heading to johnkrohn.com/podcast. Otherwise, share, review, subscribe, edit our videos into shorts if you’d like to, but most importantly, just keep on tuning in. I’m so grateful to have you listening and hope I can continue to make episodes you love for years and years to come. Until next time, keep on rocking it out there, and I’m looking forward to enjoying another round of the SuperDataScience Podcast with you very soon.

Podcasts SDS 867: LLMs and Agents Are Overhyped, with Dr. Andriy Burkov

SDS 867: LLMs and Agents Are Overhyped, with Dr. Andriy Burkov

Podcast Transcript

Share on

Related Podcasts

November 4, 2025

October 31, 2025

October 28, 2025

Podcasts SDS 867: LLMs and Agents Are Overhyped, with Dr. Andriy Burkov

Share

SDS 867: LLMs and Agents Are Overhyped, with Dr. Andriy Burkov

Podcast Transcript

Share on

Related Podcasts

November 4, 2025

SDS 937: How to Design AI-First Products, with Marc Dupuis

October 31, 2025

SDS 936: LLMs Are Delighted to Help Phishing Scams

October 28, 2025

SDS 935: Global Issues Accelerated by AI (with Solutions), feat. Stephanie Hare