Jon Krohn: 00:06
This is episode number 868 our “In Case You Missed It” February episode.
00:19
Welcome back to the SuperDataScience Podcast, I am your host, Jon Krohn. This is an in case you missed it episode that highlights the best parts of conversations we had on the show over the past month.
00:31
For our first highlight today, I spoke to the fascinating pro-athlete-turned-data-engineer Colleen Fotsch about the tools that she finds essential to her work. Specifically, in episode 861, I asked Colleen for more detail on the data-warehouse tool DBT and how it simplifies data modeling and documentation for her:
Jon Krohn: 00:50
So, one thing that you mentioned in that transition from data analyst, focused more on visualization tools like Tableau and moving a little bit up the stack to analytics engineering, you were getting more into DBT, right?
Colleen Fotsch: 01:05
Yes. Yes.
Jon Krohn: 01:06
So tell us about DBT, why a company would use it, how you interact with it, what it does.
Colleen Fotsch: 01:13
Yes. So DBT is a data building tool, and essentially, what we were doing-
Jon Krohn: 01:20
Oh, I didn’t know that.
Colleen Fotsch: 01:20
Yeah, so I actually think that, I looked this up actually, I would say a few months ago, because I wanted to double check, because I went to the DBT conference this year, which was a ton of fun. Because someone had asked me, and I was like, “Well, I believe it stands for data building tool, but I wanted to double check.” I guess it used to, but now they have transitioned to just BDT.
Jon Krohn: 01:42
Yeah. Yeah. Yeah. Yeah. That happens on time.
Colleen Fotsch: 01:44
But it’s still data building tool. And so essentially we would work with, if we already had the data in our database, but it was in a raw format and we need to expose it or surface it in a way for our stakeholders to utilize it in a visualization tool, we would go in, we would take the raw source of it. And essentially, you’re writing SQL more or less. So DBT also does have Python type of coding within it that allows you to, I would say, it allows you to do kind of SQL on steroids a little bit. So you get to do some different things with it that essentially allow you to, especially if it’s something redundant in your code, they have different functions that allow you to just consolidate to speed things up, which is really, really nice. And so essentially that’s what we were doing as far as my day-to-day.
02:48
I was going in, I would say initially wasn’t building out models from scratch, but I would be adding to models that were in there already, adding different fields, adding a lot of business logic, because that was really important too. And what I really loved about DBT was that you can embed your business logic, your agreed upon business logic with your stakeholders, and then have it documented. And that was a really, really big thing that I love. So DBT has a documentation part of their tool, DBT docs, that allows you to have definitions. And actually they have an automated version of this now, which is fantastic because I’m not going to lie. When you’re building out a model, especially from scratch, and let’s say you have 30, 40 fields in your model, the last thing you want to go do is build out a document and then just write out all the definitions of all of these things, especially when you feel that, for example, if it says created date, you’re like, well, it’s created date, do I really need? But a lot of times, you probably need that definition because created date might not mean what you think it needs all the time. And so-
Jon Krohn: 03:56
It sounds like that’s a generative AI assist coming in there.
Colleen Fotsch: 04:02
Yes.
Jon Krohn: 04:02
And it sounds like actually a great use of it. There’s all kinds of times where generative AI is showing up in my life today. For example, I don’t know, have you ever used or do you a WHOOP for tracking your sleep?
Colleen Fotsch: 04:13
So I have an Oura ring. But similar. Yeah. Yeah.
Jon Krohn: 04:17
Well, the WHOOP in the Gen AI mania of 2024, they incorporated in this daily gen AI guidance thing that I looked at it one time and I was like, oh, please never show me that again. You don’t need that in your life. You’re getting all the information you need. I don’t need this kind of generalized, it’s using general advice about fitness.
Colleen Fotsch: 04:43 Oh, interesting.
Jon Krohn: 04:43
Which I guess maybe somebody out there is finding it useful, but it’s like, make sure you get lots of water in your day. And just…
Colleen Fotsch: 04:50
Yeah, it’s interesting. I mean, I’m sure we could go off tangent on that, but it is interesting how AI is so powerful, and it’s very cool, but it’s interesting how, but it also doesn’t need to be everywhere for certain things. But especially for…. Oh, sorry, you were going to say.
Jon Krohn: 05:10
But yeah, but in DBT automatically creating a documentation for all the fields that you have in a data file.
Colleen Fotsch: 05:16
Yes.
Jon Krohn: 05:17
That sounds like it can, at least you should be reading over that and making sure that it’s accurate.
Colleen Fotsch: 05:22
Oh, absolutely.
Jon Krohn: 05:24
But it saves you the blank page problem.
Colleen Fotsch: 05:27
Yes. Well, and also now they have a feature where you can identify a field, I believe they’re called doc blocks, where you can define a field, let’s say in your initial staging layer model, and then you can reference that throughout all of the lineage throughout where that column goes or field goes in your downstream models, which is so nice. Because then it’s just efficiency and you can spend more time working on the actual model itself. But I love that component. And I also think it was really, really good because then it forces the data team to work with your stakeholders and really come to terms with, “Hey, this is our logic and this is our definition of this thing.”
06:13
Because you can’t expect, especially if you’re wanting to change how a team speaks about what, let’s say active customer means or something like that, if your data isn’t representing that, that’s going to be really hard to make that change with your conversations. So having that not only for your stakeholders to reference back, but also when you have new engineers coming in, that’s huge. When they’re trying to learn what the data means and what granularity your models are and all of that stuff, that’s just infinitely helpful. And so being able to be on a tech sack that was so, so nice. And again, it can get tedious especially, but I mean, now that we have this generative process, it’s not taking long at all, but it still was so nice to have that as, no, this is just what we do. Any model that’s created, everything has to have a definition. So essentially that’s what we were doing. And then started progressing to building out models from scratch when we were getting new data in. And then exposing that into tools like Tableau or Sigma, and then building out dashboards for our teams to be able to use.
Jon Krohn: 07:30
What’s Sigma?
Colleen Fotsch: 07:31
So Sigma is another data visualization tool. So Tableau, well, I guess you can run it a few ways, but essentially with Sigma, it sits right on top of Snowflake and works really, really well with that. So that ingestion process was removed. Anything we had in Snowflake we could build off of in Sigma. And it was really good for, I would say that initially when I opened up Sigma, it definitely can look a little bit like Excel, which gave me, I was like, “Ooh, I don’t know.” And then especially because I feel like, I’m sure lots of people in data can relate to this, but a lot of times you’re trying to get people out of SQL, or not out of SQL, out of Excel. Not all the time, sometimes it’s necessary, but for the most part, you want it consolidated, you want everyone to be looking at the same numbers so there’s consistency when we’re speaking about making business decisions.
08:37
But it was great because I think it has the capacity to allow for people that are in data analytics that are trying to do really high level analytics and insights with it, but it was great as far as for PMs and other people to get in there and they’re really familiar with Excel. So they’re like, “Oh, this is great. It’s just like Excel Plus, and it has a lot of other features.” So that was really great as far as adoption goes for both teams, which was really cool.
Jon Krohn: 09:09
It’s so important for data scientists to think how teams that might not be particularly tech literate will adopt the tools they put forward. In episode 859, I spoke to the Y Combinator-backed entrepreneur Vaibhav Gupta about the kinds of tools that help prevent mistakes like interns bringing down entire production applications! Here, Vaibhav and I talk about BAML, the “basic-butt machine learning” programming language he created that helps companies protect themselves from well-meaning mistakes.
Jon Krohn: 09:38
You’re the CEO and co-founder of Boundary, which is the creator of BAML, B-A-M-L, a programming language, and it’s an expressive language for text generation specifically. So, our listeners, we probably have a ton of listeners out there who are calling LLMs, finetuning them for various purposes, and BAML is designed for them. So, tell us about BAML, what the acronym means, why you decided to do this acronym first.
Vaibhav Gupta: 10:08
So the acronym first… BAML stands for Basic Ass Machine Learning, but if you tell your boss, you can say basically a made-up language, but the premise of BAML really came in from this theory around how web development started out. So, when we all started coding, at least for me when I started coding websites, it was all a bunch of PHP and HTML hacked together to make website work. Then I remember interning at Meta and they were the ones that made React. I think part of the reason why they made React was because their code base was starting to get atrocious to maintain. Imagine having a bunch of strings concatenating your HTML syntax, and now an intern comes in, like myself, forgets a closing div and now your newsfeed is busted.
10:59
It’s not really the way we want to write code where multi-billion dollar corporations rely on an intern closing strings correctly. It’s not really even the intern’s fault, because how could they really read a giant blob? I barely read essays. How could the intern do that? But a compiler like React could actually correct for those mistakes if you add HTML and JavaScript into the same syntax. By creating a new syntax, those ideas become much more easily expressed. Now in two milliseconds, you get a red squiggly line saying you didn’t close this div tag. In that web development loop, it just reframed the way we all started thinking about web development.
11:42
Instead of being like things are going to be broken, we could do state management because React handled it for us. We could do things like hot reloading a single component and having the state around it persist because React did that for us. It was tastefully done even though it required learning something new. We asked in this AI world that we’re all headed towards, we think a few things are going to be true. One, every code base will have more prompts in every subsequent year than they did have in the previous year. If that is true, we probably don’t want all these unclosing div tag type of mistakes existing around forever.
Jon Krohn: 12:22
When you say prompt, you mean like an LLM prompt?
Vaibhav Gupta: 12:24
Yeah, calling an LLM of some kind. LLMs, I think, are one start, but I think all models in general are going to be used long-term. Models are only going to become more easy to use for people that know nothing about machine learning in the future.
Jon Krohn: 12:41
So we’ve done episodes recently. For example, people can listen to episode 853 where we talked about this generalization of LLMs to foundation models more broadly. Maybe a vision model, for example, where you don’t necessarily need to have a language input or output, but even with that model, even in a vision use case, it could be helpful. It could make things easier for people calling on that vision model if instead of having to write code, they can use a natural language prompt. So, I 100% agree with you more and more often, the models that we’re calling, whether they’re big foundation models, including specifically LLMs or the smaller models, having natural language prompts in there to just very easily get what you’re looking for, maybe even just out of a plot.
Vaibhav Gupta: 13:30
Yeah, exactly. I think the thing that we have to think about as the stuff becomes more and more prevalent is actually developer tooling that has to come with it. Just like how React had to exist for Next.js, TypeScript, and all these other things to come out and make our lives a lot better in the web development world, we asked what has to exist in the world of LLMs and generally AI models as a developer, not as the people perhaps producing the models because that’s a different world, but just the people consuming the models.
14:04
No matter how good the models get, at some point, you have to write bits on a machine that flip and that’s code. It has to plug into your code base in a way that makes sense. Just like JavaScript sucks and TypeScript is way better because type safety and static and asset errors that we get, we wanted to do a bunch of algorithmic work that reframes the problem for users when we made BAML.
Jon Krohn: 14:30
We stay on the subject of the latest tools in tech for my next clip. In Episode 863, I talk to Prof. Frank Hutter about the huge steps that TabPFN is making in science and medicine. For the uninitiated, TabPFN is a foundation model for impressively modeling tabular data — a feat that deep learning models had struggled to handle until now. The phenomenal results of some notable TabPFN applications have been reported in leading peer-reviewed journals like Nature and Science already. I asked Frank which TabPFN applications excite him most and how listeners can get started with TabPFN for their own use-cases.
Jon Krohn: 15:06
Yeah, so very exciting, all of these big updates from version one to version two. With version one, as you mentioned, there was relatively limited applicability of TabPFN, but nevertheless, there were still some great use cases that came out of it. One of them was a science paper. So, in addition to Nature, the paper that you published in, there’s one other big kind of general, broad science paper out there, and it’s called Science.
15:31
And so, there’s this paper. I’m not even going to try to get into the biology of what this means, but we’ll include the paper in the show notes. It’s called Large-Scale Chemoproteomics Expedites Ligand Discovery and Predicts Ligand Behavior in Cells. And so, I can’t really explain what this is all about. It’s something to do with determining protein structure, but the key thing is that TabPFN was used as a part of the inferences that they made in that paper. And I’ll have a link also in the show notes to repo, a GitHub repo called Awesome-TabPFN that lists about a dozen existing applications of TabPFN across health insurance, factory, fault classification. There’s financial applications. There’s a wildfire propagation paper, a number of biological papers in there.
16:28
So, yeah, clearly lots of different applications out there, even for v1. I don’t know if you want to talk about them in any specific detail, Frank, but I know that you are, of course, looking for more people trying out TabPFN, especially now that version two can handle so many more kinds of data types, can handle missing data, can handle outliers, and can handle larger data sets. So, listeners, if you’ve got tabular data out there, you can head to the TabPFN GitHub repo that we also have a link to in the show notes, and you can get started right away.
Frank Hutter: 17:03
Yeah, awesome. Thank you so much for mentioning this, the Awesome-TabPFN repo. I literally, actually created this today, so I hope by the time that the show actually goes out, there is a lot more than a dozen applications there, and please, yeah, whenever you have an application, a use case, just either send us a note. Or, actually, this is one of these repos where you can just do a pull request with your own application, put your own paper, and yeah. We’ll basically advertise it. Also, if there is cool applications, we’d love to have blog posts or, yeah, just retweet your content and so on. I think we really want to build this community of people who love TabPFN and build on top.
17:50
And the open source community has already picked this up, and within a couple of days of the Nature paper, there’s this repo on CHAP IQ that’s all about interpretability. Directly put TabPFN in there, and so, yeah. It’s really amazing to see the speed at which the open source community works, and I’m really looking forward to what else people will build with this.
18:17
One cool thing about the Science paper I wanted to mention is, yeah, I also know nothing about chemical proteomics, but that’s kind of the neat thing. I can still work on this because, well, we have this really generic method, and if there is data chemical proteomics out there, then we can fine tune on that and get something that’s even better for this use case. And so, those are the types of things that I’m really excited about doing for all kinds of use cases. There’s also already something out there on predicting …
Jon Krohn: 18:49
Algal blooms.
Frank Hutter: 18:50
Algal blooms, yeah.
Jon Krohn: 18:50
Yeah, algae.
Frank Hutter: 18:51
So, yeah.
Jon Krohn: 18:51
Green-
Frank Hutter: 18:53
Algae, I know, and algal blooms are the sort of … but yeah, sort of-
Jon Krohn: 18:57
Yeah. I suspect-
Frank Hutter: 18:59
Things that are good for the environment and so on, I think I’m really excited about those types of applications. There’s lots and lots of applications in medicine. There’s not that many published papers on applications in finance and so on because, well, typically, people don’t publish-
Jon Krohn: 19:15
Finance companies, yeah, exactly.
Frank Hutter: 19:16
… these types of applications as much, but medical and so on, there’s a lot, and, yeah, really hoping for a lot of people to use it to do good things for the world.
Jon Krohn: 19:25
It’s incredible to see how TabPFN has gone from strength to strength in a relatively short space of time. So, how does anyone go about setting up a successful tech company like that? In episode 865, I talk to Cal Al-Dhubaib about how to start and scale a data science consultancy, using his wildly successful company Pandata as a case in point.
Jon Krohn: 19:43
So let’s talk about the kinds of things that made Pandata so successful. We have already this make it boring idea of making it boring for data scientists, easy for your clients to be able to understand the data science that you’re delivering. What are the other keys to scaling a successful data science consultancy?
Cal Al-Dhubaib: 20:02
So something that I didn’t quite know at my first startup that really stuck with me is this notion of product market fit. Anyone who’s in the space of entrepreneurship will hear this term bandied about. For those of you who haven’t been in the field of entrepreneurship, what that means is you found a pain point that someone is willing to spend something on solving. There’s enough of those people at enough scale, you know how to reach them and you can consistently deliver that thing that they’re willing to pay for. Clients vote with their money. I found early on, because I bootstrapped, that meant I didn’t raise any capital, the only source of growth I had was when a customer is willing to pay for it. So, it’s one thing when somebody says, “Hey, that’s a great idea.”
20:49
It’s another thing when they’re willing to sign a big check for you to solve that problem, and then they come back to you to solve that same problem or similar problem again and again and again. So, product market fit and listening to what people were willing to spend on was a really big part of Pandata. My first year, all I had to do was say, “Hey, we can do data science things.” I was able to land a few contracts here or there, but it was a rotating window. I’d work with one enterprise and then they’d go away. Another enterprise would come. That’s a very common story for consulting companies. There were maybe one or two clients that stuck around or kept coming back to us. I remember having a conversation with my stakeholder there.
21:28
I finally worked up the guts and I said, “Not that I want you to question the situation at all, but why are you coming back?” I was really trying to do some market research and understand, and it turns out that they really liked that we were approachable. That was one of our core values is hold back the jargon, always speak plainly. Then there were a couple of formulaic things that we accidentally ended up doing. We have this process called discovery and design that now is a mandatory requirement. Anybody that hires us to do any work, I say, “You have to do this upfront or I won’t work with you.” With those clients, we accidentally did it.
22:08
That’s where we spent just 30 days, six weeks diving into a problem, trying to figure out, “Where are the skeletons? Is this solvable? How can we approach this? What are the unknown unknowns?” Which is a really big part of solving problems that have not been solved before with pattern matching algorithms just to simplify it. So, I tried to recreate that magic. So, there were these attributes that we had that became our core values. We had five core values that I can talk about later. Then there are these processes, and one of these processes was discovery and design. Now, the funny thing is I decide, all right, I’m now no longer going to work with any client that doesn’t want to do this, and we’re going to charge an arbitrary amount of money. That engagement size is now $50,000.
22:57
At that time, that was a measly $12,000. I was really a first time entrepreneur and nervous about throwing that about. But I’d say, “Hey, you know what? Unless you’re willing to spend this, I don’t even want to work with you.” It helped me weed out two things. One, clients that weren’t serious. If they weren’t willing to pay that, they definitely weren’t willing to pay for the rest of the engagement. Two, if they didn’t philosophically agree with the importance of that step, then I knew that they were likely to be a client that was consistently disappointed by the results because they didn’t quite get the data science process.
23:28
So, I went from spending a lot of time talking to a lot of people that seemed interested at first in data science and then I got, no, no, no, no. My pipeline started to dry out and this is one of three times that Pandata’s bank account reached less than a month’s worth of expenses. I was like, “This was the end. This was maybe the dumbest idea.” Within that same period of time, I landed three of the biggest clients I had ever engaged, two of which remained clients until Pandata’s exit. So, over a period of about six years, and that process became a part of how we were able to scale so much larger than most small solopreneur consulting shops.
Jon Krohn: 24:16
So the key was having this 30-day discovery and design initial engagement at the beginning of trying to consult with somebody and you’d say, “There’s going to be this $50,000 price point to do that initial 30-day engagement.” So that initially seemed to put you in peril where your pipeline dried up. Everyone was saying no. But then it did ultimately lead to discovering solid long-term clients that were with you for six plus years. Cool.
Cal Al-Dhubaib: 24:44
Well, and so I would use this tactic, and now I use this tactic to scare off non-serious people. It actually allows me to save them time. It allows me to save time. Then I find the companies and the groups that say, “Heck, yeah, that sounds amazing. I love how you think about this.” There’s a lot of fish in the scene. It’s all about this matchmaking process. One of the counterintuitive lessons I learned was the art of saying no or ruling others out by saying no to them. It really allows you to spend more time on the bigger things, the higher value things. This is a common tactic I see a lot of most of my friends who are wildly successful.
Jon Krohn: 25:22
Right, right, right. That is tricky. It’s very hard to say no to smaller or more challenging projects because you remember those times where you got to only a month of expenses of value left in your bank account. You’re like, “Well, I guess I better say yes to everything,” but then that ultimately it slows you down. You have the death by a thousand cuts of just all of these low value touch points.
Cal Al-Dhubaib: 25:49
Well, it’s funny, when we were going through due diligence on this acquisition, there were about three points on the balance sheet and the financials that they had virtually circled. They’re like, “We want to talk about this, this, and this. We don’t like that.” I said, “I didn’t like those either.” That was really bad moments for me too.
Jon Krohn: 26:06
All right, that’s it for today’s ICYMI episode. To be sure not to miss any of our exciting upcoming episodes, be sure to subscribe to this podcast if you haven’t already but most importantly, I hope you’ll just keep on listening! Until next time, keep on rockin’ it out there and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon.