Podcasts SDS 653: Efficiently Glean-ing Insights from Vast Data Warehouses

57 minutes
Business, Data Science, Data Visualization

SDS 653: Efficiently Glean-ing Insights from Vast Data Warehouses

Subscribe on Apple Podcasts, Spotify, Stitcher Radio or TuneIn

As the founder of Glean, a data exploration and visualization platform that recently raised $7 million in funding, we’re sitting down with Carlos to dig into the platform and extract some vital tips on starting and growing a much-loved data startup from scratch. From technical topics like his chosen software stack to sales-driven topics like customer churn and managerial tips, tune in for clever insights from Carlos and his life as the founder of a successful analytics startup.

Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.

About Carlos Aguilar

Carlos is now the Founder of Glean.io which aims to make data analysis and visualization more accessible. Previously, Carlos started and ran the Data Insights team at Flatiron Health for five years. Data insights is a team of engineers and data scientists that drive positive impact on oncology practices with a product-oriented approach to building data products in healthcare. He stared his career working at Amazon Robotics/ Kiva Systems and machine learning + robotics at Cornell’s Computational Synthesis Lab.

Overview

Organizations now have access to more data than ever before. But when insights are critical for making impactful decisions, your data is only as powerful as the business intelligence software you rely on daily. Is it time to ask yourself: is my chosen platform holding my team back?

That’s where Glean enters the picture. Pushing the limits of data visualization, democratizing data, and making business analytics as scalable, accessible and as quick as possible is what founder Carlos Aguilar and his team aim to do. “Our biggest focus over the last year is time to insights – we want to get people plugged in and finding insights within 10 minutes,” he says.

But how exactly do they pull it off? As Carlos explains, Glean relies mostly on live computations from the client’s data warehouse, while also performing data profiling, caching and acceleration to deliver a fast experience. From a technical perspective, Glean’s tech stack looks like this: React and D3 for the front end; Apache Arrow for data serialization and Redis Cache for re-distribution of data when needed.

When it comes to building a great founding team, Carlos looks for people he calls “owners.” These individuals love taking ownership and demonstrate empathy and care, toward their colleagues and the needs of their end users. His approach to diversity and inclusion also stems from empathy, telling Jon how important it is to voice your company values during your interview process to weed out individuals who likely won’t be a good fit for your team and customers.

Tune in to learn more about the technical and managerial details that went into building Glean.

In this episode you will learn:

How Glean extracts actionable insights from their client’s data warehouses [06:48]
What sets Glean apart from other platforms [12:43]
Glean’s software stack [14:43]
Glean’s recent fundraising journey [24:56]
The essential characteristics of a founding team [30:53]
How Carlos founded Glean [36:56]
Carlos’s former role at Flatiron Health [40:49]
How Carlos created a robotic painter [48:57]

Items mentioned in this podcast:

Glean.io
About Glean
D3.js
Snowflake
Google BigQuery
Localogy interview with Carlos
Redis
Apache Arrow
DuckDB
SDS 535: How to Found, Grow and Sell a Data Science Startup
SDS 511: Data Science for Private Investing
SDS 549: Engineering Natural Language Models
SDS Episode 523: Open Source Analytical Computing
Glean raises $7M to democratize data insights
How to interview your first data hire
The Lean Startup by Eric Ries
Flatiron Health
SDS 619: Tools for Deploying Data Science Models Into Production
Robotic Painter
Winter Recipes from the Collective by Louise Glück
Jon’s virtual conference on natural language processing with large language models
SDS special code for a free 30-day trial of O’Reilly: SDSPOD23

Follow Carlos:

Follow Jon:

Episode Transcript:

Download The Transcript

Podcast Transcript

Jon Krohn: 00:00

This is episode number 653 with Carlos Aguilar, founder and CEO of Glean.io.

00:10

Welcome to the SuperDataScience Podcast, the most listened-to podcast in the data science industry. Each week we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now let’s make the complex simple.

00:41

Welcome back to the SuperDataScience podcast. Today I’m joined by the super sharp and super technical entrepreneur, Carlos Aguilar. Carlos is the founder and CEO of Glean, a New York-based startup that has raised 7 million in venture capital to democratize data analytics and data insights. Previously, he spent six years as the vice president of Data Insights Engineering at Flatiron Health, a cancer data platform that was acquired by the Swiss pharmaceutical giant Roche. And prior to that he spent five years engineering robots at Amazon. He holds a master’s in mechanical and aerospace engineering from Cornell.

01:18

Today’s episode includes some technical content near the beginning that will appeal most to data science and software engineering practitioners. But we do break down the main takeaways from those technical discussions so that any interested listener can partake. The remainder of the episode will appeal to anyone who’s keen to hear from an extremely intelligent technical founder on how to successfully launch a data-centric startup. In this episode, Carlos details the software stack Glean built their platform with to enable their customers to quickly obtain slick data visualizations and insights from their vast data warehouses. How to grow an entrepreneurial idea into an investible company. The essential characteristics to look for in your founding team, how to ensure your early clients are continuously delighted by your evolving software platform and how we used genetic algorithms to enable a robotic arm to paint beautiful creative art onto real world canvases. All right, you ready for this inspiring episode? Let’s go.

02:20

Carlos. Great to have you on the SuperDataScience podcast. Where are you calling in from?

Carlos Aguilar: 02:26

I am in New York, in our office in SoHo.

Jon Krohn: 02:30

Yeah, we, it’s entirely my fault. I know that you’re based in New York and I, in my head I wasn’t even thinking about how can…

Carlos Aguilar: 02:38

I can see your building.

Jon Krohn: 02:39

Yeah, exactly. From, yeah, from your… We can see each other’s where, where we are right now from where we’re recording separately and remotely. So I made a big mistake, but I’ll just have to make it up to you someday by offering you another slot and recording in person.

Carlos Aguilar: 02:56

Sounds great.

Jon Krohn: 02:58

Would’ve been super fun to meet in person. The only time that we have met previously was in person. There was a super fun AI company startup social thing that was run by, two cool New York startups Tribe and Arthur. And so, we actually have somebody coming up soon, Jaclyn Rice Arnold, who is one of the hosts of that cool event that we were at in late 2022. And yeah, this event was hosted at the Edge, which is this very popular tourist attraction as well as like Instagram photo and dating profile photo, attraction in New York. It’s this very tall platform that’s like this glass edge hanging off of a building and professional photographers weight up there to get you your perfect, like, cuz you’re just, it looks like you’re kind of just floating over Manhattan and yeah, it must be the most photographed location on the dating app Hinge.

Carlos Aguilar: 04:05

Yeah. Funny enough, I never actually made it up to the edge. I was at the bar the whole time, but…

Jon Krohn: 04:10

Yeah, well you didn’t really miss out because it, I mean, it was in December, so you were kind of hoping to get lucky with the weather anyway. And we did not get lucky with the weather at this event. It was, it was so cloudy, foggy that day, that you, you couldn’t see a single thing. You couldn’t see one building from the Edge. You were just like on this fog…

Carlos Aguilar: 04:33

You’re just on the edge, hello.

Jon Krohn: 04:37

Yeah, it was, yeah, I’m just shouting into the void. And yeah, but they had a champagne reception in this fog in the rain and crazy wind. There’s one really fun photo of like, people’s ties flying off as we tried to get one photo of anyone who made it.

Carlos Aguilar: 04:56

Yeah, it’s actually, I actually, I saw the picture and I didn’t even realize it was outside. It looked like it was on a white background, so… Jen thought it was just on like in a room in a white background.

Jon Krohn: 05:09

No, that’s like on a clear day it would’ve been Manhattan, all of Manhattan in the background right. Would’ve looked magnificent. Anyway. Yeah. so yeah, Jaclyn Rice Arnold, she was one of the hosts of the event. She will be on the show soon. Really looking forward to that episode. She’s been doing an amazing job and, yeah, you and I met because I was talking to her and Austin Ogilvie, who was in episode number 535 and Drew Conway, who is in episode number 511. And, yeah, so I guess you’ve known these folks for a long time.

Carlos Aguilar: 05:38

So yeah, I’m the only one missing out at this point. The missing puzzle piece.

Jon Krohn: 05:44

Yeah.

Carlos Aguilar: 05:44

Yeah. Jackie, it turns out, I know through a friend of mine, I don’t know if you have this, but like, is there a person you like coded with in high school and middle school? Like my good friend that I learned to code with this is like, you know, in 2002, so it was like Flash and ActionScript and JavaScript and web stuff and PHP, all that sort of stuff in high school. That’s my good friend and introduced me to Jackie at some point. And then, Austin and Drew, I just know from data folks in, in New York.

Jon Krohn: 06:18

Nice. Yeah, so they are all three of those people, including yourself, are prominent people in the startup community in New York. You specifically are the founder and CEO of Glean, which is a data exploration and visualization platform. And we actually, we’ve had a guest in the past, Lauren, who was in episode number 549. She works at a company called Glean, but it’s a different Glean.

Carlos Aguilar: 06:41

Yes.

Jon Krohn: 06:41

And so we’re not gonna even talk about that glean today. It doesn’t matter what they do, but you are a data exploration and visualization platform. can you expand on what you guys do?

Carlos Aguilar: 06:54

Yeah, so it’s Glean.io and yeah, I think my, my background is in startups always trying to like vend data to subject matter experts trying to get data out to folks, whether it’s in operations or healthcare, and was always just really frustrated with the state of how folks collaborated with data. So I started Glean as a way to make data really intuitive, really accessible to non-technical folks. And the way it works is that you as a technical person, you can set up and standardize a handful of metrics and right out of the gate we sort of automatically create some visual representations that are very clicky and clickable and explorable and you can send those out to the rest of your organization with relatively little configuration. So this assumes a world where you’re probably organizing data in something like a data warehouse, or maybe it’s just simple data and you want just like some very typical types of dashboards. We automate the creation of those dashboards for folks. And so, yeah, that’s the gist of how it works.

Jon Krohn: 08:00

So Glean is like a verb here where you get to glean information very quickly and easily from potentially a vast data warehouse.

Carlos Aguilar: 08:11

Right. Right. And like I think our big differentiation and our big focus over the last year is just time to insight. So we want to get people plugged in and finding insights within 10 minutes. That’s our, that’s our goal. And it turns out we’ve done that a lot, which is why we’ve sold the platform to a bunch of companies. But if you’re just getting started, it’s great because if you don’t know the dataset super well and don’t want to think about how to plug it in to Tableau or something like that, Glean is gonna make a bunch of assumptions for you to start with just to have smart defaults and then of course you can configure from there. But the whole idea is just like give you something visual from the get-go.

Jon Krohn: 08:51

Cool. Without divulging anything like proprietary about your clients or your platform or anything, are you able to share like one or two use cases of situations? Like you just said, you know, within 10 minutes people have been able to get insights. Are you able to give us a couple of those use cases? A couple of those case studies?

Carlos Aguilar: 09:08

Yeah, so I’ll go through maybe industries that are interesting. So industries where the data’s interesting and maybe there’s something proprietary you can do with the data and maybe there’s subject matter experts where you want them digging into the data on their own. Things like healthcare turns out to be a really good use case for us, where you might have, clinical experts that need to look at the data and need to find their own insights in the data. Financial technology, fintechs in general also have lots of domain experts that are really interested in the data and wanna get into it and explore it on their own, supply chain also, which is related to my background, lots of operations folks that really wanna dig into the data. And it turns out a really big use case for us is also subscription businesses. Like there’s lots of folks looking at who’s churning, who’s expanding revenue and, lots of people interested in why that’s happening and digging into it, looking at what happened last week. But it’s really, it’s a pretty generalized platform for exploratory data analysis. So, we’ve had lots of deviations from those two, but those are some of the main industries.

Jon Krohn: 10:13

Nice. Do you end up having to customize features within your product for these different kinds of verticals or specific clients? Or does it kind of, it works well out of the box across all these different verticals?

Carlos Aguilar: 10:26

So there’s basics that are interesting and work pretty well for every vertical you can imagine. Like doing counts and simple aggregations across different types of verticals. And then we’ll run into use cases for specific verticals that don’t quite work yet. You know, like you can think of like survival analysis in, or like healthcare outcomes where it turns out there’s a specific type of calculation that we don’t have and we have to build that in. Or like a specific type of aggregation that you can’t quite do or it’s awkward to do. We have to build in. So we are in shared slack channels with every single one of our customers, which is getting slightly unscalable, but it’s great for figuring out use cases and our customers love just like pinging us with ideas and ways that we can make the platform better. So we try not to do anything that cannot benefit the broader platform.

11:16

So there’s no like customized code that only fits for, you know, one specific customer. It’s, you know, we release every feature to every customer, and we try to get pretty hands on with some of our customers if they need help and they just need guidance on how to get into the data. It’s just like a good sort of learning and startup hack to like get really deep with customers if they give you the opportunity, just try to get in their shoes, help them analyze their data if they’re inviting you in to do that. So we’ve done a fair amount of that and some lightweight services too.

Jon Krohn: 11:48

It sounds like you’re doing that perfectly in terms of scalability, it would be a nightmare if you were building custom functions, capabilities for individual clients and then you’d have to like maintain a separate code base just for them. Like it would be, I would just be, it would very quickly become a nightmare of, right, yeah. Like trying to keep things compatible.

Carlos Aguilar: 12:07

And it turns out, so we rely on the underlying data warehouse in a lot of cases. So in a lot of cases, the power of the computation can come from the underlying data warehouse. So things like BigQuery and Snowflake have very rich, you know, computation libraries and lots of different functions you can use. And so a lot of times it’s just making sure that we’re compatible with all of those sort of advanced analytics that you can do in those data warehouses.

Jon Krohn: 12:31

Very cool. We’ll talk a bit more about your particular tech stack or tools that you’re using in a moment, but before we get to that, what makes Glean different from other kinds of business intelligence tools that are out there?

Carlos Aguilar: 12:49

Yeah, I think when I started Glean, there’s really two general workflows for setting up business intelligence. Either you had this like really fast time to value that was like really satisfying where you just write a bunch of SQL, configure some charts, and it’s like really immediate feedback. You just like write one query and you get one chart. The problem with that is it becomes pretty unscalable and like you have these standard metrics definitions and now they’re copied in 10 different SQL queries and that’s really, really challenging. So there’s lots of tools that work like that and they’re great if everybody knows SQL and you can sort of keep track of what all the metrics definitions are on your own. Those tools work great. And then at the end of the spectrum, you have these like self-service tools with these like really, really heavy weight semantic layers where you do tons of definition and they’re really, really powerful, but they take a long, long time to configure.

13:39

And when I was at Flatiron, it sort of struck me that I sort of didn’t want either of those options. I wanted to keep things dry, right? I didn’t wanna repeat myself. I wanted standard metrics definitions, but I didn’t want to do like a month of implementation with this, you know, crazy domain specific language. So Glean tries to have that sort of lightweight semantic layer so we have a semantic layer, but like the minimum viable semantic layer that allows you to do self-service. So that’s really where we’re trying to differentiate ourselves is like the setup is incredibly fast. There isn’t that much logic. We assume if you want complex logic, we’re very opinionated that you should push an upstream into data pipelines so that we can keep the implementation super fast, super lightweight and super portable if like, for whatever reason, you know, you wanna switch off of Glean or too or from Glean for any use case, we’re really, really portable.

Jon Krohn: 14:36

Cool. Yeah, those sound like a great suite of capabilities. So going back to a moment ago, to my previous transition, cuz you were starting to kind of get into how the, a part of how you make all the capabilities that you just described so powerful is by taking advantage of the full power of underlying tools that people have in their data warehouses. So they could be Snowflake, it could be Google BigQuery. What are other interesting parts of your tech stack as you’ve built your platform or the way that analytics work in your platform? You know, obviously again, not getting into like your secret sauce, but if there’s some, you know, some interesting aspects of the way that you’ve developed your platform, that you could tell the audience, it’d be awesome to hear about it.

Carlos Aguilar: 15:27

Yeah, I think the general approach is that you’re gonna have some input into the process, right? You’re defining how metrics are defined and you’re giving that as an input into Glean and you’re sort of saying which dimensions are important and how they’re related to each other in a pretty lightweight way. And then behind the scenes, we’re also gonna do some data profiling. And so we try, we’re not gonna extract any data, we’re gonna rely mostly on live computation for your data warehouse, but we are gonna do some profiling and some caching and acceleration. And we do that just to make the experience feel incredibly performant and very fast, even if there’s like a little bit of latency from your data warehouse. And so we’re just, you know, it’s a web tech stack, so it’s React and D3 for the front end and then we use, Apache Arrow for data serialization and we’re shipping data to the browser and doing some computation in the browser if it is faster.

16:29

And then we have a Redis cache that sort of like, holds that data source, that data, caches that data and then distributes it to the clients when needed. But still really trying to lean on the data warehouse as much as possible because that’s where people are doing their work. And that seems to be the center of where people hold their source of truth. And that’s what allows us really to defer really doing a lot of the transformation ourselves or having to have a really, really heavyweight sort of transformation library or heavyweight, you know, like DSL that we have to maintain ourselves. We’re just relying on that data warehouse.

Jon Krohn: 17:11

Super cool. So running through some of those tools and breaking them down for our audience a bit, D3 allows you to make really cool visualizations. D3 is kind of a funny one where I feel like there’s very few people out there who are really adept at creating D3 graphics from scratch. It seems like people are really good at… There’s like pages of really cool D3 visualizations that then show you the underlying code cuz there’s so much functionality that’s possible in there. There’s a lot of like…

Carlos Aguilar: 17:44

It’s very, it’s very low level. So D3, a lot of times it’s for library builders, right? And so that’s effectively how we’re using it. We have our own visualization library that is built on D3, which is this very, very low level sort of visualization library and JavaScript.

Jon Krohn: 18:00

Cool. Yeah, well defined there. And then another cool tool that you mentioned there, is Apache Arrow. So we had Wes McKinney on the show, in late 2021, episode number 523. So West McKinney is most famous for having created Pandas, which is ubiquitous in data science. Every hands-on data scientist, I mean probably 90% of hands-on data scientists around the world use the Pandas library for manipulating data. But Wes actually doesn’t develop Pandas anymore himself. He is totally focused on Apache Arrow, which resolves some of the issues around Pandas as data have become larger and larger. So Pandas, natively is ideal for working with data on a single machine, but Apache Arrow offers a similar kind of syntax and functionality as Pandas across, a very large number of machines that you can process large amounts of data rapidly. And so, I don’t know if I really have a question there.

Carlos Aguilar: 19:07

Yes, we use it, that benefit is useful for us because we want the same data format to be in the client and operable by the client and also on our backend and in the cash. And we want to inspect that data from our backend code and from various parts in our stack. And so it helps us be super efficient because we can do data processing in the front end when it’s convenient. And we don’t have to like serialize and de-serialize all the data and try to think about what the date format is and how we convert it between, you know, different libraries. It’s always, we always just keep it in the arrow format.

Jon Krohn: 19:43

Cool. And then another kind of data tool that you were talking about before we started recording that I’d be interested to hear more about, cuz I personally hadn’t heard much about this before at all, which is DuckDB.

Carlos Aguilar: 19:57

Yeah, so DuckDB is an in-process columnar database, which, allows you, for us, honestly, I think one of the biggest, it seems to be very hype right now, but for us it’s a huge benefit for tool builders because, all before DuckDB all of our infrastructure and all of our code depended on data warehouses. So I talked about how fancy these data warehouses are and they have all these, you know, fancy computations that they can do, but they’re also external dependencies. And so when we’re testing our infrastructure or we wanna do something lightweight or we wanna create a demo, we have these sort of external dependencies. And I’m not sure, actually, I haven’t looked into it too deeply, but I’m pretty sure there’s no way to like run any sort of emulated Snowflake locally and test against it. So building a DuckDB adapter, what it allows us to do since it’s in process, is it allows us to test our code completely in a completely isolated way.

20:58

And it’s also what enables like, just very isolated use cases for our customers. So if they just wanna upload a Parquet file or a CSV or a spreadsheet of some sort and wanna start hacking around with it and they don’t wanna think about these data warehouses, they’re, you know, data warehouses are wonderful, they’re these really amazing pieces of technology, but they’re also very heavy weight. So if you don’t want to deal with that, DuckDB is a great sort of stand-in that allows you to have a lot of the power of those data warehouses without having to have that external dependency.

Jon Krohn: 21:30

Very cool and well articulated for one that I just kind of that I just threw you. I’m like, DuckDB what is it? And you’re like, it’s in-process columnar database. It is perfect. And yeah, being able to describe…

Carlos Aguilar: 21:44

Actually, to tie it together, it operates natively on those Apache Arrow APIs. So it turned out to just be very convenient. We are already serializing data with Apache Arrow and it turns out that format is really, really fast for the types of computations that DuckDB is doing.

Jon Krohn: 22:04

Nice. super cool. I love being able to dive into your tech stack with you. You clearly know it really well. I love having technical founders on as guests for exactly this reason. So thank you for indulging me. So all of these cool tools that you’ve integrated together, in order to enable people to get insights and visualizations very quickly after they integrate with Glean, so you sit on top of their database very quickly getting insights. Is the end user, do you anticipate that they’re usually non-technical users?

Carlos Aguilar: 22:45

Yeah, so it’s interesting. I think the nature of business intelligence is that it’s really interdisciplinary. So I think our hope is that we’re gonna enable cultures that wanna be collaborative. So it means non-technical stakeholders have questions about the data. Maybe a clinical expert knows something about how specific chemotherapy drugs are being administered and wants to dig into that and then can flag some sort of data quality issue to the data team and the data team can pass that on to the data engineers. And Glean is really that point of collaboration. So I think one of the unique things about Glean is that we really try to create incredible tools for each of those types of stakeholders. So if you don’t know any SQL, we have this beautiful visual data explorer that allows you to click around and find insights. But we also have, you know, the SQL window for somebody who likes writing SQL.

23:36

And interestingly, we also invest a lot in our engineering tools. So because we wanna be this lightweight layer, we have to integrate with all tools below us. So to make that work seamlessly, having really great APIs and a great command line interface and great specifications that allow this to become programmable for large teams that use Glean, it allows them to check this in into continuous integration platforms and automate a lot of these processes for creating visualization. So it starts simple. If your use cases are simple and you just have a spreadsheet, it starts simple. But if you have engineers that really wanna get into the technical parts of the product, there’s, we try to have great, great tools for those technical stakeholders. I think that really is what’s unique about data is you have people with very different backgrounds, very different skillsets that are all trying to come together and work in some sort of unified space.

Jon Krohn: 24:33

Very cool. So you democratize data insights across users in the organization, whether they’re technical or not, but simultaneously you allow for programmable APIs and CLIs, command line interfaces, in order to allow more technical people to be automating and making the best use of glean.

Carlos Aguilar: 24:53

Exactly.

Jon Krohn: 24:55

Nice. And, it seems like investors are appreciative of what you’re building as well because you raised $7 million last year. Congratulations. Yeah,

Carlos Aguilar: 25:06

Thank you.

Jon Krohn: 25:07

So what was it like to be raising that capital? We, you know, you read that 2020, 2021, early 2022 were like bonanzas, you know, unprecedented valuations and capital flowing to early-stage startups. So yeah, what was your experience like? Was it, competitive in the data tooling market to get funding? Again without divulging anything proprietary or whatever that would be saying too much. I’d love to hear a bit about, you know, the fundraising story.

Carlos Aguilar: 25:47

Yeah. So I guess we raised at the beginning of last year, so the market was pretty good. And data I think is an area of interest. I think people have seen a lot of growth. They’ve seen the data warehouses, people are storing more data and the idea is like, there’s gonna be more applications that are built on top of this, this sort of storage layer. And we fit into that. And I think the fundraising story is, I try to network and, and stay in touch with data folks in the city and out west. And it turns out VCs are a pretty integral part of that data community. So venture capitalists are just in that community as well. And so I think, just like meeting the folks that you get along with, the Angel investors that you get along with, they introduce you to others and trying to find your people.

26:39

I feel like investors, you should treat like employees. They are, they can be like great value ads and you want them to be aligned with the values and the goals of your company, right? You don’t want somebody who wants like a quick flip when you’re trying to do something very long term. And so we met Ilia at Matrix Partners who had invested in Five Trans, so it was like intimately aware of just the transformation that is happening, no pun intended in the data space. And, he was great and got what we were trying to do and ended up leading the round and then got other angels on board after that.

Jon Krohn: 27:18

Nice. Cool. Yeah. Well with your background and with what you’re building, it’s not surprising to hear that it wasn’t, you know, enormously challenging to raise. And I’m sure there’s great things ahead with all the companies that you’re signing up. And it sounds like they love the platform, they’re providing you with lots of feedback and you guys are actioning on that feedback, building new capabilities for them. So seems like you’re doing a lot of great things over at Glean. And so there might be some listeners out there who are thinking, oh, I’d love to be working at a company like Glean, Carlos sounds like a really smart guy. Sounds like they’re doing really great things. So, I imagine you have like software engineering roles open.

Carlos Aguilar: 27:57

Yep. We’re hiring for software engineering. We have, our first data role, actually a similar role to what I had when I first started at, Flatiron Health. Like our founding sort of data person is an open role that we’re, hiring opportunistically somebody who wants to dog food the product, get deep with our customers, and like really figure out what’s the best way to use Glean, we’re also hiring for and go-to market team. We just hired our first, salesperson, but likely we’ll be hiring another salesperson and or marketing person before the end of the year. So if those are things you’re super interested in, yeah, we’d love to talk to you also. Awesome. And software engineering, of course. Always software engineering.

Jon Krohn: 28:41

Yeah.

Carlos Aguilar: 28:41

Front-end, lots of interesting front-end challenges, visualization and interaction and lots of data processing and on the backend as well.

Jon Krohn: 28:51

Yeah, no doubt. Lots of work there for software engineers. That is like, that’s even the way that I ask the question. You know, we’re a data science podcast, but , I’m like, yes, always software engineers. And so, like I say on air to our listeners, I think data science is the coolest space to be in. It’s, I’m super biased, but I picked this and I love this field and I wanna be doing it for the rest of my life. But simultaneously, if I was looking for work right now, I would be making sure that I was developing all the software engineering skills that I could because you can still be this lots of really cool things that you can be doing that blend data science with engineering, like machine learning engineering or data engineering, even just plain old backend engineering. You know, having data things work performantly and actually in, in a company like yours, really cool front-end engineering stuff you could be doing with D3 as a data scientist as well. So, you know, you can take the core data science skills, but then what are kind of like adjacent engineering-y things that also are interesting to you and learn about those because yeah, everyone is always…

Carlos Aguilar: 30:03

Yeah, I mean, software is automation, and automation it turns out can be layered into almost any role. There was, I was a paralegal as an intern in I think high school for a summer, and was like sort of a, not even a paralegal, whatever’s below a paralegal just like help. And it turns out software was like, really, you know, they wanted me to like go through and tag all these PDFs. It’s like, oh, well I can write a script and do it like. Really any job if you like, layer on a bit. It doesn’t have to be hardcore software engineering, like a little bit of scripting, a little bit of coding, can probably help a lot of people.

Jon Krohn: 30:39

Yeah, really good point there. All right. So yeah, so you’re doing all this hiring, you’ve got software engineers of course, and you’ve got your critical data person that’s gonna be coming in, which might be particularly interesting to our listeners to hear about. What are the kinds of characteristics that you look for in your hires, and are you particularly careful or do you look for special things when you’re looking for your founding team or the first hires in your company?

Carlos Aguilar: 31:09

Yeah, it’s interesting because I also hired like this at Flatiron throughout, even when we’re, you know, 20 people on the team and you know, hundreds of people in the company. I tend to look for owners, like people who wanna take ownership, who wanna drive things forward, who have like an entrepreneurial spirit and like really care about customers as an endpoint and sort of driving things towards those customers. And I guess maybe that comes down to like a level of empathy, like really caring and really putting themselves in the shoes of customers and wanting to build towards that. So I think that’s probably the most important skill set that I always hire for, but is especially important at a startup. I feel like maybe that’s not controversial, but it’s definitely a thing that we look for.

Jon Krohn: 31:57

Cool. I love what you’re looking for there. Makes a lot of sense to me. People who can work independently are critical In an early stage startup, you can’t have anybody in the company that needs to be babysat, and then…

Carlos Aguilar: 32:09

Turns out I’m also just a horrible micromanager, or I’m just bad at micromanaging, I should say, so, yeah. You just need people who can drive forward on their own.

Jon Krohn: 32:19

Mm-hmm. I certainly look for the same thing. I can’t be “what are you working on today?” “What’s the next step every day?” And I love your point about being empathetic about what the customer’s needs are. That is maybe something that has not been mentioned on air. I asked this question a lot and I can’t specifically think of somebody who said that. Everybody says communication.

Carlos Aguilar: 32:47

Yeah, also important.

Jon Krohn: 32:47

Yeah. Yeah. But, being empathetic about the end user. That’s a really cool one. So, when you’re hiring, how can you ensure that diversity and inclusion are foundational to the hiring practices and culture of your firm?

Carlos Aguilar: 33:03

Yeah, I think your values really have to come out in your interview process. Like that’s your one chance to explain and communicate to candidates what you’re about. So working diversity and inclusion and that basic empathy, it’s actually connected like having empathy for your customers, but also for your colleagues and just not being a jerk. So you can really work that into the interview process. And we do, we have part of our interview process that, tries to tease out sort of values fit from that perspective. And I think a superpower for founders in general, you know, as a founder, you know, you have, maybe you’re starting with a few employees, you can’t necessarily get the most senior engineer from Google every time. So like, if you’re just looking for the sort of resume, check the box, and it’s gonna be a relatively homogenous group, maybe, like you can’t always compete over those folks.

33:59

So you really have to consider a whole human and what they’re bringing and what their background is and what they’ve had to go through and take a really nuanced look at hiring. And I think that’s really where founders can have an edge. And that’s where diversity can come in to play too. Like having, an eclectic mix of folks with from different backgrounds can be a great strength. And trying to find folks, you know, who have a different perspective is super important. And I’ll say like, I don’t think we have like, “check the box” and like, “we’re done with diversity” and like, it’s an ongoing process. It’s something that we try to work into the process and it’s something that we’re always trying to have people with different perspectives.

Jon Krohn: 34:43

Cool. I love that answer. Yeah. So everything you’re saying there about diversity inclusion ties in nicely, or is a great addendum to a conversation that Erik Bernhardsson had in episode number 619. In that episode, he was talking about a famous blog post of his, where he was describing how, you know, when you are looking for a new apartment or a new house, you know, that you can’t have everything that you want in that new property. You know, you can’t have the location and the size and the patio and like, there’s, you’re gonna have,

Carlos Aguilar: 35:20

Especially in New York.

Jon Krohn: 35:21

Especially in New York. And so he was using that analogy to describe how, if you are clever about what you’re looking for in the people that you hire, you can find great value by say, you know, them going to a prominent institution that might not actually mean that they’re a better software developer or employee, but it is something that a lot of people look for and pay premium for. So you’re kind of, I don’t remember specifically talking about the diversity and inclusion element in Erik’s conversation, but what you’re saying kind of ties into that where by being open to different kinds of backgrounds and considering, like you described the whole person, you could be considering people, like you’re saying they didn’t come from this well-trodden path into early stage, fast-growing startup software engineering role. But you know, they’ve, they’ve managed to develop the right background despite not coming from that well-trodden path. And that means that they’re going to have a different perspective and maybe have a level of grit or something that somebody who just came down the usual path doesn’t have.

Carlos Aguilar: 36:36

Yeah. Yeah. It’s all about valuing what’s actually important to the company. And exactly like you said, if you check the five boxes of like, went to Stanford and has worked at Google and has worked at, you end up with a pretty tight and pretty similar looking group of folks, right? So just expanding your horizons a little and like thinking about what’s really important to your company.

Jon Krohn: 36:55

Nice. So as you were setting out to start this very inclusive company that’s solving a problem, that obviously was a problem that you were encountering yourself back at Flatiron Health, how did it come together that you were like, “Okay, now’s the time that I’m gonna go from knowing that I have this problem to making the jump and starting the company?” This is your first company that you founded, right?

Carlos Aguilar: 37:22

Yeah, I was on two early teams, but this is the first company that I founded. I was, yeah, first 20 employees at Flatiron Health. But yeah, first time founding, yeah, I think I was pretty reluctant. So I went, I saw Flatiron Health grow substantially, was the first there to hire and hired a bunch of folks onto that team. And Flatiron got acquired by Roche. So I left after the acquisition, was doing a little bit of the idea maze and trying to figure out what I was gonna do and consulting with companies and helping them establish their data teams and helping them set up their data stacks, setting up a data warehouse for them. And I’ve always had this idea in the back of my head that like, visualization tooling isn’t what I expected and wasn’t at the bar that like I would want as a user, but I was probably reluctant because it is a crowded space and it’s an ambitious problem to solve.

38:19

So I went in maybe with a little bit of skepticism and helped a bunch of companies set up their data warehouse. And probably after like my third time selling in Looker into an organization, just seeing the price tag and realizing it just felt too heavy. Probably it’s a great tool at scale, but it felt heavy for a lot of the companies that I was implementing it at. And it just felt like there was a gap in the market that really needed to be addressed. There was this like real pull and real need for something that was a little bit lighter weight, a little bit more opinionated and convinced myself and started just prototyping. Like, I just started writing JavaScript and D3 and hacking around with some prototypes and got some people excited about it. And that’s like the flywheel that’s hard to stop. Like you build a little thing and somebody else is excited about it and then it just like sort of gets pulled outta your hand and then the next thing you know, so a few years later and you have customers and raise money and all those things.

Jon Krohn: 39:15

That sounds awesome. I love that story. It ties in perfectly with this like lean startup idea that’s popularized by Eric Ries, or maybe coined by Eric Ries in the book, autonomously titled Lean Startup. And so it’s a, if you’re thinking about starting a company and you haven’t done it yet, it’s a really easy book to read to get some ideas around how to be doing it. It’s one of those books, which is actually like most non-fiction books, the vast majority are written like this where there’s like a few really important points, that are like strung out with all kinds of case studies and anyway, so really easy read, really fast read about…

Carlos Aguilar: 40:03

It’s like when I was getting started in startups, it was like the startup bible, right? It was on every, literally on like half of the desks was The Lean Startup.

Jon Krohn: 40:12

Yeah. And that’s, and what you’re describing there was like perfect lean startup where you are just on your own prototyping some stuff, getting some feedback from people and them being excited about it, and then you building a little more and yeah, like describing that flywheel and then oh, one of those people you’re talking to about it is interested in investing and yeah, things get outta hand.

Carlos Aguilar: 40:34

Great.

Jon Krohn: 40:36

Cool. All right. So we have talked about your transition into starting Glean. You’ve mentioned Flatiron Health a number of times. Let’s talk about that role there that you’re doing. Cause it sounds really cool. You know, you started, you, you know, one of the first employees at Flatiron Health, they were acquired by the pharmaceutical giant Roche, and so yeah, enormously successful experience. You grew a large team there. Let’s talk about it. So they are an oncology data platform, so oncology, cancer, data platform. So yeah, tell us a bit more about the company and what it was like to run the data insights team.

Carlos Aguilar: 41:16

Yeah, so Flatiron had the founding hypothesis was that there had just been widespread adoption of medical records, and that had happened over the course of maybe a decade. And, so now we had all this digitized, like structured data about patients and that that was gonna be valuable. So that’s the starting point, and that it was gonna be particularly valuable in oncology where there’s many different disease states, many different treatment types, and the therapies are expensive and hard to develop, and knowing which ones work is a really, really valuable thing. And so, this was a space where data could really help outcomes, it could help patients and it could help research and development, and that was the founding hypothesis, but like, how exactly that was all gonna come together was a bit of trial and error. And so when I joined, we had partnered with a couple cancer centers and were providing some basic tooling.

42:15

And so I joined and my role was really integrations, so getting data in from hospitals and trying to do that in a highly structured high quality way. And then I sort of took it upon myself to make part of my job also, like figuring out product market fit, like how to make this data valuable. And I think this is a lot of the core of what the role was and a little bit probably just of my personality that I couldn’t know whether it was high quality or not unless I knew exactly how it was gonna be used. So until we got it into the use case, okay, is this gonna be for clinical trials and finding patients for clinical trials, because then we need these data points. You really had to pull the whole use case all the way through from the sources to the users before you could really understand whether data quality meant anything and that that was the ethos.

43:09

So I started this theme, I think my first title was something like Integration Manager. We pivoted the title to be Data Insights Engineering, where it was really this product focused group of data engineers and data scientists and later on machine learning practitioners that was really focused on product, what do the end users need, and how do we tie the source data all the way through to the end and help push, this mission of helping cancer patients and push forward, research and development in Oncology forward. And so this is pretty product-oriented data science team, pretty interdisciplinary. There’s some folks with healthcare background, some folks with stats, backgrounds, a lot of analyst types.

Jon Krohn: 43:53

Nice. I love it. It sounds like an amazing experience. you know, you’re making a real big impact in the world, building a product that is, beneficial to people, in the cancer space. And I can only imagine working with really clever people with lots of different specializations and, yeah, gleaning insights all the way back then. But yeah. And it’s cool how that experience, led to what you’re doing now today at Glean, that you noticed. Yeah, there was an opportunity for there to be, tools out there that would’ve made your work back at Flatiron Health easier to do. Nice. And then prior to Flatiron Health, you studied mechanical and aerospace engineering. That sounds cool. Which is how you ended up working on Amazon’s warehouse robots before you started at Flatiron. So maybe give us a bit of a, like, rundown on that background, like how was that mechanical and aerospace engineering background useful to you as a technical founder still today? And yeah, without divulging anything prepared for…

Carlos Aguilar: 45:03

I’ll tell you all, I’ll tell you all the secrets. Yeah. So I was actually doing robotics research at Cornell, and this was in 2008, so there weren’t, I knew I probably wanted to do something with the robotics. There weren’t that, I mean, there’s large industrial automation companies, but things that I actually wanted to work at, there weren’t like a huge number of robotics startups at that point. It was just a couple. And so Kiva Systems, which was the startup that I joined, was one of them working on pretty cool technology, which was little Roomba-style robots that automated warehouses. So the idea is that instead of people walking out to shelves, the shelves would walk to the people. You’d have a fleet of robots that would bring the shelves to the people and they didn’t have to walk around warehouses and it would just speed up delivery time.

45:58

And, it was a tough time to start or be in startups in 2009. It was a great financial crisis. And so we went through layoffs and a lot of scary times as well, and customers were tough. So it was a pretty hard journey. And my role was making sure that, that the technology worked at a customer site. So it was a pretty customer facing role. And I was, my role was technically systems engineering, so make sure all the pieces of this really complex system that had many layers of control systems. So you can imagine the robots needed path planning, and then there was path optimization, then you had to allocate orders to humans to pick them. And so there’s a bunch of resource allocators, and these were all control systems that had to be carefully tuned. So I turned into sort of the expert in how to tune these systems.

46:49

It turned out data was an amazing tool for figuring out how to tune all these things. So instead of just like blindly trying to tune them, looking at a bunch of data to try to figure out how to tune them, when they were performing well, when they were performing poorly, was a huge benefit. So I learned this thing called SQL was hacking together, I guess what you could call data pipelines with spreadsheets, getting landed into servers was an on-premise solution. So got to learn a lot of sort of hard technical skills from sort of just coding and bash and scripting and SQL and figuring out how to instrument this really complex system. Then ultimately ended up integrating a lot of those learnings into the product. So built out data tools so that customers could understand how this really complex system was working.

47:37

And I think that’s how I really first got my start with, like, figuring out the data could be this really powerful thing for helping people understand systems, these complex systems and make products better and make humans’ lives better by being able to better interact with the systems. And so my bias is highly towards systems-oriented backgrounds, and this gets a little bit into the hiring, hiring discussion too. Like people who have different sort of perspectives and can bring different sort of things like mechanical engineering is a very interdisciplinary degree. So you do literally machining, you’re like in the machine shop machining things and figuring out tolerances. You’re also doing CAD, you’re also doing some programming, you’re doing electronics. And so figuring out how all those puzzle pieces together, I think that’s, that was critical for figuring out how this really complex system works. And I think I still use that and think about that way of thinking even at Glean, right? Really thinking about these interdisciplinary problems.

Jon Krohn: 48:39

Yeah. The interdisciplinary problem thread is cool. And also just that it sounds like from the very beginning of your career you’ve been identifying opportunities to be chaining and integrating data together and leading insights from it. Really cool. Yeah. And you also, during that time in your masters, you created a robotic painter, is that right?

Carlos Aguilar: 49:03

Yeah, yeah. So it was a generative painting robot, so it would figure out how to execute a painting with brush strokes and it would do it in simulation. So it was a machine learning algorithm that like figured out how to do this in simulation…

Jon Krohn: 49:21

Genetic algorithm, right?

Carlos Aguilar: 49:23

Exactly. A genetic algorithm. And then it executed those brush strokes onto a physical canvas and the lab that I was working in, so this was early days, I think relatively early days of like machine learning, I mean not early, early days, but before obviously a couple of waves of machine learning research. And the hypothesis of the lab was that machines and machine learning would be good at creative problem solving. And so we were looking at a bunch of different design spaces like, you know, bridges and structures and mechanical, design and seeing how machine learning and evolutionary algorithms could solve these sort of problems creatively in ways that humans couldn’t. And my particular project was figuring out whether machines could be creative from that perspective. And it was like, you know, there was not a lot of research in this space at the time. And I will say what we did was incredibly naive compared to like DALL·E and all these systems that have come out recently. But it’s cool to see that some of that turned out to be right at least that generative art and generative image creation turns out to be an incredible use case for machine learning.

Jon Krohn: 50:35

Mm-hmm. Yeah. So I will include in the show notes a link to your website that has videos of this arm painting using your genetic algorithm. Do you wanna fill us in just a little bit for the audience on what genetic algorithms are and how they work?

Carlos Aguilar: 50:50

Yeah, so genetic algorithms, the way they work is they take an encoding of a solution and they encode that into a genotype. And so you can imagine brush strokes represented as some sort of like bits, and then they execute that painting and you start with a population of random paintings. And those random paintings look nothing like your target, but you take the ones that look the most like your target, and then you sort of cross them over and you mix them together, sort of like, reproduction,

Jon Krohn: 51:25

You breed the algorithms together.

Carlos Aguilar: 51:26

Exactly. Well, you breed the solutions, the possible solutions together, and then you mutate them. So you, it’s inspired by evolution, by biological evolution obviously. And you get with each sequential generation, you get, things that look a little bit more like your target. And so you have an objective function and you’re grading your entire population. I think I was looking at a thousand paintings or something like that, and they were all competing to try to be the best painting. And over many generations, hundreds and hundreds of generations, you saw something that looked like your target image, emerge out of the solutions.

Jon Krohn: 52:08

So presumably all of that training, all of these generations of genetic algorithms being interbred and the mutations happening, that presumably all happens in simulation.

Carlos Aguilar: 52:19

Exactly. So, and then it’s impractical. There were some projects where we actually did live learning, like on real robots, but it’s sort of impractical with paintings where you’re like wasting canvases. So the idea is you create a simulation first, you evolve it in simulation and you try to match your simulation to reality as much as possible. And then you execute in… So you can imagine it, it’s sort of like imagination, right? Was one of the metaphors that we use that, you know, it, the robot has a self-representation, it can imagine these solutions and then executes them.

Jon Krohn: 52:52

Cool. Yeah, it’s, so it’s, it’s just like, you know, for a, a software application with machine learning running in it, you’re putting it into production, where production here is a physical robot arm that’s painting on a real canvas. That’s cool. Nice. All right, well this has been a super fun episode, Carlos. I’ve learned a ton about, new tools and loved hearing your stories and the way that you’ve gotten to where you are with Glean today. And sounds like exciting things are ahead. As we approach the end of the episode, I’ll ask you like I do all of our guests for a book recommendation.

Carlos Aguilar: 53:32

Yeah. So, my partner Margaret has gotten me a little bit into poetry recently. We’ve been reading poetry, so I’m just gonna recommend broadly Louise Glück, who is an incredible poet. And she had a book in 2021, Winter Recipes from the Collective. And this poem is not in that particular book, but my favorite poem by her is a poem called Elms, which is like sort of about makers and creators. So I love that particular poem too. So people should, should at least read the poem because it’s quick to read.

Jon Krohn: 54:10

Nice. That sounds great. Elms by Louise Glück, not from the book. Winter Recipes,

Carlos Aguilar: 54:18

From the Collective.

Jon Krohn: 54:19

Great, Winter Recipes from the Collective. But yeah, nevertheless recommending that collection. Very cool. I do believe, at least since I’ve been hosting this show for more than two years, I don’t think we’ve had a poetry collection recommended as the book, so thank you for that Carlos.

Carlos Aguilar: 54:36

Proud to be the first.

Jon Krohn: 54:37

So for people to get future poetry recommendations or maybe even tech related advice, how should people be following you or connecting with you going forward?

Carlos Aguilar: 54:48

Yeah, glean.io for Glean’s website, I’m @trucklos on Twitter, you can email me too, carlos@glean.io. If you hear this and have any thoughts or want to get in touch, yeah, feel free to just email me.

Jon Krohn: 55:02

Nice. All right Carlos, thank you so much for coming on the show. And yeah, we’ll have to do an episode again sometime in the future, so…

Carlos Aguilar: 55:09

In-person.

Jon Krohn: 55:09

so we can be in person. Exactly. Yeah, I’m looking forward to it.

Carlos Aguilar: 55:14

Yeah, thanks a lot. Have a good one.

Jon Krohn: 55:21

Carlos. Wow. What an inspiring and deeply technical entrepreneur. In today’s episode, he filled us in on how his platform glean.io leverages the full power of his client’s data warehouses to performantly extract actionable insights from the vast amount of available data. Talked about how the embedded columnar database DuckDB allows the testing and development of software independent of external dependencies like Snowflake or Google BigQuery, how D3.js enabled Glean to create a slick library of custom data visualizations. How it’s critical that early hires in a tech startup are highly empathetic to clients’ needs and how he used a genetic algorithm to program a robotic arm to paint creative real-life works of art. As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Carlos’s social media profiles, as well as my own social media profiles at www.superdatascience.com/653. That’s www.superdatascience.com/653

56:16

Beyond social media, another way we can interact is coming up on March 1st when I’ll be hosting a virtual conference on natural language processing with large language models like BERT and ChatGPT. It’ll be interactive, practical, and it’ll feature some of the most influential scientists and instructors in the large natural language model space as speakers. It’ll be live in the O’Reilly platform, which many employers and universities provide free access to. Otherwise, you can grab a free 30-day trial of O’Reilly using our special code SDSPOD23. We’ve got a link to that code ready for you in the show notes.

56:52

Thanks to my colleagues at Nebula for supporting me while I create content like the SuperDataScience episode for you. And thanks of course to Ivana, Mario, Natalie, Serg, Sylvia, Zara, and Kirill on the SuperDataScience team for producing another inspiring episode for us today. And thanks of course to you for listening. It’s literally why I’m here. Until next time, my friend, keep on rocking it out there and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon.

Podcasts SDS 653: Efficiently Glean-ing Insights from Vast Data Warehouses

SDS 653: Efficiently Glean-ing Insights from Vast Data Warehouses

Podcast Transcript

Share on

Related Podcasts

November 18, 2025

November 14, 2025

November 11, 2025

Podcasts SDS 653: Efficiently Glean-ing Insights from Vast Data Warehouses

Share

SDS 653: Efficiently Glean-ing Insights from Vast Data Warehouses

Podcast Transcript

Share on

Related Podcasts

November 18, 2025

SDS 941: Multi-Agent Human Societies, with Dr. Vijoy Pandey

November 14, 2025

SDS 940: In Case You Missed It in October 2025

November 11, 2025

SDS 939: Mixture-of-Experts and State-Space Models on Edge Devices, with Tyler Cox and Shirish Gupta