Podcasts SDS 479: Knowledge Graphs

72 minutes
Business, Data Science

SDS 479: Knowledge Graphs

Subscribe on Apple Podcasts, Spotify, Stitcher Radio or TuneIn

In today’s episode we talked about the theory and applications of knowledge graphs, the data science techniques Reonomy utilizes for their work, tools and approaches Maureen leverages, data scientists vs data analysts, and more!

About Maureen

Maureen is Chief Data Scientist at Reonomy, a property intelligence company which is transforming the world’s largest asset class: commercial real estate. Maureen has run simulations and transformed data for 20 years. She has a breadth of knowledge on varieties of data, including: location data, click data, image data, streaming data, public and simulated data, as well as experience working with data at scale, managing datasets ranging from kilobytes to terabytes. In previous roles, Maureen drove technological and process advancements which resulted in 500% year over year BtoB contract growth at Enigma, a data-as-a-service company headquartered in New York City. She delivered models which anticipate human behavior and needs at Axon Vibe, created a smartwatch app recommender in the Insight Data Science Fellows Program, and researched galactic shapes due to the interplay between Dark Matter and Stellar Evolution as a postdoctoral associate at Rutgers University. Maureen’s Ph.D. is in Computational Astrophysics from Columbia University: she studied the evolution of galaxies by running cosmological simulations on supercomputers.

Overview

Maureen is a longtime friend and colleague I’ve been hoping to get on the podcast. She was a prominent member of the deep learning study group I ran for several years, studying the foundations of deep learning as a group. Currently, Maureen works for Reonomy, a commercial real estate company, that provides property intelligence and the ability to input complex queries on data available on a property. One thing they use is knowledge graphs.

A knowledge graph, similar to a social graph, offers a connection and web-like structure utilizing nodes that could be people, property, or companies. Focused around those nodes, Reonomy can capture the complex relationships around a property. There is an extra layer of flexibility and power on the product side, thanks to the complexity of the nodes on this type of graph. One interesting use case for this complex knowledge, which can reveal hidden ownership of a property, is famously utilized in the land that became Walt Disney World. Walt Disney purchased 30,000 acres of land in Florida in the late 1960s and early 70s. In order to prevent other companies from purchasing adjacent land to the future park, they formed a number of small corporations who bought the land. They represent a complex relationship between owner and space that a knowledge graph can help untangle. Were this technology around then, the secret might have been known at the time of purchase. Other technologies Reonomy utilizes are Spark Scala and several other models to help tackle the billions of data points that can be attributed to different properties.

In terms of who Maureen likes to hire, she looks at folks who are reflective on the information and data they have, rather than jumping to action. She looks as well for folks who have an instinct for curiosity and questions and an ability to collaborate, communicate, and stay organized to ensure the team can work together. As you all will know, I have had a lot to say previously about the differences between a good and great data scientist, and a lot of what Maureen shared here falls right into that. We discussed different coding programs and the benefits of knowledge in them and their use in Maureen’s work. The good news for anyone out there is that Maureen is hiring right now for a data scientist, a data analyst, and a data engineer. Reonomy is looking to improve the person resolution in their pipelines and the company tree structures. So, the data scientist role would tackle this. The data analyst position is about pulling data from different portions of the pipeline to evaluate the best opportunities. This work drives the product but does not live as part of the product. For the data engineer role, Maureen defines it differently from an ML engineer. The engineer in this role will work in the information being ingested, help shape it, and transform, standardize, and clean it. From here Maureen shared her tips for growing your own data science team.

We closed out with Maureen shedding some light on how to transition from academia to industry and tools that help academics build real-world experience where she shared some of the very cool work she did as an astrophysicist and the data requirements there and what parts of the academic world pushed Maureen towards the industry sector.

In this episode you will learn:

Maureen’s work with Reonomy [5:40]
Knowledge graphs and use cases [7:35]
Other tools Reonomy uses [18:58]
What Maureen looks for in potential hires, soft skills and hard skills [26:28]
Hiring at Reonomy [41:40]
Maureen’s tips for growing a data science team [48:55]
Tools to transition from academia to industry [52:45]

Items mentioned in this podcast:

Reonomy
Open Positions at Reonomy
DataScienceGO Connect
SDS 466: Good vs. Great Data Scientists
Spark Scala ML Pipelines
PySpark
IntelliJ
DataBricks
Insight
Other Minds by Peter Godfrey-Smith
Presentation Zen
Deep Learning study group
Deep Learning Illustrated
Mathematical Foundations of Machine Learning

Follow Maureen:

Follow Jon:

Episode Transcript

Download The Transcript

Podcast Transcript

Jon: 00:00:00

This is episode number 479, with Dr. Maureen Teyssier, Chief Data Scientist at Reonomy.

Jon: 00:00:12

Welcome to the SuperDataScience podcast. My name is Jon Krohn, Chief Data Scientist and best-selling author on Deep Learning. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. Thanks for being here today, and now, let’s make the complex simple.

Jon: 00:00:42

Welcome back to the SuperDataScience podcast. I’m ever so happy that we’re joined today by the wildly intelligent and meticulously communicative, Maureen Teyssier. Maureen is Chief Data Scientist at Reonomy, a very well-funded New York startup. They’ve raised over $100 million that is transforming the world of commercial real estate with data and data science. Prior to working in industry, Maureen was an academic, working in the field of computational astrophysics. She obtained her PhD from Columbia University, and then carried out research at Rutgers University in New Jersey. In today’s episode, we covered the theory and applications of knowledge graphs, a cool and powerful data type at the heart of much of Maureen’s work at Reonomy.

Jon: 00:01:26

The data science techniques that Reonomy use to flow data through extremely high volume pipelines, enabling them to efficiently apply models to their massive datasets. We also talked about what Maureen looks for in the data scientists that she hires and the tools and approaches she leverages in order to grow a highly effective data science team. We talked about the differences between data scientists and data analysts, as well as between data engineers and machine learning engineers. Finally, we talked about Maureen’s fascinating, academic work in which she used gigantic supercomputers to simulate solar systems and galaxies.

Jon: 00:02:03

Today’s episode, should appeal to anyone interested in data science, no matter whether you have a lot of experience in the field, or are just getting started and no matter whether your hands on, or more commercially oriented. Maureen, welcome to the SuperDataScience show. I’m so excited to have you here. We’ve known each other for a long time, and I’ve been thinking about you as a perfect podcast guest for so long. The audience is going to love you. Please, tell us how you’re doing and where are you calling in from today?

Maureen: 00:02:38

Jon, thank you so much for having me. It’s an absolute pleasure to be speaking to you again. I am currently in just outside of the city in northern New Jersey.

Jon: 00:02:49

The city being New York City, obviously.

Maureen: 00:02:51

Being New York City.

Jon: 00:02:51

That would be the city. That sounds nice. Do you have some green space out there?

Maureen: 00:02:58

We do. Yeah.

Jon: 00:03:00

Lucky. Yeah. I can’t even remember what that was like. We’ve known each other for a long time. You were a very prominent member, I would say, of the Deep Learning Study Group that I ran for years which, due to the pandemic, we haven’t been doing anything for, well, over a year now. There was a period, shortly before the pandemic, where I was spending a lot of time writing a book and we didn’t have many of these sessions. But we had something like 16 sessions, at a time meeting more than once a month, studying the foundations of Deep Learning together, which was really fun. The whole group picked topics to study. We decide on, “Okay, we’re going to watch this video lecture or read this chapter of a book.”

Jon: 00:03:47

It meant that it was like going to a meet up and those kinds of things are common, but everybody was at the same level. Or some people were a little bit more ahead and could explain some things to people who are a little bit more behind and you could have this expectation that by class 10, or a study group meeting 10, everybody would be on roughly the same page, and you can speak at a certain level instead of starting from the basics again. Anyway, I thought that was cool. I missed doing it. I can’t wait till we can do it again, post-pandemic and yeah. You were there from the very beginning, I think.

Maureen: 00:04:21

Yeah. I think that’s right. I’ve got to say it was maybe my all time favorite meetup group on. We started Wednesday evenings, and I would leave work and then go to this very modern office in midtown Manhattan and talk with a group of people that were very engaged. Everybody had looked at the materials, they all had thoughts and ideas to add. It was just a … Men, it was the best of what meetups like that could offer. I hope that we have the opportunity to do that again sometime soon.

Jon: 00:04:57

I think so. Definitely. Post-pandemic, once we have an office again. All the offices we’re looking at, will have that modern field, as you mentioned, and be a big open space so that we can have tons of folding chairs set up and people come into the study group again. Mentioning, and now on air for the first time is this idea I have to do live podcast recordings with a live studio audience that can ask questions, and clap, and react during the podcast. SuperDataScience listeners, something to look forward to when this pandemic is over. Which actually, in New York, I think we’re going to have an office in a few months.

Maureen: 00:05:35

Fantastic.

Jon: 00:05:37

Yeah, yep. Yep. Yep. That’ll be fun. Anyway, you were doing that, I think mostly back then, when we started the Deep Learning Study Group, you were working at Enigma technologies, which quite an enigmatic name. You can tell us a little bit about that. But now you’ve been working at Reonomy for coming on three years. You started as the director of data science and data engineering there and now you’re the chief data scientist. Tell us about what Reonomy does, and tell us what you do there.

Maureen: 00:06:08

Yeah. Reonomy is a commercial real estate company. We provide property intelligence through a data layer that’s available via API that provides high volume information about properties, that people in the companies that are associated with those properties, as well as the history of the property. Then also through a website that allows you to do complex search queries and really explore and engage with the information that we provide.

Jon: 00:06:44

Nice. You guys are pretty big. You’re not just another commercial real estate company, you’re a VC-backed, machine learning driven, commercial real estate company. Your most recent funding round was a series D and it’s starting to get pretty meaty funding with 60 million, six zero, in that most recent funding round. So big players in this space. In particular, you do a lot around knowledge graphs, which it sounds like you’ve been working on for a long time. It looks like even at Enigma technologies, you were involved in this kind of knowledge graph specialization. I don’t know if it extends all the way back to your PhD work?

Maureen: 00:07:29

No. It does not.

Jon: 00:07:32

Well, still many years of knowledge graph experience. Tell us what a knowledge graph is and how that’s useful, generally speaking, I guess, but then also, specifically at Reonomy?

Maureen: 00:07:45

Yeah. A knowledge graph is very similar to a social graph. If you’ve looked at posts or any kind of media about, for example, Facebook social graph, and the connections between the people in this web structure, the difference between a normal graph database and normal graph way of structuring information and knowledge graph is that the nodes in the knowledge graph can be different types. For example, nodes in the Reonomy knowledge graph. There are person nodes, there are property nodes at multiple levels, because there’s kind of many to many relationship with the property structures. Then there are … We have company nodes.

Maureen: 00:08:32

Focusing around those three types of nodes, were able to capture the complex ownership structures for property within the commercial real estate space. To go into maybe a little bit more detail about, let’s say, I focus on a person node. The information that we have about a person, it includes, first name, last name, maybe misspellings of your name that we have in the data, it includes contact information that we have available for you. Email addresses, phone numbers, that type of information, we potentially have some demographic information linked to you, so we might know your age range. Then we also have the relationships between your node and the other nodes in the graph. If you are present in the commercial real estate space, maybe you’re working at a company that owns commercial real estate directly, or through it’s subsidiaries, or maybe you are the reported owner on the property, that gives you a taste of what the relationships are within the graph and what a node could look like within the graph.

Jon: 00:09:46

Yeah, that makes perfect sense. I was actually not aware of the specific distinction of a knowledge graph relative to a regular graph. I’m going to quickly make sure that everyone knows what a node is. Just so, it’s like … It’s obvious once what it is. But in that simple explanation of a social graph, where you could think about, “Okay, me and all my friends and because Maureen and I were both in the Deep Learning Study Group, there’s a set of people that we both know.” In a graph representation, I could be a node, Maureen could be a node. All the people that we know and all the friends that Maureen has and that I have … We don’t know. I’m connected to all my friends in the graph. The things that connect nodes are called edges. Maureen and I would have an edge connecting us in our social graph. Maureen would have a connection node of her friends, she’d have an edge of all of her friends who are nodes and we’d also have to all the people that we both know, from the Deep Learning Study Group, we would have … I’d have an edge connecting to that node, she’d have an edge connecting to that node. Not only can you have information stored in the nodes, but the connections, the edges, between those themselves are a piece of information.

Maureen: 00:11:00

Yes. Yeah, that’s a beautiful explanation. Yeah. If you think to dive in a little bit more there, if you think about the way that information is constructed, right? The study group itself could also be represented as a node, or we could use that information to just create a direct connection between Jon and I. There are two different ways of representing that information within a graph and there are ramifications for that information for the product. For example, if you wanted to understand well who went to the Deep Learning Study Group, you would want to have that represented as a node in your graph. That wouldn’t be present, right? If we’re just connecting Jon and I, using the information from the Deep Study Group. The knowledge graph, because it has different types of nodes, it adds this extra layer of flexibility and it adds power on the product side, that you don’t necessarily have if you’re just trying to maintain one type of node across the graph.

Jon: 00:12:10

That’s super cool. Yeah. I’ve never worked with knowledge graphs. They sound super interesting, super useful. How do you navigate a graph? Or find what kinds of operations can we perform on a graph to do something useful?

Maureen: 00:12:27

Yeah, The way that Reonomy uses the graph is to create ownership portfolios for commercial real estate. The ownership structures, I call them a structure because companies in the commercial real estate space are used as financial vehicles. What I mean by a financial vehicle is that it’s essentially a container for moving money around and controlling how money is used. You will have holding companies that are formed as part of a partnership to just hold money that’s been invested from multiple enterprise companies and distributed in order to buy commercial real estate of certain asset type. Meaning, they’re buying primarily retail, or they’re buying primarily commercial properties, or they’re buying primarily some particular property asset type. Another way that companies use the formation of other companies and create complex structures is actually to hide ownership. One famous example of this is the purchase of the land for Walt Disney World.

Jon: 00:13:44

Oh, tell us about it.

Maureen: 00:13:51

Back in the 1960s, Walt Disney purchased 30,000 acres of land, contiguous land, meaning that the individual parcels that made up the … That was 30,000 acres are all connected together. He had several reasons for hiding the fact that he was purchasing the land. There was-

Jon: 00:14:14

Walt Disney.

Maureen: 00:14:15

Yeah, Walt Disney. Walt Disney land down in Southern California had been … was older, and it was something that had done very well. Because of that, he wanted to expand and build another park and have it be even larger and have it be accessible to the rest of the country. They were looking for locations in the south where there wouldn’t be frost on the rails and so it would be easier to maintain the park. There was speculation for where would be built. The Walt Disney Corporation was concerned that when they started buying land, other companies would buy adjacent land.

Maureen: 00:15:00

Either to develop the land and build hotels in the middle of and disrupting the Walt Disney World, or just in order to buy the land and then negotiate with the Disney Corporation and [crosstalk 00:15:13] back to them at a higher price. What they did was they formed a small shell corporations to hide the fact that it was Disney that was purchasing all of this land. There were wild rumors about a nuclear power plant and all kinds of things that were going on at the time. But they ended up purchasing the land and they did it successfully, because they were hiding who is doing it and what they were doing it for.

Jon: 00:15:44

Well, that story was great. I thought it was going to be some crime story, though. I thought [crosstalk 00:15:51] to have Walt Disney hiding bodies or something. But, all right.

Maureen: 00:15:57

No, no. Everything that is done is legal. But there is a great deal of … Corporations have a lot of power in the United States and have a lot of flexibility for how they organize. They make sure that they are doing things to their advantage, to their competitive advantage from a financial standpoint and then also in situations like I described with Disney World, yeah.

Jon: 00:16:30

Nice. Well, very cool. Basically, a knowledge graph like Reonomy’s, allows people to speculate or maybe even know. If somebody had Reonomy in the 1960s, they might have been able to figure out what Walt Disney was doing?

Maureen: 00:16:49

Yeah, yeah. They would have. We’re very unique in that … Most of the ownership, like The True are highest level, the Ultimate Parent company is not known for the majority of the properties across the United States. There are over 50 million parcels of land across the US that fall within the commercial real estate category and ownership is largely unknown. That information has been inherited from other people who work at your company there. The CI space is still has a lot of offices with file drawers and stacks and stacks of paper books. We’re really revolutionizing the industry by making this data available.

Jon: 00:17:40

Super cool. You may already have heard of DataScienceGO, which is the conference run in California by SuperDataScience. You may also have heard of DataScienceGO Virtual, the online conference we run several times per year. In order to help the SuperDataScience community stay connected throughout the year from wherever you happen to be on this wacky giant rock called planet Earth. We’ve now started running these virtual events every single month. You can find them at datasciencego.com/connect. They’re absolutely free, you can sign up at any time, and then once a month, we run an event where you will get to hear from a speaker, engage in a panel discussion for an industry expert Q&A session.

Jon: 00:18:27

Critically, there are also speed networking sessions, where you can meet like-minded data scientists from around the globe. This is a great way to stay up to date with industry trends, hear the latest from amazing speakers, meet peers, exchange details, and stay in touch with the community. Once again, these events run monthly. You can sign up at datasciencego.com/connect. I’d love to connect with you there.

Jon: 00:18:58

Are there other kinds of tools or techniques beyond knowledge graphs that have become really your specialty? Are there other kinds of tools and techniques that Reonomy makes use of?

Maureen: 00:19:07

One technology that we use is we have high volume distributed pipelines, and we use Spark to do that. Our production pipelines are in Spark Scala. We do development in PySpark as well though. Then the reason that we need high volume pipelines is because we use machine learning and AI in order to create the edges in our graph and also to define the nodes in our graph. If I turn off on a property record, as a reported owner or mortgage signator, and I also show up on a company record we use AI to decide that the Maureen that shows up on a company record and the Maureen that shows up on property record are actually the same person.

Jon: 00:20:00

Nice. That makes a lot of sense here because you’ve got to … Yeah, it can’t be just from name, there’s got to be other circumstantial factors that would give your model. I guess like you’d have a set of feature weights that say, “Okay, based on the name being the same, or almost the same and these other factors, location factors.” I don’t know. I’m not going to speculate on your secret sauce and I’m not going to [inaudible 00:20:33] exactly what the features are. But basically through that set of information, you can say, “Okay, there is a high probability that Maureen A is the same as Maureen B. Then so therefore, that should be just one Maureen node. That also means we’re going to have to connect her to not only this set of companies over here, but also to this other set of companies. Cool.” That makes a lot of sense. I guess, the reason why you need such high volume pipelines, is because you have billions of data points and a graph by it’s nature. If you have millions of nodes, the number of connections between those could be insane. So you need these kinds of high volume tools to be able to do that processing as quickly as possible.

Maureen: 00:21:21

Right, Yes. In one record, particularly on the property records, you could have person name show up. We could have more than one reported owner. It could be siblings, it could be husband and wife, it could be business partners. You could also have person named show up in the seller fields, or in the mortgage signatory fields. You could have six or eight person name show up per record. Then we also have … We ingest company data and we ingest data that’s more person centric, that contains contact information and other supplementary information that we use. But we have more than a billion instances of people’s names coming in and we have to deduplicate all of that data, and establish all of the connections from those deduplicated pieces of information to all of the other nodes in the graph. It’s a lot. It’s a lot of data to process.

Jon: 00:22:28

Yeah. Just teaching a class yesterday on Big O notation like computational complexity. The task that you just described, it has polynomial time complexity. Because for every new node that you add in, you might need to compare that node to every other node in the graph. Every node you add in … It’s like if you have 50 million nodes and then you add in one more, it’s not just one more piece of information, because now you need to compare that one more piece of information against the existing 50 million pieces. Yeah. Wow. That is very cool. I don’t know. Anything else? Are there other kinds of models that you need? I don’t know. I mean, I’m stretching here now. I don’t know if there would be. I mean, that sounds already a huge amount of work that could easily consume an entire data science and engineering team, but I don’t know. Just giving you the chance. Is there anything else [crosstalk 00:23:26]

Maureen: 00:23:27

Yeah. We have a few models that are using … They use the fact that we have this additional context that’s created by the graph. Then also the fact that we ingest property information from multiple sources. We correct that, the property information. We could get a property type, and by property type, I mean, is this a hospital? Is this a mall? Is this an airport? Is this where we could get a property type that is defined in correctly? The definition is categorizing the properties correctly is important, because that’s part of how people find the properties on our website, on app.reonomy.com.

Maureen: 00:24:12

If the property is undefined or not correctly defined, it won’t come up when you say, “Tell me all of the retail strips in Philadelphia.” It won’t show up. We have a model that makes the corrections there and we have another model that … Ultimately, the property data comes from tax assessors. When they have a bad day, the accuracy of the data goes down and there are some counties that do much better than other counties. A unit count is the number of offices in an office building, or the number of stores in a mall. Sometimes you’ll see unit counts to say 9999 or sometimes the unit count is just twice what it should be, or sometimes they’ll write down five when it’s 35. But we can look at the other information on the property and the model can make an estimate for what the actual unit count should be, and then we use that as a baseline to make corrections.

Maureen: 00:25:18

We do have models that correct data. Then lastly, we have a model that is predictive. It predicts properties that are likely to sell in the next, a year or two years. Commercial real estate usually generally, moves pretty slowly, much more slowly than other objects that we categorize or make predictions on data. Yeah. That functionality is just supposed to weed out properties that have recently been purchased, properties that are mid-mortgage, in order to allow the people who are looking to purchase properties, or that are looking to facilitate an exchange of a property, make it easier for them to find the right ones among the haystack that we’ve created. Right? We’ve created a haystack. We want to help our users find those needles.

Jon: 00:26:21

Very cool. Those are some cool models. I’m glad I asked. What do you look for in people you hire? If somebody wanted to work with you, if they said themselves, “Wow, this sounds awesome. Working on knowledge graphs, in a VC-backed company that is changing an industry, that is changing commercial real estate.” What would somebody have to do to impress you in an interview?

Maureen: 00:26:49

We look for people that are keen to learn and keen to grow. We look for people that have … I do an interview that is a … Right? Create a scenario and I’m looking for people who are reflective. They’re thinking about, “What is the data coming in? How do I evaluate the data that I’ve been given? Before I leap to what features am I going to create, what models would be appropriate? Given the problem that I’ve been given, what do I need to require from the training data?” Et cetera. We look for things like that. We look for people who can …

Maureen: 00:27:38

The data is complex, it’s messy, we look for people who have that Sherlock Holmes instinct, where they’re asking the simple questions and the complicated questions and their sleuthing. Then lastly, I look for people because that AI and ML is so deeply embedded within the pipelines, there has to be a very tight collaboration between the data scientists and the other parts of the engineering org, and specifically, the data scientists and the data engineers. I look for people who can kind of the code that they produce is well organized and structured and that they can communicate what they’re doing and why they’re doing it to people that might not be as familiar with machine learning and AI.

Jon: 00:28:22

Perfect. Two of those things, keen to learn, knowing how to learn, and communication, although I also do like the twist of writing clean code. That is something that I hadn’t thought of in that context before, but that is another form of communication. When we think about communication, I’m thinking about verbal communication, written communication, and I guess code is writing, but it requires a different kind of thinking. Not only just the comments, which well commented code, obviously very helpful, but even just literally the way you write the code itself can be either … It can be a huge difference. A really great engineer, or data scientist, writes code in a way that you can look at it. Maybe he doesn’t even need comments, because it’s so obvious why you broke things up into these functions here and there, and the functions are well named. Cool.

Maureen: 00:29:16

Yeah, absolutely.

Jon: 00:29:18

But those two things, knowing how to learn, being keen to learn, and communication, that’s what almost everyone says. I ask this question on most episodes. When people are senior data scientists like you who are doing, hiring, I asked them that. It’s those two things that always show up. So much that, actually, I’ve done a standalone episode on it. Episode 466 is on the difference between a good versus a great data scientist. I highlighted in that episode, that aired in late April, that, “Okay, these are the two things, the two most core things.” But I love the extra twist of clean, understandable code and then also you have that nice other piece there which isn’t something that people … I don’t know if I’ve had anyone say that before, but that idea of being a detective, so important. Being able to ask easy questions, hard questions, I love that.

Maureen: 00:30:17

One of the things that we found that’s been really helpful is the … We use data bricks. It allows us to do visualizations in Python and code in PySpark, as well as code in Spark Scala. You can have all three languages present in one notebook using that platform. The reason that I say that it helps with communication is that it means that the data scientists that are working on the pipeline, and the data engineers that are working on the pipeline, they can look at the same thing. They can look at the link to the same notebook. They’re literally looking at the same data. That is instrumental. From the engineering side, the text editing tool that everybody, most people, seem to gravitate towards is IntelliJ.

Maureen: 00:31:19

IntelliJ doesn’t allow you to really dig into the data the way that a notebook does. There can be kind of … There is a missing link there. There’s a missing space, right? The screen space on IntelliJ is taken up. You’re looking at the lines of code, you’re not looking at the data that’s being produced in the different steps. When people are looking at different things, the code versus the data, they can have miscommunications. When they’re looking at the same thing, you really get some tight knit teams. You also learn a great deal because the team is so tight knit. I hired two data scientists, about two years ago, and they came in only knowing Python, and then they learned PySpark, and then they started learning Spark Scala. Then they were deploying their own algorithms in a production pipeline, and their title changed to MLE, right? This platform helped them do that.

Jon: 00:32:24

Super cool. I’ve heard of Databricks a million times. They are big conference sponsors, for example, but I don’t think I’ve ever had it explain to me what it is. Let me try to explain back what I understood, and you can tell me where I’m wrong. It’s a Jupyter notebook in that, I guess you can share with your teammates, like a URL, and everybody can go to that URL and look at the same notebook. But a Jupyter notebook … Jupyter, it came out of the IPython project. It’s supposed to be this interactive Python idea, I guess, where you could have code and graphics output, but then it developed to Jupyter, which is Julia Python and R mixed together into one word. So these kind of three data science languages.

Jon: 00:33:15

But I think even then … I don’t think you can mix and match in the same notebook. But, anyway, but you can use those three languages in Jupyter. Now, what you’re saying with Databricks is it has at least some of that core functionality of being able to write code in, being able to see tables of output, being able to see lots of output, but you can mix and match many different languages into the one notebook. Now, does that mean … Is that kind of … If I don’t know … Okay, so PySpark maybe I might be able to figure out because I know Python, but Spark Scala to me would probably be a completely foreign language, is it? How would I understand someone’s like Spark Scala code and [inaudible 00:34:04]

Maureen: 00:34:04

Yeah. Spark Scala is particularly interesting. A Scala is the native language, and then PySpark is actually layered on top.

Jon: 00:34:18

Spark Scala.

Maureen: 00:34:18

Sorry, PySpark. PySpark [crosstalk 00:34:19]

Jon: 00:34:18

On top of Python?

Maureen: 00:34:21

On top? No.

Jon: 00:34:22

On top of Scala. Ooh!

Maureen: 00:34:26

The syntax is different, but it’s similar. There are different ways to define the typing and that’s where a lot of the syntax changes come in. Some of the ways that you write out the code. They do look very similar. What Scala gives you as an added benefit is that because Scala is native. You don’t need to use the API layer for it. You can write code that looks like little Python functions within the jobs that you’re running. Either way you slice it. Scala is still a functional programming language. They’re still that functional structure as a hurdle to go from Python to Scala, but when you look at the code, and the micro components of the code, the smallest components of the code, they are more translatable than you would necessarily assume at first glance.

Jon: 00:35:35

Nice. I’m just going to try to repeat some of that back to you again. Key distinction, which will be … There’re some people will notice. Some listeners will notice, but some won’t. Python, is an object oriented programming language, which actually R is as well, which a lot of listeners would be familiar with. You’re focused on the nouns of software. You’re focused on the object. The thing that actually has data, and then you can add methods to it so that you can have the methods or the verbs that you apply to these noun objects. Now, Scala, as a functional programming language, you’re focused on the verb that’s kind of the primary, like the functions that are doing things are the focus, and objects flow through them. Yeah. I guess that’s … I didn’t thought about it that way, but there’s kind of this nice noun, verb difference between object oriented and functional programming languages. Yeah.

Maureen: 00:36:38

I like that perspective. That’s a fun way to put it. Yeah.

Jon: 00:36:41

That’s a fresh one. It’s going to be tricky. For me, I’ve never … I haven’t worked much with functional programming languages. Looking at Scala would be tricky. But what you’re saying is because PySpark is actually written in Scala. There is ends up being a lot of overlap in how things work. You can flow from one to the other, people can start to piece things together. Like you’re saying, you can have someone come in. I can’t remember now the direction that they float in, but it was like Python, the PySpark-

Maureen: 00:37:19

Python. It’s a PySpark. [crosstalk 00:37:19] Spark Scala. Yeah. Yeah.

Jon: 00:37:22

Cool. All right. Basically, what you’re telling me is, in this Databricks notebook, I would be looking at the code in Spark Scala, but if I’m already familiar with PySpark, I can probably piece together the largest what’s happening, maybe it’s helped by the well commented code, by these great commenters that you’ve hired and then the graphs, and data outputs, I’d be able to understand anyway. Then, I think that’s the key piece here is that you can have the all of the data that you’re working for the same thing.

Jon: 00:37:57

You’re all seeing the same version of things, as opposed to… I think it’s something that happens a lot is you’ll do your prototyping in one space. We do this in my company, I’m thinking a lot of companies do this, where we prototype with one stack of tools, and then we engineer with another stack of tools. It ends up being every time we’re translating some machine learning model from the data science team, that’s prototyping it, to the engineering team, we have people who have to make sure that we have to do it really carefully, to make sure that everything is translated faithfully into that new stack of software.

Maureen: 00:38:39

Right. Yes. Because of the nature of PySpark, and Spark Scala, you can write code in PySpark and you can train a model. Then you can import that trained model to a Spark Scala pipeline as well. Right? Of course, you’d have to rewrite features and make sure that the input is going to the models, it’s the same, but you don’t necessarily need to rewrite all the code that’s doing the training.

Jon: 00:39:11

Nice.

Maureen: 00:39:11

Sounds so nice functionality. Then also Databricks, it lives on top of clusters. That’s one other difference. I think that we haven’t touched on Jupyter notebook. You can provision a cluster of whatever size you need have. Lots of different types. It could be GPU as opposed to CPU, it could be what you need. But that allows you to play with a cluster before you are trying to run something in production. You can look at scalability, before you get to the point that you’re trying to put it in and run it across everything. The data scientists can run the models across everything and they can look at the outputs and create the distributions into all that. There is a kind of that extra layer of independence there.

Jon: 00:40:05

Yeah it is possible with Jupyter notebooks to run multiple servers and distribute the compute, but it tends to be messy. It’s something that like I rarely do myself. It’s the task that I delegate, because I’m like, “Oh man, that’s going to be tough.” I seem to always do something wrong. I’m like, “Why is all my memory being used up on all these machines? I thought I ended it. Thought the job was over, but isn’t somehow.” Anyway, I can imagine how to like Databricks they’ve figured out how to do that in a way that is easier.

Maureen: 00:40:43

Yes. Yeah. They’ve made it really easy.

Jon: 00:40:46

Cool. Well, good to know. I’ve learned a bunch on this podcast already, but Databricks, that’s … Yeah, it’s something that I hear about a lot and I’m so glad to actually understand what it is now. All right. We’ve talked about Maureen, what you look for in people that you hire and that led us to that brilliant data bases discussion. My understanding is that you are actually doing hiring right now. You’re hiring a data scientists, data analysts, data engineers. Tell us about those roles. I guess it could even be useful for some people to understand what the distinction is between a data scientist, a data analyst and a data engineer. Yeah. Let us know. Tell us about those jobs and where people can apply?

Maureen: 00:41:31

Sure. Yeah. Reonomy has a lot of open roles right now. On our careers page, you can see that there are just a ton of open roles. The data scientists, they do the work that we’ve been describing, where we’re looking for people who can build models to support product functionality. We’re continuing to work on and improve the person resolution within our pipeline. We’re looking to improve the company tree structures that we’ve created were doing … Because we’ve created these big company tree structures, we now have the problem of organizing the people that fall under the tree in order to provide who is the best person to talk to?

Maureen: 00:42:19

Is it a property manager? Is it somebody from a local branch? You don’t want to recommend the CEO of Pepsi, just because down at the bottom of the corporate tree, there are commercial properties that are attached to that tree. We create problems for ourselves as we successfully build things. There is always more. We’ve got a lot going on right now. That’s what the data scientists do. If you can deploy your models in our production environment, we’re happy to call you an MLE as well.

Jon: 00:42:55

Machine learning engineer.

Maureen: 00:42:55

Yeah. It feels [crosstalk 00:42:57]

Jon: 00:42:57

Cool. As well, so data scientist/machine learning engineer [crosstalk 00:43:01]

Maureen: 00:43:01

Well, I meant one or the other but-

Jon: 00:43:07

Pick one. Come on.

Maureen: 00:43:07

Yeah. Pick one. If you want to go deeper into the algorithms, and manipulating the model, and manipulating the information that goes into the models, you can continue to progress up the data science career track trajectory. Or if you are interested more in production deployments, and optimizing how quickly things run, the cluster configuration as well as doing development of models, that’s a different route to go up. The data analyst position is about pulling data from different parts of our pipeline in order to evaluate opportunities. We’re building so many things. We have to be careful of the things that we choose, and we’ll come up with ideas, but it actually takes a significant amount of work and there is information nuance in understanding what the opportunity actually is. What the lift actually would be for the data analyst role. That is one of the major components of that. It’s exciting because you’re in a lot of ways, defining the direction of the product, which I don’t … That’s going to unique. That’s [crosstalk 00:44:25] unique.

Jon: 00:44:25

Yeah. That’s full data analyst role. That’s huge for being able to actually define where things go. The data analyst would still be doing a lot of that detective work, looking for unusual things or opportunities across all the data that you have. Things that could potentially be in the product, things that a data scientist could model, and then maybe so … Like a rule of thumb. A way that I distinguish the idea of a data analyst from a data scientist is, they’re really similar jobs except, a data scientist would more often be building a model and validating a model. I think often people grow from the data analyst role into a data scienc role, because … and that kind of situation you’re describing. You’re like, people are seeing the data, they see all kinds of opportunities. They can put it in a table, or plot and say, “I think there’s a relationship here. It lends itself automatically.” You’re like … You start experimenting with a little regression model and all of a sudden, you’ve got a taste and you’re building models.

Maureen: 00:45:26

Yeah, yeah. I’m open to different group progressions, right? If we hire an analyst that wants to continue to be an analyst, that’s great.

Jon: 00:45:35

Sure.

Maureen: 00:45:36

If we hire someone that wants to transition over to the DS Career Track, that’s great too. Another way that I think about … I like the way that you put it, but another way that I think about it is that the analysts do work that does not become a permanent part of the product. Might drive the product forward, it might [crosstalk 00:45:57] reflect on how well we’ve done for the product, but it doesn’t live as part of the product, or as the things that the data scientists are doing. The data scientists also do data munging. They’re not always doing models, although we have a lot for the size team that we have. That’s another way that I think about the difference there.

Jon: 00:46:19

Yeah. I hadn’t thought of it that way. But that makes perfect sense to me and I think that that definition is common across a lot of data analyst roles. Thank you for bringing that up. When you’re talking about a machine learning engineer. We talked about three different kinds of roles. We talked about data scientists, data analysts. We also talked about that you’re hiring data engineers. In the way that you define it, is that something different from a machine learning engineer?

Maureen: 00:46:44

Yeah. The MLE are focused, really on the models, those components that fit within the pipeline. But there’s a whole realm of work that needs to be done in order to support both the information that we’re ingesting from our different sources, the property sources, the company sources. We use shaped data and then we have additional person information sources that I mentioned before. Then there are a great deal of transformations that need to happen to that data. There’s source specific information that we take out, there is data standardization, and some cleaning that’s done at the top of the pipeline, rather than just before the models. There is creation of … Essentially, getting the data out of the graph, and into the API, into our Elasticsearch cluster to support the search functionality in the app. All of that falls under the realm of data engineering. It’s pretty much everything that’s not the models.

Jon: 00:47:57

Nice. That makes perfect sense. I haven’t thought about it that way before, but it basically yes. A machine learning engineer is concerned with deploying the model. There’s still tons of other pipelines in a lot of applications. That’s the data engineer is concerned with those pipelines that don’t necessarily have a model integrated in them.

Maureen: 00:48:18

Yeah. Yes. We have a really excellent data engineering team. You can find our Director of Data Engineering, James, on LinkedIn as well. He and I are connected. If you want to look us both after the podcast, please feel free to do so and just mention … You can connect with me and mention the SuperDataScience podcast that I know where you’re coming from.

Jon: 00:48:43

Awesome. Yeah. We’ve talked about growing a data science team. That’s something that you’ve actually developed some real expertise in. You’ve skilled a lot of data science teams, you’ve done talks on growing data science teams. You have a few takeaway tips for people when they’re doing it themselves?

Maureen: 00:49:07

Sure. I think that it’s critical to really start with a good understanding of, not just what’s needed within the company, but also what the expectations are of the other players within the company. That could be a direct manager, it could be the C-suite, your executive team. That could be other parts of the company, that could … If product falls outside of your structure, which sometimes it does and sometimes it doesn’t, but understanding the expectations there can help you identify the communication patterns that are needed. Some organizations need really frequent communication for them to have confidence in this very large investment that they’ve placed in your care.

Maureen: 00:50:01

It’s high reward, but it’s also very high risk for any company. Being proactive about addressing the expectations, and the needs, and building and scaling your team with that in mind … I mean, not based … not focused on that, but keeping that in the periphery, I think is really helpful. Making sure that the communication processes, and the technical processes with the other parts of the company, are also very important. Usually, when you’re in a smaller company, there isn’t anybody that will tell you what’s needed, right?

Maureen: 00:50:43

Because I’ve seen so many different architectures, and I’ve seen so many different products, at this point, I have a feel for, “Oh, well, we need some monitoring systems in place, we need to get Databricks, because of the languages that we have. We need to organize cross-functional teams where these people are working together.” If I wasn’t working at Reonomy, if I was working someplace else, and we had a big data lake and the models could operate off of the data lake, instead of within these high volume pipelines that are so tightly coupled with the work that the engineers are doing, I might have a completely different team structure.

Maureen: 00:51:27

I might say, “Okay, our team is just going to be a data science team or multiple data science team focused on different products, and we’re going to operate on the database. We’ll put in place some infrastructure to do that, and keep us separate from the engineering pipelines that are ingesting information and populating the database.” I think that what I’m really arguing for is talking to all of the people that you work with, before you start, before you have any preconceived ideas about how to build your team, in order to get this holistic view of all of the different types of interactions that are needed, in order to have a successful product.

Jon: 00:52:11

Cool.

Maureen: 00:52:12

Does that address what you were talking about, or is that too high level?

Jon: 00:52:17

No, that’s great. I think that’s perfect. Especially because there’s at least one other thing that I want to talk to you about and I don’t want to have to take the entirety of your day. For the people at home listening, I am dragging on this interview well beyond the time that we’ve scheduled today.

Maureen: 00:52:34

We had a lot of fun catching up. I don’t regret time spent at all, in any way.

Jon: 00:52:41

Could you? But anyway … I have one other big topic before we start wrapping things down, which is, you transition from an academic career into industry. You did a PhD at Columbia, in computational astrophysics, which sounds awesome, and you’re welcome to tell us a little bit about that, if you’d like to. But the piece that I want to focus on is, you continued on in academia. You did a postdoc at Rutgers University in New Jersey, and then you took the Insight data science program, which I think I know a little bit about, and you can tell me where I’m wrong about this, but … There are a lot of data science bootcamps out there. Most of them, especially if it’s a full time in classroom bootcamp, they will charge you to participate. But the way that this Insight program worked, is they turn things the other way around. Is they said, “Well, what if we found people that are really exceptional? Really exceptional academics, for example.” I think for a while, they only took people with PhDs.

Maureen: 00:53:47

That’s right.

Jon: 00:53:49

They find amazing people, and they say, “We will train you to be industry-based data scientists for free, but you have to at least talk to these potential companies would be interested in hiring you.” Then they take their fee from the hiring company. I’m going to get that right.

Maureen: 00:54:09

Yeah, that is right. It’s like a recruiter fee that comes from the company that’s hiring you. Yeah.

Jon: 00:54:19

Cool. I guess, tell us a bit about the transition from academia to industry like why you chose to do that. Then maybe specifically, like that program that Insight can offer or other kinds of programs that people might be able to consider if you have some insights into that.

Maureen: 00:54:40

But I’m [inaudible 00:54:41]

Jon: 00:54:43

Oh God, that’s a bad one. So bad.

Maureen: 00:54:45

Yeah. I very much enjoyed being in academia. I was very lucky with some of the choices that I made as an academic. I was working in high volume data pipelines. I was writing code and running code on supercomputers and back when they were called supercomputers, and they were still very few of them within the United States, and I would have simulations that would run for a month on a supercomputer. It was very large volumes of code.

Jon: 00:55:19

How much data can there be in the universe? Come on.

Maureen: 00:55:27

I know, all right. To give you an example of why we needed supercomputers. I was looking at the evolution of galaxies. The easiest way to do it, is to take a map of the primordial fluctuations, that microwave background, that you might be familiar with from some of the media. For a while, you could buy a beach ball and instead of the color stripes on it, it had the microwave background. That was a favorite around the office. You take the primordial fluctuations of the universe and then you evolve them forward.cThe colder areas become denser. Then they start to form galactic structures. They form groups of galaxies. That’s the easiest way to do it, because when you … It’s very hard to create a balanced galaxy.

Maureen: 00:56:21

If I wanted to create a simulation of the Milky Way, the way that it looks today, it would be very hard to get it to be stable. Usually, the galaxy starts shrinking on you a little bit. It’s start [crosstalk 00:56:36] because they’re so delicately balanced. You evolve from primordial fluctuations, which means that you get these giant structures, and then you have to run an additional simulation, that’s like a zoomed in version, if you only want to look at a single galaxy. You can get resolution differences of 16 orders of magnitude, if you’re talking about going from these large groups of where there are thousands of galaxies and their group, down to a almost at stellar resolution objects that create the right shape galaxies.

Jon: 00:57:12

Wow.

Maureen: 00:57:17

That was just, man, an absolute ton of fun. It taught me things that I didn’t know that I was learning. Like how to be strategic with the way that I did my research and goal oriented with the way that I did research and to be very tactical with the compute resources, and trying to run things not just so that they’re robust, and they’re no bugs, but so that they run more quickly, because I don’t want to wait 12 hours every time I have to do an analysis on the outputs from these large simulations. There was a just kind of innately some things that I needed to learn as part of my PhD.

Maureen: 00:58:06

Then I got into industry, and it was like, “Well, this is great. You learned a lot of the things that I need to know.” But as for the why, I think that there’s a trend in academia for that people are leaving. Part of the reason that they’re leaving is because you can do fun amazing things outside of academia with a similar skill set. Now, companies are doing a lot of really fascinating research to building things that have never been built before, right? We’ve never had ownership structures for all commercial properties across the US before. But also, the United States went through a boom in the growth of academic institutions in the ’70s, and ’80s, and then we leveled off and then we started to shrink a little bit.

Maureen: 00:58:57

There were less positions that are available. Which means that there’s a pretty high likelihood that you … It’s not you … You’ve entered into a small numbers statistics game and most of the people that go in will do many, many postdocs and the tenured positions. Well, the associate faculty positions in a lot of cases aren’t necessarily what they used to be. You can be a lecturer at a top university and still not have healthcare. It’s pretty staggering how much what the lifestyle looks like now and how much it’s changed from 40 years ago. Yeah, a lot of people are leaving and I was part of that. I have been part of that.

Jon: 00:59:43

That makes a lot of sense. I saw the same thing coming out of my PhD where it looked like a absolutely astronomical amount of work and I wasn’t even working in astrophysics. [inaudible 00:59:57]

Maureen: 01:00:01

[crosstalk 01:00:01] Only two in the whole podcast. You’re dropping the ball.

Jon: 01:00:11

I completely understand that we are looking ahead at so much uncertainty where you’re applying for grants is constant constant constant application for grants that often only last for a couple of years. Even the uncertainty around where you’ll live, because you get a grant at one place, you get a grant in the UK for two years and then what? Okay, you got a grant in the US, and then … Well, I got a grant in Singapore. You can end up … You have to follow where the grants come in. Through those postdoc years, you’re jumping around from place to place. You’re doing, yeah, huge amounts of work and the ultimate goal.

Jon: 01:00:50

You say for a lot of people is to get that associate professorship beyond a ten year track. Those opportunities are rare. They’re very few positions. People will hold them for their lifetime. Yeah. I don’t know. I completely understand. Like you say, I think it’s happening more and more and more. One of the really exciting things about the data science field is is expanding. It’s still expanding rapidly. If you want a lot on that, in Episode 471 with Kirill Eremenko, we talk a ton about how not only has data science been this growing industry for the last few years. Some people are like, “Has it peaked?” Absolutely not. It is still growing a lot. In a lot of companies, you can be doing super interesting work like you do so. You can end up having tons of resources, be able to hire lots of great people and feel like you have the job security, the health care, all that stuff. Completely understand. Anyway, I guess would you recommend people do a bootcamp like you did, or was that valuable?

Maureen: 01:01:58

What was valuable about Insight was that … I think one of the most valuable things that I didn’t realize at the time, is that it created an instant network for me. People who I was going through the bootcamp with, they’re all still in the industry. A lot of them are still in the area and we rely on each other still. We asked each other questions. “How are you dealing with this? What do you do with this? Do you think that I should go for an MLE title? Do you think that I should take this opportunity or that opportunity? How did you grow your team? What things were difficult?” [crosstalk 01:02:41]

Jon: 01:02:40

I’m so jealous. [crosstalk 01:02:44] It’s still obvious. I wish I had that. That’s awesome. Yeah.

Maureen: 01:02:47

Yeah. Realizing that, and building those relationships off, and staying in touch with those people has been just wonderful. It’s just been wonderful. You want to treat people as your friends and not as competitors. Even though we’re all looking for jobs at the same time, they are not competitors they’ve been good friends. Yeah, yeah.

Jon: 01:03:12

You’re not fighting over a few faculty jobs.

Maureen: 01:03:17

Yeah. There’re so many things to do. There no reason to have anything but friendships. Then I think the other things that were really helpful was that … I have no idea how to interview for a technical position the way that engineers interview [crosstalk 01:03:39]

Jon: 01:03:38

Have you seen the how long my CV is? I’m assuming for this role.

Maureen: 01:03:43

Yeah. Right? Yeah. Academic interviews are … You know what they’re like. “Tell me about your thesis. Tell me about your last publication for one hour, or three hours, or four hours, or whatever.” That’s the interview. That’s it. It’s not coding questions, and talent, culture questions and all of this different type of stuff. An introduction to this different world was was also very valuable.

Jon: 01:04:17

Cool. All right. Well, we’ve covered all my big topics. I’ve just got my little questions left now. The first one is, you have a book recommendation for us, Maureen?

Maureen: 01:04:29

I do. I actually have to pull two books out of my bookcase. One I’ve read recently, and it’s called Other Minds.

Jon: 01:04:39

Yeah. That’s the second recommendation that we’ve had, Deblina Bhattacharjee, who was back on episode 439 that aired at the end of January. She also recommended Other Minds and I have been dying to read it. I’m so glad that you brought it up again. Tell us about it.

Maureen: 01:05:00

I am just in love with this book. It talks about how the octopus and it’s environment and it’s anatomy has driven different structures within it’s brain and in different way of perceiving the world. It makes you increase the sense of wonder, and it makes you feel like, “Well, if you’re interested in aliens, they’re here. They’re here on our planet, and we can interact with them and try to understand them. They have a …” I can’t say enough good things about this book. I absolutely love it. It’s very thorough, and it has … I’m opening the back cover now. This is all of the notes. There is just pages and pages and pages of notes of references to pieces of research. But it’s not like dry and technical, it’s a little magical.

Jon: 01:06:05

Sounds perfect. That sounds like a really talented author. I look forward to, not only reading that, but seeing what else they’ve done, because that sounds like a really special talent.

Maureen: 01:06:14

Absolutely. Absolutely.

Jon: 01:06:16

Cool. I believe you have one other book recommendation for us as if that wasn’t enough.

Maureen: 01:06:21

Yeah. The other one is called Presentations Zen.

Jon: 01:06:25

Cool.

Maureen: 01:06:27

It is-

Jon: 01:06:28 [crosstalk 01:06:28] about meditation?

Maureen: 01:06:29

Well, yeah. Well it’s about how to create visuals that go with the stories that you’re telling when you are creating a PowerPoint, or whether you’re creating some kind of architectural diagram, or whether you are … Whether you have some kind of internal collaboration on your team. Where you’re trying to figure things out and solve problems. The reason that it’s called Presentation Zen is because it emphasizes simplicity. It presents you with the idea that clarity comes through simplicity. It just does an excellent job of that. I picked this book up when I was still doing my PhD and it helped me hugely.

Jon: 01:07:21

Cool. I think that that’s a really great pragmatic recommendation. I love that you have these two different book recommendations, and they’re so different. One is like, not obviously applied to data science, that I’m sure. Anything can inspire ideas, but just sounds absolutely fascinating other minds. Then Presentation Zen, it’s just this quite pragmatic suggestion for data scientists to become better communicators, particularly in presentations. This has been such an amazing episode, Maureen. I hope that we can have you on the show again, sometime. I want to quickly mention, there was something that we talked about a lot in the beginning, which was the deep learning study group that you and I met at.

Jon: 01:08:02

If people are interested in hearing about that, I probably should have mentioned it at the time, but didn’t think of it, you can check it out. You can head to deeplearningstudygroup.org. To see I have separate Jupyter notebooks for each one of the sessions that we had. There is lots of cool information there. If you’re getting started in deep learning, you can see what the kind of path that we followed, as we’ve learned about deep learning. Although I did bring that all together into my book, Deep Learning Illustrated. That might even be a better resource. But anyway, check it out. There’s really cool photos, lots of amazing people doing incredible presentations.

Jon: 01:08:37

I wanted to mention that. It’s something you could check out. I don’t think I’ve mentioned the Deep Learning Study Group on the show before, which is crazy. Then the other thing is, I guess you’ve already mentioned. I was going to say how should people stay in touch with you? You already mentioned that, but I guess I want to bring that up again now. Typically, we do it at the end of the show. You already mentioned, LinkedIn is the place to find you will, of course, include the specific URL in the show notes. Yeah, people should add you, but when they add you, they should mention that they were listening to the SuperDataScience podcast. I often give that recommendation in the conclusion to the episode that I record separately from this and will happen for listeners in a minute. But, yeah. That’s what I say to you is I’m like, “Please add me on LinkedIn.” But if you mentioned that it was because you listened to the SuperDataScience show, then I’ll know that you weren’t a random recruiter, or salesperson.

Maureen: 01:09:27

Right. Yeah. Well, this has been so much fun. I’ve really enjoyed myself. You’re just absolutely fantastic person to do these kind of things with.

Jon: 01:09:42

Thanks.

Maureen: 01:09:42

I’m so appreciative of the opportunity to do it. Thank you so much.

Jon: 01:09:46

The feeling is mutual, Maureen. All right, we’ll catch you again soon. Thank you so much for being on the show. Yeah.

Jon: 01:09:59

Wow, Maureen is equally brilliant, whether we’re talking about deep technical topics or practical commercial considerations. In this episode, we learned about knowledge graphs, which are a graph data structure that allows for different types of data at the graphs nodes, such as company nodes, people nodes, and property nodes. We talked about tools and techniques for high volume data pipelines and data science teams that skill well, including Spark Scala, ML Pipelines, PySpark, IntelliJ, and Databricks.

Jon: 01:10:32

We talked about what Maureen looks for in the people she hires, things like being keen to learn, having clear, understandable communication skills, and being detectives that ask both easy questions and hard questions. We talked about the differences between data analysts and data scientists, as well as the differences between machine learning engineers, and data engineers. As always, you can get all the show notes, including the transcript for this episode, any materials mentioned on the show and the URL for Maureen’s LinkedIn profile, as well as my own social media profiles at www.superdatascience.com/479. That’s www.superdatascience.com/479.

Jon: 01:11:13

If you enjoyed this episode, I’d of course greatly appreciate it if you left a review on your favorite podcasting app. To let me know your thoughts on the episode, please do feel welcome to add me on LinkedIn, or on Twitter, and then tag me in a post to let me know your thoughts on this episode. Your feedback is invaluable for figuring out what topics we should cover next. Since this is a free podcast, if you’re looking for a free way to help me out, I’d be very grateful if you left a review of my book, Deep Learning Illustrated on Amazon or Goodreads.

Jon: 01:11:44

You gave some videos on my YouTube channel a thumbs up, or subscribe to my free content rich newsletter on joncrone.com. To support the SuperDataScience Company the kindly funds, the management, editing and production of this podcast without any annoying third party ads, you could create a free login to their learning platform at www.superdatascience.com or consider buying a usually pretty darn cheap Udemy course published by Ligency, a SuperDataScience affiliate, such as my Mathematical Foundations of Machine Learning course. All right. Thanks to Ivana, Jaime, Mario and JP on the SuperDataScience team for managing and producing another amazing episode today. Keep on rocking it out there folks, and I’m looking forward to enjoying another round of the SuperDataScience Podcast with you very soon.

Podcasts SDS 479: Knowledge Graphs

SDS 479: Knowledge Graphs

Podcast Transcript

Share on

Related Podcasts

November 21, 2025

November 18, 2025

November 14, 2025

Podcasts SDS 479: Knowledge Graphs

Share

SDS 479: Knowledge Graphs

Podcast Transcript

Share on

Related Podcasts

November 21, 2025

SDS 942: Odds of AGI by 2040? LEAP Expert Forecasts and Workforce Implications

November 18, 2025

SDS 941: Multi-Agent Human Societies, with Dr. Vijoy Pandey

November 14, 2025

SDS 940: In Case You Missed It in October 2025