Jon Krohn: 00:00:00
This is episode number 485 with Doug Eisenstein, the founder of Advanti.
Jon Krohn: 00:00:12
Welcome to the SuperDataScience Podcast. My name is Jon Krohn, chief data scientist and bestselling author on deep learning. Each week we bring you inspiring people and ideas to help you build a successful career in data science. Thanks for being here today and now let’s make the complex, simple.
Jon Krohn: 00:00:42
Welcome back to the SuperDataScience Podcast. We are oh so fortunate to have Doug Eisenstein as today’s guest. Doug is an exceptionally clear and content rich communicator. In this episode, he uses his remarkable communication skills to give us the skinny on data engineering for financial markets. 20 years ago, Doug founded the consulting firm, Advanti and they have since become a critical provider of solutions to complex data engineering problems faced by some of the world’s largest banks and asset managers including Morgan Stanley, Bank of America, Citibank and State Street.
Jon Krohn: 00:01:18
Topics covered in the episode include a breakdown of the primary financial sectors and departments, why data source integration for financial decision making is so wildly complicated and specific data engineering approaches that resolve these issues including entity resolution, knowledge graph mapping and tritemporality. Today’s episode will appeal primarily to anyone who is interested in finance or in data engineering. It doesn’t matter whether you have a strong technical background in data engineering already or not, because we do a thorough job of explaining the technical bits. All right, be ready for an awesome episode. Let’s go.
Jon Krohn: 00:02:04
Doug, welcome to the program. I am so excited to have you on. We’ve known each other for a couple of years and you’re just such an ideal podcast guest where the audience is so lucky to have you here, they don’t even know it yet. So, welcome Doug. Where in the world are you today?
Doug Eisenstein: 00:02:21
Thank you. I am very happy to be here. I am in Natick, Massachusetts, which is like 25 minutes outside of Boston. Great place to be, very close to the highway and-
Jon Krohn: 00:02:34
That’s what you want. That’s really cool. I am, if the people are watching… If the listeners or viewers, I guess, are watching the YouTube version of this, for the first time since I started hosting the SuperDataScience Show on January 1st, I’m in a different room.
Jon Krohn: 00:02:53
So, I’m traveling for the first time since the pandemic started. Things are starting to open up across the US and so it was the right time for me to visit our company office in Austin, Texas, which I’d never been to, and I’m in my hotel room right now. So, Austin, Texas, a really nice place to visit. People are really friendly here. Have you been, Doug?
Doug Eisenstein: 00:03:17
I’ve never been to Austin, Texas. I had not been to Texas at all but it is on one of the places that we like to visit if we take one day our RV trip from Massachusetts all the way out to Arizona, which we’ve talked about many times. That will happen one day and our stop… Yeah, that will be one of our stops.
Jon Krohn: 00:03:37
Yeah, it’s very cool. Lots of great food, people extremely relaxed and an enormous tech hub right across the street. I can see it out the window right now. I’m looking at Anaconda Headquarters.
Doug Eisenstein: 00:03:51
Wow.
Jon Krohn: 00:03:51
They’re the people who make like the Python package and associated libraries.
Doug Eisenstein: 00:03:57
Yeah.
Jon Krohn: 00:03:58
So there you go. A little data science nerdy [crosstalk 00:04:01].
Doug Eisenstein: 00:04:01
[crosstalk 00:04:01] right outside your window.
Jon Krohn: 00:04:03
Exactly.
Doug Eisenstein: 00:04:03
Some inspiration.
Jon Krohn: 00:04:07
Yep. So we’ve known each other for a few years now. I remember you visited an office that I was in like probably five years ago, kind of thing. You were a friend of a friend and we really connected on things like CrossFit. You were wearing a Rogue Fitness shirt which is a big brand in the CrossFit community and yeah, and we’ve stayed in touch, so that’s really cool. Something that I didn’t really specifically know about you, but when I was researching, for this show, I discovered that your consultancy, Advanti, that you founded, you founded it right out of school. So in 2002, you transitioned from a bachelor’s degree in computer engineering straight into your own company. Tell us a bit about that. Tell us a bit about what Advanti does today, but also how you had the guts to make that jump to create your own company right out of school?
Doug Eisenstein: 00:05:10
Well, I’ll say, I’m going to rewind back for a minute because I like to joke around that the reason why I am where I am today is because I had a passion for video games growing up, like no doubt, that like screen time and interest led to like tinkering, and that tinkering led to like building PCs and stuff. And then from there getting into college. And it just kind of happened that from when I was 16, I knew that my path… Like I see with my kids. You begin to know what your gifts are. Like you see that in yourself, “Oh, I’m really good at this. I’m a natural at this,” and I saw that in myself. So as I started to go through college, at that time… That was the late ’90s, early 2000s and it was like the dot-com boom at that time. I was working at like a dot-com place. Oh, I have such great stories to tell, one day that over drinks, myself sleeping inside of a data center for basically three days because we had so much traffic and I was trying to figure out how to scale all of our web services at that time… like great stories.
Doug Eisenstein: 00:06:40
But I kind of found that I liked working on different projects, working with people, understanding their businesses and understanding like no matter if you’re focused on one industry or multiple industries, I think the opportunity that you get when you do consulting is that you have the ability to really understand how their business runs, how it functions, the different kind of systems that are put together. And you do see the similar kinds of problems, whether it be industry-specific or agnostic. And I mean, I just kind of found myself going from Suffolk into working at these dot-com startups at that time, and I said to my… Actually, I was working as a consultant with another friend of mine. I think he had decided to work full time at another company. I ended up working as a consultant for one of his clients. And then from that point I’m like, “I kind of like this. Let me just create a company.” And it was just initially me.
Doug Eisenstein: 00:07:52
Where this really started to kick off though, was interesting, was the hardest part of doing all this, the hardest part, was that in the very beginning when you don’t know anybody, you have no track record, you have no one but you at your company, your company is you. Like how do you get customers? So going from zero to one was a very difficult challenge. I remember going to customers and telling them, “Hey, I can help provide all of these different data services.” And what ended up happening was they would say, “Okay. Well, what people do you have?” And I kind of like just think, “I don’t really have any people, just me.” You needed somebody to trust you and I had a couple of people that trusted me and that really helped open up that door. And then once I had one person that trusted me, I did the best job that I could and then from that point it just became all like word-of-mouth. And really it went from there.
Jon Krohn: 00:08:59
Nice. That’s a really important part of the story to highlight. And one of the things that you mentioned there was Suffolk and people who don’t know, that’s the university that you were studying at-
Doug Eisenstein: 00:09:12
Yes, yeah.
Jon Krohn: 00:09:12
-computer engineering as… That’s also in Massachusetts, right, where you live?
Doug Eisenstein: 00:09:16
It is, downtown Mass… actually they’re really well known for law. One of the things I look back on is, “Mm, maybe I should’ve got a law degree,” and except friends of mine that are like… understand patent law and that whole like legal text base is pretty neat.
Jon Krohn: 00:09:34
Yeah, different strokes for different folks. I could not imagine that being a lawyer, that does not sound like fun to me, but obviously for somebody it is. All right so, Advanti, you started it almost 20 years ago now.
Doug Eisenstein: 00:09:49
Yep.
Jon Krohn: 00:09:49
And nice to hear a little bit of the back story on getting going with that. Has it always had a focus on finance? So making data useful in finance, like you guys do today?
Doug Eisenstein: 00:10:05
Very early on, no actually. My journey in the very beginning, started, believe it or not in just pure infrastructure. So infrastructure meaning like, at that time it would be, hardware, server administration, those kinds of things, but I always had a passion for data. And I remember distinctly, there was a buddy of mine who was a database administrator at the time, started talking about like column stores and I’m like, “Wow, this is really cool.” So, I put all of my energy into learning about that and transitioned the company probably back in 2007 to more focus on data. And from focusing on like more databases, we kind of f… I don’t want to say fell into finance, but it just was the fact that we started working with some finance companies, they were hedge funds. CTO of one of the hedge funds goes to work at another company, calls us, asks for help, we go there and that word of mouth kind of spread from that point.
Doug Eisenstein: 00:11:13
So we went from actually infrastructure into more like, what I call data infrastructure, first, performance tuning systems, and then eventually that became into well, it’s… The performance problem really wasn’t necessarily in the hardware or optimizing like the data model itself, it was more of the kind of the transformations or the analytics that are being done and how everything was really connected. So, it kind of naturally progressed into that area and then it allowed us to see, going to work with so many different investment management companies, really allowed us to be able to see the types of problems that they had in acquiring their data, integrating it, analyzing it. And yeah, that led us to where we are now with Advanti Consulting and then also with Aristos as well too.
Jon Krohn: 00:12:09
Yeah. So I guess we could kind of summarize that what Advanti does today, is this focus on data engineering in finance. So all of those kinds of different activities around having data flow efficiently, having different data sources reconcile against each other. All that data engineering is really the focus, right?
Doug Eisenstein: 00:12:30
Yeah, and that’s actually still to this day, a huge problem, I mean, I know that there’s a lot of people that still say the data preparation is a problem. It is a problem and especially within finance. I mean, in all seriousness, like, it’s still such a huge problem because there’s so many hundreds of different data sources, even internal ones and you still have to do all that integration with. That’s still a major, major, major problem. It’s got simpler with the great technologies that are out there, open-source technologies. One thing that we try to focus on are using open source technologies as part of the solution that we provide to customers. But yeah, the tooling has been fantastic. Like, there’s so much, so many… It just seems like every day, every week we’ll say, “I talked to someone and there’s something new,” that I’m like, I got to check this out on GitHub. I star it, fork it and then kind of like start learning about it. There’s always something new that exists and figuring out ways to… Where does this technology fit in?
Doug Eisenstein: 00:13:37
Well, one thing for the audience to also take into consideration as part of like services in consulting is that, you have to focus on the problem and really understanding what the problem is because it’s… you have to understand what the array of modern tooling is and solutions but more so, what exactly is the problem? What are the pain points? What are the needs? What are the requirements, without even mentioning a single technology and once you can do that and you understand that really well, and you can articulate that, then you can start figuring out, “Okay, what tools do I use and how do I apply them?”
Jon Krohn: 00:14:17
Cool. Eliminating unnecessary distractions is one of the central principles of my lifestyle, as such, I only subscribe to a handful of email newsletters. Those that provide a massive signal-to-noise ratio. One of the very few that meet my strict criterion, is the Data Science Insider. If you weren’t aware of it already, the Data Science Insider is a 100% free newsletter that the SuperDataScience team creates and sends out every Friday. We pour over all of the news and identify the most important breakthroughs in the fields of data science, machine learning and artificial intelligence. The top five simply five news items. The top five items are hand-picked, the items that we’re confident will be most relevant to your personal and professional growth. Each of the five articles is summarized into a standardized easy to read format and then packed gently into a single email. This means that you don’t have to go and read the whole article, you can read our summary and be up to speed on the latest and greatest data innovations in no time at all.
Jon Krohn: 00:15:28
That said, if any items do particularly tickle your fancy, then you can click through and read the full article. This is what I do, I skim the Data Science Insider newsletter every week. Those items that are relevant to me, I read the summary in full and if that signals to me that I should be digging into the full original piece for example to pour over figures, equations, code, or experimental methodology, I click through, and dig deep. So, if you’d like to get the best signal to noise ratio out there in data science, machine learning and AI news, subscribe to the Data Science Insider, which is completely free, no strings attached at www.superdatascience.com/dsi. That’s www.superdatascience.com/D-S-I. And now, let’s return to our amazing episode.
Jon Krohn: 00:16:18
Great summary points on how to consult effectively. Yeah it is super cool how there’s this growing and growing and growing open source community that is even now supported by the big tech companies, they have so many internal open source initiatives. It’s a wonderful time to be in this industry-
Doug Eisenstein: 00:16:40
Yeah.
Jon Krohn: 00:16:40
-to be in data science or anything related to data or software engineering. So you mentioned briefly there, a couple of thoughts ago, you mentioned Aristos, which is another company that you founded-
Doug Eisenstein: 00:16:52
Yeah.
Jon Krohn: 00:16:52
-and that was in 2020-
Doug Eisenstein: 00:16:54
Yeah.
Jon Krohn: 00:16:56
-very recently. So, just last year-
Doug Eisenstein: 00:16:58
Yep.
Jon Krohn: 00:16:58
-and Aristos makes products. So you have Advanti, which does consulting and I love this, it is my number one recommended way to grow a business is to start in consulting, make some revenue doing that, figure out where people’s pain points are, where the opportunities are and then through working with clients, eventually, you kind of have these ideas maybe in the shower, whatever you like, “Huh? Clients X, Y and Z, all have this same problem, I could make a product and sell it to clients X, Y and Z and then who knows how many other clients could be out there?” So creating these software products, scales a lot better than obviously consulting possibly can.
Doug Eisenstein: 00:17:50
Yep.
Jon Krohn: 00:17:51
So, that’s what you’ve done?
Doug Eisenstein: 00:17:53
Well you hit the nail on the head with all that. That is exact… You hear these stories about companies products that are started from consulting and I echo your comments, I mean, if you, want to build a product and you really need to understand these pain points that customers within a given segment are experiencing and what I have found in building a company is that you could focus on the market of thousands, tens of thousands of companies, but you really narrow it down to one little segment, a couple of pain points and then if you don’t understand those problems deep enough and you need the systems, you need the people, you need the network. You just need to like have some way of starting then consulting in one of these areas, consulting and starting to build an understanding of what the problem is. Do you work at one company, a second company, you do piece everything together and actually funny enough, yes, it’s either in a shower or it’s walking or sometimes it’s even like in the middle of a workout where you get this like… Not always though, sometimes you just, “Like I want to get through to the very end,” but you just get these eureka moments and then you jot them down and like, “Wow, like that’s what I got to do.”
Doug Eisenstein: 00:19:35
And then the fun starts because then you can start creating, you can start sharing and the other good thing about doing the consulting part in the very beginning is, as you start to build your product, like customers don’t expect, if you’re providing them a solution. Like the way I think about it is that there’s a problem and there’s a solution. And sometimes that solution, it doesn’t have to be 100% product, it could be product and a service, but as long as you’re providing some kind of solution. So if you can emphasize what the solution is, you go to a customer and you provide them with a framework and even if the framework provides out-of-the-box 20%, and instead of the entire 100% or 80%, it gets them further and if you can provide it for, you can work out some commercials that are good for the customer. Sometimes at the beginning, you might even just want to include it at no cost but then walk away with the intellectual property of adding to it. That’s a great way to be able to get started on building something incrementally.
Doug Eisenstein: 00:20:39
At some point, why do you build a product? You build a product because you want to be taking that and you want to be able to scale, just like you said, you want to go from a consulting business, which is classically, not necessarily easily scalable because it’s people bound. And it’s you have to monitor the quality of what is produced and people producing different quality solutions to something that’s more process-oriented to something that is an actual product.
Jon Krohn: 00:21:10
Yeah, that’s really well stated as to why consulting doesn’t scale as easily is that you have differences in quality with each person and it needs to be monitored. Cool. So you have now a number of products at Aristos that grew out of this kind of approach of realizing where pain points were from consulting. And so I think there’s four right?
Doug Eisenstein: 00:21:35
That’s right.
Jon Krohn: 00:21:35
So you have Dominus, that’s an entity recognition engine, tell us about that and the other three as well.
Doug Eisenstein: 00:21:40
Yeah and so we naturally we found these gaps basically. One product that we’ve created is called [Finflow 00:21:48] and Finflow, if you ever heard of a product called Stitch or even Fivetran, it’s very similar to those products except it’s specifically around finance and one of the problems within finance is that you have hundreds of different data sources and oftentimes these organizations are getting the same data building up the same process to pipe through all the data and get it into usable form. The purpose of these, of what we call these adapters and Finflow was to just start flowing data from point A to point B and point B could be any kind of modern data warehouse. So that’s the scope of Finflow, turnkey data adapters.
Doug Eisenstein: 00:22:31
Then comes, “All right, cool. You got everything and sitting inside of one place. Now, you need to start linking everything together.” And this becomes, probably one of the… like the heart and soul of what we’ve created. And one of the biggest problems that I think is in finance but also across other Industries and it’s really managing, what I see is entity data. Entity data meaning… Within finance, within or specifically the investment space, you can, if you’re looking at equities and you want to be buying and selling a stock. Well a stock, you have to be modeling this correctly and there’s different kinds of identifiers, data from different sources. You get data about the price of a stock, you get data about the fundamentals or in other words that’s the like cost of goods sold or the total revenue that comes at the company level. These two, just that simple example are two different. They’re two different entity types that have metrics that are very important for making it a investment decision, but come from different sources with different identifiers and those identifiers change over time due to corporate actions.
Doug Eisenstein: 00:23:55
So connecting all this together is like absolutely crucial to making a good investment decision. And Dominus basically does the matching, the mastering, so combine the different attributes together in handling survivorship and then the traversing. So you can say something like where you can… where you can basically go through a network graph or you go through a knowledge graph and then get back the data that you need. So S&P 500 has members and then those members can have a company and therefore that company maybe it sells vacuum cleaners and you have rating information on vacuum cleaner. So you want to get all of that data together inside of a time series so you can perform analysis on it. The third is back playing basically it’s a modern way of handling your data infrastructure. So if you wanted to take your data adapter and run that on a Lambda or a Kubernetes, that handles basically decoupling your data infrastructure and the last one is called Fabric and Fabric is all about like…
Doug Eisenstein: 00:25:10
In the financial industry since there’s so much complexity in accessing the data, there are companies like Bloomberg that… which is like a financial data provider, probably one of the most well-known. They have a really cute language called BQL, Bloomberg Query Language and it is so powerful, and it does so much. So, we’ve built something that is similar to that, where you can create your own DSL, your own domain-specific language and you can basically define and say, “Hey, this is how I want to get the data out from my different warehouses,” and it pushes the aggregations, the analytics down into the warehouse and it allows someone who is like a portfolio manager or a trader not to have to know as much about how to get the data out. So those are the four different products that we’ve created.
Jon Krohn: 00:26:12
Cool, thank you for that run down Doug and it does provide me and the audience with a great sense of where some of your expertise lies. So speaking kind of generally about finance and financial data, it might be helpful to break down the different kinds of financial industries.
Doug Eisenstein: 00:26:35
Yes.
Jon Krohn: 00:26:35
So, I know there’s capital markets, asset management, I’m sure they all have different kinds of data needs and so if you wouldn’t mind breaking down [crosstalk 00:26:44].
Doug Eisenstein: 00:26:44
Yeah, totally. The way that I think about this, it took me a while to kind of think of it this way, but it makes… Well, I’ll explain it. So, I think of the financial industry if… Financial industry has been broken down into like, basically like banks, capital markets, credit. And then if you go further in then within let’s say capital markets, then you’ll have investment management. And within investment management, what is investment management? Investment management are basically companies, I think there’s about 16,000 worldwide, that will manage a large amount of money. Millions, billions, usually billions of dollars on behalf of another institution. So that could be like the State of California has a retirement fund and they’re going to allocate a portion of their fund to a hedge fund or to some kind of money manager and that money manager will manage that money on behalf of them. So within a investment management company, there are what they call the front office, the middle office and then the back office.
Jon Krohn: 00:27:53
Right.
Doug Eisenstein: 00:27:54
And the front office is where you bring, if you start to peel that on your back more, you have portfolio management and portfolio construction, you have research, you have trading. And that’s where a lot of the investment decision making is done and where I see a lot of emphasis has been put in those areas over the last five, seven years around, kind of managing and improving all that data pipelining in those areas. That’s kind of how I would see it. The looking at the finan… Taking [inaudible 00:28:36]… No. Thousand foot overview of the financial industry and like kind of zooming in into this area.
Jon Krohn: 00:28:43
Nice. That was a very clear explanation. So your expertise seems to be particularly helpful to this front office side to the investment managers, the traders themselves. So these are people who, while they’re quite intelligent, while they might work long hours and while they have a great understanding of the financial industry, they may not be able to write any code. So they probably spend a lot of time in spreadsheets and there’s this opportunity to present them with huge amounts of data that allow them to make better investment decisions. You already mentioned [inaudible 00:29:30]. A given single company will probably have hundreds of inbound data sources. And you can’t expect the investment manager to be creating Python scripts, or SQL queries to grab data from all these different sources and then merge them together and be able to see them in one place. So that’s kind of where your sweet spot is, right? Building tools that allow these investment managers to quickly and easily make sense of huge amounts of information across many different sources.
Doug Eisenstein: 00:30:09
Yes, exactly. And I can just elaborate on the kinds. So in the front office, which happens to be the area that we do the most work in. The first component of that is that there’s a research function within these organizations that needs to determine, they need to take data from hundreds of different data sources, connect it together between that department and usually some kind of data engineering department and then put it into this form, this time series form, where they can start to make predictions on that within the investment space, by the way. I didn’t make this distinction before but there are systemic investors and non-systematic investors. So a systematic investor would be what a lot of people hear about as being quant. In between would be called the quantamentals and then on the other side would be fundamental.
Doug Eisenstein: 00:31:16
So I’ve seen much more budgeting, basically being allocated to, even these organizations that have traditionally been fundamental, kind of trying to move into more of a systematic based process. But some of the problems that they have is that you just want to be able to start, you want to get the data together. That’s like problem number one and I think a lot of these organizations, one of the pain points for them has been many of them have been around for decades and what happens is that you sort of start small, right? You start small with a small team, they grow their assets. What’s called their assets under management. So the money that they’re managing on behalf of their clients, that grows and grows, their team grows and grows, but then the technology that they use, at the core of it might be the same kind of technology that they had been using, five, 10 years ago. And without looking at creating a really solid foundation that can sometimes be a big lift and where we can come in to play is when a company has sort of matured and needs to take another look at their foundation and how they could sometimes modernize that or make the onboarding of new data sets a little bit faster or more accessible, those kinds of things.
Doug Eisenstein: 00:33:05
Having a second set of eyes that has seen other companies go through this transition and sort of a fast way from going, moving from what you currently have into something that can take you a little bit further forward in the future. I mean, it’s challenging for these companies to go and to migrate. So you need sometimes a migration path and you need to sort of think through with someone that understands your existing process, your unique investment process, how you can make that shift. And I find that that’s where we find ourselves, at least from a services perspective, that’s where we find ourselves often.
Jon Krohn: 00:33:48
Nicely described, completely understood and so I imagine that one of the particular issues that happens as a company moves from kind of systemic fun… Sorry, the non-systematic fundamental based trading where you’re trading based on reading news articles on, “Oh, electric cars are going to be big, so we should invest in Tesla.” So that kind of fundamental step, moving from that through to quantamental and then be totally systematic, quantitative, automated, data driven type of company. Probably one of the biggest issues that companies could encounter as they make that transition would be around entity extraction and resolution. So tell us a bit about that particular problem.
Doug Eisenstein: 00:34:43
Yeah, I mean that, so I think one of the problems is when you’re getting hundreds of different data sources, I mean, maybe even just thinking about this as like a… I wanted to be a little bit more concrete without going… Or let me be a little bit more concrete. So, when you’re making an investment decision, sometimes what you need is… Okay, let me just take a step back. So, there are different kinds of asset classes, right? So an investment manager sometimes is bound to a particular benchmark. So you could be bound to let’s say the S&P 500 and your goal is you have to beat the S&P 500. S&P 500 is composed of stocks right. There are all different kinds of investment managers and all different kinds of financial instruments that can traded. Financial instruments, asset classes. They largely mean the same thing, at least how I’m speaking about it.
Doug Eisenstein: 00:35:50
And some of the problems that happen are that, one of the hot areas now, past few years has been multi-asset class, right? So, multi-asset class means that you could be investing into stocks, into bonds, into futures, into swaps, all different kinds of unique instruments. So with that, you might have data that comes from the SEC. The SEC is a regulatory body in the United States, sec.gov, where companies that are public need to file what’s called a 10K or 10Q. So these are statements that are basically like the balance sheet, income statement and you need to get those fingers out of those filings. Some companies do that directly, some companies buy this data from a market data provider. So you need to get that out. Now think about this, right? Like that is coming at the company level, right? That comes at the company.
Doug Eisenstein: 00:36:55
Companies change over time. There are mergers. There are Acquisitions. There are spin-offs. So keep that in the back of your mind. The second is, you might be getting estimates data. Data that comes from brokers or other companies like [Estimize 00:37:13] that have more crowd funded or crowdsourced estimates. Then you have prices about the stocks. Then you have more alternative data sets now which there’s a proliferation of so many great data sources. You have email receipts, you can go to Glassdoor, you can get data from Thinknum, from Quandl, like there’s just hundreds of different data sources.
Doug Eisenstein: 00:37:43
So, the problem going back to the entities is that, when you start to look at the data that you’re being provided, you’re being provided with different kinds of identifiers, different types of characteristics and there’s really different entities that are baked into that data. So you might have an index. An index would be the S&P 500, but you might also have an ETF which is what could be like the [SPDR 00:38:13] which would be the S&P 500 ETF. Then you can also have a futures contracts. So you have all these different kinds of entities, they’re providing you unique metrics, they need to be connected and what we found, and if you, for the audience that’s listening to this that knows anybody in finance knows anybody in the investment space, you should ask like your buddies is this problem of like matching up identifiers and connecting all these different ideas together over time a problem? We’ve seen this as a huge problem and that’s one of the areas that we find is the most challenging for customers to really get right.
Doug Eisenstein: 00:38:59
So, we spend a lot of time working on that and I’ve actually created a tool that has helped us with that. Inside that tool we’re doing entity resolution which has been… Entity resolution, what is entity resolution? You bring together, you’re basically taking all these different datasets. You’re extracting entities. The way that we think about it is that there’s an entity. An entity has a identity, something that uniquely describes it, but it’s so critical that you capture the descriptive characteristics, these unique characteristics; name, country, currency, and it might be CUSIP, SEDOL, ISIN, which are different identifiers. And you have to capture that at a given point in time. Then once you can do that you want to link them together. And that’s where, as we think about it, there’s a resolver kind of function. That’s at least the way that we’ve put together our component, where you can instruct the system with a plug-in and say, “Hey, this is how you want to connect everything together.”
Doug Eisenstein: 00:40:11
The whole purpose of entity extraction is really to do, to increase the coverage of your data, like making sure that you have the highest coverage possible across your entire what they call investible universe. So your investible universe could be, 500 stocks, it could be 40,000 if we’re talking about equities, it could be an options, thousands of different financial instruments. So you want to cover. Cover meaning if you have data from source A, B and C, you want to try to cover as much as you can across every one of those sources and sometimes it comes back to the linking of identifiers and characteristics together.
Jon Krohn: 00:41:05
Nice. So I think, I understand all of that clearly. So this idea of entity extraction, generally means, pulling specific nouns-
Doug Eisenstein: 00:41:18
Yeah.
Jon Krohn: 00:41:18
-from natural language documents. So things like SEC filings. So you have this big unstructured blob of text and you scan over it. You identify words that are likely to be important invest entities, nouns in the investment world and then once you’ve extracted these entities, you need to link them together and so the example that you gave was, so the S&P 500, this index of, 500 I guess, big stocks in the US and then there’s the associated exchange-traded fund, which allows somebody to very easily trade a single equity product that mimics that entire basket of 500 stocks and by the way I think that’s a great investment strategy and one that I use myself and so those ETFs they’ll have a separate name. So like you said, a SPDR for the S&P 500 is a stock that is abbreviated to SPY and so we need to extract things like S&P 500, SPY and treat those as separate entities-
Doug Eisenstein: 00:42:36
Yep.
Jon Krohn: 00:42:36
-one of them being, the basket, the other one being this single equity that we can use to trade that basket and you want to link them together and I guess when you link them together, you could use a structure, like the knowledge graph that you mentioned-
Doug Eisenstein: 00:42:51
Yes.
Jon Krohn: 00:42:51
-earlier in the episode. Cool. And if listeners want to learn a lot about knowledge graphs, you can check out a recent episode, episode 479 with Maureen Teyssier, which was focused on exactly that. Anyway, so yeah, very cool. It sounds like you might have something to say and I just keep speaking over you.
Doug Eisenstein: 00:43:12
No, no, no, no and you hit the nail on the head. I mean, that is the way to think about all this. And by the way, that is a great episode, I listened to that the other day and it’s very [inaudible 00:43:23] in listening to Maureen, it was like… I’m like, “Wow, there’s so much overlap to the way that the problems that I see within, not within real estate, but within finance,” like I see a lot of these same problems, but yeah man, I mean like the same thing like you got… It’s the way described it is right. If you have to be… You’re searching through documents, through structured data, semi-structured, unstructured data. You’re picking out these different entities. Sometimes you don’t know what these things are but the more information that you get, then the more you can determine, “Oh, okay, we’re really talking about Apple the company, not necessarily the apple you’re going to be eating,” and there’s different kinds of apples [crosstalk 00:44:08].
Jon Krohn: 00:44:08
That famous apple harvesting farm that’s listed in the S&P 500. So cool. Cool, cool. There is something that you touched on earlier that I’d like to dig into a little more.
Doug Eisenstein: 00:44:22
Yeah.
Jon Krohn: 00:44:23
It sounds like it would be a particularly thorny issue. So when we’re doing this kind of entity extraction, there are especially weighty complexities around temporal resolution. So that must be something that you really need to focus on yourself and get right?
Doug Eisenstein: 00:44:44
Yeah, actually, temporality I would say such a huge… It’s such a huge problem. The reason why it’s so important. I’ve asked myself many times. Why is it so important in this space of investments? You know what it is? It’s because you want to look back in time and see like what if analysis, how did my portfolio perform, a simulation and sometimes people call it a backtest. That’s really the reason why.
Jon Krohn: 00:45:07
Yeah, yeah, yeah. Because if you throw those out, if you say, “Well, any company that it’s had a merger or an acquisition, that’s too complicated, I’m not going to put it in my data set.” That’s a really big problem because those stocks in particular, have some of the biggest price movements.
Doug Eisenstein: 00:45:25
Yeah, exactly. And actually with that one other connecting thought to this is what people have told me that’s different about finance is let’s say these 500 companies… So within S&P 500 you have 500, sometimes right around 500, but let’s say 500. You need to have like 99.9998% or 100% coverage for across all of your data sources because there are times when you want to make a trading decision on one of those securities and it just so happens to be a security that it went under some sort of, what they call… Corporate action in finance just means like it’s a merger acquisition spin-off, something like that. But those are sometimes the ones you want to take advantage of as a trading opportunity and the data is going to prohibit you from that. So one of the issues is with the temporality is that, when you’re getting this data, the data can change.
Doug Eisenstein: 00:46:40
So in a super simple case, there is a company called Refinitiv [inaudible 00:46:53].
Jon Krohn: 00:46:53
Right, yeah.
Doug Eisenstein: 00:46:56
They provide estimates data and what would happen if an estimate that gets provided to you today on June 18th, they find out that the analyst made an error, right? Then you made a trading decision today using that data. On Monday, then you get an update that says, “Oops, there was a mistake. We’ve corrected that,” right? And then you make another trading decision. Well, if you want to go back in time, you really need to look back and say, what did you know on that particular day? And this is where the temporality comes in. Where you have to understand like what has changed. So there is a concept called bitemporality, which is very important where you have like the dating around the date of the data itself and then you have the date of when the data on that day has changed and you track the different versions of that.
Doug Eisenstein: 00:48:08
So I think that temporality is such a key area, but the problem with the temporality is that in order for you to… Your pipeline that you manage needs to be taking that into account from the very beginning. Like, I saw all different kinds of implementations, but you really have to know when you’re getting your data from the very beginning and keeping track of the changes that have occurred to individual entries inside of that data, whether it be an API call or something else, you have to be tracking that.
Doug Eisenstein: 00:48:58
The last thing I’m going to say is, I think of there being three different kinds of temporalities actually. One temporality is the date. What some people call the data date. So if you think of a price. A price is provided. Let’s say you have a close of day price for the ETF of the S&P 500 today, right? Well, the close price is for today, so we’ll say that’s our effective date. But then when is that really valid? It’s not… If you look back in time, it’s not valid of today, it’s really valid as a closed price, after market close or tomorrow, or basically on Monday. So that’s the validating and you have to take that into account of, those two pieces.
Doug Eisenstein: 00:49:48
Then there’s the other dimension that I was mentioning, some people call it the system dating, or we call it the knowledge dating which is on Monday, when Friday’s data is valid, then you got this correction. And now the data has to changed. One last thing related to this is macroeconomic data which is another… There’s another example of where temporality is absolutely crucial. If you were to look at Fred or you were to look at the ONS, which is the United Kingdom Office of National Statistics, I believe, and many other macroeconomic sources, they do keep track over usually what’s called a release date and then like the data date or the effective date. So they keep track of the two dimensions because they want to provide to you revisions which are different than corrections to the data. So temporality, in order for you to be able to rewind back time and see what did the world look like and have a precise answer to it. Critical, critical.
Jon Krohn: 00:51:02
I think I got it. So first of all, it’s crystal clear why this is such a thorny problem, the bitemporality piece is I guess something that a lot of other people talk about where I wasn’t familiar with it personally before, but it sounded like that’s kind of the way you phrased it. And that’s a problem where you have the data, that’s like one of the aspects that we’re considering temporarily, but then it becomes bitemporality because we also need this kind of metadata around when there are changes to those data going back. And then try tritemporality, which sounds like it might be your term. Or [crosstalk 00:51:44]-
Doug Eisenstein: 00:51:43
Yeah I don’t think it’s industry term, I think it’s probably my term.
Jon Krohn: 00:51:48
So tritemporality has both of those facets. And in addition, it takes into account this everyday lag-
Doug Eisenstein: 00:51:59
Yes.
Jon Krohn: 00:52:00
-between… Yeah, the close price of a stock today, but you don’t actually know that officially until the next trading day, which may be-
Doug Eisenstein: 00:52:07
Exactly.
Jon Krohn: 00:52:07
-the Friday, [inaudible 00:52:09] Monday.
Doug Eisenstein: 00:52:09
Exactly.
Jon Krohn: 00:52:09
Cool.
Doug Eisenstein: 00:52:10
Now I could blow your mind and I can add a couple of other dimensions because there are other dimensions to this, but there are other types of, I won’t go into the details, but there are other types of time stamps that are useful but not as critical. I’d say the two most critical ones are this like, what people call the trade date or the price date or the effective date and then this system date or what we call is the knowledge date. Like those two different dates are critical because that’s how you can track over time how these different data points have changed.
Doug Eisenstein: 00:52:47
And believe me, actually what’s kind of interesting is when you start looking at the changes in time, there’s some really interesting data quality observations you can make with between the anomalies, conflicts, like sometimes data producers will change. One example, there is a data producer, I can’t name names, that changes every other Friday, the name of their company from 21st Century Fox to like 21st Century or something for a period of a couple of hours and they kind of chuckle at it because it must be some process that’s, behind the scenes, that changes it, distributes all that data, and there’s some other post correction that’s made to it, but that’s just an example. That’s a insignificant example because it’s not material since it’s just the name change, but if it was, it changed to something that was significant, a price or something else that would be a problem, right?
Jon Krohn: 00:53:50
Good example, clear example. And in case it isn’t obvious to the listener why this is so absolutely critical important. All these ideas around when we know information, when it’s official, when changes happen, a big reason why this is so absolutely important is that when you are a quantitative trader, a systematic trader, and I was one professionally for a couple of years. You design your trading strategies based on historical data. You design these algorithms that use some kind of modeling, some kind of understanding of the market to try to predict based on this information that I have is the price at this instant, of this particular, stock or commodity, or whatever, is it likely to go up or is it likely to go down? How likely is that and depending on how likely it is, then maybe I’ll trade on it or not and especially when you’re trading at high frequencies.
Jon Krohn: 00:54:53
So a lot of our algorithms were sub-second where you enter into and out of a trade so you buy a product and sell it in less than a second and I’m talking like fractions of a second, hundredths or thousands of a second you’re in and out. When you’re talking about that kind of resolution or maybe not even that fast even just minutes or hours. If you don’t know exactly what information you had at that instant, then you can design an algorithm that unbeknownst to you, when you’re training it on historical data, it can see into the future. So of course it’s an amazing algorithm because it has information that you didn’t really have until minutes later or a day later and so you could end up training your alg… You could say, “Wow, this algorithm when I deploy it into the market it’s going to crush better than any algorithm I’ve ever had. I’m going to be so rich,” but then you just get eaten alive.
Doug Eisenstein: 00:56:03
It’s looking at bias, right? Yeah, I mean that’s one of the most critical problems and then there was something else that you said that I wanted to add onto which is you mentioned, high frequency and that’s another when we think about the segments of the financial markets that’s a whole other area like of all these different investors, where they’re making decisions sub-second, millisecond, microsecond. Actually one particular area that is very interesting for the audience is to read up on transaction cost analysis within the trading departments of companies, because that is an area where [inaudible 00:56:47] there’s high frequency companies and there are… I don’t know if they call themselves low frequency, but they don’t trade as often as basically high frequency. I don’t think they call themself low frequency.
Jon Krohn: 00:56:58
It doesn’t sound very cool.
Doug Eisenstein: 00:57:00
It doesn’t sound very cool. They just don’t trade as often. And in those kinds of organizations sometimes you need to spread out a trade over hours, days sometimes even weeks and it’s important because the price can move that you need to have a way that you can look back and identify, could there have been another way to make this trade that would’ve been more profitable? So that area of transaction cost analysis has been critical over the last 10 years and it continues to be an area that I’ve seen organizations kind of tune in a little bit more to now that there’s better technologies and that’s where you can have, just to recap, you can have companies that are not trading as often but rely upon like this intraday data or you can have TCA being taken into account as part of the sub-second processing.
Jon Krohn: 00:58:05
TCA?
Doug Eisenstein: 00:58:05
Yeah, TCA. Transaction cost analysis.
Jon Krohn: 00:58:08
Oh, right.
Doug Eisenstein: 00:58:08
Yeah, sorry, yep.
Jon Krohn: 00:58:11
Cool. Well, so lots of interesting complex problems that you’ve touched on in this episode. If somebody wants to work with you, if there’s a listener out there that’s thinking, “Sounds pretty cool. I have a relevant background or I have a highly technical background and I’d love to be moving into this kind of financial data engineering,” you have four data engineer roles open right now. So there’s an opportunity right there for some listeners. How can people apply?
Doug Eisenstein: 00:58:44
Just getting in contact. I’m big on LinkedIn, so if they just contact me through LinkedIn, I’m actively communicating on there, but that would be terrific. And what’s important to us as a company is really passion. Like that’s one thing that I think is critical for whenever I’m looking at hiring is, I want to see someone’s passion about what they do and that could take the form of contributing to open source projects and just keeping abreast of the new technologies that exist in how to be able to apply them. Yeah and just sometimes also having a good attitude as well too.
Jon Krohn: 00:59:35
Nice. Yeah. Passion and attitude.
Doug Eisenstein: 00:59:39
Yep.
Jon Krohn: 00:59:41
So, all right, that gives us… So typically the last question I ask is, how should people stay in touch? But you just answered that, which is LinkedIn, and so then my final question, which would typically be my penultimate question, is, do you have a book recommendation for us?
Doug Eisenstein: 00:59:59
I do actually and in not kidding like there are so many different books that I can recommend. There is one book in particular that I really have loved and it’s actually called Chasing Excellence by Ben Bergeron.
Jon Krohn: 01:00:15
Oh yeah. All right.
Doug Eisenstein: 01:00:18
Great book. Even if you are not someone… He’s a CrossFit coach that has gained sort of a lot of recognition in CrossFit because of the games and everything and like [crosstalk 01:00:37]-
Jon Krohn: 01:00:37
He coaches some of the biggest names. So, early on, his wife, Heather Bergeron was big in the games but he in more recent years, some of the very best CrossFit athletes, Katrin Davidsdottir, Mat Fraser, trained with Ben Bergeron too right?
Doug Eisenstein: 01:00:53
Yeah, yeah.
Jon Krohn: 01:00:54
So yeah, two of the most accomplished CrossFit athletes of all time. And yeah, so amazing and Doug works out in Ben’s gym, CrossFit New England and I discovered just before the show, lives just a few blocks away. So in the small chance that you’re a SuperDataScience listener that’s into CrossFit as much as me, this is really exciting and I actually didn’t even know about Ben’s book, Chasing Excellence. So, what’s it about? So it’s not even just for CrossFit people, I think was the point that you were about to [crosstalk 01:01:25].
Doug Eisenstein: 01:01:24
Well yeah. I mean, I think my point of view is that even if you are not someone that’s into CrossFit, understanding the mind of an athlete and the discipline that they have and just a way of thinking, a lot of it that I took away was approaching whenever you have a problem with, you have a problem, here’s a solution. If you’re not happy with, let’s say they gave an example of a time or a score that you got right. Let yourself be angry for five minutes, 10 minutes, 15 minutes. But then move on. Move past it at that point. Don’t think about it again. Give yourself several days, then reflect on it. Things like that, I feel like really help even if it’s applied in that scenario to an athlete, it helps in our day to day. So that’s a great-
Jon Krohn: 01:02:30
Nice. Really great recommendation. I love that, Doug. All right, thank you so much for being on the program. I’ve learned a ton. You are an outstandingly, clear communicator of concepts, and you give such rich examples, so I’m sure our audience loved this episode as well. Hopefully we can have you on again sometime?
Doug Eisenstein: 01:02:50
Yeah it sounds great. I would love to do that and I’d love to talk more about my passions of data engineering and finance.
Jon Krohn: 01:02:58
[inaudible 01:02:58]. All right, Doug, catch you again, bye bye.
Doug Eisenstein: 01:03:00
Thank you.
Jon Krohn: 01:03:07
I told you Doug was a clear, content rich communicator. I learned so much in this episode and made it so easy to understand thanks to his clarity and bountiful examples. The key points Doug covered were how to start a consultancy with no track record or the best way to launch software products could be via consulting. An overview of the major financial industries, like capital markets, asset management and high frequency trading. An overview of the front, middle and back office financial firm structure. How data engineering for front office decision making is extremely complicated because it involves the integration of hundreds of data sources, each with retrospectively, updated data points. And how entity resolution, knowledge graphs and tritemporality can address these issues.
Jon Krohn: 01:03:53
As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials mentioned on the show and the URL for Doug’s LinkedIn profile as well as my own social media profiles at www.superdatascience.com/485. That’s www.superdatascience.com/485. If you enjoyed this episode, I’d of course greatly appreciate it if you left a review on your favorite podcasting app or on the SuperDataScience YouTube channel where we have a video version of this episode. To let me know your thoughts on the episode, please do feel welcome to add me on LinkedIn or Twitter and then tag me in a post to let me know your thoughts on this episode. Your feedback is invaluable for figuring out what topics we should cover next.
Jon Krohn: 01:04:36
I’d like to give a special mention to those of you listeners who nominated my work for a data community content creator award. Thanks to you, at the award ceremony on June 22nd my YouTube channel was recognized as the favorite for learning about machine learning and artificial intelligence. If you haven’t checked it out yet the channel’s at youtube.com/c/jonkrohnlearns, we upload a new video every Monday and every Thursday.
Jon Krohn: 01:05:02
I am absolutely delighted that some of you think we’re on the right track with the MLAI videos we publish. All right, thanks to Ivana, Jaime, Mario, and JP of the SuperDataScience team for managing and producing another amazing episode today, keep on rocking out there folks. And I’m looking forward to enjoying another round of the SuperDataScience Podcast with you very soon.