We went over the new and huge opportunities in natural language processing algorithms in the legal tech space, how to successfully bootstrap and scale-up tech startups, innovative commercial applications of data science, and more!
About Horace Wu
Horace is the founder and managing director of a legal tech company, Syntheia. Syntheia aims to make it easy for lawyers to do their best work by putting useful knowledge at their fingertips. Horace is trained as a lawyer and has worked at top law firms in Australia and the US. At Syntheia, he works closely with his team of data scientists and developers to solve complex problems for the legal profession.
Overview
Horace and I have known each other for quite a while, going back to my time at Oxford. Currently, we both live in New York City. Horace first lived in New York over ten years ago working as a lawyer. His current work is with his company Syntheia, which works in legal data, turning impenetrable information into usable data for lawyers. For example, you can be in a Word document for a contract, and you want a clause on intellectual property. Syntheia allows you to source a similar clause through a keyword in a simple way.
As mentioned, Horace started out practicing law. During the latter half of his legal career, he got a lot of business questions from colleagues and friends. Rather than pursuing a partnership at his law firm, he took 12 months off to start an events recommendation app for consumers. His second company was Syntheia when he spoke with a friend about the tediousness of searching and referencing legal documents. While doing this, he is legal counsel for Nearmap in Australia which is incredibly useful for property assessments and valuations.
We talked about the use cases of NLP with Syntheia. Historically, the legal field has not been particularly robust in document naming, categorizing, metadata creation, and searching. Taking a hundred-page legal document and classifying it with one label, you miss tons of information that could help describe and categorize it. Syntheia breaks documents down into smaller information chunks and then pulls words out of these chunks with NLP to help categorize and describe the document. Because of the confidential nature of legal data, all deployment happens at the law firm level. But, how do they make these tools pop up in Microsoft platforms? It’s actually pretty simple, react JavaScript-based. Horace’s main role is writing out the logic for developers to code. Then they collectively test and ensure the business outcome. He sells himself short a bit though, it’s impressive someone with a legal background can have this much literacy on the dev side.
The team is remote, utilizing Slack and Jira to maintain sprints. The machine learning team is almost exclusively based out of Cairo, for example. The Java team began in Australia and organically grown from there from a system of two-week trials to find the best candidates. The core skills necessary for this are critical thinking and problem comprehension. One unique skill one of their team members is learning is Cython, a Python-C hybrid.
We closed by looking back and forwards in Horace’s career. He says he has no regrets or desires to redo something so much that it would really change much about where he is now. He’s excited to see if Syntheia can become integral to legal problem solving for the better.
In this episode you will learn:
- Horace’s life and work in New York City [5:56]
- Syntheia and Horace’s role there [6:25]
- Horace’s background [12:07]
- Nearmap [16:35]
- Syntheia NLP use cases [21:46]
- Design, coding, and the team [34:19]
- What skills does one need for this field? [41:41]
- What would Horace do differently and what is he excited for? [46:15]
Items mentioned in this podcast:
Follow Horace:
Follow Jon:
Episode Transcript
Podcast Transcript
Jon Krohn: 00:00
This is episode number 455 with Horace Wu, the founder and director of Syntheia.
Jon Krohn: 00:12
Welcome to the SuperDataScience podcast. My name is Jon Krohn, chief data scientist and bestselling author on Deep Learning. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. Thanks for being here today, and now let’s make the complex simple.
Jon Krohn: 00:42
Welcome to the SuperDataScience podcast, I’m your host, Jon Krohn, and we are lucky to be joined today by the fabulously interesting, Horace Wu. Horace is an Australian who trained as a lawyer in Sydney and spent more than a decade working as an attorney. After advising countless tech companies, he caught the entrepreneurship bug himself, and today lives in New York as founder and director of his second startup, a machine learning company called Syntheia. Syntheia cleverly combines machine vision and natural language processing models together, in order to provide real-time guidance and instant airtight automatically generated clauses to lawyers, as they read and draft contracts.
Jon Krohn: 01:29
During this episode, Horace fills us in on how natural language processing algorithms are finally reaching a point where huge opportunities are emerging for commercial machine learning applications, in the legal tech space. He also provides practical guidance on how to successfully Bootstrap, and then scale up a tech startup, without diluting your ownership to outside investors. This episode will be of interest to anyone who’s interested in learning about innovative commercial applications of data science, or launching an AI startup from scratch, even if you don’t have any formal scientific or technical background. We also do get into the weeds a little bit here and there for you hands-on practitioners, including discussing cutting edge software stacks for training and deploying machine vision and NLP algorithms, into production systems that require real-time model feedback.
Jon Krohn: 02:31
Horace, wonderful to have you on the SuperDataScience show, welcome.
Horace Wu: 02:36
Jon, it’s wonderful to be here, thank you for having me.
Jon Krohn: 02:40
So, you and I have known each other for a surprisingly long time. We met 14 years ago, I did the math just a second ago. We met 14 years ago in Oxford, in the United Kingdom. I was in the first days of my graduate, studies at Oxford University, and you were visiting a friend, and we went to what they call a Bop. Do you remember what the Bop was all about that night?
Horace Wu: 03:06
It might’ve been a Christmas Bop. I do a distinctly-
Jon Krohn: 03:09
A Christmas Bop, oh, yeah, [crosstalk 00:03:11].
Horace Wu: 03:11
I remember Los Terry was played, there was Mountain Wine…
Jon Krohn: 03:16
That’s right, that’s right. So for listeners who are from outside the UK, a Bop is just a party, but for some reason, if it’s held by the University, they call it a Bop, and it’s a surprising amount of fun. Something that’s interesting about, at least at Oxford, maybe in the UK in general, my English… I was in England for five years for my entire time as a graduate student at Oxford. So what I think is English, I could be completely wrong, but an interesting thing is in the U.S. and Canada, where I’ve spent most of my life, the only party in a year that anybody dresses up, in terms of a costume, is a Halloween party.
Horace Wu: 04:00
Right.
Jon Krohn: 04:01
But in Oxford at these bops, almost every single one has a theme.
Horace Wu: 04:06
Yep. It’s a great opportunity to really stretch-test your wardrobe. We don’t really have those in Australia either. It’s just…
Jon Krohn: 04:13
Really?
Horace Wu: 04:14
… Yeah, we don’t really even have Halloween dress ups you might do as a kid, and I think that is much to the detriment of the Australian party culture.
Jon Krohn: 04:25
Yeah. It’s really fun dressing up for a party. It’s something that we should try to get going. Which actually now, you and I could. We’ve touched on this. We met a long time ago, ancient history in England, you are Australian, and people can tell from your accent, and you spend most of your life in Australia, but in fact, now we both live on the Island of Manhattan-
Horace Wu: 04:47
Incidentally.
Jon Krohn: 04:48
… and we should be instituting costumed bops.
Horace Wu: 04:51
Well, as soon as COVID is over Jon, you can count me in as guest number one.
Jon Krohn: 04:56
Perfect. So you’ve been in New York for a while, tell us what you’re doing here.
Horace Wu: 05:05
Oh, gosh. Well, New York is an interesting story. This is actually my second stint in New York. I was in New York in 2007 to 2009, and I was-
Jon Krohn: 05:16
Oh.
Horace Wu: 05:16
… Yeah. [crosstalk 00:05:17]. That’s when we met and I was a lawyer back then, and this is my second stint, we moved here about two and a half years ago. And the reason we moved here was because my wife said, “Well, you’ve lived in New York, so now it’s my turn.” And that’s why we moved here.
Jon Krohn: 05:35
Australians always want to spend some chunk of their life Barfield. I think half of them end up in New York, and the other half end up-
Horace Wu: 05:43
Whistler.
Jon Krohn: 05:44
… in Whistler, British Columbia.
Horace Wu: 05:49
Yes, yes we do. We do. I think it’s not so much because Australia is a terrible place to be, despite what the British might tell you. It’s a wonderful place, but it is far away from the rest of the world and Australians being adventurous people, they like to go around and see new things and try new things. So you end up with a large cohort living overseas at any one time.
Jon Krohn: 06:14
Yeah. So, while you’re here, you have a fascinating role, and I’m so excited to have you on the program and to dig into exactly what you’re doing. So you founded a company called Syntheia.
Horace Wu: 06:27
Yes.
Jon Krohn: 06:28
We’ve touched on how you used to be a lawyer, but now you are the founder and head, you call yourself the managing director, [crosstalk 00:06:38] I wanted to call you, the head honcho.
Horace Wu: 06:43
You’re going to inflate my ego too much if you called me that.
Jon Krohn: 06:48
… So the company is called Syntheia,-
Horace Wu: 06:49
Yes.
Jon Krohn: 06:50
… It’s spelled wonderfully. So it sounds like the woman’s name, Syntheia, but it’s spelled like synthetic.
Horace Wu: 06:57
Yes, that’s right.
Jon Krohn: 06:57
S-Y-N-T-H-E-I-A, Syntheia,-
Horace Wu: 07:00
Correct.
Jon Krohn: 07:00
… and Syntheia is a machine learning platform.
Horace Wu: 07:04
Correct. It’s specialized in legal data. Syntheia actually comes from the Latin word, because lawyers love Latin words, for together and knowledge. So, Syntheia. So what’d be [crosstalk 00:07:18]… Yeah. Yeah, yeah. There is a story behind [crosstalk 00:07:21].
Jon Krohn: 07:21
I didn’t know that.
Horace Wu: 07:22
I rarely get to tell that story. People don’t usually have the patience to listen to it. And so-
Jon Krohn: 07:28
That’s why we welcomed you onto the Latin roots podcast.
Horace Wu: 07:35
Let me get out my dictionary to tell you more of the [crosstalk 00:07:38]. We specialize in taking legal documents and legal behavior, and plotting them and mapping them to bits of data that can be reused by law firms and by lawyers. So, it’s a platform that is designed to turn what is quite impenetrable information into reusable data that can be mapped to different purposes.
Jon Krohn: 08:08
Nice. So as an example, I’ve seen lots of really smooth demos of the platform. The way that you described it was a bit in the abstract.
Horace Wu: 08:19
Yes.
Jon Krohn: 08:19
But to give an example of what I’ve seen. You could be in a Word document, which I imagine lawyers are in a lot of their workday, and you’re typing up a contract, and you’d like to have a clause on intellectual property, whatever.
Horace Wu: 08:35
Mm-hmm (affirmative), yep.
Jon Krohn: 08:36
And you’re like, “It would be such a pain for me to write this clause from scratch. But I bet somebody in my firm has written a clause, like what I need before.”
Horace Wu: 08:47
Yes.
Jon Krohn: 08:48
And then you could maybe write just the words, intellectual property, or write a sentence or something that’s roughly what you need to be covering in this clause, and then highlight that text and press a button in Word, and then it activates Syntheia?
Horace Wu: 09:05
Correct, correct. So Jon, you’ve asked a really good question, which is how does it actually work? What does it actually do for people? So, it’s based on the idea that lawyers, they need knowledge and the more knowledge they have, the better the work is going to be. So we try and build our software in such a way that it fits in the workflow of a lawyer, and most of that workflow is inside of Microsoft Word. So the example you just gave is, finding a prior precedent that addresses a concept, so that a lawyer would have a reference point when they’re trying to address a new case, or satisfy a new need.
Horace Wu: 09:48
And there are multiple use cases for this type of technology. The drafting one is one example, there are others such as if you were reviewing a document for the first time, and you want to find things that are red flags. You know from previous deals what those red flags are, and so the machine has already seen your behavior and mapped out where when you see words like this in a document like that, raise a red flag because that should not be there. And then you can do other things like- Yeah, yeah. Saving your time.
Jon Krohn: 10:21
I didn’t know about that, that’s cool. This episode is brought to you by SuperDataSience, our online membership platform for learning data science at any level. Yes, the platform is called SuperDataScience. It’s the namesake of this very podcast. In the platform, you’ll discover all of our 50 plus courses, which together provide over 300 hours of content, with new courses being added on average once per month. All of that and more, you get as part of your membership at SuperDataScience. So don’t hold off, sign up today at www.www.superdatascience.com. Secure your membership and take your data science skills to the next level.
Jon Krohn: 11:04
I should let the audience know that I am not a completely benign observer of Syntheia. I’ve known about Syntheia for many years, and I’ve been informally providing advice on machine learning stuff to Syntheia for a long time. So, I’m not a completely unbiased spectator, but actually, we have been caught up in, pretty much since COVID hit over a year ago.
Horace Wu: 11:27
Yeah, yeah.
Jon Krohn: 11:29
Yeah, I’m way out of date on anything that Syntheia is doing. So, I always think of Syntheia as this ability to bring clauses up, to have related documents come up, but the red flag that’s also a really cool use case.
Horace Wu: 11:44
Yeah, yeah. Yeah, and it comes back to the idea of if you have that data already mapped out, if you already have that database of information, reusing it is a matter of just repurposing the front end and repurposing your queries to that database.
Jon Krohn: 12:01
Nice. So, let’s talk a little bit about your background. So you were a lawyer for many years.
Horace Wu: 12:13
Yes.
Jon Krohn: 12:14
How did you first get this machine learning bug? When did you figure out that, “What I needed to be doing, was learning as much as I can about machine learning?” And we’re getting into the technical detail later in this podcast.
Horace Wu: 12:29
Sure.
Jon Krohn: 12:29
I’ve been blown away in conversation with you before, about how hands-on and knowledgeable Horace is about software and data science. So, we’re going to dig into all of that stuff, but tell us, how did you make that transition? What was it like?
Horace Wu: 12:45
Jon, first you flatter me too much, thank you very much. As to my background, I actually was a lawyer for more than a decade. And during the latter half of that, a lot of my friends were asking me for business-style advice, commercial-style advice, simply because of my role as a corporate M&A lawyer, and what I did… The bug was, “Well, hang on. If I’m giving all of this advice out and I’m helping all my friends, I want to try this for myself.” So at the end of 2000, you just kind of go, “Well, might as well.” So at the end of 2016, there was a divergence, or a fork in the path where I could stay at the law firm and try for partnership, the path that most lawyers would pursue.
Horace Wu: 13:32
And there was a separate part of, well I could go and do something else, whatever that happens to be. And I decided to take 12 months off, and I started a company which was called Viva.
Jon Krohn: 13:44
Viva?
Horace Wu: 13:45
Yeah. I don’t know if you remember this.
Jon Krohn: 13:47
Now I remember it. Had forgotten about it [inaudible 00:13:49]
Horace Wu: 13:48
Yes, yes. It was a events entertainment recommendation app for consumers. And what we did,-
Jon Krohn: 13:58
Nice.
Horace Wu: 13:59
… Very much like Netflix. It learns what your preferences are, and then it made recommendations to you, but for real-life events, because back then you can actually go and see people and do things. That was when I left the law to create that company, to run that, sadly that didn’t go anywhere because six months after we launched, Google and Facebook both launched their events equivalent, so we were like, “All right, let’s just shut up shop.” That was a little bit of a sad story, but also I learnt a lot from doing that. And Syntheia actually-
Jon Krohn: 14:39
Yeah, and it shows that you were right on the money with the business idea.
Horace Wu: 14:42
… Well, thank you. Yes, if I still have-
Jon Krohn: 14:43
You’re right on the money with the business idea.
Horace Wu: 14:45
Yeah, a little too early or too late, I’m not sure sometimes depending on what frame of reference you’re looking at. So, Syntheia was my second one. We started that almost by accident. A good friend of mine is a data scientist for a German company that does NLP, and we were meeting one day for a coffee in Sydney, and he wanted some help looking at U.S. leases, property leases because one of their customers at the time wanted to review these documents, and they had no idea what lease actually contained and what they do.
Horace Wu: 15:23
So, after a four-hour coffee, I sent him a five-page document that said, “If you have this technology, you should not be doing simple doc review and pulling out dates and figures. That’s a really low-level task. Do X, Y, and Z, and it’s much more sophisticated and much more valuable for lawyers and businesses.” Then I think within a week or two weeks, his CEO called me and said, “Do you want to build this together?” And that’s how we started Syntheia.
Jon Krohn: 15:52
Wow. That is a long coffee for hours. [crosstalk 00:15:56]
Horace Wu: 15:56
It was a very interesting conversation.
Jon Krohn: 16:01
I bet. It’s been three years now that you’ve been doing Syntheia full-time.
Horace Wu: 16:07
Yeah, that’s true.
Jon Krohn: 16:10
Which I think is really amazing. And the way that you have been… I think Bootstrapping is the word to use. The way that you’ve been Bootstrapping Syntheia, I think that this is an interesting story for anybody who might like to found a business. If you’re listening to this program, it’s probably going to be some kind of data science or machine learning business. So, Syntheia is a machine learning business, but actually that part of it doesn’t necessarily matter. It’s just this idea of how can you Bootstrap effectively? And I love what you’ve been doing, which is that you still have another job to bring in some income. So that other job actually is pretty darn interesting too, if you don’t mind telling us a bit about [crosstalk 00:16:50].
Horace Wu: 16:50
Yeah, that’s right, that’s right. So my other job is, I’m the assistant general counsel for a company called Nearmap, which is an Australian aerial imagery company. And what they do is they fly planes around the world, predominantly the U.S., Canada, Australia and New Zealand, capture photos of the ground, and then use that data to extract more valuable information. Things like, is there a pool in the backyard? Is there X, Y or Z? And this data is really valuable for anyone who is in a geospatial space. So that’s my other role, I am the assistant GC. And that helps pay for audit developers and all the work that needs to be done on Syntheia, while Syntheia is in the state of just piloting really small revenue contracts, nothing that would by itself sustain the cost of development.
Jon Krohn: 17:53
Yeah. So Nearmap, just to talk about them for one more minute,-
Horace Wu: 17:57
Yeah, yeah, sure.
Jon Krohn: 17:58
Horace isn’t emphasizing how big of a deal they are. They’re a huge… They might be Australia’s most successful tech startup right now?
Horace Wu: 18:09
Apart from Atlassian and Canva I would say it’s up there. The Australians have a few really big startups, but as to listed startups that are really focused on data science, I wouldn’t say Atlassian’s focused on data science nor would I say Canva, but for Australian tech companies that are focused on data and data science, I would say NearMap is among the top three, and certainly no other Australian company is doing that sort of work with aerial imagery.
Jon Krohn: 18:46
Yeah. So one of the market leaders, I think in terms of aerial imagery, so a huge number of patents, a lot of intellectual property around being able to stitch together at an extremely high resolution, all of this aerial footage. And the data science piece of that, I think is something that’s happened secondarily, but is an obvious use case now of the machine vision that we have. So since 2012, having this AlexNet deep learning architecture and all of the other deep learning architectures that have arisen since, we can have extremely accurate machine vision models, if we have high-quality training data, and NearMap as oodles and oodles of high quality training data. I can only imagine what the cloud compute bills are like for now.
Horace Wu: 19:33
Well, I know the answer to that, but it’s not [crosstalk 00:19:37].
Jon Krohn: 19:38
Yeah. I’m not sure it is, to get that out of you, but I can, I with how much data there is, so not only is it expensive, but it does mean that you’re going to end up with a great high quality machine learning model. So you can do these kinds of things. So you can sell models to, like you said, detect pools, and so this kind of information is useful for private and public sector clients alike, right?
Horace Wu: 20:04
Absolutely.
Jon Krohn: 20:04
To predict, to be able to price properties or assess things, I don’t know. There’s endless uses for being able to survey land with a machine-vision driven assessment.
Horace Wu: 20:17
Well, and it’s a lot of data science. Like if you were to get a human being to do the same work, it would take them days or weeks, whereas if you’ve got a machine to do it, it takes minutes. And the data quality is probably going to be as high, if not better than the human laborer.
Jon Krohn: 20:34
Yeah, because the human fatigues, for it to be someone’s job for a month to every minute of the workday be looking for swimming pools and footage, your mind’s going to drift, and you’re going to miss a bunch of pools and no one will ever know. So yeah, I wouldn’t be surprised if machines can do that better than a person. All right. Anyway, we don’t need to talk about Nearmap because that isn’t the star of the show today, the star of the show is Syntheia. Well, and you, but Syntheia and Horace together, dancing through the podcast, [inaudible 00:21:08].
Horace Wu: 21:07
You make it sound like a married couple, which [crosstalk 00:21:10].
Jon Krohn: 21:12
Your wife probably would agree to this assessment.
Horace Wu: 21:17
It’s her birthday today, so let’s not tell her that.
Jon Krohn: 21:21
Oh, that’s funny. Well, thank you very much for taking the time to do this. Actually I know that we do eventually have a hard stop though. We’re still quite a ways from it, quite a ways from your wife’s birthday dinner. But between now and then, we still have plenty of time to cover really cool things about Syntheia and Natural Language Processing and the technology that you’re using and even just how to commercialize companies like Syntheia. So, let’s talk about that. Let’s dig into these examples of Syntheia. So you’re using Natural Language Processing, tell us a bit about one or two of the big use cases of NLP that Syntheia allows.
Horace Wu: 22:01
Yeah. I think maybe, if it’s okay with you, I’ll start with the background on legal sector, and why the legal landscape.
Jon Krohn: 22:09
That’s a good idea.
Horace Wu: 22:10
Yeah, it does frame this conversation a lot. So the legal sector, as you may rightly guess, is rather antiquated when it comes to technology and when it comes to this type of technology. So, you have a lot of law firms, and a lot of companies out there where the legal department essentially just dumps all of their documents into a hard drive or a shared drive somewhere. So, if you want to find information that’s contained in these documents, you need to have either named them extraordinarily well, and or put metadata on them, or you need to look through every single document manually, in order to find the information you need. It’s worse than a needle in a haystack. And over the last, maybe 10 years, NLP has played a role in this segment of the market, and NLP in the legal sector primarily works at what we call the macro and micro level.
Horace Wu: 23:10
The macro level, you might have people using NLP to look at a whole document, and try and classify or otherwise categorize, what that document is, or what that document does. And at the micro level, you have people doing data extraction. Things like party names, dates, dollars, that sort of information, which then become attributes of that whole document. And you find that these techniques can only take you so far, because when you’re looking at a whole document, which might be a few hundred pages, and then trying to classify that with one label, you’re missing a lot of information. So, we at Syntheia looked at this behavior in the market and we said, “Well, hang on. That’s not the best way of actually processing legal data. That’s not how lawyers think.” Lawyers don’t go, “Uh-huh, I’m going to hold in my mind a whole 200-page document, and use that to do and perform a task.”
Horace Wu: 24:18
Lawyers are very much like any other people, human beings, professionals who can retain only an X amount of information.
Jon Krohn: 24:25
Is that true?
Horace Wu: 24:26
Mostly true.
Jon Krohn: 24:27
Lawyers are like other people?
Horace Wu: 24:29
Lawyers are a little bit better. They’re a little bit better,-
Jon Krohn: 24:34
All right, all right.
Horace Wu: 24:37
… but we’re not superhuman by any stretch of the imagination. So we can only retain so much information in our heads, and that type of information would usually exist at the paragraph level, or at a sentence level. Lawyers write in super long sentences. Sometimes they’re synonymous. And for the technology that exists today, to process at either macro or micro, you’re missing all of this rich information that human beings actually think in. At the conceptual level, this paragraph does X. So, Syntheia really uses machine learning in two key ways. First is, we break large documents down into what we call meaningful chunks of information, your paragraphs, your sections, your sentences, and then we use NLP to then read the words in those chunks of information, to then give them actual useful meaning. And we can dive into what we do with the NLP a little bit later, if you would like.
Jon Krohn: 25:48
Nice, yeah, I would love to. So there’s a machine vision and an NLP elements, and typically starting with that machine vision to figure out where the chunks are on the document visually, that’s clever. So in the Natural Language Processing space, a typical initial first step is to take a Word document, or a PDF or whatever, and just convert it into plain text-
Horace Wu: 26:14
Yes.
Jon Krohn: 26:14
… and then work from there. That is the standard. So to have this initial step of using machine vision, to visually recognize chunks first, done properly, I can imagine you get a much better results in terms of being able to recognize where those chunks are.
Horace Wu: 26:31
And you get to segment ideas before they’re conflated into a single string. So, if you were to take one example, in a lot of legal contracts, you’d have maybe 30, 40 different clauses, each of which deal with a different set of circumstances. It’s only triggered if certain things become true. If you were to process that document without first segmenting it, you end up with this really fuzzy document. Whereas, if you segmented first, you’d end up with 30 very distinct ideas, that are much easier and much more meaningful to deal with.
Jon Krohn: 27:17
That makes a lot of sense. All right. So, you’ve given us that frame of reference, but we’ve also already talked about case studies. So you’ve talked about being able to provide red flags. So, you can train a model or fine-tune a model to… There’s a number of levels actually, now that I think about it. So, we have in the Natural Language Processing space, it’s common to use a language model that’s trained on all of the English on the internet, or all the English in Wikipedia.
Horace Wu: 27:52
Yes.
Jon Krohn: 27:53
And in your case, I imagine you want to fine-tune the language, the weights in this big natural language model, based on legal language in particular, and then perhaps one step further, you could even fine-tune the model, based on a particular law firms, proprietary database of documents. Law firms can be gigantic, and I imagine in your case, many of these kinds of tools that you’re suggesting, these machine learning tools, you’re probably not expecting your early adopters to be a law firm with one or two employers, you’re expecting probably the big law firms. I don’t know, they must have tens of thousands of employees are the big ones?
Horace Wu: 28:38
The largest law firms in the world may have 10,000. The legal sector is highly fragmented, and you raised a really good point, Jon, and that is, what we use the language models for, and how we train the language models. We deployed the language models privately with a law firm. We do not have shared language models between law firms, because of the confidential nature of the data. And that’s something quite unique to the legal sector.
Horace Wu: 29:10
So, when we deploy stuff, we do fine-tune, if the law firm let us, on their documents and their data. And a large law firm may have somewhere North of 100 million documents in their database, which is a very rich source of information, if you want to then refer [crosstalk 00:29:31]-
Jon Krohn: 29:30
Agreed. Especially if they’re big documents, like you was talking about a hundred-page documents, in some cases.
Horace Wu: 29:36
That’s right.
Jon Krohn: 29:37
That’s a huge amount of natural language data.
Horace Wu: 29:39
That’s right. And so once you fine-tune,-
Jon Krohn: 29:41
[inaudible 00:29:41]
Horace Wu: 29:43
… Not every law firm’s willing to go that far. And in fact, I would say at the moment, given the state of the technology, very few are, but we see the trend is people will be more willing, and we think over the next decade, this would almost become the norm.
Jon Krohn: 29:58
Nice, okay. So, we’ve talked about machine vision aspects, natural language aspects, fine-tuning these big language models, do you feel there’s any other specific case studies other than being able to identify red flags in language that you might not want an individual lawyer to have? So having that pop up automatically. So, I guess that’s similar to the way that any consumer has with a Microsoft product or a Google Office product, you have something that pops up when you make a spelling or grammar mistake. This is similar, but it’s like, “This is a legal mistake.”
Horace Wu: 30:33
Jon you’re right. This is conceptual. You should not be giving away X in this scenario. So, in terms of use cases, this might be a really good segue to talk about how we design applications for lawyers, and how we go about building our software, and it’s not driven by, sad to say, it’s not driven by how awesome the science is, and how we can use this NLP model to achieve things that no one has done before. Because we are a business, we do have to design first from the perspective of, “Well, what use case are we addressing, and how do we fit within the way lawyers work?” And when you monitor and look at how lawyers work today, most of their work is done inside of Microsoft Word. They exist inside of two key pieces of software, Microsoft Word, and Microsoft Outlook. That’s your kind of your ecosystem for lawyers.
Horace Wu: 31:33
So when we design stuff, we have to look at, “Well, what did they do today, and what do they want to achieve as the outcome?” Therefore, map out their journey and go, “Here’s where we’ll insert ourselves in this longer journey.” And the two use cases I’ve mentioned so far, which are helping lawyers revise and draft better clauses, are towards the end of the journey where they already have a client, they’re negotiating, and they want to make sure that the words they put in the document, reflect the outcome, and will cause the outcome that they want.
Horace Wu: 32:07
Whereas the red flags are towards the start of the journey, where they might be seeing a document for the first time, or where they get a document back from the opposing counsel, and they have to find, “Hmm, where has this person done this sneaky?” And so that red flags report is a shorthand way of taking advantage of the firm’s knowledge, not just the individual lawyer’s knowledge to go, “Hmm, this is bad. We should not be doing that, and here are the reasons why, and here’s how we reacted in the past.”
Jon Krohn: 32:39
Cool, I love that. That does sound like a superpower. That sounds like something that could be hugely useful. All right. So to make this happen, to have these kinds of tools show up in Outlook or Word, how do you do that? I have no idea. I’m used to web apps, how do you get a button to appear in Microsoft Word?
Horace Wu: 33:05
Yeah. The Microsoft guys are actually quite good at this. They’ve made their add-ins, or ReactJS-based, so it feels and you program in just like a web app. The infrastructure-
Jon Krohn: 33:18
Yeah.
Horace Wu: 33:18
Yeah, yeah. It’s probably a change over the last decade, that you didn’t use to be like this, but now if you’re a JavaScript developer, you can develop for the Microsoft suite. So in our stack, when you mention Web apps, it’s very similar. We start off with, “Well, this is a front end, which is designed to sit inside of Microsoft Office suite, and that plugs into some software on the backend server side, that in our case, we use node.js, and that plugs into a microservices ecosystem, where you have specific modules that service different parts of your NLP pipeline, and it all sits on top of a database that contains your legal knowledge.”
Jon Krohn: 34:05
That’s super cool. I had no idea, and that sounds very much like the stack that we use at my company for building web apps. Yeah, node in the background, react on the front end, very cool. So, do you ever roll up your sleeves and get down and dirty and Python code yourself when you’re building [crosstalk 00:34:26]? We talked about the production deployments but when you’re designing these models or fine-tuning these models, we’re working in Python, right?
Horace Wu: 34:34
We are working in Python if we’re talking about NLP. And I’m very happy to say my team does not have to suffer my terrible coding. Generally, my role is limited to writing out the logic, and the pseudo code that says, “Hey, this is what we need to do in order to achieve this outcome.” And the developers who are much better than I am at writing code, then would actually make it work, and then we’ll collectively test, we’ll collectively make sure to achieve the business outcome. But I sit more on the design end of the spectrum, and I try to keep as far away from the code as possible, unless I’m doing a QA, and I’ll read something that is obviously wrong, in which case I will write a comment, but the quality of my coding is a little subpar.
Jon Krohn: 35:32
You have done some… I know you can write some code. I know it’s something you’ve studied and being up to date on the latest and greatest in Natural Language Processing, for example, is something that we’ve talked about before, for somebody who comes from a law background, I’m constantly blown away at how much you can really get into the weeds on the technology that’s being used on the backend and up and down the stack actually. So, you are a tech founder despite [crosstalk 00:36:02].
Horace Wu: 36:05
I think the legal actually does help a little bit. There’s a lot of power parallels between how lawyers think and how programmers think. Yeah, and it’s not obvious unless you’ve actually done both. In contract drafting, for example, you use definitions, which are very similar to functions inside the programs. You write a function and then you just code it once. Well, you code it multiple times, but you only need to write it once.
Jon Krohn: 36:32
Right, right, right, right, right. I had never thought of that.
Horace Wu: 36:35
Yeah.
Jon Krohn: 36:35
That’s brilliant. All right. So, if you’re not writing the code yourself, you have hired a team, so tell us a bit about that. Tell us about as much as you want about the team that you have, and things like, I know that they are entirely… So prior to COVID, this was a bit less common, but your team is entirely distributed around the world. So maybe talk about how that works, how you operationalize a completely distributed company, and then what kinds of skillsets you have to make this application work from the NLP, all the way up to the ReactJS, executing in Microsoft Word.
Horace Wu: 37:21
Yeah. I think we were lucky to start our company at the time where microservices became almost commonplace, and it really facilitated us building a team that does very specialized work, but then each part can stand on its own. Our team has around 10 people, five, six are full-time, the others are part-time, and all of them are distributed around the world. So for us, the most important thing is communication. We are on Slack and JIRA almost all the time, we keep each other posted on progress, we keep each other posted on forecast, where we are in the sprint, and what we try to do is we try to make it very clear at the start of every sprint, what everyone’s responsibilities are.
Horace Wu: 38:12
So for our Python team, they work almost exclusively on a component that are using NLP and machine vision or computer vision. And for our JavaScript team, they work exclusively on a front end and backend code, and they also do the database work because that’s kind of historically, where it made more sense for our team. And when you look at where our team is based, our machine learning team is almost exclusively based out of Cairo. And that’s because not our choice. Yeah, yeah, and I get that reaction a lot when I say Cairo, people go, “Hung on. Not Eastern Europe?” And we got lucky. We found a PhD who is a data scientist, and he is Egyptian by background, and he lives in Calgary in Canada. So, we effectively hired people that used to be in his class when he was teaching in Cairo. So that’s our NLP and Python team… Our JavaScript team started in Australia, we had a really amazing 15-year full stack developer. And from him, we started then hiring additional developers one at a time. And we went through this process of every new developer he would bring on, we gave him a very distinct discrete project for two weeks.
Horace Wu: 39:53
We don’t go through coding interviews, we don’t go through, what you might see in the big companies because we find that’s not necessarily the best way to hire for us. For us the personality really matters, and how they collaborate really matters, which doesn’t come out in an interview. So we put them on a two-week paid trial, we pay them for their work, they develop it, and then we decide after that, do they fit? Is the code good enough? And if so, they become a part of our team. And that’s how we’ve organically grown who we have on this wonderful team that we have.
Jon Krohn: 40:28
Brilliant. So when you’re recruiting, it’s been largely through word-of-mouth, I guess, through people’s connections, and then you don’t do much of a code test or much of an interview. It’s kind of like, “Okay, you’ve been recommended, let’s try you for two weeks.”
Horace Wu: 40:47
It’s a little bit more structured than that. I wish it were that easy. For the NLP side, we’re lucky because the kind of pivotal person we have on our team, he used to teach. So, he knew having known the people he recommended for years, whether they were good or not good. On the JavaScript side, we do actually ask for a code sample, and we review that, and usually from every 10 applicants, we reduce down to two, and then from those two, we would test the one we liked better for two weeks, and if that’s good enough, we don’t go to the second person, we just hire the first person, and if not, we go to the second person.
Jon Krohn: 41:33
Cool. All right. So tell us if you are looking to hire, well, I guess you told me what you’re looking for, so for our listeners, if they’re interested in building super cool state-of-the-art machine vision, and Natural Language Processing applications at a tech startup like yours, a machine learning startup like yours, what should they be doing? What skills should they learn, or what skills should they be learning over the next few years?
Horace Wu: 42:02
Yeah, that’s a really interesting question, because it changes a lot, and I think the core skills that a developer and anyone in a data science space should have is understanding the problem, and critical thinking. I know it’s kind of a buzzy to say critical thinking, but it’s being able to interpret what someone actually wants, when they tell you I have a problem and it is X. Sometimes X is not the problem. Sometimes you have to dig deeper and scratch, and really try and understand the underlying causes.
Horace Wu: 42:40
So, once you’re able to do that, then the development and the problem solving side is a little bit easier. The NLP space moves so quickly that there are new libraries all the time, there are new methodologies and different ways of approaching things all the time. So learning or rote learning something that exists today, is probably not going to help you very much in five years time. But developing the skills to really understand the problem, and developing the skills to try and think through, and learn things rapidly are going to serve you way better in five years.
Jon Krohn: 43:24
Yeah. That’s a very general answer, but as it is a good and invaluable answer. It makes perfect sense and like, “What skills should people be learning in years to come?” But all of them, any of them, it’s like specifically learning how to learn and critical thinking, that makes sense.
Horace Wu: 43:43
I can tell you, “Yes, you absolutely need to learn Python.” But that’s the baseline. And for us, when you try and productionize things, you move from Python to Cython, and you move away from these skills that are more commonly talked about. But the ability to go, “Okay, I’ve identified a problem, that this particular component is slower than it should be, and the way to overcome that is Cython, I’ve never dealt with that before, but I can read the documentation and figure it out.” That is what is really going to set you apart as an applicant, as a developer, as someone who works in this space.
Jon Krohn: 44:25
Well, let’s touch on Cython for a second.
Horace Wu: 44:28
Oh gosh, okay.
Jon Krohn: 44:28
So, Cy comes from the C at the beginning, which is related to the C Programming Language, all right?
Horace Wu: 44:37
That’s right.
Jon Krohn: 44:38
So how does Cython… I’ve never coded in Cython. I don’t think I’ve ever seen Cython code. Does it look lot like Python, or does it vary?
Horace Wu: 44:52
Okay. I can talk about this, but I’ll be a terrible person to talk about this. I have seen Cython code only in passing. What I know is-
Jon Krohn: 45:01
Yeah, it’s still better than what I know.
Horace Wu: 45:04
Oh, gosh. I can barely read it. I would say I’m conversant in Python, but in Cython, and not at all.
Jon Krohn: 45:13
Don’t they [inaudible 00:45:14] in Python?
Horace Wu: 45:18
Google translates. So, one of our team at the moment is now learning Cython from scratch, in order to solve this problem. In one of our computer vision pipelines, it takes for some documents about five to eight seconds to OCR, and understand the page layout. And we know that we can reduce that to less than half a second. So that’s why we’re refactoring the code and building a new piece.
Jon Krohn: 45:54
Beautiful. That is a great answer. It’s a lot more than I knew before we started talking about Cython a few minutes ago, so beautiful. Something to look into, if you’re looking to make your production Python code even faster, you just need to translate it into Cython.
Horace Wu: 46:09
So some lower level machine language.
Jon Krohn: 46:13
Right, right, right. Cool. Well, so with your journey, the way that you have come from a legal background, and in a lot of ways, the legal is still a big part of your life, you’re still working in Nearmap a bit, and obviously having a legal tech startup, legal is a big part of your life, but is there anything that you would do differently if you could, looking back over your career?
Horace Wu: 46:44
Over my career, no. I think I got really lucky because I like being a lawyer, I like problem-solving using the law, and what I might change is different things I’ve done over the last couple of years, especially in the startup space. In the startup space is very much experimental. You do something, you see if it works, and then you find out it doesn’t, you learn a lesson, you pivot, you do something else. And you just keep iterating and learning more.
Horace Wu: 47:18
Some of the things that I did, I would look back and go, “Ah, I wish I didn’t do that. I wish I didn’t spend X amount of dollars on Facebook marketing.” Or, “Well, ah, I wish we had…” Just little things like that. But I wouldn’t say I would change anything so much that I have any regrets.
Jon Krohn: 47:48
Cool. That’s a great answer. And then looking ahead in your career,-
Horace Wu: 47:54
Yes.
Jon Krohn: 47:55
… are there any particular aspirations or inspiring people that push you to better yourself? When you retire, what are you hoping to be able to look back?
Horace Wu: 48:06
I’ve never been asked that question before. And I think it’s because in the startup space, you don’t really get asked the question of, “Think about your retirement.” So for me, I would say probably three things, because I think there were three aspects to that question. First is what I’d like to see as the outcome of Syntheia. What I’d really like is if Syntheia becomes almost an integral piece of how lawyers think about problem-solving in the legal space. Right now legal data is highly cost. If we can refine that, if we can make that information more visible and accessible, people can make better decisions. And it’s not just lawyers, it’s anyone that’s impacted by contracts, by laws and so on. So having that as an outcome would be amazing.
Horace Wu: 48:59
And in terms of what I’d like as a legacy, what I’d like when I’m retired, I would like people to have achieved great things from having worked with Syntheia. So my immediate team, I know who they’re going to leave, eventually they’re going to go do greater and bigger things, and I want them to go and achieve greater and bigger things. And then one day, get together and have a beer. That would be amazing for me. When I’m retired, I’m going to need drinking buddies, and that’s who I would like to see. So yeah, I hope it touches people in a positive way.
Jon Krohn: 49:35
It makes a lot of sense, and it certainly has the potential. This is a huge tool, it could be a game changer. I think that’s the same kind of revolution that we’ve seen in a lot of other industries in the last decade or so, the financial industry 10 years ago, 90, 95% of trades were executed by people manually on a trading floor. And today only a few percent of trades that are executed by people shouting in a pit, and everything else is executed by machines, directed by machines. Maybe a human in the loop somewhere in some cases, but certainly machine executed. So with the legal space, it’s a little bit trickier of a problem because natural language data are a lot more complex than a ticker tape. The price of something moving around, and correlations between different commodities or whatever.
Jon Krohn: 50:38
But there’s a huge potential here when you’re talking about any given law firm, having hundreds of millions of documents, the huge resource that could be drawn upon to augment the capacity of an individual lawyer, and it could very well be the case. You could imagine in legal companies in a few years, Syntheia could be a verb we use like the way you would say, to Google something, they use Syntheia, [inaudible 00:51:09].
Horace Wu: 51:08
Why didn’t you Syntheia that document?
Jon Krohn: 51:15
Yeah.
Horace Wu: 51:15
It’s very much an under-serviced sector. And part of the problem in the past has been, Natural Language Processing wasn’t good enough to understand, and to weave through the complexity of these documents. But is now at least starting to be good enough, that you can take a lot of meaningful actions, based on what the text of the document is telling you to do. So we hope we are riding the start of that wave, and if we were leading that wave, that would be amazing as well, but where you think there are a lot of people starting to get into this space now, and it’s going to be a very exciting, and a very active space in the next decade.
Jon Krohn: 52:00
Nice. So, a question that I always ask near the end of these guest podcasts is what are you reading right now? Do you have any book recommendations?
Horace Wu: 52:11
I don’t have a specific book recommendation. My reading list at the moment comprise a lot of sales and sales technique books that I’ve downloaded onto my Kindle. So, there’s this bad boy here.
Jon Krohn: 52:27
[inaudible 00:52:27]
Horace Wu: 52:27
He found a life. And they’re not necessarily great books, they are books that tell you pretty much the same things in different ways. But what I discovered about myself is, I’m not the strongest sales person in the world. And in fact, the books would tell me that my sales technique is the exact inverse of what I should be doing. I don’t put any pressure on, and I need to put a little bit of pressure.
Horace Wu: 52:59
[inaudible 00:52:59] sense of scarcity, for example, and putting in deadlines and those sorts of things. So a lot of those books right now, and trying to distill them down to a few principles that I can apply to what I do as a business.
Jon Krohn: 53:16
Nice, yeah. I think that at the stage that your business has evolved to, I can see why that’s the key skills that you need. So I suspect a couple of years ago, you were digging into a lot of Python and Natural Language Processing books, and now that the technology has matured a little bit, you’ve grown the team a bit and they’re handling a lot of that. You’re like, “All right. We’ve got to get the sales machine rolling.”
Horace Wu: 53:36
“We’ve got to actually get the customers to pay, and don’t just put them on pilots.”
Jon Krohn: 53:38
That’s right.
Horace Wu: 53:41
So yes, that’s where most of my mental energy goes these days.
Jon Krohn: 53:46
Beautiful. All right. I’ve learned a lot today, I think it’s been obvious, I’m sure many of our listeners have as well. How can they contact you or follow you?
Horace Wu: 53:55
Oh, well, you can absolutely contact me and reach out to me on LinkedIn, Horace Wu, you can find our website at syntheia.io., you can also find us on Twitter, but I’m not very active on Twitter. We don’t have a social media manager yet, but the plan is one day, it’ll be much more [inaudible 00:54:15].
Jon Krohn: 54:17
You may not be very active on Twitter Horace, but I love your description of yourself on Twitter, which is, some guy on the internet, very credible. Very good.
Horace Wu: 54:30
I try to tell the truth, Jon, that’s [inaudible 00:54:32].
Jon Krohn: 54:34
Nice. All right, well, thank you so much for being on the program Horace, this has been a special episode where I’ve learned a lot, you being able to transcend the entirety of the startup experience for listeners from building the technology stack, the back end, the front end, how it interacts with clients, and even a little bit about sales, hugely valuable. Thank you so much.
Horace Wu: 54:59
Thank you for having me, Jon, and thank you for listening everyone.
Jon Krohn: 55:07
Oh yeah. What a cool journey Horace has had from lawyer to machine learning startup entrepreneur. And boy, does he know his stuff. Inspiring, how he’s able to get it into the weeds on his company’s models and software stack, despite having no formal scientific or technical training. It goes to show that success in data science is ultimately all about combining together thousands upon thousands of Google searches.
Jon Krohn: 55:32
To summarize today’s topics, we covered what the legal tech space is. The huge commercial opportunities emerging for applying data science in the legal industry, such as flagging contentious language in contracts for extra scrutiny, or fully automatically drafting an airtight exactly appropriate clause, like horses companies, Syntheia can do, how to Bootstrap a startup without outside funding, by continuing to work in another career, the cool parallel between definitions and legal contracts and functions in software, the Cython and JavaScript heavy software stack for building real-time model feedback as a convenient Microsoft Word add-on, and how critical thinking maybe the most important skill to succeed as a data scientist or engineer at an early stage AI startup.
Jon Krohn: 56:24
As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials mentioned on the show in the URLs for Horace’s LinkedIn, and Twitter, at www.superdatascience.com/455. That’s www.superdatascience.com/455. If you enjoyed this episode, I’d of course greatly appreciate it. If you left a review on your favorite podcasting app or on YouTube, where we have a high fidelity smiley face field video version of this episode. I also encourage you to follow or tag me in a post on LinkedIn or Twitter, where my Twitter handle is @Jonkrohnlearns. To let me know your thoughts on this episode, I’d love to respond to your comments or questions in public and get a conversation going.
Jon Krohn: 57:06
You’re also welcome to add me on LinkedIn, but it might be a good idea to mention you were listening to the SuperDataScience podcast so that I know you’re not a random salesperson. Finally, since this podcast is free, if you’re looking for a freeway to support my work, you could leave a review of my book, Deep Learning Illustrated on Amazon, or on Goodreads, you could give some videos on my YouTube channel a thumbs up, or if you happen to have an O’Reilly subscription, you can give my books or videos a star rating in there. To support the SuperDataScience company that kindly funds the management, editing, and production of this podcast, without any annoying third-party ads, you could create a free login to their learning platform at www.superdatascience.com. Or you could consider buying a usually pretty darn cheap Udemy course published by SuperDataScience, such as my own Machine Learning & Data Science Foundations Masterclass. All right, it’s been another great episode. Keep on rocking it out there, and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon.