SDS 243: Geospatial Analytics: Where Data Science & Actuarial Science Meet

Podcast Guest: Dominic Roe

March 13, 2019

A great discussion with Dominic Roe on his actuarial work and the pioneering and implementation of a risk assessment method he’s utilized for insurance companies that’s now widespread across Australia. 

About Dominic Roe
Dominic Roe is a Senior Underwriter and Fellow of the Institute of Actuaries of Australia. He’s worked in risk assessment and helped pioneer a method for risk assessment for homeowners utilizing the random forest method that is now widespread across Australia. His specialty are novel statistic approaches to solving longstanding general insurance problems.
Overview
Actuaries got their start in life insurance, coming up with risk models for taking on people of various ages and conditions. Since then, they’ve moved into broader data analysis and predictive modeling. There are some key similarities but stark differences between actuaries and data scientists. While they both utilize an understanding of data, data scientists dig deeper and understand the data on a more intimate level to create the most predictively powerful algorithms. Actuaries have the added element of legislation and code of conduct and what conclusions to draw from the data provided.
Dominic had an early love of problem solving. He believes any data scientist can benefit from stepping into an actuarial role. My personal exposure to actuarial work is old school and reliable approaches that haven’t been interrupted. Dominic’s work feels innovative. Though, it is important to note that different countries, and even different states, have different regulations on what actuaries can and can’t do as part of their practice—including tools, pricing, and anti-discrimination laws based on ethnicity, gender, and other factors.
Dominic’s case study is a fascinating one, looking at the use of random forest, a form of construction for decisions trees. He uses this to create risk models among households across the country in 50 different categories for risk level. He then applies a formula to shrink the slope of the data by way of a function of how many predictors you have. By this he means utilizing methods to reduce the discrepancy between training data and valid data. It’s extremely involved and he explains it much better than I could here. The good news is, this model got rolled out in Australia for homeowners. Dominic and his team’s work was pioneered by them, creating the first real application of this method that’s now utilized by other insurance companies across the country. The fascinating thing is, at its core, the project feels like a data science project, outside of the regulatory parts of the actuarial practice. I’ve dabbled in geospatial analytics at Deloitte but the work has exploded and evolved since I’ve worked in that space.
Dominic’s advice to those who want to jump in or jump back in is to look in your own backyard. Look at your neighbors, how they live, where they go for holidays, how their property is affected by weather, and more. Geospatial analytics is a way to tie yourself to your environment and to the people around you. As Dominic notes, we’re all on the planet together and geospatial analytics and the overall science in this field is going to continue to become a very important discipline for data scientists. There are plenty of applications for the future of the practice across various risk assessments including overlaps with telematics, vehicle routing, and other places to maximize safety, cost, and resources. Geospatial work in action can save companies millions each year.
He closes out noting that the future is bright for geospatial analytics, though Australia is a bit behind the US and Europe in the work, the forward progress is obvious.
Additional notes from Dominic:
Dominic’s role as Senior Underwriter at Cover-More Group involves improving profitability for the company. To do this, he must assess, compare and propose changes to referral rules – rule-sets that tell an insurance provider what risks they can accept and which they should decline.
The first step, notes Dominic, is to understand the metrics relevant to the problem. These are:
  1. The expected profitability of a risk,
  2. The volatility of a risk, ie the capacity of the claimant to make large requests, and
  3. The reliability of a risk’s profit potential, ie the likelihood that the profitability will deviate from the estimate.
Failing to account for one or more of these measures risks us overlooking an essential element of the business problem. Dominic says:
– Inexperienced data scientists will focus only on the expected profitability, as it is the simplest metric. However, the nature of insurance is such that only a relatively small proportion of your risks – typically between 1% and 15% – actually make a claim in any one year. Further, 80% of the claims costs in any one year are often caused by less than 20% of the claimants.
In this episode you will learn:
  • Actuaries vs. Data Scientist [5:48]
  • Dominic’s professional experience as an actuary [12:00]
  • Regulations on tools in actuarial practice [15:21]
  • Dominic’s case studies in geospatial analytics and random forest [20:43]
  • Advice for those interested in geospatial analytics and applications [44:15]
Items mentioned in this podcast:
Follow Dominic
Episode Transcript

Podcast Transcript

Kirill Eremenko: This is episode number 243 with Senior Underwriter, Dominic Roe.

Kirill Eremenko: Welcome to the SuperDataScience Podcast. My name is Kirill Eremenko, Data Science Coach and Lifestyle Entrepreneur. Each week we bring you inspiring people and ideas to help you build your successful career in data science. Thanks for being here today, and now let’s make the complex simple.
Kirill Eremenko: Welcome back to the SuperDataScience Podcasts ladies and gentlemen. Super excited to have you on the show here today, and joining us from Brisbane, Australia, we have Dominic Roe. We had a fantastic chat today, Dominic is a senior underwriter in his role. He’s professionally an actuary and he is a fellow of the Actuaries Institute of Australia, which is not an easy fit, you have to pass 12 exams involving statistics and other complex topics to be a member of this institute.
Kirill Eremenko: Today, what you will find out in this podcast super, super exciting stuff. First of all, of course you’ll find out, or we will refreshen only because we have had actuaries on this podcast before, but we’ll refresh on what the difference between an actuary and a data scientist is and you’ll definitely cement that in and be very comfortable in understanding the differences. And then in terms of the similarities between the two professions, we will have some amazing things for you today.
Kirill Eremenko: Dominic shared one of his epic case studies of his work and we went into lots of details. If you’re looking for a technical podcast, if you want to up your skills on modeling and how to think about models and especially in the B2C space, this is the podcast for you. We talked about how Dominic built a model, it involves decision trees, random forest, geodemographic segmentation, stochastic sampling. We go into lots of little nitty gritty details. I think this is a podcast like no other we’ve had before, so highly, highly recommend to tune in and follow along with us on how he built this model. It’s actually two separate models in one. Crazy story, I think you’ll love it.
Kirill Eremenko: Then after that, towards the end of the podcast we talked a bit more about specifically geodemographic segmentation and you’ll find some interesting use cases, specifically three use cases of geodemographic segmentation in data science, and so if you’ve never encountered geodemographic segmentation or you’ve briefly worked with it before, this will be a good place to enrich your knowledge. Super pumped about this episode. Let’s dive straight into it and start data sciencing, and without further ado, I bring to you Dominic Roe, Senior Underwriter and Actuary.
Kirill Eremenko: Welcome back to the SuperDataScience Podcast ladies and gentlemen, super excited to have you on the show because today I have Dominic Roe calling in from Brisbane, Australia. Dominic, welcome. How are you going today?
Dominic Roe: Hey Kirill, I’m great. How are you going?
Kirill Eremenko: Going great. It’s actually funny how I asked you just before the podcast I was like, because I normally ask that question to do the audio check, you know, how’s the weather? And then I was asking you, how’s the weather in Brisbane when I’m in Gulf Coast, which is literally a hundred kilometers away and we have the same weather.
Dominic Roe: Yeah, very rarely is it different.
Kirill Eremenko: Yeah, you just have to look out the window. It’s been pretty odd, and as you mentioned, it’s very strong surf. I went to the beach today in the morning and very strong, powerful waves, you know like knee deep, they like wipe you off your feet.
Dominic Roe: Yeah.
Kirill Eremenko: Do you think it might be to do with the full moon?
Dominic Roe: No I think it’s actually to do with the cyclone. There’s a cyclone off the coast of Queensland near Fiji, and yeah, this is whipping up all of the waves, the swell, et cetera, so yeah, I think that’s the main cause. But of course yeah, I know it’s a full moon at the moment as well, which makes it even more extreme.
Kirill Eremenko: Yeah, the tides go up higher or lower I think.
Dominic Roe: Yeah, exactly. Yeah, that’s right. Yeah.
Kirill Eremenko: Yeah. Well it was really cool to catch up, so for our listeners, Dominic and I met through a common friend of ours, Bill Konstantinidis. Bill, huge shout out, huge thank you for listening to this. Bill’s an executive in the insurance and data science analytics in very, like a very influential person in Australia, and was very cool to get in touch with you Dominic and we had an amazing chat, remember, like last week during the coffee break. I really enjoyed it.
Dominic Roe: Yeah, he was great. Yeah, I mean, I really liked how you went over some of the most cutting edge techniques, and you were able to explain some of the latest trends in data science, which I’m not exactly up to scratch on, the very latest trends, so that’s great.
Kirill Eremenko: Thank you. On my side, I really, I felt engaged and immersed into what you were saying when you were sharing some of your recent wins and case studies from your work. I thought they were so cool and I just had to invite you to the podcast, because I want you to share that with the world. It’s going to be quite an exciting session today, really looking forward to it. But, before we get started, give us a quick rundown, so you are, from what I know, you’re a senior underwriter, you work for Cover-More Insurance, your profession is you’re an actuary, so we’ll go into that.
Kirill Eremenko: What’s the difference between an actuary and data science and how they are, you know, what they’re sharing more and more of time. You’re a fellow of the Institute of Actuaries, you had to pass a dozen or, yeah, a dozen exams to become a fellow that, so it’s all quite an involved career that you have. Tell us a bit more about what exactly is it that you do?
Dominic Roe: Yeah, sure. I suppose, what an actuary is for your listeners, a bit of background information I suppose, because not all your listeners might have heard of an actuary before. An actuary is essentially, originally it comes from the life insurance industry, where the actuary would essentially create all of the life tables, which tells the insurance company how much they need to charge to insure people at different ages and different genders, et cetera. Then around the 1970s, actuaries moved from life insurance, they also started working in non-life insurance, so what people commonly call property casualty insurance, so that includes auto insurance as well as home and contents insurance, as well as travel insurance, different types of insurance.
Dominic Roe: Basically, an actuary will, at the very core skillset would be probability and statistics as well as a strong financial and accounting background. But combined with that, these days in particular, actuaries are very much competent in coding, in data analysis of very large datasets and there’s an increasing overlap in the skillset between data scientists and actuaries. So building predictive models is absolutely the bread and butter of what actuaries do. These days, actuaries are very much in the life insurance as well as the non-life insurance, including things like health insurance, as well as banking and superannuation or pensions as most of the world calls it. Yeah.
Kirill Eremenko: Yeah. Very interesting how that all stemmed from actuarial sciences and indeed now is sometimes it’s hard to distinguish, hard to tell the difference between what’s the role of an actuary and what’s the role of a data scientist. What would your definition be like? What’s the difference between a data scientist and an actuary at this point?
Dominic Roe: Yeah, yeah. That’s an interesting one. I suppose an actuary, I suppose they used to have the skills of a data scientist, to me a data scientist is a person who understands the structure of the data, understands how to clean the data, understands many different approaches in terms of algorithms and different ways of modeling, both in a supervised and in a non-supervised setting in order to understand the underlying structure of the data and build the most predictively, the most powerful models possible.
Dominic Roe: An actuary needs to know that, probably not to the same depth as a data scientist, but the overlying considerations for an actuary are around understanding the legislation, understanding the industry that you’re in, and understanding things like professional code of conduct, which says things like you need to be impartial in terms of when you’re writing statutory reports, and really I suppose putting all those things together to act in the best financial interests of the company.
Dominic Roe: With an actuary and that profession, there are a lot of, it’s actually written into the law in a lot of countries that an actuary has to sign off, sets of rates or premium rates that are being charged by an insurance company, and the actuary has to make a formal signed declaration that those premium rates are correct in terms of accurately reflecting the cost of claims that are expected to be born out of the insurance policy that the company is writing. There’s that additional layer I suppose of professional standards and legislative accountability with an actuary that a data scientist wouldn’t have.
Kirill Eremenko: Okay. Very interesting. I think that’s a very accurate description. I would assume it would be quite easy to go from an actuary to a data scientist and from a data scientist to an actuary, would you agree or would you say it’s a bit more involved in [inaudible 00:10:34].
Dominic Roe: Yes. Yeah, I mean, I know a few actuaries that have gone full, very, very deep into the data science world, and they’ve absolutely loved it, and they love being completely free of all the statutory obligations and you know, the boring ticking the box, which comes with some components of actuary work, so they feel freer in terms of understanding the data and building models that are free of the various restrictions imposed by them. But I would say that if a data scientist tries to move into the actuarial world, unless you like, I suppose reading the law and understanding the various supervisory, regulatory bodies and all the rules and regulations they have. You may find it a little bit I suppose frustrating or cumbersome that you have to learn all these rigid rules, and not simply concentrate on the size of building the most predictive and the most powerful algorithm or model that you can possibly build. I think that, a lot of data scientists, if they did try to move into the actuary world, they may be frustrated by that additional path, yeah.
Kirill Eremenko: Yeah. Okay. I can see how this is [inaudible 00:11:59] what happened. Tell us about you. What’s your story? You started as an actuary and you’re a very experienced actuary and a senior underwriter at the moment, and a member of the Institute of Actuaries. Where does your interest for data science come from?
Dominic Roe: Yes, I suppose I’ve always been interested in building models to predict the future and this is what I really enjoyed, and I suppose that’s why it made sense for me to go down this path. I’ve always I suppose had an abstract mind and what’s fascinated me, you know, as a boy when I used to play say chess on the computer or whatever was, how the computer went about its thoughts or its rules or whatever it did in order to make the next move, that kind of abstract thinking has always fascinated me. And so whilst I’ve always liked mathematical and statistical analysis and I actually did a Master’s degree after I did my Bachelor’s degree in statistics, that component of abstract thinking and going through a thought process to work out what the best thing to do is for a given objective that hit that kind of problem has always fascinated me. Yeah.
Kirill Eremenko: That’s why you decided to expand even further beyond your actuarial knowledge and experience into data science and see what else is available.
Dominic Roe: Yeah, yeah. That’s right, and I mean, I think that all good actuaries who enjoy learning new things should step into the data science area because you can’t do the best job that you can possibly do in the world of insurance pricing, which is where my career has predominantly concentrated on without understanding the more advanced types of models that are out there today. The models that are being used by insurance companies are improving, but they are improving very, very slowly because there’s a lot of inertia in the insurance industry about using and implementing more advanced models that can say predict conversion rates, can predict clients’ costs, can predict the likelihood of renewal, et cetera. All these things that companies can use to maximize their profit over time. Actuaries have been somewhat slower than data scientists in terms of embracing new techniques, new models. Yeah, I think that all good actuaries should definitely step into data science to learn more about it.
Kirill Eremenko: That’s very interesting because I was actually thinking that when we met for the coffee catch up last week, some of the examples that you gave me from your work were very unusual, because I’ve spoken to actuaries before, and usually it’s mostly like logistic regression, linear regression, you know, some kind of very developed, very old school, not to say bad, but kind of like very reliable approaches that have been around for a very long time. When we caught up, you were talking about using a random forest, XGBoost and you know, starting to think about deep learning, lots of different more innovative approach. What I was thinking is, is there any kind of regulation in the actuarial sciences on what can and can’t be used? Like in data science you can use whatever you want as long as it gets the job done and it produces great results, great predictions. In actuarial science, with all this legislation and regulatory framework, is there any restrictions on what you are allowed to apply? What kind of models and what you cannot apply?
Dominic Roe: Yeah, I think that’s a great question and the answer is, because I know you have an international audience, the answer really just depends, it varies. Within the US it varies state by state I believe. Within Australia, the Australian rules are very much relaxed. I mean if you’re talking about a non statutory class, so that’s ordinary home and contents insurance, or motor vehicle insurance, which doesn’t include the bodily injury component, just the property damage, you can actually charge whatever premium do you like, you are unrestricted. The only rule essentially is that you cannot collude in your pricing. In other words, you can’t go, if you’re insurer A, you can’t go to insurer B and say to that company, just to let you know, I’m going to be charging $350 for these types of cars, 450 for those types of cars, and you know, you can’t share information in that way about pricing because then it’s considered to be a cartel by the Australian regulators and that’s considered to be an extremely serious offense.
Dominic Roe: Anything else in terms of, you know, writing a particular risk by exactly where it’s located, the exact longitude, latitude, elevation, and then incorporating additional factors such as what’s the crime rate around that particular building that you’re insuring? How close is that building to say the nearest police station, or the nearest fire station, or the nearest waterway? What’s the relative elevation, you know, if you take say, the points around that point that you’re insuring, is the water, say if the flood risk, is the water likely to inundate that particular building? All those things can be taken into account.
Dominic Roe: The only other point that might be a little bit dubious from the point of view of the law is around things like ethnic factors that are based on an individual, so if you base it on individual’s say last name, that usually gives you a clue as to what their ethnicity is, obviously that would be inappropriate from anti-discrimination perspective, in the US I believe, and I’ve never worked in the US so hopefully this is half right at least. I have been told that there’s write filings that are required, so in other words, the insurer must give to the regulator and say these are exactly our best estimate of the correct rates, which, you know, what the cost of claims are going to be, with a margin for expenses and profit, and then so you have perfect equity between policy holders. If a 20 year old driver is predicted to have twice as many claims as a 40 year old driver, then their premium should be twice as much, it shouldn’t be effective if it’s any different, I believe that’s, that’s how it’s commonly done there, which is very, very different to Australia.
Kirill Eremenko: What about gender? Because when I get car insurance for instance, one of the questions is, are you male or female? I’m not an actuary, so I’m not sure if this is true or not, but I’ve heard that how you answer to that question affects your insurance because, you know, in some forms of vehicles, women or men are seen as more risky in driving. Is that not discrimination?
Dominic Roe: Yes. I think that’s a really good point and I actually don’t know the answer because I believe the answer in Australia is going to be changing fairly soon, that’s just what I’ve heard. I believe that in the EU, they’ve officially outlawed that, so in other words, if you’re male/female, it’s forced that the premium is the same, even though, you know, just on the average, the females are the safer drivers. In Australia, I believe that you can charge a different premium, but I may be wrong there, I’m not 100% sure here.
Kirill Eremenko: Okay, got you. That’s good that it’s, you know, things are always changing, adjusting and yeah, we’re moving forward as a society, as a whole. What you mentioned there was very interesting about the coordinates and the heights of elevation, relative elevation, police station, fire station nearby. That’s really cool, and that’s like new areas that insurers can and should be exploring. That’s where I see data science actually comes in, where you are not just doing your standard, you know, you have certain features and you make your standard model, you look beyond, you look at what else can I get? How can I combine things? That’s one of the things that I really thought about our chat last week, was the whole notion of geospatial. You have quite a bit of experience in geospatial analytics, can you tell us a bit more about that? First of all, what is geospatial and how did you get into it?
Dominic Roe: Yeah, so really, because geospatial statistics is anything at all that’s associated with the geographical area, so it’s either it can be an area or an exact latitude, longitude and elevation, x, y, z coordinate on the earth. I suppose how I got into it, I was working, this is a fair time ago, but I was working with a person who, he still works in the insurance industry, but who is very, very innovative and very progressive in terms of pushing his people to, you know, to do additional research and find out what the best ways of solving any particular problem are, and so I won’t mention the company or the person.
Dominic Roe: Basically the problem that we were trying to solve was, how do we have all this external third party data, which you know, incorporates both areas, geospatial data to do with areas, so some are small areas, some are large areas, as well as exact x, y, z location geospatial data. How do we incorporate that additional information, additional external information into insurance pricing so that the prices that we charge ultimately are closer to what the claims costs are going to be. The claims cost in the future is obviously completely random, it’s fortuitous, but it should be, you know, relatively predictable in terms of, you know, if you understand the underlying, all of the underlying policy and risk characteristics, you should be able to get it fairly close once you aggregate amongst the policy holders.
Dominic Roe: We actually tried a few different techniques in order to get the result that we wanted, so Random Forest was a technique that we use. Now Random Forest is essentially building recursive decision trees, and what it’s used for is, if you believe you have high level interactions in your data. If you think about, say a neural network, or say like a linear model, if you have say no interactions whatsoever between your variables, so in other words, the outcome only depends on essentially a linear, just an ordinary linear function of your predictors, then something like the linear model or a neural network will say no hidden layer will be perfectly fine.
Dominic Roe: What Random Forest is really good at is picking up higher level interactions. If variable A, together with B, in combination with C and D would isolate a particular segment of policy holders, that are say extremely high risk and you know, you’re able to prove that through bagging, which is sampling with replacement and you’re able to prove that through say cross validation, then you may be able to isolate segments that can’t be easily isolated through many other techniques. We used Random Forest because we had a relatively large number of predictors, so predictors all to do … Probably 500 to 600 predictors. And just to give your listeners, I suppose a little bit more of an idea of the types of predictors we had, we had things around unemployment, we had things around wealth and income, so all the demographic factors as well as things like immigration factors, distances to the nearest natural features and nearest railway station, et cetera, et cetera, nearest airport. And so you get, using all these different predictors in combination, using Random Forest, you know, what you can do is you can build up just the geographical component as the risk.
Dominic Roe: I suppose the other thing, this is what we’ve been talking about is just the geographical component, so when a policy holder comes to you, what do you know about him, let’s say it’s home insurance. You know what type of building you’re insuring, so just all of the basic, you know, what you have to tell your insurer, so the type of building, is it two stories? Is it three stories? How much do you want to insure the building for? Do you have locks on all of your external doors and windows? Do you have a security alarm? Is your home fully fenced? All those basic things, these are the predictors that sit over the top of that. And so the trickiness was, how do we take into account all of those basic policy characteristics that I just mentioned, but also at the same time, build up geographical zones in which to determine our insurance prices from, from all these hundreds of external factors.
Dominic Roe: The way that we did that was, we had what is called a GLM, so that’s a generalized linear model where the target of what we’re trying to predict there is the frequency of claim, so the claim frequency is being predicted by all the ordinary policy characteristics they have, just the ordinary risk characteristics. Then what you can do is, you can get the residual effect. When you have your zonal effect within that model, then you take all of your residuals and once you add that to the zone effects, after you fitted all your parameter estimates, that zone effect is essentially what you want to try to predict using all of your external factors.
Kirill Eremenko: Sorry, I missed it, what is a residual again?
Dominic Roe: The residual is essentially the error from the GLM, so you build a GLM, which is predicting the frequency of claim against the policy that you’re insuring, against the home, and then the residual effect is going to be essentially the error, so if you predict say a claim frequency of let’s say 3%, but what you actually see in your past training data is actually 5%, then you just have a residual there of 0.2. Does it make sense?
Kirill Eremenko: 0.2?
Dominic Roe: Yeah, so it’s just the difference between the actual and predicted-
Kirill Eremenko: So 2%.
Dominic Roe: Yeah, so it’s just the error. Yeah, that’s correct, just the error.
Kirill Eremenko: What do you do with this residual?
Dominic Roe: Then what we have to do is, we have to essentially sample all of our exposure and our clients, so the exposure is essentially saying, hey, we got one policy and it’s on risk for one year, and the claims are, if any, you know, what actual claims are being made for damage to the property, and then what we want to do is we want to create a balanced data set between the target variable of one yes the policy holder had a claim, and zero, the policy holder did not have a claim. For tree based algorithms, when you’re building a classification tree, so the outcome is either one or zero, if you have in your training data, a very imbalanced data set in terms of the number of rows is not approximately equal in terms of the ones and the zeros, then often you’ll get a result which is highly unreliable-
Kirill Eremenko: Interesting. I remember this is what we talked about last week. This was very interesting, guys if you’re interested in regression random forest and decision trees this is a valuable tip. So balanced data sets, right? Sorry to interrupt.
Dominic Roe: Yeah, that’s right, yes. Yeah, no worries. When we’re building a classification tree for the Random Forest, yes we want to create a training data set, which is approximately equal for the target variable ones and zeros. In insurance, we usually have a very imbalanced data set because you have say 2 million policies that you’re writing for a year and you only have, let’s say 50,000 claims or so, just for say a moderate sized insurer. You have the ratio there of claims, has claims to having no claims is something like 500 to one. It’s very, very imbalanced. You have to balance that up if you choose to use a tree based algorithm. Otherwise, yeah [inaudible 00:29:55].
Kirill Eremenko: Do you know [inaudible 00:29:55] why are tree algorithms picky like that?
Dominic Roe: I believe it’s because when the tree is built, if you don’t have approximately, so when each split is determined, if you don’t have approximately the same observations, you know, on each split, then basically you very, very quickly you’re going to run out of data in order to segment where you’re going wrong. I think it’s part of the problem in classification where, for example, if you’re trying to predict in image, if we think of image recognition and we try to predict which image is cancerous, is a malignant tumor versus no malignant, the ratio there is about 99 to one, and so if you want to get 99% accuracy, you know, you just say, well everything is nonmalignant.
Kirill Eremenko: Ah, everything is [inaudible 00:31:01], yeah I got it.
Dominic Roe: Yeah, so it just depends exactly how you measure that, so there’s accuracy, but then there’s other measures that you can use in terms of how good at it, but I believe the answer, the question is when the tree splits, if you have a very imbalanced tree, then it’s going to perform very, very poorly on your validation and holdout set. Yeah.
Kirill Eremenko: How do you make sure that the ones that you’re sampling, so I’m assuming you take like all the ones that you have, you take them all and then you sample an equal number of zeros, how do you make of once you’re sampling or you know, like you’re not biased in them?
Dominic Roe: Yeah, so the way that you actually sample it is, you sample it stochastically. In other words, the chance of you picking out that one particular record out of your 5 million exposure records, the chance of doing that needs to be proportional to the residual. In other words, if you’ve got say a very, very big residual, a big error, then you want to put more weight onto those observations with a very big error, because they’re the ones essentially that are going to provide the predictive power for your geospatial component of your model.
Kirill Eremenko: Interesting, so not even random sampling, you sample a proportion-
Dominic Roe: No, so it’s not random sampling, so I mean, if it was purely random, you know, there’d be an equal chance, and so you just use one in five million, you got five million records and the chance of picking one out is say 20,000 divided by 5 million, and then after you do that, you should end up with approximately 20,000 records after you’ve gone through all the 5 million records.
Kirill Eremenko: Yeah.
Dominic Roe: What do you want do is, you want to have the probability of picking out a record with a large residual being higher. In other words, it’s stochastically sampled, with a different probability and that’s quite easy to-
Kirill Eremenko: Why is it important to have those records with higher residuals, more of them in your model?
Dominic Roe: Yes, because those are the ones that need to have a greater component of the geospatial component in terms of describing why the model is fitting calling. If you’ve got something that the residual is already very close to zero, it should mean that it’s been adequately explained by all of those ordinary policy characteristics already. In other words, there’s nothing left to explain in terms of the geospatial component. The geospatial component there is approximately, you know, there’s nothing significant about it, if that makes sense.
Kirill Eremenko: In those particular records with the low residual?
Dominic Roe: Yeah, that’s correct, yes.
Kirill Eremenko: Interesting. Basically, you’re breaking up your modeling process into two stages. First you have the ordinary model, and then you’re looking, all right, how can I make this even more powerful? What else do I have? Oh, I have this geospatial information, I know their address, I know where they live, I can model certain things there and add a ton more new features. How I do I do this wisely? Which records actually need this modeling? What kind of records have the higher residuals, meaning that their predicted poorly at the moment? How about we augment the model with this geospatial component? Is that about right?
Dominic Roe: Yes, that’s exactly right. The advantage of doing that as opposed to just going straight into say Random Forest and throwing everything into Random Forest is that, from an implementation perspective, it’s quite difficult for legacy systems to be able to say things like, you know, I have 600, 500, 600 predictors here, and each predictor is potentially different for each individual address in the country, and so in Australia we have say approximately 10 million addresses. That’s just a lot of data than you’re building random forest models on a lot of data with a lot of predictors. With legacy systems, it’s difficult to implement that. Often you can only implement at an area level, but you still want to incorporate all those x, y, z coordinates, geospatial data into your pricing. That allows you to do that as opposed to having something which is completely free, has too many free parameters and it’s a very complex algorithm to actually implement.
Kirill Eremenko: Interesting. What do you mean by legacy system?
Dominic Roe: Legacy system I mean, often, practically, I suppose in the real world, even though we may choose a model that is very, very highly predictive and has many thousands of parameters, is quite a complex algorithm and there’s very, very predictive powerful predictive, we might not be able to implement that because the systems that we have to work with, if you work, say an insurance company, the systems they’re using are quite old, and so, you know, the same, it may be on average say 20, 25 years old, you have to, I suppose restrict what the model output is saying in order to fit what you can put into your system.
Kirill Eremenko: You mean like hardware?
Dominic Roe: No, just the tables that are in the system, so the tables in the system for most insurers may only go to say down to the post code or zip code level or the, you know, the suburb level, just depending on exactly what type of insurance that you’re talking about, yeah, each insurer is different. Yes.
Kirill Eremenko: Okay, got you. Yeah, very interesting. Breaking the modeling process into two sets. What kind of results did you see when you implemented that approach?
Dominic Roe: Yeah, so it was actually really, really interesting. I mean the predictive power of the random forest is very, very strong, but one thing that I did recognize though was, it actually requires quite a degree of regularization, and so if you look at the [inaudible 00:37:31] on the training set compared to the validation holdout sets, there’s actually a very significant difference, and so whilst it was dramatically more powerful than the previous model had been, which was already about five or six years old, the predictive power was not as good as on the training set, and there was actually quite a bit of difference. The question for me is, you know, how much regularization do you need to do in terms of feeding and hyper parameter tuning it on your cross validation set.
Dominic Roe: We spend probably 50% of the time going on the hyper parameter tuning to try to make sure that we’re getting the right mix in terms of the depths of the trees, and in terms of how many predictors we were sampling, re-sampling at each node. Random forest has quite a few different hyper parameters and it took quite a bit of experimentation in order to try to get to an optimal outcome. What we were going from, I mean, it’s difficult to compare how good it was because what we were going from was actually a lot more of a simple approach, just pricing at the post code or zip code level. We were incorporating a lot of new data at that point in time.
Kirill Eremenko: To confirm we’re on the same page, regularization is the process of making sure that your model is not over fitting to the training data, is that right?
Dominic Roe: Yeah. Yes, that’s correct, yeah. When you fit the model to the training data and you analyze the results, obviously it’s almost always very, very good, very highly predictive, and then when you move on to a test or hold out sample, which is completely separate, completely independent of your training data, you then see, usually you see a relatively large drop in your predictive power, and that’s definitely the case in insurance, I think that’s the case in probably most fields of data science, yeah.
Kirill Eremenko: Totally, totally, agree with that, and I’ve seen that plenty, especially like in a financial predictions, you know, you can create a really cool model for a time series, for stocks or for currency rates and so on. This looks perfect on your training data and then bam in real life, completely different story. Tell us a bit more, how did you regularize this, in this particular scenario?
Dominic Roe: Yeah, we were using the random forest, we actually split each and every household in the country, we split into 50 different, basically 50 different categories from the highest risk to the lowest risk, and so what we did is we analyzed what the relative increase was in the claim frequency for each of those bins, one to 50. Each of the bins, one to 50 was essentially the score or the prediction that came out of the random forest model, but it was then put into these different categories. What we looked at, there’s actually a formula, I think it may be from one of a Leo Breiman’s papers, I forget exactly what the paper is called, but I look in the paper and there’s actually a shrinkage factor, I think he refers to it as a shrinkage factor in terms of reducing the gradient of that slope that comes out of the random forest output, and I believe it’s a function of how many predictors that you have, maybe something like one on square root, but there’s a particular factor that he uses in order to shrink the effects of the random forest predictions.
Kirill Eremenko: What do you mean by shrinking the effect of random forest?
Dominic Roe: When you categorize all of the random forest output from highest risks to lowest risk, each of the increase in the categories may result in say for example, two and a half, or say a 3% increase in risk from an insurance perspective. That two and a half or 3% is based on the training data, in the test or the validation data, it’s going to be significantly less than that. The question is, using what formula, or using what method would you use in terms of trying to, you’ve reduced that effect from say 2,5% down to say 1.7% or what it might be, so then on the cross validation and the test set, you get a consistent result.
Kirill Eremenko: Okay. Very interesting. Okay, cool. Sounds like a very evolved projects, and yeah. Did the model eventually go live, get rolled out?
Dominic Roe: Yes. Yes it did. Yep. Yeah, that’s right. I think that in Australia, we do actually have, I think most insurers now in home insurance this is, actually price by the exact latitude, longitude, elevation of the property. Yeah, so when you can actually get a quote, get a prize, you jump on the web and you get a quote, they will actually verify your precise address against the postal address file, and so, yeah, most of them have that capability now apart from a couple of the smaller ones, yes.
Kirill Eremenko: Nice. Were you like the first pioneer in this space or were many insurers developing similar models at the same time?
Dominic Roe: I believe the company that I worked for at the time was the very first in Australia, I’m quite certain of that, it’s definitely not my credit. I was mentored by an actuary who was very senior and he was a fantastic mentor, so no, I don’t deserve any of the credit, but yes, I still have a good relationship with him, so I’m very thankful for him, yes.
Kirill Eremenko: Thank you. That’s awesome. Well, thank you very much for that case study, I think it was very evolved and I’m sure our listeners are enjoying going through this very, very cool project of actuarial. In fact it actually feels like a data science project, you know, because we didn’t go into the regulatory components and things like that, but the nature of the work itself, definitely feels a lot like a data science stuff. What would you say to those listening who have never done geospatial before? Because I’ve played around with geospatial, I’ve done I think one or two, maybe two, no two or three projects when I was in Deloitte, I did on the geospatial analytics, very interesting stuff.
Kirill Eremenko: There’s really cool tools like Pathfinder, Esri, some of the ones like a Pitney Bowes, you can use that for I think it’s like dealing with postcodes and stuff like that, and then the Australian Bureau of Statistics for people in Australia is really, really cool in that sense. Yeah, and in fact, in SuperDataScience, we don’t yet have a course on geospatial and I’ve been itching to create one because I find that topic very interesting. What would your comments be to those who are in data science or thinking of getting into data science but have never considered a geospatial analytics as a career path per se? What would you say to them?
Dominic Roe: What I would say is I think that the prevalence of data these days in terms of geospatial is just exploding all the time, and if you think about where people live and where businesses are located, there’s a very high degree of correlation between, you know, yourself and your neighbor and the business that is five doors up from you, and you all share things in common because you’ve got that spatial correlation. You can ensure a lot of different things about the people who live there, the types of products they like to buy, how risk averse they are, where they like to go on a holiday, and that opens up all sorts of opportunities in terms of predictive marketing, predictive advertising, and you know, there’s just a lot of different applications for geospatial.
Dominic Roe: I mean things like natural disasters, because Kirill, you know where we live in Queensland, Queensland has had a very large share of Australia’s natural disasters over the last 12 months. I mean, even things like helping inform people of when a cyclone or a bushfire or flood is about to occur, there’s just so many possibilities with geospatial, and so I think that, yeah, because we’re all going to be living on the earth and, you know, we all share so many things in common, I think that definitely it’s going to be a very significant piece of, I suppose, discipline of data science for the foreseeable future. Those would be my thoughts, yeah.
Kirill Eremenko: That’s really good advice, and I think yeah you’re right, geospatial can be used for some very noble causes. I was actually thinking about this, funny enough, even before we met, I was thinking about geospatial. My dad and I went on a bicycle ride from Gold Coast to Brisbane and then from Brisbane to Gold Coast, and like I was tracking my ride with an app and it’s got me thinking about like what’s applications of geospatial can I think of in data science? I thought of three and like, let me know if you agree or maybe like maybe you can add some to that.
Kirill Eremenko: The first one would be what you described today, where you can, and like probably one of the most valuable applications of geospatials, when you can use that data, whatever it is, address, post code, mesh block, whatever it is, SA1, SLA level, and use that to add geodemographic segmentation to your data set, so basically extract things like affluence of the population in that area, you know, general average ages, average income, and any other kind of information like you said. Maybe you might be able to, you have a proprietary data set on where people like to go on holidays based on where they live. A lot of that data can be purchased, a lot of it is proprietary and you can find very interesting things. There was a company we were talking about, what’s it called that that sell proprietary data around geospatial? [inaudible 00:48:34] or something, not [inaudible 00:48:35] it’s completely out of there, not the same.
Dominic Roe: In the previous work place we used Tactician One. Tactician One does all the GIS and draws all the polygons for the area, et cetera, has all the maps there. Tactician One, I think it was Pathfinders was the-
Kirill Eremenko: No, it was like a big, if I remember it, I’ll say it on the podcast, there’s a big company that actually, once you’ve already done all of that modeling and you know, the x, y, z’s or you know the post codes, they tell you, all right, people in this area, they have this type of affluence, these are the gender split, this is the age split, you know what I’m talking about? Like a very big company. They purchased an Australian company [inaudible 00:49:27]. Anyway, if any one of us remembers it, we’ll mention it. Basically augmenting your data set with geodemographic data to add new features that you can include in your modeling, extremely, extremely powerful application of geospatial.
Kirill Eremenko: Application number two that I can think of is a drive time analysis. This is very important, especially for businesses like local businesses where you might think you are 20 kilometres away from your customer, but really that’s just, you know, you wanting, you know, like you calculating the earth distance according to planet earth, how far away you are, but really customers get to by different means, whether it’s walking, public transport, driving. Around draft time analysis, you might think you’re 20 kilometers away, but how long did it take your customer to get to you? Based on the roads, it might take them, I don’t know, it might take them 20 minutes, might them 40 minutes, it might take them 10 minutes to get to, and so very important.
Kirill Eremenko: When a business does like a capture segment of its customers, like all right, what’s the capture area for our customers? I have seen situations where a business says, all right, we’re here in the middle and we’re going to draw a 20 kilometer radius around us, well it’s not actually a radius. You got to do drive time analysis and see, you know, it’s going to be like a shape which is stretched out along highways and then is contracted where there’s less roads because people can get to you, it’s faster for them to get to through a highway, so therefore they can travel further.
Dominic Roe: Yeah, that’s a really good point, so to be a contours of kind of a heat map would be a couple of ways to, I suppose, illustrate that and the shape of the contours of the heat map would change, you know, for peak hour versus you know, for weekends, et cetera, could change over time.
Kirill Eremenko: Yeah. Imagine how important that is if you’re like a business and you’re placing a second store, like say you’re Bunning in Australia, it’s a hardware store, and you’re placing a second store, you know, like where are you going to place it? On one hand you service your customers and they don’t go to competitors. On the other hand, you’re not cannibalizing your own demand for the previous store, right. You don’t want to service in one area with two stores.
Dominic Roe: Yeah, that’s a good point.
Kirill Eremenko: Extremely powerful, and you can only do that through geospatial. Geospatial analytics application number three that I thought of was mapping of routes. Like for instance, my dad and I were riding bicycles, it’d be really cool to map our, what’s the route we take, how do we go, you know, which turns did we take and things like that. Yeah, so it’s less obvious when that can be useful, but it can be useful in logistics, things like how are you delivering stuff to your customers?
Dominic Roe: Yeah that’s right. I think it’d be also useful for things like traffic planning. When the authorities make a decision to say widen a road from one lane each side, to say two lanes each side, and you know, they’re going to be spending $50 million to widen this road, they should be able to say, okay, this is how many people are going to use it, and these are the alternative routes, so people who don’t use this road is there an easy detour, and how worth it is that.
Dominic Roe: I think that also there’s a little bit of an overlap with say telematics here as well because what you’re talking about there with say optimizing logistics, or optimizing say fleets of cars or trucks that stay, telematics can be very, very helpful for that because telematics can tell you if you own a fleet of say 300 cars, or say delivery vans, you know where each one is at all times, and you know exactly the route they’re taking, and exactly the average speed over each hour in each day, and so you can then plan your fleet and all of timetables and your browsing, et cetera, and your scheduling, driver scheduling in order to minimize fatigue, in order to minimize overall travel times, which you know, helps with safety, it helps with insurance premiums because you’re going to have less accidents if people are less fatigued, all that kind of stuff. I think that yeah, it has a bit of an overlap there with telematics as well.
Kirill Eremenko: And to your point, there’s a fantastic example that my brother actually mentioned to me, so I just found it online, I will read it out. Like UPS, UPS trucks actually … Well UPS have designed their vehicle routing system in a way to eliminate left hand turns as much as possible. If you look at UPS trucks, the only … that’s United States Postal Service, so they only turn right. I think only 10% of their turns are left, and in fact they are not allowed to turn left, they will always turn right on a [inaudible 00:54:24] and in the US the driver [inaudible 00:54:26] road, so you can turn right on a traffic light, even without waiting for a green signal, you just turn right, even when it’s red, you can turn right.
Kirill Eremenko: As a result, the company claims it uses 10 million gallons less fuel and it’s 20,000 tons less carbon dioxide, and delivers 350,000 more packages every year. How crazy is that? They were able to cut the number of trucks they use by 1,100, bringing the company’s total distance travelled by 28.5 million miles. How crazy is that? Even though they’re all [inaudible 00:55:00].
Dominic Roe: That’s just incredible, isn’t it?
Kirill Eremenko: Yeah, there you go, geospatial in action.
Dominic Roe: Yeah.
Kirill Eremenko: That is extremely, extremely cool. Yeah, so that’s kind of like three main applications of geospatial that I can kind of think of. Would you agree? Would you think there’s anything else like from your experience that maybe can be added to that list?
Dominic Roe: Yeah, I can’t think of anything at the moment, but I think that because there is so much data that’s related to where a person lives and where they drive, where they work, et cetera, I think you are throwing away a lot of potential knowledge and potential power if you don’t incorporate that into your business models and into whatever initiative that you have in either government or in business. If you’re ignoring that then there’s a lot of lost opportunity there. I can’t think of any additional applications at the moment, but I do know that telematics in particular is starting to take off in somewhat of a bigger way in this country, and I think Australia is a little bit behind in terms of using things like telematics, not just for logistics and trucks, but also for say light vehicles as well. Australia is usually a little bit of a late adopter when it comes to those types of technologies, but I think it’s already reaping great rewards in Europe and the US.
Kirill Eremenko: Yeah. Well we’ll get there, as with everything. We’re finally starting to get good internet in Australia, which [inaudible 00:56:49]. Yeah, that’s cool. Well Dominic, this kind of like actually brings us to the end of the podcast. There was so many other things I wanted to talk about as well, like you know, the probability in statistics, you are an expert in that stuff. Maybe we’ll have another session later on, but on that note I really wanted to thank you for coming on the show and sharing all your insights with us, it’s been a huge pleasure, you know, a lot of fun to have you on the show.
Dominic Roe: Thanks Kirill, thanks very much. I really appreciate the invitation, and yeah, happy to have a chat whenever you like, so thanks-
Kirill Eremenko: Awesome. Awesome. Thanks. Before you go, what’s the best way for our listeners to get in touch and maybe connect with you? Like maybe they have some follow up questions or just follow your career?
Dominic Roe: Yeah, so I think that if they just send me a message through LinkedIn, that would probably be the best way, the easiest way to contact me. Yeah, so just shoot me a message there, or grab my email from LinkedIn as well, that’s fine as well.
Kirill Eremenko: Awesome. Okay, cool. We’ll share that in the show notes for everybody listening, and one more question I have for you, what’s a book that you can recommend to our listeners to help them with their career?
Dominic Roe: Yes, so that’s an interesting question and I suppose my view on books and data science is that, I think data science is a discipline where the more you get your hands dirty, the more you’ll learn, and you have to use a lot of trial and error in terms of trying different approaches, trying to fit different models, seeing what works and seeing what doesn’t work. I suppose I haven’t come across the book that I can say, you know, I think that is, you know, that one in particular is fantastic. I actually prefer to read the shorter articles and the shorter research papers rather than a textbook on data science. And so, one that I’ve used and applied, a research paper that I’ve used and applied heavily in the past has been on Convex Optimization.
Dominic Roe: The title of that one, that research paper was Convex Optimization, A New Approach to Common Challenges, and so that was written in 2010, and that was written for the Institute of Actuaries conference, which was being held on the Gold Coast, and so that’s Dimitri Semenovich, Yang Cai and Ian Heppell are the three authors of that one. Yeah, so not trying to plug fellow actuaries at all, but that’s one in particular that I thought was good because convex optimization is, essentially what it is, is it’s a method to solve a particular objective function or a particular lost function subject to constraints, where your objective function is convex. And so you can have quite a broad array of different functions within and still ensure that the function is convex, and essentially, so that’s what it is. It’s a solver.
Dominic Roe: One of the applications I actually used it in was when an alternative to the random forest or competitor to the random forest model that I built was to say each and every single house is its own spatial parameter, gets its own spatial parameter in the generalized linear model. And so what you would then have is a massively over parameterized model, which you can still fit, you can still force it to converge as long as you have the correct panelization and regularization terms within your objective function. That’s probably getting a little bit deep mathematically, so for the people who like the maths component of data science, that probably would have gotten that, but for some people who aren’t all that familiar with the mathematical concept of optimization, that that might’ve been a bit too much. But basically you can force the model to converge and for the convex optimizer to find a minimum of the loss function subject to the restraints, constraints that you have, as long as you have the correct penalty terms in there.
Dominic Roe: The penalty terms are, you’ve got ordinary regularization, which you know, same as you’d have in any model, same as you have in the neural network. You’ve also got an additional term in there which performs spacial smoothing, so in other words, you’re going to have an additional penalty for how different the risk is, relative to your neighbor. What it looks at is, it looks at the pairwise distances of each of the properties and says if you are very close, I’m going to penalize you a lot in terms of having a different parameter, and if you’re far away then you know, there is going to be no penalty term there. It’s essentially inversely proportional to the square of the distance between each of the pair wise, each of the properties that you’ve got. It’s quite a large, yes, it’s quite a large matrix if you think about it, but you can actually solve the problem through that alternative technique as well. It’s a very, very general Solver, which has applications in engineering as well as insurance, has very broad applications.
Kirill Eremenko: Wow, very, very interesting, thank you. I’m sure you’re going to get follow up questions from that. I wanted to say that I completely see your point. When we were creating courses, I go through a ton of research papers and it’s so inspiring to be on the cutting edge of what’s happening in the world in that technology that you’re using. It’s also important to know, you know, what is, for instance in AI, what is the A3C algorithm, or what is the new development in the augmented random search, or whatever else, evolution strategies, genetic algorithm, things that you might, even if you don’t use exactly the way they are described per se, you just see how people think and it’s something new, something fresh. It’s always there. Thank you very much for that, and could you repeat the name of the paper again please?
Dominic Roe: Yes. It’s called Convex Optimization, A New Approach to Common Challenges. The authors are Dimitri Semenovich, Yang Cai, Ian Heppell, and that was from 2010, and it’s published by the Institute of Actuaries of Australia.
Kirill Eremenko: Awesome. All right, perfect. Well on that note once again, thank you so much Dominic. Great pleasure. Great fun chatting.
Dominic Roe: Yeah, you too Kirill. Thanks very much for the opportunity, really appreciate it.
Kirill Eremenko: There you have it ladies and gentlemen. That was Dominic Roe, Senior Underwriter at Cover-More and also a professional actuary. I hope you enjoyed this chat as much as I did and as promised at the start of this was a podcasts like no other where we dove deep into the different models that Dominic was using, or the way he built this model, which combined two separate models in one. My favorite part was probably the way Dominic explained a balanced data set and why it’s important to perform that stochastic sampling that he was talking about. That was something very insightful for me, and I personally learned some new things on this podcast, and I hope you did too.
Kirill Eremenko: On that note, you can get all of the show notes for this episode as usual at the SuperDataScience website, and this episode is number 243 so the URL would be www.www.superdatascience.com/243, that’s www.superdatascience.com/243, and there you can get all the show notes, all the materials that were mentioned in the show plus a URL to Dominic’s LinkedIn, so make sure to connect with him, ask him any questions that you might have about actuarial sciences or any of his modeling that he shared, his case study that he shared here. In general, Dominic is a great connection to have in your network.
Kirill Eremenko: Once again, I hope you enjoyed this podcast. If you did, then please go to iTunes or wherever you’re listening to these episodes and leave us a review, leave us a rating and write what you feel about this podcast, I would highly appreciate it. I read them all, I go on there and I read all those reviews because I really value your feedback and it inspires me when you are inspired, inspires us all at SuperDataScience when you’re inspired. Make sure to let us know and let the world know how you feel about the SuperDataScience podcast. On that note, thank you so much for being here today, I look forward to seeing you back here next time, and until then, happy analyzing.
Show All

Share on

Related Podcasts