Welcome to episode #051 of the Super Data Science Podcast. Here we go!
Today's guest is Global Analytics Consultant Randal Scott King
Hadoop, HDFS, MapReduce, Hive, Kudu, Impala, Spark, YARN, Seahorse… Do you often see and hear these terms and want to know more about what technologies they represent and why they are making waves in the field of data science?
Scott King brings years of consultancy experience to this episode and goes into detail about his applications of these technologies alongside an overview of what they all do.
He also shares his tips on how to captivate varied audiences in public speaking, as well as his own personal story of getting in front of an audience.
Prepare to widen your knowledge base!
In this episode you will learn:
- Use Cases for Hadoop (09:04)
- HDFS and MapReduce in a Nutshell (14:11)
- What Hive Does (20:33)
- The Problem Kudu Solves (22:33)
- How Spark Interacts with YARN (25:02)
- Speeding Things up with Seahorse GUI (27:42)
- How is a Data Lake Useful? (30:44)
- Advice for Getting Into Public Speaking (38:21)
Items mentioned in this podcast:
- Data Wrangling with Python: Tips and Tools to Make Your Life Easier by Jacqueline Kazil and Katharine Jarmul
- Hadoop MapReduce v2 Cookbook by Thilina Gunarathne
Kirill: This is episode number 51 with Global Analytics Consultant Scott King.
(background music plays)
Welcome to the SuperDataScience podcast. My name is Kirill Eremenko, data science coach and lifestyle entrepreneur. And each week we bring you inspiring people and ideas to help you build your successful career in data science. Thanks for being here today and now let’s make the complex simple.
(background music plays)
Welcome to the SuperDataScience podcast everybody. Hope your week is going great, and today we've got a very interesting guest. Scott King is an analytics consultant, also the founder of Brilliant Data, and also a renown analytics speaker. Scott comes into organisations to help them with their data capabilities. He helps executives with their strategy, helps them understand what are the best tools for their organisation, and which is the best way to proceed going forward into the future.
And this podcast, we predominantly talked about Hadoop. So if you've always wanted to find out a bit more about Hadoop, or really what's going on at the cutting edge of technology in this space, then this is the podcast for you. In today's episode, you will learn a lot. And I mean a lot. Get your pens and papers out, because today, you will learn about HDFS, MapReduce, Hive, Hog, Pig, Kudu, Impala, Spark, Seahorse, and what a data lake is, and much, much more.
And my favourite part is that Scott has a natural ability to explain these things in a very, very simple manner so you don't have to be highly technical, you don't have to already have a lot of knowledge about Hadoop to understand all of these things. It'll be very easy for you to get up to speed. So this is your crash course into Hadoop. By the end of this hour you will know how to operate these terms, or at least be up to date with what's going on in this space.
In addition, we will also talk about Scott's public speaking and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of audiences, and the things that he does that help him out when he's doing public speaking.
So all in all, very interesting, exciting podcast. Can't wait for you to check it out. And without further ado, I bring to you Scott King, Global Analytics Consultant and Founder of Brilliant Data.
(background music plays)
Hello everybody and welcome to the SuperDataScience podcast. Today we've got Randal Scott King calling in from Atlanta. Randal Scott is a Global Analytics Consultant. How are you going today, Scott?
Scott: I'm doing very well, how about yourself, Kirill?
Kirill: I'm well as well. And just for the benefit of our listeners, can you explain the whole legacy of names in your family? Because your name is Randal Scott, but everybody calls you Scott. How does that come by?
Scott: I think that's just a thing in my family. We give people first names and then don't use them. My father is John Richard. He went by Richard, I think probably because his father was also John Richard.
Kirill: Ok, so John Richard the junior, right?
Scott: Yeah, exactly, yeah.
Kirill: And you said you have 5 children, are you going to also give them double names as well? That's 10 names you gotta think of!
Scott: Well, yeah. So as a matter of fact, yeah. As a matter of fact, my oldest daughter goes by both of her names.
Kirill: Ok. Alright. Gotcha. Ok, well thanks a lot for coming on the podcast and so for those of you who haven't heard what Scott does, Scott is a Global Analytics Consultant and he consults companies like the Fortune 500 companies in analytics, in spaces like Big Data, and Hadoop, just to name a few. And so this is going to be a pretty exciting chat. So Scott, tell us how you got into this whole space of analytics consulting.
Scott: I started as a user. I was actually the Director of Business Development at an IT reseller. And I was responsible for making sure the company sold $100 million worth of Cisco product.
Scott: And I realised one day, I was like, I found on the company's internal website that somebody would publish the entire dump from the Oracle ERP that had every transaction year to date on it. And I realised pretty quickly just how powerful that was. I spun up an instance of Pentaho. They have an open source version, but it's a BI package. And I would go out and get that database dump every day and put it on my laptop and I started looking at the data and realized “Oh, my gosh. I can find out everything about the business. I can find out who’s selling what, where, and to whom.” It was really useful in my job as business development to know all of that about the business. We were able to post some really strong gains year over year for the three years that I was there by just knowing what was going on.
Kirill: Wow! That’s powerful.
Scott: So, BI is great. It tells you what’s going on with the company right now and in the past, but it was really when I discovered machine learning and I was like, “Oh, my God. I can predict things now.”
Kirill: Yeah, totally. And then you slowly transitioned into the space of Hadoop or you do both at the moment?
Scott: So once we started doing that, we started going back further in time. We had something like 8 years’ worth of data. Obviously, that outgrew my laptop really quickly. (Laughs) So the light bulb just went off one day and I was like, “You know what, if we’re getting this much value out of this kind of stuff, we could be doing this for clients.” So yeah, I left and went out and started Brilliant Data and started doing that sort of thing for clients.
Kirill: Okay. That’s pretty cool. Are you consulting on your own or do you have a team of people that you’re working with?
Scott: I’ve got a small team. We’re a small boutique firm but we’re getting to be pretty well-known for getting results.
Kirill: Okay. That’s great. I’m actually looking at your LinkedIn now and there’s so many people leaving comments about how you presented within companies and on Hadoop and really shaped other companies’ ways of thinking, so CEOs and directors are leaving great comments. Can you tell us a bit more about that? Apart from implementation, you’re obviously also pushing the envelope in terms of an analytics agenda and building this analytics culture in different organizations. How do you go about doing that?
Scott: Well, you know, I’ve always done public speaking in some form or fashion throughout my career. I’ve been in IT consulting for about 20 years and have always either taught classes or gave sales pitches as a sales engineer, any number of things that have to do with speaking. And I realized, when I started Brilliant, I was like, “You know, this is going to be a valuable tool for bringing business in, is getting out there and speaking and kind of starting the conversations around things.
These days I see it more as kind of educating people on the state of analytics and the state of big data because it really is changing so quickly all the time. In the past, you really wouldn’t want to ever use Hadoop as a data warehouse because there were some significant limitations there, but there’s been some things that have come along just recently that have kind of changed that now and it’s at least possible, if not advisable, to do now. A lot of people don’t know that. You know, you kind of get out there and let people know.
Kirill: Okay, gotcha. So basically, two areas that I want to dig into further is, of course, your public speaking and your experience there and also Hadoop and what you can tell us a bit more about that. Maybe let’s start with Hadoop. In five sentences or less, what is Hadoop for a person listening to this podcast who hasn’t encountered Hadoop or just heard it as a buzz term?
Scott: Well, there’s really two reasons why you would want to use Hadoop. You’ve just got more data than a traditional relational database can work with, or you’re wanting to work with types of data that a relational database can’t work with.
Kirill: Interesting. And how much data are we talking that a relational database can’t handle? A gigabyte, 10,000 gigabytes?
Scott: No, certainly not in gigs. You’d have to get into the upper terabytes or even petabytes before you’d really max out most of the really good relational databases like your Oracles and whatnot.
Kirill: Gotcha. Does only the amount of gigabytes matter or also the — you know, people probably have heard of the three Vs of Hadoop – velocity, variety and veracity, I think, or variability, maybe 4 Vs. Does it have to be just the volume of data, or can it also be the different types of data that you have in your dataset?
Scott: Well, yeah. It’s really those two things. And you’re right, there are the three Vs, but the two that I think of when I’m talking to clients and figuring out if they actually need Hadoop – because a lot of people think they need Hadoop and they don’t. I told somebody recently — they had SAP, you know, they were wanting to build dashboards and I’m like, “You do realize you can do that with what you have, right?” But the two things that I think of when a client says, “We need Hadoop,” is “Okay. Do you have the volume of data necessary for Hadoop?” “Well, no.” “Okay. Do you have a lot of different kinds of data that you want to work with that you would find difficult to put into a database and they’re all like, “Yes.” You see, we’ve got these customer service transcripts that we want to go through and do text analysis on them, etc. You know, you’re not going to do that in a traditional relational database.
Kirill: Yeah, gotcha. Understood. All right, so Hadoop if you have a lot of data, if you have lots of different types of data. So, how does this happen? Somebody calls you up and says, “We want Hadoop.” You ask them the questions and then you say, “Either you need Hadoop or you don’t.” But if they do need Hadoop, what happens from there?
Scott: Well, it’s a matter of determining what they have and what they want to bring into Hadoop and what is the ultimate business benefit that they want to achieve. I think that’s really where a lot of Hadoop—there’s all kinds of press these days about the demise of Hadoop. I think it was Mark Twain who wrote to his local newspaper, they had published an obituary for him, and he said “The news of my death has been greatly exaggerated.” I think the news of Hadoop’s death has been greatly exaggerated. I think that goes back to the Gartner hype cycle. You know, we’re in the trough of disillusionment right now and I’m really looking forward to the — what is it, the plateau of productivity?
Kirill: Something like that, yeah. So, basically you go in and you implement the system. How long does it take? What are the main challenges that you face when you’re implementing Hadoop at an organization?
Scott: Well, to actually stand up a Hadoop cluster doesn’t take very long at all. To bring in the client’s data and to do the groundwork necessary for that to figure out—to sit down with them and say, “Okay, here’s the problem or problems that you’re trying to solve” or “Here’s the additional capabilities that you want to develop,” that takes a while. Bringing the data in takes a while. You know, people think of Hadoop as that whole data lake concept of “Oh, just throw it all in the data lake and we’ll figure it out later.”
Well, you can do it that way, sure, but I wouldn’t advise it. But sitting down with the client and figuring out what it is that they’re trying to accomplish, figuring out what data they’re going to need for that and how to structure that inside either HDFS or Hive or now if it’s a Cloud area, install Kudu, which is a great little tool, we can talk about it in more detail later if you want. But figuring out what exactly it is that they’re wanting to accomplish and how to set that up in Hadoop is what takes the longest time. Usually, once the data is in and it’s in the format that you want, doing analysis doesn’t really take all that long.
Kirill: Yeah, gotcha. And there are tools like Pivotal, for instance. It’s a pretty good tool to use on top of Hadoop to make things easy. They have PivotalR even, to allow for that. Okay, so you’ve mentioned a couple of words – HDFS, Hive, Kudu. Could you tell us a bit more? Let’s start getting into the technical side of things.
Scott: Sure. The central idea behind Hadoop is that it’s a distributed system, right? So you take a big problem like analysing terabytes or petabytes of data and you cut it up into smaller problems and you spread that load out over the individual servers that make up the cluster and then they do their part of it and send it back to you and you reassemble the whole thing on the other side. Hadoop is all about distributed computing. And HDFS, which stands for Hadoop Distributed File System, is exactly what it sounds like. It’s a distributed file system that spans the whole Hadoop cluster and sits on top of the file system on each node.
Kirill: Okay. So that’s basically the system that connects everything together?
Scott: That’s one way of thinking of it, yeah. I mean, it’s a file system just like any other, but it spans the whole cluster and as a result it can be massive.
Kirill: Okay. And it’s one of the main advantages of Hadoop, is that it’s scalable, right?
Scott: Yes. You want to make it bigger, you just add machines to the cluster.
Kirill: Okay. And how does that compare to using a super computer? Why would one use Hadoop over using a super computer?
Scott: Super computers tend to be very, very costly whereas Hadoop uses commodity hardware. I tell people Hadoop is the exact opposite of virtualization. With virtualization you’re taking one big physical server and you’re running a bunch of virtual servers on it. Hadoop does the exact opposite. It takes a bunch of physical servers and makes one big virtual server out of it.
Kirill: Okay, gotcha. So what happens when one of those individual servers or commodity hardware breaks down?
Scott: Nothing. That’s the beauty of it. So HDFS replicates data a minimum of three times. There will be three nodes that your data is on. So if one of those nodes takes a dive, let’s say the power supply sparks and you have to pull it out and replace the power supply or just replace the entire machine. Well, when you put it back in, HDFS will actually rebuild the file system that was on it.
Kirill: Okay. That’s pretty cool. So basically you have three copies of your data on separate nodes, but are they all being processed simultaneously, or data is still processed on each individual node but that is just kept as a backup?
Scott: Well, what HDFS does is let’s say that you have a terabyte worth of data, it’s some kind of big table or something. What HDFS is going to do is it’s going to take that and divide it up into blocks. The size of the block is configurable, but I think the default these days is a 128 MB. Sometimes you might want that to be 64 MB, sometimes you might want it to be 256. Like I said, it’s a configurable parameter. HDFS is going to take that big terabyte file and it’s going to chop it up into 128 MB blocks and it’s going to replicate each block three times.
So if you have a 20-node cluster, for example, that first block might be on servers 1, 3 and 5. The second block might be on servers 2, 15 and 10, etc. So the whole data gets spread out over the cluster and then the name node, which is the part of the cluster that decides where things go and decides who does what, it will pick one of the nodes that that first block of data is on and say, “Okay, your job is to calculate this block of data for this job.” Does that make sense?
Kirill: Yeah, gotcha. And what about MapReduce? Can you tell us a bit more about MapReduce?
Scott: Well, I kind of just did. (Laughs)
Kirill: Okay. So that is MapReduce?
Scott: MapReduce is the computation engine, or the original computation engine, for Hadoop. MapReduce is the part that does the actual calculating anything on the data in Hadoop. Since Hadoop v2 and with the introduction of something called YARN, Yet Another Resource Negotiator, there are additional computation engines like Tez and like Spark that can run on a Hadoop cluster now, you’re not just limited to MapReduce. And that’s a good thing because MapReduce has some significant limitations.
Kirill: Okay, gotcha. All right. Moving on to this whole animal kingdom of Hive, Kudu, Pig, etc.—
Scott: Oh, my gosh. It grows every day, man.
Kirill: (Laughs) What can you tell us about that? First of all, why are the names all animals?
Scott: That’s a really good question. I’m not sure, but I do know that Hadoop itself, that name came from Doug Cutting’s young son at the time, and this was 10 years ago so that kid could probably be an adult by now. But at the time, he had this little toy stuffed elephant, this yellow stuffed elephant that was named Hadoop and Doug decided to name that project after that little toy.
Kirill: It’s like kids – inventors or developers on these things and their kids. Same thing with Pivotal. I think they have a product called Plum, or it’s a name of the company—
Kirill: Greenplum! So, to his child at the time, he was like creating and said, “Oh, I have a cool idea. What should I call it?” And the child picks up an apple from the basket of food that they have at the table and the guy is like, “Well, I can’t call it Apple. Apple is already taken. What’s the closest thing to a green apple? A green plum.” So that’s why they call it Greenplum. It’s just funny. Anyway, Hive – what is Hive? Tell us about that.
Scott: Let’s say you bring in a bunch of CSV files into HDFS and you want to be able to run SQL queries on those files. Well, that’s what Hive does. It creates tables on top of these files to make them searchable by SQL queries.
Kirill: Okay, gotcha. So, it kind of makes unstructured data workable through this traditional structured approach.
Scott: It’s kind of an interpreter, really. It’s an abstraction that sits on top of MapReduce and it translates SQL code into MapReduce Java and sends that to MapReduce and says “Hey, execute!”
Kirill: Yeah. Pivotal have a similar thing to that. You can do SQL on Hadoop through a Pivotal product as well, I just don’t remember what it’s called.
Scott: I want to say that’s HAWQ.
Kirill: Exactly, that’s right. Another bird, another animal. (Laughs)
Scott: Another animal. (Laughs)
Kirill: I’m not surprised. All right, cool. That’s good. So basically even somebody not knowing how to work with unstructured data and do Java code and all those other things, they can still work with Hadoop through things like HAWQ or Hive.
Scott: Yes. And there’s actually more abstractions on top of MapReduce. Like Pig you mentioned earlier, it’s a scripting language and it’s much easier to learn than Java. Again, basically all it does it takes that scripting language, turns it into Java and sends it to MapReduce and says “Execute.”
Kirill: Okay, good. That’s good. So we’ve covered HDFS, MapReduce – this is value – Hive, HAWQ, PivotalR, Pig. Okay, next one on the line – Kudu. You said that’s Cloudera’s creation. Tell us a bit more about Kudu.
Scott: Yeah, a lot of people have tried to implement Hadoop in a way that you would normally only want to do with a relational database. I mean, Hadoop and relational databases were never meant to do the same things. But a lot of people have tried, for example, to deploy Hadoop as a data warehouse. This is something that I’ve advised people against in the past because the things that you have to do to make that work are just [indecipherable 23:11], for a lack of a better term. The problem is that data in HDFS is immutable. You don’t change data in HDFS. You either delete it and overwrite it or you append.
Obviously, that creates some problems as far as trying to use it as a data warehouse because data warehouses update information all the time. So what Kudu really does – and I don’t understand why Cloudera is not marketing it this way, but they’re not – but what Kudu does is it basically eliminates the immutability issue. Data in Kudu is absolutely updatable. You can do updates, you can do inserts, you can do deletes just like you would on a relational database.
Kirill: Okay. That makes sense. So how does Kudu do it in terms of — if HDFS restricts the updating of data, how does Kudu go around that?
Scott: Well basically it does it by not using HDFS. So, Kudu is actually a storage engine, you know, they’re very hesitant to call it a database because really it’s not, but it is a storage engine that is updatable whereas HDFS isn’t. So you can have HDFS and Kudu running side by side on a Hadoop cluster and use them for different things, but Kudu does not rely on HDFS at all.
Scott: Hive relies on HDFS. Kudu just has its own thing for that.
Kirill: Okay. How about next in our line-up of animals and creatures – Spark and Seahorse? What can you say about Spark and Seahorse?
Scott: Well, going back to what I said earlier about MapReduce, you know, it was the original computation engine. But in Hadoop v2 they introduced YARN. I like to call YARN the mom of the cluster. You know, when you were a kid and you wanted something or you wanted to do something you went to mom, right? Because if you went to dad, he would just say, “Well, what does your mother say?”
Kirill: (Laughs) Totally, yeah.
Scott: But, yeah, I mean, if you and your brother or you and your sister wanted a cookie and there was only one cookie, mom was going to be the one who decided who got the cookie, right? So YARN does the same thing in a Hadoop cluster. YARN is administering the resources of the cluster and deciding, if you’ve got a bunch of different jobs running on a Hadoop cluster, YARN is going to say, “Okay, I’m giving this job this much processing power, etc.” That used to be in MapReduce. So, in v1, MapReduce did all that. But they took that out of MapReduce and put it into YARN so now MapReduce just does computation. And what that did is it opened up the window to other computation engines and Spark is one of them.
Kirill: So Spark goes instead of MapReduce?
Scott: Yeah. And the great thing about it is, because of the way YARN works, you don’t have to choose either/or. So, for example, if you’ve got a job that could really benefit from MapReduce’s bash-oriented nature, which a lot of ETL processes would qualify, then you can have that job running in MapReduce while over here you’ve got a different job running in Spark and taking advantage of Spark’s greater speed.
Kirill: Gotcha. So advantage of Spark over MapReduce is speed?
Scott: The advantage of Spark is really it’s orders of magnitude faster than MapReduce. The reason for that is MapReduce is very disc-oriented and very batch-oriented. It’s just slow. When MapReduce was invented, they weren’t really all that concerned with speed. They were just concerned with being able to do the kinds of things that MapReduce does at all. So, whereas MapReduce is batch-oriented and disk-oriented – I mean, it does everything on disk – Spark will actually load data into memory. And again, it’s just orders of magnitude faster because of that.
Kirill: Gotcha. All right. And what can you say about Seahorse?
Scott: You know, Seahorse is this brand-new thing that I just stumbled across maybe a month and a half ago. What Seahorse does is it actually gives you a graphical front end to Spark. So Spark has these great machine learning libraries – ML and MLlib – and Seahorse gives you a graphical interface to using those. I’m noticing it really speeds up prototyping for me. I loaded it on my laptop and interfaced with the cluster here at the office and I’ve noticed that rather than hand coding everything in Python or Java or whatever, having that graphical front end enables me to prototype things so much quicker.
Kirill: Really, they are just punching these out, like one and a half months ago, and you’re obviously in the centre of everything that’s going on with Hadoop, how quickly do they release these things?
Scott: You were saying something about the ecosystem around Hadoop earlier and how they’re all named after different animals and all this and I was like, “Yeah, there’s a new one every day.” You know, again, you could spend all kinds of time just keeping up with the latest on either Hadoop or big data or machine learning or data science. I mean, there’s always something new every week.
Kirill: And how do you do it? How do you keep up with everything?
Scott: Who says I do? (Laughs)
Kirill: Well, you seem to be pretty up-to-date with everything we’ve discussed so far. You definitely know what’s going on.
Scott: I think keeping current on anything in tech is really a matter of community. I introduced a friend of mine who’s a data scientist at a benefits company. I sent him an e-mail and said, “Hey, have you seen this thing?” I was talking about Seahorse. “Seahorse is basically a GUI for Spark.” He’s like, “Whoa, I love this!” And there’s always somebody who says, “Hey, Scott, have you seen this thing over here?” Sometimes it’s, “Yeah, I have. I think it’s great.” And sometimes it’s, “Oh, my God, no! What is that?”
Kirill: It’s a seahorse. (Laughs) Okay. That’s pretty cool. Summarizing all of these things up together, can you give us a quick description, what do people mean when they say data lake? Obviously these tools somehow altogether assist or facilitate data lakes or are used in data lakes. What do people mean by the term data lake?
Scott: I think different people mean different things by it. I think a lot of people use the term and don’t even really know what they mean by it.
Kirill: All right. Let’s make it clear for everybody so that when people use it in the future, they know what they’re talking about.
Scott: Yeah. So, when you think about a lake — a lake has layers, right? You’ve got the sediment and you’ve got kind of the murky depths and then you’ve got the top. And really, what it’s all about in terms of making that useful for an organization is you’ve got different kinds of data that you can put in there. You know, earlier we said that thing about some people talk about how Hadoop is where you can just throw things and deal with it later. You really can. You can use Hadoop to archive data and just get it in there and figure out what to do with it a couple of years down the road when there’s some use for it. Or you can have it in a Hive table or a Kudu table for ready access.
I think really the whole data lake concept came about from having a place where you could put everything, which was kind of the goal of the data warehouse, but data warehouses have very strict schema and whatever data you’re going to put in there has to conform to that schema. With Hadoop you don’t really have that restriction. You can put anything in there. And I think that’s really the whole point of the data lake concept. Yeah, there’s all kinds of stuff that you can put in there and you can either start using it immediately or figure out what to do with it later.
Kirill: All right. But how do I imagine it because it’s so different to what people are used to, whether it’s SQL or Excel or just folders and files. Can I imagine a data lake as just a huge folder where I can just create new folders and just dump all of my videos or all of my texts, scanned documents, whatever I want? Does that analogy make sense? Do you think of it as folders containing certain information? Or does it look like something else to you?
Scott: No, that’s a very apt description of it. I mean, it’s a file system much like any other, HDFS is. And you can put almost anything onto it. Like, you mentioned video. You can put images, you can put audio, you can put texts. Really, whatever it is that you want to stick on there, you can stick on there.
Kirill: Okay, gotcha. Thanks for the confirmation. It makes things a bit more clear now. And with your public speaking, so you have all this vast knowledge on in-depth topics and you know obviously how to set these things up and communicate them well. When you go into a company and you do the speaking there to the employees and even executives, what is your main goal? What is your main goal that you’re trying to communicate to them and what do you want as an outcome for them at the end of your conversation?
Scott: Well, I mean, it’s really different. If I’m doing public speaking as a public speaking thing, if they’re bringing me in to do a 30 minute, 60 minute, 90 minute keynote, then it’s a matter of I’m trying to find out what it is that they want to accomplish. For example, there was this benefits company here in the States. They have like 300 Java developers and they had done a proof of concept with Hadoop and they were going to move forward with using Hadoop and they wanted me to come in and address these 300 Java developers at their annual conference that they have internally within the company.
And I said, “Okay, what is it really you’re trying to accomplish with this keynote?” And they said, “Well, we want to educate them on Hadoop, give them the 30,000 foot view of how it works and what the pieces, parts are, but mostly we want them to not be afraid of Hadoop and understand that their jobs are changing a bit, but they won’t be changing that much and they won’t go away. And so that’s exactly how I tailored the speech, was “Hey, guys. Here’s what Hadoop is. Here’s why the company is going to Hadoop. Here’s why your future as Java developers is very secure with Hadoop: a) MapReduce is Java.” If you’re programming in MapReduce, you’re not using Python or Scala or whatever. That’s what they were trying to accomplish. They wanted to kind of reassure these guys and educate them at the same time. That’s what I tailored the speech to do.
Kirill: Gotcha. Okay, it makes sense. Can you give us another example of a different speech where you had to tailor to a different type of audience?
Scott: Actually, yes. An IT organization asked me to come in and address the executives in their own company and explain to them why the IT organization wanted to use Hadoop to augment the company’s existing data warehouse. You know, I was talking to the senior executive who wanted me to do this and I’m like, “Um, couldn’t you do this?” (Laughs)
Kirill: What was his answer to that?
Scott: He said, “Well, yes, but a) I’m not good at that sort of thing, and b) I think it will carry more weight if it’s coming from a third party.” (Laughs)
Scott: So, I was hired by the IT organization to convince executives to go in a direction that the IT group wanted to go.
Kirill: Okay, definitely very different. I see how that’s working. So what’s your biggest challenge that you face when speaking to whether it’s executives or large audiences like that? Is it hard to get the message across about Hadoop?
Scott: Not too hard, no. I’ve always had kind of a knack for taking something that’s really complicated and breaking it down to its essence and explaining that in a way that most people can understand. To me, the most important thing is to make sure that I understand what it is they’re trying to accomplish with the speech – that’s really the biggest thing – and to work with them and make sure that “Okay, what does success look like at the end of this speech? What’s going to happen? Who’s going to do what? What information is going to get conveyed? Is there a skill that you want them to develop? Is there an idea that you want to plant in their head? What does success look like? How is life different after this speech?”
Kirill: Okay, gotcha. Just on that, I wanted to ask you as well, for people out there who are listening and who want to get into speaking, or public speaking, on data-related topics, whether it’s data science, machine learning or even Hadoop like yourself, what would your biggest one piece of advice be for them? Because obviously data science is a very kind of topic where you don’t interact with people that much, where you’re very technical and going into speaking can be a big shift for somebody. What would your best advice for a person like that be?
Scott: I would probably say speaking is a skill just like any other skill, like riding a bike or like coding in Python. You learn it by doing it. If you haven’t done much speaking before, probably the best thing to do is find opportunities to speak. I hate to use the old cliché about “Go join Toastmasters,” but people recommend that because it works.
Kirill: Okay, gotcha. And yourself, how did you get started into speaking?
Scott: Oh, gosh. We had a piano in our house when I was growing up. I grew up on a peanut farm in southern Alabama, so there wasn’t a whole lot to do. I mean, there was plenty to do, but there wasn’t a whole lot to do that was very entertaining. At 6 years old, I started planting myself in front of the piano and tinkering around on that and eventually taught myself how to play piano. I want to say I was 12 when I started playing the organ at the little church around the corner that had 70 people in it. And so from a very young age I was used to being in front of people and I don’t think that I ever had the opportunity to develop stage fright.
Kirill: Okay, so it’s pretty lucky circumstances that you put yourself into.
Scott: Right. And speaking in front of people isn’t much different from playing a musical instrument in front of them, really.
Kirill: Okay, gotcha. When you’re speaking to lots of people – I’m just curious about this for myself – who do you focus your attention on? Do you look at one person, or do you move your eyes around the room to make everybody feel included?
Scott: Oh, you’ve got to move around the room, sure. I mean, if you look at one person the whole time you’re speaking, you’re going to accomplish one of two things. You’re going to make everybody else feel kind of disconnected from you or you’re going to really freak that person out.
Kirill: (Laughs) Yeah, they’d be like, “Why is he staring at me all the time? I’m tired of his gaze.”
Scott: Yeah. So I kind of let my eyes wander the room as I’m talking and kind of gauge different people’s reactions to what I’m saying. Of course, the thing that you never want to see is that person with half-shut eyes looking like they’re about to nod off. (Laughs)
Kirill: Yeah, gotcha. That’s kind of like the next thing I wanted to ask you. How do you structure your presentations in such a way that people don’t fall asleep? Because a very common flaw of technical presentations is that they’re either overpopulated with formulas or overpopulated with text or just the way the presentation is flowing just makes people nod off and they can’t pay attention for more than 8 or 12 minutes. What’s your trick to that?
Scott: Well you have to know the audience, right? There’s nothing wrong with a speech that’s full of technical information if you’re speaking to 300 Java developers. There is everything wrong with a speech that’s full of technical information if you’re talking to a room full of business executives.
Kirill: Okay, gotcha. You want to keep it nice and sweet and short for the business executives.
Scott: Well, keep it high level for sure. You know, don’t get into the weeds with those kinds of guys. And be sure to make sure that whatever it is you’re talking about this technical subject relates back to some sort of business problem or business result or something that they can relate to.
Kirill: Okay, gotcha. These speaking events, how do they link up with the work you do at Brilliant Data? Do they detract from your time that you could be spending implementing Hadoop? Or do they facilitate that and actually help you bring on more work into the company or better deliver the projects that you're delivering by then following up with some speaking events?
Scott: Well, I’m a consultant who speaks, not a speaker who consults. Whenever I go and give a speech somewhere, it’s always coming out of an experience that I had with a client somewhere. And sometimes those experiences are kind of humorous. That’s always great really because humour is a necessary ingredient for public speaking. We were talking earlier about keeping people’s attention and humour is great for that. It also makes you seem a lot more human when you’re on stage if you relate a funny story.
Kirill: Yeah, and approachable. Can you give an example of a funny story you’ve used recently?
Scott: So, I started my career really as a network engineer, not even really doing data. I was working for Cisco at the time and they sent me out to this switch site for a major service provider. They were having all kinds of errors. And when I got there, the facility manager seemed very amused that I was there. And so I said, “Yeah, I’m here to troubleshoot the problems with this equipment.” He said, “Oh, yeah. I know where that is. We’ll go up to the top floor.” So we went up there and he showed me the routers that were having the trouble and I said, “Jimmy, what is this bright film on top of the equipment?” He goes, “Oh, that’s where the roof leaks.” I said, “Leaks?! As in this is still going on?” He says, “Oh, yeah. I’ve been sending e-mails trying to get permission to get the roof fixed for like 6 months now.”
Kirill: (Laughs) Well, there is your problem, right?
Scott: Well, yeah. And so, you know, they spent whatever they spent. I’m sure it was not cheap to have Cisco send me out there to troubleshoot this when they really could have figured it out just by calling the facilities.
Kirill: Yeah, okay. Definitely. I see how that can make people a bit more happy in your presentations and lighten the mood up. That’s a pretty cool story. Moving on to some other questions that I have for you, what is a recent win that you’ve had in your career that you can share with us, something that you’re most proud of?
Scott: You know, I would have to say it’s been about a year and a half ago now. Packt Publishing reached out and said, “Hey, you know, we really want to do a training course on Hadoop and we think you’re the right guy to do that.” I said, “Okay, yeah. That sounds like a lot of fun.” They give you this compressed six week time period to get this thing done. And, you know, I know what I want to say, I know how to do this thing that I’m about to demonstrate, but getting it recorded and edited and put together into something that looks good and sounds good is not easy and it’s very time-consuming, whereas I thought, “Oh, yeah, six weeks. This will be done in three.” No, it wasn’t. I think it was done in six weeks and three days. (Laughs) But I think the end result was something that I’m pretty proud of.
Kirill: That’s nice. So you have a course which people can take. So our listeners, if they’re interested they can find your course. Where can they find your course?
Scott: Well, they can go to packtpub.com or it’s also on the O’Reilly site, oreilly.com. It’s called “Learning Hadoop 2”. There is a book by that title and there is a video course by that title and I did the video course.
Kirill: Okay, interesting. And how did you find creating videos? Was it very different to public speaking and was it very different to other forms of education that you’ve done before?
Scott: Well, it’s actually quite different from public speaking because if you make a mistake you get to go back and fix it.
Kirill: (Laughs) So it’s better?
Scott: Oh, yeah. You know, you kind of think about what it is you want to demonstrate and what you want to say about it ahead of time. You know, if you’re going along and you kind of trip over your words or you say the wrong thing or you stub your toe on the leg of the table and you accidentally curse, like I did one time, you just go back and you take that out. Nobody has to know.
Kirill: Totally. Okay, next question is what is your one biggest challenge that you’ve ever faced in your career?
Scott: Well, I’ll tell you. Leaving where I was, which was a very comfortable position, I had a great boss, I had a great team that I was working with. But leaving that and going out and starting your own thing is very difficult. They say the three hardest addictions to overcome are heroin, carbohydrates and a monthly salary. Now, I don’t have any experience with the first one, but I have lost 20 pounds before and I have left a monthly salary and a very comfortable job to start a company before. And those two are very hard things to go without.
Kirill: Okay, gotcha. And was it tough the first few months or even years when you left?
Scott: Oh, gosh, yeah. The first year was—let’s just say it was very educational. (Laughs)
Kirill: Yeah, totally. I find it’s like nothing else. You learn so much while you’re building a company, building a business, building a team. How’d you go about building a team, by the way? How long did it take you to hire your first employee?
Scott: Well, it took a while before I needed an employee. As far as like starting the company, there’s this guy, Alan Weiss, that’s done all these books on consulting and running a consulting practice and all that. Something that he says is, “When you’re 80% ready, go ahead and pull the trigger and figure out the other 20% as you go.” I thought I was 80% ready and then I realized after a year of looking back on that year, I thought, “Oh, you know what? I was really only about 50%.”
Kirill: (Laughs) Yeah. But you pulled it off. That’s good.
Scott: Yeah, it took a while. My first employee was actually a virtual assistant, really is what she was. She’s here in town, but we met in a Starbucks around the corner from my house here. You know, she started up a conversation and we’re talking, and she asked what I was doing and I said, “You know, I’m looking through these websites and I’m going to hire a virtual assistant.” And she said, “Actually, I’ve been out of work for about two years now. I had a baby and stayed home and all that. I’m looking to get back into working again and I’d be very interested.” So that’s kind of how that whole thing came about. I told her, “Okay. Here’s what the job would involve and here’s what the hours would be and all that,” and it just kind of went from there.
Kirill: And how big is your team now?
Scott: I’ve got two salespeople, there’s myself, there’s a team of subcontractors that I’ll use for various things, for example, for user interface, design and anything graphical. There’s a guy that I’ve known for 10 years and he’s probably one of the top 10 in the country at that sort of thing.
So when I’ve got a project that’s big enough size that I need to bring people in, I know who to call for that. And we’re about to bring on a couple of interns, actually. I’m extremely fortunate to be headquartered in Atlanta. Georgia Tech is here and they have an extraordinary program for data science and for machine learning. Some of the things they do there are just incredible. We’re bringing on a couple of interns who know data science and want to learn Hadoop.
Kirill: Okay, good. It’s like a win-win.
Scott: Absolutely. You know, if you’re going to do an internship, it’s got to be win-win. There’s got to be something in it for them.
Kirill: Okay, cool. Next question is what is your one most favourite thing about being in the space of data?
Scott: I’d have to say it’s just the diversity of things that you can do, right? I mean, every company has data. It really doesn’t matter what their business is, it generates data. I’ve been fortunate enough to work with Fortune 500 companies and mid-market companies and right now one of my clients is a manufacturing company that makes plastic and metal containers. Every engagement you do as a consultant in this industry, you learn something, you know. Sometimes I learn as much as the client I’m working with. Of course, I don’t tell them that because then they’ll try to—
Kirill: (Laughs) That’s the best. Oh, my God, yeah. I totally understand. I can completely relate to that because the space is growing so quickly and the amount of new technology and methodology that’s coming out all the time is immense. It’s just impossible to know everything. Inevitably, you’re going to be learning. Like, when I create a course, I know some things. But some things I learn as I’m doing the research for the content I’m creating. I’m sure you were the same when you were creating your course. You definitely had to do some research and come across some things that were just coming out at the time.
Scott: And that’s one of the best ways to learn, I think, to be honest with you. If you approach a topic that you know something about and you’re like, “Okay, I need to really get in-depth on this topic and really understand it,” I think one of the best ways to really accomplish that is to think in terms of, “Okay, if I was going to make a course about this, how would I structure this and what would I say?” That was really how I learned how to do machine learning in Spark, is because I’ve got a request to put together a course about it and I said, “Sure, yeah. Okay.”
Kirill: “I’ll learn it.”
Scott: And I was like, “Oh, crap. Now I have to learn Spark.” You know, that wasn’t too much of a challenge because I already knew Python. The thing is, if you know Python and Pandas and scikit, you will know how to use Spark with Python. And that’s one of the great things about Spark that I forgot to mention earlier, is that if you know either Scala or Java or Python or even R, then you can use those in Spark.
Kirill: That’s pretty cool, yeah. I’ve personally—I haven’t done it through Spark but I’ve used R on Hadoop through PivotalR. That was pretty good. And there’s another one – MADlib. Have you heard of that one? I think it’s a Greenplum or a pivotal development just for Hadoop, a mathematical package that you can apply as well.
Scott: Yes, I’ve heard of it. I know what it is, but I can’t say that I’ve ever worked with it.
Kirill: Yeah, there’s just so many tools. You can’t work with everything for sure.
Scott: Gosh, yeah. I mean, if you know how to code in R, then 90% of that will translate over into using R inside Spark to do machine learning.
Kirill: Yeah, gotcha. By the way, Tableau on top of Hadoop – how do you guys do that?
Scott: It’s not really a whole lot different from using Tableau with any other data source. It depends on what your engine is, and I hate to use that term. For example, in a Cloudera cluster, Impala would be the SQL engine that you would want to use because Impala is just orders of magnitude faster than Hive. But you would just point Tableau to the IP address of one of your data nodes and point it to port 21050, which is the port that Impala listens on. And then from there it’s just like you do with any other ODBC/JDBC connection that you set up for Tableau.
Kirill: Okay, gotcha. So it’s possible, basically? Short answer is it’s possible—
Scott: Yes. Short answer is, “Oh, yeah. It’s absolutely possible.”
Kirill: Okay, gotcha.
Scott You have to understand, if you’re working with immense datasets, that those aren’t going to run as fast as smaller ones and you don’t want to package them into a workbook. (Laughs)
Kirill: Gotcha, yeah. Yeah, for sure. (Laughs)
Scott: What we did for a client recently is, you know, we set up Tableau Server to have a live connection into Impala because it was a Cloudera cluster. And then Desktop just connects into Server to get the data and that ran pretty quickly. Didn’t have a whole lot of problems with it because the datasets weren’t really that immense. But even if you’re dealing with just tremendously huge datasets, there’s solutions for that like AtScale, for example. AtScale is a nifty little tool. What it does is it lets you create dimensional cubes on top of data in Hadoop, but then it’s also got this adaptive cache, so it learns what data you access the most and it will actually cache that into memory. So it speeds things up tremendously in terms of Tableau.
Kirill: Nice. That sounds like pretty solid foundation for Tableau. Wrapping up this podcast, I’ve got another question for you, like a visionary type of question or a philosophical one. Obviously you’re deep into the space of Hadoop and data and data science, machine learning. From where you’re sitting and what you’re seeing, where do you think this whole field of data science is going? And what should our listeners look into to prepare for the future?
Scott: Let me narrow that down just a bit and pontificate on the future of Hadoop, if you don’t mind.
Kirill: Sounds good.
Scott: I’m sure you’re familiar with the Gartner hype cycle.
Scott: You know, we’ve gone through that peak where everybody was just all about Hadoop, you know, “We’ve got to have this and we’ve got to put it to use.” Now we’re kind of in that trough of disillusionment where it seems like every time I go online, I’m reading an article somewhere about the death of Hadoop. I don’t think Hadoop is dead at all. I mean, I think that whole idea is actually kind of silly. But, yeah, we’re definitely in the trough of disillusionment, and then afterwards comes the plateau of productivity, when everybody has calmed down and realized that it’s not the cure for everything, but at the same time, this guy is not falling either. And you know, we settle in and we get some work done.
I think that’s where the future of Hadoop is, is that companies are going to start realizing what Hadoop is actually useful for. There is going to be enough people with the skills to do things in Hadoop and we’re going to actually start getting things—it’s not just going to be the Fortune 500s who are able to extract value out of Hadoop. Pretty much anybody will be able to.
Kirill: Gotcha. That’s some good advice. People listening to this, don’t get afraid that Hadoop is dying. It’s not dying. Everything will be okay. Just plough along and it’s going to be an exciting space to be in. In any case, all these skills that you learn, they’re so transferrable to whatever. Even if something does replace Hadoop, it’s not going to be that much different. You will be able to transfer all these skills anyway.
Scott: And that’s another thing, too. It seems like every time I turn around, somebody is comparing Hadoop to Spark, which I think is kind of funny because every time I’ve ever seen Spark, it was running on top of Hadoop. While it is possible to set up a cluster that runs just Spark, the reality is that hardly anybody does it that way.
Kirill: Okay. Thanks for the insights. Thank you very much, Scott, for coming on the show. It was very exciting, sharing all that knowledge. From the surveys that we run and from I know about our audience, we actually have quite a few executives listening to this podcast and quite a few managers and business owners even. If any of them ever want to get in touch with you to invite you to do a public speech maybe at their organization or help install Hadoop or do some consulting work, where is the best place they can find you?
Scott: Probably e-mail, I would say. It’s [email protected]
Kirill: Gotcha. All right, we will definitely include that in the show notes as well. People listening out there, if you need to contact Scott, you have his e-mail now. And one last question I have for you is, what is one favourite book that you can recommend to our listeners to help them become better at what they do?
Scott: I'm going to up that and say two.
Kirill: Sounds good.
Scott: You know, probably 80% of any job that you do with data is getting the data into the usable format, something that you can work with. O'Reilly has a great book for that called “Data Wrangling with Python.” That book probably has 80% or 90% of the stuff that you’re going to do on a day-to-day basis with data. It’s all about cleaning data, scraping data off the web. I mean, if there is some technique or method for preparing data, it’s probably in that book.
And the second one is Hadoop-specific and that’s “Hadoop MapReduce v2 Cookbook” which is a really unnecessarily long title for a book. That one is by Packt Publishing. I was fortunate enough to be the technical reviewer on that book. Again, it’s just full of day-to-day stuff that you would do with Hadoop from spinning up a cluster in the cloud to bringing in data and setting up a table with Hive. And it even goes into using Mahout, which is another component of Hadoop, to do machine learning. So if there is something that you want to do with Hadoop, it’s probably in there.
Kirill: Sounds good. Thanks a lot. So there we go: “Data Wrangling with Python” and “Hadoop MapReduce v2 Cookbook”. Once again, thank you very much, Scott. I really appreciate you taking the time to come on this podcast and share all of this wisdom and knowledge with us.
Scott: Well, I really appreciate you inviting me to.
Kirill: So there you have it. I hope you enjoyed today’s podcast. Lots and lots of valuable information. As you could see, Scott has very in-depth knowledge on the topic and is definitely in the most advanced frontiers of Hadoop and everything to do with data in organizations, data lakes, and big data and all of these things. He was kind enough to come on the show, spend an hour of his time and share all of these things with us.
So I hope you took this opportunity to pick up some additional knowledge and skills. Personally, I definitely learned a lot as well. My favourite part, of course, was the breakdown of all of these different aspects of Hadoop, all of these elements such as Hive, Kudu, Spark, Seahorse. Some of them I knew about, some of them I’ve worked with, but some of them I haven’t even heard of and I was very excited to learn. I can definitely look back and say now I know some more things about Hadoop, and it looks like this space is constantly evolving. On that note, if you would ever like to get in touch with Scott, if, for example, you are an executive or a manager or you own your own business and you’d like to get Scott to come in and help you with your analytics strategy, make sure to reach out to him. As you could tell from this podcast, this is a person with a huge wealth of knowledge who’s excited to share it, so this is your go-to guy for anything to do with Hadoop and data strategy.
And with that, don’t forget that you can find all the resources mentioned in this episode at www.superdatascience.com/51. There you will be able to find the transcript for this episode as well as a link to Scott’s LinkedIn, his company’s page, and his e-mail. Thank you so much for being here today. I hope you enjoyed the episode. I can’t wait to see you next time. Until then, happy analyzing.
- big data
- data lake
- public speaking