Podcasts SDS 211: Working on Apache Spark & R Package Development

61 minutes
Database, R Programming

SDS 211: Working on Apache Spark & R Package Development

Subscribe on Apple Podcasts, Spotify, Stitcher Radio or TuneIn

Today on the SuperDataScience Podcast, I chat with Javier and we focus the discussion in Big Data and Big Compute! Aside from discovering Javier’s passions & career journey, discover also today how Apache Spark evolved from MapReduce and Hadoop!

ABOUT JAVIER LURASCHI

Javier Luraschi is a software engineer at RStudio, Inc. and also the author and maintainer of the sparklyr, an R package which has an interface used for spark. He also holds a double degree in Math and Software Engineering.

OVERVIEW

I am pretty sure that many of you have downloaded and used the R packages for work or personal reasons. Today’s guest, Javier, gives us a quick overview of how R Packages are made.

We started the episode by talking about his experience in R Studio writing software focused on R Packages. He emphasizes how R packages are very user-friendly since these are R codes packaged in modules that can be easily shared with other people. They are also made to run faster and to keep up with new releases every time. He adds that R packages can be created by the community or the R Studio itself. R packages are mostly developed based on what could be the most impactful for whatever the community needs.

Next is we focus on Big Data and Big Compute. Hear Javier share things that he’s excited about for Big Data. Learn also the historical aspect of how we are handling big data. He shares a lot of statistical facts on how we have created a lot of information and how the digital data is growing.

To manage the huge amount of data, there are various software frameworks that have been developed – from the Hadoop Distributed File System (HDFS) to MapReduce to Apache Spark. He dives deep on the similarities and differences that affect the efficiency of each framework. For example, if Hadoop will take 72 minutes and 2100 computers to sort 100 TB of data, it would only take 23 minutes and 200 computers to do the same thing when using memories. Aside from efficiency, the low cost of using the Apache Spark also factors in which makes it a great development. Javier also talks about sparklyr, an R package for Spark in which he co-authored. He invites you to start learning Spark together with R.

Big compute is a term which is kind of complementary to Big Data. As much as possible, when processing Big Data, we want the results in real time. So, to sort data and find the most valuable data you need from a big pile, you need the most efficient advanced hardware to do it for you… and, so the need for Big Compute.

Lastly, he shares advice for those who want to focus on big data and big compute but are in a very different field. You’ll also hear about his book that is about to be released so make sure to watch out for it.

IN THIS EPISODE YOU WILL LEARN:

Working in Remote Teams. (04:36)
What does Javier do as a software engineer in R Studio? (08:15)
R Packages are created by R Studio and the community. (10:13)
What you should be excited about with Big Data. (13:57)
Hadoop Distributed File System to Apache Spark. (17:58)
Spark’s big vocabulary of operations. (24:20)
Sparklyr: An R Package to Spark (27:58)
What is Big Compute? (35:00)
What is Parallel Computing? (41:20)
Career Advice from Javier for those who plan to shift to data science. (46:00)
Javier talks about his latest book. (54:03)

ITEMS MENTIONED IN THIS PODCAST:

FOLLOW JAVIER

EPISODE TRANSCRIPT

Download The Transcript

Podcast Transcript

Kirill Eremenko: This is episode number 211 with software engineer at RStudio Javier Luraschi.

Kirill Eremenko: Welcome to the SuperDataScience podcast. My name is Kirill Eremenko, Data Science Coach and lifestyle entrepreneur, and each week we bring you inspiring people and ideas to help you build your successful career in Data Science. Thanks for being here today and now let’s make the complex simple.

Kirill Eremenko: Welcome back to the SuperDataScience podcast. Ladies and gentlemen, I’m very excited to have you on the show today and today I’ve got a very interesting guest Javier Luraschi on the show with us. Javier is a Software Engineer at RStudio. If you use RStudio, you might know a couple of packages that he’s worked on or and/or co-authored packages such as sparklyr, mlflow, cloudml, and many more. And in fact, if you, even if you’re just a beginner in our, you probably have already incontinent the dark theme in RStudio and that’s something that’s Javier has also contributed towards. So we’ve got a very exciting podcasts coming up for you just now with Javier. We talked a lot about Big Data and Big Compute and specifically we talked a lot most of this podcast is dedicated to Apache Spark.

Kirill Eremenko: So in this podcast you will find out it’s the whole history of Big Data where Hadoop came about from why Apache Spark was created, how they compare, what a punch just park is used for, how it’s developed over time, how it’s developing now. You will also learn a lot about package development in RStudio and some of the exciting things that are happening in this space. And in addition to all events, you will feel a lot of passion. Javier has a ton of passion for this space, for RStudio, for a Apache Spark, for developing packages, for Big Data and Big Compute. So this podcast is full of that. I was personally sitting on the edge of my seat just enthralled by all these amazing stories that Javier is telling and this whole space of Big Data.

Kirill Eremenko: So all in all, an incredibly exciting, captivating podcast. Can’t wait for you to check it out. So without further ado, I bring to Javier Luraschi who is a Software Engineer at RStudio.

Kirill Eremenko: Welcome Ladies and gentlemen to this SuperDataScience podcast, today I have a very special guest calling in all the way from Seattle

Javier Luraschi from RStudio. Javier, how are you going today?

Javier Luraschi: I’m doing good Kirill. How about you?

Kirill Eremenko: I’m doing well as well and I’m in Australia right now in Gulf Coast. The weather is quite warm here, unlike Seattle. You said it’s getting a bit cold there?

Javier Luraschi: Yeah, I’ll probably have to go and visit you one of these days ’cause definitely we can use some of your sun.

Kirill Eremenko: Yeah, for sure. And you mentioned winters are quite harsh and saddle, like how cold does it get?

Javier Luraschi: It’s not that it gets that cold about, it gets cold minus five Celsius, which is not terrible, but we have very, very long winters. So definitely expect a visit from me somewhere around March or May. I’m assuming Australia has pretty nice weather most of the year.

Kirill Eremenko: Yeah. Most of the year. It’s quite good. The only thing is that it’s a bit like first time I got here I was not really expecting that during winter it’s warm during the day, but then at night the temperature drops to like maybe plus eight degrees Celsius, which is fine. But the thing is they don’t have central heating in the buildings. So the building is actually cold and you have to get blankets. So that was a bit unexpected. But other than that it’s really cool.

Javier Luraschi: I would say that at least in winter it’s a time where you can be very productive because you know, like it’s dark and everything and it’s cold. So it’s kind of like puts you in more to just get things done which is good. I kind of been on warm countries and sometimes getting work done in warm countries gets trickier.

Kirill Eremenko: I know. You work faster when it’s colder.

Javier Luraschi: I believe that’s for me, but I don’t know if that’s the same for everyone.

Kirill Eremenko: Same for me. All right. Well Javier RStudio. You’re on the podcast. It’s really cool to have you. As I mentioned, as I mentioned before the podcast already spoken to Nathan Stephens, Nisha Iyer. I’ve recently just been chatting on Linkedin to Jonathan Regenstein all from RStudio. You have such a fantastic team there. This is incredible. Every person I met from RStudio is like some sort of very interesting genius in their own field and I learned so much. I’m really looking forward to this podcast. I got really high expectations for today.

Javier Luraschi: Well we appreciate that, but like it’s definitely fun. I don’t know about being all of us genius but definitely it’s fun. Especially sine we’re a distributed company, I don’t even know honestly where Jonathan is these days, but whoever you talk with RStudio it’s likely that they won’t be on the same city ’cause we’re just all over the US also Europe and yeah.

Kirill Eremenko: It’s crazy. How is it to work in a distributed company like that?

Javier Luraschi: I really love it honestly. I do feel like it’s different. I feel like any other new job or any change in your life usually have a honeymoon period right? Where you … Even Data Science, right? I bet you start with Data Science and you’re like, oh my God, this is great. This is the best thing. For the most part. I think working in a remote team has been that way. I do think, and I would give us advice or unsolicited advice to your listeners, right? There’s definitely like a period where you get, you have to tweak your personal life to also maximize all of the the things that are not work related, if you know what I mean, right. So like when when you’re in the same office it’s very easy to have those personal interactions and keep up with people and what not and when you’re in a distributed team you kind of lose the personal face to face connection unless you practically say hey I want to catch up with these colleagues or friends that I haven’t seen right?

Javier Luraschi: But net net, I think it’s fantastic. I don’t even know what a community looks like and it’s nice be close to your family and what not so I definitely highly recommend positions and remote teams.

Kirill Eremenko: That’s so cool. That’s so cool. And I totally agree with the whole concept losing on the personal side of things you need something you do a take care of. Like for instance, at SuperDataScience, we have a remote team and we all catch up, we make an effort to at least once a year to all get into the same place and spend a week together. And the other thing is like every week we have a new body. So everybody in the team randomly gets assigned somebody else so it could be like a director. It could be an analyst together and for the week, no, not for the week, for the month you guys are buddies and so you plan your weeks together, you catch up once a week and that really puts people from different parts of their company closer to each other. So I think you’re right, the personal side of things.

Javier Luraschi: Yeah. So I’m actually curious like what does that mean? ‘Cause you’re still remote, right? Does that mean that you get to talk with this person in Slack or whatever you’re using or?

Kirill Eremenko: So Slack, you can talk to anybody anytime, but that means that at least once a week you need to catch up and talk to that other person in person. So we use Zoom, so not like in person as in-

Javier Luraschi: Yeah, we Zoom as well.

Kirill Eremenko: But like on video you need to catch up with that person at least once a week at the start of the week when you’re planning your week and you spend an hour together, and so you have already chatting about the weekend, what are you doing in your personal life? And feedback’s been really good. People love to find out more about that.

Javier Luraschi: Yeah connecting with other colleagues and what not. That makes a little sense.

Kirill Eremenko: That’s cool. So tell us, how did you get into RStudio in the first place? Before we continue, I just wanted to let everybody know. So Javier is a software engineer at RStudio. Maybe let’s start there. Before we go into the story of how you got into RStudio, tell us a bit about what exactly do you do at RStudio?

Javier Luraschi: Sure. Well, I mean, as you mentioned, I’m a software engineer, so I write software, right? That’s kind of like obvious. But I mostly have been focusing on R Packages. For those of you that might not be super familiar with what an R Package is, is basically our code, which is just R is programming language, right? And you package this code into modules that you can share with other people, right? So this is actually, it’s a concept that you can find in other programming languages, but I think what they R community, has going is pretty special. ‘Cause there’s a very nice relationship between R Package and an actual person that lives and breathes in real life, right? So a lot of the functionality that you use from R is going to come from packages and those packages have maintainers and authors. And at the end it’s just R code, but it’s packaged in such a way that it makes it very easy for you to reuse and for us maintainers to also keep up with new releases and new features and why not.

Javier Luraschi: So yeah I do R Packages and specifically I do packages that are mostly related with making R run faster or upscale or helping you share models with other people. So R Packages can basically do anything you want and the ones that I focus on are mostly about helping you work with Big Data and Big Compute and specifically with a package called sparklyr.

Kirill Eremenko: Interesting. That’s a very interesting description. I thought that R Packages mostly come from CRAN and that they are created by the community. I didn’t know that the R studio on purpose creates R packages. Can you tell us the difference between those two please?

Javier Luraschi: Well, they’re exactly the same. They’re just two different parts of the story, right. So basically an R Packages it’s literally like you could see it as a box with code inside it and CRAN is the store all the packages are free, right. But it’s where you go and get the actual packages. So there’s people that build these packages in the art community, they don’t need to work at RStudio. And honestly most of the packages that are on CRAN are not from RStudio at all. So anyone can package a useful piece of code into an R Package and then make it available on CRAN. And then anyone from the community can go to CRAN search for packages and then install them, right. So think about CRAN as the APP store, right? Whenever you go to the APP store, you look through apps and then you install them. CRAN is kind of like the APP store from the R world. And some of the packages very few are developed by RStudio. And many, many, many otters are developed by other people in their community.

Kirill Eremenko: Okay. Gotcha. And in that case, how does RStudio decides what the community is going to develop and what is going to be developed within RStudio, like for instance, within your team and just like within the company? Is there kind of some strategy behind that?

Javier Luraschi: Yeah. I think there’s some honestly I don’t think there’s an explicit strategy, like this is my own personal point of view. So I wouldn’t speak completely on behalf of RStudio. But the way that I see it is that if we see someone that is already working in a package and it’s a great package in the community, we just simply don’t work on it, right? We’re very pragmatic in the way we try to approach problems. So we try to look for the things that are the most painful and the most impactful that the community might need. And if we look at a problem, right? Like in my case, like, hey, Big Data, right? If someone in the community already providing that and if the answer is no and there’s enough big impact that we can help the community, we basically help.

Javier Luraschi: That’s kind of the very high level, like, hey, this is kind of how we see it. ‘Cause yeah, I think it could be possible that if they R community would take over the entire packages, there wouldn’t be a need for RStudio perhaps to develop packages, right. But something that perhaps, I don’t know if you already talked about this on the series is, but it is like the opportunity that exists in Data Science in general, and in my particular case, Big Data is so huge that, we need more people, right. So it’s not that we’re fighting for packages which is, I think it’s a great place to be at. It’s mostly like, oh my God, Data Science is growing so much and there’s so much to be done that, let’s just make sure that we don’t overlap because that’s just an efficient. But there’s so much opportunity everywhere that we haven’t had that problem of hey, having to do a lot of coordination or why not. It’s mostly pretty obvious when there’s a gap or an opportunity that someone can help with.

Kirill Eremenko: Gotcha. Okay. Very interesting role and lots of packages. So maybe let’s talk a little bit about some of the recent development. So you mentioned Big Data. Can you tell us a bit about Big Data, maybe even give a quick overview for new listeners on the podcast who are not familiar with the cost of Big Data. What is Big Data and Big Compute? What’s the story that’s happening there?

Javier Luraschi: No for sure. I honestly, I really love those two concepts, Big Data and Big Compute and mostly I really like them ’cause there’s people that have very strong feelings one way or the other one and you probably have seen these around. So kind of to me the way that I like to introduce them, it’s just from a historical point of view. Maybe that’s the most boring point of view, but I think it’s also pretty exciting, right? So if you think, like to me one of the most exciting things about in general Data Science and Big Data is if you ask a historian, right? Someone specializing in history, he’s like hey, how do you divide human development, right? The development of the human civilization. It’s going to be pretty likely that they will mention things like the stone age, pre industrial revolution, the industrial revolution and why not.

Javier Luraschi: And a lot of them will mention also the information age, right? I’m pretty sure that you might have already heard this term, right? The information age and this kind of concept basically means that there’s a lot of information that we’re creating that is digital, right? And if you look at one report from the World Bank, which basically creates reports about how the world is changing every year. One report that they had on 2016 kind of like try to analyze how much physical data we have that’s called analog data in some ways, like books and paper and anything that leaves a paper trail and why not. And also digital information, right. And you can see in the report it’s just growing an exponential growth rates. It’s just we don’t have to explain this too much, but if you look at how many cat videos we create per day, right? Or Instagram photos or Facebook photos or just data in general.

Javier Luraschi: We’re creating a lot of information. So back in the year 2003 around the time, there were companies like Google that were just starting, right? It’s crazy, but google is a pretty new company, right? It wasn’t here, what is that? 30 years ago or why not? And what Google was trying to do is say, hey, we have the worldwide web and there’s a lot of information that seems to usable. Can we make it searchable? And there were obviously companies before them like Yahoo and Excite and AltaVista for those of you that remember, but Google was one of those companies that said, how can we make search better?

Javier Luraschi: And the first program that you hit when you’re talking about the web, it’s like you can’t really put the whole web in one computer, right? It just too much data on if you try to put it in one computer, like it just doesn’t make any sense. So the way that they solved the problem back then was by using multiple computers, right? You have multiple computers, each with their hard disk across many computers you can load the entire web and make searches and that’s basically what gave birth to what’s called a HDFS and which I believe has been maybe mentioned already in your podcast. Which is basically a way of splitting data into multiple machines. Right. So that’s kind of like the beginning of Big Data is like, hey, if you have something that doesn’t fit in one machine, you have Big Data and it’s a pretty clear definition. There’s obvious problems that fall into this category, right? Like, hey I have my computer, I’m doing analyses, I can’t have all this data in my machine, I need multiple computers. Well now we’re talking about Big Data.

Kirill Eremenko: And HDFS e Hadoop Distributed File System?

Javier Luraschi: Yeah, it’s basically the Hadoop Distributed File System. Basically the way that this started is Google had a research paper where they explained to the world, hey, we have the Google file system, right? And we have a bunch of files, this is how we distribute them and then some engineers in India who they said, well, this sounds like a great idea, can we make it on open source project? And that’s where Hadoop came from. Is this open source implementation of the internal file system that Google made available to a research publication, right. And let me know how this-

Kirill Eremenko: Sounds great [crosstalk 00:18:44].

Javier Luraschi: Yes, it’s great okay. I want to make sure that I give enough information but not too much.

Kirill Eremenko: No. I love it. You’re totally right. It’s the historic context makes it even more exciting. Makes it like a story, like the journey. So yeah, no, I’m just listening. I’m really immersed in your explanation, please continue.

Javier Luraschi: Well, yeah for sure. So far so good, right? Google was doing their own thing and we have open source project like Hadoop that were doing their own thing and as I mentioned, they were mostly based on disk drives right? Disk drives even today can store the most information for the lowest cost, so it makes sense to put information there. But then a different project came out around, I want to make sure I get this right, but I believe it was around 2009 there was a project in Berkeley to kind of improve over Hadoop. Said, well, Hadoop is great we’re doing all these things based on a file system based on disk drives, but can we do it better? Can we do it faster? And sure enough, another open source project started in Berkeley called Apache Spark. And the premise at the point in time was, or one of the things that changed from their precursor was that, there was a trend in which memory RAM computer memory, was getting cheaper.

Javier Luraschi: Not as cheap as disk drives, but it was getting a bit cheaper, right. And it also happens to be the case that your computer memory makes your computer pretty fast, right? So whenever you go and buy a new computer, the amount of memory is one of the things that you want to go and check, oh, it has like four gigabytes or eight gigabytes or 32 or whatever. It makes a big difference on speed ’cause it means that a computer can handle more things in a faster and easier, way to access it, right. Anyways, so basically the Apache Spark project started by figuring out how to create something like Hadoop and improve over it based on memory.

Javier Luraschi: And what they found out is that sure enough you can get significant speedups by using memory instead of disks. And one of the things that we like to do as software engineers is sort data, sounds like such an easy task just to order data. If you had a list of names of people and you just want list them on alphabetical order, it happens to be a pretty compute intensive task, especially when you have a lot of data. Anyway, so what they found out back then is that it used to take to sort 100 terabytes, which is not a lot of data. It used to take Hadoop 72 minutes and 2100 computers. So, that was with Hadoop. And then with this new framework based on memory and with have a more rich vocabulary of operation that you could do, you could do the same in 23 minutes but only use in 200 computers.

Javier Luraschi: So that’s crazy, right? It’s an improvement of 10X performance, you need 10 times less computers and you actually make it … You can sort the information faster. So it was a really, really big deal back then. And in some ways it’s still part of the main trend of trying to figure out as a society and as humanity, how do we order, how do we make sense of all this information, right? How do we arrange it? How do we store it? How do we answer harder and more interesting questions over and over? So, that’s pretty much what Spark is. It’s an in memory engine that allows you to process any information. One of the things that is different perhaps with Hadoop is that when you talk about Hadoop, there was only support for one type of operation called map reduce. And maybe this is also something that people have heard around when they’re getting immersed into a Data Science and Big Data, right? It’s like map reduce
.

Javier Luraschi: Well, back when Hadoop was designed, there was another paper from Google where they proposed this model. Whenever you want to transform information, I guess I should say that so far we’ve talked about storing information on retrieving it. We haven’t talked enough on okay. Well information is across multiple computers. How do you actually make it do something meaningful if it’s distributed across multiple computers, right. So the first attempt to solve that, which also was at around the time a couple of years after Hadoop came to be, was another paper from Google who called MapReduce, which basically explained the world, say, hey, if you have a distributed system, that basically means you have a lot of computers, you can reduce all the operations to mapping meaning transforming some information in the same machine to some other information in the same machine, and then combining the information between machines.

Javier Luraschi: That’s kind of like at a high level what it is. And it was pretty good at that time. But it was also constraining in the sense that it didn’t provide a lot of verbs or kind of like grammatical constructs to make coding more easy, right? It was pretty bare bones at the time. And one of the other big improvements from Spark is that it enabled big vocabulary of operations to make Big Data more easy to analyze, right? So instead of just saying, hey, you need to tell me how to map data on each machine on then aggregate it, it introduced things like, hey, just tell me what you want filtered. What do you want to the average to be like. Kind of like more things that are now they feel closer to what data analyses. Like, hey, I just want to count how many earthquakes are available in these data sets. That’s a pretty simple question that used to be actually pretty tricky to ask back when we only had MapReduce.

Javier Luraschi: So with Spark you can say things like, hey, from all these data set, give me all the earthquakes that just happened in Australia on this period and give me the count. So that’s a much more reach way of expressing and analyzing data when it’s distributed across multiple machines that previous technologies didn’t really … It wasn’t as simple. You had to provide code for you to process each disk drive on each machine and then figure out how to upgrade and all that. So yeah it was a big deal and it’s still a big deal. I mean, it has been less than 10 years. So if you look at the Apache Spark project, I mean, we can talk about more of the things that have happened and why not. It’s still changing a lot. So it’s by no means what I would consider it done. And it’s growing everyday.

Kirill Eremenko: Gotcha. So if I understand correctly, Apache Spark not only, is faster than Hadoop in the sense that it works in memory, uses RAM instead of the disk data, but it also is simpler to use to operate. Is that correct?

Javier Luraschi: Yeah. That is completely correct. And the thing to say here, it sounds really easy and it is really easy compared to dealing with MapReduce. But you’re still talking about hundreds of machines or at least tens of machines. So it’s a pretty hard problem and it’s surprising the amount of progress that the open source community has made in the past 20 years. Where now really anyone, if we want to get into specifics of hey, how do I get into these? Really anyone can download the tools that they need in less than 15 minutes and get up and running and start doing data processing. That was really state of the art in companies like Google or Yahoo like 15 years ago. So that thing is just kind of like the technical explanation of why Data Science and Big Data, it’s such a big deal.

Javier Luraschi: It is ’cause these tools are so easy today than what they were before that we honestly haven’t figured out yet how to fully apply them and we can maybe talk a little bit about this. But just talking about potential. It’s like, well Google and Yahoo and other big companies were using these tools and they were great and now they just became so easy to use. What can we do with them? And I’m sure a lot of your listeners will have more particular ideas, but it’s definitely fascinating.

Kirill Eremenko: Yeah for sure. And do you have certain packages in R which to allow people to work with Apache Spark?

Javier Luraschi: Yeah, that’s a great question. And the answer is yes. And the package that I recommend using, it’s called sparklyr. It’s kind of like a corny name out of Spark. You have Spark this is like sparklyr or sparkly R. Yeah, so we try to name our package in fun ways ’cause it doesn’t make sense to name them with boring topics. So yeah, it’s called sparklyr and it’s basically the way that you use Spark from the R programming language, right. So R is obviously a pretty well known programming language or computing language I would want to say-

Kirill Eremenko: Very well known. My brother is studying at uni now and he’s actually using R. It’s really going well.

Javier Luraschi: There you go. So if you’re already have an age for learning R, it’s another natural progression to say well, what about things like Big Data? And we haven’t even talked about Big Compute so we need to get back to that. But yeah, definitely if you want to get involved into doing things in cluster computing with Big Data and why not, this R Package called sparklyr is a very nice way of getting started. And again in this particular case, it’s a package that RStudio developed, myself and other people in RStudio are authors of these package. But it’s actually an Apache licensed open source project. So it’s completely free. It’s available in CRAN, you can download it, it’s easy to install. And same for Spark. I didn’t mention that, but Spark also happens to be an open source project, it’s Apache licensed. So it’s pretty much there for anyone to use that has that need or interest.

Kirill Eremenko: That’s very cool. That’s very cool. How long does it take you guys to do all the packages? Just out of curiosity?

Javier Luraschi: Well in all honesty, I feel like we’re still working on it and we’re going to be working on it for a while and. So the original package we worked for a few months in it back in 2016. We actually announced it back in use R 2016. It was like the base basic functionality which included things like being able to do data analysis with dplyr which it’s one of the most used packages in the R community, which allows you to basically design data manipulation operations through our grammar. That makes it very easy to let you analyze even from very simple things like hey, give me the average of this column, we all used to do that in Excel or why not, or accounting records all the way to, well, I wanted to run, I want to join these two data sets and I want to run a specific computation on them and why not. So it definitely has that breadth of functionality for very basic data analyses to very complex data analysis.

Javier Luraschi: So that was one of the first features that we decided to include in sparklyr to allow the community to easily move their already existent data like basically analysis that they can run in their own machine. Like if you have a CSV file, just a text file or why not, or an Excel file … By the way you can use Excel with R if that’s what you like doing. So what dplyr allows you to do is to say hey I’m going to use Excel spreadsheet to do analysis so you can import it with read R and then you can do data analysis on the dplyr. But then now with the sparklyr and support for dplyr in the sparklyr you can say, well instead of running this same analysis in this Excel spreadsheet file, now I want to run it on like 10 terabytes of data or why not?

Javier Luraschi: It’s the same thing. Which is kind of crazy. You don’t need to regret it. You don’t need to worry a lot of those things. You do need to pay for the computers, right? We haven’t talked about this but like still computing is not free, right? So someone needs to pay for those computers and what not. But at least as a user and if you work in a company or you aspire to work on a organization that is going to do like data analysis in small scale and also in big scale, these tools allow you to really easily jump from, hey, I just want to do something quick and dirty while I’m on the bus, on my way to work, and it works, it’s great. And then like you can just literally copy paste the same code and run it against like a ginormous data set.

Javier Luraschi: Yeah, so that’s kind of like how it started back in 2016. But since then, what we’ve seen is that the Spark community keeps growing. They keep adding new features-

Kirill Eremenko: They have to keep up.

Javier Luraschi: It keeps getting more and more interesting. Like for instance, one of their conferences got renamed last year I believe from Sparks Summit to Spark Plus AI Summit. So like now spark is getting also intertwined with deep learning and it’s enabling a bunch of other really cool interesting things to do. So I honestly don’t think that we’re going to be able to call our work done at least not on the next few months, right. I don’t know if it’s one year, two years or another 10 years, who knows? But definitely it’s like a moving target. And I think that’s, that’s exciting, right?

Javier Luraschi: ‘Cause that means that you’re not jumping into something that started like 10 years ago and no one uses anymore and why not try? It’s like the opposite, right? There’s things that are very stable and you can be very reliant on and use them at scale and we are very confident that they’re going to give you the productivity results that you’re looking for. And they’re sort of other things that are just like very, very new. And they’re exciting and they’re probably going to get there, but they’re still like a moving target. So there’s definitely like interesting bits and pieces for the new comers and for the very experts to be excited, specifically about spark just ’cause we’re talking about Apache Spark, but also I think in general on Data Science.

Kirill Eremenko: Gotcha. And while we’re on top of a spark, I know we have so many other things we can discuss, but I’m really curious, Apache Spark 2.0 was released I think last year, at the beginning of last year. What are your thoughts on that jump from Spark to Spark 2.0, what new stuff was added?

Javier Luraschi: Yeah. For sure. Well first of all I would say that that was a big jump. Just to give some context, Spark 1.5 was like the first wave of people were starting to get familiar with it and I feel like it really hit mainstream. 1.6 is still one of the biggest released versions that is still in use today and the jump from 1.6 or 2.0 introduced many improvements and interesting new features. For instance, one of the ones that I was excited about was Spark structure at streaming. I know it’s a pretty long name but basically means real data processing. And I feel like that’s a really good segue to talk a little bit about Big Compute, right? ’cause we only talk about this Big Data kind of part of the story which is important and relevant. But then a lot of companies don’t have huge data sets, right?

Javier Luraschi: So what’s the point of Big Data? Why do I need Spark? Is this just hype and whatever? And you know like some of that is true, right? Not everyone needs to Big Data, but there’s this whole other side of the coin called Big Compute. And what Big Compute the way that I like to explain it is just mostly making things faster, right? So like when you have simple questions like, hey, count how many records I have. Well that’s usually pretty fast and you don’t have to worry about that, right? But when you start asking questions like, hey, could you please sort, arrange all these data set. Well, that’s a little bit harder, right? And then for those people that are in the track of doing Data Science, and again, I’m not an expert on Data Science, I’m a software engineer, but like they will get more and more familiar with a more complex models, right?

Javier Luraschi: So, some people might already be familiar with linear regressions, which are a type of model that is pretty common and pretty efficient and like a pretty good first step towards modeling. But then you can incrementally get harder and harder models. You want to really fit the data correctly and why not? So a lot of times what happens is like, well maybe I only have 100 megabyte data sets. Like well, that’s obviously not Big Data, right? But then you’re running these models and a lot of times what we see is we see data scientists just waiting like an hour, right? Or it’s like, well, I’m going to go for a coffee ’cause this thing is just running for the next two hours. And it’s like, well, you know, that might be a good case for Big Compute, right? Which means, it only really means saying, hey, let’s divide these tasks into multiple computers ’cause even though you don’t have Big Data, wouldn’t it be nice if instead of waiting two hours we’re going to give you the answer in like 20 seconds, right? And he’s like, well-

Kirill Eremenko: Then there is no time for coffee.

Javier Luraschi: Okay, don’t tell about your boss, but you can do it faster and still go for coffee, but anyways. So definitely there’s the other side of the coin where you say like, well, I don’t care about Big Data or I don’t have a need for Big Data, but I want to make things faster. And when you get to that point, it’s like, well, how fast is fast enough? I mean, you know, for you and me maybe we don’t have interest in data sets and we’re like, well, if you give me the answer in like 10 seconds that’s good enough. Who cares? That’s fine. But there’s a lot of industries out there, like I’m just thinking of a stock trading, right? I mean if I tell my boss, he’s like, hey, you know, I’m going to give you the answer in 10 seconds. He’s like what are you doing?

Javier Luraschi: That’s really not going to help me. So there’s a lot of use cases where you want to have instant feedback, right? And some of the ways that we describe this is with concepts like real time. We say oh, we want to process data in real time, right? Meaning I don’t want to wait for whatever is being processed to finish processing. I need the data right now. And there is definitely a niche there of who really needs real time versus who doesn’t. But Spark structured streaming enables those types of really fast execution models that are very useful in some cases, that make a lot of sense. And the way that they are tackling it, that we’re tackling with sparklyr and specifically streaming, it’s kind of like a very profound, interesting way. And the credit goes to the Spark project. But the way that they define Stream.

Javier Luraschi: So we’re talking about Stream and we haven’t even really defined like, hey, what is Stream? Right? Well, the way that we define what a Stream is in Spark is like a table, like Excel spreadsheet, but when you open Excel spreadsheet, you have a limited amount of rows, right? Like you open it and you have, hey, I have like 20 rows or whatever two million. But you have like a set number of rows. The difference with streams is that we consider them us data sets that have an infinite amount of rows, right? It’s not true that they have an infinite amount. But if you’re looking at the stock market and you were seeing like what’s the price of the Nasdaq every second. And if you try to see that as a table, well it’s a table, but I mean assuming that the Nasdaq doesn’t crash and disappears from planet earth, that looks like an infinite data set, right? Of data that just keeps coming.

Javier Luraschi: And it also have the quality that you want to process it really, really fast. You don’t want to wait of like, yeah, I’ll tell you what is a good prediction for the stock market tomorrow. Right? It’s like, well, I need it instantly. So kind of like when you start looking at those data sets, that data is coming really fast and it never stops coming, structured streaming is like the future that you kind of want to consider. So, that’s one of the newer features that I’m excited about. Doesn’t mean by any means that you need to get started with Data Science and Big Data with streaming, right? There’s a million other ways to get started, but it’s definitely one of those features that it’s pretty exciting. I could talk here for hours, so just telling me kind of like where do you want to kind of steer the conversation.

Kirill Eremenko: I’m glad we touched on Big Compute and that it’s a part of Spark 2.0. So Big Compute is kind of, as I understand it in this sense, is the … Or the predecessor for Big Compute is just paralyzed computing. That’s something I studied back at uni-

Javier Luraschi: Yeah it is-

Kirill Eremenko: About paralyzed processes.

Javier Luraschi: Yeah it is.

Kirill Eremenko: Okay. Gotcha, Gotcha. And Spark takes full advantage and that sense that it’s running in memory, right? Big Compute happens in memory and not on disk drives.

Javier Luraschi: And we were talking about these verbs that exist on Spark, that make it really easy to do operations, right? Like, hey, I want to filter, I wanted to get the average, I want to join these two data sets, right. Since you are already familiar with parallel computing, you probably also remember how hard it is, right? Or it used to be, right? It’s not trivial at all to say, well I have three computers with small data set, now I want to do calculation over these three things at the same time. And it’s like, well, it’s actually pretty tricky.

Javier Luraschi: I wouldn’t say necessarily that it’s fully sold on Spark ’cause there’s a lot of like, you know, if you’re doing genomic analyses are why not, right. You will probably have to do customs things. But just talking specifically about data analysis, that problem is well solved in Spark. You want to have to worry about, hey, how do I make these things run faster? You just explain it in terms like similar things like what we were talking about in our, we have the dplyr package. You can arrange a pretty Big Data set, like this hundred terabyte data set that we were talking about. You can still sort that data set in 23 minutes just by saying sort. Like sort parentheses, open parentheses, close parentheses, what a Pi before that and why not.

Javier Luraschi: So, kind of like those things get … Yeah and you’re right. Big Compute is not a new term. It’s just something that has been getting simpler and simpler. And hopefully with time it will get even simpler. I don’t even know how that looks like ’cause I already think it’s pretty simple. Honestly we could also talk about that ’cause I feel like some of the challenges today still on, on distribute computing is about troubleshooting when things are running at scale. Like in a lot of cases things will run smoothly and you can run your computation and be done with it. The reality today is that, the harder computations you’re doing, even though it’s easier to express what to do, there is still a lot of like, hey, I need to troubleshoot. Why is this machine failing? Does it have enough memory? My hope is that in the future those things will get even more and more automated?

Javier Luraschi: I’m just making stuff up at this point. This is almost science fiction. But like it would be cool that before you execute some data analysis and you say like, hey, I wanted to sort these data set. It would be cool if the tool would tell you in this case a Spark or sparklyr or why not would tell you, hey, you’re probably going to need, like if you want to run it in one computer, it’s going to take you three hours. If you get up 10 computers with eight gigabytes of RAM each or whatever, it’s going to take this much.

Javier Luraschi: That part today doesn’t exist yet. You need to figure out like how big the cluster needs to be or how small or get kind of like get some advice from your system administrator if you work in a big company or a big organization. But hopefully one cool thing that we could do is just make it easier to say hey, we’re going to help you optimize what you need and we’re going to tell you. And then you can execute or why not. But yeah, there’s definitely like a lot of really interesting work if other … If a lot of people listening to your podcast feel more on the software engineer track, I would also encourage to explore kind of like those areas.

Javier Luraschi: A lot of the questions that we get or I personally get today to answer is I’m a software engineer, how do I become a data scientist? And that’s totally fine for those people that want to become data scientists. But there’s also a lot of disciplines around data scientists where people can apply their skills-

Kirill Eremenko: Like you. You’re a software engineer working on Data Science stuff all the time.

Javier Luraschi: Right. And I love software engineering and I wouldn’t change it, but it’s surprising that I still find a lot of very meaningful problems and interesting challenges on Data Science without being per se a data scientists, right?

Kirill Eremenko: Yeah. Gotcha. That’s definitely a very interesting-

Javier Luraschi: Point of view I guess.

Kirill Eremenko: Yeah and career path that you’ve decided for yourself.

Javier Luraschi: I mean honestly I haven’t thought about other career paths, but I could see how someone that is doing marketing or why not like could also take a focus. I don’t know exactly how that would look like. But you could say, you know what, I like marketing but I want to compliment it with Data Science and how do I really apply marketing to Big Data and Data Science and why not. So for those people that are kind of like by curiously looking at Data Science or not I feel like there’s also strong career paths if you stay on whatever you’re doing, but put a strong focus on Data Science or Big Data or Big Compute, why not? But you stay where you’re doing and that’s probably also as valid.

Kirill Eremenko: Very, very true. Before we move away from Spark, I just wanted to ask you quickly from your experience and from what you’ve seen in this space, how difficult is it to learn Spark? You mentioned that it’s quite simple to use as in it paralyzes a lot of stuff for you. You don’t even have to think much about it, but in general, what would you say, how long is the learning curve from knowing some R programming and how to do Data Science on a basic level in R to actually knowing how to use Spark and querying Big Datasets with Spark and with R. How long do you think that would take?

Javier Luraschi: Well, so I would split this into two questions, right? So if you’re starting by your own and you don’t have like even a Spark cluster. If you’re literally like finished like if you’re on your own, I feel like that could be challenging ’cause it’s like, well where do you even get the computer? There’s a lot of questions to ask and how do you-

Kirill Eremenko: Amazon?

Javier Luraschi: Yeah, Amazon. So there’s a lot of services out there. There’s Amazon, Amazon has service called EMR. There’s a service called Azure HDInsight, there’s Cloudera, there’s Databricks. There’s like 10 different ones and I apologize for the ones that I didn’t mention. So there’s definitely ways of getting started if you’re on your own. But I think that’s usually not the case. Usually what happens is, well there’s … We can break it down again into two. One is like if you’re learning, if you’re like, hey, I want to learn about R and I want to learn about Spark. Well I have very good news for you ’cause it’s actually super, super easy. In fact that’s one of the goals that I have that brings me the most joy is to make it absolutely insanely easy for you to get started with Spark and R. So you can get started very easy. You download the sparklyr package by installing it as any other R Package and then you literally run, install_Spark, open parentheses, close parentheses, run dot. It will install it for you and then you run Spark connect.

Javier Luraschi: Don’t try to do these from the actual podcast. But if you want more information, go to spark.rstudio.com and we’ll help you get there. So it’s totally real easy. So, that’s one case. If you’re a student, definitely you can do it and the barrier or the cost to enter is super, super low. So give it a shot. The other way where you might end up working with Apache Spark, which is also really easy, is if you happen to end up working on an organization that already has an Apache Spark cluster, it is often the case that there’s a cluster administrator and there’s someone whose job it is to maintain that cluster or someone that is already paying the bill with Databricks or Amazon or Google or Azure why not, right?

Javier Luraschi: So if those clusters are already up and running and a lot of times the data is already there, it also happens to be very, very simple ’cause you don’t have the burden of setting that up, right? All you need to ask is, hey, where’s the cluster? What is the connection name? If you listen to … I’m almost sure that Nathan probably talk about connection strings and why not, right? All you need is a character string that tells you where is the cluster. And so you basically put data into Spark connect and that will get you up and running. And cluster administrators are pretty good at helping other employees or members of that community to get up and running. So I think in both cases it’s pretty easy to get started.

Javier Luraschi: So those are I think just great news. There’s almost no excuse to not try it out. I would say that it is a linear … Well, I would like to say that it’s a linear learning path. So it is not true that doing everything on Spark is as easy, if you know what I mean. As we were mentioning there’s things that are very easy and newer things like maybe Spark Structured Streaming or like other topics. I work practically every day to make them easier. But you kind of like want to start with small steps which will get you very, very far regardless. But then as you feel confident and feel proficient, you just need to keep walking, right? It’s just like a slope with our [inaudible 00:51:38] client. Kind of like, if you’re like mountain climbing or what not or kind of like that type of deal where you’re like well, you start going, it’s not that hard and you definitely at some point you’re going to be hitting harder problems.

Javier Luraschi: But the very nice thing is that to get started it takes very little. And I think that’s important ’cause if it take a really long time to even do simple things or why not, it’s like no one has time for that. But if it gets you up and running and you can do meaningful analysis very easily, which is where we are today, it’s very easy to get us started. It’s very easy to learn. And then as your problems get harder and harder and you are answering more interesting questions that are interesting to you or that bring value to whoever you’re working for or with, it just makes, makes it easier that you just need to keep working there and there’s, I didn’t mention it but there’s a great community in R. My guess is that some of the other R speakers here mentioned it, but there’s a very nice, warm community around R and specifically also about sparklyr and why not.

Javier Luraschi: So you can go to resources to community.rstudio.com and just ask for help. You can also go to our GitHub page and look at the issues and if something feels like that you really need help with, you’re going to open an issue. So you’re also not alone in this and there’s a lot of resources to get you up and running. So I definitely, at least will encourage everyone to, if they’re curious about Big Data and Big Compute and Cluster Computing, give it a shot, ’cause you’re going to be surprised. That is something that feels doable. Which would not have been the case few years ago.

Kirill Eremenko: Fantastic. Thanks so much Javier for that description. Hopefully that will encourage more people to jump into learning Spark, especially through R. It sounds like you guys really make it happen. Make it easy and those who are already using sparklyr like Javier he’s making things I like for you in the space. Well there’s so many more things that I wanted to ask you about Big Data, about your journey into this career and so on. Unfortunately we’re running out of time and probably we only have time for one more thing. And I thought out of all the things that I have written down, I thought the most important one would be your book.

Kirill Eremenko: So you mentioned before the podcast you’re working on a book and seeing how much value you’re giving on this podcast, how passionate are you about the space. I think it would be a shame for listeners to hear that you and some of your colleagues or some of your friends you’re working on a book and that is going to be published next year. So tell us a bit about that. What is this book going to be about so people are interested in this space can look forward to it.

Javier Luraschi: Right. So the name of the book is going to be somewhere around the lines of, the R in Spark. Kind of like funny name the R, what is the piece that we’re going to highlight from Spark or R, the R programming language. So we already have a website and we’re going to put more information when the book is published, but the website is, therinspark.com. So it’s pretty straightforward. But for now it’s a bit of a placeholder, but if you’re interested at least it’s good to keep in your bookmarks. But my goal with that book is to make it, well both with sparklyr and with the book is to make it the absolutely easiest way to get started with Apache Spark. So anything from, hey, what is this thing which we already … Your listeners are lucky enough to have you that they already know, kind of like what’s spark right and they got a very nice introduction.

Javier Luraschi: So the goal of this book is to make it very, very easy for anyone that opens the book to say, wait, [inaudible 00:55:37] is the Big Data? Oh, that’s what it is. What is spark? Oh, now I get it. Okay, now that I understand it, how do I get started? It’s going to be a very gentle introduction. But being gentle, it’s not going to remove the fact that if you go through the whole book we want to take you from being a very new user to being close to being an expert. And like any book, you need to do the exercise and practice and why not. But we’re definitely hoping that we can bring a lot of people to the Apache Spark community as well and just help them out. So yeah, definitely if you want to keep in touch, therinspark.com might be the place.

Kirill Eremenko: Awesome. I’m actually looking at it right now and there’s for our listeners, if you’re interested in what Javier was talking about today and you found it exciting to listen to R in Spark, if you go to the website, therinspark.com, there’s a way you can get early access to the book. So you just have to send an email to Javier and you’ll get early access. I think that would be pretty exciting to get early access to the content. So yeah, if you guys are interested, jump on top of that. Javier, we’re out of time. Thank you so much for coming on this show. Being an amazing journey just listening. Totally, totally captivated too. You have so much passion for this space. Before I let you go, I have to ask, what are some of the best ways for our listeners to get in touch with you to follow up? You mentioned the R in Spark, what other ways can our listeners get in touch with you?

Javier Luraschi: Yeah. I would say Twitter, but I’m honestly so bad at Twitter. I need to listen for 10 tips to be a Twitter user. So Twitter, definitely you can find me there Javier Luraschi and that’s just Twitter, first name, last name. I’ll do my best to answer there. But definitely on the GitHub page and we also have a Gitter channel. So it’s going to be pretty easy if you start looking at the sparklyr, if you search for sparklyr, one way or another one, you’re probably going to end up reaching myself and other colleagues in RStudio. So I would just say just don’t be afraid, whichever form you find of contacting us, there’s a Gitter channel where we can chat. There is Twitter if hopefully I get better at using it. There is GitHub, there is the book that will take us a few months to get it to you. But just in general, whichever way you can find us, just feel free to keep in touch and I’m looking forward to that.

Kirill Eremenko: Awesome. And Linkedin, is it alright for listeners to connect on Linkedin?

Javier Luraschi: Yeah. For sure. It’s first name, last name altogether. I don’t think they have like a nice search in there, but I my first name Javier, last name Luraschi. Twitter as well. You can find me there. I don’t know if I miss anything else, but I can give you my address if you want to. Just kidding.

Kirill Eremenko: And your social security number and where the money is.

Javier Luraschi: That’s for sure. Yeah, we’ll put it out there.

Kirill Eremenko: Okay. No, I think that’s all. Well, once again, thanks so much Javier. Good luck with the book. Looking forward to it coming out and I’m sure a lot of people are going to get from this podcast. Thanks so much.

Javier Luraschi: Well, Kirill thank you so much for having me and grade work with this podcast. I’m really happy I had the chance to be with you today, here.

Kirill Eremenko: So there you have it. That was Javier Luraschi from RStudio. Hope you enjoyed this session as much as I did personally. My biggest takeaway and fairest was the whole fact that we dove into Apache Spark so deeply and got to know this space so well from firsthand, from a person who actually works in developing a package to work with Apache Spark. So Javier is up to date with all of the changes in Apache Spark and he knows exactly everything that’s happening in this space. So it was really great to hear this information, these insights directly from him. And I’m sure you can also feel the immense passion than Javier has for the space and in fact it’s even contagious, so I’m sure if you have never heard of Apache Spark before, now you can feel that this is one of those really powerful tools that maybe one day you’ll add to your Data Science toolkit.

Kirill Eremenko: On that note, you can get the show notes for this episode at www.www.superdatascience.com/211. There you will also find all the things that we mentioned in the episode. All the materials we mentioned in the episode, a URL to connect with Javier and follow him and his career on all social media. You’ll also find a link to the upcoming book, to the website where you can register to get some contents of Javier’s upcoming book, which is going to be pretty awesome based on what we heard today. And of course, if you know anybody interested in the space of Big Data, in Apache Spark who wants to learn more, and who would be as excited about this episode as you hopefully and of course as I was on today’s show, then please forward them this link. Help spread the word, help other people learn these topics. Apache Spark is a really cool tool that is helping data scientist’s work of Big Data. So let’s help each other out. Send this episode to anybody who you think might benefit from it. Whether it’s a friend, colleague, family member or somebody that you just know that this will help them out.

Kirill Eremenko: On that note, thanks so much for being here today and sharing this hour with us. Can’t wait to see you back here next time, and until then, happy analyzing.

Podcasts SDS 211: Working on Apache Spark & R Package Development

SDS 211: Working on Apache Spark & R Package Development

Podcast Transcript

Share on

Related Podcasts

January 9, 2026

January 6, 2026

January 2, 2026

Podcasts SDS 211: Working on Apache Spark & R Package Development

Share

SDS 211: Working on Apache Spark & R Package Development

Podcast Transcript

Share on

Related Podcasts

January 9, 2026

SDS 956: From Agent Demo to Enterprise Product (with Ease!) feat. Salesforce’s Tyler Carlson

January 6, 2026

SDS 955: Nested Learning, Spatial Intelligence and the AI Trends of 2026, with Sadie St. Lawrence

January 2, 2026

SDS 954: Recap of 2025 and Wishing You a Wonderful 2026