Welcome to episode #047 of the SDS Podcast. Here we go!
Today's guest is Deep Learning Expert Hadelin de Ponteves
If you have always wanted to know more about Deep Learning, today’s episode will give you the overview you have been looking for.
Hadelin de Ponteves is back by popular demand to share his knowledge of six types of deep learning models, 3 supervised and 3 unsupervised models, including an extensive discussion of their applications.
You will also hear us discuss the content of our new course on deep learning, as well as the latest research and the current issues in this exciting field.
Are you ready to take the plunge?
In this episode you will learn:
- Artificial Neural Networks (ANNs) (06:45)
- Why Deep Learning Instead of Machine Learning? (09:10)
- Convolutional Neural Networks (CNNs) and its Applications (18:55)
- Recurrent Neural Networks (RNNs) and its Applications (23:48)
- Long-Short Term Memory in RNNs (LSTMs) (27:57)
- Differences Between Supervised and Unsupervised Deep Learning (29:47)
- Self-Organizing Maps (SOMs) and How They Work (31:11)
- All About Boltzmann Machines (40:23)
- Autoencoders and their Applications (46:27)
- TensorFlow or PyTorch? (51:45)
Items mentioned in this podcast:
- Deep Learning by Ian Goodfellow and Yoshua Bengio and Aaron Courville
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani and Jerome Friedman
- Deep Learning A-Z Course
Kirill: This is episode number 47 with Deep Learning Expert Hadelin de Ponteves.
(background music plays)
Welcome to the SuperDataScience podcast. My name is Kirill Eremenko, data science coach and lifestyle entrepreneur. And each week we bring you inspiring people and ideas to help you build your successful career in data science. Thanks for being here today and now let’s make the complex simple.
(background music plays)
Welcome everybody back to the SuperDataScience podcast. Super excited to have you on board, and today by popular demand, we have a returning guest, Hadelin de Ponteves. Hadelin was on the show for the first time about 6 months ago, that was episode 2, and that was when we just released the Machine Learning A-Z course. And now he's back, and this time we've just released the Deep Learning A-Z course. And so what is deep learning all about? Well deep learning is an advanced branch of machine learning where we use algorithms called neural networks to mimic the human brain in order to be able to solve very complex problems. And our goal with Hadelin in creating Deep Learning A-Z was to create a course that has a robust structure and really covers topics in a simple manner that is accessible to anybody. So you don't have to be an expert in mathematics or in programming or in anything else for that matter. You just need a basic background in high school maths to understand this course and to follow along.
And that's exactly what we created. It's released now and it was actually also featured on Kickstarter, where it had immense support from backers and we're very excited about that. We're very excited to bring this course to the world and today we're going to run through all the six different models that we discuss in the course and give you a quick breakdown.
And finally, for those of you who don't know Hadelin, I wanted to mention that Hadelin has experience in deep learning from Canal+, which is a competitor of Netflix, and as well as that, Google. So this is definitely a person who knows his way around deep learning and we're going to learn quite a lot in today's podcast. So I'm very excited for you to hear this episode and get your glimpse into the world of deep learning and the different models that exist there. And without further ado, I bring to you Hadelin de Ponteves.
(background music plays)
Welcome everybody and welcome, welcome, welcome Hadelin back to the show. How are you going, my friend?
Hadelin: I'm doing very well, thank you. I'm very happy to be back.
Kirill: Awesome. Well you definitely should be because 12,000 views as of today, your podcast, your first episode has 12,000 views. How do you feel about that?
Hadelin: Wow, that's amazing! I didn't expect that many views, so I am very happy about that and I hope this has helped some people and that people could be inspired from it.
Kirill: Yeah, I definitely hope so too. And I'm actually sure of it, and it's raised some very interesting debates, especially the podcast was about machine learning. But what raised debates was your health, since you mentioned you were sleeping 3 hours a day for the past 3 years.
Hadelin: I know right, some people were concerned! I got messaged on LinkedIn to tell me that I should be careful! Thank you very much for all of them.
Kirill: Yeah, that was fantastic. Everybody worried about Hadelin out there, he's still alive. He's still fine, and as always, very energetic. So, mate, what have you been up to? Or what have we been up to over the past couple of months? What projects have we been working on?
Hadelin: Well we made the deep learning course! The big brother of the machine learning course. So yeah, this one is I think more powerful because we dive deep into some more advanced techniques and we code some more advanced stuff, like we use classes and objects to implement some deep learning moles from scratch. So I think it is quite new and people will still improve a lot their skills even after doing the machine learning course. So I think it's very complementary and it's basically taking things at the next level.
Kirill: Exactly. Exactly. And also I wanted to point out here that our goal was to create the most disruptive course on deep learning and really collate a lot of information from pretty much everywhere and put in our knowledge and experience into this course and our view on things and put together, most importantly, not just one, not just two, not just three, but six different models on deep learning. And I think we finally can say that we have completed this course and it's there and people are going through it and we're very, very excited about it.
Hadelin: Yes, and actually I finished implementing the last, the very last model of this course yesterday. It was the Boltzmann Machine, which took 14 tutorials. And I was really happy to finish it because it was quite a challenge. Because this is one of the most advanced models in this course because it's a probabilistic graphical model so it handles a lot of probabilities and we had to dive into the MCMC, Markov Chain Monte Carlo techniques, with Gibbs sampling, with the random walks, so a lot of very cool mathematical concepts, but we did it yesterday and I was very happy to finish on that note.
Kirill: Fantastic, and congrats on that. It was definitely a big course to tackle. And for everybody listening today, we're going to talk about deep learning. This podcast is dedicated to deep learning and summarising and running through all of the things that we have covered in the course. And just to give you, even if you're not taking the course, just to give you a feel and taste for what deep learning is about, what type of models exist out there, what type of approaches, techniques, what are the use cases and applications. And so we'll be running through six different models of deep learning, giving you our comments on that and this is going to be quite an exciting podcast. Really looking forward to this session. How about you, mate? You excited about this?
Hadelin: Yeah, very excited.
Kirill: Alright. Ok. So let's kick things off with the very first, very basic model, the artificial neural networks. It goes into the foundation of all of the deep learning concepts and principles and outlines everything there. And probably we'll start off by saying that in artificial neural networks -- oh, by the way, if you're listening to this podcast and you haven't taken the course yet, then probably you should know that in this course, just like in the machine learning course, I was doing the intuition tutorials and Hadelin was doing the practical application tutorials. So you might hear us comment on the deep learning methodologies from both sides exactly in that manner.
And in terms of artificial neural networks, we're trying to model the human brain. So we're creating this structure which is full of neurons which are interconnected. And the fascinating thing, just by creating this course, I personally learned a lot. Especially about the field of neuroscience. I found out that in the brain we have 100 billion different neurons and each one of them is connected to at least as many as a thousand other neighbours. So I just wanted to get your comment on that, Hadelin. What are modern deep learning neural networks like? Are they anywhere close to that size?
Hadelin: Well, at least that’s our goal. We are trying to build some models that mimic the processes in the human brain. But, of course, we’re not there today. As you said, there are billions of neurons in the human brain and there are lots and lots of connections between the neurons. So far, when we talk about artificial deep learning, the artificial neural networks we make when we solve some problems, they contain several, maybe dozens of layers at most. So we’re very far from what’s happening in the brain, but we’re trying to mimic what’s happening in the brain, we’re trying to reproduce the structure and the connections by adding some mathematics into it, you know, to make some relevant models that can solve complex problems with lots of non-linear relationships. Only models that are close to how the brain works can solve that kind of problems.
Kirill: Okay. And why is that? That’s actually a good segue into what’s the whole point of deep learning? Why can’t we just stick to machine learning and solve all of our problems with machine learning?
Hadelin: Because basically there are limitations in machine learning. In machine learning we can solve a lot of problems, but when the problems become very complex because, you know, problems are defined by their relationships, whether they are linear, non-linear, how complex are the non-linear relationships, and when we are reaching some high level of complexities, because with deep learning we can extend the complexity by adding some layers and some neurons, we can basically solve more and more complex problems thanks to the fact that we can add these layers and these neurons. With machine learning, while you have some fixed models, you cannot really extend the models of machine learning, while you could for XGBoost, for example. XGBoost is actually another great model that is used to solve very complex problems. That is because you can add some trees in XGBoost because XGBoost is like a Random Forest but in a much more advanced construction. So that’s the thing – you can add some level of complexities by adding some trees or layers in the models and that’s how deep learning can solve complex problems as opposed to machine learning.
Kirill: Okay, gotcha. So, basically, at some point you reach a level of complexity. For example, recognizing objects in an image. That’s pretty complex, right? And not just recognizing objects from an algorithm which tells you “Look for this type of pixel,” but automatically learning how to recognize objects from many, many thousands of images. That’s a complex problem.
Hadelin: That’s right. That’s a very complex problem. Not only do you have to understand some patterns in thousands and thousands of images, but also you have to understand some patterns in the pixels. And there are thousands and thousands of pixels so that makes the problem really, really complex and, of course, this is not solvable with machine learning, classic machine learning.
Kirill: Okay, gotcha. So, going back to ANNs, we introduced a couple of concepts there, and I know it’s going to be extremely—like, the course is how long, 20 hours or something so far?
Hadelin: Yeah, little more than 20 hours. We’re going to add some more tutorials.
Kirill: Yeah, it’s going to be extremely hard to convey some of these topics in the podcast. But just quickly, we introduced a few concepts and probably the key one is the activation function. What happens in the activation function? Can you tell us a bit about that?
Hadelin: Okay. First of all, there are different kinds of activation functions. We have the rectifier activation function that is used to break the linearity because that’s the whole point of it. We are trying to solve non-linear problems. In order to be able to solve these non-linear problems, we have to break the linearity between one hidden layer to another hidden layer. That’s what the rectifier function is for and actually you explain it very well in one of your intuition lectures.
And then we have the sigmoid activation function, which is another kind of activation function, that is used to output the predictions in terms of probabilities. Instead of returning an exact outcome, for example a binary outcome 0 or 1, you will use the activation function to model the probability of an outcome. So, for example, that will return 0.8, meaning that the outcome will have 80% chance of being the right prediction.
And then you have some other activation functions, but basically the idea of an activation function is to activate the neuron. That’s used to activate certain neurons in the neural networks with certain weight, and the higher is the weight, the more relevant will be the neuron.
Kirill: Yeah, gotcha. That is very similar to what’s going on in the human brain, in our brains. As we’re just talking now, what’s going on is neurons are firing up, they’re sending an electrical impulse to the following neuron, then that neuron is getting electrical impulses from many different neurons around it, so up to a thousand different neurons. It’s combining all of that, it’s making a decision whether it needs to fire up or not, and then it’s passing on (or not) an electrical signal of varying intensity to the next neuron and so on.
And when you combine all of that, you have this whole huge army of neurons sending around electrical signals and that is what thought is, that is what all of these ideas that we have, all of these concepts and our interaction with the world, all of our senses, they all translate into that. It’s fascinating when you think about it. All the thoughts that we have are basically just electrical signals running around in our heads.
Hadelin: That’s right, yes. That’s fascinating. And it’s fascinating that we’re managing to reproduce that with the models that we make ourselves.
Kirill: Yeah, totally. That’s very important, that we mention that. So we have neurons in the artificial neural networks and we have activation functions which kind of connect the neurons and facilitate the interaction between the neurons. So in the artificial neural network, we have three types of layers. Tell us about that. What layers do we have in the ANN?
Hadelin: Okay. Of course, let’s start with the input layer. The input layer is the layer that receives the observations. For example, let’s say we’re trying to predict if some customers are going to leave or stay in a company or leave or stay in a bank. Well, the input layer would get the information of the customers that will go through the network to be able to then make some predictions. That’s the input layer. It just receives the observations.
And then we have the hidden layers, and that’s where everything happens. That’s where the learning happens. That’s where the model is trying to learn how to make some correlations between the information of the input layer and the output, which is the final prediction. And this final prediction is going to be compared to the real outcome and that’s how the model is going to learn, because it’s going to compare the prediction to the real outcome and, according to the mistake it might make, it will correct what happened before. That’s backpropagation. And backpropagation will then correct the weight so that it can learn some better correlations next time. So after the hidden layers we the have output layer that get the final prediction.
Kirill: And the more hidden layers we add, the more complex becomes our neural network. It’s harder to train, but it might be able to solve more complex problems. Is that right?
Hadelin: Yes, that’s correct.
Kirill: Okay. And you mentioned a couple of interesting terms. First of all, if I’m not familiar with neural networks I might ask the question, “What is the purpose of building a neural network if on the output you’re comparing the results to the real outcome so that means you already have the real outcome? What’s the whole point of modelling an outcome?” Can you comment on that?
Hadelin: Sure. That’s because for any machine learning model, or any deep learning model, there is a training phase and a testing phase. In order for the model to learn something, it needs the real outcomes to learn the correlations. Because if it didn’t have the real outcomes, it couldn’t learn anything. It’s like when a student is practicing for an exam and he’s learning a lesson or a topic, it needs the real outcome when it’s training so that it can evaluate how he understood the course.
Well, that’s the same for a deep learning model. It needs a training phase with real outcomes so that it can make some predictions itself, but then it has to compare these predictions to the real outcomes so that it can correct itself. And then we have the test set. And on the test set we have some totally new observations for which we don’t have the real outcomes and this is what pure predictions are about. This is really pure predictions where I don’t have anything to compare them.
Kirill: Okay, gotcha. And in this case we actually compare to the real outcomes. So some models train with the real outcomes and some models still train, but without the real outcomes. The two different ones are called supervised models and unsupervised models. So artificial neural networks are a type of supervised model and in total we looked at three different supervised models and we looked at three different unsupervised models.
And probably the last comment is on backpropagation. You’ll be hearing the word ‘backropagation’ quite a lot. It is relevant to supervised deep learning models and basically in summary it’s exactly what Hadelin described. You compare it to the real output, you find the error and then you backpropagate – hence the name backpropagation – you backpropagate that error through the network to update the network in very simplistic terms, and that’s how the models train. So that was artificial neural networks. Let’s move on to number two: convolutional neural networks. CNNs – what are they used for?
Hadelin: CNNs are mainly used for image detection, but they have various other applications like text recognition. How does image detection work? Well, the CNNs are going to try to learn something like some patterns in the pixels and that’s how they will detect some specific features and images to be able in the end to recognize what is in the image. For example, in our course we’re training an algorithm, a CNN actually, that is learning to predict whether there is a cat or a dog in images. To do this, it will try to understand some patterns and the pixels of the images to detect some specific features of dog and the specific features of cat. And that’s how in the end it can manage to predict if there is a cat or a dog in the image.
Kirill: Yeah. And the most fascinating thing for me is that it can learn that – and we got a pretty good accuracy rate in the course – it can learn whether it’s a cat or a dog just by looking at lots and lots of images. How many images did we have in the course?
Hadelin: 10,000. We’re training the CNN on 8,000 images and we’re testing it on 2,000 images. And indeed, on the 2,000 images we get a pretty good accuracy that is pretty good at correct predictions.
Kirill: About what? What is it like?
Hadelin: I think we reached 82%.
Kirill: 82% accuracy?
Kirill: Which is great. So, it means that CNN had a look at 8,000 images of dogs and cats which are labelled so it knows that this folder has cats and this folder has dogs. It just looked through them. Without anything else, no tricks or hacks, we specified nothing. We just said, “Look at these images. These are cats, these are dogs, and now decide for yourself what is important for you in an image, what features are you going to be looking for when you’re looking at new images to decide whether it’s a dog or a cat.” And then after all the training we gave it 2,000 images of dogs and cats and out of those 2,000 it got 85% correct. So it identified 85% of the images correctly, that these are dogs/cats. Without us having to tell it how to do it, it learned the thing itself.
Hadelin: Yes, that’s right. And these are our new observations, new images. And besides, this is without any parameter tuning, because in the course we insist on parameter tuning and that’s one of the homeworks we gave to the students. We let the students work on the model to improve them, tune them. Actually, one of the challenges is to get 90% accuracy, which I know is possible so that’s why I gave the challenge. The students who manage to reach that 90% accuracy, they will have the gold medal.
Kirill: (Laughs) Has anybody gotten 90% accuracy yet?
Hadelin: Somebody got the gold medal on the other challenge, which is for ANN. Actually, that was today, so I congratulated that student. He got 87% accuracy, which is very good on the other problem.
Kirill: I think I saw that message. His accuracy was even better than in the training set. It was higher on the test set than on the training set.
Hadelin: Absolutely. That’s the one, yes.
Kirill: Okay, gotcha. All right, very cool. So, that was CNNs – a very brief intro into convolutional networks. And it’s very important to understand the concept behind CNN because that is not in its raw form what goes into self-driving cars, but that is the direction in which things like self-driving cars are aiming. They need to recognize pedestrians on the streets, they need to recognize stop signs, they need to recognize the colour of the traffic light to work completely autonomously. That is your first step in that direction. Would you say that is a fair summary?
Hadelin: Yeah, that’s absolutely a fair summary. Of course, CNNs are used in self-driving cars to detect objects on the street, which is absolutely compulsory.
Kirill: Yeah. And there was actually a challenge to see how well computers can do to recognize different types of road signs and right now they’re already doing better than humans. It’s really interesting. All right, moving on to the next one: recurrent neural networks.
Hadelin: Oh, that was quite a challenge.
Kirill: (Laughs) Yeah. For us, we spent so much time on recurrent neural networks simply because of the challenge that we set ourselves. We wanted to predict stock prices, but in the end we found out that it’s too chaotic to predict with recurrent neural networks to the extent that we attempted that challenge. So probably if you spend more time you might be able to find ways to do it better, but nevertheless a very interesting type of neural network. This is a neural network that has short-term memory. All neural networks have long-term memory and that’s when you train them up, they remember the structure and the configuration of the neural network, will remember the training and then it will apply that knowledge in the testing and hence it can recognize dogs or cats or solve complex problems. But recurrent neural networks actually have short-term memory, so if it’s going through a dataset in row 51, it will remember what the outcomes were for row 50 or row 49 or row 48 and so on. What kind of applications does that open up for the world of deep learning?
Hadelin: Well, there are many applications. For example, there are applications for natural language processing. You can use recurrent neural networks for natural language processing if you want to predict what’s going to be the next step in a sequence of text. So, for example, you can predict what’s going to be the last word in a sentence and more, so you can increase the complexity of the problem, and it can also be used for video classification. For example, if you want to predict what’s going to happen next in a video, then in that case you will need to combine convolutional neural networks to RNN because RNN are basically used to predict a time series, like what’s going to be the next step in a series of events. So that can be used for text, as I said, or for videos as well.
Kirill: Okay, gotcha. So, RNNs are very often used in combination with other algorithms like CNNs. And you can train up an RNN to basically just go through like a huge amount of text and learn from it how sentences are structured, how words follow each other. That way you train up a deep learning model to create sentences, in essence.
In the course we actually mention a small video, a 9-minute film, called “Sunspring” which was entirely written by an RNN, by specifically an LSTM, so long short-term memory type of RNN which was trained up on thousands of sci-fi films and then it wrote its own sci-fi film and people actually acted it out, professional actors acted it out, and it participated in the – I think it was the Sci-Fi London Film Festival and it was rated – I think it was in the top 10. So that’s pretty exciting. That’s where the world is going.
Hadelin: Yes. That’s a pretty exciting application.
Kirill: Yeah. And you can think of lots of other applications. Basically, RNNs are there to facilitate the short-term memory which we humans have. It’s so powerful. We don’t just have long-term memory. If you just had long-term memory then you wouldn’t remember the start of this podcast and would be very sad. You’d be sitting there thinking “What are we talking about? What is this whole topic right now?” Short-term memory is a very important tool that we have as humans and therefore, why would we deprive deep learning models of that concept? And that’s why RNNs exist. And probably the last thing on RNNs I wanted to mention here is LSTMs. Can you tell us a bit about LSTMs? This is actually a huge breakthrough for the world of RNNs. What is it all about?
Hadelin: Yes, this is actually the disruption in RNNs because, as you said, the classic RNNs have short memory, but the LSTM is the first RNN to be able to learn long-term relationships. Basically, it’s the first RNN to have long memory. That’s why it’s called LSTM – long short-term memory. Basically, that’s the most powerful RNN model and that’s the one we implement in our course.
Kirill: Yeah. I think it was created in Germany in the 90s.
Hadelin: Yes, 1997 or something.
Kirill: Yeah, it’s pretty cool. And actually it’s very interesting, throughout the course we mentioned the creators of these models and people who came out with them. We’ve got Geoffrey Hinton, we’ve got Yann LeCun, Yoshua Bengio. All of these scientists, I think they’re all from Canada, if I’m not mistaken.
Hadelin: Yann LeCun is French.
Kirill: Yann LeCun is French? Okay. That’s right. But now they live in Canada/America and so they have their own little circle which I think Yann LeCun calls the conspiracy of deep learning. It’s very interesting. And then all of a sudden you have somebody from Germany creating the LSTM. When I found out about them, where the LSTMs came from, it was a pleasant surprise that the whole world is actually contributing to this movement. It’s great.
Okay, next we are moving into the world of unsupervised deep learning. First of all, can you tell us a bit about unsupervised? What does that mean, ‘unsupervised’?
Hadelin: Well, unsupervised means that basically you don’t have an outcome to compare your prediction to, so basically what you have to do is identify yourself some structure in the data that will become your future predictions. So, basically, when we’re starting with unsupervised learning, we don’t have a dependent variable. We don’t have something that we want to explain. But by identifying some segments or some clusters or some structures in the data, we will eventually end up with such a dependent variable. And actually, in the course we highlight this transition from unsupervised to supervised because we once we complete our unsupervised deep learning model, we end up with a dependent variable that can lead our model to become a supervised deep learning model.
Kirill: Yeah, that’s definitely a very powerful thing. Unsupervised models, I think they’re more complex generally because each one of the models that we discuss has its own very unique approach to learning, kind of a trick or a hack or a whole new concept that it introduces in order to bypass the fact that it doesn’t have this output to compare to.
We start off by talking about SOMs – self-organizing maps. These neural networks were first created in I think 1982 by Teuvo Kohonen, a Finnish professor, and they are very interesting. By far, they are the simplest out of the whole course just because they are so elegant and the idea behind them is so straightforward, even the mathematics. We don’t talk about mathematics in the course, we don’t go deep into the mathematics so we don’t get bogged down, but the mathematics driving the other models in the course, they’re pretty complex. But for self-organizing maps, they are very straightforward, they’re very easy to code even from scratch. We talk about self-organizing maps and we find out how exactly they work. So what can you tell us about SOMs, Hadelin?
Hadelin: Okay. First, what is the purpose of SOMs? Well, it’s to detect some features in a very complex data. There is a high dimensional dataset again full of non-linear relationships. It will detect some features which we have absolutely no idea what they are, but it will detect some features inside this data. And how will it do that? Well, it will do that by reducing the dimensionality of the dataset. That’s why at the beginning we start with a lot of the features that are columns in the dataset, and a lot of observations, and eventually we end up having this two-dimensional map. And on this two-dimensional map we can see some neurons that we call the ‘winning nodes’. Basically each cluster, or each winning node, is detecting a certain feature. That’s pretty powerful because even by having a very complex dataset at the beginning, we end up with this cool two-dimensional map that is very visual and that we can use directly to detect some specific features or see some specific clusters, segments in the dataset. And actually we implement a self-organizing map to detect fraud and that’s because fraud is a specific feature detected by the SOM.
Kirill: Yeah. It’s a very visual type of algorithm.
Hadelin: Yeah, it’s very visual.
Kirill: So, you might have a huge dataset with lots and lots of columns which there is no way you would be able to visualize in a concise way, but then the SOM allows you to reduce that, put it on a map, and then you can see all of the connections.
In the intuition side of things we walk through an application of SOMs to the – I think the U.S. Senate or something, part of the United States government, how they vote, and you can see how they’re all plotted in a self-organizing map and then you can see how to read that map and understand what it’s saying about how they’re voting, Republicans versus Democrats and so on. In our practical tutorials, we actually have a very interesting application. Tell us a bit more. What did you prepare for us in the practical side of things to kind of showcase how SOMs can be used?
Hadelin: Okay. So, the data sets contain some credit card applications. Some customers – well some not-yet customers – some people applying to have an advance credit card in their bank and basically to apply for this credit card they need out a paper and provide a lot of information like their credit score or other type of financial information like their estimated salary.
And basically at the end of the application there is a yes or no whether they got approved for the credit card. Of course, like in any application, there are some people that can cheat, and the goal is to find the potential cheaters, to detect the potential frauds in the applications. So we have no idea how to visualize that at the beginning because we have a lot of information, all the information that was filled in by the customers, but the SOM will manage to detect the frauds by detecting some specific features in the self-organizing map.
I’m not going to say right now what are going to be the features in the SOM because it’s actually something that the students have to guess at some point, but these features, these frauds, are pretty specific, are pretty visual in the self-organizing map so we can really identify them well.
Kirill: Yeah. I found that that was a very cool application of self-organizing maps. When you came up with that challenge for the course I was very excited because I’ve worked with fraud analytics before, back when I was at Deloitte, but I’ve never actually seen self-organizing maps applied to solve the problem and I think that this approach is very interesting because it does allow you to go through lots and lots of data without having to supervise the model, without having to come up with these inputs at the start and therefore it can really find it. It’s like a robot or a computer looking for people who have committed fraud. Like, what are a human’s chances against a machine? Zero, right? Pretty much zero.
Hadelin: Zero. Let’s hope this doesn’t go too far.
Kirill: Yeah. Let’s hope it doesn’t turn into World War III or something like that.
Kirill: Yeah. And in terms of SOMs, what I found when I was creating the intuition tutorials, I found that they’re really very different to all the other five models that we discussed and I would never have thought of classifying them as a deep learning model. I always thought SOMs are just a type of dimensionality reduction model. I think the lack of backpropagation or the lack of more interconnectivity between neurons makes me kind of think that in some way they’re just a bit too simple to be considered as a deep learning model. What are your thoughts on that? Is it maybe just because they’re so elegant that they give you this impression?
Hadelin: Well, first of all, I agree with you. I had sort of the same feeling when I started studying about SOMs. But with no hesitation I wanted to include them in the course because they actually involve neurons. The points in the grids are actually neurons and neurons are attracting other neurons around them according to how they are similar to these neurons. Of course these are very different from the other neural networks that we implement in the course, but this is still a neural network in two dimensions having several neurons. It’s considered as a neural network, and that’s why it’s considered as deep learning. But you’re definitely right that this might be the most simple deep learning model in all these neural networks.
Kirill: Gotcha. But even saying that it’s simple, the applications are massive. Maybe the simplicity facilitates more applications. We briefly mentioned a paper in the course about how SOMs are used to analyse the probability density function of photometric redshifts, so basically an application in astronomy. Then we look at an example of how it’s applied to World Bank data to look at different countries and their prosperities, or poverty and then plot that on a map – and plot that on an actual world map. So applications are immense, limitless for self-organizing maps, and maybe that has something to do with the simplicity.
Hadelin: That’s right.
Kirill: But moving on, next one is our favourite or probably your favourite algorithm – the Boltzmann machine. By far, hands down, the most complex algorithm that exists on this planet. It was so much fun preparing tutorials, for me anyway. I know that you did 14 tutorials about Boltzmann machine and spent like over a week on that just recording.
Hadelin: Yeah, I was very excited recording that.
Kirill: How did that go? Tell us about Boltzmann machines. What are they all about?
Hadelin: Well, first of all, they are very broad. The most important thing is that this is a rupture between what we had before because what we had before were sequential models with a sequence of layers. We started with the input layer and then we had some sequence of hidden layers and then eventually the output layer.
And here, that’s totally different. We now have some neurons and all the neurons are connected to each other and there’s no longer input neurons and an output layer. Actually, what happens is that we have some visible nodes and some hidden nodes and basically the graph – because that becomes a computational graph – the graph is updating itself and the input nodes are getting updated so that in the end they become the output nodes, but there is no output layer.
It’s like a graph full of probabilities, because basically Boltzmann machines are probabilistic graphical models, and this is a graph that is updating itself and in the end it’s maximizing what we call the likelihood that allows to make the nodes all relevant to each other with some relevant outputs, which in the end are predictions. Because of all these complexities, because that involves a large number of computations, we have what we call the ‘restricted Boltzmann machine’ when we have to filter the connections between the nodes. And in the restricted Boltzmann machine, all of the nodes are no longer connected to each other. Only the visible nodes are connected to the hidden nodes and vice versa.
Kirill: Yeah. And it’s just a fascinating type of model. I’ll give you an example of how it’s so different to everything else that we’ve seen before. This is an example that we use in the intuition tutorials in the course. Imagine a nuclear power plant which generates electricity. That in itself is a system, is a huge system which has lots of parameters. You have the speed of a wind turbine, you have the temperature inside the main core of the power plant, you have the pressure in certain water pumps. You have lots of parameters that govern how this facility is functioning. But there’s also lots of parameters that are out of your control, parameters that you can’t measure, which might be, for instance, the moisture of the soil in a certain location. There’s so many things, so many moving parts in this whole system, you can’t measure them all at once.
What a Boltzmann machine does, and that’s why it’s a probabilistic model, it generates all of these different states of this nuclear power plant just randomly and then based on your inputs, you’re able to tweak the Boltzmann machine for it to be a better representation of your specific nuclear power plant. Not just any nuclear power plant in the world, possible or impossible, you kind of restrict it — and this is not in any way connected with the term ‘restricted Boltzmann machine’, those are different. Anyway, you restrict this whole Boltzmann machine to being a representation of your nuclear power plant and that allows you to model very interesting things. And why is a nuclear power plant a good example? Because you cannot model all of the scenarios in a nuclear power plant through supervised deep learning because in supervised deep learning you need a training set. And you just don’t have, and it’s impossible to have, lots of training data on nuclear power plant meltdowns.
So if you want to model all the possible situations in which a nuclear power plant explodes or disasters happen on it, in order to be able to prevent them, you cannot do that through supervised learning just because you don’t have the training data. And that is where unsupervised models, for example, Boltzmann machines, come in and they can really help out with this situation because they are generating these scenarios on their own at random and that allows you to go venture into the scenarios that haven’t even happened in real life.
Hadelin: Right. And besides, by giving this example you explain the Boltzmann machines from the energy-based point of view because actually a Boltzmann machine can be seen on two different points of view. The first one is an energy-based model, so exactly as you just explained, and the second point of view is that it’s also a probabilistic graphical model, and that’s what we focus more during the practical applications. It’s good that students get to see both points of view.
Kirill: And probably if you want to challenge yourself to do something extremely interesting and complex at the same time, in terms of just grasping and getting your head around it, Boltzmann machines are the way to go. There are a number of things that you’ll probably find challenging to get your head around in the space of deep learning that are very significantly reduced because if you understand Boltzmann machines, then anything else is going to be a piece of cake. That’s a very challenging topic but it’s also worth attempting that challenge.
All right, moving on to final model in the course: the autoencoders. Tell us a bit more about autoencoders. Where does the name come from?
Hadelin: Well, autoencoders are my personal favourite because — well, I don’t know if they’re my personal favourite because I really like Boltzmann machines too, but I like autoencoders because basically they’re quite simple, especially after studying Boltzmann machines, and at the same time they’re capable of solving extremely complex problems and that’s because they’re stacked autoencoders.
Basically, we implemented a recommender system with Boltzmann machines that can predict binary ratings and with the autoencoders we would take it to the next level by predicting some ratings that are from 1 to 5, which is a more complex problem. And yet, the autoencoders is a more simple problem because basically the simple autoencoder is composed of three layers: the input layer that gets the neuron, the observations, the hidden layer that is a layer with a small number of nodes compared to the input layer, and we have the output layer. By putting the observations into the [indecipherable 47:39] we’re trying to reconstruct the input observations in the output layer.
That’s how it works and that’s why it’s called autoencoders, because basically what happens is a two-step process. The first step is the encoding step, when we try to encode the observations into this hidden layer composed of a fewer number of neurons, and then there is a decoding step, when we decode the hidden layer to reconstruct the hidden layer into the input layer. So, we’re trying to replicate the input layer by decoding it.
Kirill: That’s a great summary. That’s what I meant when I said that all of these 3 unsupervised deep learning models, they have their own ways to get around the fact that they don’t have the data that they need to look at the real outcome. And the way that autoencoders get around that is they make the input be the outcome that they’re aiming toward. In a way, they’re not purely unsupervised, sometimes they’re called ‘self-supervised deep learning model’ because they are in essence supervising themselves through the inputs that they’re aiming to recreate as outputs. It’s their way of cheating the system a little bit, but nevertheless they are extremely powerful. And I remember during the course when we were creating it – or just before we created it – you were super excited that there was some breakthrough in stacked autoencoders and you were like, “We have to include this in the course. We definitely have to. This is all so brand new.”
Hadelin: Yes. And we actually implemented stacked autoencoders because basically stacked autoencoders is what I just explained but with several hidden layers. So that means that there are several encodings and several decodings and that’s exactly what we implement. I think we have two or three hidden layers, but then the challenge, of course, is to change the architecture of the model and that’s very fun. Students will learn how to change the architecture of the model to add more layers and to add more nodes to tune the number of nodes and other parameters. So, they will be some sort of artists trying to create some other structures of stacked autoencoders. And that’s a pretty good challenge.
Kirill: That really ties in with what you said on the first podcast, which was six months ago — can you believe it has been six months since then?
Hadelin: Time has flown.
Kirill: Yeah. And I think what you mentioned there was the artist versus engineer. That’s two categories of data scientists. You’re going to have the artists, somebody who definitely can’t be replaced with machines in the near future, and the engineer who is building these things and who’s at more risk of being replaced by machines. And I actually found it very interesting that you were pointing it out in the practical tutorial as areas where students or any deep learning expert or practitioner has to apply their creativity to come up with a new architecture or a structure for a certain model.
Hadelin: That’s right. The deep learning scientist is definitely not only an engineer, it’s also an artist because there is no rule of thumb in making the perfect architecture to solve a specific problem, so the deep learning scientist has to make some sort of art to find the best model.
Kirill: Yeah. Fantastic! That summarizes our six models. And while we have a bit of time still left, I wanted to get your comments on the main two tools that we used. So we definitely covered off different things in the course, we talk about scikit-learn, we use all of the standard libraries, NumPy and so on because it’s a Python-based course, but the two main tools that we use are PyTorch and TensorFlow. TensorFlow is a Google-developed library and PyTorch is I think a Facebook and Yann LeCun-developed library. Can you give us a few thoughts on that? What are the advantages, pros and cons? Why did we use both in the course? Where do you think the world is going and what should students or anybody getting into the world of deep learning be focusing on at this stage?
Hadelin: What I think is that first at the start when we have to start learning deep learning, I think the best option is to start with Keras, which is a wrapper of TensorFlow. The big advantage of TensorFlow is that it has Keras. That allows you to build some deep learning models with only a few lines of code. So basically you don’t have to implement the deep learning models from scratch. That’s the big advantage of this. Then we have PyTorch and PyTorch doesn’t yet have a wrapper like Keras to implement deep learning models in a few lines of code, but I think it’s actually more powerful than TensorFlow because it can handle even more complex deep learning models. These more complex deep learning models are the dynamic graphs because if we go deeper into the theory of deep learning, we will bump into the dynamic graphs which are basically making a deep learning model that is dynamic and no longer static.
The dynamic graph is a new specificity that PyTorch can handle, unlike TensorFlow. I really encourage the students to handle both libraries because maybe they will be able to solve some problems with PyTorch that they couldn’t solve with TensorFlow. Actually, PyTorch is very recent so there’s still a lot of debate about that. I think it’s not only able to solve some very complex problems, like dynamic graphs, but also I think it’s very practical because students will see that when we implement the models from scratch with PyTorch, well it’s actually very practical and intuitive. Okay, it takes some more lines of code than Keras requires, but we can see in the end when we take a step back at what we implemented that it’s actually very intuitive and then very easy to change the architecture of the model. I highly recommend PyTorch for two reasons: it is able to solve some complex problems, and the very practical side. But for beginners, I recommend more TensorFlow and Keras.
Kirill: Gotcha. Very, very good. I’m very excited that we covered off both of these in the course. In the autoencoders and Boltzmann machine side of things, you even go into things like how to develop your own class and how to use that in deep learning. That’s very powerful, I think.
Hadelin: That’s a big plus of the course. Students learn the important tricks of Python and learn the most important techniques, like building classes, because basically I’m sure that when they start some projects of deep learning, they will have to implement their own model. It is actually 100% sure that they will have to make a class. Classes and objects are very important to understand and handle in Python and I made sure to explain all this and what’s the use of all this and how we can use them to build some deep learning models from scratch.
Kirill: Okay. Thank you very much for sharing all of that and for coming on the show again. This brings us to the end of today’s episode. One thing I would like to ask you to finalize this show is what is the book that you can recommend to people who are interested in learning more about deep learning?
Hadelin: Okay. I highly recommend the “Deep Learning Book” by Ian Goodfellow and Yoshua Bengio and Aaron Courville. It’s actually a book that they can get online at www.deeplearningbook.org. Basically, this is a really, really good book even if they want to dive deep into the math and the theory. It covers pretty much what we discuss in the course, but more on the theory side. It’s a very, very good book.
Kirill: It’s free, right?
Hadelin: It’s free, yes. And then I also recommend some other books. Actually, on Facebook I’m part of a deep learning group where we discuss the latest technologies in deep learning or the latest theory that appears in the latest papers. And then regularly we have some surveys or some questions that are asked in this group, and one of the questions was “What is the deep learning Bible? What is the best book people would recommend?” Of course, there was the “Deep Learning Book” that I just mentioned. Then there was also the book “The Elements of Statistical Learning”. That’s an incredible book because it will give you everything you need, all the basics you need to really have a deep understanding of deep learning.
Kirill: Gotcha. All right, thank you very much for sharing that. I think this was quite an exciting session venturing into the world of deep learning and hopefully quite a few interesting things can be picked out from what we shared today. Once again, thank you so much for coming on the show. I really appreciate your time.
Hadelin: Thank you very much. I was very happy to come back again.
Kirill: Great. All right, see you soon. Take care.
Hadelin: Yeah, see you soon. Bye.
Kirill: So there you have it. That was deep learning for you, a quick summary of all the six models. Of course, there’s many more models in deep learning, but these are the six main ones which we identified and covered off in Deep Learning A-Z and hopefully in this podcast you got a quick glimpse of what the world of deep learning is like and how these models are structured and what they can be used for.
Personally, my favourite out of all the six is the Boltzmann machine just because it’s so cool. Just because it has that — as Hadelin mentioned, you can look at it in two different ways. I look at it through the lens of an energy-based model, so it uses a Boltzmann distribution. Actually, when we were creating the course we modified the Wikipedia article for Boltzmann distribution and that was so cool. There’s actually a tutorial in the course where we go into the Wikipedia article for Boltzmann distribution and add some text, just one line about the Boltzmann machine.
It’s very close and dear to my heart because of what I used to do in my bachelors degree with physics. It is a very interesting way to model systems. The Boltzmann distribution is actually used to model energy-based systems that include things like entropy and the gas in the room where you’re sitting in, it’s distributed according to a Boltzmann distribution. It’s taking the state that has a minimal energy cost. So that was very interesting for me, to learn about and to kind of include tutorials about it. So that’s my favourite for sure. And I’d be very curious to find out what your favourite is. If you haven’t ventured into the world of deep learning a lot yet, probably this is your first intro to that world, and it’s going to be hard to decide at this stage what your favourite is. But maybe if you do venture into the world of deep learning one day, I’ll be really interested to find out which one of those six models you would prefer the most yourself.
So that’s that. Deep learning is definitely a field to be at least aware of. That’s where the world is going. As we mentioned in the course intro video, what we’re seeing is a shift from machine learning to deep learning. We’re seeing that deep learning methods are becoming so advanced and sophisticated that just through their sheer complexity, not in the sense of understanding them, but in the sense of how complex the problems are that they can tackle, just through that, they are edging out machine learning methods.
I think Ben Taylor on one of the previous episodes in the podcast actually mentioned that you can beat the accuracy of a machine learning algorithm with deep learning methods very easily. For instance, on the MNIST dataset, which is a dataset of handwritten digits, you have to be very, very proficient in machine learning to get an accuracy anywhere above 95%. On average, you’ll probably get like 90%-92% accuracy. You could just be starting out into deep learning and achieve an accuracy of 98% just because it’s that powerful and you just have to code a few lines of code if you use Keras, as Hadelin mentioned.
So deep learning is definitely something to be aware of. That’s where all the technology is going and in my view, it’s the future of data science. So hopefully this podcast gave you a good overview of what to expect or the different types of buzzwords and buzz terms you might be hearing in the near future.
So there we go. I hope you enjoyed this episode and make sure to leave a review on iTunes if you’re listening on iTunes. It would really help us out and help spread the word about this podcast. And on that note, I look forward to seeing you next time. Until then, happy analyzing.