Podcasts SDS 129: Database Challenges for Data Science and How to Deal With Them

49 minutes
Business, Database

SDS 129: Database Challenges for Data Science and How to Deal With Them

Subscribe on Apple Podcasts, Spotify, Stitcher Radio or TuneIn

Welcome to episode #129 of the Super Data Science Podcast. Here we go!

Today’s guest is the Managing Partner at SQL Database Modeller, Ajay Singh

What do you do when you encounter a problem that bugs you year after year? If you are Ajay Singh, you build a solution that you share with your community.

Today is the day to fill in any gaps in your knowledge of databases as Ajay shares details of his project with us. We speak in depth about normalisation schemas, relational vs non-relational databases, non-SQL databases, cloud computing, Big Data, and how to organise and visualise your database relationships.

Let’s get started!

In this episode you will learn:

Key Differences Between Databases and Spreadsheets (05:50)
Different Types of Databases (08:03)
All About Database Normalizations, Primary Keys, Foreign Keys, and Schemas (11:11)
Getting an Idea Out Into the World (24:22)
Common Database Design Issues (28:09)
NoSQL Databases, Non-Relational Databases, and Data Integrity (30:26)
Data Corruption and its Causes (36:38)
Databases on the Cloud versus In-House (40:58)
Three Trends to Look Out for in the World of Databases (43:34)

Items mentioned in this podcast:

Designing Products People Love: How Great Designers Create Successful Products by Scott Hurff

Follow Ajay

Episode Transcript

Download This Transcript

Podcast Transcript

Kirill: This is episode number 129 with Managing Partner at SQL Database Modeller Ajay Singh.

(background music plays)

Welcome to the SuperDataScience podcast. My name is Kirill Eremenko, data science coach and lifestyle entrepreneur. And each week we bring you inspiring people and ideas to help you build your successful career in data science. Thanks for being here today and now let’s make the complex simple.

(background music plays)

Welcome everybody to the SuperDataScience podcast. Today on the show I have Ajay Singh, who is one of the co-creators of the project SQLDBM, or Database Modeller. This is an online tool where you can design your databases from scratch to make sure that you design them in a way without flaws or, which is something that is probably more relevant to data scientists, you can reverse engineer your databases. So you can upload an XML file of your database (as long as, of course, you receive the authorisation from your management) and understand how the databases are interconnected. So you can access the data faster.

So that’s a really, really cool insight and we’ll talk more about that on the podcast. And also, some of the topics that we will go into include normalisation schemas, relational vs non-relational databases, non-SQL databases, cloud computing, Big Data, and a couple more. So the way to approach this podcast is as a quick glossary/crash course into databases if you are not very comfortable with all the terms and all of the developments that are happening there. Databases, I find, are very important because they lay the foundation of our work as data scientists. So that’s why, in this podcast, we have an overview of lots of different terms. So you will find that we touch on lots of different topics and everybody will find for themselves something that is interesting for them. And on that note, let’s get started. Without further ado, I bring to you Ajay Singh, who is the Managing Partner at SQLDBM.

(background music plays)

Welcome ladies and gentlemen to the SuperDataScience podcast. Today I’ve got a very exciting guest on the show, Ajay Singh, who is one of the Managing Partners at SQLDBM. Welcome Ajay, how are you going today?

Ajay: I am good Kirill, thank you so much for having me on this podcast.

Kirill: Fantastic, fantastic. And you are calling in from San Diego, is that correct?

Ajay: Yes sir, I’m in San Diego.

Kirill: Awesome. Well first of all, happy new year. We’re recording this on what, the 4th of January. So it’s super exciting. Or is it 4th of January for me, 3rd for you. So happy new year, it’s really exciting to be in 2018.

Ajay: Yes, thank you and same to you.

Kirill: Awesome. Did you have a fun celebration?

Ajay: Yes sir. We had a great 2017 celebration.

Kirill: Goodbye 2017 and hello 2018, yeah?

Ajay: Yes, yes, exactly.

Kirill: Fantastic. Awesome. Well so today is going to be an interesting podcast. It’s going to be a bit different to what we normally talk about, because we’re not going to be talking about analytics models and different applications today. We’re going to talk about something that lies in the foundation, something that we often, as data scientists, overlook, but something that we need in order to do all our analysis. And that something is the database underneath what we do. So we’re going to talk about database architectures, structures, potential errors, how to deal with databases, and you, out of all people, are one of the most well-positioned to give some advice to us, to me and our listeners, on that. So I’m very excited about today’s session. How about you, you excited about today?

Ajay: Yes Kirill, I’m very excited about today’s session.

Kirill: Awesome. Let’s start maybe with the basis. Let’s answer the question why so that our dear friends who are listening to this show can understand why it’s so important for us to know more about databases. What would your answer be? Why is it important for a data scientist? And I know through your project you deal with a lot of software developers, but on the other hand, for a data scientist, why is it so important to better understand database architectures, structures, types, how they work and what goes into a database?

Ajay: I think it’s very important, even for a data scientist, to know how is a database designed, where is the data actually stored. Knowing your database or your data is very important to analyse the data. So where the data is actually stored, which database it is, which schema is it stored, how it’s related with each other – that really helps all the data scientists to analyse the data and that’s where any database designing tool which can aid them to visualize their database really helps to understand the data and analysing the data.

Kirill: Gotcha. Because we have lots of different levels of listeners on the show, for those who are just starting out in the space of data science, analytics and just data in general, can you give us a quick two or three sentence answer to the question what is the difference between an Excel spreadsheet and a database? What are the core differences there?

Ajay: In Excel spreadsheet the data is pretty much flat. In a database, especially in a relational database, the data is all integral with each other. So, that really helps to maintain that integrity of data. You cannot have a disconnected data which does not belong to other tables or any errors in your data. I’m just trying to think of a good example. For example, you cannot have an order without a customer, but in an Excel sheet you can have order without a customer.

Kirill: Gotcha. So it’s more structured in that way?

Ajay: It’s more structured, it’s more tight, it helps to have the data integrity, it helps to keep all the basic concepts of the database.

Kirill: Gotcha. And the other thing I would probably add to that is Excel is great if you’re just starting out, but I find it has one flaw. Tell me what you think about this. In Excel you can put formulas and numbers and text and all sorts of things. You can mix them up. You can mix data and logic inside, and you have to. You know, to put a formula into an Excel spreadsheet, you have to put it into a cell right next to a number. But in a database, you have the data and the logic separately. You have the data in the database and the logic goes into the scripts that help you interact with the database. What are your thoughts on that?

Ajay: Yeah, I think you got it right. Excel is more of visualizing your data, but your database is structuring your data, how you want to store the data, but in Excel it’s more like a reporting format.

Kirill: Yeah, and also I feel it’s very important for data scientists to be aware of how databases work and stuff, because where do most organizations store their data? Not in Excel spreadsheets. They store them in databases, right?

Ajay: Right, exactly.

Kirill: Okay. So tell us in overview, what are the main types of databases that people can encounter out there? What are the main brands of different databases? Because there’s a couple of different ones. Which ones would you outline?

Ajay: So, there are Microsoft SQL Server, there is MySQL, there is Postgre, there is Oracle, there are tons of different types of databases out there. Especially these days there is a NoSQL database also. Currently what we are supporting through SQLDBM is mainly SQL Server databases. Soon we are going to integrate it with the MySQL also because there is a huge community which works in MySQL as well. So there are many different types of databases out there in the market.

Kirill: Awesome. I guess it’s a good time to talk a little bit about your project. What is SQLDBM all about?

Ajay: SQLDBM is mainly online database designing tool. It helps database modelling or database designing, it helps you to visualize your database, to see what are the different tables out there in the database, what are the different schemas, how they’re connected. And it also enhances users to find any loopholes in their database, what they can fix, how they can enhance their database designing.

Kirill: Okay, gotcha. We spoke a little bit before the podcast, and for our listeners I want to outline what I feel what kind of different value we can convey in this episode. I kind of feel there are three groups that we can give most value to. First of all, it’s anybody doing data science—as we discussed, anybody doing data science needs to be aware of databases and have a certain understanding of schemas, relational versus non-relational, NoSQL databases and other things that are out there.

Then there’s a second group, which is data scientists who need to design databases, not complex databases, but simple databases in order to feed data into their analytics projects. For instance, you might be creating a Business Intelligence dashboard, or you might be creating some sort of model like a logistic regression model to classify your customers or predict churn and you need to roll that out. So you need to design the database that will work with this model so that it can be repeatable in the future.

And then the third group I was thinking is data scientists who are on the verge of software development, because SQLDBM is a great place for software developers, but in terms of data scientists that are on the verge of software development, that is kind of more into the space of AI, machine learning, deep learning, where you need those databases because you’re almost developing an end product.

So, I guess let’s start with group number one. I would love to pick your brain about the most important terms that people throw around there in terms of databases. So, where do you think we should start? Should we start talking about some schemas or maybe tables and procedures and normalization, other things? What would you say is the most important thing?

Ajay: I think the most important thing is the normalizations, looking at your tables and seeing how they’re connected with each other. Are they properly connected with each other, do you have proper foreign keys, do you have proper database constraints in your database. That really helps to have a nice database output.

Kirill: Okay, let’s start there. Maybe let’s start with an overall overview. We have a database which has several tables; those tables are linked, every table has a primary key and a foreign key. What is a primary key and what’s a foreign key?

Ajay: A primary key mainly tells you what is the uniqueness on that table, so what is the unique ID, or what is the unique column that defines this as the row of that table which doesn’t repeat at all.

Kirill: Okay. So, for instance, customer ID in a list of customers table.

Ajay: Yes, exactly.

Kirill: Gotcha. And then foreign key in that same list could be what product they purchased, right? So, for instance, customer number 001 could have purchased product A, and customer number 002 could have also purchased product A. So it doesn’t have to be unique, but the foreign key links to the other table, right?

Ajay: Yeah. Foreign key guarantees that this ID also exists as a primary key in the parent table.

Kirill: In this case the product table.

Ajay: Exactly.

Kirill: Gotcha. Okay, got that out of the way. So what is normalization then?

Ajay: There are different levels of normalizations out there. For example, it helps you to make sure your data is not duplicated so there is no redundancy in your database. For example, you have a customer name and the customer name should not exist in another table also because then there can be a conflict on that.

Kirill: Okay, gotcha. When we were recording the database course last year, I learned about three types of normal forms – 1NF, 2NF, 3NF. We won’t go into too much detail, but just as a summary, what you’re saying is that the customer name should only be in the customer table, it shouldn’t be repeated. What’s the point of repeating the customer name inside the product table when we know what they purchased or something like that, is that right?

Ajay: Right, exactly.

Kirill: Okay, gotcha. So that’s normalization. What about schemas?

Ajay: About schemas, they’re basically to group your tables under different categories. For example, you can have a sales schema which has all the sales-related tables. You can have a users schema which can have all the users’ roles/permissions tables. So schema is basically to categorize your tables.

Kirill: Okay. So when somebody is dealing with a database—let’s say a data scientist goes into a company and opens up their database. What are some of the first things they need to look at and maybe some of the challenges that they will face when trying to understand how to access the data that they want?

Ajay: One of the challenges they will face is users don’t have a good visualization of their database so they don’t know which tables they need to join to bring which data. So if they have a good visualization tool or some kind of a diagram which can help them to quickly grasp where their data is stored and which tables are connected, that would really help them to write nice queries and get the data out of it.

Kirill: Okay, so how would you say to overcome that challenge?

Ajay: How would they work on that challenge?

Kirill: Yeah. How would they overcome it?

Ajay: Well, they can overcome by having a good tool which can reverse engineer their database into a good diagram which they can quickly grasp and see how tables are connected. And that’s how they would overcome.

Kirill: Interesting. So are those tools common in organizations? Like, if you just go up to the database manager, you just say, “Hey, can I get the diagram of the database?” What would the steps be there?

Ajay: Actually, there are not many tools out there that do good visual aid of the databases. And that’s where we saw the huge gap in the industry from many, many years, which motivated us to get into this SQLDBM tool which really helps the users to visually see their database and understand their database and find what are they missing in their database, are they applying proper normal forms or not, is it properly connected, do they have the proper foreign key, if they have all the constraints that they need. That’s where SQLDBM plays a role also.

Kirill: Interesting. So SQLDBM doesn’t just help you design the database from the start, it helps you analyse an existing database. Is that right?

Ajay: Exactly. It can reverse engineer your database and you can pull your database into SQLDBM and then you can adjust it as per your needs, you can enhance it, you can fix it, you can create nice different diagrams, subject areas, so that you can have a proper categorization of your database. Instead of having 100 or 200 tables in a single folder, you can split them up into different diagrams and see them clearly.

Kirill: I thought SQLDBM is an online tool. So you have to upload your whole database online or am I getting something wrong?

Ajay: It is an online tool, but it helps you to reverse engineer your database. The beauty of SQLDBM is it does not require any credentials of your database. So you don’t need to share any credentials, you just run a query in your database, it pulls the XML and then you upload it and then it works from there.

Kirill: Oh, I get it now. An XML is just a file that contains the structure of the database, but not the data.

Ajay: Exactly.

Kirill: So you’re not uploading 10GB of data. You’re just uploading—

Ajay: No, no. We’re just uploading the structure of your database.

Kirill: That’s so cool. That eliminates the whole privacy issue, right? You don’t have access to the data so there is no privacy concern.

Ajay: Yes, exactly. Because we understand it’s a huge concern with a lot of companies, a lot of users, and it should not be there. I mean, the user wants to look at the structure only, so why share those very important credentials of the company?

Kirill: Right. The only thing that will be in that XML file is the names of the tables and the names of the columns, right?

Ajay: Exactly.

Kirill: That’s the only thing. So if a person is comfortable sharing that, if there’s no intellectual property associated with the way they design the database in the company, a manager should be totally comfortable allowing a data scientist to do this investigation and understand, “How does this thing work? How is this database structured so I can get my data much quicker?”

Ajay: Exactly. You got it right.

Kirill: Okay. That’s really cool. How long ago did you guys launch SQLDBM?

Ajay: I think it’s been around six months now.

Kirill: Okay. By the way, for our listeners, it’s sqldbm.com. Is this right, it stands for SQL Server Database Modeler? Right?

Ajay: Yes, it’s SQLDBM, which is SQL Database Modeler.

Kirill: Okay, gotcha. What kind of uptake have you seen in the past six months? How frequently are people using your tool?

Ajay: It surprises us every day. We have seen such enormous growth in the number of users the last few months. And all the feedback that we get is phenomenal. People really love this. Me personally, I have been in the software industry for more than 15 years and there is not much out there, any good tool which does such a great job. I’m not being biased because it’s my tool, but it actually is a beautiful product.

Kirill: Actually, you guys have previews on the front page, I’m looking at it. You have ‘Design your own database online,’ ‘Forward engineering/Reverse engineering,’ and I’m watching this .gif for the reverse engineering and it shows exactly that. Somebody is copying an XML file, uploading it to your tool and then – bam! It pops out with a map of the database and you can drag it around and you can see how the tables are connected. That is so cool.

Ajay: Yeah, and it’s very easy and very intuitive to understand how to use it. You don’t need any user training for that. It’s very intuitive actually.

Kirill: Awesome. And if you don’t mind me asking, does it cost anything? What’s the price to use a tool like this?

Ajay: Currently it’s absolutely free for all our users.

Kirill: Oh, fantastic.

Ajay: Yeah, but down the line there might be some price attached to it, but it’s going to be very reasonable compared to anything out there.

Kirill: Okay, very cool. You’re obviously putting time into this project and effort. Why are you giving it away absolutely free to anybody who needs this tool?

Ajay: Currently our driving force is there is not much out there and we want to help the software development community, especially the database development community, to have a nice tool which can really enhance your productivity. It’s kind of a way to give back to the community. We really enjoy all the feedbacks that we get and that’s our motivation for now.

Kirill: For sure. That’s awesome. I’ll add to that. It’s definitely very useful for software developers and database administrators, but for data scientists—if you’re listening and you’re a data scientist working with databases, I’ve been there and I’m sure lots of us have been there, you’re running around the organization trying to understand where to get the data, how does it work. This looks like a really cool tool that can really help you out. So if you’re scratching your head trying to understand how to best structure your next project, maybe check it out. I think it could be very, very useful. Okay, that’s the project you’re working on now. Interesting. How did you even start into this space? Tell us a bit about your background, what you studied and what path brought you to creating SQLDBM.

Ajay: Well, I have done my engineering and my Masters, both in computer science, and I’ve been in software development, I’ve been writing programs and designing databases and working with databases for more than 15 years. What drove us into SQLDBM, for all the projects that I worked on I’ve been looking for a good tool which can really help us to visualize our data and to analyse the data that we have and there was nothing much out there, even from Microsoft who had SQL Server for many years. So that’s what motivated us to design something which can be really useful to find all the design issues that you have in the database.

Kirill: Okay, that’s good. You have a few cofounders, is that correct?

Ajay: Yes. We are a team of three people. All three of us are cofounders. We’re in a database community for many, many years and we’re all working day and night towards it and getting it out there.

Kirill: Okay, cool. So how did you guys meet?

Ajay: One of them is my childhood friend, and another guy went with him in his school time so that’s how we met together and that’s how we are rolling. (Laughs)

Kirill: (Laughs) Gotcha. The reason I’m asking these questions is because data science is a really powerful space. Anything to do with data is powerful. I’m just trying to understand how does somebody who has an idea, how do you take that idea from just being a thought in your head to being something that is benefitting the community and looks like a really inspiring project? I can imagine you wanted some tool that can help you analyse these things, you had this idea, but then what’s the next step? Did you go and just write it down on the paper, start programming something? How did you get from there to where you are now?

Ajay: Yes, we started writing it down on paper. We started thinking about how it should look, how it can help, how quickly users can understand how to use it, how visually it can be easier for everybody. Even who is not from the database background or who is very beginner to the database, how they can easily understand, visualize their database, and that’s how we designed all the different panels, different backgrounds. You won’t even believe, Kirill, for even single tiny colours on the tool, we have put great thought behind it, how it should not distract you away from your tables but it should be subtle out there. So there is a lot going on in our lives with it.

Kirill: Yeah. And I have to say it looks really cool. I like the choice of dark black and dark grey. It looks very stylish and at the same time—

Ajay: It’s very easy on your eyes also.

Kirill: Yeah. That’s really cool. What are your plans? This has been live six months and you’ve been getting some great feedback. You said you want to implement some other SQL languages, some other tools in there?

Ajay: Yeah, we have a lot of plans for SQLDBM. We are about to launch MySQL support in the next couple of weeks I would say. And then we have another feature coming up, collaborative feature, because we realise designing your database or knowing your database is always a team work. You need to have different team members working on the same project or looking at the same project at the same time.

Kirill: That’s pretty cool.

Ajay: Yeah, so that’s coming up. And you can have different colour schemes of different tables. That’s coming up. And then after MySQL we have plans to add Postgre as well as Oracle. I will tell you one thing, Kirill. The main driving factor of new features is our customers. They tell us what they want. It’s not us who are the final authority. Yes, we help to drive this ship in different directions, but all our customers are a part of this whole team.

Kirill: That’s really inspiring. I’m always actually impressed by people in the space of technology who go out there and just create something from scratch like this. And the reason is—are you not concerned that a bigger company like Microsoft might just come along and see that, “Oh, wow, this is a really cool thing that the world needs,” and they could create something similar? And because of their resources you could just end up way behind. What are your thoughts on that?

Ajay: (Laughs) Well, that’s a very good question. I don’t know what would happen if some big company like Microsoft would come and create a replica of that. They could do that, but they could do that many, many years ago also. We reinvented something, but we accomplished a lot of things. We got different ideas from different applications, and that’s how we brought it together. So, to be very honest, we are not scared about that, but let’s see where the future takes us.

Kirill: That’s really cool. I love that attitude. You know, at the end of the day you’re helping people, changing the world a little step at the time and that’s already good enough. We’ll see what happens in the future. Okay, awesome. All right, so we’ve got this tool. What else did I want to talk about? You mentioned there’s design issues. You know, when designing a database, your tool can help prevent design issues. Can you tell us a few examples of design issues data scientists should look out for that can be a hindrance in their work and their analytics? What are some of the most common database design issues that exist out there?

Ajay: I think one of the common database design issues is not having the proper data integrity. Sometimes database developers, other developers, do not connect their databases and do not follow the proper database concepts of normalizations. Initially their application looks good, but down the line when the data grows, then they get performance hits and they get all different kinds of database conflicts. Once they are down in that area, then it’s very tough and it’s very expensive for the whole company to fix those issues. That’s why the database designing or any good tool really helps to visualize it and take a look at how the data is connected, what should be connected and what is not connected. That’s where it helps.

Kirill: Okay. It’s a problem that happens at scale most of the time, right?

Ajay: Yeah.

Kirill: So if you don’t have smart database design—for the data scientist that means the speed of access to the data will be slow if you have a large database.

Ajay: It would be affected, yes.

Kirill: Okay, gotcha. Anything else that pops to mind? Any other major design issues that could be a flaw?

Ajay: Yes. I have noticed many times there are many hanging tables which are not connected and which are not in use at all. So that eats up a lot of space and not only space, but you have to back up your databases and do lots of stuff on that. SQLDBM really helps you to clean all that stuff that you really don’t need.

Kirill: Okay, that makes sense. That’s very cool. Thanks for that. You mentioned before NoSQL databases. What kind of databases are those?

Ajay: NoSQL databases are basically flattened out databases. They do not follow the relational structure. They are not connected with each other and they are not transactional actually. They are mainly used for searching the databases, for big data, and having the data availability all across the globe. That really helps. But they are not for systems like banks where you want to make sure that the transaction actually happened, the money went out from this account and it went into another account at the same minute. It’s not for transactions, but it’s really good when you want to have really big data and distributed data.

Kirill: What about MongoDB? Is that a NoSQL type of database?

Ajay: Yes, MongoDB is a NoSQL database.

Kirill: One of the most popular ones, right? Out of the ones you mentioned, SQL Server, MySQL, Postgre and Oracle, I think MongoDB is in fifth place worldwide.

Ajay: It is. We haven’t played with it, but yes, it is one of the popular NoSQL databases.

Kirill: Okay. I’ve always been fascinated by the way they work, these NoSQL databases. A very interesting world. Okay, relational versus non-relational – what are your comments there?

Ajay: Well, I’ve noticed a lot of our users have actually asked for a NoSQL database support, which is non-relational database support.

Kirill: Oh, it’s the same thing? Non-relational is NoSQL database?

Ajay: Exactly. Because there is no data integrity guarantee there. There is no data integration. So I was wondering, why are they asking for a non-relational database support in SQLDBM? And we figured out they just want to have the conceptual design in front of them even though it is not physically connected like primary key-foreign key, but they want to have a visual representation of their data.

Kirill: Okay. So that’s possible and better in NoSQL databases?

Ajay: Yes.

Kirill: Okay. Could you explain a little bit better what is integrity in a database? What do you mean by that?

Ajay: Data integrity mainly means the first normal form, which is making sure your data is not duplicated in different tables and your data is not conflicting with other data values. Let’s go back to the same example. If you have a customer table and you have order table and there is a customer ID primary key and the order ID primary key in the orders table, if customer ID is a foreign key, then it makes sure the owner of this value is existing in the customer table. So you cannot have a customer ID which does not exist as a customer.

Kirill: Okay, got it. And if that rule can be broken, then you have data integrity issues.

Ajay: Yes.

Kirill: Okay, interesting. I wanted to ask you a question about the future. We are seeing so much more data around the world. It is coming from all over the place and it’s growing exponentially. What are you seeing for databases? How are databases going to be and will they change in the next 3-5 years? What are your thoughts on that? Which ones will be the most popular ones? Will organizations shift away from relational towards non-relational or will everybody be using big databases and Hadoop and Spark type of systems? What are your thoughts on that?

Ajay: I think non-relational databases are getting very popular, but that does not mean they will eat up all the space even for the relational databases. They are totally different markets. For example, relational database is always going to be needed wherever you need transactions to happen, wherever you want to make sure the data goes in multiple tables at the same time. But non-relational database is going to be there where data integrity is not an issue, but the issue is the data must be distributed, the data must be available across the globe even though it doesn’t happen at the same second. For example, Google or Facebook, all those categories fall into non-relational databases. But all the banking, all the stocks, they fall into relational databases.

Kirill: Okay. And is it the case that for unstructured data you have to use a non-relational database?

Ajay: It’s not necessary, but it’s an overhead for the database to keep structure when they actually don’t need structure.

Kirill: Okay, gotcha. Interesting. So, Facebook and Google and the like, they’re used mostly—like, the search stuff, they’re mostly non-relational databases?

Ajay: Yes.

Kirill: Okay. Another thing I wanted to ask you, sometimes we encounter—you know, databases can have issues, they can have integrity problems and therefore the links might not be correct between tables or somebody might have inputted the wrong data into a wrong table and stuff like that. A huge part of the data science process is to prepare the data at the start. It takes about 70% of the time to prepare your data for analysis, and a lot of those issues that we encounter are problems in the data itself and some of them can come from data integrity issues or incorrect inputs, but some of them actually come from corrupt data? So when can databases get corrupt? What are some circumstances when you might have data which is in a completely good database which has integrity and the data is correct, but somehow it gets corrupt? What can happen?

Ajay: If your data is corrupt, even in a good database design, I would say probably it has not put around the proper transactions. For example, in the banking system, if there is a table of where the money goes out and another table where the money goes in, the money goes out, but the money going in had an issue. It was not put around the transaction, but over time you will notice both accounts have no money at all because one of the queries failed but another one passed. That could lead to a corrupt data issue even in a good database design.

Kirill: So that failure is because of—

Ajay: Because of the transaction.

Kirill: So maybe there’s a connection issue or—why would that failure occur?

Ajay: Because it was not a wrapper transaction put around all the queries. If there are different insert statements, but the transaction was not put around the whole block of inserts.

Kirill: Okay, interesting. Does that mean there should be some sort of supervising script to make sure everything is working okay?

Ajay: Yes. There should be checks at the end.

Kirill: Like Q and A, that type of thing?

Ajay: Exactly.

Kirill: Okay, whether it’s manual or automatic. And I’ve always wondered this question: Can databases get corrupt if you leave your server in the sun or you have a hard drive and you’re carrying it on a bicycle and there’s lots of vibration or something like that?

Ajay: That’s definitely a hardware issue. Yes, it can get corrupted. It would be tough to recover. I’m not the right person to answer that, but some networking guy would be the right person to answer on the physical conditions of the hard disk.

Kirill: Okay, gotcha. Let’s go through a couple of quick questions that I have. First one is, what’s the biggest challenge that you’ve had? You weren’t always running this project, but I’m assuming you were working with databases yourself, so what’s the biggest challenge you’ve had throughout your career?

Ajay: I guess the biggest challenge is an ongoing challenge. It’s telling the community how important it is to design their databases properly. This for me is the core or the centre point or starting point of any project out there, even for the data analysts, to design their database properly. Once it goes off-hand, then it becomes very, very expensive for the whole team to recover that. So the biggest challenge is to make them aware of this issue.

Kirill: Okay. Well, hopefully we’ve done that with this podcast, made everybody aware of the issue of designing. When you say that it’s very costly, that led me to a thought that a lot of organizations that are older, that have been around for some time, they have their databases in-house and indeed it’s very costly to change things, they have to update them and it costs millions of dollars. I mean the hardware. What are your thoughts on in-house versus cloud? There are so many cloud solutions right now for databases. One of the biggest ones is Amazon, with their different types of servers that are on demand or other types of subscriptions. What are your thoughts for organizations that do in-house databases versus databases in the cloud?

Ajay: I personally like databases on cloud because it gives a lot of flexibility. You can bring on many instances at the same time. You have database availability in all the places. It’s not expensive on the company. You take care of a lot of networking issues – you don’t have to deal with those. But bottom line is it depends on every company’s policy. Every company is a different company when they have to think about their data if it is a medical data, if it’s a personal data, if it’s a bank data, or in comparison to any other company where the data is. Like the blog posts or any blog website where the data doesn’t need to be very secure or it’s not very personal data. Bottom line is the privacy policy.

Kirill: Yeah, gotcha. So depending on that, that’s how they’ll go about it. And do you see more and more organizations moving to the cloud in the future?

Ajay: I definitely see that. I think this is a great initiative by a lot of different companies like Amazon, Azure to move in cloud.

Kirill: Gotcha. What is a recent win that you can share with us, like some major breakthrough that you’ve had in your career?

Ajay: My recent win has been not too far ago. It’s been going on. We are getting really great reviews, ratings on our product. The users really love it and I can tell. Every day I get so many feedbacks telling us this is the product they’ve been waiting for and we have done a tremendous job there.

Kirill: Fantastic! Awesome. I know we talked about the future a little bit, but what are your thoughts on—obviously you’ve encountered data scientists using databases or maybe even your product. What are your thoughts—what future do you see coming in terms of databases and what should our listeners prepare for? Which areas should they look into and study in order to be prepared for the types of databases and database-related concepts that are coming in the next couple of years?

Ajay: In the next couple of years I see huge movement toward NoSQL database. I see huge movement towards online cloud databases. I see huge movement towards big data because data is growing at a tremendous speed. So how to analyse that huge data? I see products like Tableau or any other products which analyse big data that are going to be really in demand and that’s where users need to train themselves how to work with huge databases and do analysis on that.

Kirill: Gotcha. So, three main trends: move to NoSQL databases, then we’re going to move to the cloud and more big data. Those are three things to look out for. Okay, I think we talked about all three except for maybe big data. Can you mention a couple of words on big data? We talked about NoSQL already, cloud… What are your comments on big data, some takeaways for our listeners?

Ajay: About big data, I see it more on the analysis part of that data. For example, any company which received a huge number of data—let’s say collecting all the marketing data and analysing how the trend is going, how it is shifting from making free users to paid users, at which point the users are going out. So wherever the data amount is in terabytes and how to analyse and how to make anything constructive out of that, finding patterns out of that, I think that’s where the big data goes in.

Kirill: Okay, awesome. I agree with that. It’s got its own applications. Especially in the space of analytics, big data is very helpful. You know, when you’re working in a bank you really need that structure to make sure everything is operating properly, but when you have a lot of data and you want to identify patterns and so on, they even have this concept of the data lake when they’re talking about big data – it’s like a huge lake of all the data that you have, you throw it in there and then you fish it out as you need it. Yeah, that’s really cool. Well, I think that’s a good note to wrap this up on. Thank you so much, Ajay, for coming on. And where can our listeners find you and learn more about your projects and follow your career?

Ajay: I think they can visit sqldbm.com and we have a ‘Contact Us’ page there. We also have a media presence. They can find us on Facebook, Twitter, LinkedIn, and we are very responsive to all our users. So, we are eager to answer any questions.

Kirill: Awesome. Okay, sounds really cool. We’ll definitely include those links. And one more question for you today: Any book that you can recommend to our listeners to help them in their careers?

Ajay: Well, recently I read a book “Designing Products” from Scott Hurff. That I really liked. It’s about knowing your product and how your users can love it, how you can enhance your products, how you can listen to your customers and bring those all-important features into your product.

Kirill: Okay, there you go. So it’s a book called “Designing Products”?

Ajay: Yes, sir. “Designing Products People Love,” that’s the full name.

Kirill: “Designing Products People Love” by Scott Hurff. Okay, thank you very much, Ajay, once again, for coming on the show and sharing your insights into the world of databases. I really appreciate your time today.

Ajay: Thank you, Kirill.

Kirill: So there you have it. That was Ajay Singh from SQLDBM. I hope you enjoyed today’s podcast. Personally, my biggest takeaway was this whole notion of reverse engineering your database in order to understand the structure. As data scientists, when we work with databases, it is a challenge sometimes to find where the right data sits, how to connect it to other sources in your database and what relations exist between them. So this is a great solution in case you find yourself in a bit of a pickle in that sense.

So make sure to check out the website, it’s SQLDBM.com. Check out the tool, they’ve got some great previews there. Also connect with Ajay on LinkedIn. You can find the URL for his LinkedIn at the show notes at www.www.superdatascience.com/129. There you can also find the show notes for this episode. And on that note, we’re going to wrap up. Thank you very much for your attention today. I can’t wait to see you back here next time. Until then, happy analysing.

Podcasts SDS 129: Database Challenges for Data Science and How to Deal With Them

SDS 129: Database Challenges for Data Science and How to Deal With Them

Podcast Transcript

Share on

Related Podcasts

January 9, 2026

January 6, 2026

January 2, 2026

Podcasts SDS 129: Database Challenges for Data Science and How to Deal With Them

Share

SDS 129: Database Challenges for Data Science and How to Deal With Them

Podcast Transcript

Share on

Related Podcasts

January 9, 2026

SDS 956: From Agent Demo to Enterprise Product (with Ease!) feat. Salesforce’s Tyler Carlson

January 6, 2026

SDS 955: Nested Learning, Spatial Intelligence and the AI Trends of 2026, with Sadie St. Lawrence

January 2, 2026

SDS 954: Recap of 2025 and Wishing You a Wonderful 2026