My relation with a Relational Database goes way back to 2007 when I started my career with Oracle, the behemoth of Enterprise data storage, processing and management.
Being a fresher out of college and someone scared to death of Java, C or C++ programming, SQL came as a welcome breath of relief. Well, there couldn't be a simpler way to code than writing some English like statements and this Oracle 11i console spitting out some records on the fly!
Come 2014, after 7 years of successful SQL career, I joined a startup where I got to hear these dreaded statements from several of my new age colleagues :
Relational Database? Seriously? You're still working in the same?
That's old school man! The world has moved to NoSQL. Who will read your tables in the world of JSON and YAML?
Stop using Relational Database. They are so expensive!
In the world of Big Data, your RDBMS can't handle scale. Period.
It felt like a big FOMO. Was I missing out on this new trend? Was it pack up time for my favorite language SQL?
Am not giving up so easily.
Agreed, Relational DB came into existence in 70s and 80s.
Agreed, They weren't designed to handle semi-structured/hierarchical data.
Agreed, it can't handle scale the way Hadoop does
But, throwing my beloved technology out of the window without letting me defend my case would be plain injustice right?
Let me put my case with few scenarios to discuss why I still consider Relational DBs a great choice even after I've started appreciating value NoSQL brings. So consider my scenarios be biased towards RDBMS :
Think of an overly simplified order management system. All you have is Customers and Orders. Suppose, you don't have any fancy tech to run your business. What's the easiest way to capture and manage Customer and Orders information?
My Solution : Design an order capture form in Google Form and use the same to capture order details. Let's capture few orders. The outcome looks like below in Excel :
Great! Looks like business started well.
But, let's think about some possible issues :
- What if the customer places another order the same day having same quantity and sales amount?
- What if another customer with same name places an order with same quantity and sales amount?
So, in the above snapshot, what's happening?
Did the same Denise placed another order on the same day of same units? Possible.
Was it another lady sharing the same name placing order for the same product? Possible.
Or, was it that while data entry, the sales executive by mistake tried entering the same order details again. Possible.
But, this isn't acceptable in real world. A record in the system storing data should be a unique occurrence.
That's why we need some attribute which can identify the uniqueness of a record. Sounds familiar? Yes, this is the vital Primary Key Constraint which relational databases provide out of the box. They ensure the primary key is not blank and is always unique.
With the Primary Key Constraint, RDBs check off one important point in Data Quality checklist which is not allowing duplicate records to enter the system.
Learning 01 : Relational Databases offer primary key constraint on the fly. This means it is internally coded for you. So, you can stop worrying about possibility of duplicate records corrupting the system. Cool, But wait there's more.
Look at the below scenario :
Looks like Ryan Foster is one happy customer! However, just look at the first 5 columns. Customer information is repeating for each order.
Consider another case. What if for some reason, the order entries for Mr. Ryan Foster were wrong and you were asked to delete all orders. You'll happy do it.
But, wait! After you happily delete all records, you find that now there is no customer information for Ryan Foster. With all orders, the customer information is gone too.
Consider yet another case. What if the address details of Mr. Ryan Foster was wrong and you were asked to change them. What's the pain point? Yes, Each and every record need to be corrected here. That'll be close to impossible task when the order count is in thousands or millions!
With Normalization, Relational Databases prevent all scenarios mentioned above from taking place. This is done by separating tables by entity. The fields which make sense at customer level should stay at customer level. Orders should be a separate table.
So, after normalization, the data looks like below :
Customers separate :
Orders separate :
- Normalization saves space (Single record for customer Ryan Foster in customer table)
- Normalization prevents the Insertion, Update and Deletion Anomaly. (With orders getting deleted, the customer information still remains intact. Also, updating address means changing the single record in the customer table)
**Learning 02 **: Relational DBs offer normalization of tables for efficient data storage.
Wait, there's still more.
With tables separated due to Normalization, there is one important aspect still not covered.
With separation of data, the relation between customer and order is missed out. How do I delete Ryan Foster's orders from the Orders table? I don't know which orders were placed by him.
This is where RDBs actually justify their name : RELATIONAL. The relation between entities is what RDBs are all about. All right, let's quickly fix the datasets and add Customer IDs in the Orders table
Now, just like we do VLOOKUP in Excel, we'll do inner join to fetch relevant data from the Customers table. Problem solved!
Or is it?
Consider this case : What if Mr. Ryan Foster no longer wishes to buy from this business and requests to remove his record. You go and delete his entry from the Customers table.
What happens? You got it right. There will still be references to Ryan Foster's record in the Orders table. Look at all order entries for Customer ID 100005. These are ghost records. It's like someone has placed orders but we don't know who.
RDBs prevent this situation too by introducing Referential Integrity Constraints! Which means as long as the parent records exist, you can't delete child records. The deletion has an order, a cascade effect. First orders should go, then only customer can be removed. Makes sense? Well, it takes some internal coding effort to ensure this constraint.
Learning 03 - RDBs provide the Referential Integrity Constraints feature to prevent ghost records from appearing hence ensuring highest levels of integrity.
Talking about constraints, they are the security guards of RDBs. They ensure highest degree of data quality out of the box. A software developer doesn't have to worry about quality issues as RDBs have very strict rules to be obeyed before allowing data into the system. The data types also play an important role to ensure clean data.
Few other constraints which help ensure good quality data like : Default constraint, Check constraint. As a Data Analyst, RDBs made my life very simple by taking over the heavy lifting of ensuring efficient data storage, no data duplication and high degree of integrity. This is something industries sensitive towards data quality have always loved about.
It should be clear why Banks and Insurance firms still in love with RDBs.
Great! It felt good to discuss the above points to support an old but steady technology.
However, just like a good technology professional, I must acknowledge that I spent a lot of time understanding why so much noise about NoSQL DBs. It's important to question your bias always and understand why an alternative technology exists. Nobody builds tech just to show off right?
The problem has been that NoSQL DBs have been posed as a replacement of Relational DBs. Here's where the whole problem started. In fact, NoSQL DBs solve challenges posed by Relational DBs but at the same time fail to demonstrate some of the capabilities of RDBs.
Whether you like it or not, Technology does not have a silver bullet to solve all problems. It's all about choosing what works for your problem.
Finally, I can peacefully rest my case.
What NoSQL offers which Relational doesn't? That's for another article.