Sept 8, 2013
Although Mainframe Hierarchical Databases are very much alive today, the Relational Databases (RDBMS) (SQL) have dominated the Database market, and they have done a lot of good. The reason the money we deposit doesn’t go to someone else’s account, our airline reservation ensures that we have a seat on the plane, or we are not blamed for something we didn’t do, etc… RDBMS’ data integrity is due to its adherence to ACID (atomicity, consistency, isolation, and durability) principles. RDBMS technology dates back to the 70’s.
So what changed? Web technology started the revolution. Today, many people shop on Amazon. RDBMS was not designed to handle the number of transactions that take place on Amazon every second. The primary constraining factor was RDBMS’ schema.
NoSql Databases offered an alternative by eliminating schemas at the expense of relaxing ACID principles. Some NoSql vendors have made great strides towards resolving the issue; the solution is called eventual consistency. As for NewSql, why not create a new RDBMS minus RDBMS’ shortcomings utilizing modern programming languages and technology. That is how some of the NewSql vendors came to life. Other NewSql companies created augmented solutions for MySql.
Hadoop is a different animal altogether. It’s a file system and not a database. Hadoop’s roots are in internet search engines. Although Hadoop and associates (HBase, Mapreduce, Hive, Pig, Zookeeper) have turned it into a mighty database, Hadoop is an inexpensive, scalable, distributed filesystem with fault tolerance. Hadoop’s specialty at this point in time is in batch processing, hence suitable for Data Analytics.
Now let’s start with our example: My imaginary video game company recently put our most popular game online after ten years of being in business, shipping our games to retailers around the globe. Our customer information is currently stored in a Sql Server Database and we have been happy with it. However, since the players started playing the game online, the database is not able to keep up and the users are experiencing delays. As our user base grows rapidly, we spend money buying more and more Hardware/Software but to no avail. Losing customers is our primary concern. Where do we go from here?
We decide to run our online game application in NoSql and NewSql simultaneously by segmenting our online user base. Our objective is to find the optimal solution. The IT department selects NoSql Couchbase (document oriented like MongoDB) and NewSql VoltDB.
Couchbase is open source, has an integrated caching mechanism, and it can automatically spread data across multiple nodes. VoltDB is an ACID compliant RDBMS, fault tolerant, scales horizontally, and possesses a shared-nothing & in-memory architecture. At the end, both systems are able to deliver. I won’t go into the intricacies of each solution because this is an example and comparing these technologies in the real-world will require testing, benchmarking, and in-depth analyses.
Now that the online operations are running smoothly, we want to analyze our data to find out where we should expand our territory. Which are the most suitable countries for marketing our products? In doing so, we need to merge the Sql Server customer Data Warehouse with the data from the online gaming database and run analytical reports. That’s where Hadoop comes in. We configure a Hadoop system and merge the data from the two data sources. Next, we use Hadoop’s Mapreduce in conjunction with the open source R programming language to generate the analytics reports.