May I say at the outset that I know the phrase “Data Suction Appliance” sounds awkward at its best and downright awful at its worst. Honestly, I don’t feel that bad! These are some of the words used in Big Data products or company names: story, genie, curve, disco, rhythm, deep, gravity, yard, rain, hero, opera, karma… I won’t be surprised if I come across a start-up named WeddingDB next week.
Although there is so much hype surrounding social media data, the real goldmine is in the existing RDBMS Databases and to a lesser degree in Mainframes. The reason is obvious. Generally speaking data capture has been driven by business requirements, and not by some random tweets about where to meet for dinner. In short, the Database vendors are sitting on top of the most valuable data.
Oracle, IBM, and Microsoft “own” most of the data in the world. By that I mean if you run a query in any part of the world, it’s very likely that you are reading the data from a Database owned by them. The larger the volume of data, the greater the degree of ownership; just ask anyone who has attempted to migrate 20 TB of data from Oracle to DB2. In short, they own the data because the customers are locked-in. Moreover, the real value of data is much greater than the revenues generated from the Database licenses. In all likelihood the customer will buy other software/applications from the same vendor since it’s a safe choice. From the Database vendors’ standpoint the Database is a gift that keeps on giving. Although they have competed for new customers, due to absence of external threats (Non-RDBMS technology), they have enjoyed being in a growing market that has kept them happy. Teradata, MySql (Non-Oracle flavors), Postgres, and Sybase have a small share of the overall Database market.
The birth of Hadoop and NoSql technology represented a seismic shift that shook the RDBMS market not in terms of revenue loss/gain, but in offering an alternative to businesses . The Database vendors moved quickly to jockey for position and contrary to what some believe, I don’t think they were afraid of a meltdown. After all who was going to take their data? They responded to the market lest they be deprived of the Big Data windfall.
IBM spent $16 billion on its Big Data portfolio and launched PureData for Hadoop; a hardware/software system composed of IBM Big Data stack. It introduced SmartCloud and recently backed Pivotal’s Cloud Foundry. Cloud Foundry is “like an operating system for the cloud,” Andy Piper, developer advocate for Cloud Foundry at Pivotal.
Microsoft HDInsight products integrate with Sql Server 2012, System Center, and other Microsoft products; the Azure cloud-based version integrates with Azure cloud storage and Azure Database.
Oracle introduced Big Data appliance bundle comprising Oracle NoSql Database, Oracle Linux, Cloudera Hadoop, and Hotspot Java Virtual Machine. It also offers Oracle Cloud Computing.
What is Data Suction Appliance? There is a huge market for a high performance data migration tool that can copy the data stored in RDBMS Databases to Hadoop. Currently there are no fast ways of transferring data to Hadoop; Performance is sluggish. What I envision is data transfer at the storage layer and not Database layer. Storage vendors such as EMC and NetApp have an advantage in finding a solution while working with Data Integration vendors like Informatica. Informatica recently partnered with VelociData, provider of hyper-scale/hyper-speed engineered solutions. Is it possible? I would think so. I know that I am simplifying the process, but this is a high level view of what I see as a possible solution. Database objects are stored at specific disk addresses. It starts with the address of an instance within which the information about the root Tablespace or Dbspace is kept. Once the root Tablespace is identified, the information about the rest of the objects (Non-root Tablespaces, tables, indexes, …) is available in Data Dictionary tables and views. This information includes the addresses of the data files. Data file headers store the addresses of free/used extents and we continue on that path until data blocks containing the target rows are identified. Next, the Data Suction Appliance bypasses the Database and bulk copies the data blocks from storage to Hadoop. Some transformations may be needed during data transfer in order to bring in the data in a way that NoSql Databases can understand, but that can be achieved through an interface which will allow the Administrators to specify the data transfer options. The future will tell if I am dreaming or as cousin Vinny said, “The argument holds water”.