Big Data’s Missing Component: Data Suction Appliance

Hadoop Data
Hadoop Data

 

Fari Payandeh

Fari Payandeh

Aug12,2013

May I say at the outset that I know the phrase “Data Suction Appliance” sounds awkward at its best and downright awful at its worst. Honestly, I don’t feel that bad! These are some of the words used in Big Data products or company names: story, genie, curve, disco, rhythm, deep, gravity, yard, rain, hero, opera, karma… I won’t be surprised if I come across a  start-up named WeddingDB next week.

 Although there is so much hype surrounding social media data, the real goldmine is in the existing RDBMS Databases and to a lesser degree in Mainframes. The reason is obvious. Generally speaking data capture has been driven by business requirements, and not by some random tweets about where to meet for dinner.  In short, the Database vendors are sitting on top of the most valuable data.

 Oracle, IBM, and Microsoft “own” most of the data in the world. By that I mean if you run a query in any part of the world,  it’s very likely that you are reading the data from a Database owned by them. The larger the volume of data, the greater the degree of ownership; just ask anyone who has attempted to migrate 20 TB of data from Oracle to DB2. In short, they own the data because the customers are locked-in. Moreover, the real value of data is much greater than the revenues generated from the Database licenses. In all likelihood the customer will buy other software/applications from the same vendor since it’s a safe choice. From the Database vendors’ standpoint the Database is a gift that keeps on giving. Although they have competed for new customers, due to absence of external threats (Non-RDBMS technology), they have enjoyed being in a growing market that has kept them happy. Teradata, MySql (Non-Oracle flavors), Postgres, and Sybase have a small share of the overall Database market.

 The birth of Hadoop and NoSql technology represented a seismic shift that shook the RDBMS market not in terms of revenue loss/gain, but in offering an alternative to businesses . The Database vendors moved quickly to jockey for position and contrary to what some believe, I don’t think they were afraid of a meltdown. After all who was going to take their data? They responded to the market lest they be deprived of the Big Data windfall.

 IBM spent $16 billion on its Big Data portfolio and launched PureData for Hadoop; a hardware/software system composed of IBM Big Data stack. It introduced SmartCloud and recently backed Pivotal’s Cloud Foundry.  Cloud Foundry is “like an operating system for the cloud,” Andy Piper, developer advocate for Cloud Foundry at Pivotal.

 Microsoft HDInsight products integrate with Sql Server 2012, System Center, and other Microsoft products; the Azure cloud-based version integrates with Azure cloud storage and Azure Database.

 Oracle introduced Big Data appliance bundle comprising Oracle NoSql Database, Oracle Linux, Cloudera Hadoop, and Hotspot Java Virtual Machine. It also offers Oracle Cloud Computing.

 What is Data Suction Appliance? There is a huge market for a high performance data migration tool that can copy the data stored in RDBMS  Databases to Hadoop.  Currently there are no fast ways of transferring data  to Hadoop; Performance is sluggish. What I envision is data transfer at the storage layer and not Database layer. Storage vendors such as EMC and NetApp  have an advantage in finding a solution while working with Data Integration vendors like Informatica. Informatica recently partnered with VelociData, provider of hyper-scale/hyper-speed engineered solutions. Is it possible? I would think so. I know that I am simplifying the process, but this is a high level view of what I see as a possible solution. Database objects are stored at specific disk addresses. It starts with the address of an instance within which the information about the root Tablespace or Dbspace is kept. Once the root Tablespace is identified, the information about the rest of the objects (Non-root Tablespaces, tables, indexes, …) is available in Data Dictionary tables and views. This information includes the addresses of the data files. Data file headers store the addresses of free/used extents and we continue on that path until data blocks containing the target rows are identified. Next, the Data Suction Appliance bypasses the Database and bulk copies the data blocks from storage to Hadoop. Some transformations may be needed during data transfer in order to bring in the data in a way that NoSql Databases can understand, but that can be achieved through an interface which will allow the Administrators to specify the data transfer options.  The future will tell if I am dreaming or as cousin Vinny said, “The argument holds water”.

 

7 thoughts on “Big Data’s Missing Component: Data Suction Appliance

  1. Hello Fari,

    Love the article.

    At VelociData, we are in strong agreement with the major points of your article. Nearly every Fortune 1000 company still holds their most valuable data in Oracle, IBM and Microsoft databases. Interestingly, many of these large enterprises still have a significant share of their most valuable data on IBM Mainframes inside of DB2, VSAM, IMS and various independent software vendor’s DBMSs. These data sources will forever be a critical element of an effective analytics platform for large enterprises. As you point out, the conundrum is how to effectively and efficiently transfer and transform massive volumes of data from numerous DBMS sources into Hadoop. Making this challenge even more difficult is that data volumes are growing very rapidly while time bound service levels for data availability are not negotiable (e.g., Insurance claim’s data-mart needs to be refreshed by 5AM everyday to allow customer service responses to inbound customer inquires. No new data -> No can answer -> Not so happy customers). Conventional approaches for these extreme challenges have reached a point of dimensioning returns at a very high cost. This creates the need for what you describe as a “Data Suction Appliance.”

    We’ve never thought of VelociData as a “Hoover Wind Tunnel :-)”, but I can understand the analogy. In just the way you describe in your article, we accelerate standard ETL functions by as much as 1000X faster over conventional tools at vastly better price performance. Think of VelociData as Data Transformation and Data Quality functions purpose built in firmware that leverages an array of massively parallel compute resources to generate extreme acceleration against Big Data volumes. As an illustration of the performance, one of our insurance customers used a VelociData Engineered System (aka, the Hoover Wind Tunnel) to reduce a 16 hour ETL process down to 45 seconds.

    We don’t think you are dreaming and we welcome the opportunity to prove the case to your cousin Vinny.

    All the best,

    Chris O’Malley

    1. Excellent! What you folks do is very impressive. I hope that you find a way of inventing a data suction appliance. It will make you super rich and you’ll be doing the industry a huge favor.

      Good luck and best of wishes!

  2. Ken Bell

    I recall our team using a tool called Sqoop that promised somewhat similar capabilties to those brought up by the author. I wish I could provide more detail to, however a simple search may uncover whether or not this is a tool useful in these applications.

    1. The problem with the existing methods including Sqoop is that they connect to the Database. My imaginary machine sucks the data right out of the storage and sends it to Hadoop. No DB involvement other than accessing it to read the data dictionary.

  3. Hi Fari,

    Thanks a lot for your blog.

    Like Chris says, we still see a lot of value in data in the classical data architectures for some time to come.

    Your idea as I understand it is to be able to copy database data blocks as fast as possible from the source system to Hadoop. That does roughly fit with some of our most basic capabilities. And it certainly aligns with our core value proposition as an ETL acceleration solution which can move data around at wire speeds. We would tend to engineer the solution such that the most performance-intensive algorithms would be accelerated with firmware.

    I am curious about a few things:

    0) It seems to me pretty reasonable that a storage vendor might want to look into this proposal. They seem to have the underlying position in the architecture to take best advantage of it. VelociData might be interested in partnering/engaging with them, if so.

    1) Would it really be possible to collect all the various database data blocks to get all the data? What happens with SANs, RAIDs and other distributed storage? Also, I am guessing this strategy is best at accommodating inserts only and not updates and deletes. Would this inevitably require integration at a database level?

    2) Given that I think the answer to the previous question might not be as perfect as we would like, is there any other way to try and collect data? Perhaps grabbing data at the network (or disk connection level)? This seems like it also might be a bit of a challenge.

    3) So I am thinking a viable alternative might be to capture data at a slightly different level, such as via replication, change-data-capture, log mining, etc. For example, I believe Oracle gives APIs that can look at in-memory transaction logs. But that seems to be what you are suggesting ought to be avoided.

    Once we settle on how to capture the source data, then I think VelociData might be able to help with some of the effort bringing it into the target Big Data system. We have technology which we call Instant Indexing which can build indexes at wire speed. Also the conversion into the layout/format required by the target would be a good use of our appliance, I think.

    Admittedly, we cannot do everything today. But it is a very interesting idea that we would be open to exploring with you in more detail.

    –Dave

  4. Hi Dave,

    Identifying the data blocks won’t be a problem. After all, Oracle’s dbdump utility can bulk read/write data blocks. The problem is that it’s slow when dealing with Terabytes of data.

    Suppose the underlying storage is NetApp. My thinking is that the NetAPP engine first finds the addresses of the blocks the same way the dbdump utility does, and it stores the metadata in a repository. Next, it’ll copy those blocks to a new NetApp aggregate with lightning-fast speed while formatting the blocks in a way that let’s say MongoDB understands. It’s a complex operation. What I’d like to see is for the appliance to take a snapshot of 100 TB of data in 30 minutes. The snapshot records the last database checkpoint timestamp. Once the snapshot is complete, the rest of the data can be brought over using a number of methods. At the end of the process both MongoDB and Oracle will be running in parallel, processing real-time transactions in their own separate worlds.

  5. Pingback: Big Data’s Missing Component: Data Suction Appliance | Big Enterprise Data

Post:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s