big data

How To Avoid The Big Data Quicksand

Big Data Quicksand

IT vs Business

By Fari Payandeh

July 31, 2013

These are not ordinary times for CIO’s. The news about Netflix’s Big Data success, and Big Data’s involvement in predicting the last election results might have created a buzz among the IT workers, but it was the NSA’s surveillance news that established a beachhead for Big Data in the public arena. Business managers might become nervous about what the competition is doing with Data Analytics and they may turn to CIO’s for help. As more and more success stories bubble up to the surface, it will become harder and harder for CIO’s to sit on the fence. Although this is not the first time that a groundbreaking technology is putting pressure on IT to deliver, it is probably the biggest rock to lift to date. Reasons being, the dizzying array of choices available, the complexity of cost benefit analysis, fear of data security problems, lack of skillful workforce, perplexities of on-premise implementations, reservations about the maturity of the technology, expensive services from BI Megavendors, and Cloud’s slow file transfer.

The CIO’s may not give into pressure, lest they find themselves caught in a Big Data Quicksand. On the other hand, waiting too long could have far-reaching negative ramifications for the company. The IT industry has gone through a similar experience while implementing Data Warehousing. According to Gartner, an approximate sixty percent of Data Warehousing projects have failed. The decisions made by the CIO will have a direct impact on IT. IT managers who maintain large volumes of data are already grappling with day-to-day problems that are caused by the sheer size of the data they store. The last thing they need is diving into a Big Data project. Therefore, the most important phase of Big Data implementation is the study phase. What do we currently have? Where do we want to go? Do we adopt a Cloud solution, Data Analytics as a service, or on-premise implementation? What about a hybrid Sql/Nosql environment? May be an Open Source strategy? Which Hadoop distribution? Are the BI Megavendors affordable? Can we benefit from integrating social media data? Why not employ a NewDB?

Here is a quick breakdown of potential solutions to help you decide.

Is it worth it? I’d think that most companies that plan to adopt a Big Data solution want to take advantage of Predictive Analytics. If your requirements fall under Diagnostic Analytics (discovery), you may not need a Big Data solution. You can look for a BI vendor that can work with your existing data– Alteryx, Tibco Spotfire, Actuate Quiterian, QlikTech, Birst, Tableau Software, SAP Visual Intelligence, MicroStrategy Visual Insight, Google BigQuery, Microsoft Power View, IBM Cognos Insight, Action, and Oracle Endeca.

If you want to go towards Predictive Analytics, you need to consider Cloud services. Cloud has a few advantages not the least of witch is scalability. The CIA signed a $600 million contract with Amazon a few months ago. Although most BI Megavendors offer Cloud Services, Amazon and Google are your best choices since they both have been in the Cloud business longer than others and they offer Analytics. Amazon hosts some of the BI Megavendors’ applications as well. However, bear in mind that Cloud data transfer is slow. Another consideration is Cloud security. The startup Attunity is an Israeli company that has made significant progress in speeding up the Cloud data transfer. Another startup, Seculert is a strong player in Cloud security space.

You might want to consider a private Cloud solution and many companies are already building their applications on off-premise data centers. The startup Engine Yard provides a Cloud application development platform.

Open source is probably the least expensive but the most difficult to implement. Working with Hadoop is not easy and there is a lack of seasoned Hadoop technology workers. Nevertheless, once you have installed your commodity servers, and Hadoop is running and it is stable, the hard part is over and you can focus on installing your Analytics software. Hortonworks Hadoop is 100% open source. Jaspersoft and Pentaho offer Open Source Data Analytics software. You can save a ton of money if you succeed!

To ease the pain and speed up your implementation, you can go with a Hadoop distribution that has builtin API’s to work with other software that you might need. MapR and Cloudera Hadoop distributions fall in that category. IBM and Intel also have their own Hadoop distributions. Pivotal HD (Hadoop) is an alternative; the startup is the child of EMC and VMware. Xplenty startup has a different approach and claims that it has a cost effective solution that eliminates the barriers to entry, making Hadoop accessible to everyone.

As for selecting a Predictive Analytics tool, there are several choices: Tibco Spotfire, KXEN, StatSoft, SAS, Action, Amazon Analytics, Google BigQuery, Oracle, SAP,  IBM, Microsoft, Microstrategy, and Teradata. There are also the following startups. Tableau, Splunk, Datameer, LucidWorks, SpaceCurve, ParAccel, DataGravity, Altryx, SiSense, and ClearStoryData.

Finally, there are some startups that are doing interesting work.

Hadapt, Continuuity, and Cloudera (Impala), Concurrent (Lingual) provide tools for Hadoop developers

Apache Sqoop connects Hadoop to traditional RDBMS

Skytree, Ayasdi, Splunk, Automated Insights, and Oxdata have machine learning tools

Import Io extracts data from Websites

VOLTDB and ParStream: Distributed, massively parallel processing columnar database based on a shared nothing architecture running on commodity servers

Rainstor NewDB on Hadoop utilizing Sql

Ayata offers Prescriptive Analytics  — one of few

FoundationDB is working on a NoSql product that is ACID compliant– best of both worlds (Sql, NoSql) type of effort

FairCom’s c-TreeACE Database allows for processing both Relational and NoSql type records

Cloudant offers a distributed database as a service (DBaaS)

ScaleBase dynamically scales out relational databases (MySql).

DeepDB provides simultaneous transactions and Analytics in the same data set, in real-time.

Platfora reads data directly from Hadoop without a middle layer and generates analytics reports.

DataStax delivers consulting, support and training for Cassandra.

MySql Refugees unite under SkySQL

*Note: Although Hadoop adoption is on the rise, its performance with real-time Analytics is not there yet.

4 replies »

  1. Thanks, Interesting post and great summary. You’re exploring options by these categories: Big Data, cloud, private grids. I have one here that follows similar veins but distinguish solutions based on “Distributed Computing” vs “Distributed Performance” – http://tinyurl.com/qakyafq. It’d be great if there’s a discussion on how different industry sectors approaches this subject – and why they have Big Data to start with. Social Media obviously, but other industry sectors? For example coming from background in investment banking – we don’t have much Big Data and our data definitely best stored in Relational Database.

Post:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s