How To Avoid The Big Data Quicksand

Big Data Quicksand
IT vs Business

By Fari Payandeh

July 31, 2013

These are not ordinary times for CIO’s. The news about Netflix’s Big Data success, and Big Data’s involvement in predicting the last election results might have created a buzz among the IT workers, but it was the NSA’s surveillance news that established a beachhead for Big Data in the public arena. Business managers might become nervous about what the competition is doing with Data Analytics and they may turn to CIO’s for help. As more and more success stories bubble up to the surface, it will become harder and harder for CIO’s to sit on the fence. Although this is not the first time that a groundbreaking technology is putting pressure on IT to deliver, it is probably the biggest rock to lift to date. Reasons being, the dizzying array of choices available, the complexity of cost benefit analysis, fear of data security problems, lack of skillful workforce, perplexities of on-premise implementations, reservations about the maturity of the technology, expensive services from BI Megavendors, and Cloud’s slow file transfer.

The CIO’s may not give into pressure, lest they find themselves caught in a Big Data Quicksand. On the other hand, waiting too long could have far-reaching negative ramifications for the company. The IT industry has gone through a similar experience while implementing Data Warehousing. According to Gartner, an approximate sixty percent of Data Warehousing projects have failed. The decisions made by the CIO will have a direct impact on IT. IT managers who maintain large volumes of data are already grappling with day-to-day problems that are caused by the sheer size of the data they store. The last thing they need is diving into a Big Data project. Therefore, the most important phase of Big Data implementation is the study phase. What do we currently have? Where do we want to go? Do we adopt a Cloud solution, Data Analytics as a service, or on-premise implementation? What about a hybrid Sql/Nosql environment? May be an Open Source strategy? Which Hadoop distribution? Are the BI Megavendors affordable? Can we benefit from integrating social media data? Why not employ a NewDB?

Here is a quick breakdown of potential solutions to help you decide.

Is it worth it? I’d think that most companies that plan to adopt a Big Data solution want to take advantage of Predictive Analytics. If your requirements fall under Diagnostic Analytics (discovery), you may not need a Big Data solution. You can look for a BI vendor that can work with your existing data– Alteryx, Tibco Spotfire, Actuate Quiterian, QlikTech, Birst, Tableau Software, SAP Visual Intelligence, MicroStrategy Visual Insight, Google BigQuery, Microsoft Power View, IBM Cognos Insight, Action, and Oracle Endeca.

If you want to go towards Predictive Analytics, you need to consider Cloud services. Cloud has a few advantages not the least of witch is scalability. The CIA signed a $600 million contract with Amazon a few months ago. Although most BI Megavendors offer Cloud Services, Amazon and Google are your best choices since they both have been in the Cloud business longer than others and they offer Analytics. Amazon hosts some of the BI Megavendors’ applications as well. However, bear in mind that Cloud data transfer is slow. Another consideration is Cloud security. The startup Attunity is an Israeli company that has made significant progress in speeding up the Cloud data transfer. Another startup, Seculert is a strong player in Cloud security space.

You might want to consider a private Cloud solution and many companies are already building their applications on off-premise data centers. The startup Engine Yard provides a Cloud application development platform.

Open source is probably the least expensive but the most difficult to implement. Working with Hadoop is not easy and there is a lack of seasoned Hadoop technology workers. Nevertheless, once you have installed your commodity servers, and Hadoop is running and it is stable, the hard part is over and you can focus on installing your Analytics software. Hortonworks Hadoop is 100% open source. Jaspersoft and Pentaho offer Open Source Data Analytics software. You can save a ton of money if you succeed!

To ease the pain and speed up your implementation, you can go with a Hadoop distribution that has builtin API’s to work with other software that you might need. MapR and Cloudera Hadoop distributions fall in that category. IBM and Intel also have their own Hadoop distributions. Pivotal HD (Hadoop) is an alternative; the startup is the child of EMC and VMware. Xplenty startup has a different approach and claims that it has a cost effective solution that eliminates the barriers to entry, making Hadoop accessible to everyone.

As for selecting a Predictive Analytics tool, there are several choices: Tibco Spotfire, KXEN, StatSoft, SAS, Action, Amazon Analytics, Google BigQuery, Oracle, SAP,  IBM, Microsoft, Microstrategy, and Teradata. There are also the following startups. Tableau, Splunk, Datameer, LucidWorks, SpaceCurve, ParAccel, DataGravity, Altryx, SiSense, and ClearStoryData.

Finally, there are some startups that are doing interesting work.

Hadapt, Continuuity, and Cloudera (Impala), Concurrent (Lingual) provide tools for Hadoop developers

Apache Sqoop connects Hadoop to traditional RDBMS

Skytree, Ayasdi, Splunk, Automated Insights, and Oxdata have machine learning tools

Import Io extracts data from Websites

VOLTDB and ParStream: Distributed, massively parallel processing columnar database based on a shared nothing architecture running on commodity servers

Rainstor NewDB on Hadoop utilizing Sql

Ayata offers Prescriptive Analytics  — one of few

FoundationDB is working on a NoSql product that is ACID compliant– best of both worlds (Sql, NoSql) type of effort

FairCom’s c-TreeACE Database allows for processing both Relational and NoSql type records

Cloudant offers a distributed database as a service (DBaaS)

ScaleBase dynamically scales out relational databases (MySql).

DeepDB provides simultaneous transactions and Analytics in the same data set, in real-time.

Platfora reads data directly from Hadoop without a middle layer and generates analytics reports.

DataStax delivers consulting, support and training for Cassandra.

MySql Refugees unite under SkySQL

*Note: Although Hadoop adoption is on the rise, its performance with real-time Analytics is not there yet.

Simple description of Big Data

Big Data: Let’s say that we wanted to process  the data that could be captured by satellites a few years ago; It was impossible to do so. Reason being that  traditional Databases were incapable of  processing the volume, velocity, and variety (text, images,etc…) of the data. Today we can, and this significant  leap forward is coined “Big Data”. Therefore, Big Data is not just about the size of data. It has more to do with how the data is processed.

Hadoop:  Hadoop has become the  defacto platform for Big Data. Hadoop is the “operating System” that places the foundation for capturing data. Among the companies which contributed to lifting “modern Big Data” off the ground–  Google, Yahoo, Amazon, Facebook– Google is the only one that doesn’t use Hadoop. Although Google’s fingerprints are all over Hadoop, it’s currently utilizing its own Analytics platform called BigQuery, which is a cloud based service.

Hadoop Distributions: Apache, Hortonworks, MapR, Cloudera, Pivotal HD, Amazon, IBM, and Intel. Hortonworks is 100% Open Source Apache Hadoop Distribution.

NoSql Databases: They sit on top of Hadoop and they store the data without schemas. Why? because schemas have contributed to the sluggish performance of traditional Databases when processing large volumes of data. The frontrunners are MongoDB, Cassandra, Google Cloud Datastore, Amazon DynamoDB, Redis, CouchDB, Hbase, Neo4J, MarkDB, Riak, and CouchBase.

Data Analytics: Now that we can process the data gushing forth, we can look for patterns, correlations,  events, anomalies, and invisible nuggets of intelligence to help us make better decisions. The following companies are strong players in that space. IBM, Oracle, Google, Microsoft, SAP, Amazon, Teradata, SAS, Tibco, Statsoft, KXen, and Angoss Software. Pentaho and Jaspersoft are open source. Alteryx has recently made its software free. Tableau has a free public edition.