The Best Of Open Source For Big Data

Open Source tools  for Big Data

 

 

 

 

 

 

 

Fari Payandeh

 

 

 

Sep 1, 2013

By Fari Payandeh

It was not easy to select a few out of many Open Source projects. My objective was to choose the ones that fit Big Data’s needs most. What has changed in the world of Open Source is that the big players have become stakeholders; IBM’s alliance with Cloud Foundry, Microsoft providing a development platform for Hadoop, Dell’s Open Stack-Powered Cloud Solution, VMware and EMC partnering on Cloud, Oracle releasing its NoSql database as Open Source.

“If you can’t beat them, join them”. History has vindicated the Open Source visionaries and advocates.

Hadoop Distributions

Hortonworks

Cloud Operating System

Cloud Foundry — By VMware

OpenStack — Worldwide participation and well-known companies

Storage

fusion-io — Not open source, but very supportive of Open Source projects; Flash-aware applications.

Development Platforms and Tools

REEF — Microsoft’s Hadoop development platform

Lingual — By Concurrent

Pattern — By Concurrent

Python — Awesome programming language

Mahout — Machine learning programming language

Impala — Cloudera

R — MVP among statistical tools

Storm — Stream processing by Twitter

LucidWorks — Search, based on Apache Solr

Giraph — Graph processing by Facebook

NoSql Databases

MongoDB, Cassandra, Hbase

Sql Databases

MySql — Belongs to Oracle

MariaDB — Partnered with SkySql

PostgreSQL — Object Relational Database

TokuDB — Improves RDBMS performance

Server Operating Systems

Red Hat — The defacto OS for Hadoop Servers

BI, Data Integration, and Analytics

Talend

Pentaho

Jaspersoft

How To Avoid The Big Data Quicksand

Big Data Quicksand
IT vs Business

By Fari Payandeh

July 31, 2013

These are not ordinary times for CIO’s. The news about Netflix’s Big Data success, and Big Data’s involvement in predicting the last election results might have created a buzz among the IT workers, but it was the NSA’s surveillance news that established a beachhead for Big Data in the public arena. Business managers might become nervous about what the competition is doing with Data Analytics and they may turn to CIO’s for help. As more and more success stories bubble up to the surface, it will become harder and harder for CIO’s to sit on the fence. Although this is not the first time that a groundbreaking technology is putting pressure on IT to deliver, it is probably the biggest rock to lift to date. Reasons being, the dizzying array of choices available, the complexity of cost benefit analysis, fear of data security problems, lack of skillful workforce, perplexities of on-premise implementations, reservations about the maturity of the technology, expensive services from BI Megavendors, and Cloud’s slow file transfer.

The CIO’s may not give into pressure, lest they find themselves caught in a Big Data Quicksand. On the other hand, waiting too long could have far-reaching negative ramifications for the company. The IT industry has gone through a similar experience while implementing Data Warehousing. According to Gartner, an approximate sixty percent of Data Warehousing projects have failed. The decisions made by the CIO will have a direct impact on IT. IT managers who maintain large volumes of data are already grappling with day-to-day problems that are caused by the sheer size of the data they store. The last thing they need is diving into a Big Data project. Therefore, the most important phase of Big Data implementation is the study phase. What do we currently have? Where do we want to go? Do we adopt a Cloud solution, Data Analytics as a service, or on-premise implementation? What about a hybrid Sql/Nosql environment? May be an Open Source strategy? Which Hadoop distribution? Are the BI Megavendors affordable? Can we benefit from integrating social media data? Why not employ a NewDB?

Here is a quick breakdown of potential solutions to help you decide.

Is it worth it? I’d think that most companies that plan to adopt a Big Data solution want to take advantage of Predictive Analytics. If your requirements fall under Diagnostic Analytics (discovery), you may not need a Big Data solution. You can look for a BI vendor that can work with your existing data– Alteryx, Tibco Spotfire, Actuate Quiterian, QlikTech, Birst, Tableau Software, SAP Visual Intelligence, MicroStrategy Visual Insight, Google BigQuery, Microsoft Power View, IBM Cognos Insight, Action, and Oracle Endeca.

If you want to go towards Predictive Analytics, you need to consider Cloud services. Cloud has a few advantages not the least of witch is scalability. The CIA signed a $600 million contract with Amazon a few months ago. Although most BI Megavendors offer Cloud Services, Amazon and Google are your best choices since they both have been in the Cloud business longer than others and they offer Analytics. Amazon hosts some of the BI Megavendors’ applications as well. However, bear in mind that Cloud data transfer is slow. Another consideration is Cloud security. The startup Attunity is an Israeli company that has made significant progress in speeding up the Cloud data transfer. Another startup, Seculert is a strong player in Cloud security space.

You might want to consider a private Cloud solution and many companies are already building their applications on off-premise data centers. The startup Engine Yard provides a Cloud application development platform.

Open source is probably the least expensive but the most difficult to implement. Working with Hadoop is not easy and there is a lack of seasoned Hadoop technology workers. Nevertheless, once you have installed your commodity servers, and Hadoop is running and it is stable, the hard part is over and you can focus on installing your Analytics software. Hortonworks Hadoop is 100% open source. Jaspersoft and Pentaho offer Open Source Data Analytics software. You can save a ton of money if you succeed!

To ease the pain and speed up your implementation, you can go with a Hadoop distribution that has builtin API’s to work with other software that you might need. MapR and Cloudera Hadoop distributions fall in that category. IBM and Intel also have their own Hadoop distributions. Pivotal HD (Hadoop) is an alternative; the startup is the child of EMC and VMware. Xplenty startup has a different approach and claims that it has a cost effective solution that eliminates the barriers to entry, making Hadoop accessible to everyone.

As for selecting a Predictive Analytics tool, there are several choices: Tibco Spotfire, KXEN, StatSoft, SAS, Action, Amazon Analytics, Google BigQuery, Oracle, SAP,  IBM, Microsoft, Microstrategy, and Teradata. There are also the following startups. Tableau, Splunk, Datameer, LucidWorks, SpaceCurve, ParAccel, DataGravity, Altryx, SiSense, and ClearStoryData.

Finally, there are some startups that are doing interesting work.

Hadapt, Continuuity, and Cloudera (Impala), Concurrent (Lingual) provide tools for Hadoop developers

Apache Sqoop connects Hadoop to traditional RDBMS

Skytree, Ayasdi, Splunk, Automated Insights, and Oxdata have machine learning tools

Import Io extracts data from Websites

VOLTDB and ParStream: Distributed, massively parallel processing columnar database based on a shared nothing architecture running on commodity servers

Rainstor NewDB on Hadoop utilizing Sql

Ayata offers Prescriptive Analytics  — one of few

FoundationDB is working on a NoSql product that is ACID compliant– best of both worlds (Sql, NoSql) type of effort

FairCom’s c-TreeACE Database allows for processing both Relational and NoSql type records

Cloudant offers a distributed database as a service (DBaaS)

ScaleBase dynamically scales out relational databases (MySql).

DeepDB provides simultaneous transactions and Analytics in the same data set, in real-time.

Platfora reads data directly from Hadoop without a middle layer and generates analytics reports.

DataStax delivers consulting, support and training for Cassandra.

MySql Refugees unite under SkySQL

*Note: Although Hadoop adoption is on the rise, its performance with real-time Analytics is not there yet.