It was not easy to select a few out of many Open Source projects. My objective was to choose the ones that fit Big Data’s needs most. What has changed in the world of Open Source is that the big players have become stakeholders; IBM’s alliance with Cloud Foundry, Microsoft providing a development platform for Hadoop, Dell’s Open Stack-Powered Cloud Solution, VMware and EMC partnering on Cloud, Oracle releasing its NoSql database as Open Source.
“If you can’t beat them, join them”. History has vindicated the Open Source visionaries and advocates.
Cloud Operating System
Cloud Foundry — By VMware
OpenStack — Worldwide participation and well-known companies
fusion-io — Not open source, but very supportive of Open Source projects; Flash-aware applications.
These are not ordinary times for CIO’s. The news about Netflix’s Big Data success, and Big Data’s involvement in predicting the last election results might have created a buzz among the IT workers, but it was the NSA’s surveillance news that established a beachhead for Big Data in the public arena. Business managers might become nervous about what the competition is doing with Data Analytics and they may turn to CIO’s for help. As more and more success stories bubble up to the surface, it will become harder and harder for CIO’s to sit on the fence. Although this is not the first time that a groundbreaking technology is putting pressure on IT to deliver, it is probably the biggest rock to lift to date. Reasons being, the dizzying array of choices available, the complexity of cost benefit analysis, fear of data security problems, lack of skillful workforce, perplexities of on-premise implementations, reservations about the maturity of the technology, expensive services from BI Megavendors, and Cloud’s slow file transfer.
The CIO’s may not give into pressure, lest they find themselves caught in a Big Data Quicksand. On the other hand, waiting too long could have far-reaching negative ramifications for the company. The IT industry has gone through a similar experience while implementing Data Warehousing. According to Gartner, an approximate sixty percent of Data Warehousing projects have failed. The decisions made by the CIO will have a direct impact on IT. IT managers who maintain large volumes of data are already grappling with day-to-day problems that are caused by the sheer size of the data they store. The last thing they need is diving into a Big Data project. Therefore, the most important phase of Big Data implementation is the study phase. What do we currently have? Where do we want to go? Do we adopt a Cloud solution, Data Analytics as a service, or on-premise implementation? What about a hybrid Sql/Nosql environment? May be an Open Source strategy? Which Hadoop distribution? Are the BI Megavendors affordable? Can we benefit from integrating social media data? Why not employ a NewDB?
Here is a quick breakdown of potential solutions to help you decide.
Is it worth it? I’d think that most companies that plan to adopt a Big Data solution want to take advantage of Predictive Analytics. If your requirements fall under Diagnostic Analytics (discovery), you may not need a Big Data solution. You can look for a BI vendor that can work with your existing data– Alteryx, Tibco Spotfire, Actuate Quiterian, QlikTech, Birst, Tableau Software, SAP Visual Intelligence, MicroStrategy Visual Insight, Google BigQuery, Microsoft Power View, IBM Cognos Insight, Action, and Oracle Endeca.
If you want to go towards Predictive Analytics, you need to consider Cloud services. Cloud has a few advantages not the least of witch is scalability. The CIA signed a $600 million contract with Amazon a few months ago. Although most BI Megavendors offer Cloud Services, Amazon and Google are your best choices since they both have been in the Cloud business longer than others and they offer Analytics. Amazon hosts some of the BI Megavendors’ applications as well. However, bear in mind that Cloud data transfer is slow. Another consideration is Cloud security. The startup Attunity is an Israeli company that has made significant progress in speeding up the Cloud data transfer. Another startup, Seculert is a strong player in Cloud security space.
You might want to consider a private Cloud solution and many companies are already building their applications on off-premise data centers. The startup Engine Yard provides a Cloud application development platform.
Open source is probably the least expensive but the most difficult to implement. Working with Hadoop is not easy and there is a lack of seasoned Hadoop technology workers. Nevertheless, once you have installed your commodity servers, and Hadoop is running and it is stable, the hard part is over and you can focus on installing your Analytics software. Hortonworks Hadoop is 100% open source. Jaspersoft and Pentaho offer Open Source Data Analytics software. You can save a ton of money if you succeed!
To ease the pain and speed up your implementation, you can go with a Hadoop distribution that has builtin API’s to work with other software that you might need. MapR and Cloudera Hadoop distributions fall in that category. IBM and Intel also have their own Hadoop distributions. Pivotal HD (Hadoop) is an alternative; the startup is the child of EMC and VMware. Xplenty startup has a different approach and claims that it has a cost effective solution that eliminates the barriers to entry, making Hadoop accessible to everyone.
As for selecting a Predictive Analytics tool, there are several choices: Tibco Spotfire, KXEN, StatSoft, SAS, Action, Amazon Analytics, Google BigQuery, Oracle, SAP, IBM, Microsoft, Microstrategy, and Teradata. There are also the following startups. Tableau, Splunk, Datameer, LucidWorks, SpaceCurve, ParAccel, DataGravity, Altryx, SiSense, and ClearStoryData.
Finally, there are some startups that are doing interesting work.
Hadapt, Continuuity, and Cloudera (Impala), Concurrent (Lingual) provide tools for Hadoop developers
Apache Sqoop connects Hadoop to traditional RDBMS
Skytree, Ayasdi, Splunk, Automated Insights, and Oxdata have machine learning tools
Import Io extracts data from Websites
VOLTDB and ParStream: Distributed, massively parallel processing columnar database based on a shared nothing architecture running on commodity servers
Rainstor NewDB on Hadoop utilizing Sql
Ayata offers Prescriptive Analytics — one of few
FoundationDB is working on a NoSql product that is ACID compliant– best of both worlds (Sql, NoSql) type of effort
FairCom’s c-TreeACE Database allows for processing both Relational and NoSql type records
Cloudant offers a distributed database as a service (DBaaS)
ScaleBase dynamically scales out relational databases (MySql).
DeepDB provides simultaneous transactions and Analytics in the same data set, in real-time.
Platfora reads data directly from Hadoop without a middle layer and generates analytics reports.
DataStax delivers consulting, support and training for Cassandra.
MySql Refugees unite under SkySQL
*Note: Although Hadoop adoption is on the rise, its performance with real-time Analytics is not there yet.