Sep 1, 2013
By Fari Payandeh
It was not easy to select a few out of many Open Source projects. My objective was to choose the ones that fit Big Data’s needs most. What has changed in the world of Open Source is that the big players have become stakeholders; IBM’s alliance with Cloud Foundry, Microsoft providing a development platform for Hadoop, Dell’s Open Stack-Powered Cloud Solution, VMware and EMC partnering on Cloud, Oracle releasing its NoSql database as Open Source.
“If you can’t beat them, join them”. History has vindicated the Open Source visionaries and advocates.
Hadoop Distributions
Hortonworks
Cloud Operating System
Cloud Foundry — By VMware
OpenStack — Worldwide participation and well-known companies
Storage
fusion-io — Not open source, but very supportive of Open Source projects; Flash-aware applications.
Development Platforms and Tools
REEF — Microsoft’s Hadoop development platform
Lingual — By Concurrent
Pattern — By Concurrent
Python — Awesome programming language
Mahout — Machine learning programming language
Impala — Cloudera
R — MVP among statistical tools
Storm — Stream processing by Twitter
LucidWorks — Search, based on Apache Solr
Giraph — Graph processing by Facebook
NoSql Databases
MongoDB, Cassandra, Hbase
Sql Databases
MySql — Belongs to Oracle
MariaDB — Partnered with SkySql
PostgreSQL — Object Relational Database
TokuDB — Improves RDBMS performance
Server Operating Systems
Red Hat — The defacto OS for Hadoop Servers
BI, Data Integration, and Analytics
Talend
Pentaho
Jaspersoft
Even all are easily be found on Google, it may be better to add links to all recommendations.
Thank you Mirmirik.
That’s a good idea!
I’ll add the links when I get some free time.
I appreciate your input.
REEF is not even released to the public.. its just a research project…
Hi Someone.
I know, but I selected it due to its historic significance. When I read the news that of all companies Microsoft was going to provide an open source development platform, I became overwhelmed with joy. To me It symbolized the triumph of altruistic motivations over monetary rewards. We should never underestimate the power of passion. It does amazing things. 🙂
You may want to define what “Open Source” really means in the context of distributions such as Talend, Jaspersoft, etc? These “open Source” distributions are severely and I mean very severely limited use.
On the other hand PostgresSQL comes without any limitations. I am inclined to say that Distributions are commercial software and NOT open source.
Hi Mark,
I understand your point. I rely on Gartner Group in deference to their authority. Based on my personal experience over the years, Gartner Group represents the most reliable and unbiased source of information. Gartner lists Talend, Jaspersoft, and Pentaho as open source.
http://www.gartner.com/technology/reprints.do?id=1-1DZLPEP&ct=130207&st=sb
No Jubatus? http://jubat.us/en/
How about Julia (http://julialang.org/) and Julia Studio (http://forio.com/julia/).
Looks good… it reminds me of Pascal. Is Julia object oriented?
The answer to that question is “Kind of”. This link will tell you why – https://groups.google.com/d/msg/julia-users/3tTljDSQ6cs/ouPARNq7qCcJ.
Rob,
Please don’t point to a site. If you know the answer just say it because it’s a simple question: Is the programing language object oriented?
Fari, It’s worth mentioning HPCC Systems, an open source, mature platform for Big Data processing and analysis. Its declarative language, ECL, also integrates well with other open source languages and tools like R, Python, Eclipse, and Pentaho. See hpccsystems.com
I know that HPCC is a great product. I think AGPL license might be limiting its success. Hadoop’s open source community is very strong. HPCC needs better marketing. I used to work for Informix and it had the best Database at the time. It was well ahead of Oracle, but Oracle did a much better job marketing its products. Although Informix eventually merged with IBM through acquisition. It never reached its potential.
Fari, just to clarify, HPCC Systems CE is now under Apache License 2.0.
I didn’t know that… thank you for letting me know. That makes HPCC a real competitor for Hadoop.
Great list. I have been looking for a list like this. Thanks. But, it misses some very vital ones. Did you have a personal use case in mind when you selected this list. It would help understand the reasoning behind IT.. Anyways, here are some more that I think are amazing technologies that you can
neo4j – ACID compliant graph database
redis – In memory database (Big data solution may need an in-memory cache)
HPCC (As pointed out by Data 101) – For complex peta byte scale analytics
ROXIE (which is part of HPCC) – parallelized data delivery engine
sciDB – (big data for the scientist)
On a lighter note: “If you can’t beat them, join them”.. is good for incremental innovation.
Wouldn’t you agree that this ecosystem also needs disruptive innovators. Some one who asks, say, what if Hadoop is not my only answer. Because the trend right now (which I am sure you will agree) is that hadoop is the answer irrespective of the question 🙂
Cheers
Srini
Srinii,
It wasn’t easy to put this list together. No matter what I chose, some would have been left out…. I even removed one of my personal favorites, neo4j . The criteria I picked was: “what is it that Big data needs most at this point in time?” For instance TokuDB doesn’t make it to the list on its own but as an augmentative solution for MySql, is valuable.
I am thinking about making a more comprehensive list that will include at least one tool representing a unique area. I appreciate your input. I agree with what you say.
The unique tool list should stay with syslog-ng.
I like what you guys are up too. Such clever work and exposure!
Keep up the good works guys I’ve included you guys to my personal blogroll.
Thank you multi,
Best of luck to you!
Hello There. I discovered your blog the use of msn.
That is a very well written article. I’ll be sure to bookmark it and come back to learn extra of your useful info.
Thank yyou for the post. I will certainly comeback.