The Best Of Open Source For Big Data

Open Source tools  for Big Data








Fari Payandeh




Sep 1, 2013

By Fari Payandeh

It was not easy to select a few out of many Open Source projects. My objective was to choose the ones that fit Big Data’s needs most. What has changed in the world of Open Source is that the big players have become stakeholders; IBM’s alliance with Cloud Foundry, Microsoft providing a development platform for Hadoop, Dell’s Open Stack-Powered Cloud Solution, VMware and EMC partnering on Cloud, Oracle releasing its NoSql database as Open Source.

“If you can’t beat them, join them”. History has vindicated the Open Source visionaries and advocates.

Hadoop Distributions


Cloud Operating System

Cloud Foundry — By VMware

OpenStack — Worldwide participation and well-known companies


fusion-io — Not open source, but very supportive of Open Source projects; Flash-aware applications.

Development Platforms and Tools

REEF — Microsoft’s Hadoop development platform

Lingual — By Concurrent

Pattern — By Concurrent

Python — Awesome programming language

Mahout — Machine learning programming language

Impala — Cloudera

R — MVP among statistical tools

Storm — Stream processing by Twitter

LucidWorks — Search, based on Apache Solr

Giraph — Graph processing by Facebook

NoSql Databases

MongoDB, Cassandra, Hbase

Sql Databases

MySql — Belongs to Oracle

MariaDB — Partnered with SkySql

PostgreSQL — Object Relational Database

TokuDB — Improves RDBMS performance

Server Operating Systems

Red Hat — The defacto OS for Hadoop Servers

BI, Data Integration, and Analytics




22 thoughts on “The Best Of Open Source For Big Data

    1. Hi Someone.

      I know, but I selected it due to its historic significance. When I read the news that of all companies Microsoft was going to provide an open source development platform, I became overwhelmed with joy. To me It symbolized the triumph of altruistic motivations over monetary rewards. We should never underestimate the power of passion. It does amazing things. 🙂

  1. You may want to define what “Open Source” really means in the context of distributions such as Talend, Jaspersoft, etc? These “open Source” distributions are severely and I mean very severely limited use.

    On the other hand PostgresSQL comes without any limitations. I am inclined to say that Distributions are commercial software and NOT open source.

  2. Fari, It’s worth mentioning HPCC Systems, an open source, mature platform for Big Data processing and analysis. Its declarative language, ECL, also integrates well with other open source languages and tools like R, Python, Eclipse, and Pentaho. See

    1. I know that HPCC is a great product. I think AGPL license might be limiting its success. Hadoop’s open source community is very strong. HPCC needs better marketing. I used to work for Informix and it had the best Database at the time. It was well ahead of Oracle, but Oracle did a much better job marketing its products. Although Informix eventually merged with IBM through acquisition. It never reached its potential.

  3. Great list. I have been looking for a list like this. Thanks. But, it misses some very vital ones. Did you have a personal use case in mind when you selected this list. It would help understand the reasoning behind IT.. Anyways, here are some more that I think are amazing technologies that you can
    neo4j – ACID compliant graph database
    redis – In memory database (Big data solution may need an in-memory cache)
    HPCC (As pointed out by Data 101) – For complex peta byte scale analytics
    ROXIE (which is part of HPCC) – parallelized data delivery engine
    sciDB – (big data for the scientist)

    On a lighter note: “If you can’t beat them, join them”.. is good for incremental innovation.
    Wouldn’t you agree that this ecosystem also needs disruptive innovators. Some one who asks, say, what if Hadoop is not my only answer. Because the trend right now (which I am sure you will agree) is that hadoop is the answer irrespective of the question 🙂

    1. Srinii,

      It wasn’t easy to put this list together. No matter what I chose, some would have been left out…. I even removed one of my personal favorites, neo4j . The criteria I picked was: “what is it that Big data needs most at this point in time?” For instance TokuDB doesn’t make it to the list on its own but as an augmentative solution for MySql, is valuable.
      I am thinking about making a more comprehensive list that will include at least one tool representing a unique area. I appreciate your input. I agree with what you say.

  4. Hello There. I discovered your blog the use of msn.
    That is a very well written article. I’ll be sure to bookmark it and come back to learn extra of your useful info.
    Thank yyou for the post. I will certainly comeback.

Leave a Reply to Anonymous Cancel reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.