BigDataStudio.com

Big-Data-Analytics-Final

                                 BI vs Data Analytics vs Big Data
By Fari Payandeh
faripayan-xxxxxxxxx

Let’s say I work for the Center for Disease Control and my job is to analyze the data gathered from around the country to improve our response time during flu season. Suppose we want to know about the geographical spread of flu for the last winter (2016). We run some BI reports and it tells us that the state of New York had the most outbreaks. Knowing that information we might want to better prepare the state for the next winter. Theses types of queries examine past events, are most widely used, and fall under the Descriptive Analytics category.

Now, we just purchased an interactive visualization tool and I am looking at the map of the United States depicting the concentration of flu in different states for the last winter. I click on a button to display the vaccine distribution. There it is; I visually detected a direct correlation between the intensity of flu outbreak with the late shipment of vaccines. I noticed that the shipments of vaccine for the state of New York were delayed last year. This gives me a clue to further investigate the case to determine if the correlation is causal. This type of analysis falls under Diagnostic Analytics (discovery).

We go to the next phase which is Predictive Analytics. PA is what most people in the industry refer to as Data Analytics. It gives us the probability of different outcomes and it is future-oriented. The US banks have been using it for things like fraud detection. The process of distilling intelligence is more complex and it requires techniques like Statistical Modeling. Back to our examples, I hire a Data Scientist to help me create a model and apply the data to the model in order to identify causal relationships and correlations as they relate to the spread of flu for the winter of 2018. Note that we are now taking about the future. I can use my visualization tool to play around with some variables such as demand, vaccine production rate, quantity…  to weight the pluses and minuses of different decisions insofar as how to prepare and tackle the potential problems in the coming months.

The last phase is the Prescriptive Analytics and that is to integrate our tried-and-true predictive models into our repeatable processes to yield desired outcomes. An automated risk reduction system based on real-time data received from the sensors in a factory would be a good example of its use case.

Finally, here is an example of Big Data. Suppose it’s December 2018 and it happens to be a bad year for the flu epidemic. A new strain of the virus is wreaking havoc, and a drug company has produced a vaccine that is effective in combating the virus. But, the problem is that the company can’t produce them fast enough to meet the demand. Therefore, the Government has to prioritize its shipments.  Currently the Government has to wait a considerable amount of time to gather the data from around the country, analyze it, and take action.  The process is slow and inefficient. The following includes the contributing factors. Not having fast enough computer systems capable of gathering and storing the data (velocity), not having computer systems that can accommodate the volume of the data pouring in from all of the medical centers in the country (volume), and not having computer systems that can process images, i.e, x-rays (variety).

Big Data technology changed all of that. It solved the velocity-volume-variety problem. We now have computer systems that can handle “Big Data”.  The Center for Disease Control may receive the data from hospitals and doctor offices in real-time and Data Analytics Software that sits on the top of Big Data computer system could generate actionable items that can give the Government the agility it needs in times of crises.

Hadoop vs. NoSql vs. Sql vs. NewSql By Example

z-05

z-03

Fari Payandeh

Sept 8, 2013

Fari Payandeh

Although Mainframe Hierarchical Databases are very much alive today, the Relational Databases (RDBMS) (SQL) have dominated the Database market, and they have done a lot of good. The reason the money we deposit doesn’t go to someone else’s account, our airline reservation ensures that we have a seat on the plane, or we are not blamed for something we didn’t do, etc… RDBMS’ data integrity is due to its adherence to ACID (atomicity, consistency, isolation, and durability) principles. RDBMS technology dates back to the 70’s.

So what changed? Web technology started the revolution. Today, many people shop on Amazon. RDBMS was not designed to handle the number of transactions that take place on Amazon every second. The primary constraining factor was RDBMS’ schema.

NoSql Databases offered an alternative by eliminating schemas at the expense of relaxing ACID principles. Some NoSql vendors have made great strides towards resolving the issue; the solution is called eventual consistency. As for NewSql, why not create a new RDBMS minus RDBMS’ shortcomings utilizing modern programming languages and technology. That is how some of the NewSql vendors came to life.  Other NewSql companies created augmented solutions for MySql.

Hadoop is a different animal altogether. It’s a file system and not a database. Hadoop’s roots are in  internet search engines. Although Hadoop and associates (HBase, Mapreduce, Hive, Pig, Zookeeper) have turned it into a mighty database, Hadoop is an inexpensive, scalable,  distributed filesystem with fault tolerance. Hadoop’s specialty at this point in time is in batch processing, hence suitable for Data Analytics.

Now let’s start with our example: My imaginary video game company recently put our most popular game online after ten years of being in business, shipping our games to retailers around the globe. Our customer information is currently stored in a Sql Server Database  and we have been happy with it. However, since the players started playing the game online, the database is not able to keep up and the users are experiencing delays. As our user base grows rapidly, we spend money buying more and more Hardware/Software but to no avail. Losing customers is our primary concern. Where do we go from here?

We decide to run our online game application in NoSql and NewSql simultaneously by segmenting our online user base. Our objective is to find the optimal solution. The IT department selects NoSql Couchbase (document oriented like MongoDB) and NewSql VoltDB.

Couchbase is open source, has an integrated caching mechanism, and it can automatically spread data across multiple nodes. VoltDB is an ACID compliant RDBMS, fault tolerant, scales horizontally, and possesses a shared-nothing & in-memory architecture. At the end, both systems are able to deliver. I won’t go into the intricacies of each solution because this is an example and comparing these technologies in the real-world will require testing, benchmarking, and in-depth analyses.

Now that the online operations are running smoothly, we want to analyze our data to find out where we should expand our territory. Which are the most suitable countries for marketing our products?  In doing so, we need to merge the Sql Server customer Data Warehouse with the data from the online gaming database and run analytical reports. That’s where Hadoop comes in. We configure a Hadoop system and merge the data from the two data sources. Next, we use Hadoop’s  Mapreduce in conjunction with the open source R  programming language to generate the analytics reports.