Google F1 Database: One Step Closer To Discovering The DB Holy Grail


Fari Payandeh





Sept 15, 2013

Fari Payandeh

Google recently replaced its AdWords MySql Database with a Database that they built in-house namely F1 Database. AdWords serves thousand of users, ” which all share a database over 100TB serving up hundreds of thousands of requests per second, and runs SQL queries that scan tens of trillions of data rows per day,” Google said.

After reading Google’s paper on its F1 Database (not open source), I started thinking about its ramifications for Databases in general and Big Data in particular. Google F1 Database paper might trigger new initiatives that eventuate in materializing the phantom (next paragraph). The paper mentions few challenges with F1 DB that need to be addressed. I came away with two lingering issues. First, there is no mention of security. Secondly, it states, “Hide RPC latency, Buffer writes in client, send as one RPC”. What will happen if the network connection between the client and the Database goes down? Will the data be lost? This is a serious problem for operations that need to commit as fast as possible; Airline reservation is one.  I probably misunderstood.

The system resembles a hybrid between Relational and Hierarchical (think mainframe) Databases. What is the Holy Grail  in the Database world?  Relational Databases (RDBMS) are like high-rises comprising many apartments.  What if there are no vacancies and people have lined up to rent from us. The way RDBMS has handled the demand is by adding more floors on top of the high-rise. It is expensive and slows down the day-to-day operations. A new technology (NoSql) emerged a few years ago and solved the space allocation problem. Instead of building new floors we place the tenants in inexpensive houses. Once we run out of vacant houses we give the tenants new houses. The downside? It makes managing the place more difficult and  we might unwittingly  reserve the same house for two different individuals. There are ways to prevent that, but it’s a perplexing task and it places a lot of pressure on the engineers who design the housing complex. The Holy Grail is to discover a method by which we  can combine the best of both worlds and remove the negative.

Following Google’s invaluable tips in the paper, no doubt some engineers are working hard to figure out how to build an F1++ Database. What if they succeed? What will happen to NoSql and NewSql if they produce an open source Database System? The confluence of several forces that are currently shaping open source, Big Data, Mobile, and Cloud technologies might in time make NoSql and the existing NewSql irrelevant– flash-aware applications, shared-nothing architecture, Mapreduce methods, software-defined storage, in-memory computing, shared virtual storage array networks, new compression algorithms, atomic writes, horizontal scalability, software-defined networking, columnar technology,  progress in fault tolerance, database sharding, and solid state drives.

There is one very powerful force that in my view will keep NoSql alive and well for years to come and that is the power of developers. The genie is out of the bottle and all the nuclear fusion combined in the world cannot put it back in there. Speaking from personal experience as a Developer/DBA, I know that developers hate roadblocks. Once they start on something they like to continue working. To get them away from what they are deeply involved in is like taking a pacifier from a baby. For the first time in history, they can get on their generally free and open source bikes and run without the hassle of calling the DBA’s to open the gates for them every 40 miles. NoSql pushed the Database inside the developers’ world and they love it! Is it good for the industry? Perhaps not, but it might just create millions of programming jobs. After all, somebody has to untangle the convoluted code (not to the fault of developers) left behind. Separation of Database and code, as painful as it might be for developers is a necessity. It establishes checks and balances. According to Google’s paper, they have taken those factors into account. Google F1 is a developer friendly Database. Hopefully the trend will continue.

From Google:

F1 is a distributed relational database system built at
Google to support the AdWords business. F1 is a hybrid
database that combines high availability, the scalability of
NoSQL systems like Bigtable, and the consistency and us-
ability of traditional SQL databases. F1 is built on Span-
ner, which provides synchronous cross-datacenter replica-
tion and strong consistency. Synchronous replication im-
plies higher commit latency, but we mitigate that latency
by using a hierarchical schema model with structured data
types and through smart application design. F1 also in-
cludes a fully functional distributed SQL query engine and
automatic change tracking and publishing.

Simple description of Big Data

Big Data: Let’s say that we wanted to process  the data that could be captured by satellites a few years ago; It was impossible to do so. Reason being that  traditional Databases were incapable of  processing the volume, velocity, and variety (text, images,etc…) of the data. Today we can, and this significant  leap forward is coined “Big Data”. Therefore, Big Data is not just about the size of data. It has more to do with how the data is processed.

Hadoop:  Hadoop has become the  defacto platform for Big Data. Hadoop is the “operating System” that places the foundation for capturing data. Among the companies which contributed to lifting “modern Big Data” off the ground–  Google, Yahoo, Amazon, Facebook– Google is the only one that doesn’t use Hadoop. Although Google’s fingerprints are all over Hadoop, it’s currently utilizing its own Analytics platform called BigQuery, which is a cloud based service.

Hadoop Distributions: Apache, Hortonworks, MapR, Cloudera, Pivotal HD, Amazon, IBM, and Intel. Hortonworks is 100% Open Source Apache Hadoop Distribution.

NoSql Databases: They sit on top of Hadoop and they store the data without schemas. Why? because schemas have contributed to the sluggish performance of traditional Databases when processing large volumes of data. The frontrunners are MongoDB, Cassandra, Google Cloud Datastore, Amazon DynamoDB, Redis, CouchDB, Hbase, Neo4J, MarkDB, Riak, and CouchBase.

Data Analytics: Now that we can process the data gushing forth, we can look for patterns, correlations,  events, anomalies, and invisible nuggets of intelligence to help us make better decisions. The following companies are strong players in that space. IBM, Oracle, Google, Microsoft, SAP, Amazon, Teradata, SAS, Tibco, Statsoft, KXen, and Angoss Software. Pentaho and Jaspersoft are open source. Alteryx has recently made its software free. Tableau has a free public edition.