Sunday, September 14, 2014

My Nerdie Bookshelf - "Big Data" by Nathan Marz and James Warren

Undoubtedly Big Data is the hype of 2014: if on the one hand the availability of cheap cloud resources, together with the huge increase of data sources, have paved the way for the incredible rise of interest in this subject, it is also true that behind it there is also a lot of marketing.

Real use cases are abundant but Big Data has become a buzzword used now for everything regarding data analytics. Interesting viewpoints on that can be read on selected articles like "'Big Data' Is One Of The Biggest Buzzwords In Tech That No One Has Figured Out Yet" and "If Big Data Is Anything at All, This Is It". Moreover often the term "Big" is improperly used. Definitely 100 Gb of overall data, that can be processed in RAM on affordable machines (see for example this offer from Hetzner), can't be defined Big Data...

The strongest point of the Big Data book I'm reviewing here has been for me its ability to present a clear definition of the problem. Since the hype, I wanted to study more about Big Data and chose this Nathan Marz book thanks to the very good reputation of its author (well, co-author to say the truth): Marz was a technical architect at Twitter and founded several exciting open source projects (one for all Storm, a sort of ultra scalable, high-performance service bus).

Marz presents here his vision and recipes to deal with the Big Data problem, namely the Lambda Architecture. It is definitely a high end solution, to be used when great bunch of data must be processed with the lowest latency as possible. For this purpose the Lambda Architecture consists of two layers, the batch layer and the speed layer, the latter, as the name implies, to process the most recent data.

Albeit the book is very practical and describes directly the technical solutions and the open source technologies the Lambda Architecture is based on, it also expresses clearly the characteristic of Big Data processing. At the heart of it all there is the idea that analytics is just a "function" of all the available data and that the master dataset is immutable: new information should be only accumulated and never deleted. This way it is possible to reprocess the whole universe of information when needed. This allows several advantages like the ability to easily correct errors introduced in previous processing and the ability to elaborate new indicators or to refine existing ones, on the whole set of information. These data in fact must be processed to produce (or to enrich if processing is incremental) indexes used for queries (eg for business intelligence): in the author’s words, “to make queries on precomputed views rather than directly on the master dataset”.

The key technology here is of course Hadoop, both for the distributed management of big data and for processing through map-reduce powered algorithms. Quite interesting is the chapter about modeling, that presents a data schema that revolves around atomic facts, logically linked to form a graph of structures: this at the end is not too dissimilar to the star/snowflakes schemas that can be found in traditional OLAP data warehouses.

For the real time layer Marz proposes the use of the aforementioned Storm, in this case to update short-lived indexes, destined to be removed after a few hours when the batch layer is able to ingest the recent data they were based on.

While there is a hint of theory here and there, Big Data is very much practical and highly focused in presenting the Lambda Architecture (a much more suitable title should have been A reference architecture for Big Data processing). I wouldn't advice the book to developers totally new to the Big Data ecosystem: it presents in fact several specific technical solutions (Thrift, Pail, JCascalog, Cassandra, ElephantDB, Trident,…) that should help and simplify the problems described, but can be probably more appreciated only to those already skilled on the subject. In my viewpoint in fact it would be beneficial first to get the hands dirty on all the main technologies behind (Hadoop to start with and Storm just in case) before directly jumpstart to embrace the final Lambda Architecture as it is presented here.

Overall anyway I found the book interesting and stimulating: the fact that MEAP books can be easily bought with big discounts (just follow the Manning Twitter account to get promotional coupons) is also a plus worth mentioning here.

More information on Big Data can be found on the Manning web site.