Tuesday, July 1, 2014

On the imminent decline of MapReduce

Google recently announced at Google IO 2014 that they are retiring MapReduce (MR) in favor of a new system called Cloud Dataflow. Well, the article author perhaps dramatized it when quoting Urs Hölzle's words
We don’t really use MapReduce anymore. 
You can watch the keynote here for a better context. My guess is that no one is writing new MapReduce jobs anymore, but Google would keep running legacy MR jobs for years until they are all replaced or obsolete.

Regardless of what has happened at Google, I'd like the point out that MR should have been ditched long ago.

Someone at Cloudera (the company that used to make money on the hype of Hadoop MapReduce) already partially explained why in this blog post: The Elephant was a Trojan Horse: On the Death of Map-Reduce at Google. Some quotes to remember are:
  • Indeed, it’s a bit of a surprise to me that it lasted this long.
  • and the real contribution from Google in this area was arguably GFS, not Map-Reduce.
Every real distributed machine learning (ML) researcher/engineer knows that MR is bad [*]. ML algorithms are iterative and MR is not suited for iterative algorithms, which is due to unnecessary frequent I/O and scheduling plus other factors (see the illustration below). For more details on the weaknesses of MR, one can read any intro slides about Spark [**].


Also note that Mahout, the ML library for Hadoop, recently said goodbye to MapReduce.

25 April 2014 - Goodbye MapReduce

The Mahout community decided to move its codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new MapReduce algorithm implementations from now on. We will however keep our widely used MapReduce algorithms in the codebase and maintain them. 

We are building our future implementations on top of a DSL for linear algebraic operations which has been developed over the last months. Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark.
Notes
[*] Unfortunately, lots of companies, including my employer, are still chasing the Hadoop game. Microsoft just less than a year ago announced HDInsight, aka. Hadoop on Azure.
[**] For virtually everything that MR can do, Spark can do equally well and in most cases better. Also note that while Spark is generally fantastic, it is not necessarily the right distributed framework for every ML problem.