Saturday, November 10, 2012

Besides MapReduce-style clusters, what other popular paradigms/architectures can handle large scale computational problems?


This is a question on Quora which I answered a while back.


Short answer. Sure, there are other programming paradigms for large scale computing, such as MPI or GPGPU. They are both very popular models for large scale computing, in addition to MapReduce.

Longer answer. We have to be more specific when talking about large scale computational problems. Is it large in computational complexity or large in data? 

Basically, MPI is suited for task paralellism while MapReduce is for data parallelism. I'll explain this point via two examples.

1. MPI for high performance computing. Most scientific computing problems (e.g. weather forecasting or rendering an animated movie) fall into this category. In fact, the top super computers in the world are dedicated to run MPI applications, not MapReduce applications. 

Consider a sample problem of multiplying two large matrices, each of size 10000x10000. The computational complexity of multiplying two matrices is O(N^3) so running this algorithm on one computer is not practical. However, the data is not that big, O(10GB), and can be well stored on a hard drive.

MPI is the natural framework for parallelism  The matrices are decomposed into blocks and stored in the storage node(s). The compute nodes load their corresponding data chunks into their memories and run the computation. Using MPI, you control what each node does, what data to get, etc. and also the communication among the nodes.

MapReduce is an overly complicated model for this problem because:
  1. You have to frame your algorithm in terms of keys and values, which is not natural for this problem. In the example, you need to create artificial keys for parallelism.
  2. It's not even clear how you would distribute your input matrices.
  3. There are many unnecessary intermediate phases, including determining which machines to run on, shuffling, merging, sorting, etc.
  4. Data locality and fault tolerance are also not so relevant.

Using MapReduce for this type of applications means complexity in both programming time and run-time.

2. MapReduce for for big data processing. Now consider the problem of finding the top 5 repeated words of length>=10 in all books published since 1900.

It's quite straight forward to do it using Hadoop/MapReduce, which I won't explain here. Let's see how you would do it using MPI.

First, the amount of data is so large that it can only be stored on multiple machines, across racks or even across data centers. A distributed file system with high fault tolerance and good data locality is required (such as HDFS or GFS). Because of the data locality, the compute and storage nodes are the same.

In the current MPI paradigm, it's unclear which machines you want to run on (you have to specify a machine list). Even if you run on all machines, each node doesn't know which data files it needs to load. Some input files stored in the same or nearby nodes, but which ones? You don't know.