Tuesday, August 5, 2014

The importance of cache efficiency in SGD optimization



Recently, some members in our team have experimented with variants of SGD (Stochastic Gradient Descent) and SDCA (Stochastic Dual Coordinate Ascent) on large and sparse datasets (such as the KDD Cup 2010 dataset [*]). Note that we've focused only on linear models (e.g. logistic regression and SVM) which lead to convex optimization.

What we have found during the process is very interesting from the engineering standpoint, yet not covered in any academic papers. That is: in SGD and SDCA, each weights update is typically so fast [**] that cache efficiency can become the limiting factor. Put it another way: we experienced that if the data is randomly shuffled in memory, the algorithm suddenly becomes ~3x slower (see the table below).


Experiment Throughput
(number of weights updates per second)
Don’t shuffle data 3.6M
Shuffle data
(only shuffle the references to the data rows)
1.2M
(3x slower)
Shuffle, then clone the shuffled data in memory 2.5M
Throughput measurements for different data iteration strategies

However, as many papers have shown, shuffling data (for each pass) is important to get good convergence. So here is a trick to make SDCA cache-friendly when data needs to be shuffled: use another thread to shuffle the data, then create another deep copy of it. Note that a full deep copy of the whole dataset is not necessary. Since the cloned examples are consumed only once, we can use a buffer to store batches of the shuffled data. Once an example is pulled, it is removed from the buffer.

Think of this scheme above as the producer-consumer pattern, where the producer is the examples cloner and the consumer is the learning algorithm.

Conclusions: when implementing SGD or SDCA, be obsessed with cache efficiency.

Side notes
[*] This dataset is about ~5GB when loaded into memory, stored using sparse format and single-precision float type.
[**] We use an optimized BLAS-like library for fast matrix/vector operations