Friday, May 1, 2015

Scaling Up Stochastic Dual Coordinate Ascent

That's our new paper, to appear at KDD 2015. Here's the abstract.
Stochastic Dual Coordinate Ascent (SDCA) has recently emerged as a state-of-the-art method for solving large-scale supervised learning problems formulated as minimization of convex loss functions. It performs iterative, random coordinate updates to maximize the dual objective. Due to the sequential nature of the iterations, it is mostly implemented as a single-threaded algorithm limited to in-memory datasets. In this paper, we introduce an asynchronous parallel version of the algorithm, analyze its convergence properties, and propose a solution for primal-dual synchronization required to achieve convergence in practice. In addition, we describe a method for scaling the algorithm to out-of-memory datasets via multi-threaded deserialization of block-compressed data. This approach yields sufficient pseudo-randomness to provide the same convergence rate as random-order in-memory access. Empirical evaluation demonstrates the efficiency of the proposed methods and their ability to fully utilize computational resources and scale to out-of-memory datasets.
There are two main ideas in this paper
  1. A semi-asynchronous parallel SDCA algorithm that guarantees strong (linear) convergence and scales almost linearly with respect to the number of cores on large and sparse datasets.
  2. A binary data loader that can serve random examples out-of-memory, off a compressed data file on disk. This allows us to train on very large datasets, with minimal memory usage, while achieving fast convergence rate (due to the pseudo shuffling). For smaller datasets, we even showed that this *out-of-memory* training approach can be even more efficient than standard in-memory training approaches [*].
Note that the second idea is not restricted to SDCA or even linear learning. In fact, we originally implemented this binary data loader for training large neural networks. However, it couples nicely with SDCA as the real strength of SDCA is on very large sparse datasets, for which the need for out-of-memory training arises.

See the full paper for more details :).

Side notes
[*] Cache efficiency is the key, as I mentioned in a previous blog post.