Friday, May 1, 2015

Scaling Up Stochastic Dual Coordinate Ascent

That's our new paper, to appear at KDD 2015. Here's the abstract.
Stochastic Dual Coordinate Ascent (SDCA) has recently emerged as a state-of-the-art method for solving large-scale supervised learning problems formulated as minimization of convex loss functions. It performs iterative, random coordinate updates to maximize the dual objective. Due to the sequential nature of the iterations, it is mostly implemented as a single-threaded algorithm limited to in-memory datasets. In this paper, we introduce an asynchronous parallel version of the algorithm, analyze its convergence properties, and propose a solution for primal-dual synchronization required to achieve convergence in practice. In addition, we describe a method for scaling the algorithm to out-of-memory datasets via multi-threaded deserialization of block-compressed data. This approach yields sufficient pseudo-randomness to provide the same convergence rate as random-order in-memory access. Empirical evaluation demonstrates the efficiency of the proposed methods and their ability to fully utilize computational resources and scale to out-of-memory datasets.
There are two main ideas in this paper
  1. A semi-asynchronous parallel SDCA algorithm that guarantees strong (linear) convergence and scales almost linearly with respect to the number of cores on large and sparse datasets.
  2. A binary data loader that can serve random examples out-of-memory, off a compressed data file on disk. This allows us to train on very large datasets, with minimal memory usage, while achieving fast convergence rate (due to the pseudo shuffling). For smaller datasets, we even showed that this *out-of-memory* training approach can be even more efficient than standard in-memory training approaches [*].
Note that the second idea is not restricted to SDCA or even linear learning. In fact, we originally implemented this binary data loader for training large neural networks. However, it couples nicely with SDCA as the real strength of SDCA is on very large sparse datasets, for which the need for out-of-memory training arises.

See the full paper for more details :).

Side notes
[*] Cache efficiency is the key, as I mentioned in a previous blog post.

Tuesday, March 10, 2015

Metrics revisited

Machine learning researchers and practitioners often use one metric on the test set and optimize on a different metric when training on the train set. Consider the traditional binary classification problem, for instance. We typically use AUC on the test set for measuring the goodness of an algorithm while using another loss function, e.g. logistic loss or hinge loss, on the train set. 

Why is that? The common explanation is that AUC is not easily trainable. Computing AUC requires batch training as there's no concept as AUC per example. Even in batch training, we just don't use it as a loss function [1].

I want to ask a deeper question. Why is AUC a good metric in the first place? It's not the metric that business people care about. Why don't we use the true business loss, which can be factored into loss due to false positives and loss due to false negatives, for testing a machine learning algorithm; and even for training it?

The major reason that AUC is favored as a proxy for business loss is that it is independent of the classification threshold. Why are we scared of the threshold? Why do we need to set a threshold in order to use a classifier model? Isn't it anti machine-learning that humans have to manually set the threshold?

So I'd like to propose that we shouldn't consider threshold as a parameter to tune. Instead, make it another parameter to learn. Here are a few challenges when doing so
  • If we are using a linear model, this addition of threshold parameter will make the model nonlinear. In fact, we will no longer have linear models.
  • The threshold parameter needs to be within 0 and 1. We can relax this constraint by applying a logistic function on a parameterized threshold variable.
ML research has always been challenging. Adding another layer of complexity shouldn't be an issue. Not modeling the business problem directly is more of an issue to me.

Side notes
[1] Computing AUC is costly and computing the AUC function gradient is even costlier.