Saturday, March 21, 2015

Machine Learning at Uber

I've just joined Uber and we are hiring engineers and scientists, for both full-time and internship positions. See the roles and their brief qualifications below; and feel free to contact me (tran@uber.com) for more information.

Machine Learning Software Engineer

  • Strong coding and debugging skills in C++
  • Fast algorithms
  • High performance computing
Machine learning background is a plus but not required!

Machine Learning Scientist

Doing active research in areas of large-scale optimization, deep learning, fast algorithms, auto-ML, etc.

Data Scientist

  • Math and Stats modeling
  • Broad understanding of machine learning or numerical algorithms
  • Scientific programming in any language

Tuesday, March 10, 2015

Metrics revisited


Machine learning practitioners often use one metric on the test set and optimize using a different metric when training on the train set. Consider the traditional binary classification problem, for instance. We typically use AUC for measuring the goodness of an algorithm on the test set while using another loss function, e.g. logistic loss or hinge loss, on the train set. 

Why is that? The common explanation is that AUC is not easily trainable. Computing AUC requires batch training as there's no concept as AUC per example. Even in batch training, we just don't use it as a loss function [1].

I want to ask a deeper question. Why is AUC a good metric in the first place? It's not the metric that business people care about. Why don't we use the true business loss, which can be factored into loss due to false positives and loss due to false negatives, for testing a machine learning algorithm; and even for training it?

There are two major reasons that AUC is popular as a proxy for business loss.
  1. AUC is agnostic of the classification threshold.
  2. Most academic datasets that are ML researchers use for benchmarking don't have business cost associated with each false positive or negative.

Due to (2), it makes sense to use ROC or PR AUC as a comparison metric in academic papers. However, in practice, it shouldn't be the case because practitioners should know the cost associated with each false positive/negative prediction. 

Regarding (1), why are we scared of the threshold? Why do we need to set a threshold in order to use a classifier model? Isn't it anti machine-learning that humans have to manually set the threshold?

Side notes
[1] Computing AUC is costly and computing the AUC function gradient is even costlier.

Friday, February 27, 2015

Scaling Up Stochastic Dual Coordinate Ascent

That's the title of our recently submitted paper. Here's the abstract of the paper.
Stochastic Dual Coordinate Ascent (SDCA) has recently emerged as a state-of-the-art method for solving large-scale supervised learning problems formulated as minimization of convex loss functions. It performs iterative, random coordinate updates to maximize the dual objective. Due to the sequential nature of the iterations, it is mostly implemented as a single-threaded algorithm limited to in-memory datasets. In this paper, we introduce an asynchronous parallel version of the algorithm, analyze its convergence properties, and propose a solution for primal-dual synchronization required to achieve convergence in practice. In addition, we describe a method for scaling the algorithm to out-of-memory datasets via multi-threaded deserialization of block-compressed data. This approach yields sufficient pseudo-randomness to provide the same convergence rate as random-order in-memory access. Empirical evaluation demonstrates the efficiency of the proposed methods and their ability to fully utilize computational resources and scale to out-of-memory datasets.
There are two main ideas in this paper
  1. A semi-asynchronous parallel SDCA algorithm that guarantees strong (linear) convergence and scales almost linearly with respect to the number of cores on large and sparse datasets.
  2. A binary data loader that can serve random examples out-of-memory, off a compressed data file on disk. This allows us to train on very large dataset, with minimal memory usage, while achieving fast convergence rate (due to the pseudo shuffling). For smaller datasets, we even showed that this *out-of-memory* training approach can be even more efficient than standard in-memory training approaches [*].
Note that the second idea is not restricted to SDCA or even linear learning. In fact, we originally implemented this binary data loader for training large neural networks. However, it couples nicely with SDCA as the real strength of SDCA is on very large sparse datasets, for which the need for out-of-memory training arises.

See the full paper for more details :).

Side notes
[*] Cache efficiency is the key, as I mentioned in a previous blog post.

Saturday, December 20, 2014

Is it really due to Neural Networks?

Deep Learning is hot. It has been achieving state-of-the-art results on many hard machine learning problems. So it's natural that many study and scrutinize it. There have been a couple of papers in the series of intriguing properties of neural networks. Here's a recent one, so-called Deep Neural Networks are easily fooled, that was discussed actively on Hacker News. Isn't it interesting?

The common theme among these papers is that DNN is unstable, in the sense that
  • Changing an image (e.g. of a lion) in a way imperceptible to humans can cause DNN to label the image as something else entirely (e.g. mislabeling a lion as library)
  • Or DNN gives high confidence (99%) predictions for images that are unrecognizable by humans
I myself have experienced this phenomenon too, even before these papers came out. However, what surprised me is many people are too quick to claim that these properties are due to neural networks. Training a neural network, or any system, requires tuning many hyper parameters. Just because you don't use the right set of hyper-parameters doesn't mean that the method fails.

Anyway, my experience shows that the aforementioned instability is likely due to the use of output function, in particular: Softmax. In training a multi-class neural network, the most common output function is Softmax [1]. The lesser common approach is just using Sigmoid for each output node [2].

In practice, they perform about the same in terms of testing error. Softmax may perform better in some cases and vice versa but I haven't seen much difference. To illustrate, I trained a neural net on MNIST data using both Sigmoid and Softmax output functions. MNIST is a 10-class problem. The network architecture and other hyper-parameters are the same. As you can see, using Softmax doesn't really give any advantage.
Sigmoid Softmax
However, if you train a predictor using Softmax, it tends to be too confident because the raw outputs are normalized in an exponential way. Testing on test data doesn't expose this problem because in test data, every example clearly belongs to one of the trained classes. However, in practice, it may not be the case: there are many examples where the predictor should be able to say I'm not confident about my predictions [3]. Here's an illustration: I applied the MNIST models trained above on a weird example, which doesn't look like any digit. The model trained using Sigmoid outputs that it doesn't belong to any of the trained digit classes. The model trained using Softmax is very confident that it's a 9. 



Side notes
[1] It's commonly said that Softmax is a generalization of logistic function (i.e. Sigmoid). This is not true. For example, 2-class Softmax Regression isn't the same as Logistic Regression. In general, they behave very differently. The only common property is that they are smooth and the class-values using these functions sum up to 1.
[2] Some people say that for multiclass classification, Softmax gives a probabilistic interpretation while Sigmoid (per-class) does not. I strongly disagree. Both have probabilistic interpretations, depending on how you view them. (And you can always proportionally normalize the Sigmoid outputs if you care about them summing up to 1 to represent probabilistic distribution.)
[3] Predicting with uncertainty is very important in many applications. This is in fact the theme of Michael Jordan's keynote at ICML 2014.

Wednesday, December 17, 2014

Monday, December 1, 2014

Challenges in Machine Learning Practice - Part 1

In this series, I'd like to highlight a few challenges in developing and using a machine learning (ML) system in practice. This first post will focus on model serialization and how to consume a trained model.

The disconnect between training models and putting them into production


What’s the best practice to train and deploy a model?

Most of the major machine learning toolsets and platforms (e.g. SAS, R, Python libraries, Weka, RapidMiner, Hadoop, etc.) focus on providing algorithms and tools to help you analyze and visualize data, extract features, and train models. Once the data scientists are done playing with the data and having trained a good model, how to operationalize that trained model or workflow [1] remains a tricky question to most companies and groups.

What usually happens is one of the following.
  1. The whole data mining [2] pipeline is offline. 
    1. In some cases, companies just want to get insights from their landfill of data, to help them make more informed decisions. That’s it. They don’t need to or perhaps don’t know how to train/apply a predictive model.
    2. Online prediction is not too critical. Some data scientists are willing to run prediction offline on new batches of data. Examples include customer retention analysis, risk analysis, fraud detection (many firms already do fraud detection online but there are still lots of firms running this offline). 
    In both cases, machine learning is only for internal use. I’m guessing that the machine learning software market is dominated by these needs but won’t be for long. The future will rely more on predictive analytics.

  2. Or after training and validating a model offline, the data scientists will ask other software engineers to implement the model decoder, typically in another language that is more efficient (C++/C#/Java), which can then be deployed as a live service.
(2) is what I want to talk about. There are many problems with this approach.
  1. The educational bridge between data scientists (or statisticians) and production engineers. Basically, the data scientists need to explain the model to the engineers. Depending on the complexity of the model and other factors, this process could be long and error prone.
  2. Retraining and updating the model is also a pain. After a few months, the data scientists will likely come up with better models, either through retraining with more data or through more fundamental changes. Now the engineers need to update the model. If the change is fundamental, such as switching from SVM to Neural Network, the model decoder would change significantly.

  3. The process of improving the models is and should be endless, at least due to periodic retraining.
If you multiply the cost factors above with the number of projects that use machine learning, the waste of resources is far reaching. It typically takes months and multiple engineers to test and deploy an already trained machine learning model.

Undoubtedly, there are a few solutions out there and they are far from being ideal.
  1. Large IT companies (Microsoft, Google, Facebook, etc.) may have the resources to build their own infrastructure to support serializing and consuming models in an automatic way. However, their solutions are (i) internal, (ii) even fragmented within the companies, and (iii) customized for their own data sources/format and their current learning algorithms.
  2. Predictive Model Markup Language (PMML) is the most established and common format for serializing machine learning models. Many popular software (such as KNIME, RapidMiner, R, SAS, etc.) [3] support producing models in this format. However, PMML has a few disadvantages:
    1. As an old standard, its evolution is very slow. For example, it doesn't support modern methods such as convolutional neural networks.
    2. Software that can consume PMML models (e.g. SAS or IBM SPSS Modeler) tend to be expensive and not optimized for online usage.
  3. Train a model in a data-friendly environment (let's say Python) and consume the model through that environment [3]. This approach, while fast to make something work, also induces major technical debts that you have to pay later. Some major debts are speed and scalability (both in terms of computation as well as software maintenance).
In summary, producing and consuming machine learning models is still painful for a large number of machine learning practitioners. The current market is very fragmented and still at an early stage. Software that pay attention to producing and consuming models tend to lack other important components to make them widely used. I'll highlight a few of those important components in the next posts.

Notes
[1] It should be noted that almost every predictive model is a workflow, not just a ML-trained sub-model.
[2] Sometimes I use “data mining” to emphasize the practice of mining data just for the sake of understanding but not for making predictions on new examples.
[3] This site provides a comprehensive list of software that can produce and/or consume PMML models.

Tuesday, August 5, 2014

The importance of cache efficiency in SGD optimization



Recently, some members in our team have experimented with variants of SGD (Stochastic Gradient Descent) and SDCA (Stochastic Dual Coordinate Ascent) on large and sparse datasets (such as the KDD Cup 2010 dataset [*]). Note that we've focused only on linear models (e.g. logistic regression and SVM) which lead to convex optimization.

What we have found during the process is very interesting from the engineering standpoint, yet not covered in any academic papers. That is: in SGD and SDCA, each weights update is typically so fast [**] that cache efficiency can become the limiting factor. Put it another way: we experienced that if the data is randomly shuffled in memory, the algorithm suddenly becomes ~3x slower (see the table below).