Tuesday, August 5, 2014

The importance of cache efficiency in SGD optimization



Recently, some people in our team have experimented with variants of SGD (Stochastic Gradient Descent) and SDCA (Stochastic Dual Coordinate Ascent) on large and very sparse datasets (such as the KDD Cup 2010 dataset [*]). Note that we've focused only on linear models, Logistic Regression and SVM, which lead to convex optimization.

What we have found during the process is very interesting from the engineering standpoint, yet not covered in any academic papers. That is: in SGD and SDCA, each weights update is typically so fast [**] that cache efficiency can become the factor to optimize for. Put it another way: we experienced that if the data are randomly shuffled (even once before training), the algorithm suddenly becomes ~3x slower (see the table below).

Monday, July 14, 2014

ICML 2014 Highlights 2: On Deep Learning and Language Modeling


Previously: Highlights #1 - On ML Fundamentals

Deep Learning and Language Modeling

Images classification seems to be the past. The current wave of DL research is all about language modeling. Here’re some interesting works on this front.

Friday, July 11, 2014

ICML 2014 Highlights 1: On Machine Learning Fundamentals


Abstract

At a high level, Deep Learning (DL) is still hot and DL keeps eating Machine Learning. The conference's attendance distribution was like: half was there for Deep Learning and the other half was there for *Shallow* Learning :). Interestingly, the conference took place in Beijing, for the first time, and more than 50% of the attendants either study or work there (and most of that local population are students). So the attendance distribution could be biased.

In the following, I'll highlight what I've learned and observed from the conference. Here's the outline:

Tuesday, July 1, 2014

On the imminent death of MapReduce

Google recently announced at Google IO 2014 that they are retiring MapReduce (MR) in favor of a new system called Cloud Dataflow. Well, the article author perhaps dramatized it when quoting Urs Hölzle's words
We don’t really use MapReduce anymore. 
You can watch the keynote here for a better context. My guess is that no one is writing new MapReduce jobs anymore, but Google would keep running legacy MR jobs for years until they are all replaced or obsolete.

Regardless of what has happened at Google, I'd like the point out that MR should have been ditched long ago.

Someone at Cloudera (the company that used to make money on the hype of Hadoop MapReduce) already partially explained why in this blog post: The Elephant was a Trojan Horse: On the Death of Map-Reduce at Google. Some quotes to remember are:
  • Indeed, it’s a bit of a surprise to me that it lasted this long.
  • and the real contribution from Google in this area was arguably GFS, not Map-Reduce.
Every real distributed machine learning (ML) researcher/engineer knows that MR is bad [*]. ML algorithms are iterative and MR is not suited for iterative algorithms, which is due to unnecessary frequent I/O and scheduling plus other factors (see the illustration below). For more details on the weaknesses of MR, one can read any intro slides about Spark [**].


Also note that Mahout, the ML library for Hadoop, recently said goodbye to MapReduce.

25 April 2014 - Goodbye MapReduce

The Mahout community decided to move its codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new MapReduce algorithm implementations from now on. We will however keep our widely used MapReduce algorithms in the codebase and maintain them. 

We are building our future implementations on top of a DSL for linear algebraic operations which has been developed over the last months. Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark.
Notes
[*] Unfortunately, lots of companies, including my employer, are still chasing the Hadoop game. Microsoft just less than a year ago announced HDInsight, aka. Hadoop on Azure.
[**] For virtually everything that MR can do, Spark can do equally well and in most cases better. Also note that while Spark is generally fantastic, it is not necessarily the right distributed framework for every ML problem.

Monday, June 30, 2014

ICML 2014 Best Paper Awards


It's strange that the best paper awards are not posted on the ICML website. So I'll post them here:

Disclaimer: I have read none of the above papers. I have a different set of interesting ones but those above are the official best papers. 

I'll share what I find interesting in another Highlights of ICML 2014 post.

Wednesday, March 26, 2014

What kind of coding skills are required to work on machine learning?

Answered on Quora


(Image src: Inside BigData)

In our small team of 13 people, who all work on ML, the required coding skills range from
  • None (or simple git pull and build). Such person only needs to run experiments and write technical docs. (Revised: perhaps very little to demonstrate how to use the API.)
  • to decent numerical computing in MATLAB/Python/R. Such person runs and tweaks experiments on real problems for customers. Knowing at least one of those scripty languages is required so that they can do custom features engineering or visualization tasks that are not supported by the main tool that we build.
  • to good C# or F# + great software design + various level of numerical computing. Such person contributes to the main code base.
  • to hardcore low level programming. Such person is obsessed with latency/throughput, BLAS, SSE/AVX, GPU, and distributed systems.

Wednesday, December 11, 2013

Highlights of NIPS 2013


Overall

  1. Deep Learning (or Deep Neural Network or DNN) is again the most trendy topic of the conference. Its workshop session is perhaps twice (or more than that) as big as the one last year and it was packed for most of the day. Interestingly, Mark Zuckerberg of Facebook stopped by for a Q&A and then a panel discussion session. His visit was mostly to announce the new AI Lab of Facebook. For technical highlights, see below.
  2. Distributed machine learning is another topic of huge interest.
  3. Growing markets and interests in predictive analytics on sensor data (e.g. activity detection on mobile phones or wearable devices), in ML for Health Care, and in ML for Education.
  4. And there are certainly bad-ass research in other areas which I have missed. Among the topics of my interest, Optimization (particularly non-convex optimization) however hasn't made much progress.  

Deep Neural Nets

  • Natural Language Processing. Application of DNN in NLP is the theme of the deep learning research this year. This is natural because NLP is the holy-grail of machine learning research. DNN has already convincingly demonstrated its power in Computer Vision and Speech Recognition. There were some cool research using DNN in NLP such as Compositional Natural Language Parsing with Compositional Vector Grammars by a team at Stanford (led by Richard Socher) and Word2Vec project at Google (led by Tomas Mikolov). I think that this is just the beginning though.
  • Computer Vision. New benchmark for ImageNet has been established by Matt Zeiler et al. although it's not a big improvement from the previous record set by Alex Krizhevsky et al. (see their famous paper). Although Matt's work received a lot of attention from the community (come on, he set a new record), I was slightly disappointed. The spirit of his paper is about understanding convolutional neural networks but he did not explain why his network (which is a customized version of Alex's network) yields better results. He also wasn't able to rigorously explain certain mysteries (such as why rectified linear units work so well) in training neural networks.
  • Non-convex Optimization. This is the topic that I care the most in DNN because solving a DNN is a non-convex optimization problem. The current techniques only try to find a local minimum, at best. Here's an experiment that my team did for activity detection on mobile devices using sensory data. After features extraction using PCA and feeding the extracted model into a neural network, we got very good results. We then simulated the PCA feature extraction by introducing another layer to the neural network. We expected that the the optimized weights of the first layer should be identical to the PCA weights, if not better. However, we got worse results. This indicates that the optimizer converged to a non-so-good local optimum. 

  • From the scientific standpoint, the sexy part about DNN is that it can model very complicated machine learning tasks without doing too much feature engineering. Until there's a breakthrough in global optimization technique for non-convex problem or some convex remodeling of DNN, DNN will be just another periodic trend. That's why training DNNs still requires a significant amount of engineering.

Distributed Machine Learning

There're many interesting posters/talks on the topic of distributed and large scale machine learning at NIPS this year (such as this, this, and this workshop). However, what really excited me is not the work presented at NIPS but Spark, an Apache project built on top of HDFS for running distributed iterative operations distributedly. (It's a shame that I hadn't been aware of this project earlier.)

The cool thing about Spark is that it inherits the goods of Hadoop (i.e. HDFS) and re-engineers the rest. In particular:
  1. Iterative algorithms, which is the norm in ML, can be run in memory during their lifetime. No more reading from / writing to the disk for each iteration.
  2. Support of more operators, beyond Mapping and Reducing, which can be piped and lazily evaluated. See more here. In a sense, Spark is similar to Parallel LINQ but on HDFS data.
(These are the main reasons why serious machine learning practitioners don't use Hadoop beyond HDFS. Yes, stay away from the Hadoop hype. HDFS is cool. The rest of Hadoop is not so.)

The not-so-cool thing is that the whole Hadoop technology stack is in Java. Anyone interested in building Spark .NET? If there're enough interested developers, I would join or initiate such project.

Machine Learning on Sensor Data

To be (never) updated ...