Saturday, December 20, 2014

Is it really due to Neural Networks?

Deep Learning is hot. It has been achieving state-of-the-art results on many hard machine learning problems. So it's natural that many study and scrutinize it. There have been a couple of papers in the series of intriguing properties of neural networks. Here's a recent one, so-called Deep Neural Networks are easily fooled, that was discussed actively on Hacker News. Isn't it interesting?

The common theme among these papers is that DNN is unstable, in the sense that
  • Changing an image (e.g. of a lion) in a way imperceptible to humans can cause DNN to label the image as something else entirely (e.g. mislabeling a lion as library)
  • Or DNN gives high confidence (99%) predictions for images that are unrecognizable by humans
I myself have experienced this phenomenon too, even long before these papers came out. However, what surprised me is many people are too quick to claim that these properties are due to neural networks. Training a neural network, or any system, requires tuning many hyper parameters. Just because you don't use the right set of hyper-parameters doesn't mean that the method fails.

Anyway, my experience shows that the aforementioned instability is likely due to the use of output function, in particular: Softmax. In training a multi-class neural network, the most common output function is Softmax [1]. The lesser common approach is just using Sigmoid for each output node [2].

In practice, they perform about the same in terms of testing error. Softmax may perform better in some cases and vice versa but I haven't seen much difference. To illustrate, I trained a neural net on MNIST data using both Sigmoid and Softmax output functions. MNIST is a 10-class problem. The network architecture and other hyper-parameters are the same. As you can see, using Softmax doesn't really give any advantage.
Sigmoid Softmax
However, if you train a predictor using Softmax, it tends to be too confident because the raw outputs are normalized in an exponential way. Testing on test data doesn't expose this problem because in test data, every example clearly belongs to one of the trained classes. However, in practice, it may not be the case: there are many examples where the predictor should be able to say I'm not confident about my predictions [3]. Here's an illustration: I applied the MNIST models trained above on a weird example, which doesn't look like any digit. The model trained using Sigmoid outputs that it doesn't belong to any of the trained digit classes. The model trained using Softmax is very confident that it's a 9. 

Side notes
[1] It's commonly said that Softmax is a generalization of logistic function (i.e. Sigmoid). This is not true. For example, 2-class Softmax Regression isn't the same as Logistic Regression. In general, they behave very differently. The only common property is that they are smooth and the class-values using these functions sum up to 1.
[2] Some people say that for multiclass classification, Softmax gives a probabilistic interpretation while Sigmoid (per-class) does not. I strongly disagree. Both have probabilistic interpretations, depending on how you view them. (And you can always proportionally normalize the Sigmoid outputs if you care about them summing up to 1 to represent probabilistic distribution.)
[3] Predicting with uncertainty is very important in many applications. This is in fact the theme of Michael Jordan's keynote at ICML 2014.

Wednesday, December 17, 2014

Friday, August 1, 2014

The difference between L1 and L2 regularization

The difference between \(L_1\) and \(L_2\) regularization is well studied in ML (and you can find lots of reference on the internet). However, the most common explanation is perhaps via this figure.
While it's true (and seemingly intuitive at first), this figure may not be as easy to digest. The figure corresponds to an equivalent constrained optimization problem, but why we need to deal with constrained optimization problem is not clear [1]. In fact, someone asked this question on Quora: What's a good way to provide intuition as to why the lasso (L1 regularization) results in sparse weight vectors?, which motivated me to provide a simpler explanation [2].

Monday, July 14, 2014

ICML 2014 Highlights 2: On Deep Learning and Language Modeling

Previously: Highlights #1 - On ML Fundamentals

Deep Learning and Language Modeling

Images classification seems to be the past. The current wave of DL research is all about language modeling. Here’re some interesting works on this front.

Friday, July 11, 2014

ICML 2014 Highlights 1: On Machine Learning Fundamentals


At a high level, Deep Learning (DL) is still hot and DL keeps eating Machine Learning. The conference's attendance distribution was like: half was there for Deep Learning and the other half was there for *Shallow* Learning :). Interestingly, the conference took place in Beijing, for the first time, and more than 50% of the attendants either study or work there (and most of that local population are students). So the attendance distribution could be biased.

In the following, I'll highlight what I've learned and observed from the conference. Here's the outline:

Tuesday, July 1, 2014

On the imminent decline of MapReduce

Google recently announced at Google IO 2014 that they are retiring MapReduce (MR) in favor of a new system called Cloud Dataflow. Well, the article author perhaps dramatized it when quoting Urs Hölzle's words
We don’t really use MapReduce anymore. 
You can watch the keynote here for a better context. My guess is that no one is writing new MapReduce jobs anymore, but Google would keep running legacy MR jobs for years until they are all replaced or obsolete.

Regardless of what has happened at Google, I'd like the point out that MR should have been ditched long ago.

Someone at Cloudera (the company that used to make money on the hype of Hadoop MapReduce) already partially explained why in this blog post: The Elephant was a Trojan Horse: On the Death of Map-Reduce at Google. Some quotes to remember are:
  • Indeed, it’s a bit of a surprise to me that it lasted this long.
  • and the real contribution from Google in this area was arguably GFS, not Map-Reduce.
Every real distributed machine learning (ML) researcher/engineer knows that MR is bad [*]. ML algorithms are iterative and MR is not suited for iterative algorithms, which is due to unnecessary frequent I/O and scheduling plus other factors (see the illustration below). For more details on the weaknesses of MR, one can read any intro slides about Spark [**].

Also note that Mahout, the ML library for Hadoop, recently said goodbye to MapReduce.

25 April 2014 - Goodbye MapReduce

The Mahout community decided to move its codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new MapReduce algorithm implementations from now on. We will however keep our widely used MapReduce algorithms in the codebase and maintain them. 

We are building our future implementations on top of a DSL for linear algebraic operations which has been developed over the last months. Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark.
[*] Unfortunately, lots of companies, including my employer, are still chasing the Hadoop game. Microsoft just less than a year ago announced HDInsight, aka. Hadoop on Azure.
[**] For virtually everything that MR can do, Spark can do equally well and in most cases better. Also note that while Spark is generally fantastic, it is not necessarily the right distributed framework for every ML problem.

Wednesday, March 26, 2014

What kind of coding skills are required to work on machine learning?

(Image src: Inside BigData)

In our small team of 13 people, who all work on ML, the required coding skills range from
  • None (or simple git pull and build). Such person only needs to run experiments and write technical docs. (Revised: perhaps very little to demonstrate how to use the API.)
  • to decent numerical computing in MATLAB/Python/R. Such person runs and tweaks experiments on real problems for customers. Knowing at least one of those scripty languages is required so that they can do custom features engineering or visualization tasks that are not supported by the main tool that we build.
  • to good C# or F# + great software design + various level of numerical computing. Such person contributes to the main code base.
  • to hardcore low level programming. Such person is obsessed with latency/throughput, BLAS, SSE/AVX, GPU, and distributed systems.