Monday, December 10, 2012

Highlights of NIPS 2012

NIPS 2012 is the first conference in Machine Learning [1] that I've attended and it was well worth the time and money invested. Here are my brief-but-not-so highlights of the academic contents of the conference and workshops [3, 4].

  1. The workshop sessions were amazing with many interesting topics and talks. They offer both breath (variety of the workshops) and depth, with plenty of time for interactive discussion.
  2. Deep Neural Networks (DNNs), not surprisingly, received a lot of buzz and attention [5]
  3. Besides DNNs, here are some major topics at NIPS this year.
    • Latent Variable Models. The truth is: this is not a topic that I'm yet familiar with. The good thing is: after this conference, I know that there is quite a lot of interest in this topic. Another good thing: there seems to be a lot of numerical mathematics in this research that I am actually familiar with. Anyway, this is a topic that I'll definitely investigate soon.
    • Kernel methods: this and this workshop
    • Graphical Models and Probabilistic Programming
    • ... and perhaps some other cool topics which I have missed.
Deep Neural Nets
(aka. Deep Learning, to a bunch)
  • Mathematical Foundation. There are plenty of reasons for the hype in deep networks, which I won't describe here. One of the major academic interests in DNNs at the conference is about its theoretical foundation. Deep networks work so well against many benchmarks but no one can give a solid explanation why. This lack of a mathematical/statistical foundation for deep networks (or neural networks in general) was questioned many times during the conference and the workshop. Fortunately for the field, as Yann Lecun mentioned in the panel discussion, it has recently attracted the attention of some well-known applied mathematicians (such as Stan Osher and Stephane Mallat).
  • Why it works well. To me, the fact that neural networks with many layers perform well can simply be explained by its ability to use many parameters. Basically, deep networks allow you to fit a training set, in a nonlinear way, with a bunch of parameters. Loosely speaking, the more parameters you use, the better you can fit the data.
  • Overfitting issue. Of course, when you use so many parameters, overfitting will be a concern and it is true that deep networks are prone to overfitting. That's why there was a talk by G Hinton, the master of the art, about dropout technique in neural networks, which targets the overfitting problem.
  • Fine tuning. It makes sense from the above that there is a lot of engineering (e.g. fine tuning) in designing a deep network. If we connect all the dots, there seems nothing particularly new or interesting about DNNs. The basic idea is: use a lot of parameters, if you have the computational resource to do so, and engineer the network so that overfitting can be reduced.
  • Distributed DNNs. Another hype about DNNs is the distributed work by Google. Along with the mathematical foundation research, distributed DNNs will be a major research direction.
    • Here is my fun hypothesis: if your method performs (slightly) better than the benchmarks, its hype will ironically be an increasing function with respect to its complexity and the required computational resource (e.g. 16000 cores and a new parallelism architecture will generate a great deal of hype).
Kernel Methods
Despite the hype about DNNs, kernel methods is still a relevant and is a powerful class of methods (and with strong mathematical foundation). The major focus of kernel methods research is now on low-rank approximation of the kernel matrixwhich I'm also pursuing.
  • The fact that Leslie Greengard, a computational physicist well known for his co-invention of the Fast Multipole Method and who hasn't worked at all in machine learning, was invited to give a talk at NIPS 2012 is an indication of its importance.
    • Unfortunately, he couldn't come at the last minute due to injury, and his talk was replaced by ... well, a deep learning talk (by G Hinton's).
  • Even with the absence of Leslie Greengard, there were still more than one talk (one by Francis Bach and one by Alex Smola) about low-rank approximation methods.
Graphical Models and Probabilistic Programming
Disclaimer: I've recently learned about graphical models and here are my takes about this approach.
  • Strength: it is the most natural and most accurate way to infer the distribution of an unknown from the observed data, if you can construct the graph that represents the probabilistic relationship of the variables.
  • Weakness: this condition is the main weakness of the graphical model. You cannot just throw in features and data and hopefully get a model back. You have to construct a features graph. In many problems where there are a lot of features (or variables), it's hard or even impossible to construct such a graph. In those problems, we don't even know how variables correlate.
That said, graphical models is a powerful way of modeling if the number of variables is not too large and if we want to inject domain knowledge about the data. I think that there are even more problems of this type than very large scale problems.

Unsupervised vs. Supervised Learning
  • Unsupervised feature learning. In his talk, Andrew Ng said that Deep Learning is sort of a misnomer. What people should care about is unsupervised feature learning, not deep this or deep that. I agree with him on this. I further think that unsupervised feature learning should be decoupled from deep networks (aka deep learning). 
    • It may be possible that deep networks is the current best technique for unsupervised feature learning. However, it is not the only technique or even the first technique. For example, people have used PCA (or Kernel PCA for nonlinear analysis).
    • Unsupervised feature learning is not the only application of deep networks. Some recent successes in deep networks are just straight supervised learning from the data (without the pre-training, i.e. unsupervised feature learning).
  • Unsupervised vs. Supervised Learning. This is a topic of frequent debate at the conference.
    • Arguments for supervised learning: you can get much more information from labeled data
    • Arguments for unsupervised feature learning: you can learn more from more data and the amount of unlabeled data is huge compared to labeled data
  • Semi-supervised learning. Both sides are valid and here's my take: we should do both unsupervised and supervised feature learning simultaneously, or so-called semi-supervised learning.

[1] For those who haven't heard of NIPS, it's one of the main annual conferences (if not the main conference) in Machine Learning.
[3] Disclaimer: this analysis is through the lenses of a computational-mathematician-convert, who has recently been very interested in machine learning.
[4] Not to mention that the whole conference was very well organized. Kudos to the organizers.
[5] Just in case you wonder why it is not surprising, see the media hype: herehere, or here.