Monday, December 10, 2012

Highlights of NIPS 2012

NIPS 2012 is the first conference in Machine Learning [1] that I've attended and it was well worth the time and money invested. Here are my brief-but-not-so highlights of the academic contents of the conference and workshops [3, 4].

  1. The workshop sessions were amazing with many interesting topics and talks. They offer both breath (variety of the workshops) and depth, with plenty of time for interactive discussion.
  2. Deep Neural Networks (DNNs), not surprisingly, received a lot of buzz and attention [5]
  3. Besides DNNs, here are some major topics at NIPS this year.
    • Latent Variable Models. The truth is: this is not a topic that I'm yet familiar with. The good thing is: after this conference, I know that there is quite a lot of interest in this topic. Another good thing: there seems to be a lot of numerical mathematics in this research that I am actually familiar with. Anyway, this is a topic that I'll definitely investigate soon.
    • Kernel methods: this and this workshop
    • Graphical Models and Probabilistic Programming
    • ... and perhaps some other cool topics which I have missed.
Deep Neural Nets
(aka. Deep Learning, to a bunch)
  • Mathematical Foundation. There are plenty of reasons for the hype in deep networks, which I won't describe here. One of the major academic interests in DNNs at the conference is about its theoretical foundation. Deep networks work so well against many benchmarks but no one can give a solid explanation why. This lack of a mathematical/statistical foundation for deep networks (or neural networks in general) was questioned many times during the conference and the workshop. Fortunately for the field, as Yann Lecun mentioned in the panel discussion, it has recently attracted the attention of some well-known applied mathematicians (such as Stan Osher and Stephane Mallat).
  • Why it works well. To me, the fact that neural networks with many layers perform well can simply be explained by its ability to use many parameters. Basically, deep networks allow you to fit a training set, in a nonlinear way, with a bunch of parameters. Loosely speaking, the more parameters you use, the better you can fit the data.
  • Overfitting issue. Of course, when you use so many parameters, overfitting will be a concern and it is true that deep networks are prone to overfitting. That's why there was a talk by G Hinton, the master of the art, about dropout technique in neural networks, which targets the overfitting problem.
  • Fine tuning. It makes sense from the above that there is a lot of engineering (e.g. fine tuning) in designing a deep network. If we connect all the dots, there seems nothing particularly new or interesting about DNNs. The basic idea is: use a lot of parameters, if you have the computational resource to do so, and engineer the network so that overfitting can be reduced.
  • Distributed DNNs. Another hype about DNNs is the distributed work by Google. Along with the mathematical foundation research, distributed DNNs will be a major research direction.
    • Here is my fun hypothesis: if your method performs (slightly) better than the benchmarks, its hype will ironically be an increasing function with respect to its complexity and the required computational resource (e.g. 16000 cores and a new parallelism architecture will generate a great deal of hype).
Kernel Methods
Despite the hype about DNNs, kernel methods is still a relevant and is a powerful class of methods (and with strong mathematical foundation). The major focus of kernel methods research is now on low-rank approximation of the kernel matrixwhich I'm also pursuing.
  • The fact that Leslie Greengard, a computational physicist well known for his co-invention of the Fast Multipole Method and who hasn't worked at all in machine learning, was invited to give a talk at NIPS 2012 is an indication of its importance.
    • Unfortunately, he couldn't come at the last minute due to injury, and his talk was replaced by ... well, a deep learning talk (by G Hinton's).
  • Even with the absence of Leslie Greengard, there were still more than one talk (one by Francis Bach and one by Alex Smola) about low-rank approximation methods.
Graphical Models and Probabilistic Programming
Disclaimer: I've recently learned about graphical models and here are my takes about this approach.
  • Strength: it is the most natural and most accurate way to infer the distribution of an unknown from the observed data, if you can construct the graph that represents the probabilistic relationship of the variables.
  • Weakness: this condition is the main weakness of the graphical model. You cannot just throw in features and data and hopefully get a model back. You have to construct a features graph. In many problems where there are a lot of features (or variables), it's hard or even impossible to construct such a graph. In those problems, we don't even know how variables correlate.
That said, graphical models is a powerful way of modeling if the number of variables is not too large and if we want to inject domain knowledge about the data. I think that there are even more problems of this type than very large scale problems.

Unsupervised vs. Supervised Learning
  • Unsupervised feature learning. In his talk, Andrew Ng said that Deep Learning is sort of a misnomer. What people should care about is unsupervised feature learning, not deep this or deep that. I agree with him on this. I further think that unsupervised feature learning should be decoupled from deep networks (aka deep learning). 
    • It may be possible that deep networks is the current best technique for unsupervised feature learning. However, it is not the only technique or even the first technique. For example, people have used PCA (or Kernel PCA for nonlinear analysis).
    • Unsupervised feature learning is not the only application of deep networks. Some recent successes in deep networks are just straight supervised learning from the data (without the pre-training, i.e. unsupervised feature learning).
  • Unsupervised vs. Supervised Learning. This is a topic of frequent debate at the conference.
    • Arguments for supervised learning: you can get much more information from labeled data
    • Arguments for unsupervised feature learning: you can learn more from more data and the amount of unlabeled data is huge compared to labeled data
  • Semi-supervised learning. Both sides are valid and here's my take: we should do both unsupervised and supervised feature learning simultaneously, or so-called semi-supervised learning.

[1] For those who haven't heard of NIPS, it's one of the main annual conferences (if not the main conference) in Machine Learning.
[3] Disclaimer: this analysis is through the lenses of a computational-mathematician-convert, who has recently been very interested in machine learning.
[4] Not to mention that the whole conference was very well organized. Kudos to the organizers.
[5] Just in case you wonder why it is not surprising, see the media hype: herehere, or here.

Thursday, November 22, 2012

Dimension reduction in machine learning

An application of dimension reduction in face recognition. Instead of using thousands of pixel to capture a face image, we can use ~100 eigenface images that when combined linearly can reasonably approximate any face image.

Here's the standard definition of dimension reduction, which I copy from Wikipedia
In machine learning (or data analysis in general)dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction.
Why is dimension reduction important in data analysis? In layman terms, dimension reduction is the process of filtering noisy signals and keep only high quality signals, with or without transforming the original signals. (Dimension reduction without transforming the input signals is typically called variable or feature selection while dimension reduction with transformation is often called feature extraction.) Dimension reduction thus has two main benefits:
  1. Better accuracy. By learning from high quality signals, you can predict on new data more accurately by reducing overfitting
  2. Performance, performance. You can analyze data more quickly in lower dimension. This may be crucial when you have to deal with lots of data.
This is just an introduction of my series on dimension reduction. In future posts, I'll write about:
  1. Feature selection methods
    • Filter methods
    • Wrapper methods
    • Embedded methods
  2. Feature extraction methods
    • Principal component analysis (PCA)
    • Fisher linear discriminant analysis (LDA)
    • Other nonlinear methods
Feature extraction methods tend to be more powerful, in general settings, and are also harder to understand/implement. Familiarity with linear algebra and matrix decomposition (e.g. eigenvalue or singular value decomposition) is extremely useful in understanding PCA or LDA or machine learning in general.


Note that none of what I'm writing in this series is new. Instead, I just present them in a way that hopefully has some values to certain readers. Of course, instead of writing everything in my own words, I'll reuse as much public material as possible, where I see fit.

Wednesday, November 14, 2012

Why is Linear Algebra a prerequisite behind modern scientific/computational research?

This is yet another question on Quora which I answered a while back.

Because Linear Algebra is the language about linear systems.

So why linear systems?

1. Linear systems are well-defined and well understood. On the other hand, nonlinear systems are not concretely defined. They can be quadratic, cubic, polynomial, exponential, harmonic, blah blah... 

Simplicity is not the only reason though. Here's the more important one.

2. Any smooth function can be approximated locally by a linear function. In layman terms, most functions, when zoomed in, look like a linear function.

Linear approximation is the essence of Newton's method (for nonlinear optimization), finite difference (approximating the derivative of a function), etc. 

Do you get the picture above?

Sunday, November 11, 2012

My ideal desktop keyboards

A good keyboard should both promote user's health (i.e. ergonomics) and productivity. Based on these standards, here are my favorite keyboards on the market.

1. Thinkpad keyboard
I hate switching back and forth between the keyboard and mouse and moving my hand around when navigating with the mouse. The TrackPoint on the Thinkpad keyboard solves that problem.
ThinkPad USB Keyboard with TrackPoint
Bonus: with this keyboard, you don't need a mouse.

2. Microsoft ergonomic keyboard
The Microsoft keyboard, on the other hand, is superior in terms of ergonomics.

I wonder why they can't just put a track point in that blank space in the middle.

Conclusion. My ideal keyboard would thus be the combination of the two, i.e. Microsoft keyboard shape with a TrackPoint. Unfortunately, such product doesn't exist!

Saturday, November 10, 2012

Besides MapReduce-style clusters, what other popular paradigms/architectures can handle large scale computational problems?

This is a question on Quora which I answered a while back.

Short answer. Sure, there are other programming paradigms for large scale computing, such as MPI or GPGPU. They are both very popular models for large scale computing, in addition to MapReduce.

Longer answer. We have to be more specific when talking about large scale computational problems. Is it large in computational complexity or large in data? 

Basically, MPI is suited for task paralellism while MapReduce is for data parallelism. I'll explain this point via two examples.

1. MPI for high performance computing. Most scientific computing problems (e.g. weather forecasting or rendering an animated movie) fall into this category. In fact, the top super computers in the world are dedicated to run MPI applications, not MapReduce applications. 

Consider a sample problem of multiplying two large matrices, each of size 10000x10000. The computational complexity of multiplying two matrices is O(N^3) so running this algorithm on one computer is not practical. However, the data is not that big, O(10GB), and can be well stored on a hard drive.

MPI is the natural framework for parallelism  The matrices are decomposed into blocks and stored in the storage node(s). The compute nodes load their corresponding data chunks into their memories and run the computation. Using MPI, you control what each node does, what data to get, etc. and also the communication among the nodes.

MapReduce is an overly complicated model for this problem because:
  1. You have to frame your algorithm in terms of keys and values, which is not natural for this problem. In the example, you need to create artificial keys for parallelism.
  2. It's not even clear how you would distribute your input matrices.
  3. There are many unnecessary intermediate phases, including determining which machines to run on, shuffling, merging, sorting, etc.
  4. Data locality and fault tolerance are also not so relevant.

Using MapReduce for this type of applications means complexity in both programming time and run-time.

2. MapReduce for for big data processing. Now consider the problem of finding the top 5 repeated words of length>=10 in all books published since 1900.

It's quite straight forward to do it using Hadoop/MapReduce, which I won't explain here. Let's see how you would do it using MPI.

First, the amount of data is so large that it can only be stored on multiple machines, across racks or even across data centers. A distributed file system with high fault tolerance and good data locality is required (such as HDFS or GFS). Because of the data locality, the compute and storage nodes are the same.

In the current MPI paradigm, it's unclear which machines you want to run on (you have to specify a machine list). Even if you run on all machines, each node doesn't know which data files it needs to load. Some input files stored in the same or nearby nodes, but which ones? You don't know.

Friday, November 2, 2012

Is online (machine) learning necessary?

This is a question on Quora which I answered a while back
I used to ask this question myself, but in retrospect, I clearly didn't understand machine learning at that time. 

A ML system consists of 2 parts: a (re)trainer and a predictor. 

In most applications, the predictor needs to be real-time. For example: once an email arrives, a spam detector needs to immediately classify if the email is a spam or not. The same responsiveness applies to web search ranking, pattern recognition, etc.

Making the predictor real-time is easy. It just applies the model function, with the coefficients trained by the learner, to the new input (e.g. new email or new search query).

However, the learner typically doesn't need to be real-time. The output by the learner, loosely speaking, captures the the behavior and/or tastes of users and/or other factors. These factors do change over time, but slowly. So usually, it's fine to train a ML model every day or even every month.

The learning part can be time-critical when your model is dumb due to small data and you need to make it smarter as soon as possible.


Here's another way to look at it. Think of a human learner. So many new things that I digest (see/hear/read/etc.) everyday. Yet, I rarely convert what I digest into experience (and act based on this updated experience) on a weekly basis, let alone real time.

The fact that I can't immediately convert what I see into experience is totally fine. In fact, if I could relearn myself, in a non-heuristic way, on a weekly basis, I'd have been much more intelligent than I am today.


Online learning is an area in ML that focuses on applications that need real-time (or incremental) training.

I am very interested in the technical challenges posed in this field, and even working on a new approach to online logistic regression (*). However, I think the impact of online learning in the near future is small. (This opinion is controversial, I know.)

Wednesday, October 31, 2012

Welcome to my renewed blog!

My site was hacked two years ago and since then, I haven't bothered restoring the site.

Now, I'll start writing again. What's the theme this time? Still pretty technical but perhaps I'll place more emphasis on machine learning and related stuffs, which is what I'm doing now for a living. At the moment, I am personally working on few projects, including:
  1. A new efficient online logistic regression method, so-called dual regression. I've recently found a flaw in my approach which makes this project likely to fail. Anyway, I still believe that stochastic gradient descent, which is the current standard way to do online learning, can be significantly improved, both empirically and theoretically.
  2. Improve the Kernel SVM method by using a fast kernel summation method (also called fast multipole method or FMM in Computational Physics).
  3. Scale up and scale out certain ML algorithms.