Thursday, November 22, 2012

Dimension reduction in machine learning

An application of dimension reduction in face recognition. Instead of using thousands of pixel to capture a face image, we can use ~100 eigenface images that when combined linearly can reasonably approximate any face image.

Here's the standard definition of dimension reduction, which I copy from Wikipedia
In machine learning (or data analysis in general)dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction.
Why is dimension reduction important in data analysis? In layman terms, dimension reduction is the process of filtering noisy signals and keep only high quality signals, with or without transforming the original signals. (Dimension reduction without transforming the input signals is typically called variable or feature selection while dimension reduction with transformation is often called feature extraction.) Dimension reduction thus has two main benefits:
  1. Better accuracy. By learning from high quality signals, you can predict on new data more accurately by reducing overfitting
  2. Performance, performance. You can analyze data more quickly in lower dimension. This may be crucial when you have to deal with lots of data.
This is just an introduction of my series on dimension reduction. In future posts, I'll write about:
  1. Feature selection methods
    • Filter methods
    • Wrapper methods
    • Embedded methods
  2. Feature extraction methods
    • Principal component analysis (PCA)
    • Fisher linear discriminant analysis (LDA)
    • Other nonlinear methods
Feature extraction methods tend to be more powerful, in general settings, and are also harder to understand/implement. Familiarity with linear algebra and matrix decomposition (e.g. eigenvalue or singular value decomposition) is extremely useful in understanding PCA or LDA or machine learning in general.


Note that none of what I'm writing in this series is new. Instead, I just present them in a way that hopefully has some values to certain readers. Of course, instead of writing everything in my own words, I'll reuse as much public material as possible, where I see fit.

Wednesday, November 14, 2012

Why is Linear Algebra a prerequisite behind modern scientific/computational research?

This is yet another question on Quora which I answered a while back.

Because Linear Algebra is the language about linear systems.

So why linear systems?

1. Linear systems are well-defined and well understood. On the other hand, nonlinear systems are not concretely defined. They can be quadratic, cubic, polynomial, exponential, harmonic, blah blah... 

Simplicity is not the only reason though. Here's the more important one.

2. Any smooth function can be approximated locally by a linear function. In layman terms, most functions, when zoomed in, look like a linear function.

Linear approximation is the essence of Newton's method (for nonlinear optimization), finite difference (approximating the derivative of a function), etc. 

Do you get the picture above?

Sunday, November 11, 2012

My ideal desktop keyboards

A good keyboard should both promote user's health (i.e. ergonomics) and productivity. Based on these standards, here are my favorite keyboards on the market.

1. Thinkpad keyboard
I hate switching back and forth between the keyboard and mouse and moving my hand around when navigating with the mouse. The TrackPoint on the Thinkpad keyboard solves that problem.
ThinkPad USB Keyboard with TrackPoint
Bonus: with this keyboard, you don't need a mouse.

2. Microsoft ergonomic keyboard
The Microsoft keyboard, on the other hand, is superior in terms of ergonomics.

I wonder why they can't just put a track point in that blank space in the middle.

Conclusion. My ideal keyboard would thus be the combination of the two, i.e. Microsoft keyboard shape with a TrackPoint. Unfortunately, such product doesn't exist!

Saturday, November 10, 2012

Besides MapReduce-style clusters, what other popular paradigms/architectures can handle large scale computational problems?

This is a question on Quora which I answered a while back.

Short answer. Sure, there are other programming paradigms for large scale computing, such as MPI or GPGPU. They are both very popular models for large scale computing, in addition to MapReduce.

Longer answer. We have to be more specific when talking about large scale computational problems. Is it large in computational complexity or large in data? 

Basically, MPI is suited for task paralellism while MapReduce is for data parallelism. I'll explain this point via two examples.

1. MPI for high performance computing. Most scientific computing problems (e.g. weather forecasting or rendering an animated movie) fall into this category. In fact, the top super computers in the world are dedicated to run MPI applications, not MapReduce applications. 

Consider a sample problem of multiplying two large matrices, each of size 10000x10000. The computational complexity of multiplying two matrices is O(N^3) so running this algorithm on one computer is not practical. However, the data is not that big, O(10GB), and can be well stored on a hard drive.

MPI is the natural framework for parallelism  The matrices are decomposed into blocks and stored in the storage node(s). The compute nodes load their corresponding data chunks into their memories and run the computation. Using MPI, you control what each node does, what data to get, etc. and also the communication among the nodes.

MapReduce is an overly complicated model for this problem because:
  1. You have to frame your algorithm in terms of keys and values, which is not natural for this problem. In the example, you need to create artificial keys for parallelism.
  2. It's not even clear how you would distribute your input matrices.
  3. There are many unnecessary intermediate phases, including determining which machines to run on, shuffling, merging, sorting, etc.
  4. Data locality and fault tolerance are also not so relevant.

Using MapReduce for this type of applications means complexity in both programming time and run-time.

2. MapReduce for for big data processing. Now consider the problem of finding the top 5 repeated words of length>=10 in all books published since 1900.

It's quite straight forward to do it using Hadoop/MapReduce, which I won't explain here. Let's see how you would do it using MPI.

First, the amount of data is so large that it can only be stored on multiple machines, across racks or even across data centers. A distributed file system with high fault tolerance and good data locality is required (such as HDFS or GFS). Because of the data locality, the compute and storage nodes are the same.

In the current MPI paradigm, it's unclear which machines you want to run on (you have to specify a machine list). Even if you run on all machines, each node doesn't know which data files it needs to load. Some input files stored in the same or nearby nodes, but which ones? You don't know.

Friday, November 2, 2012

Is online (machine) learning necessary?

This is a question on Quora which I answered a while back
I used to ask this question myself, but in retrospect, I clearly didn't understand machine learning at that time. 

A ML system consists of 2 parts: a (re)trainer and a predictor. 

In most applications, the predictor needs to be real-time. For example: once an email arrives, a spam detector needs to immediately classify if the email is a spam or not. The same responsiveness applies to web search ranking, pattern recognition, etc.

Making the predictor real-time is easy. It just applies the model function, with the coefficients trained by the learner, to the new input (e.g. new email or new search query).

However, the learner typically doesn't need to be real-time. The output by the learner, loosely speaking, captures the the behavior and/or tastes of users and/or other factors. These factors do change over time, but slowly. So usually, it's fine to train a ML model every day or even every month.

The learning part can be time-critical when your model is dumb due to small data and you need to make it smarter as soon as possible.


Here's another way to look at it. Think of a human learner. So many new things that I digest (see/hear/read/etc.) everyday. Yet, I rarely convert what I digest into experience (and act based on this updated experience) on a weekly basis, let alone real time.

The fact that I can't immediately convert what I see into experience is totally fine. In fact, if I could relearn myself, in a non-heuristic way, on a weekly basis, I'd have been much more intelligent than I am today.


Online learning is an area in ML that focuses on applications that need real-time (or incremental) training.

I am very interested in the technical challenges posed in this field, and even working on a new approach to online logistic regression (*). However, I think the impact of online learning in the near future is small. (This opinion is controversial, I know.)