Thursday, November 22, 2012

Dimension reduction in machine learning

An application of dimension reduction in face recognition. Instead of using thousands of pixel to capture a face image, we can use ~100 eigenface images that when combined linearly can reasonably approximate any face image.


Here's the standard definition of dimension reduction, which I copy from Wikipedia
In machine learning (or data analysis in general)dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction.
Why is dimension reduction important in data analysis? In layman terms, dimension reduction is the process of filtering noisy signals and keep only high quality signals, with or without transforming the original signals. (Dimension reduction without transforming the input signals is typically called variable or feature selection while dimension reduction with transformation is often called feature extraction.) Dimension reduction thus has two main benefits:
  1. Better accuracy. By learning from high quality signals, you can predict on new data more accurately by reducing overfitting
  2. Performance, performance. You can analyze data more quickly in lower dimension. This may be crucial when you have to deal with lots of data.
This is just an introduction of my series on dimension reduction. In future posts, I'll write about:
  1. Feature selection methods
    • Filter methods
    • Wrapper methods
    • Embedded methods
  2. Feature extraction methods
    • Principal component analysis (PCA)
    • Fisher linear discriminant analysis (LDA)
    • Other nonlinear methods
Feature extraction methods tend to be more powerful, in general settings, and are also harder to understand/implement. Familiarity with linear algebra and matrix decomposition (e.g. eigenvalue or singular value decomposition) is extremely useful in understanding PCA or LDA or machine learning in general.

-------------------------

Note that none of what I'm writing in this series is new. Instead, I just present them in a way that hopefully has some values to certain readers. Of course, instead of writing everything in my own words, I'll reuse as much public material as possible, where I see fit.