Thursday, April 17, 2014

Write to Learn

In a recent talk titled "Thinking for Programmers", Leslie Lamport (a recent Turing award winner among other things) said: To think you have to write. If you're thinking without writing, you only think you're thinking.

Critical writing is difficult. Writing correct and valuable statements can be very costly. However, the rewards from writing usually outweigh the costs of doing it. Here's my recent example.

I have recently spent days writing about one of the pain-points in machine learning practice, namely the disconnect between training models and putting them into production, on this blog. The process of writing required me to validate some of the claims that I took for granted and that may not be obvious to the readers. By doing this, I realized that one of my claims was actually not quite right.

I won't go into details but the cost of this validation is not small. Besides tossing away my writing efforts, this means that what I have been building perhaps is not as valuable as I had imagined. This really crushed a bit of my soul.

However, the benefit from doing this kind of research and writing is considerably more significant. Obviously, I have learned a few new stuffs. It's not just some little knowledge that everyone gains everyday. I may need to put a few steps back and investigate more rather than pursuing something of little value.

By the way, this is not the only example. I have several similar posts waiting in the drafts box because during my writing, I had figured out that they contain unverified assumptions.

Wednesday, March 26, 2014

What kind of coding skills are required to work on machine learning?

Answered on Quora


(Image src: Inside BigData)

In our small team of 13 people, who all work on ML, the required coding skills range from
  • None (or simple git pull and build). Such person only needs to run experiments and write technical docs. (Revised: perhaps very little to demonstrate how to use the API.)
  • to decent numerical computing in MATLAB/Python/R. Such person runs and tweaks experiments on real problems for customers. Knowing at least one of those scripty languages is required so that they can do custom features engineering or visualization tasks that are not supported by the main tool that we build.
  • to good C# or F# + great software design + various level of numerical computing. Such person contributes to the main code base.
  • to hardcore low level programming. Such person is obsessed with latency/throughput, BLAS, SSE/AVX, GPU, and distributed systems.

Wednesday, December 11, 2013

Highlights of NIPS 2013


Overall

  1. Deep Learning (or Deep Neural Network or DNN) is again the most trendy topic of the conference. Its workshop session is perhaps twice (or more than that) as big as the one last year and it was packed for most of the day. Interestingly, Mark Zuckerberg of Facebook stopped by for a Q&A and then a panel discussion session. His visit was mostly to announce the new AI Lab of Facebook. For technical highlights, see below.
  2. Distributed machine learning is another topic of huge interest.
  3. Growing markets and interests in predictive analytics on sensor data (e.g. activity detection on mobile phones or wearable devices), in ML for Health Care, and in ML for Education.
  4. And there are certainly bad-ass research in other areas which I have missed. Among the topics of my interest, Optimization (particularly non-convex optimization) however hasn't made much progress.  

Deep Neural Nets

  • Natural Language Processing. Application of DNN in NLP is the theme of the deep learning research this year. This is natural because NLP is the holy-grail of machine learning research. DNN has already convincingly demonstrated its power in Computer Vision and Speech Recognition. There were some cool research using DNN in NLP such as Compositional Natural Language Parsing with Compositional Vector Grammars by a team at Stanford (led by Richard Socher) and Word2Vec project at Google (led by Tomas Mikolov). I think that this is just the beginning though.
  • Computer Vision. New benchmark for ImageNet has been established by Matt Zeiler et al. although it's not a big improvement from the previous record set by Alex Krizhevsky et al. (see their famous paper). Although Matt's work received a lot of attention from the community (come on, he set a new record), I was slightly disappointed. The spirit of his paper is about understanding convolutional neural networks but he did not explain why his network (which is a customized version of Alex's network) yields better results. He also wasn't able to rigorously explain certain mysteries (such as why rectified linear units work so well) in training neural networks.
  • Non-convex Optimization. This is the topic that I care the most in DNN because solving a DNN is a non-convex optimization problem. The current techniques only try to find a local minimum, at best. Here's an experiment that my team did for activity detection on mobile devices using sensory data. After features extraction using PCA and feeding the extracted model into a neural network, we got very good results. We then simulated the PCA feature extraction by introducing another layer to the neural network. We expected that the the optimized weights of the first layer should be identical to the PCA weights, if not better. However, we got worse results. This indicates that the optimizer converged to a non-so-good local optimum. 

  • From the scientific standpoint, the sexy part about DNN is that it can model very complicated machine learning tasks without doing too much feature engineering. Until there's a breakthrough in global optimization technique for non-convex problem or some convex remodeling of DNN, DNN will be just another periodic trend. That's why training DNNs still requires a significant amount of engineering.

Distributed Machine Learning

There're many interesting posters/talks on the topic of distributed and large scale machine learning at NIPS this year (such as this, this, and this workshop). However, what really excited me is not the work presented at NIPS but Spark, an Apache project built on top of HDFS for running distributed iterative operations distributedly. (It's a shame that I hadn't been aware of this project earlier.)

The cool thing about Spark is that it inherits the goods of Hadoop (i.e. HDFS) and re-engineers the rest. In particular:
  1. Iterative algorithms, which is the norm in ML, can be run in memory during their lifetime. No more reading from / writing to the disk for each iteration.
  2. Support of more operators, beyond Mapping and Reducing, which can be piped and lazily evaluated. See more here. In a sense, Spark is similar to Parallel LINQ but on HDFS data.
(These are the main reasons why serious machine learning practitioners don't use Hadoop beyond HDFS. Yes, stay away from the Hadoop hype. HDFS is cool. The rest of Hadoop is not so.)

The not-so-cool thing is that the whole Hadoop technology stack is in Java. Anyone interested in building Spark .NET? If there're enough interested developers, I would join or initiate such project.

Machine Learning on Sensor Data

To be (never) updated ...

Friday, April 12, 2013

Neural Network Best Practices

I have been ecstatic by work. That's why this blog doesn't get updated as often as I want.

Recently, I have been focusing on the implementation of a state-of-the-art (in both accuracy and perf dimensions) and generic neural net, as a predictor for a general ML software. I have learned a ton while doing it and would like to share some of the best practices when implementing a deep neural net [1].


1. Neural Nets Basics

The basics of Neural Nets (NN) is well written on Wikipedia. There's no point to repeat what Wiki has already covered. 
However, there are some subtle points that are worth mentioned.
  1. Misconception: backprop is not training method for neural network. It is simply a method, the standard method, to compute the gradient using chain rules. I have explained it in slightly more details here.
  2. Tips: Implementing backprop correctly is key to have a correct neural net implementation. It's easy to have bugs in the implementation when you introduce more complex connection types (e.g. convolutional, pooling, etc.) or other tricks (e.g. dropout). As backprop is one way to compute the gradient, always use another method to assert (when running in debug mode) that your gradient implementation is correct. The standard gradient checking method is to use finite difference to calculate the gradient and compare that with the results obtained from backprop.

2. Major techniques to improve prediction accuracy

2.1. Convolutional Neural Networks (CNN)
Few years ago, convolutional network was considered a trick. Nowadays, it has become a norm to obtain good results in areas such as computer vision and speech recognition, where data has some local structure.

You can read more about CNN here or check out my slides.

2.2. Regularization
Neural net is notorious for being prone to overfitting, due to its massive number of parameters (a.k.a. large capacity). Here are some effective regularization techniques.
  • \(L_1\) and \(L_2\) are the usual suspects for regularization and they still work well. In practice, \(L_1\) is sufficient [2]. They are practically not different in terms of performance but the former gives rise to sparse coding of the weights (see the explanation here). Think of it as both regularizer and feature selector.
  • Dropout is a dead simple but very clever use of model ensembling in a large neural net. You can read more about it in Hinton's paper. Implementing dropout is quite simple and we have seen great gains from using it.
It's always good to combine \(L_1\) and dropout in training neural nets.

2.3. Choice of activation function Sigmoid and Tanh are the two most commonly used activation functions. However, they are not necessarily good choices as activation functions for the hidden units. The reasons are:
  • They saturate quickly when the output value is not a narrow region (around 0), i.e. the derivative almost vanishes. This means that the network easily gets stuck or converges slowly when the pre-activated values fall outside of the region. 
  • There's no reason that the hidden units should be restricted to (0,1) as in the Sigmoid function or (-1,1) as in the Tanh function.
Rectifier is becoming popular as an activation function. However, I find its theory dubious and my experiments have not shown that it is always better. That said, I'm experimenting with new activation functions. (Little trivia: I'm borrowing many ideas from my graduate work in computational wave propagation.)

3. Speed things up

This is what I want to talk about more because academic papers tend to focus on accuracy and ignore the efficiency aspects.

3.1. Use a highly optimized BLAS Library The computations at each layer in a neural network can be abstracted as matrix multiplications
  • Forward propagation: \(y=W^{T}\times x\)
  • Back propagation: \(\nabla x=W\times\nabla y\)
  • Weights update: \(W+=\alpha\times x\times y^{T}\)
where
  • \(W\) of size \(m\times n\) is the weights matrix,
  • \(X\)  of length \(m\) represents the source layer's nodes,
  • \(Y\) of length \(n\) represents the destination layer's nodes,
  • and \(\alpha\) is the learning rate (assuming that we are using stochastic gradient descent).
Note that in order to use this one-matrix-multiplication abstraction, the source layer must contain the bias node [3].
For these matrix operations, it's recommended use a highly-optimized BLAS library such as Intel MKL [4] for running on CPUs or NVIDIA's CuBLAS for running on GPUs.

The speed gain from abstracting the operations as matrix multiplies and using optimized BLAS libraries is very significant (tens to hundreds fold faster).

3.2. Convolution Unrolling
Here, I assume that you are already familiar with convolutional network mentioned above and have implemented a basic one. If so, you would notice that the matrix multiply abstraction above doesn't seem to apply. Furthermore, the convolution operations are also slow due to index indirection.

Actually, with the convolution unrolling technique, proposed in this paper by Patrice Simard and John Platt, you can abstract the operations in training a convolutional network as matrix-matrix multiplies.

The layer-wise operations can be described as followed.
$$\begin{array}{cccc}
\mbox{Forward propagation:}\hspace{1em} & Y & = & W^{T}\times X\\
\mbox{Back propagation:}\hspace{1em} & \nabla X & = & W\times\nabla Y\\
\mbox{Weights update:}\hspace{1em} & W & += & \alpha\times X\times\nabla Y^{T}
\end{array}$$
where
  • \(W\) is the weights matrix of size \(k\times m\). Each column of \(W\) corresponds to a kernel and with this layout, the last bias row can be omitted conveniently in the backprop.
  • \(X\) is the unrolled input of size \(k\times n\). Each row of \(X\) corresponds to a window in the original input.
  • \(Y\) is the output matrix of size \(m\times n\). Each row of \(Y\) corresponds to the output features for a particular feature map.
  • \(\nabla X\) is the unrolled (i.e. pre-accumulated) errors of the input layer.
  • \(\nabla Y\) is the errors of the output layer.
Here,
  • \(m\) is the number of feature maps (map count);
  • \(n\) is the number of output features per map;
  • \(k\) equals the kernel size plus one (the bias term).
Rolling and unrolling
In order to represent as above, we need two extra steps.
  1. Unroll the input vector/matrix in each forward propagation. This process, which involves duplication of many elements, is to store \(X\) consecutively in memory.
  2. Roll (accumulate) the back-propagated errors matrix \(\nabla X\) to a flat errors vector (of size equal the source destination), which can then be passed to the earlier layers.
The rolling and unrolling can be done efficiently by pre-computing the index map which maps from the original space to the unrolled space and vice versa.

3.3. More Parallelism
MKL (or CUDA for GPU computing) already does a lot of parallelism. However, there are still parallelism opportunities left on the table.

Due to the high cost of a high performance CPU cluster and due to the inter-machine communication inefficient, we'll focus our parallelism effort on a single machine with multiple GPU cards.

To be continued...

Notes
[1] There're many tricks in NN. There're even conferences (such as ICLR) dedicated to new little techniques. Here, I'll only highlight some major ones that work in almost all situations and sometimes yield significant gains in accuracy and performance. Also, this post focuses on implementation tips, not methods.
[2] Optimizers, such as Stochastic Gradient Descent or L-BFGS, should be implemented independent from the learning methods (e.g. Neural Network or Logistic Regression). That way, if you work with multiple learners, all learners can reuse the optimization code and any learner can expose different optimizers for users to choose if they want.
[3] A traditional approach is to treat biases as another set of parameters.

Thursday, January 17, 2013

How to understand back propagation algorithm in Neural Networks



I'm not pleased with how back propagation is explained on the internet (e.g. this one). They mistakenly view backprop as gradient descent, or an optimization algorithm, or a training algorithm for neural networks. It is not.

Backprop is simply a method to compute the partial derivatives (or gradient) of a function, which has the form as a function composition (as in Neural Nets). When you solve an optimization problem using a gradient-based method (gradient descent is just one of them), you want to compute the function gradient at each iteration.

For a neural network, the objective function has the form of a composition. How do you compute the gradient? Two common ways to do it.

  1. Analytic differentiation. You know the form of the function. So just compute the derivatives using the chain rule (basic calculus). The long mathematical formulas in tutorials or in papers may look scary (because multivariate calculus usually looks ugly). However, if you write down the differentiation steps yourself, you should find that it is very straight forward.
  2. Numerical differentiation using finite difference. This method is computationally expensive because the number of function evaluation is O(N), where N is the number of parameters. This is expensive, compared to analytic differentiation. Finite difference, however, is commonly used to validate a backprop implementation, for debugging purpose.

To summarize: backprop is the straightest way to compute the partial derivatives of a function, using the chain rule. It is really first year Calculus plus proof-by-induction plus a lot of patience :).

Monday, December 10, 2012

Highlights of NIPS 2012


NIPS 2012 is the first conference in Machine Learning [1] that I've attended and it was well worth the time and money invested. Here are my brief-but-not-so highlights of the academic contents of the conference and workshops [3, 4].

Overall
  1. The workshop sessions were amazing with many interesting topics and talks. They offer both breath (variety of the workshops) and depth, with plenty of time for interactive discussion.
  2. Deep Neural Networks (DNNs), not surprisingly, received a lot of buzz and attention [5]
  3. Besides DNNs, here are some major topics at NIPS this year.
    • Latent Variable Models. The truth is: this is not a topic that I'm yet familiar with. The good thing is: after this conference, I know that there is quite a lot of interest in this topic. Another good thing: there seems to be a lot of numerical mathematics in this research that I am actually familiar with. Anyway, this is a topic that I'll definitely investigate soon.
    • Kernel methods: this and this workshop
    • Graphical Models and Probabilistic Programming
    • ... and perhaps some other cool topics which I have missed.
Deep Neural Nets
(aka. Deep Learning, to a bunch)
  • Mathematical Foundation. There are plenty of reasons for the hype in deep networks, which I won't describe here. One of the major academic interests in DNNs at the conference is about its theoretical foundation. Deep networks work so well against many benchmarks but no one can give a solid explanation why. This lack of a mathematical/statistical foundation for deep networks (or neural networks in general) was questioned many times during the conference and the workshop. Fortunately for the field, as Yann Lecun mentioned in the panel discussion, it has recently attracted the attention of some well-known applied mathematicians (such as Stan Osher and Stephane Mallat).
  • Why it works well. To me, the fact that neural networks with many layers perform well can simply be explained by its ability to use many parameters. Basically, deep networks allow you to fit a training set, in a nonlinear way, with a bunch of parameters. Loosely speaking, the more parameters you use, the better you can fit the data.
  • Overfitting issue. Of course, when you use so many parameters, overfitting will be a concern and it is true that deep networks are prone to overfitting. That's why there was a talk by G Hinton, the master of the art, about dropout technique in neural networks, which targets the overfitting problem.
  • Fine tuning. It makes sense from the above that there is a lot of engineering (e.g. fine tuning) in designing a deep network. If we connect all the dots, there seems nothing particularly new or interesting about DNNs. The basic idea is: use a lot of parameters, if you have the computational resource to do so, and engineer the network so that overfitting can be reduced.
  • Distributed DNNs. Another hype about DNNs is the distributed work by Google. Along with the mathematical foundation research, distributed DNNs will be a major research direction.
    • Here is my fun hypothesis: if your method performs (slightly) better than the benchmarks, its hype will ironically be an increasing function with respect to its complexity and the required computational resource (e.g. 16000 cores and a new parallelism architecture will generate a great deal of hype).
Kernel Methods
Despite the hype about DNNs, kernel methods is still a relevant and is a powerful class of methods (and with strong mathematical foundation). The major focus of kernel methods research is now on low-rank approximation of the kernel matrixwhich I'm also pursuing.
  • The fact that Leslie Greengard, a computational physicist well known for his co-invention of the Fast Multipole Method and who hasn't worked at all in machine learning, was invited to give a talk at NIPS 2012 is an indication of its importance.
    • Unfortunately, he couldn't come at the last minute due to injury, and his talk was replaced by ... well, a deep learning talk (by G Hinton's).
  • Even with the absence of Leslie Greengard, there were still more than one talk (one by Francis Bach and one by Alex Smola) about low-rank approximation methods.
Graphical Models and Probabilistic Programming
Disclaimer: I've recently learned about graphical models and here are my takes about this approach.
  • Strength: it is the most natural and most accurate way to infer the distribution of an unknown from the observed data, if you can construct the graph that represents the probabilistic relationship of the variables.
  • Weakness: this condition is the main weakness of the graphical model. You cannot just throw in features and data and hopefully get a model back. You have to construct a features graph. In many problems where there are a lot of features (or variables), it's hard or even impossible to construct such a graph. In those problems, we don't even know how variables correlate.
That said, graphical models is a powerful way of modeling if the number of variables is not too large and if we want to inject domain knowledge about the data. I think that there are even more problems of this type than very large scale problems.

Unsupervised vs. Supervised Learning
  • Unsupervised feature learning. In his talk, Andrew Ng said that Deep Learning is sort of a misnomer. What people should care about is unsupervised feature learning, not deep this or deep that. I agree with him on this. I further think that unsupervised feature learning should be decoupled from deep networks (aka deep learning). 
    • It may be possible that deep networks is the current best technique for unsupervised feature learning. However, it is not the only technique or even the first technique. For example, people have used PCA (or Kernel PCA for nonlinear analysis).
    • Unsupervised feature learning is not the only application of deep networks. Some recent successes in deep networks are just straight supervised learning from the data (without the pre-training, i.e. unsupervised feature learning).
  • Unsupervised vs. Supervised Learning. This is a topic of frequent debate at the conference.
    • Arguments for supervised learning: you can get much more information from labeled data
    • Arguments for unsupervised feature learning: you can learn more from more data and the amount of unlabeled data is huge compared to labeled data
  • Semi-supervised learning. Both sides are valid and here's my take: we should do both unsupervised and supervised feature learning simultaneously, or so-called semi-supervised learning.
----------------------

[1] For those who haven't heard of NIPS, it's one of the main annual conferences (if not the main conference) in Machine Learning.
[3] Disclaimer: this analysis is through the lenses of a computational-mathematician-convert, who has recently been very interested in machine learning.
[4] Not to mention that the whole conference was very well organized. Kudos to the organizers.
[5] Just in case you wonder why it is not surprising, see the media hype: herehere, or here.

Thursday, November 22, 2012

Dimension reduction in machine learning

 
An application of dimension reduction in face recognition. Instead of using thousands of pixel to capture a face image, we can use ~100 eigenface images that when combined linearly can reasonably approximate any face image.


Here's the standard definition of dimension reduction, which I copy from Wikipedia
In machine learning (or data analysis in general)dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction.
Why is dimension reduction important in data analysis? In layman terms, dimension reduction is the process of filtering noisy signals and keep only high quality signals, with (feature selection) or without (feature extraction) transforming the original signals. Dimension reduction thus has two main benefits:
  1. Better accuracy. By learning from high quality signals, you can predict on new data more accurately by reducing overfitting
  2. Performance, performance. You can analyze data more quickly in lower dimension. This may be crucial when you have to deal with lots of data.
This is just an introduction of my series on dimension reduction. In future posts, I'll write about:
  1. Feature selection methods
    • Filter methods
    • Wrapper methods
    • Embedded methods
  2. Feature extraction methods
    • Principal component analysis (PCA)
    • Fisher linear discriminant analysis (LDA)
    • Other nonlinear methods
Feature extraction methods tend to be more powerful, in general settings, and are also harder to understand/implement. Familiarity with linear algebra and matrix decomposition (e.g. eigenvalue or singular value decomposition) is extremely useful in understanding PCA or LDA or machine learning in general.

-------------------------

Note that none of what I'm writing in this series is new. Instead, I just present them in a way that hopefully has some values to certain readers. Of course, instead of writing everything in my own words, I'll reuse as much public material as possible, where I see fit.