Wednesday, December 11, 2013

Highlights of NIPS 2013


  1. Deep Learning (or Deep Neural Network or DNN) is again the most trendy topic of the conference. Its workshop session is perhaps twice (or more than that) as big as the one last year and it was packed for most of the day. Interestingly, Mark Zuckerberg of Facebook stopped by for a Q&A and then a panel discussion session. His visit was mostly to announce the new AI Lab of Facebook. For technical highlights, see below.
  2. Distributed machine learning is another topic of great interest.
  3. Growing markets and interests in predictive analytics on sensor data (e.g. activity detection on mobile phones or wearable devices), in ML for Health Care, and in ML for Education.
  4. And there are certainly bad-ass research in other areas which I have missed. Among the topics of my interest, Optimization (particularly non-convex optimization) however hasn't made much progress.  

Deep Neural Nets

  • Natural Language Processing. Application of DNN in NLP is the theme of the deep learning research this year. This is natural because NLP is the holy-grail of machine learning research. DNN has already convincingly demonstrated its power in Computer Vision and Speech Recognition. There were some cool research using DNN in NLP such as Compositional Natural Language Parsing with Compositional Vector Grammars by a team at Stanford (led by Richard Socher) and Word2Vec project at Google (led by Tomas Mikolov). I think that this is just the beginning though.
  • Computer Vision. New benchmark for ImageNet has been established by Matt Zeiler et al. although it's not a big improvement from the previous record set by Alex Krizhevsky et al. (see their famous paper). Although Matt's work received a lot of attention from the community (come on, he set a new record), I was slightly disappointed. The spirit of his paper is about understanding convolutional neural networks but he did not explain why his network (which is a customized version of Alex's network) yields better results. He also wasn't able to rigorously explain certain mysteries (such as why rectified linear units work so well) in training neural networks.
  • Non-convex Optimization. This is the topic that I care the most in DNN because solving a DNN is a non-convex optimization problem. The current techniques only try to find a local minimum, at best. Here's an experiment that my team did for activity detection on mobile devices using sensory data. After features extraction using PCA and feeding the extracted model into a neural network, we got very good results. We then simulated the PCA feature extraction by introducing another layer to the neural network. We expected that the the optimized weights of the first layer should be identical to the PCA weights, if not better. However, we got worse results. This indicates that the optimizer converged to a non-so-good local optimum.

Distributed Machine Learning

There're many interesting posters/talks on the topic of distributed and large scale machine learning at NIPS this year (such as this, this, and this workshop). However, what really excited me is not the work presented at NIPS but Spark, an Apache project built on top of HDFS for running distributed iterative operations distributedly. (It's a shame that I hadn't been aware of this project earlier.)

The cool thing about Spark is that it inherits the goods of Hadoop (i.e. HDFS) and re-engineers the rest. In particular:
  1. Iterative algorithms, which is the norm in ML, can be run in memory during their lifetime. No more reading from / writing to the disk for each iteration.
  2. Support of more operators, beyond Mapping and Reducing, which can be piped and lazily evaluated. See more here. In a sense, Spark is similar to Parallel LINQ but on HDFS data.
These are the main reasons why serious machine learning practitioners don't use Hadoop beyond HDFS. Yes, stay away from the Hadoop hype. HDFS is cool. The rest of Hadoop is not so.

Machine Learning on Sensor Data

To be (never) updated ...

Saturday, September 21, 2013

[C#] Class attributes vs. interface's properties

In C#, attributes are meta-information attached to a code entity (fields/properties, types, or methods). Attributes are generally very useful. However, are they useful on classes? One could argue that instead of using attributes, we can just have that class implement some interface's properties. See the code below for example. Eric Lippert has also advocated for using properties.

Here, I'd like to make a stronger case for using class attributes.
  1. Readability: an attribute indicates that the meta-information is static, i.e. it is bound to the class, not instances of that class.
  2. Interface doesn't support default values for properties. So any type that implements that interface has to implement all the properties. We can make the properties default in (abstract) classes but not in interfaces but C# (reasonably) doesn't support multiple class inheritance.
  3. Attribute may contain logic (see the last example below) while an interface cannot. Furthermore, the same attribute (and any embedded logic) may be used for both type and fields/properties.
Of course, if the meta information is frequently accessed by other code, then we absolutely should use properties because accessing properties is less taxing and also faster (see the example below). 

  1 /* Using attribute */ 
  2     class FriendlyNamesAttribute : Attribute
  3     { 
  4         public string ShortName;
  6         public string LongName;
  8         public string HelpText;
  9     } 
 11     [FriendlyNames(ShortName = "miner") 
 12     class DataMiner  
 13     {  
 14         ... // Class logic here 
 15     } 
 17     // Accessing the meta-information 
 18     var attr = obj.GetType().GetCustomAttribute<FriendlyNamesAttribute>(); 
 19     if (attr != null) 
 20     { 
 21         var shortName = attr.ShortName;
 22         ... 
 23     } 
 25 /* Using Properties */ 
 26     interface IHasFriendlyNames  
 27     { 
 28         public string ShortName { get; }
 29         public string LongName { get; }
 30         public string HelpText { get; }
 31     } 
 33     class DataMiner : IHasFriendlyNames
 34     { 
 35         ... // Class logic here 
 37         public string ShortName { get { return "Miner"; } } 
 39         public string LongName { get { return null; } }
 41         public string HelpText { get { return null; } }
 42     } 
 44     var friendlyObj = obj as IHasFriendlyName;
 45     if (friendlyObj != null)
 46     { 
 47         var shortName = friendlyObj.ShortName;
 48         ... 
 49     } 
 51 /* A slightly more complex FriendlyName attribute */
 52     public abstract class NameAttribute : Attribute
 53     { 
 56         public string ShortName; 
 57         public string LongName;
 60         private static Regex _shortRegex = new Regex("[a-z]"); 
 61         private static Regex _longRegex = new Regex("(?<=[a-z])([A-Z])");
 63         /// <summary> 
 64         /// Given a field/class name, automatically shortify it
 65         /// </summary> 
 66         public string GetShortName(string rawName)
 67         { 
 70             ShortName = ShortName ?? _shortRegex.Replace(rawName, "");
 73             return ShortName; 
 74         } 
 76         /// <summary> 
 77         /// Given a field/class name, automatically prettify it
 78         /// </summary> 
 79         public string GetLongName(string rawName)
 80         { 
 83             LongName = LongName ?? _longRegex.Replace(rawName, " $1");
 86             return LongName; 
 87         } 
 88     }

Friday, April 12, 2013

Neural Network Best Practices

I have been ecstatic by work. That's why this blog doesn't get updated as often as I want.

Recently, I have been focusing on the implementation of a state-of-the-art (in both accuracy and speed) and generic neural net, as a predictor for a general ML software. I have learned a ton while doing it and would like to share some of the best practices when implementing a deep neural net [1].

1. Neural Nets Basics

The basics of Neural Nets (NN) is well written on Wikipedia. There's no point to repeat what Wiki has already covered. 
However, there are some subtle points that are worth mentioned.
  1. Misconception: backprop is not the training algorithm for neural network. It is just a method to compute the weights gradient using chain rules. I have explained it in slightly more details here.
  2. Tips: Implementing backprop correctly is key to have a correct neural net implementation. It's easy to have bugs in the implementation when you introduce more complex connection types (e.g. convolutional, pooling, etc.) or other tricks (e.g. dropout). As backprop is one way to compute the gradient, always use another method to assert (when running in debug mode) that your gradient computation is correct. The standard gradient checking method is to use finite difference to calculate the gradient and compare that with the results obtained from backprop.

2. Major techniques to improve prediction accuracy

2.1. Convolutional Neural Networks (CNN)
Just a couple of years ago, convolutional network was considered a trick. Nowadays, it has become the norm to obtain good results in areas such as computer vision and speech recognition, where data has some spatial or temporal structure.

You can read more about CNN here or check out my slides.

2.2. Regularization
Neural net is notorious for being prone to overfitting, due to its massive number of parameters (a.k.a. large capacity). A simple but effective way to prevent overfitting is using dropout. Dropout is a clever use of model ensembling in a large neural net. You can read more about it in Hinton's paper. Implementing dropout is quite simple and we have seen great gains from using it.

2.3. Choice of activation function Sigmoid and Tanh are the two most commonly used activation functions. However, they are not necessarily good choices as activation functions for the hidden units. The reasons are:
  • They saturate quickly when the output value is not a narrow region (around 0), i.e. the derivative almost vanishes. This means that the network easily gets stuck or converges slowly when the pre-activated values fall outside of the region. 
  • There's no reason that the hidden units should be restricted to (0,1) as in the Sigmoid function or (-1,1) as in the Tanh function.
Rectifier is becoming popular as an activation function. However, I find its theory dubious and my experiments have not shown that it is always better. That said, I'm experimenting with new activation functions. (Little trivia: I'm borrowing many ideas from my graduate work in computational wave propagation.)

3. Speed things up

This is what I want to talk about more because academic papers tend to focus on accuracy and ignore the efficiency aspects.

3.1. Use a highly optimized BLAS Library
The computations at each layer in a neural network can be abstracted as matrix multiplications
  • Forward propagation: \(y=W^{T}\times x\)
  • Back propagation: \(\nabla x=W\times\nabla y\)
  • Weights update: \(W \leftarrow W+\alpha\times x\times \nabla y^{T}\)
  • \(W\) of size \(m\times n\) is the weights matrix,
  • \(x\)  of length \(m\) represents the source layer's nodes,
  • \(y\) of length \(n\) represents the destination layer's nodes,
  • and \(\alpha\) is the learning rate (assuming that we are using stochastic gradient descent).
Note this abstraction assumes that the source layer contains the bias node, which is 1 [2]. For these matrix operations, it's recommended use a highly-optimized BLAS library such as Intel MKL [4] for running on CPUs or NVIDIA's CuBLAS for running on GPUs. The speed gain from abstracting the operations as matrix multiplies and using optimized BLAS libraries is very significant (tens to hundreds fold faster).

3.2. Convolution Unrolling
Here, I assume that you are already familiar with convolutional network mentioned above and have implemented a basic one. If so, you would notice that the matrix multiply abstraction above doesn't seem to apply. Furthermore, the convolution operations are also slow due to index indirection.

Actually, with the convolution unrolling technique, proposed in this paper by Patrice Simard and John Platt, you can abstract the operations in training a convolutional network as matrix-matrix multiplies.

The layer-wise operations can be described as followed.
\mbox{Forward propagation:}\hspace{1em} & Y & = & W^{T}\times X\\
\mbox{Back propagation:}\hspace{1em} & \nabla X & = & W\times\nabla Y\\
\mbox{Weights update:}\hspace{1em} & W & \leftarrow & W + \alpha\times X\times\nabla Y^{T}
  • \(W\) is the weights matrix of size \(k\times m\). Each column of \(W\) corresponds to a kernel and with this layout, the last bias row can be omitted conveniently in the backprop.
  • \(X\) is the unrolled input of size \(k\times n\). Each row of \(X\) corresponds to a window in the original input.
  • \(Y\) is the output matrix of size \(m\times n\). Each row of \(Y\) corresponds to the output features for a particular feature map.
  • \(\nabla X\) is the unrolled (i.e. pre-accumulated) errors of the input layer.
  • \(\nabla Y\) is the errors of the output layer.
  • \(m\) is the number of feature maps (map count);
  • \(n\) is the number of output features per map;
  • \(k\) equals the kernel size plus one (the bias term).
Rolling and unrolling
In order to represent as above, we need two extra steps.
  1. Unroll the input vector/matrix in each forward propagation. This process, which involves duplication of many elements, is to store \(X\) contiguously in memory.
  2. Roll (accumulate) the back-propagated errors matrix \(\nabla X\) to a flat errors vector (of size equal the source destination), which can then be passed to the earlier layers.
The rolling and unrolling can be done efficiently by pre-computing the index map which maps from the original space to the unrolled space and vice versa.

3.3. More Parallelism
MKL (or CUDA for GPU computing) already does a lot of parallelism. However, there are still parallelism opportunities left on the table.

Due to the high cost of a high performance CPU cluster and due to the inter-machine communication inefficient, we'll focus our parallelism effort on a single machine with multiple GPU cards.

To be continued...

[1] There're many tricks in NN. There're even conferences (such as ICLR) dedicated to new little techniques. Here, I'll only highlight some major ones that work in almost all situations and sometimes yield significant gains in accuracy and performance. Also, this post focuses on implementation tips, not the theory.
[2] A traditional approach is to treat biases as another set of parameters.

Thursday, January 17, 2013

How to understand back propagation algorithm in Neural Networks

I'm not pleased with how back propagation is explained on the internet (e.g. this one). They mistakenly view backprop as gradient descent, or an optimization algorithm, or a training algorithm for neural networks. It is not.

Backprop is simply a method to compute the partial derivatives (or gradient) of a function, which has the form as a function composition (as in Neural Nets). When you solve an optimization problem using a gradient-based method (gradient descent is just one of them), you want to compute the function gradient at each iteration.

For a neural network, the objective function has the form of a composition. How do you compute the gradient? Two common ways to do it.

  1. Analytic differentiation. You know the form of the function. So just compute the derivatives using the chain rule (basic calculus). The long mathematical formulas in tutorials or in papers may look scary (because multivariate calculus usually looks ugly). However, if you write down the differentiation steps yourself, you should find that it is very straight forward.
  2. Numerical differentiation using finite difference. This method is computationally expensive because the number of function evaluation is O(N), where N is the number of parameters. This is expensive, compared to analytic differentiation. Finite difference, however, is commonly used to validate a backprop implementation, for debugging purpose.

To summarize: backprop is the straightest way to compute the partial derivatives of a function, using the chain rule. It is really first year Calculus plus proof-by-induction plus a lot of patience :).