## Friday, April 12, 2013

### Neural Network Best Practices

I have been ecstatic by work. That's why this blog doesn't get updated as often as I want.

Recently, I have been focusing on the implementation of a state-of-the-art (in both accuracy and speed) and generic neural net, as a predictor for a general ML software. I have learned a ton while doing it and would like to share some of the best practices when implementing a deep neural net [1].

### 1. Neural Nets Basics

The basics of Neural Nets (NN) is well written on Wikipedia. There's no point to repeat what Wiki has already covered.
However, there are some subtle points that are worth mentioned.
1. Misconception: backprop is not the training algorithm for neural network. It is just a method to compute the weights gradient using chain rules. I have explained it in slightly more details here.
2. Tips: Implementing backprop correctly is key to have a correct neural net implementation. It's easy to have bugs in the implementation when you introduce more complex connection types (e.g. convolutional, pooling, etc.) or other tricks (e.g. dropout). As backprop is one way to compute the gradient, always use another method to assert (when running in debug mode) that your gradient computation is correct. The standard gradient checking method is to use finite difference to calculate the gradient and compare that with the results obtained from backprop.

### 2. Major techniques to improve prediction accuracy

2.1. Convolutional Neural Networks (CNN)
Just a couple of years ago, convolutional network was considered a trick. Nowadays, it has become the norm to obtain good results in areas such as computer vision and speech recognition, where data has some spatial or temporal structure.

You can read more about CNN here or check out my slides.

2.2. Regularization
Neural net is notorious for being prone to overfitting, due to its massive number of parameters (a.k.a. large capacity). A simple but effective way to prevent overfitting is using dropout. Dropout is a clever use of model ensembling in a large neural net. You can read more about it in Hinton's paper. Implementing dropout is quite simple and we have seen great gains from using it.

2.3. Choice of activation function Sigmoid and Tanh are the two most commonly used activation functions. However, they are not necessarily good choices as activation functions for the hidden units. The reasons are:
• They saturate quickly when the output value is not a narrow region (around 0), i.e. the derivative almost vanishes. This means that the network easily gets stuck or converges slowly when the pre-activated values fall outside of the region.
• There's no reason that the hidden units should be restricted to (0,1) as in the Sigmoid function or (-1,1) as in the Tanh function.
Rectifier is becoming popular as an activation function. However, I find its theory dubious and my experiments have not shown that it is always better. That said, I'm experimenting with new activation functions. (Little trivia: I'm borrowing many ideas from my graduate work in computational wave propagation.)

### 3. Speed things up

This is what I want to talk about more because academic papers tend to focus on accuracy and ignore the efficiency aspects.

3.1. Use a highly optimized BLAS Library
The computations at each layer in a neural network can be abstracted as matrix multiplications
• Forward propagation: $$y=W^{T}\times x$$
• Back propagation: $$\nabla x=W\times\nabla y$$
• Weights update: $$W \leftarrow W+\alpha\times x\times \nabla y^{T}$$
where
• $$W$$ of size $$m\times n$$ is the weights matrix,
• $$x$$  of length $$m$$ represents the source layer's nodes,
• $$y$$ of length $$n$$ represents the destination layer's nodes,
• and $$\alpha$$ is the learning rate (assuming that we are using stochastic gradient descent).
Note this abstraction assumes that the source layer contains the bias node, which is 1 [2]. For these matrix operations, it's recommended use a highly-optimized BLAS library such as Intel MKL [4] for running on CPUs or NVIDIA's CuBLAS for running on GPUs. The speed gain from abstracting the operations as matrix multiplies and using optimized BLAS libraries is very significant (tens to hundreds fold faster).

3.2. Convolution Unrolling
Here, I assume that you are already familiar with convolutional network mentioned above and have implemented a basic one. If so, you would notice that the matrix multiply abstraction above doesn't seem to apply. Furthermore, the convolution operations are also slow due to index indirection.

Actually, with the convolution unrolling technique, proposed in this paper by Patrice Simard and John Platt, you can abstract the operations in training a convolutional network as matrix-matrix multiplies.

The layer-wise operations can be described as followed.
$$\begin{array}{cccc} \mbox{Forward propagation:}\hspace{1em} & Y & = & W^{T}\times X\\ \mbox{Back propagation:}\hspace{1em} & \nabla X & = & W\times\nabla Y\\ \mbox{Weights update:}\hspace{1em} & W & \leftarrow & W + \alpha\times X\times\nabla Y^{T} \end{array}$$
where
• $$W$$ is the weights matrix of size $$k\times m$$. Each column of $$W$$ corresponds to a kernel and with this layout, the last bias row can be omitted conveniently in the backprop.
• $$X$$ is the unrolled input of size $$k\times n$$. Each row of $$X$$ corresponds to a window in the original input.
• $$Y$$ is the output matrix of size $$m\times n$$. Each row of $$Y$$ corresponds to the output features for a particular feature map.
• $$\nabla X$$ is the unrolled (i.e. pre-accumulated) errors of the input layer.
• $$\nabla Y$$ is the errors of the output layer.
Here,
• $$m$$ is the number of feature maps (map count);
• $$n$$ is the number of output features per map;
• $$k$$ equals the kernel size plus one (the bias term).
Rolling and unrolling
In order to represent as above, we need two extra steps.
1. Unroll the input vector/matrix in each forward propagation. This process, which involves duplication of many elements, is to store $$X$$ contiguously in memory.
2. Roll (accumulate) the back-propagated errors matrix $$\nabla X$$ to a flat errors vector (of size equal the source destination), which can then be passed to the earlier layers.
The rolling and unrolling can be done efficiently by pre-computing the index map which maps from the original space to the unrolled space and vice versa.

3.3. More Parallelism
MKL (or CUDA for GPU computing) already does a lot of parallelism. However, there are still parallelism opportunities left on the table.

Due to the high cost of a high performance CPU cluster and due to the inter-machine communication inefficient, we'll focus our parallelism effort on a single machine with multiple GPU cards.

To be continued...

Notes
[1] There're many tricks in NN. There're even conferences (such as ICLR) dedicated to new little techniques. Here, I'll only highlight some major ones that work in almost all situations and sometimes yield significant gains in accuracy and performance. Also, this post focuses on implementation tips, not the theory.
[2] A traditional approach is to treat biases as another set of parameters.