I have been ecstatic by work. That's why this blog doesn't get updated as often as I want.

Recently, I have been focusing on the implementation of a state-of-the-art (in both accuracy and perf dimensions) and generic neural net, as a predictor for a general ML software. I have learned a ton while doing it and would like to share some of the best practices when implementing a deep neural net [1].

###
1. Neural Nets Basics

The basics of Neural Nets (NN) is

well written on Wikipedia. There's no point to repeat what Wiki has already covered.

However, there are some subtle points that are worth mentioned.

- Misconception: backprop is not training method for neural network. It is simply a method, the standard method, to compute the gradient using chain rules. I have explained it in slightly more details here.
- Tips: Implementing backprop correctly is key to have a correct neural net implementation. It's easy to have bugs in the implementation when you introduce more complex connection types (e.g. convolutional, pooling, etc.) or other tricks (e.g. dropout). As backprop is one way to compute the gradient, always use another method to assert (when running in debug mode) that your gradient implementation is correct. The standard gradient checking method is to use finite difference to calculate the gradient and compare that with the results obtained from backprop.

###
2. Major techniques to improve prediction accuracy

**
2.1. Convolutional Neural Networks (CNN)**
Few years ago, convolutional network was considered a trick. Nowadays, it has become a

**norm **to obtain good results in areas such as computer vision and speech recognition, where data has some local structure.

You can read more about CNN

here or check out

my slides.

**
2.2. Regularization**
Neural net is notorious for being prone to overfitting, due to its massive number of parameters (a.k.a. large capacity). Here are some effective regularization techniques.

- \(L_1\) and \(L_2\) are the usual suspects for regularization and they still work well. In practice, \(L_1\) is sufficient [2]. They are practically not different in terms of performance but the former gives rise to sparse coding of the weights (see the explanation here). Think of it as both regularizer and feature selector.
- Dropout is a dead simple but very clever use of model ensembling in a large neural net. You can read more about it in Hinton's paper. Implementing dropout is quite simple and we have seen great gains from using it.

It's always good to combine \(L_1\) and dropout in training neural nets.

**2.3. Choice of activation function**
Sigmoid and

Tanh are the two most commonly used activation functions. However, they are not necessarily good choices as activation functions for the hidden units. The reasons are:

- They saturate quickly when the output value is not a narrow region (around 0), i.e. the derivative almost vanishes. This means that the network easily gets stuck or converges slowly when the pre-activated values fall outside of the region.
- There's no reason that the hidden units should be restricted to (0,1) as in the Sigmoid function or (-1,1) as in the Tanh function.

Rectifier is becoming popular as an activation function. However, I find its theory dubious and my experiments have not shown that it is always better. That said, I'm experimenting with new activation functions. (Little trivia: I'm borrowing many ideas from my graduate work in computational wave propagation.)

###
3. Speed things up

This is what I want to talk about more because academic papers tend to focus on accuracy and ignore the efficiency aspects.

**3.1. Use a highly optimized BLAS Library**
The computations at each layer in a neural network can be abstracted as matrix multiplications

- Forward propagation: \(y=W^{T}\times x\)
- Back propagation: \(\nabla x=W\times\nabla y\)
- Weights update: \(W+=\alpha\times x\times y^{T}\)

where

- \(W\) of size \(m\times n\) is the weights matrix,
- \(X\) of length \(m\) represents the source layer's nodes,
- \(Y\) of length \(n\) represents the destination layer's nodes,
- and \(\alpha\) is the learning rate (assuming that we are using stochastic gradient descent).

Note that in order to use this one-matrix-multiplication abstraction, the source layer must contain the bias node [3].

For these matrix operations, it's recommended use a highly-optimized BLAS library such as Intel MKL [4] for running on CPUs or NVIDIA's CuBLAS for running on GPUs.

The speed gain from abstracting the operations as matrix multiplies and using optimized BLAS libraries is very significant (tens to hundreds fold faster).

**3.2. Convolution Unrolling**
Here, I assume that you are already familiar with convolutional network mentioned above and have implemented a basic one. If so, you would notice that the matrix multiply abstraction above doesn't seem to apply. Furthermore, the convolution operations are also slow due to index indirection.

Actually, with the convolution unrolling technique, proposed in

this paper by Patrice Simard and John Platt, you can abstract the operations in training a convolutional network as matrix-matrix multiplies.

The layer-wise operations can be described as followed.

$$\begin{array}{cccc}

\mbox{Forward propagation:}\hspace{1em} & Y & = & W^{T}\times X\\

\mbox{Back propagation:}\hspace{1em} & \nabla X & = & W\times\nabla Y\\

\mbox{Weights update:}\hspace{1em} & W & += & \alpha\times X\times\nabla Y^{T}

\end{array}$$

where

- \(W\) is the weights matrix of size \(k\times m\). Each column of \(W\) corresponds to a kernel and with this layout, the last bias row can be omitted conveniently in the backprop.
- \(X\) is the unrolled input of size \(k\times n\). Each row of \(X\) corresponds to a window in the original input.
- \(Y\) is the output matrix of size \(m\times n\). Each row of \(Y\) corresponds to the output features for a particular feature map.
- \(\nabla X\) is the unrolled (i.e. pre-accumulated) errors of the input layer.
- \(\nabla Y\) is the errors of the output layer.

Here,

- \(m\) is the number of feature maps (map count);
- \(n\) is the number of output features per map;
- \(k\) equals the kernel size plus one (the bias term).

__Rolling and unrolling__
In order to represent as above, we need two extra steps.

- Unroll the input vector/matrix in each forward propagation. This process, which involves duplication of many elements, is to store \(X\) consecutively in memory.
- Roll (accumulate) the back-propagated errors matrix \(\nabla X\) to a flat errors vector (of size equal the source destination), which can then be passed to the earlier layers.

The rolling and unrolling can be done efficiently by pre-computing the index map which maps from the original space to the unrolled space and vice versa.

**3.3. More Parallelism**
MKL (or CUDA for GPU computing) already does a lot of parallelism. However, there are still parallelism opportunities left on the table.

Due to the high cost of a high performance CPU cluster and due to the inter-machine communication inefficient, we'll focus our parallelism effort on a single machine with multiple GPU cards.

To be continued...

**Notes**
[1] There're many tricks in NN. There're even conferences (such as

ICLR) dedicated to new little techniques. Here, I'll only highlight some major ones that work in almost all situations and sometimes yield significant gains in accuracy and performance. Also, this post focuses on implementation tips, not methods.

[2] Optimizers, such as

Stochastic Gradient Descent or

L-BFGS, should be implemented independent from the learning methods (e.g. Neural Network or Logistic Regression). That way, if you work with multiple learners, all learners can reuse the optimization code and any learner can expose different optimizers for users to choose if they want.

[3] A traditional approach is to treat biases as another set of parameters.