Thursday, January 17, 2013

How to understand back propagation algorithm in Neural Networks

I'm not pleased with how back propagation is explained on the internet (e.g. this one). They mistakenly view backprop as gradient descent, or an optimization algorithm, or a training algorithm for neural networks. It is not.

Backprop is simply a method to compute the partial derivatives (or gradient) of a function, which has the form as a function composition (as in Neural Nets). When you solve an optimization problem using a gradient-based method (gradient descent is just one of them), you want to compute the function gradient at each iteration.

For a neural network, the objective function has the form of a composition. How do you compute the gradient? Two common ways to do it.

1. Analytic differentiation. You know the form of the function. So just compute the derivatives using the chain rule (basic calculus). The long mathematical formulas in tutorials or in papers may look scary (because multivariate calculus usually looks ugly). However, if you write down the differentiation steps yourself, you should find that it is very straight forward.
2. Numerical differentiation using finite difference. This method is computationally expensive because the number of function evaluation is O(N), where N is the number of parameters. This is expensive, compared to analytic differentiation. Finite difference, however, is commonly used to validate a backprop implementation, for debugging purpose.

To summarize: backprop is the straightest way to compute the partial derivatives of a function, using the chain rule. It is really first year Calculus plus proof-by-induction plus a lot of patience :).