Deep Learning is hot. It has been achieving state-of-the-art results on many hard machine learning problems. So it's natural that many study and scrutinize it. There have been a couple of papers in the series of intriguing properties of neural networks. Here's a recent one, so-called Deep Neural Networks are easily fooled, that was discussed actively on Hacker News. Isn't it interesting?
The common theme among these papers is that DNN is unstable, in the sense that
- Changing an image (e.g. of a lion) in a way imperceptible to humans can cause DNN to label the image as something else entirely (e.g. mislabeling a lion as library)
- Or DNN gives high confidence (99%) predictions for images that are unrecognizable by humans
I myself have experienced this phenomenon too, even before these papers came out. However, what surprised me is many people are too quick to claim that these properties are due to neural networks. Training a neural network, or any system, requires tuning many hyper parameters. Just because you don't use the right set of hyper-parameters doesn't mean that the method fails.
Anyway, my experience shows that the aforementioned instability is likely due to the use of output function, in particular: Softmax. In training a multi-class neural network, the most common output function is Softmax . The lesser common approach is just using Sigmoid for each output node .
In practice, they perform about the same in terms of testing error. Softmax may perform better in some cases and vice versa but I haven't seen much difference. To illustrate, I trained a neural net on MNIST data using both Sigmoid and Softmax output functions. MNIST is a 10-class problem. The network architecture and other hyper-parameters are the same. As you can see, using Softmax doesn't really give any advantage.
However, if you train a predictor using Softmax, it tends to be too confident because the raw outputs are normalized in an exponential way. Testing on test data doesn't expose this problem because in test data, every example clearly belongs to one of the trained classes. However, in practice, it may not be the case: there are many examples where the predictor should be able to say I'm not confident about my predictions . Here's an illustration: I applied the MNIST models trained above on a weird example, which doesn't look like any digit. The model trained using Sigmoid outputs that it doesn't belong to any of the trained digit classes. The model trained using Softmax is very confident that it's a 9.
 It's commonly said that Softmax is a generalization of logistic function (i.e. Sigmoid). This is not true. For example, 2-class Softmax Regression isn't the same as Logistic Regression. In general, they behave very differently. The only common property is that they are smooth and the class-values using these functions sum up to 1.
 Some people say that for multiclass classification, Softmax gives a probabilistic interpretation while Sigmoid (per-class) does not. I strongly disagree. Both have probabilistic interpretations, depending on how you view them. (And you can always proportionally normalize the Sigmoid outputs if you care about them summing up to 1 to represent probabilistic distribution.)
 Predicting with uncertainty is very important in many applications. This is in fact the theme of Michael Jordan's keynote at ICML 2014.