Friday, August 1, 2014

The difference between L1 and L2 regularization


The difference between \(L_1\) and \(L_2\) regularization is well studied in ML (and you can find lots of reference on the internet). However, the most common explanation is perhaps via this figure.
While it's true (and seemingly intuitive at first), this figure may not be as easy to digest. The figure corresponds to an equivalent constrained optimization problem, but why we need to deal with constrained optimization problem is not clear [1]. In fact, someone asked this question on Quora: What's a good way to provide intuition as to why the lasso (L1 regularization) results in sparse weight vectors?, which motivated me to provide a simpler explanation [2].

For those who don't already know. The main difference between \(L_1\) and \(L_2\) regularization is that \(L_1\) can yield sparse models while \(L_2\) doesn't. Sparse model is a great property to have when dealing with high-dimensional data, for at least 2 reasons.
  • Model compression: increasingly important due to the mobile growth
  • Feature selection: it helps to know which features are important and which features are not or redundant.
But why does \(L_1\) yield sparse models?

For simplicity, let's just consider the 1-dimensional case.

\(L_2\)-regularized loss function \(F(x)=f(x)+\lambda\|x\|_2^2\) is smooth. This means that the optimum is the stationary point (0-derivative point). The stationary point of \(F\) can get very small when you increase \(\lambda\), but still won't be 0 unless \(f'(0)=0\).

\(L_1\)-regularized loss function \(F(x)=f(x)+\lambda\|x\|_1\) is non-smooth. It's not differentiable at 0. Optimization theory says that the optimum of a function is either the point with 0-derivative or one of the irregularities (corners, kinks, etc.). So, it's possible that the optimal point of \(F\) is 0 even if 0 isn't the stationary point of f. In fact, it would be 0 if \(\lambda\) is large enough (stronger regularization effect). Below is a graphical illustration.

In multi-dimensional settings: if a feature is not important, the loss contributed by it is small and hence the (non-differentiable) regularization effect would turn it off.

Notes
[1] The regularized optimization problem can be represented as a constrained optimization problem where the norm is constrained by a new constant C. The black solid regions are the feasibility regions. Why/How are the two problems are equivalent is actually non-trivial. Here's a proof if you're curious: Why are additional constraint and penalty term equivalent in ridge regression?
[2] Like many others, this post comes from my own answer on Quora: What is the difference between L1 and L2 regularization?