Tuesday, August 5, 2014

The importance of cache efficiency in SGD optimization

Recently, some members in our team have experimented with variants of SGD (Stochastic Gradient Descent) and SDCA (Stochastic Dual Coordinate Ascent) on large and sparse datasets (such as the KDD Cup 2010 dataset [*]). Note that we've focused only on linear models (e.g. logistic regression and SVM) which lead to convex optimization.

What we have found during the process is very interesting from the engineering standpoint, yet not covered in any academic papers. That is: in SGD and SDCA, each weights update is typically so fast [**] that cache efficiency can become the limiting factor. Put it another way: we experienced that if the data is randomly shuffled in memory, the algorithm suddenly becomes ~3x slower (see the table below).

Friday, August 1, 2014

The difference between L1 and L2 regularization

The difference between \(L_1\) and \(L_2\) regularization is well studied in ML (and you can find lots of reference on the internet). However, the most common explanation is perhaps via this figure.
While it's true (and seemingly intuitive at first), this figure may not be as easy to digest. The figure corresponds to an equivalent constrained optimization problem, but why we need to deal with constrained optimization problem is not clear [1]. In fact, someone asked this question on Quora: What's a good way to provide intuition as to why the lasso (L1 regularization) results in sparse weight vectors?, which motivated me to provide a simpler explanation [2].