## Abstract

At a high level, Deep Learning (DL) is still hot and DL keeps eating Machine Learning. The conference's attendance distribution was like: half was there for Deep Learning and the other half was there for *Shallow* Learning :). Interestingly, the conference took place in Beijing, for the first time, and more than 50% of the attendants either study or work there (and most of that local population are students). So the attendance distribution could be biased.In the following, I'll highlight what I've learned and observed from the conference. Here's the outline:

- ML Fundamentals (this post, see below)
- Deep Learning and Language Modeling
- Optimization, Distributed Optimization, and Distributed ML
- Kernel Methods
- Auto ML
- Other Topics

## ML Fundamentals

Two of the three keynotes were delivered by Michael Jordan and Eric Horvitz, who talked about the fundamental challenges in ML. In general, there talks were more philosophical than technical; but they raised some important points.**Predict with uncertainty**

In his keynote, Michael emphasized the importance of predicting with uncertainty

It's too often in the industry as well as in academia that people train models that can only predict some outputs without any confidence information. Such models could be dangerous to use in practice. [*]It's important to estimate the risk and not just the loss.

An old and simple technique that can be used to train models that can predict with uncertainty is using Bagging, i.e. bootstrap and then aggregate. Bootstrapping means that we create multiple samples from the train set, each having the same size as the train set, using sampling with replacement. Each sample then can be trained using an embedded learner (e.g. decision tree) to produce a candidate predictor. Bagging generates a few candidate predictors, which can be aggregated to produce not only the prediction (the mean) but also uncertainty (their variance).

An even better approach to estimate the quality of estimators, according to Michael, is using the bag of little bootstraps (BLB).

**Privacy aware learning**

ML with privacy constraints is getting more popular and will be hot once ML becomes more mature. One third of Michael's keynote is on this topic [1, 2], which is one of his main interests nowadays. Eric Horvitz also emphasized the importance of this problem, in the context of applying ML in the health care industry.

**Causality modeling**

In many applications, prediction is not the end goal. It's important to explain the causality and make decisions. A simple example is customer churn analysis. Predicting customer churn rates doesn't do any good if the model cannot explain why and suggests actions to take.

Enough of BS. In the next posts, I will touch on more technical contents.

**Side notes**

[*] On multiclass neural network: due to the importance of prediction with uncertainty, I think that Sigmoid is a better output function that Softmax (which is contrary to popular advises). Softmax tends to make the predictor too confident. I have empirically experienced the non-regularity of NN model when using softmax on MNIST (similar to the phenomenon observed in the recent paper Intriguing Properties of Neural Networks).