Tuesday, March 10, 2015

Metrics revisited


Machine learning researchers and practitioners often use one metric on the test set and optimize on a different metric when training on the train set. Consider the traditional binary classification problem, for instance. We typically use AUC on the test set for measuring the goodness of an algorithm while using another loss function, e.g. logistic loss or hinge loss, on the train set. 

Why is that? The common explanation is that AUC is not easily trainable. Computing AUC requires batch training as there's no concept as AUC per example. Even in batch training, we just don't use it as a loss function [1].

I want to ask a deeper question. Why is AUC a good metric in the first place? It's not the metric that business people care about. Why don't we use the true business loss, which can be factored into loss due to false positives and loss due to false negatives, for testing a machine learning algorithm; and even for training it?

The major reason that AUC is favored as a proxy for business loss is that it is independent of the classification threshold. Why are we scared of the threshold? Why do we need to set a threshold in order to use a classifier model? Isn't it anti machine-learning that humans have to manually set the threshold?

So I'd like to propose that we shouldn't consider threshold as a parameter to tune. Instead, make it another parameter to learn. Here are a few challenges when doing so
  • If we are using a linear model, this addition of threshold parameter will make the model nonlinear. In fact, we will no longer have linear models.
  • The threshold parameter needs to be within 0 and 1. We can relax this constraint by applying a logistic function on a parameterized threshold variable.
ML research has always been challenging. Adding another layer of complexity shouldn't be an issue. Not modeling the business problem directly is more of an issue to me.

Side notes
[1] Computing AUC is costly and computing the AUC function gradient is even costlier.