Monday, December 1, 2014

Challenges in Machine Learning Practice - Part 1

In this series, I'd like to highlight a few challenges in developing and using a machine learning (ML) system in practice. This first post will focus on model serialization and how to consume a trained model.

The disconnect between training models and putting them into production

What’s the best practice to train and deploy a model?

Most of the major machine learning toolsets and platforms (e.g. SAS, R, Python libraries, Weka, RapidMiner, Hadoop, etc.) focus on providing algorithms and tools to help you analyze and visualize data, extract features, and train models. Once the data scientists are done playing with the data and having trained a good model, how to operationalize that trained model or workflow [1] remains a tricky question to most companies and groups.

What usually happens is one of the following.
  1. The whole data mining [2] pipeline is offline. 
    1. In some cases, companies just want to get insights from their landfill of data, to help them make more informed decisions. That’s it. They don’t need to or perhaps don’t know how to train/apply a predictive model.
    2. Online prediction is not too critical. Some data scientists are willing to run prediction offline on new batches of data. Examples include customer retention analysis, risk analysis, fraud detection (many firms already do fraud detection online but there are still lots of firms running this offline). 
    In both cases, machine learning is only for internal use. I’m guessing that the machine learning software market is dominated by these needs but won’t be for long. The future will rely more on predictive analytics.

  2. Or after training and validating a model offline, the data scientists will ask other software engineers to implement the model decoder, typically in another language that is more efficient (C++/C#/Java), which can then be deployed as a live service.
(2) is what I want to talk about. There are many problems with this approach.
  1. The educational bridge between data scientists (or statisticians) and production engineers. Basically, the data scientists need to explain the model to the engineers. Depending on the complexity of the model and other factors, this process could be long and error prone.
  2. Retraining and updating the model is also a pain. After a few months, the data scientists will likely come up with better models, either through retraining with more data or through more fundamental changes. Now the engineers need to update the model. If the change is fundamental, such as switching from SVM to Neural Network, the model decoder would change significantly.

  3. The process of improving the models is and should be endless, at least due to periodic retraining.
If you multiply the cost factors above with the number of projects that use machine learning, the waste of resources is far reaching. It typically takes months and multiple engineers to test and deploy an already trained machine learning model.

Undoubtedly, there are a few solutions out there and they are far from being ideal.
  1. Large IT companies (Microsoft, Google, Facebook, etc.) may have the resources to build their own infrastructure to support serializing and consuming models in an automatic way. However, their solutions are (i) internal, (ii) even fragmented within the companies, and (iii) customized for their own data sources/format and their current learning algorithms.
  2. Predictive Model Markup Language (PMML) is the most established and common format for serializing machine learning models. Many popular software (such as KNIME, RapidMiner, R, SAS, etc.) [3] support producing models in this format. However, PMML has a few disadvantages:
    1. As an old standard, its evolution is very slow. For example, it doesn't support modern methods such as convolutional neural networks.
    2. Software that can consume PMML models (e.g. SAS or IBM SPSS Modeler) tend to be expensive and not optimized for online usage.
  3. Train a model in a data-friendly environment (let's say Python) and consume the model through that environment [3]. This approach, while fast to make something work, also induces major technical debts that you have to pay later. Some major debts are speed and scalability (both in terms of computation as well as software maintenance).
In summary, producing and consuming machine learning models is still painful for a large number of machine learning practitioners. The current market is very fragmented and still at an early stage. Software that pay attention to producing and consuming models tend to lack other important components to make them widely used. I'll highlight a few of those important components in the next posts.

[1] It should be noted that almost every predictive model is a workflow, not just a ML-trained sub-model.
[2] Sometimes I use “data mining” to emphasize the practice of mining data just for the sake of understanding but not for making predictions on new examples.
[3] This site provides a comprehensive list of software that can produce and/or consume PMML models.