Model Deployment
This is article 3 of a 5-part series on data science operations.
Most data science operations produce at-least a handful of models for deployment every year. If the operation is large with multiple data science teams or in an industry segment going through rapid innovation, hundreds or thousands of new models may be deployed in a year. At this scale, data science operations need to have a systematic way of certifying new models, managing the upgrade and downgrade cycle, and recording statistics on the usage and performance of the deployed models.
Certification and live-testing
New candidate models usually go through a certification cycle before deployment. It is helpful to think of the certification phase as one that encompasses all aspects of a candidate model’s performance, rather than limiting attention to prediction performance.
For most Data Scientists, prediction performance is the primary focus. Data Scientists constantly tinker with feature engineering and model parameters to eke out incremental prediction performance. Often, the latency and throughput characteristics of the model get less attention. However, when a model is deployed in a production environment, the latency and throughput characteristics matter as much as prediction performance. Depending on the context in which the model is used, latency and throughput can significantly affect the application’s end-user experience. Latency and throughput also have a direct cost impact - more or better infrastructure may be needed to get acceptable latency and throughput from a model.
At scale, the process of running candidate models through a battery of tests that define the prediction, latency, and throughput characteristics of the model should be automated. Models that do not have good performance across all three criteria should be blocked from deployment and sent back to the drawing board. Further, models accepted for deployment should go through a live-testing phase where a fraction of the load serviced by a predecessor model is passed through the candidate model (i.e., the candidate is A/B tested). The candidate model should replace its predecessor only if it has better aggregate performance characteristics in the live-test.
Some data science teams will want to consider live-testing of more than one candidate model simultaneously - such testing is helpful where multiple approaches to prediction problems are being attempted in parallel or when the final model is expected to be an ensemble.
Upgrades and Downgrades
For professional data science operations at scale, not only is it important to automate the certification of new candidates, but it is also essential to use a version management tool to promote candidate models to live-tests and then to full deployment. Conversely, if model performance degrades, the version management tool should downgrade to an older version of the model with stable performance.
Model Statistics
As models move from certification to live-tests to full production deployment, it is helpful to systematically record the model’s performance characteristics. Keeping score is important not only in making decisions about which models to promote and which to send back for re-work, but it is also important for model discovery and reuse. Data Scientists working on new projects ought to investigate previous models that have been deployed with good results. Such investigation is difficult to complete without useful statistics on past usage.
Finally, recording performance characteristics enables Data Scientists to revisit and redefine their goals. For example, if the record shows that new models consistently perform well during the certification phase but degrade significantly during live-tests, it may be worthwhile for the data science team to revisit its technology choices and training data.