Model Monitoring

This is article 4 of a 5-part series on data science operations.

It takes significant effort and resource-coordination to get a model from “research prototype” to “production deployment.” However, the Data Scientist is hardly done when she gets the model deployed. Production environments can unearth significant differences between the Data Scientist’s assumptions and real-life. Professional data science operations have to monitor the model’s prediction performance, latency, and throughput, responding appropriately but quickly to the anomalies encountered. Typically such monitoring can be accomplished with software. Of-course, the response to anomalies can not always be automated.

Prediction Anomalies and Response

Prediction performance can differ from the Data Scientist’s expectations if the real-world data sent to the model is statistically different from the data on which the model was trained. Prediction performance can also differ if the relationship modeled has drifted over time or changed abruptly due to recent events.

To guard against changes in post-deployment prediction performance, Data Science teams have to continuously test for statistical differences between production input and training input. Further, they have to confirm the correctness of the model’s output. Ascertaining correctness is particularly difficult as there may not be a reliable way to verify the model’s output without manual intervention. However, well-maintained data science infrastructure and tools can approximate correctness verification by automatically checking the predictions from the deployed model against those from previous stable versions on the same data.

In the cases where either the input or the output differs significantly from its benchmark (test input and older model’s output, respectively), monitoring software can trigger an alert to the data science team. The alert should cause the team to investigate the root cause of the differences in input or output and prepare a response.

In some cases, the eventual response may require building a new version of the model, whereas in other instances downgrading to an older version of the model may suffice. To the extent that the data science team has well maintained and comprehensive deployment tools and processes at its disposal, the response can be efficient, limiting the effect of inferior predictions.

Latency, Throughput Anomalies, and Response

Anomalies in latency and throughput are somewhat easier to accommodate when compared to anomalies in prediction. This is because latency and throughput as performance indicators are more widely understood. Further, responses to anomalies in latency and throughput are usually easier to automate.

In most cases, the latency and throughput benchmark for a model can be derived from the performance observed during pre-deployment model certification and live-testing. Benchmarks can also be derived from previous versions of the model or similar models serving other applications.

At a minimum, software monitoring latency and throughput can respond to anomalies by triggering alerts. Monitoring software can also execute additional responses, such as expanding the capacity (storage, compute, network bandwidth) available to the model. In practice, such automated action is accompanied by additional alerts to Data Scientists and Dev-ops. Ultimately, human intervention will be needed if the pre-configured steps that expand capacity do not resolve the anomaly.