Data Science Operations at Scale

This is article 5 of a 5-part series on data science operations.

On a small scale, data science, say on a Data Scientist’s laptop, is significantly different from data science at scale. At scale, many data scientists contribute to a project or a large number of models are produced, or both. Thus, firms need to consider the foundational building-blocks necessary to support data science at scale. Current best practice suggests four major building-blocks for data science operations.

First, firms create an abstraction for computing, storage, and networking so that the data science operation is not tied to any particular private or public cloud but meets compliance and security requirements across the cloud infrastructure used. Layered on this abstraction are management functions such as logging, analytics, and dashboards. Finally, multiple data science programming languages and related libraries are supported to give data scientists flexibility in model development. Together, these layers remove the need for snowflake data science environments - the officially supported environment meets the data scientists’ needs. Much of the abstraction described above can be built using Docker containers that install on practically any computer. While Docker containers are sometimes thought of as light-weight virtual machines, they also work well for software packaging and delivery ¹, ².

Second, firms develop policies that force uniformity in the development of prototype and production models. These policies address the infrastructure, model versioning, model release notes, model descriptions, usage and performance statistics, and model interfaces to multiple data science languages. Practicing these policies enables data science teams to better maintain their models over the long term. A significant number of these policies can be implemented by using a source control system for the models and model artifacts such as release notes and usage statistics ³, ⁴, ⁵.

Third, firms define a model certification and live-testing process that every candidate model needs to go through before deployment. This process seeks to characterize not just the prediction performance of candidate models but also the latency and throughput performance. Further, documented upgrade and downgrade procedures are backed by tools that can seamlessly replace one model version with another. Together the certification, live-testing, and upgrade process ensure that only the highest quality and thoroughly vetted models are deployed into production. This process also ensures that if a model fails to perform adequately in production, there is a downgrade process ready to take it out of service. Kubernetes, coupled with Docker containers, can automate much of the model deployment process. Kubernetes is an orchestration platform compatible with Docker. If individual models and their dependencies are containerized, Kubernetes (with some additional scripting logic) can orchestrate pulling in the correct model from the source control system and the ensuing upgrade/downgrade cycle ⁶, ⁷, ⁸.

Fourth, firms create a system to monitor anomalies in performance: prediction performance and latency and throughput performance. Whenever an anomaly is noticed, the data science team is alerted immediately. Simultaneously, for some predefined classes of anomalies such as latency and throughput, the system attempts automatic remediation. If remediation is unsuccessful, additional alerts are issued, and operator intervention is invited. A previous stable version of the model is used as a canary to detect prediction under-performance. Manual intervention will almost always be required when the canary’s results differ significantly from the production model ⁹, ¹⁰. In some cases, such as when live data is different from training data, the model in question may have to be re-designed altogether. Regardless of the anomaly type, having a comprehensive monitoring and response system in place restricts the time over which anomalies go unnoticed, limiting the economic damage from such anomalies.

Together, these four data science building-blocks provide “belts-and-suspenders” for every stage of a model’s life cycle. If any step in the model development or deployment process fails, the failure is visible, and a predetermined next-step, whether automated or manual, can be taken. Further, the building-blocks provide a foundation over which models can be revised quickly and deployed efficiently. The system described above enables feedback from prior deployments and new data science techniques to be incorporated into models without too much fear of performance regression. Finally, if implemented correctly, these building-blocks can operate at scale with large data science teams and with thousands of models and model versions passing through this system every year. The ultimate effect will be visible in the data science velocity supported, which will provide firms with a vector of differentiation over their competitors.

References