Model Development and Maintenance
This is article 2 of a 5-part series on data science operations.
As the use of data science has grown, Data Scientists have found themselves rushing models to production to respond to business needs. In many cases, the models serve the immediate business need but prove difficult to maintain. As firms mature in their use of data science, they need to develop systems to manage the life cycle of the models they deploy.
Initial creation
Most models start in a prototype or research phase, where Data Scientists experiment with the data, the model choices, and the model parameters. Often Data Scientists carry out these experiments on laptops or non-standard computing environments, inadvertently creating technical debt that will need to be overcome before business value can be extracted from deploying the models in production environments. To the extent that Data Scientists have access to well-maintained Data Science infrastructure, technical debt can be reduced or eliminated at low cost by using the infrastructure for the initial experimentation - using the infrastructure forces the use of standardized computing primitives, programming language versions, and library versions, lowering the barrier to eventual deployment.
Similarly, in many cases, a model is needed to power a software application. Pressed for time, Data Scientists tightly couple their model to the application in question. In today’s world, where applications are often hosted in the cloud or have access to cloud resources at runtime, it is advantageous to separate the model from the application such that the model can have a life-cycle of its own. Decoupling the model from the application allows the model to be improved independently of the application. Decoupling also promotes model reuse in the future.
Subsequent updates
In many ways, data science models are like software libraries - they evolve as Data Scientists incorporate feedback from users and respond to changes in the computing infrastructure (such as new module availability) or data. As a model evolves, it is logical to treat new versions of the model as new versions of a library.
It is beneficial to version (or label) releases of a model and associate release notes with the version. Doing so makes explicit the contents of any particular version of the model. Equally, it gives Data Scientists, software engineers, and Dev-ops a clean way to reference the appropriate models for various uses (such as deployment to production and testing).
Reuse
Model development and maintenance are expensive, given the typical cost of human resources, data sources, and the computing infrastructure. Thus it is also logical to extend the useful life by promoting the reuse of models already developed.
A key aspect of model reuse is model discoverability. Model versioning and model release notes help discoverability, but most firms with large or growing data science teams need to go further by creating a single, visible repository of models. It also helps to associate additional meta-data with model versions indicating the amount of production use and the runtime performance so that other Data Scientists can make informed decisions about reusing existing models.
Another aspect of model reuse is the availability of appropriate interfaces to existing models in multiple programming languages. Providing an interface with corresponding stub codes for existing models in various languages reduces the barrier to experiment with the current models. Such stub code also encourages Data Scientists to reuse work produced in languages other than those they are familiar with. In aggregate, adapting techniques developed for software engineering and applying them to data science models can make data science operations more efficient by reducing the hurdle to new model deployment and extending the useful life of existing models.