Infrastructure for Data Science
This is article 1 of a 5-part series on data science operations. The series was originally written in September 2019 but is being posted in November 2020.
Today, the underlying infrastructure supporting data science operations is often ad-hoc compared to traditional software development and deployment. Solutions to some of the data science infrastructure challenges, those that are also encountered in conventional software development, are well understood. However, other infrastructure challenges are unique to data science and need additional attention.
Solutions from software engineering
Data science operations share three broad classes of challenges with software engineering organizations: infrastructure abstraction, infrastructure management, and enterprise readiness 1, 2, 3, 4.
Many firms have heterogeneous compute, storage, and networking environments. This heterogeneity is not only due to the use of multiple clouds, whether public or private, but also due to various computing models such as virtual machines, containers, and serverless-computing. Dealing with heterogeneity is made easier by inserting an abstraction layer on top of the compute infrastructure that allows the data scientist (or equivalently, software engineer) to carry out computations largely independently from the underlying infrastructure.
Many firms also implement a management layer on top of their compute-infrastructure to manage and maintain this compute-infrastructure. Such a layer includes logging, analytics, and dashboards. Ideally, the dashboard capability allows for the management of the infrastructure and the creation of additional specialized dashboards for the users of the infrastructure.
Enterprise-class firms typically solve for an additional level of common problems on top of their compute-infrastructure. This includes security of the infrastructure, compliance with required external standards such as HIPAA and PCI-DSS, and internal policies and readiness for future audits. If the firm is large enough, it may also need to provision for charging back usage fees to sub-organizations using the infrastructure, in proportion to the sub-organizations’ use of the infrastructure.
The commonality of problems between data science and software engineering suggests that Data Scientists should not solve infrastructure abstraction, enterprise readiness, and infrastructure management problems independently. Since well studied and well-documented solutions to these problems exist in traditional software engineering, Data Scientists should partner with Infrastructure-IT and Dev-ops teams to implement (previously) well-known solutions for data science operations.
Problems unique to Data Science Of course, data science operations face some unique infrastructure problems that do not directly parallel software engineering.
The first of these problems is support for the toolchains needed for data science. There are many different and fast-growing data science tools today, each with their universe of libraries, modules, and connectors. Each of these, in turn, have multiple, sometimes incompatible versions forcing additional maintenance requirements. While traditional software engineering has similar problems, data science differs from software engineering today with respect to the magnitude of the problem: there are several parallel toolchains (for Python, R, SAS, SPSS, Matlab, to name a few) needed simultaneously in the same firm to accommodate Data Scientists’ preferences, with each toolchain evolving rapidly and independently 1.
The second problem has to do with GPU support. Many of the data science tools have specialized libraries for faster execution on GPUs. Data Scientists’ preference for this specialized support further compounds the challenge of maintaining multiple rapidly evolving tools 1.
Docker containers, Kubernetes, and Kubeflow
Data Scientists can manage the infrastructure complexity by using open-source tools such as Docker, Kubernetes, and Kubeflow 5, 6. Docker containers are helpful in packaging data science models with their associated toolchains, thereby reducing dependencies. Once packaged inside a container, the model can be deployed over Kubernetes. Kubernetes abstracts away the underlying cloud (or computer hardware) details, making the Docker container even more portable. Kubernetes and Docker reduce the variability in the model’s environment, thereby making the model (more) production-ready. Finally, Data Scientists can use Kubeflow, rather than their own scripts, to orchestrate, deploy and run data science workloads over Kubernetes and Docker.
Docker, Kubernetes, and Kubeflow make for a nice toolchain in their own right. Docker interoperates well with Kubernetes, and the two tools are often used together (including outside of Data Science contexts). Kubeflow is more recent, but it was explicitly developed to manage data science pipelines that use Kubernetes. The three tools together provide data science models with a stable and reproducible platform to run on.
Pay-back
Implementing the various abstractions alluded to above requires some upfront investment in setting up the data science infrastructure. However, dedicating resources towards maintaining a comprehensive set of data science tools pays back over time as research models do not need to be rewritten using a different toolchain before deployment, reducing errors, and unexpected behavior. Further, to the extent that the same data science infrastructure is used for research and production, less customization is needed to promote a research model to production, resulting in a shorter time to market. Finally, the use of a well-maintained data science infrastructure across a firm leads to economies of scale - there are fewer sub-scale snowflake data science environments chipping away at the firm’s budget.