MLOps Scaling Machine Learning Lifecycle in an Industrial Setting

Machine learning has evolved from an area of academic research to a real-world applied field. This change comes with challenges, gaps and differences exist between common practices in academic environments and the ones in production environments. Following continuous integration, development and delivery practices in software engineering, similar trends have happened in machine learning (ML) systems, called MLOps. In this paper we propose a framework that helps to streamline and introduce best practices that facilitate the ML lifecycle in an industrial setting. This framework can be used as a template that can be customized to implement various machine learning experiments. The proposed framework is modular and can be recomposed to be adapted to various use cases (e.g. data versioning, remote training on Cloud). The framework inherits practices from DevOps and introduces other practices that are unique to the machine learning system (e.g.data versioning). Our MLOps practices automate the entire machine learning lifecycle, bridge the gap between development and operation.





References:
[1] Sculley, D. and Holt, Gary and Golovin, Daniel and Davydov, Eugene
and Phillips, Todd and Ebner, Dietmar and Chaudhary, Vinay and
Young, Michael and Crespo, Jean-Francois and Dennison, Dan, Hidden
Technical Debt in Machine Learning Systems. Proceedings of the 28th
International Conference on Neural Information Processing Systems -
Volume 2, NIPS’15, page 2503–2511, Cambridge, MA, USA, MIT Press,
2015.
[2] Dashmote https://dashmote.com/
[3] Yizhen Zhao, Machine Learning in Production: A Literature
Review. https://staff.fnwi.uva.nl/a.s.z.belloum/LiteratureStudies/
Reports/2021-LiteratureStudy-report-Yizhen.pdf
[4] Adarsh Shah, Challenges Deploying Machine Learning
Models to Production. https://towardsdatascience.com/
challenges-deploying-machine-learning-models-to-production-ded3f9009cb3
[5] Luigi, 5 Challenges to Running Machine Learning
Systems in Production. https://mlinproduction.com/
5-challenges-to-ml-in-production-solve-them-with-aws-sagemaker/
[6] Paleyes, Andrei and Urma, Raoul-Gabriel and Lawrence, Neil D.
Challenges in Deploying Machine Learning: a Survey of Case Studies.
arXiv e-prints, page arXiv:2011.09926, 2020.
[7] Git https://git-scm.com
[8] Anant Bhardwaj and Souvik Bhattacherjee and Amit Chavan and Amol
Deshpande and Aaron J. Elmore and Samuel Madden and Aditya G.
Parameswaran DataHub: Collaborative Data Science & Dataset Version
Management at Scale, 2014.
[9] Datahub https://datahub.io/
[10] Vimarsh Karbhari, MLOps: Data Science
Version Control. https://medium.com/acing-ai/
ml-ops-data-science-version-control-5935c49d1b76
[11] Pachyderm https://www.pachyderm.com/
[12] AWS Sagemaker Ground Truth https://aws.amazon.com/sagemaker/
groundtruth/?nc1=h\ ls
[13] AWS Sagemakerweb https://aws.amazon.com/sagemaker/
[14] Azure https://azure.microsoft.com/en-us/
[15] Azure, Deploy machine learning models to Azure. https://docs.microsoft.
com/en-us/azure/machine-learning/how-to-deploy-and-where?tabs=azcli
[16] MLflow https://mlflow.org
[17] MLflow sagemaker https://www.mlflow.org/docs/latest/python\ api/
mlflow.sagemaker.html\#module-mlflow.sagemaker
[18] Kyle Gallatin, Deploying Models to Production
with Mlflow and Amazon Sagemaker.
https://towardsdatascience.com/deploying-models-to-productionwith-
mlflow-and-amazon-sagemaker-d21f67909198
[19] Emmanuel Raj, Edge MLOps framework for AIoT applications,
Continuous delivery for AIoT, Big Data and 5G applications, 2020.
[20] Azure Machine Learning https://azure.microsoft.com/en-us/services/
machine-learning/
[21] DevOps https://azure.microsoft.com/en-us/services/devops/
[22] P¨ol¨oskei, Istv´an, MLOps approach in the cloud-native data pipeline
design. Acta Technica Jaurinensis, 2020.
[23] Yizhen Zhao, MLOps Scale ML in an Industrial Setting. https://staff.
fnwi.uva.nl/a.s.z.belloum/MSctheses/MScthesis\ Yizhen\ Zhao.pdf
[24] Yizhen Zhao, MLOps and data versioning in machine learning
project. https://staff.fnwi.uva.nl/a.s.z.belloum/LiteratureStudies/Reports/
2020-Internship\ report-Yizhen.pdf
[25] Yizhen Zhao, MLOps: Data versioning with DVC — Part I.
https://yizhenzhao.medium.com/mlops-data-versioning-with-dvc-part-\
%E2\%85\%B0-8b3221df8592
[26] Ubereats https://www.ubereats.com/nl-en
[27] DVC https://dvc.org/
[28] Jenkins https://www.jenkins.io/
[29] DVC File https://dvc.org/doc/user-guide/project-structure/dvc-files\
#dvc-files
[30] Airflow https://airflow.apache.org/
[31] Git Flow https://guides.github.com/introduction/flow/
[32] DVC YAML File https://dvc.org/doc/user-guide/project-structure/
pipelines-files
[33] DVC LOCK File https://dvc.org/doc/user-guide/project-structure/
pipelines-files\#dvclock-file
[34] Sagemaker Batch Transform https://docs.aws.amazon.com/sagemaker/
latest/dg/batch-transform.html
[35] Yizhen Zhao, MLOps: Deploy custom model with AWS Sagemaker
batch transform — Part II. https://yizhenzhao.medium.com/
mlops-deploy-custom-model-with-aws-sagemaker-batch-transform-part-\
%E2\%85\%B1-54263ec711ce
[36] Sagemaker Price https://aws.amazon.com/sagemaker/pricing/ [37] MLflow, How Runs and Artifacts are RecordedHow Runs and
Artifacts are Recorded. https://mlflow.org/docs/latest/tracking.html\
#how-runs-and-artifacts-are-recorded
[38] AWS EC2 Target Group https://docs.aws.amazon.com/
elasticloadbalancing/latest/application/load-balancer-target-groups.html
[39] AWS, Train a Model with Amazon SageMaker. https://docs.aws.amazon.
com/sagemaker/latest/dg/how-it-works-training.html
[40] Azure, Machine Learning Operations maturity model.
https://docs.microsoft.com/en-us/azure/architecture/example-scenario/
mlops/mlops-maturity-model