What is MLOps?
MLOps, or Machine Learning Operations, sits at the intersection of data science, data engineering, IT infrastructure and ML engineering. It optimises the deployment and management of machine learning models, ensuring their scalability, reliability, and efficiency in real-world business applications.
Over the past 5 years, data created and captured worldwide has grown exponentially by 29% compounded annually. This surge is attributed to a concerted effort in data collection, empowering enterprises to adopt a truly data-driven approach. This has enabled enterprises to be truly data driven and leverage ML in every department(planning, sales, marketing, operations, HR etc) for every possible use case. With the increasing number of use cases, the number of models and ML systems an ML team has to build, deploy and monitor is increasing at an exponential rate. By automating and standardising processes such as model development, deployment, monitoring, and optimization, MLOps empowers organisations to derive maximum value from their AI investments.
Enterprises deploy MLOps principles across the following different parts of data science project:
- Data preparation
- Feature engineering
- Model training and hyper parameter optimization
- Model evaluation and model retraining
- Model monitoring and drift detection
- Model versioning
- Experiment tracking
- Inference Endpoint creation and maintenance
- CI/CD
When should an enterprise think about their MLOps strategy?
An enterprise's ML journey typically commences with a single use case, triggered by either excessive human effort, potential accuracy enhancements through ML, or the necessity for AI/ML to address scalability challenges. At this juncture, data scientists embark on developing a proof of concept (POC) using a Jupyter notebook either locally or on a cloud-based platform. Once the Data Science and Business teams gauge the impact of the ML solution and decide on deployment, the enterprise should pivot its focus towards formulating an MLOps strategy.
Timely implementation of MLOps ensures several benefits:
- Maximisation of data scientists' bandwidth for enhancing existing solutions or creating new ones, rather than manual support.
- Ensuring scalability, reliability, and efficiency of models through automated training, fine-tuning, and monitoring.
- Efficient management of multiple models, mitigating the risk of human error.
While this process may seem daunting, enterprises can adopt our best practices guide(discussed below) to seamlessly initiate their MLOps journey in a phased manner.
What are some best practices for MLOps?
- Phased implementation: In place of rushing to implement multiple steps at once, MLOps principles can be implemented in a phased manner depending on the use cases. For example, if Feature engineering is simple but you need to train and maintain multiple models, you can always perform feature engineering using simple python scripts and use Sagemaker training jobs to train multiple models.
- Resource selection and optimisation:
- Consider CPU over GPU for cost-effective scenarios, unless specific model computations explicitly benefit from GPU acceleration
- Improperly configured autoscaling can lead to resource wastage, where the number of instances remains high even under low demand. Prefer step scaling over target scaling to ensure efficient upscaling and downscaling based on demand, optimising resource usage and costs
- Model choice: Use common ML frameworks (e.g., PyTorch, XGBoost) for model packaging to avoid compatibility issues. Refrain from using custom formats that are not widely supported. One can avoid using custom and lesser known libraries that are built on top of common ML frameworks.
- Enhance Logging and Monitoring: Create dashboards and metrics tables for each step in the MLOps pipeline for real-time monitoring and alerting and implement custom logging to monitor run time errors to effectively scale multiple models in production.
- Version everything: Leverage Git for code version control, Model registry for managing and versioning model artifacts and endpoint versioning to ensure smooth code management, efficient model rollbacks and updates and streamlined deployment.
- Modularisation and reusability: Package scripts used in processing steps as standalone microservices, avoid duplicating elements in SageMaker pipelines to ensure easier updates and scalability while maintaining efficiency and simplicity. You can use AWS Feature store to ensure data consistency and reduce data preparation overhead.
What are the options available for MLOps implementation?
Depending on your existing data platform, you can leverage your existing technology provider like AWS, Databricks, etc. to apply MLOps principles. You can also leverage open source tools like ML flow, KuberFlow, Seldon core etc.
A brief summary of how MLOps stack looks like on different data platforms:
- AWS: AWS provides a wide range of tools and services with Amazon Sagemaker. You can use Sagemaker models, fine tuning, batch processing, real time inference, feature store, pipelines etc. to manage an ML project’s life cycle and easily integrate other AWS services like lambda, Athena, S3, could watch for additional features and finer control.
- Azure: Azure provides multiple services and tools on the Azure ML platform to implement MLOps. Azure machine learning designer can be used to create machine learning pipelines and Azure machine learning environments can be used to maintain project’s software dependencies. Azure also offers model registration using Azure Machine Learning model registry for model versioning. Models can be deployed on cloud or locally using real time inference end point and batch predictions can be generated using batch inference end point. Azure event grid can be used to trigger alerts and automatically act on events such model retraining, model deployment,data drifts etc.
- Google Cloud Platform: MLOps can be implemented on GCP Vertex AI platform similar to AWS Sagemaker and Azure ML platform. Vertex AI feature store can used for standardising features, Vertex AI experiments for tracking and analysing different models and hyperparameters, Vertex AI pipelines to automate, govern and monitor ML workflows, Vertex AI model to manage models effectively and Vertex AI monitor to monitor models in production for data drifts.
- Databricks: Databricks uses the combination of DevOPS, DataOPS and modelOPS as MLOps. Databricks leverages various open source tools like Git to manage DevOPS, Delta Lake to manage DataOPS and MLflow to implement modelOPS. Databricks also recommends creating different development, staging and production environments. Git is used to store code and pipelines for each environment, raw data and features tables are stored in Delta tables and model development, model parameters, metrics, code snapshots are tracked using MLFlow.
- Open source: Various open source tools can be used in a standalone mode or in combination with each other to implement different steps of MLOps. For example, Kuberflow can be used to perform orchestration, deploy on any infrastructure, scale on demand and manage multiple microservices. Tools like Flyte can be used for experiment tracking and Apache Airflow can be used to create and schedule batch workflows.
Implementation examples
For a Telco customer who needed real time personalised recommendations for a large user base, we leveraged Amazon Sagemaker to manage end to end MLOps. We supported multiple ML models for different business objectives with continuous A/B testing.
For a Retail customer who needed promotions planning tool for their business teams, we deployed Deep Learning models using Amazon Sagemaker so that the business user can either generate results in bulk or can run what if analysis in real time.
A simplified version of architecture diagram is as follows:
![notion image](https://www.notion.so/image/https%3A%2F%2Fprod-files-secure.s3.us-west-2.amazonaws.com%2F16ebcd74-563a-48a2-9ce7-686f30d5c337%2Fb1ff5001-854f-4368-9657-2bd84a190902%2FUntitled.png%3FspaceId%3D16ebcd74-563a-48a2-9ce7-686f30d5c337?table=block&id=8f605cda-5ff5-4911-9808-721ace9d9d56&cache=v2)
Data Integration and Preprocessing:
- AWS Glue: Automated the extraction, transformation, and loading (ETL) processes, enabling seamless integration of data from various sources into a centralized data lake stored in Amazon S3.
- Amazon Athena: Provided serverless querying service, allowing Polaris to run ad-hoc queries against the consolidated data in Amazon S3, significantly speeding up the data analysis process.
Model training and deployment:
- Feature Store: We leveraged SageMaker Feature Store to maintain a repository of curated features necessary for model predictions, ensuring data consistency across training and prediction phases.
- We used Amazon SageMaker for the entire machine learning lifecycle, from model training with historical data to deploying these models for real-time inference.
- We leveraged SageMaker Model Registry to manage model versions, facilitating easier rollback and model version comparison.
Post deployment:
- Model Monitoring and Autoscaling: We utilized Amazon CloudWatch to monitor model performance and set up auto-scaling for the SageMaker endpoints, ensuring that the model inference remains responsive under varying loads.
- Automated Testing and Monitoring: Integrated testing and monitoring into the CI/CD pipelines, ensuring model accuracy and infrastructure reliability through automated checks.
Apart from it, we implemented Continuous Integration and Continuous Deployment(CI/CD) pipelines using AWS CodePipeline and AWS CodeBuild, automating the deployment of machine learning models and infrastructure updates and leveraged AWS CodeCommit for version controlling our code base.