There is a good summary of all existing ML tools https://github.com/EthicalML/awesome-production-machine-learning

How to choose

  • MLFlow if you care more about tracking experiments or tracking and deploying models using MLFlow’s predefined patterns than about finding a tool that can adapt to your existing custom workflows.
    • MLeap The mlflow.mleap module provides an API for saving Spark MLLib models using the MLeap persistence mechanism.
  • KubeFlow if you want to use Kubernetes but still define your tasks with Python instead of YAML. More binding to Google eco-system.
  • SageMaker Pipelines continuous integration and continuous delivery (CI/CD) service for ML. More of binding to existing AWS-Sagemaker solution (and AWS eco-system) like notebook, jobs, endpoint serving.

ZenML – An extensible open-source MLOps framework to create reproducible pipelines. https://zenml.io/

Dagster.io – Data orchestrator for machine learning, analytics, and ETL.

MLRun – Generic mechanism for data scientists to build, run, and monitor ML tasks and pipelines. https://github.com/mlrun/mlrun

Kubeflow vs. MLFlow

Kubeflow and MLFlow are both smaller, more specialized tools than general task orchestration platforms such as Airflow or Luigi. Kubeflow relies on Kubernetes, while MLFlow is a Python library that helps you add experiment tracking to your existing machine learning code. Kubeflow lets you build a full DAG where each step is a Kubernetes pod, but MLFlow has built-in functionality to deploy your scikit-learn models to Amazon Sagemaker or Azure ML.

Use Kubeflow if you want to track your machine learning experiments and deploy your solutions in a more customized way, backed by Kubernetes. Use MLFlow if you want a simpler approach to experiment tracking and want to deploy to managed platforms such as Amazon Sagemaker.

DeltaLake

This should be in previous data-processing-tools post, but it’s also tightly coupled with ML concepts.

Also, ML data can be expanded to be a giant topic…

Conditional update without overwrite

Delta Lake provides programmatic APIs to conditional update, delete, and merge (upsert) data into tables. Here are a few examples.

  • @Xin: this is important for real life daily operations. as new data refreshed and streamned in for futher processing/training, continues integration for data is also important (not only code deployment CI/CD).
deltaTable.as("oldData")
  .merge(
    newData.as("newData"),
    "oldData.id = newData.id")
  .whenMatched
  .update(Map("id" -> col("newData.id")))
  .whenNotMatched
  .insert(Map("id" -> col("newData.id")))
  .execute()

Read older versions of data using time travel

You can query previous snapshots of your Delta table by using time travel. If you want to access the data that you overwrote, you can query a snapshot of the table before you overwrote the first set of data using the versionAsOf option.

  • @Xin: also important, to reproduce a bug and debug data issues
df = spark.read.format("delta").option("versionAsOf", 0).load("/tmp/delta-table")
df.show()

AWS specific orchestration

Glue + SageMaker

Moving from notebooks to automated ML pipelines using Amazon SageMaker and AWS Glue

  • Glue trigger:
    • Scheduled: A time-based trigger based on cron.\
    • Conditional: A trigger that fires when a previous job or crawler or multiple jobs or crawlers satisfy a list of conditions.
    • On-demand: A trigger that fires when you activate it.

[2021-Release] Amazon SageMaker Pipelines

Building, automating, managing, and scaling ML workflows using Amazon SageMaker Pipelines

purpose-built, easy-to-use continuous integration and continuous delivery (CI/CD) service for machine learning (ML) SageMaker Pipelines is a native workflow orchestration tool for building ML pipelines that take advantage of direct Amazon SageMaker integration

  • @Xin: feel like it’s mirror of MLFlow, even for components names, quote Three components improve the operational resilience and reproducibility of your ML workflows: pipelines, model registry, and projects.

Automatated: MLOps templates and integrations

SageMaker projects introduce MLOps templates that automatically provision the underlying resources needed to enable CI/CD capabilities for your ML development lifecycle.

MlOps

Uber ML infrastructure - Michelangelo

As I mentioned in previous post, UBER Machine Learning Platform - Michelangelo

Unfortunately, this tool is not open sourced. Just high level architecture is shared publicly.

It has Python ML Model Development module https://eng.uber.com/michelangelo-pyml/

  • @Todo

Netflix Orchestration - Metaflow

[Netflix] Genie

https://github.com/Netflix/genie

Genie is a federated Big Data orchestration and execution engine developed by Netflix.

[Netflix] MetaFlow

https://metaflow.org/

Metaflow provides a unified API to the infrastructure stack that is required to execute data science projects, from prototype to production.

metaflow

[Netflix] Metaflow Metadata Service

https://github.com/Netflix/metaflow-service

Facebook - dagger

https://github.com/facebookresearch/dagger

dagger: A Python Framework for Reproducible Machine Learning Experiment Orchestration

Seldon

Seldon makes MLOps software for enterprise deployment of machine learning models.

Seldon Core

An open source platform to deploy your machine learning models on Kubernetes at massive scale.

https://github.com/SeldonIO/seldon-core

Example design for a startup

https://towardsdatascience.com/designing-ml-orchestration-systems-for-startups-202e527d7897

startup

References