Airflow: https://airflow.apache.org/

  • Apache Airflow is a platform to programmatically author, schedule, and monitor workflows.

Argo Workflow: https://argoproj.github.io/projects/argo/

  • Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD.
  • Argo is the one teams often turn to when they’re already using Kubernetes

Celery: https://docs.celeryproject.org/en/stable/

  • Celery is a simple, flexible, and reliable distributed system to process vast amounts of messages, while providing operations with the tools required to maintain such a system.

Dask: https://dask.org/

  • Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love
  • Dask is open source and freely available. It is developed in coordination with other community projects like NumPy, pandas, and scikit-learn.

Luigi: https://github.com/spotify/luigi

  • Luigi is a Python package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.

Koalja: https://github.com/AljabrIO/koalja-operator

  • The Koalja platform is a pipeline construction platform designed to make data processing pipelines simple and easy.

Prefect: https://www.prefect.io/

  • Prefect is the new standard in dataflow automation , trusted to build, run, and monitor millions of data workflows and pipelines.

Reflow: https://github.com/grailbio/reflow

  • Reflow is a system for incremental data processing in the cloud. Reflow enables scientists and engineers to compose existing tools (packaged in Docker images) using ordinary programming constructs.

Netflix Conductor: https://netflix.github.io/conductor/

  • Conductor is a Workflow Orchestration engine that runs in the cloud.

Hudi (Hadoop Upserts Deletes and Incrementals):https://github.com/apache/hudi

  • Apache Hudi ingests & manages storage of large analytical datasets over DFS (hdfs or cloud stores). (DataLake)

How to choose

  • Apache Airflow if you want the most full-featured, mature tool and you can dedicate time to learning how it works, setting it up, and maintaining it.
  • Luigi if you need something with an easier learning curve than Airflow. It has fewer features, but it’s easier to get off the ground.
  • Prefect if you want something that’s very familiar to Python programmers and stays out of your way as much as possible.
  • Argo if you’re already deeply invested in the Kubernetes ecosystem and want to manage all of your tasks as pods, defining them in YAML instead of Python.

Data-driven vs task-driven

Tables Batch Batch and stream
Task-driven Airflow
Argo Workflow
Luigi
Prefect
 
Data-driven Dagster Koalja
Reflow

Feature Check

Frameworks Distributed tasks Parallel tasks Integrations Maturity Pipeline definition Kubernetes orchestration
Airflow Using Celery or Kubernetes Using Celery or Kubernetes Dask Mature Pipeline-as-code* (Python) yes
Dagster Using Celery or Dask Using Celery or Dask Spark, Airflow, Dask New Pipeline-as-code (Python) yes
Prefect Using Dask Using Dask / (unclear) Pipeline-as-code (Python) yes
Luigi no yes Spark Mature Pipeline-as-code (Python) /
Reflow yes yes / (unclear) Pipeline-as-code (Python) /
Argo Workflow Using Kubernetes Using Kubernetes Airflow (unclear) Typed declarative schema Natively using CRD
Koalja Using Kubernetes Using Kubernetes / New Typed declarative schema Natively using CRD
All above frameworks are involving at a high speed. I’ll try to revisit this post in the future.