Airflow: https://airflow.apache.org/
- Apache Airflow is a platform to programmatically author, schedule, and monitor workflows.
Argo Workflow: https://argoproj.github.io/projects/argo/
- Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD.
- Argo is the one teams often turn to when they’re already using Kubernetes
Celery: https://docs.celeryproject.org/en/stable/
- Celery is a simple, flexible, and reliable distributed system to process vast amounts of messages, while providing operations with the tools required to maintain such a system.
Dask: https://dask.org/
- Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love
- Dask is open source and freely available. It is developed in coordination with other community projects like NumPy, pandas, and scikit-learn.
Luigi: https://github.com/spotify/luigi
- Luigi is a Python package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
Koalja: https://github.com/AljabrIO/koalja-operator
- The Koalja platform is a pipeline construction platform designed to make data processing pipelines simple and easy.
Prefect: https://www.prefect.io/
- Prefect is the new standard in dataflow automation , trusted to build, run, and monitor millions of data workflows and pipelines.
Reflow: https://github.com/grailbio/reflow
- Reflow is a system for incremental data processing in the cloud. Reflow enables scientists and engineers to compose existing tools (packaged in Docker images) using ordinary programming constructs.
Netflix Conductor: https://netflix.github.io/conductor/
- Conductor is a Workflow Orchestration engine that runs in the cloud.
Hudi (Hadoop Upserts Deletes and Incrementals):https://github.com/apache/hudi
- Apache Hudi ingests & manages storage of large analytical datasets over DFS (hdfs or cloud stores). (DataLake)
How to choose
- Apache Airflow if you want the most full-featured, mature tool and you can dedicate time to learning how it works, setting it up, and maintaining it.
- Luigi if you need something with an easier learning curve than Airflow. It has fewer features, but it’s easier to get off the ground.
- Prefect if you want something that’s very familiar to Python programmers and stays out of your way as much as possible.
- Argo if you’re already deeply invested in the Kubernetes ecosystem and want to manage all of your tasks as pods, defining them in YAML instead of Python.
Data-driven vs task-driven
Tables | Batch | Batch and stream |
---|---|---|
Task-driven | Airflow Argo Workflow Luigi Prefect |
|
Data-driven | Dagster | Koalja Reflow |
Feature Check
Frameworks | Distributed tasks | Parallel tasks | Integrations | Maturity | Pipeline definition | Kubernetes orchestration |
---|---|---|---|---|---|---|
Airflow | Using Celery or Kubernetes | Using Celery or Kubernetes | Dask | Mature | Pipeline-as-code* (Python) | yes |
Dagster | Using Celery or Dask | Using Celery or Dask | Spark, Airflow, Dask | New | Pipeline-as-code (Python) | yes |
Prefect | Using Dask | Using Dask | / | (unclear) | Pipeline-as-code (Python) | yes |
Luigi | no | yes | Spark | Mature | Pipeline-as-code (Python) | / |
Reflow | yes | yes | / | (unclear) | Pipeline-as-code (Python) | / |
Argo Workflow | Using Kubernetes | Using Kubernetes | Airflow | (unclear) | Typed declarative schema | Natively using CRD |
Koalja | Using Kubernetes | Using Kubernetes | / | New | Typed declarative schema | Natively using CRD |