Data Processing Infrastructure Tooling Quick Checkup

Apache Airflow is a platform to programmatically author, schedule, and monitor workflows.

Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD.
Argo is the one teams often turn to when they’re already using Kubernetes

Celery is a simple, flexible, and reliable distributed system to process vast amounts of messages, while providing operations with the tools required to maintain such a system.

Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love
Dask is open source and freely available. It is developed in coordination with other community projects like NumPy, pandas, and scikit-learn.

Luigi is a Python package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.

The Koalja platform is a pipeline construction platform designed to make data processing pipelines simple and easy.

Prefect is the new standard in dataflow automation , trusted to build, run, and monitor millions of data workflows and pipelines.

Reflow is a system for incremental data processing in the cloud. Reflow enables scientists and engineers to compose existing tools (packaged in Docker images) using ordinary programming constructs.

Hudi (Hadoop Upserts Deletes and Incrementals)：https://github.com/apache/hudi

Apache Hudi ingests & manages storage of large analytical datasets over DFS (hdfs or cloud stores). (DataLake)

How to choose

Apache Airflow if you want the most full-featured, mature tool and you can dedicate time to learning how it works, setting it up, and maintaining it.
Luigi if you need something with an easier learning curve than Airflow. It has fewer features, but it’s easier to get off the ground.
Prefect if you want something that’s very familiar to Python programmers and stays out of your way as much as possible.
Argo if you’re already deeply invested in the Kubernetes ecosystem and want to manage all of your tasks as pods, defining them in YAML instead of Python.

Tables	Batch	Batch and stream
Task-driven	Airflow Argo Workflow Luigi Prefect
Data-driven	Dagster	Koalja Reflow

Frameworks	Distributed tasks	Parallel tasks	Integrations	Maturity	Pipeline definition	Kubernetes orchestration
Airflow	Using Celery or Kubernetes	Using Celery or Kubernetes	Dask	Mature	Pipeline-as-code* (Python)	yes
Dagster	Using Celery or Dask	Using Celery or Dask	Spark, Airflow, Dask	New	Pipeline-as-code (Python)	yes
Prefect	Using Dask	Using Dask	/	(unclear)	Pipeline-as-code (Python)	yes
Luigi	no	yes	Spark	Mature	Pipeline-as-code (Python)	/
Reflow	yes	yes	/	(unclear)	Pipeline-as-code (Python)	/
Argo Workflow	Using Kubernetes	Using Kubernetes	Airflow	(unclear)	Typed declarative schema	Natively using CRD
Koalja	Using Kubernetes	Using Kubernetes	/	New	Typed declarative schema	Natively using CRD