Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Airflow vs. Argo Workflows

Every business collects data. The smart ones also have ways to process it. The next step is to add task orchestration, automation, and MLOps because processing data via manual procedures takes too much time and effort. So, how can you start building your MLOps pipelines? Should you add Airflow or Argo Workflows? Which one is the best task orchestration tool?

Let's look at two of the most powerful and popular tools for automating data pipelines and workflows: Apache's Airflow and Argo Workflows. Both of these workflow engines have robust features for building and scaling serial and parallel jobs. We'll explore them side-by-side so you can make the best decision for your company.

Before we get started, let's touch on a few key concepts related to workflows and task orchestrators.

What is a Workflow Engine?

A workflow engine is a platform for starting, stopping, and organizing a set of related tasks. You use it to define a sequence of steps, run them, and monitor their progress.

Workflow engines are useful for a variety of applications, such as data collection, normalization, and processing. You can, and probably have, done these tasks manually. You can automate them with tools like crontab or Rundeck, but a workflow engine takes automation to a new level. It can rerun failed tasks, perform the functions in the correct order, and run steps in parallel where possible. It can also take advantage of cloud architectures like Kubernetes.

Text reading: Workflow enginers are useful for a variety of applications, such as data collection, normalization, and processing.

Airflow and Argo are two of the most popular engines for workflows and pipelines. One of the reasons they're both so successful is their ability to manage pipelines with a Directed Acyclic Graph (DAG).

DAG

A DAG models the tasks and dependencies between pipelines. It represents the task order with vertices that show the functions and lines that illustrate the order in which the workflow performs them.

A diagram representing DAGs

This is a directed graph because each line follows one and only one direction. It's acyclic because there are no cycles. DAGs don't have loops.

This graph has four tasks. The workflow can only perform tasks #2 and #3 after completing task #1, but it can execute them in parallel. Then, after tasks #2 and #3 are both finished task #4 can be performed.

Both Argo and Airflow support this model for organizing and prioritizing tasks, but in slightly different ways. We'll look at those differences below.

Airflow vs. Argo

While Airflow and Argo have many of the same capabilities, there are significant differences. Let's take a look at these two workflow tools side-by-side.

Deployment and Ease of Use

How you install Airflow and Argo is where some of the most significant differences crop up. Airflow is Kubernetes-friendly, while Argo is Kubernetes-based.

Argo bills itself as "an open-source container-native workflow engine for orchestrating parallel jobs on Kubernetes." It runs on any Kubernetes implementation, from the preconfigured development system with Docker desktop to the cloud implementations offered by GCP and AWS. You can try it out by installing a quick manifest on your Kubernetes cluster, and it's ready to run workflows without any further modifications. You'd want to create your own Kubernetes implementation for a production system, but everything required to run Argo runs on a single pod.

Airflow will run on Kubernetes and can take advantage of its scaling and stability. The Airflow project even provides a Helm chart to get you started. But Airflow is a Python-native project that requires more configuration than Argo. Like Argo, a simple standalone instance is easy to build, but a production system requires a SQL database server, a multinode cluster (or Kubernetes), and other infrastructure.

For some teams, the extra work required to run Airflow means more flexibility and control. For others, Argo's simplicity means more time to focus on the tasks at hand.

Workflow Definitions

You configure Argo workflows with YAML. The "Hello World" Argo example in Argo's documentation looks like this:

Meanwhile, you write Airflow workflows in Python. A "Hello World" workflow might look like this:

Of course, this is Python code, and your version might look completely different. But you will need to import the Airflow modules, define a DAG, and add your code to it.

Airflow workflows always have a DAG, regardless of the ordering and dependencies between your tasks. Argo supports DAGs but can also run with simplified or no dependencies.

Airflow is a Python-based system and requires coding, while Argo does not. Here again, Airflow requires more work, but some teams may consider the additional flexibility a net plus. Others prefer Argo's simplicity and keep their code inside their workflow's tasks.

Native APIs

As we just saw, Airflow's interface is the API. It has a UI, but you create your workflows with DAGs in Python code. If you're already working with data sets and machine learning training in Python, Airflow is another API you add to your toolbox.

Argo doesn't require coding. You can use it by setting up your workflow's steps as containers and arranging them in Argo's YAML. Argo runs the containers in the way your configuration specifies. Since Argo will run any Docker container you supply, your pipeline can run with your preferred tooling.

But Argo has Golang, Java, and Python APIs, too.

The native Python API can load workflows from YAML:

Or, you can create a Workflow in code, similar to Airflow:

Argo developers can also use Couler, another full-featured Python API. Couler, like Argo, is part of the CNCF Cloud Native Interactive Landscape. It offers a simplified API for running Argo Workflows.

Here's "Hello, World" in Couler:

Argo has more to offer than Airflow in the API category in terms of functionality. You can develop your workflows in three different languages and it has two different Python APIs to choose from.

Text reading: Argo has more to offer than Airflow in the API category in terms of functionality.

Fault Tolerance

Airflow supports running multiple schedulers in a high-availability configuration. Properly configured, the scheduler will see zero downtime, even in the event of a node failure. This functionality requires extra configuration and either a current version of PostgreSQL or MySQL database or extra database configuration. (MariaDB is not officially supported.)

Argo relies on Kubernetes for fault tolerance. If its workflow controller crashes, Kubernetes starts a new one. While you can't run two schedulers as a fault-tolerant pair, you can configure Argo to retry failed tasks. The retry capability includes a backoff timer and the ability to limit the number of attempts.

Airflow's ability to run redundant schedulers makes it more fault-tolerant, but it comes at a cost in terms of complexity. Argo's ability to take advantage of Kubernetes and retry tasks may be sufficient for many use cases.

Airflow vs. Argo: You Decide

In this post, we examined Airflow and Argo side-by-side. We looked at what it takes to get each platform running and how they define workloads. Then, we compared their APIs and how they manage fault tolerance.

While Airflow and Argo are close to each other in terms of features, they have different approaches to running your pipelines. Airflow offers more options than Argo in several areas, but it requires more configuration and customization to get started. It's easier to get started with Argo, especially if you're looking for a cloud-native option.

Which one will work better for you? There's only one way to find out. Pick one and give it a try!

Subscribe for Pipekit updates.

Get the latest articles on all things Pipekit & data orchestration delivered straight to your inbox.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Follow Pipekit

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

More

Guides

Why teams use Argo Workflows to run cloud-native Spark jobs

6-min read
Guides

Argo and Airflow DAG Examples

5 min read
Guides

Kubeflow vs. Argo Workflows

5 min read