Subscribe for Pipekit updates.

Get the latest articles on all things Pipekit & data orchestration delivered straight to your inbox.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Follow Pipekit

Airflow vs. Argo Workflows

Every business collects data. The smart ones also have ways to process it. The next step is to add task orchestration, automation, and MLOps because processing data via manual procedures takes too much time and effort. So, how can you start building your MLOps pipelines? Should you add Airflow or Argo Workflows? Which one is the best task orchestration tool?

Let's look at two of the most powerful and popular tools for automating data pipelines and workflows: Apache's Airflow and Argo Workflows. Both of these workflow engines have robust features for building and scaling serial and parallel jobs. We'll explore them side-by-side so you can make the best decision for your company.

Before we get started, let's touch on a few key concepts related to workflows and task orchestrators.

What is a Workflow Engine?

A workflow engine is a platform for starting, stopping, and organizing a set of related tasks. You use it to define a sequence of steps, run them, and monitor their progress.

Workflow engines are useful for a variety of applications, such as data collection, normalization, and processing. You can, and probably have, done these tasks manually. You can automate them with tools like crontab or Rundeck, but a workflow engine takes automation to a new level. It can rerun failed tasks, perform the functions in the correct order, and run steps in parallel where possible. It can also take advantage of cloud architectures like Kubernetes.

Airflow and Argo are two of the most popular engines for workflows and pipelines. One of the reasons they're both so successful is their ability to manage pipelines with a Directed Acyclic Graph (DAG).

DAG

A DAG models the tasks and dependencies between pipelines. It represents the task order with vertices that show the functions and lines that illustrate the order in which the workflow performs them.

A diagram representing DAGs

This is a {% c-line %}directed{% c-line-end %} graph because each line follows one and only one direction. It's {% c-line %}acyclic{% c-line-end %} because there are no cycles. DAGs don't have loops.

This graph has four tasks. The workflow can only perform tasks #2 and #3 after completing task #1, but it can execute them in parallel. Then, after tasks #2 and #3 are both finished task #4 can be performed.

Both Argo and Airflow support this model for organizing and prioritizing tasks, but in slightly different ways. We'll look at those differences below.

{% cta-1 %}

Airflow vs. Argo

While Airflow and Argo have many of the same capabilities, there are significant differences. Let's take a look at these two workflow tools side-by-side.

Deployment and Ease of Use

How you install Airflow and Argo is where some of the most significant differences crop up. Airflow is Kubernetes-friendly, while Argo is Kubernetes-based.

Argo bills itself as "an open-source container-native workflow engine for orchestrating parallel jobs on Kubernetes." It runs on any Kubernetes implementation, from the preconfigured development system with Docker desktop to the cloud implementations offered by GCP and AWS. You can try it out by installing a quick manifest on your Kubernetes cluster, and it's ready to run workflows without any further modifications. You'd want to create your own Kubernetes implementation for a production system, but everything required to run Argo runs on a single pod.

Airflow will run on Kubernetes and can take advantage of its scaling and stability. The Airflow project even provides a Helm chart to get you started. But Airflow is a Python-native project that requires more configuration than Argo. Like Argo, a simple standalone instance is easy to build, but a production system requires a SQL database server, a multinode cluster (or Kubernetes), and other infrastructure.

For some teams, the extra work required to run Airflow means more flexibility and control. For others, Argo's simplicity means more time to focus on the tasks at hand.

Workflow Definitions

You configure Argo workflows with YAML. The "Hello World" Argo example in Argo's documentation looks like this:

Meanwhile, you write Airflow workflows in Python. A "Hello World" workflow might look like this:

Of course, this is Python code, and your version might look completely different. But you will need to import the Airflow modules, define a DAG, and add your code to it.

Airflow workflows always have a DAG, regardless of the ordering and dependencies between your tasks. Argo supports DAGs but can also run with simplified or no dependencies.

Airflow is a Python-based system and requires coding, while Argo does not. Here again, Airflow requires more work, but some teams may consider the additional flexibility a net plus. Others prefer Argo's simplicity and keep their code inside their workflow's tasks.

Native APIs

As we just saw, Airflow's interface is the API. It has a UI, but you create your workflows with DAGs in Python code. If you're already working with data sets and machine learning training in Python, Airflow is another API you add to your toolbox.

Argo doesn't require coding. You can use it by setting up your workflow's steps as containers and arranging them in Argo's YAML. Argo runs the containers in the way your configuration specifies. Since Argo will run any Docker container you supply, your pipeline can run with your preferred tooling.

But Argo has Golang, Java, and Python APIs, too.

The native Python API can load workflows from YAML:

Or, you can create a Workflow in code, similar to Airflow:

Next, we can run the same example using Hera instead of Argo's YAML DSL.

Hera makes it easy to specify the image, the command, and the arguments in Python instead of YAML. Hera also has full parity with Argo Workflows features as of version 5.

Argo has more to offer than Airflow in the API category in terms of functionality. You can develop your workflows in three different languages and it has two different Python APIs to choose from. 

Text reading: Argo has more to offer than Airflow in the API category in terms of functionality.

{% related-articles %}

Fault Tolerance

Airflow supports running multiple schedulers in a high-availability configuration. Properly configured, the scheduler will see zero downtime, even in the event of a node failure. This functionality requires extra configuration and either a current version of PostgreSQL or MySQL database or extra database configuration. (MariaDB is not officially supported.)

Argo relies on Kubernetes for fault tolerance. If its workflow controller crashes, Kubernetes starts a new one. While you can't run two schedulers as a fault-tolerant pair, you can configure Argo to retry failed tasks. The retry capability includes a backoff timer and the ability to limit the number of attempts.

Airflow's ability to run redundant schedulers makes it more fault-tolerant, but it comes at a cost in terms of complexity. Argo's ability to take advantage of Kubernetes and retry tasks may be sufficient for many use cases.

Airflow vs. Argo: You Decide

In this post, we examined Airflow and Argo side-by-side. We looked at what it takes to get each platform running and how they define workloads. Then, we compared their APIs and how they manage fault tolerance.

While Airflow and Argo are close to each other in terms of features, they have different approaches to running your pipelines. Airflow offers more options than Argo in several areas, but it requires more configuration and customization to get started. It's easier to get started with Argo, especially if you're looking for a cloud-native option.

Which one will work better for you? There's only one way to find out. Pick one and give it a try!

Are your data pipelines scalable and reliable?

Operating data pipelines at scale doesn't have to be unreliable and costly. Put an end to the stress of unreliable data pipelines and data engineering backlogs and turn data into revenue-boosting insights. Pipekit can help.

Pipekit is a self-serve data platform that configures Argo Workflows on your infrastructure to offer simplicity and efficiency when it comes to data workflows. Achieve higher scalability for your data pipelines while significantly reducing your cloud spend. Our platform is designed to align your data infrastructure seamlessly with your full-stack infrastructure, all on Kubernetes.

Try out Pipekit for free today - pipekit.io/signup

Try Pipekit free

Join Pipekit for a free 30-day trial.
No credit card required.

Start free trial
  • blue checkmark vector

    Boost pipeline speed & reliability

  • blue checkmark vector

    Streamline engineering resources

  • blue checkmark vector

    Accelerate data-to-value

  • blue checkmark vector

    Standardize workflow and app deployments

More

Guides

Unlock Workflow Parallelism by Configuring Volumes for Argo Workflows

6 min read
Guides

How to Fine-Tune an LLM with Argo Workflows and Hera

8 min read
Guides

Why it’s Time to Migrate Your CI/CD from Jenkins to Argo

6 min read