Subscribe for Pipekit updates.

Get the latest articles on all things Pipekit & data orchestration delivered straight to your inbox.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Follow Pipekit

Metaflow vs. Argo Workflows

As data sets grow, the time, effort, and system resources required to retrieve, normalize, and process grows with them. You need a workflow management platform to help you run your data pipelines and take advantage of cloud technologies like Docker and Kubernetes. That's where Metaflow and Argo Workflows come in. They can help you get the most out of your infrastructure while streamlining your MLOps. Which one is best for you?

This article will look at Metaflow vs. Argo Workflows. They're both tools for orchestrating workflows. They’re great at managing tasks, enforcing dependencies, and harnessing cloud technologies to increase throughput and scale for large datasets. But they're very different tools.

What Is a Workflow Orchestrator?

The primary role of a workflow orchestrator is to start and stop processes based on a workflow description. Each flow, or pipeline, has steps, and each step has zero or more dependencies. For example, a data retrieval step only needs to know where to get the data from, but the steps that process the data can't proceed until the data is retrieved.

An orchestrator that takes advantage of your infrastructure can examine your pipeline and run some steps in parallel. If the previous example had two data retrieval steps, then they'd run them simultaneously since neither action has any dependencies.

Text reading: An orchestrator that takes advantage of your infrastructure can examine your pipeline and run some steps in parallel.

Scheduling workflows and managing parallel tasks are table stakes. The more important question is how the platform integrates into your system. What's the orchestrator's interface? How hard or easy is it to use? Does it complement how you work or does it get in your way? 

And, of course, there's the question your leadership will inevitably ask: How much is this going to cost?

{% cta-1 %}

Metaflow vs. Argo Workflows

Argo Workflows

You can run Argo Workflows on any Kubernetes (K8s) system. It installs as a custom resource definition, so it works on any cluster using standard K8s mechanisms. If you want to build your own cluster, the Argo project supplies Helm charts, too. For local development, it runs on Docker desktop.

Argo is a Cloud Native Computing Foundation (CNCF) hosted project, so it's under active development and will be active for a long time to come. Argo offers it under the Apache 2.0 license.

Each step in your Argo workflow is a Docker container. Argo runs the container with an optional command and optional arguments for the command. You can roll your own images or run public images from Docker Hub. Argo makes it easy to pass Python code to a container as a script, but you can run any command available in the image.

Argo's steps are containers, so you can use them to run anything. Data pipelines are one common application, but so is continuous integration/continuous deployment (CI/CD). With the encapsulation you get from Docker and the power of K8s, the sky's the limit.

You can define your workflows with YAML or the Hera API. Both languages have access to all workflow features, including dependencies between steps, setting container resources, and adding container volumes.

Metaflow

Metaflow is a Python library for building data pipelines and workflows. It was initially an internal project at Netflix for their data scientists and used for statistics and deep learning applications. Now, Metaflow is licensed under the Apache 2.0 license.

You can run your Metaflow packages locally or via AWS Batch. Run locally, Metaflow's performance and capabilities are limited. When run via AWS, Metaflow will scale your jobs for you based on your code.

Like Argo Workflows, Metaflow works with more than just data pipelines. As you'll see in the examples below, it can manage any Python code.

Let's take a look at a basic workflow for both platforms.

Hello, World

Demonstrating Argo Workflows and Metaflow with anything other than a "Hello, World!" just wouldn't seem right.

Argo Workflow's introductory documentation has a great example here. It's in their YAML markup and illustrates a few essential concepts.

It prints "hello world" to the Docker logs using the cowsay command.

The workflow starts with a standard header that defines the API version, the document type, and a name. Argo will use this name to generate a unique id each time the workflow runs.

The {% c-line %}spec{% c-line-end %} field starts the definition of the workflow. An Argo workflow consists of {% c-line %}templates{% c-line-end %}, which are reusable artifacts, like functions. In this case, we're only using it once, but Argo's GitHub repo has several examples that demonstrate how you can reuse templates.

The single-step, named {% c-line %}whalesay{% c-line-end %}, loads Docker's whalesay image and runs {% c-line %}cowsay{% c-line-end %} with "hello world." With only three lines of code, you can pull an image and run it inside an Argo Workflow!

Next, we can run the same example using Hera instead of Argo's YAML DSL.

Hera makes it easy to specify the image, the command, and the arguments in Python instead of YAML. Hera also has full parity with Argo Workflows features as of version 5.

Metaflow's tutorial has a "Hello, World" example because it's the right thing to do. 

Here's the code.

Metaflow uses a combination of inheritance and annotations to build a basic workflow. The workflow is a class that inherits from {% c-line %}FlowSpec{% c-line-end %}. Each step in the flow is decorated with {% c-line %}@step{% c-line-end %}. The steps control flow by calling the {% c-line %}next(){% c-line-end %} step when they finish their part.

{% related-articles %}

Advantages and Tradeoffs

Argo is a workflow orchestrator. Its YAML interface and the Hera API are tools for managing workflows. You perform your data processing inside Docker containers that you manage. You use Argo's tools to manage the steps in the workflow, the data passed between them (unless you elect to use different mechanisms), and the relationships between each step.

Metaflow exposes similar tools via Python code. You can mark any Python function inside a workflow as a {% c-line %}@task{% c-line-end %} and establish relationships between steps with methods like {% c-line %}next(){% c-line-end %} and {% c-line %}join(){% c-line-end %} for parallel operations. So you only have to work in Python, but the API couples your workflow and data processing code together. Unless you are very careful with code structure, moving to another platform will be complicated.

Text reading: Probably the most significant tradeoff with Metaflow, though, is the tie to AWS.

Probably the most significant tradeoff with Metaflow, though, is its tie to AWS. While Argo Workflows works with any K8s implementation, including on-premises, Metaflow's scaling and concurrency features only work with AWS Batch. Metaflow ties you to a single vendor and their cost structure.

Metaflow vs. Argo Workflows: Which One?

We put Argo Workflows and Metaflow in a head-to-head comparison of workflow orchestration tools. Both platforms can orchestrate nearly any pipeline operation, but they use very different approaches. Argo Workflows uses Kubernetes to manage tasks defined as containers. You tell Argo how to manage the jobs while managing the containers and what they do. Metaflow orchestrates Python code on AWS Batch infrastructure. Without AWS, Metaflow is less powerful and doesn't scale.

Now that you've seen Metaflow vs. Argo Workflows side-by-side, you can make the right choice. Start setting up your pipelines today!

Are your data pipelines scalable and reliable?

Operating data pipelines at scale doesn't have to be unreliable and costly. Put an end to the stress of unreliable data pipelines and data engineering backlogs and turn data into revenue-boosting insights. Pipekit can help.

Pipekit is a self-serve data platform that configures Argo Workflows on your infrastructure to offer simplicity and efficiency when it comes to data workflows. Achieve higher scalability for your data pipelines while significantly reducing your cloud spend. Our platform is designed to align your data infrastructure seamlessly with your full-stack infrastructure, all on Kubernetes.

Try out Pipekit for free today - pipekit.io/signup

Try Pipekit free

Join Pipekit for a free 30-day trial.
No credit card required.

Start free trial
  • blue checkmark vector

    Boost pipeline speed & reliability

  • blue checkmark vector

    Streamline engineering resources

  • blue checkmark vector

    Accelerate data-to-value

  • blue checkmark vector

    Standardize workflow and app deployments

More

Guides

How to Fine-Tune an LLM with Argo Workflows and Hera

8 min read
Guides

Why it’s Time to Migrate Your CI/CD from Jenkins to Argo

6 min read
Guides

5 Ways to Reduce YAML File Size

5 min read