Metaflow vs. Argo Workflows
December 7, 2022
5 min read
In this post, we'll compare Metaflow vs. Argo Workflows including what they are, their advantages and tradeoffs, and which to choose for your needs.
As data sets grow, the time, effort, and system resources required to retrieve, normalize, and process grows with them. You need a workflow management platform to help you run your data pipelines and take advantage of cloud technologies like Docker and Kubernetes. That's where Metaflow and Argo Workflows come in. They can help you get the most out of your infrastructure while streamlining your MLOps. Which one is best for you?
This article will look at Metaflow vs. Argo Workflows. They're both tools for orchestrating workflows. They’re great at managing tasks, enforcing dependencies, and harnessing cloud technologies to increase throughput and scale for large datasets. But they're very different tools.
What Is a Workflow Orchestrator?
The primary role of a workflow orchestrator is to start and stop processes based on a workflow description. Each flow, or pipeline, has steps, and each step has zero or more dependencies. For example, a data retrieval step only needs to know where to get the data from, but the steps that process the data can't proceed until the data is retrieved.
An orchestrator that takes advantage of your infrastructure can examine your pipeline and run some steps in parallel. If the previous example had two data retrieval steps, then they'd run them simultaneously since neither action has any dependencies.
Scheduling workflows and managing parallel tasks are table stakes. The more important question is how the platform integrates into your system. What's the orchestrator's interface? How hard or easy is it to use? Does it complement how you work or does it get in your way?
And, of course, there's the question your leadership will inevitably ask: How much is this going to cost?
Metaflow vs. Argo Workflows
You can run Argo Workflows on any Kubernetes (K8s) system. It installs as a custom resource definition, so it works on any cluster using standard K8s mechanisms. If you want to build your own cluster, the Argo project supplies Helm charts, too. For local development, it runs on Docker desktop.
Argo is a Cloud Native Computing Foundation (CNCF) hosted project, so it's under active development and will be active for a long time to come. Argo offers it under the Apache 2.0 license.
Each step in your Argo workflow is a Docker container. Argo runs the container with an optional command and optional arguments for the command. You can roll your own images or run public images from Docker Hub. Argo makes it easy to pass Python code to a container as a script, but you can run any command available in the image.
Argo's steps are containers, so you can use them to run anything. Data pipelines are one common application, but so is continuous integration/continuous deployment (CI/CD). With the encapsulation you get from Docker and the power of K8s, the sky's the limit.
You can define your workflows with YAML or the Couler Python API. Both languages have access to all workflow features, including dependencies between steps, setting container resources, and adding container volumes.
Metaflow is a Python library for building data pipelines and workflows. It was initially an internal project at Netflix for their data scientists and used for statistics and deep learning applications. Now, Metaflow is licensed under the Apache 2.0 license.
You can run your Metaflow packages locally or via AWS Batch. Run locally, Metaflow's performance and capabilities are limited. When run via AWS, Metaflow will scale your jobs for you based on your code.
Like Argo Workflows, Metaflow works with more than just data pipelines. As you'll see in the examples below, it can manage any Python code.
Let's take a look at a basic workflow for both platforms.
Demonstrating Argo Workflows and Metaflow with anything other than a "Hello, World!" just wouldn't seem right.
Argo Workflow's introductory documentation has a great example here. It's in their YAML markup and illustrates a few essential concepts.
It prints "hello world" to the Docker logs using the cowsay command.
The workflow starts with a standard header that defines the API version, the document type, and a name. Argo will use this name to generate a unique id each time the workflow runs.
The spec field starts the definition of the workflow. An Argo workflow consists of templates, which are reusable artifacts, like functions. In this case, we're only using it once, but Argo's GitHub repo has several examples that demonstrate how you can reuse templates.
The single-step, named whalesay, loads Docker's whalesay image and runs cowsay with "hello world." With only three lines of code, you can pull an image and run it inside an Argo Workflow!
Next, let's look at the same workflow defined with Couler instead of Argo's YAML DSL.
This example uses two objects from the Couler API: couler and ArgoSubmitter. The couler run_container method does exactly what its name indicates: it pulls and runs a container using syntax similar to Argo's YAML. Then ArgoSubmitter submits the workflow to the Argo cluster for execution. So, if you don't want to manage YAML documents, you can use Couler to create your workflows.
Metaflow's tutorial has a "Hello, World" example because it's the right thing to do.
Here's the code.
Metaflow uses a combination of inheritance and annotations to build a basic workflow. The workflow is a class that inherits from FlowSpec. Each step in the flow is decorated with @step. The steps control flow by calling the next() step when they finish their part.
Advantages and Tradeoffs
Argo is a workflow orchestrator. Its YAML interface and the Couler API are tools for managing workflows. You perform your data processing inside Docker containers that you manage. You use Argo's tools to manage the steps in the workflow, the data passed between them (unless you elect to use different mechanisms), and the relationships between each step.
Metaflow exposes similar tools via Python code. You can mark any Python function inside a workflow as a @task and establish relationships between steps with methods like next() and join() for parallel operations. So you only have to work in Python, but the API couples your workflow and data processing code together. Unless you are very careful with code structure, moving to another platform will be complicated. Couler provides you with a logical separation since you're still manipulating Docker images.
Probably the most significant tradeoff with Metaflow, though, is its tie to AWS. While Argo Workflows works with any K8s implementation, including on-premises, Metaflow's scaling and concurrency features only work with AWS Batch. Metaflow ties you to a single vendor and their cost structure.
Metaflow vs. Argo Workflows: Which One?
We put Argo Workflows and Metaflow in a head-to-head comparison of workflow orchestration tools. Both platforms can orchestrate nearly any pipeline operation, but they use very different approaches. Argo Workflows uses Kubernetes to manage tasks defined as containers. You tell Argo how to manage the jobs while managing the containers and what they do. Metaflow orchestrates Python code on AWS Batch infrastructure. Without AWS, Metaflow is less powerful and doesn't scale.
Now that you've seen Metaflow vs. Argo Workflows side-by-side, you can make the right choice. Start setting up your pipelines today!
Subscribe for Pipekit updates.
Get the latest articles on all things Pipekit & data orchestration delivered straight to your inbox.