Subscribe for Pipekit updates.

Get the latest articles on all things Pipekit & data orchestration delivered straight to your inbox.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Follow Pipekit

Kubeflow vs. Argo Workflows

Data and data engineering has transformed the software landscape. Data volumes are growing, and timely processing is more important than ever. You need to move your data processing into pipelines and MLOps practices for managing them. Container technologies like Kubernetes (k8s) can help you run these pipelines efficiently.

This post will compare two powerful tools for running MLOps workflows on Kubernetes: Kubeflow and Argo Workflows. What factors should you consider when deciding between them? Which tool is best for you?

Why Kubernetes?

A workflow orchestrator coordinates a set of steps, or tasks, on your behalf. Based on code that describes how a flow starts, proceeds, and stops, it schedules tasks in the correct order, shepherds data between them, and harvests the results.

Part of this process is using your available resources to their fullest. You want your infrastructure to run tasks in parallel when possible. Other steps have constraints and you want to mark them for running alone because they rely on the steps before them. In every case, you want to make sure your workflows use your compute resources as effectively as possible with an orchestrator that scales resources as required.

Ideally, the orchestrator runs on Kubernetes. K8s gives you scalability, efficiency, and containerization. It also runs anywhere; on a desktop, on your premises, in a colo, or in the cloud.

Text reading: K8s gives you scalability, efficiency, and containerization.

Kubeflow vs. Argo Workflows

Argo Workflows

Rather than simply running as an application on K8s, Argo Workflows installs as a custom resource definition. It's genuinely a container-native platform designed to run on Kubernetes.

Argo Workflows supports all K8s systems and ships with binaries for macOS and Linux, as well as a K8s manifest. It runs on Docker desktop for local development, and you can build your own cluster with the Argo project's Helm charts.

In Argo Workflows, steps execute as Docker containers. Each container runs in K8s with optional commands and arguments you specify in your workflow. The images can be custom-built or public releases; you can specify any image available to your cluster.

Since Argo's steps are containers, you can literally run anything Docker supports in your workflows. Your data pipelines can mix operating systems, languages, and versions. Argo is a popular platform for continuous integration/continuous deployment (CI/CD) because of its broad platform support and ability to run complex pipelines.

You define Argo workflows with YAML or Python, using its native API or Hera. Both languages have access to all workflow features, including dependencies between steps, setting container resources, and adding container volumes.

Argo is a Cloud Native Computing Foundation (CNCF) hosted project, so it's under active development and will be active for a long time to come. Argo offers it under the Apache 2.0 license.

{% cta-1 %}

Kubeflow

Kubeflow started as an internal Google project for running Tensorflow jobs on K8s. Now it's an open-source project available under the Apache 2.0 license. Like Argo, it's a cloud-native platform designed explicitly to run on Kubernetes. Kubeflow is available as a packaged distribution for most major K8s implementations or as a manifest.

This workflow platform is for building and experimenting with machine language (ML) pipelines. Unlike Argo Workflows, Kubeflow is purpose-built for running ML applications. It includes services for running Jupyter notebooks, building pipelines for multi- and parallel-step workflows, a dashboard UI, and several other components.

While Kubeflow's authors originally built it for Tensorflow, it supports PyTorch, MXNet, MPI, XGBoost, and several other ML frameworks.

Hello, World

Let's look at how you define a workflow and these two platforms.

Argo's YAML

Argo Workflow opens its tutorials with a simple Hello, World example here.

Here's their YAML markup. We can use it to review some basic Argo concepts.

This workflow echoes "hello world" to standard output using the cowsay command, which ships in Docker's whalesay image. Cowsay prints the message using ASCII art.

The definition file's first few lines have a standard header, a workflow type, and a workflow name. The header specifies the API version, while the document type identifies it as a workflow. The name is required, so Argo has a unique string to prefix workflow instances.

Next, the specification starts with the {% c-line %}spec{% c-line-end %} field. Argo workflow steps use {% c-line %}templates{% c-line-end %}. They're reusable objects, similar to functions, that you can use and reuse for repetitive steps in a workflow. We're only using the template once in this workflow, but Argo's GitHub repo has several examples that demonstrate how you can reuse templates by passing parameters and returning results.

Like definitions, steps have names. This one is {% c-line %}whalesay{% c-line-end %}. It loads the Docker image and passes the command and arguments to run. This step gets to the heart of the example: you can run a container in an ordered workflow with three lines of code.

Hera Python

Next, we can run the same example using Hera instead of Argo's YAML DSL

Hera makes it easy to specify the image, the command, and the arguments in Python instead of YAML. Hera also has full parity with Argo Workflows features as of version 5.

Kubeflow Python

Finally, let's look at Kubeflow's Hello, World example.

With Kubeflow, your task is a Python function that you convert into a workflow step with a decorator. Then you pass the component into a pipeline that you create with Kubeflow's DSL. The kfp compiler compiles the code into an Argo Workflow definition. Kubeflow's runs its pipelines with Argo Workflows.

{% related-articles %}

Advantages and Tradeoffs

Argo Workflows place its focus firmly on orchestrating workflows on Kubernetes. Your code runs on containers, while Argo manages how tasks are run based on your workflow definitions. You can control task ordering, and build directed acyclic graphs (DAGs) with Argo's YAML or the Hera API. You can also manage your workflows directly via the argo CLI and kubectl.

Text reading: Argo Workflows place its focus firmly on orchestrating workflows on Kubernetes.

Kubeflow describes itself as "The Machine Learning Toolkit for Kubernetes," and that's precisely what it is. It's a suite of tools for managing ML development and testing on Kubernetes. One of its capabilities is defining and running pipelines, and it runs those pipelines using Argo Workflows.

Kubeflow comes with features for managing ML development and testing, and you define your workflows via Kubeflow's decorators and DSL. While it makes Kubeflow a compelling choice for ML development, it places workflow management in the background; instead of managing Argo directly, you're forced to do it via an abstraction in your Python code. Unless you're very deliberate, you'll end up mixing your workflow definitions with your model code. If you want to move to a different toolset then you're going to have to refactor a lot of code.

With Argo, you have a distinct choice: keep your application code separate from your workflow YAML or integrate carefully using Hera.

Kubeflow vs. Argo Workflows: Which One?

In this article, we compared Kubeflow and Argo Workflows. Both are workflow management tools for data pipelines. Both are designed for Kubernetes, but they are very different platforms. Argo's primary focus is on workflow management, while Kubeflow is a platform for ML development that uses Argo to create its workflows on Kubernetes. We looked at how to create workflows on the two platforms and discussed the advantages and tradeoffs of the two systems.

Which system is best for you depends on your specific requirements and what you need from your workflow orchestrator. Now that you know the differences, pick one and get started!

Are your data pipelines scalable and reliable?

Operating data pipelines at scale doesn't have to be unreliable and costly. Put an end to the stress of unreliable data pipelines and data engineering backlogs and turn data into revenue-boosting insights. Pipekit can help.

Pipekit is a self-serve data platform that configures Argo Workflows on your infrastructure to offer simplicity and efficiency when it comes to data workflows. Achieve higher scalability for your data pipelines while significantly reducing your cloud spend. Our platform is designed to align your data infrastructure seamlessly with your full-stack infrastructure, all on Kubernetes.

Try out Pipekit for free today - pipekit.io/signup

Try Pipekit free

Join Pipekit for a free 30-day trial.
No credit card required.

Start free trial
  • blue checkmark vector

    Boost pipeline speed & reliability

  • blue checkmark vector

    Streamline engineering resources

  • blue checkmark vector

    Accelerate data-to-value

  • blue checkmark vector

    Standardize workflow and app deployments

More

Guides

Unlock Workflow Parallelism by Configuring Volumes for Argo Workflows

6 min read
Guides

How to Fine-Tune an LLM with Argo Workflows and Hera

8 min read
Guides

Why it’s Time to Migrate Your CI/CD from Jenkins to Argo

6 min read