Subscribe for Pipekit updates.

Get the latest articles on all things Pipekit & data orchestration delivered straight to your inbox.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Follow Pipekit

Why teams are implementing CI/CD for data pipelines using Argo Workflows

While at ArgoCon 2022, I delivered a talk on an emerging trend in data engineering that’s helping teams ensure their data pipelines don’t break in production. During the talk, CI/CD for Data Pipelines with Argo Workflows, I introduce this paradigm shift and highlight five major learnings:

  • Using Argo Workflows to implement CI/CD for data pipelines
  • Storing Workflows and WorkflowTemplates in git
  • Validating WorkflowTemplates on pull requests
  • Syncing WorkflowTemplates to clusters
  • Testing WorkflowTemplates (demo)

In this post we’ll take a look at some of the key lessons and takeaways from the event.

Why you should use CI/CD for data pipelines

When we consider how many teams use data pipelines — and how often they use them — it's surprising how many are pushing to staging and production and simply hoping for the best. Instead, what we want to do is validate our data pipelines and data transforms when making pull requests or pushing changes, so that we prevent issues later on in the development lifecycle.

Why does this matter? Rollbacks for one. When a bug gets introduced, it’s hard to go from a current version of staging or production to a previous version. This task grows even more challenging for larger teams where many people might be pushing changes all at once. What we’ve observed is most data teams do not have a concept of versioning for their pipeline components.

This is one of the reasons teams are beginning to apply CI/CD concepts and processes seen in traditional software engineering to data pipelines. Forward thinkers are:

  • Factoring out transforms and other critical pieces into components
  • Running tests on these components during pull requests and other change events
  • Versioning these components using semantic versioning

This new way of approaching data pipelines solves two big problems:

  1. Wasted cloud spend (money) — Data pipelines can significantly increase already high budgets for AWS, GCP, Azure, etc. Having to re-run data-intensive pipelines is a waste of compute resources, so ensuring pipelines pass tests before being run can save teams thousands of dollars per month.
  2. Data engineering time (time) — Re-running a broken data pipelines often means a data engineer or data scientist needs to spend time debugging the pipeline and manually re-running it. Smart teams want to learn earlier in the dev cycle if bugs are being introduced, ideally before actually pushing changes to staging or production. This means they’ll save cycle time on debugging broken data pipelines by using CI/CD to catch common pitfalls earlier in the dev lifecycle.

{% cta-1 %}

What is Argo Workflows?

Argo Workflows is an open source workflow engine that allows teams to orchestrate jobs on Kubernetes. It is implemented as a custom resource definition (CRD), so it's container native, and you can run it on EKS, GKE, or any other K8s implementation.

With Argo Workflows, you can define workloads (workflows) such that each of its steps runs in a container. Depending on how you define the steps and their dependencies, you can run workflows sequentially or in parallel. An Argo Workflows spec that follows a YAML format is used to define your workflows and dependencies.

One of the primary reasons teams are flocking to Argo Workflows is because it allows them to run and scale pipelines for nearly any purpose. Companies commonly use it for machine learning, data processing, continuous integration/continuous deployment (CI/CD), and infrastructure automation.

Understanding, testing, and versioning WorkflowTemplates

What are Argo WorkflowTemplates?

WorkflowTemplates and ClusterWorkflowTemplates are native reusable components of Argo Workflows. Say you have part of a workflow and you want to refactor it into a piece that you can reuse over and over again. Saving time by not repeating tasks is one use case for WorkflowTemplates.

Good candidates to make components in workflows include:

  • Doing data transforms
  • Setting up/tearing down Kubernetes resources like Dask or Spark deployments
  • Running utilities like cloning git repositories

How to test Argo WorkflowTemplates

WorkflowTemplates are just functions. Each one can take inputs and generate outputs, which we want to use if we’re looking to test. When testing, we want to make sure that the WorkflowTemplates of these components are pure functions (i.e., for a given set of inputs there’s the same set of outputs). Anything random or non-deterministic will cause problems.

Versioning Argo WorkflowTemplates

Semantic versioning allows for structured promotion of components and easy rollbacks. While great to have, it is not currently available in vanilla Argo.

There are two ways of implementing semantic versioning for WorkflowTemplates in Argo Workflows that we’ve seen:

  • Appending the version to the name with dashes; i.e., template-12-3-9
  • Adding a label or annotation denoting the version

{% related-articles %}

Watch the full talk and demo

Before wrapping up my talk, I shared a brief demo. In it I create an Argo Event source and sensor that reads GitHub pull requests and runs Argo Workflows. After that, I test changes made to a WorkflowTemplate in that same pull request.

Visit our repo to run the demo yourself, and watch my full talk from ArgoCon here. If you’re interested in reviewing the slides alongside the talk, you can access them here.

Want to use Argo Workflows with your team? Consider Pipekit. It’s a control plane for Argo Workflows that enables you to develop and run large, complex workflows. With Pipekit, you’ll be able to trigger workflows, collect logs, and manage secrets. It allows you to maintain pipelines across multiple environments and multiple clusters.

Are your data pipelines scalable and reliable?

Operating data pipelines at scale doesn't have to be unreliable and costly. Put an end to the stress of unreliable data pipelines and data engineering backlogs and turn data into revenue-boosting insights. Pipekit can help.

Pipekit is a self-serve data platform that configures Argo Workflows on your infrastructure to offer simplicity and efficiency when it comes to data workflows. Achieve higher scalability for your data pipelines while significantly reducing your cloud spend. Our platform is designed to align your data infrastructure seamlessly with your full-stack infrastructure, all on Kubernetes.

Try out Pipekit for free today - pipekit.io/signup

Try Pipekit free

Join Pipekit for a free 30-day trial.
No credit card required.

Start free trial
  • blue checkmark vector

    Boost pipeline speed & reliability

  • blue checkmark vector

    Streamline engineering resources

  • blue checkmark vector

    Accelerate data-to-value

  • blue checkmark vector

    Standardize workflow and app deployments

More

Guides

Unlock Workflow Parallelism by Configuring Volumes for Argo Workflows

6 min read
Guides

How to Fine-Tune an LLM with Argo Workflows and Hera

8 min read
Guides

Why it’s Time to Migrate Your CI/CD from Jenkins to Argo

6 min read