Why teams are implementing CI/CD for data pipelines using Argo Workflows
November 21, 2022
4 min read
J.P. discusses and demonstrates how to ensure data pipelines don’t break in production using CI/CD and Argo Workflows. See what a development setup looks like, learn to test workflow templates, and discover strategies to help you version those workflow templates.
While at ArgoCon 2022, I delivered a talk on an emerging trend in data engineering that’s helping teams ensure their data pipelines don’t break in production. During the talk, CI/CD for Data Pipelines with Argo Workflows, I introduce this paradigm shift and highlight five major learnings:
- Using Argo Workflows to implement CI/CD for data pipelines
- Storing Workflows and WorkflowTemplates in git
- Validating WorkflowTemplates on pull requests
- Syncing WorkflowTemplates to clusters
- Testing WorkflowTemplates (demo)
In this post we’ll take a look at some of the key lessons and takeaways from the event.
Why you should use CI/CD for data pipelines
When we consider how many teams use data pipelines — and how often they use them — it's surprising how many are pushing to staging and production and simply hoping for the best. Instead, what we want to do is validate our data pipelines and data transforms when making pull requests or pushing changes, so that we prevent issues later on in the development lifecycle.
Why does this matter? Rollbacks for one. When a bug gets introduced, it’s hard to go from a current version of staging or production to a previous version. This task grows even more challenging for larger teams where many people might be pushing changes all at once. What we’ve observed is most data teams do not have a concept of versioning for their pipeline components.
This is one of the reasons teams are beginning to apply CI/CD concepts and processes seen in traditional software engineering to data pipelines. Forward thinkers are:
- Factoring out transforms and other critical pieces into components
- Running tests on these components during pull requests and other change events
- Versioning these components using semantic versioning
This new way of approaching data pipelines solves two big problems:
- Wasted cloud spend (money) — Data pipelines can significantly increase already high budgets for AWS, GCP, Azure, etc. Having to re-run data-intensive pipelines is a waste of compute resources, so ensuring pipelines pass tests before being run can save teams thousands of dollars per month.
- Data engineering time (time) — Re-running a broken data pipelines often means a data engineer or data scientist needs to spend time debugging the pipeline and manually re-running it. Smart teams want to learn earlier in the dev cycle if bugs are being introduced, ideally before actually pushing changes to staging or production. This means they’ll save cycle time on debugging broken data pipelines by using CI/CD to catch common pitfalls earlier in the dev lifecycle.
What is Argo Workflows?
Argo Workflows is an open source workflow engine that allows teams to orchestrate jobs on Kubernetes. It is implemented as a custom resource definition (CRD), so it's container native, and you can run it on EKS, GKE, or any other K8s implementation.
With Argo Workflows, you can define workloads (workflows) such that each of its steps runs in a container. Depending on how you define the steps and their dependencies, you can run workflows sequentially or in parallel. An Argo Workflows spec that follows a YAML format is used to define your workflows and dependencies.
One of the primary reasons teams are flocking to Argo Workflows is because it allows them to run and scale pipelines for nearly any purpose. Companies commonly use it for machine learning, data processing, continuous integration/continuous deployment (CI/CD), and infrastructure automation.
Understanding, testing, and versioning WorkflowTemplates
What are Argo WorkflowTemplates?
WorkflowTemplates and ClusterWorkflowTemplates are native reusable components of Argo Workflows. Say you have part of a workflow and you want to refactor it into a piece that you can reuse over and over again. Saving time by not repeating tasks is one use case for WorkflowTemplates.
Good candidates to make components in workflows include:
- Doing data transforms
- Setting up/tearing down Kubernetes resources like Dask or Spark deployments
- Running utilities like cloning git repositories
How to test Argo WorkflowTemplates
WorkflowTemplates are just functions. Each one can take inputs and generate outputs, which we want to use if we’re looking to test. When testing, we want to make sure that the WorkflowTemplates of these components are pure functions (i.e., for a given set of inputs there’s the same set of outputs). Anything random or non-deterministic will cause problems.
Versioning Argo WorkflowTemplates
Semantic versioning allows for structured promotion of components and easy rollbacks. While great to have, it is not currently available in vanilla Argo.
There are two ways of implementing semantic versioning for WorkflowTemplates in Argo Workflows that we’ve seen:
- Appending the version to the name with dashes; i.e., template-12-3-9
- Adding a label or annotation denoting the version
Watch the full talk and demo
Before wrapping up my talk, I shared a brief demo. In it I create an Argo Event source and sensor that reads GitHub pull requests and runs Argo Workflows. After that, I test changes made to a WorkflowTemplate in that same pull request.
Visit our repo to run the demo yourself, and watch my full talk from ArgoCon here. If you’re interested in reviewing the slides alongside the talk, you can access them here.
Want to use Argo Workflows with your team? Consider Pipekit. It’s a control plane for Argo Workflows that enables you to develop and run large, complex workflows. With Pipekit, you’ll be able to trigger workflows, collect logs, and manage secrets. It allows you to maintain pipelines across multiple environments and multiple clusters.
Subscribe for Pipekit updates.
Get the latest articles on all things Pipekit & data orchestration delivered straight to your inbox.