Subscribe for Pipekit updates.

Get the latest articles on all things Pipekit & data orchestration delivered straight to your inbox.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Follow Pipekit

Artifacts Management at Scale for CI and Data Processing Workflows

A team’s ability to process, analyze, and make sense of large amounts of data is a must. These tasks can be and often are challenging because of the increasing complexity of data pipelines. Processing that data accurately and efficiently is very much a daunting task, one that was top of mind as I prepared for my talk at ArgoCon EU.

While there, Julie Vogelman, Staff Software Engineer at Intuit, and I delivered the talk “Managing Artifacts at Scale for CI and Data Processing.” We shared the best practices we learned from using Argo Workflows at Intuit and Pipekit that helped our teams address some of the most common challenges that come with artifact management.

This post will highlight those challenges, but not before providing a summary of which types of artifacts and data you’re most likely to use with Argo Workflows and why it’s important to manage your artifacts effectively.

Types of artifacts used in data processing

Artifacts are one kind of byproduct created during software development. Some artifacts help to describe the function, architecture, and design of software while others detail the development process itself. The function that artifacts serve has a lot to do with their use case.

In data processing, artifacts take on a few different forms, including tabular data, image and video files, geospatial data, and vector data. We often see tabular data used in data processing and analysis, whereas image and video files tend to be used for object detection, facial recognition, and image processing tasks. When working with geospatial data and vector data, they are typically most prevalent in mapping and geographic information systems (GIS) applications.

Artifacts used in CI (continuous integration) look very different. Here, we’re looking at dockerfiles, git repositories, built binaries, binary caches, and other software components. For example, dockerfiles specify the environment in which software is run while git repos will store the source code, ultimately allowing developers to track changes as they collaborate. Built binaries and binary caches provide us with the final executable code that gets deployed during production, making them essential in CI.

With machine learning, we’re working with artifacts such as training datasets, trained ML models, and feature stores. Used to train ML models and extract relevant features used as input to a given model, training datasets are a key artifact in machine learning. Training datasets might include tabular data, image and video files, and text or audio data, among other things. Trained ML models learn and make predictions or classifications based on ML algorithms applied to data. Once created, these models get saved as artifacts to be used later in the data processing pipeline. Feature stores are artifacts extracted from raw data, added to a collection, and used as inputs for ML models. They provide us with a reliable way to share feature data and ensure reproducibility and consistency throughout the ML workflow.

{% cta-1 %}

The persistence of data in Argo workflows

While you might already be familiar, one thing to be aware of working with these artifacts is that there are several different data types. During my talk, I chose to zoom in on how persistent the data Julie and I discussed is because of how significantly it impacts the management of that data throughout the Argo workflow.

Here are those different types of data:

  • Transient data — Passed between workflow steps; data not needed beyond life of the workflow
  • Semi-persistent data — Relevant across multiple runs of workflows; yet, easy to reinstate if lost, like module/binary/dockerfile caches
  • Persistent data — Data we want to keep, often the final output from a running workflow

Transient data is data that we don’t really care about beyond the step during which it’s used in the workflow. Rarely do we need transient data beyond the life of the workflow. As we get more persistent, we grow more concerned with how to archive data and find it. With semi-persistent data, there tends to be less concern with lost data because it can pretty easily be reinstated.

How to choose the right storage type for your artifacts

Storage types have to be considered when building an Argo workflow, and there are three main types to explore including blob, block, and network file system. When selecting a storage type for your workflow, it's important to weigh the pros and cons of each option and choose the best fit for your specific use case.

Blob storage is the most popular storage option, the one that most of us are familiar with. Examples include S3, GCS, and MinIO. Blob storage is easy to set up, easily queryable, offers useful data archiving settings, and makes artifacts visible in the Argo Workflows UI, making it a convenient option. However, it does have some downsides, such as being slower because of the data tarring between workflow steps and its lack of Read-Write-Many compatibility.

Block storage is an alternative to blob storage that is less commonly used but offers slight performance advantages over blob storage. Like blob storage, block storage is easy to set up, however, it has lower latency and offers easy data backup and restoration. This option is, intended for per-instance storage use cases and also doesn't have Read-Write-Many compatibility.

Network file systems actually offer Read-Write-Many compatibility and auto-scale well for high throughput needs. There’s no tarring required here, which means faster transfers between steps. Some of the drawbacks of NFS to consider include usability only for cloud setups, it can’t be dynamically provisioned with a PVC, and it is slow with parallel read/writes at scale.

{% related-articles %}

How to choose the right storage type for your artifacts

The core problem at hand with artifact management is finding a way to pass data between workflow steps. But before starting to pass artifacts between steps, you’ll need to configure your artifact management properly. That will involve:

  • Centralizing the artifact repository configuration for scale and security
  • Implementing smart artifact naming patterns
  • Managing small versus large artifacts
  • Utilizing artifact garbage collection

Once that has been managed, you will be ready to address the core problem. To do this you can use parameters, blob storage, or network file systems.

  • Parameters — Parameters are the most basic way to pass information, including strings and commands, between steps in a workflow. They also allow for passing small script outputs and small, stringified .json and .txt files. With parameters, the biggest challenge is that there is a memory limit per workflow.
  • Blob storage — Blob provides more persistent artifacts, such as data tables and trained models. When seeking to pass data between workflow steps, it’s necessary to package up outputs between steps.
  • Network file system — NFS doesn't require you package up your outputs between steps, and it will save runtime. If you want Read-Write-Many functionality, you’ll need to use NFS. Finally, for some use cases, NFS can be higher cost as a tradeoff for faster runtime.

Watch the full talk

Effective artifact management is critical to workflow management in Argo Workflows. By centralizing the storage of artifacts, implementing smart artifact naming patterns, managing small and large artifacts, and using garbage collection, you can ensure scalable, secure, and efficient artifact management.

Watch our complete talk from ArgoCon EU, including a few demos, here. And if you are interested in reviewing the slides alongside the talk, visit our repo. Finally, for insight on another approach to managing your artifacts, watch this talk — Configuring Volumes for Parallel Workflow Reads and Writes — from our team member Tim Collins and Lukonde Mwila from AWS.

Are your data pipelines scalable and reliable?

Operating data pipelines at scale doesn't have to be unreliable and costly. Put an end to the stress of unreliable data pipelines and data engineering backlogs and turn data into revenue-boosting insights. Pipekit can help.

Pipekit is a self-serve data platform that configures Argo Workflows on your infrastructure to offer simplicity and efficiency when it comes to data workflows. Achieve higher scalability for your data pipelines while significantly reducing your cloud spend. Our platform is designed to align your data infrastructure seamlessly with your full-stack infrastructure, all on Kubernetes.

Try out Pipekit for free today - pipekit.io/signup

Try Pipekit free

Join Pipekit for a free 30-day trial.
No credit card required.

Start free trial
  • blue checkmark vector

    Boost pipeline speed & reliability

  • blue checkmark vector

    Streamline engineering resources

  • blue checkmark vector

    Accelerate data-to-value

  • blue checkmark vector

    Standardize workflow and app deployments

More

Guides

Unlock Workflow Parallelism by Configuring Volumes for Argo Workflows

6 min read
Guides

How to Fine-Tune an LLM with Argo Workflows and Hera

8 min read
Guides

Why it’s Time to Migrate Your CI/CD from Jenkins to Argo

6 min read