Subscribe for Pipekit updates.

Get the latest articles on all things Pipekit & data orchestration delivered straight to your inbox.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Follow Pipekit

Building a Backtesting Pipeline with Python and Argo Workflows

Python and Argo Workflows help you to manage and scale your backtesting activities quickly and efficiently. Argo Workflows is a powerful tool for creating and managing complex workflows for your applications. Python provides traders with a robust set of libraries and tools for working with data, creating models, and testing them.

With Python, you develop and test your financial and machine learning models, automate data extraction and preprocessing tasks, and quickly set up backtesting activities. Adding Argo Workflows equips you with the power of task orchestration for those backtesting activities since the workflows are highly configurable and can be rapidly modified and managed.

Let's put together a simple backtesting solution with Python and Argo Workflows so you can see it in action.

Quote: Argo Workflows are a powerful tool for creating and managing complex workflows for your applications.

Backtesting with Python and Argo Workflows

Prerequisites

In order to follow this tutorial, you'll need a few things:

  • Know how to run a workflow on Argo Workflows.
  • A cluster with Argo Workflows installed. You can follow the quickstart with any system capable of running Docker Desktop, a Docker install with minikube, or a Kubernetes cluster.
  • An Artifact Repository for passing data between steps in your workflow. Learn more about that here.
  • A Docker registry that your cluster can pull images from. For this tutorial, I'll be pushing containers to my personal Docker Hub account.

We'll be keeping things simple, so you don't need more than one worker node. The source for this tutorial is available in two GitHub repos. The market data downloader is here, and the backtester is here.

Backtesting with Python

For the sake of scaling backtesting with Argo workflows, we can break backtesting down into two major steps: downloading pricing data and processing it with our trading models. But, once you understand how to write a simple workflow, you can add more steps if they make sense for your process.

We're going to write two scripts. One will download end-of-day equities data, and the other will apply a simple moving average. You can, of course, plug in the best data source and your algorithms once you understand how the parts go together.

For downloading data, we're going to use Quandl.

Here's the script:

This script looks for a ticker, start date, and end date on the command line. It downloads the data and places the results in a file named for the ticker and data range. As you'll see below, the command line works well for passing arguments to a task in Argo Workflows. You'd want to add error checking in a production environment to ensure the arguments are valid. For now, we're keeping the code simple.

{% cta-1 %}

For backtesting, we'll use Backtesting.py with code right out of one of their samples.

Building Backtesting Containers

Now it's time to put these scripts into Docker containers so Argo can run them in Kubernetes. Here's the Dockerfile for the downloader:

The requirements file adds Quandl. Build the container with a tag for your Docker registry.

Then push the container.

Here's the Dockerfile for the backtester:

It's nearly identical to the previous file.

Build and push it to your container registry. In both cases, the Dockerfile uses the "slim" image as the base because it provides the OS support required for the underlying Python libraries that Alpine lacks.

Then it installs and upgrades pip before installing the required libraries and copying over the Python source files.

Writing a Python Backtest Workflow

Now that we have our two containers, we need a workflow to execute the backtest. Let's go over it section by section. First, we need to define the workflow and give it a name.

This sets the document type as a workflow, specifies that the workflows Argo creates should be named {% c-line %}equity-backtest-XX{% c-line-end %}, and names three workflow parameters: {% c-line %}ticker, start_date{% c-line-end %}, and {% c-line %}end_date{% c-line-end %}. These are the arguments our Python scripts are looking for.

Code reuse is one of our primary objectives here, and the parameters make this workflow reusable. You can pass them via the command line or copy the workflow file, edit the three parameters in one place, and check them into source control. You could even use Python code to generate new workflow files on demand.

Next, we need the templates that run the containers. Here's the downloader:

Line #17 specifies the container with the full path from the Docker registry. Lines #19 - #22 add the three input parameters on the command line. Up on line #6, the template identifies the output file as a named artifact. Here's that output file code again:

The file name is built from the input parameters, matching the way Python creates it. We also tell Argo to not create a zipped archive with the none: { } setting. The backtester's template looks for a file with the same name:

This template has four inputs: the ticker, both dates, and then on line #8, a file with the same name as produced by the previous template. While the file name is the same, we haven't tied the templates together yet, though. Backtest produces another output file, defined on line #13. Finally, we need the workflow steps to get these templates to work together:

This runs the containers, one after the other. Lines #7 - #12 and #17 - #22 refer back to the workflow parameters. So, whatever values you plug into the bottom (or pass to the command line) are what the tasks see. On line #23, the workflow tells the backtest step to use the file produced by the download step. Let's run this workflow.

{% related-articles %}

Running a Python Backtest Job

Here's the entire workflow in one file:

Submit it to your cluster. Here's what I see in the web GUI after running the job on my cluster:

Web GUI after running the job on my cluster

The GUI illustrates the output files using the names we gave them in the workflow as part of the tree structure. Click on the results. Argo doesn't have a widget to display CSV files but click view anyway to see the raw output.

GUI output of data

Concurrent Backtest Jobs with Argo Workflows

So, we've put together a backtesting workflow with two steps. It uses parameters to select the tickers and dates, so you could easily use Argo to run multiple concurrent jobs and with a cluster, scale your resources up and down depending on the load. With a few small additions, you can set this workflow up as a cron workflow and run your backtest jobs daily, weekly, or monthly.

But what if your tests are more complex? For example, you may have a processor that needs to wait for two or more download jobs to complete. Or, you may want to have two jobs triggered by a single download. For this, you can use a DAG. Let's rewrite the workflow as a DAG:

The changes start at line #62. Instead of steps, we have tasks, and the download task is now a dependency of the backtest test. This allows us to add additional tasks that can run in parallel with tasks it doesn't rely on, or wait until the end of another task that it needs output from. If you run this, you'll see the same result as the previous workflow, but now you have the foundation for more complex workflows.

Quote: Workflows makes creating scalable and robust workflows simple.

Scalable Python Backtesting with Argo Workflows

In this post, we built two Python containers and ran them inside two different Argo workflows. We saw how easy it is to pass the output of one container to another and take advantage of Argo's ability to orchestrate tasks on a Kubernetes cluster. Argo Workflows makes creating scalable and robust workflows simple. Put your k8s cluster to work on your backtesting tasks today!

Are your data pipelines scalable and reliable?

Operating data pipelines at scale doesn't have to be unreliable and costly. Put an end to the stress of unreliable data pipelines and data engineering backlogs and turn data into revenue-boosting insights. Pipekit can help.

Pipekit is a self-serve data platform that configures Argo Workflows on your infrastructure to offer simplicity and efficiency when it comes to data workflows. Achieve higher scalability for your data pipelines while significantly reducing your cloud spend. Our platform is designed to align your data infrastructure seamlessly with your full-stack infrastructure, all on Kubernetes.

Try out Pipekit for free today - pipekit.io/signup

Try Pipekit free

Join Pipekit for a free 30-day trial.
No credit card required.

Start free trial
  • blue checkmark vector

    Boost pipeline speed & reliability

  • blue checkmark vector

    Streamline engineering resources

  • blue checkmark vector

    Accelerate data-to-value

  • blue checkmark vector

    Standardize workflow and app deployments

More

Guides

Unlock Workflow Parallelism by Configuring Volumes for Argo Workflows

6 min read
Guides

How to Fine-Tune an LLM with Argo Workflows and Hera

8 min read
Guides

Why it’s Time to Migrate Your CI/CD from Jenkins to Argo

6 min read