MLflow vs. Argo Workflows
April 3, 2023
5 min read
Argo Workflows lets you define tasks as Kubernetes Pods and run them as DAGs. By contrast, MLflow focuses on machine learning use cases and doesn’t use any DAGs.
The world is moving toward automation. Various tools and technologies now handle most of the tasks that used to be the responsibility of technical teams. Container orchestration, for example, is done using tools like Kubernetes, Apache Mesos, and Docker Swarm.
Data orchestration is another process that's becoming more common. It's the process of cleaning data, training machine learning models, deploying models to production, and maintaining them. Usually, when the tasks are small, the internal engineering team is able to manage them. But as the solutions scale, handling them can cause the team distress.
When a pipeline is designed by the developers, different tasks in it can't be performed individually; instead, each task depends on another to create the end result. As the solution grows, a pipeline may contain different branches describing different sets of complex and interrelated tasks. Managing these interrelated tasks is another part of data orchestration. Typically, this network of tasks is visualized/modeled as a directed acyclic graph (DAG). In DAGs, nodes represent the tasks, and the edges connecting nodes are the dependencies among different tasks.
Data Orchestration Tools
Some data orchestration technologies let you create DAGs by defining all of your jobs and their interdependencies. These tools then run these tasks in the correct order and retry any tasks that fail before moving onto the next. They also keep track of progress and alert your crew if there are any setbacks.
There are several tools available for data orchestration. Some of the major ones include the following:
- Apache Airflow: This Python-native tool contains all the features for data orchestration but requires some time to learn as it contains a wide variety of features.
- Argo Workflows: This is a generic orchestrator built for Kubernetes. It’s perfect for organizations that are already using Kubernetes, and is also a great fit for data teams who want to leverage the power of distributed computing for large workloads.
- MLflow: This platform is primarily used for machine learning model training, testing, and deployment. However, it has basic data orchestration capabilities to allow data scientists to package their pipelines so others can reproduce them.
- Kubeflow: This one is a great option if you want Kubernetes as your base and still want to work with the Python language.
- Luigi: Luigi is a Python-specific orchestrator that is incredibly simple to use.
In this article, you'll see a detailed comparison between MLflow and Argo Workflows. Both tools are commonly deployed for data orchestration. They vary based on the use cases and tech stack used by the organizations. For example, Argo Workflows lets you define tasks as Kubernetes Pods (collections of containers) and run them as DAGs. By contrast, MLflow focuses on machine learning use cases and doesn’t use any DAGs to define arbitrary tasks and dependencies between them.
Data scientists generally work on a specific problem where they try different models and even different sets of parameters. Sometimes, maintaining a record-keeping habit (i.e., keeping notes of all the experiments)is challenging and not a priority. This can lead to issues like revisiting the same concepts multiple times, lack of explainability or not being able to reproduce data pipelines or analyses. What could be the solution for this? Documentation? Yeah, that could work well.
However, while documenting a small-scale application is OK, when it comes to large-scale applications, documentation alone may not help because writing docs is time consuming. Here's where MLflow comes into play.There are several reasons why MLflow is preferred for orchestrating machine learning applications.
The MLflow platform doesn’t use DAGs—a major difference from other task orchestration tools. Generally speaking, MLflow doesn’t work toward task orchestration. Instead, it focuses on managing the ML lifecycles, starting from experimentation to model deployment.
Features of MLflow
- Tracking experiments: MLflow keeps track of parameters and models used by a data scientist throughout the lifetime of a project. They can organize these results in experiments. For each member in a team, there's a specific ID indicating the team member who has done the experimentation.
- Packaging ML code: MLflow packages code in a reproducible way so that other data scientists can easily reuse it. It also provides a CLI and API that support basic workflows by chaining MLflow projects together. Code is converted to a package so productionizing it becomes a cakewalk. Moreover, MLflow supports all Python libraries, which makes it an easy choice for Python developers.
- Easy model deployment: Once the model is trained and data scientists are satisfied with its accuracy, the next step is to deploy the model for end users (which is a pretty complex task). To deploy the machine learning model, you can use any of the different frameworks and tools available out there. But managing these tools and technologies is also a tough task. MLflow does this for you with the help of REST APIs, which simplifies deploying models.
- Model management: MLflow provides a central model store to keep track of all the experiments and manage the full lifecycle of the models. It provides features like model versioning, annotations (creating bounding boxes on images), and stage transitions.
Argo Workflows is an open-source task orchestration engine that you can use for orchestrating parallel tasks/jobs on Kubernetes. It runs tasks as DAGs with the help of YAML (used in Kubernetes). Argo Workflows is built as a Kubernetes custom resource definition (CRD). Custom API objects are defined by the CRD. Since Argo runs on top of Kubernetes, users can interact with it using the command line interface (CLI) or user interface (UI) or with kubectl. Let's now discuss the features of Argo Workflows.
Features of Argo Workflows
- Highly parallel: Argo can process the most tasks concurrently of any orchestrator. Data tasks are typically represented as DAGs. All the tasks run in parallel in different containers.Because of this, it can process as many tasks as you need, scaling horizontally or vertically on your Kubernetes cluster.
- Easy constraints and artifacts mechanism: Using Argo, you can easily apply constraints (rules) among multiple steps. You can also apply different artifacts so that the output from one step can work as the input for the next step. This brings reliability, efficiency, and simplicity as developers spend less effort managing dependencies and various inputs and outputs between different tasks.
- Instant scheduling: During task scheduling, Kubernetes passes all the instructions for the task, and then it immediately responds to new workflows and state changes.
- Continuous integration made easy: When you use the Argo Workflows pipeline for an application, you can specify code just like with any other infrastructure in Kubernetes that you can write as code (IaaC).
- Portability and version control: SinceArgo is developed on Kubernetes, you can define and manage each and every workflow as a YAML file. This brings portability (a workflow that works on oneArgo system can run on the other Argo systems as well without any specified changes) and makes version control easy.
- Python SDKs: Argo Workflows supports defining workflows in Python via the Hera Python SDK and Couler SDK.
Now that you have seen both these important task orchestration tools, it's time to do a comparison of the two.
MLflow vs. Argo Workflows
With MLflow, you can easily import it into existing Python code where you can define different parameters and artifacts for model experiments and tracking. Because it is more focused on productionizing ML models, it doesn’t use DAGs to define or run tasks. Argo, on the other hand, is used to define the tasks as Kubernetes pods and run them as DAGs.
Argo Workflows is the better choice for task parallelization and scaling the infrastructure as these are cloud-native technologies. In addition, Argo Workflows provides the Hera Python SDK for Python-native development. MLflow's main use is for its easy deployment capabilities to cloud platforms like AWS SageMaker and Azure ML. Also, MLflow has features like management of Jupyter Notebooks and a command line interface for training any kind of machine learning model.
As mentioned, Argo Workflows is based on Kubernetes and uses different containers to define tasks. By understanding the actual needs of resources, it can adjust the type of nodes provisioned in a cluster. This helps with both saving costs and reliability. Argo Workflows is the better option when you need to run tasks on Kubernetes.
Alternatively, MLflow is a better choice when you want to manage a machine learning project. If you want to achieve a workflow functionality similar to Argo Workflows from MLflow, you don’t need to worry. You can use MLflow Projects to create packages of data science code that becomes reproducible and reusable.
After reading this article, you now know what data orchestration tools are. You've seen two different tools available for managing tasks in the form of DAGs. You've also seen the features of Argo Workflows and MLflow and a detailed comparison between these two tools.
Before data orchestration solutions were available, developers had to manage data pipelines on their own. When applications are small, managing them manually is still an option. But as applications grow and many tasks have dependencies on one other, managing them manually is hard. Data orchestration tools help developers manage these tasks and define different conditions and artifacts to run these tasks in order.
To summarize, there are elements you must consider when selecting task orchestration technologies. The most popular tool for managing tasks when your team is working on Kubernetes is Argo Workflows. For managing machine learning applications, MLflow is the best option.
Special thanks to Eric Goebelbecker and Caelan Urquhart for help reviewing this post.
Subscribe for Pipekit updates.
Get the latest articles on all things Pipekit & data orchestration delivered straight to your inbox.