(updated April 9, 2020) We recently open-sourced multicluster-scheduler, a system of Kubernetes controllers that intelligently schedules workloads across clusters. In this blog post, we will use it with Argo to run multicluster workflows (pipelines, DAGs, ETLs) that better utilize resources and/or combine data from different regions or clouds.
UPDATE (2020-04-09) - Multicluster-scheduler has changed quite a bit since this blog post was originally published. Multicluster-scheduler is now a virtual-kubelet provider; observations and decisions were replaced by more direct control loops. The integration with Argo Workflows discussed here still works. Multicluster-scheduler also works with Argo CD, as shown in this ITNEXT blog post published today by Gokul Chandra.
Most enterprises that use Kubernetes manage multiple clusters. For various reasons, you may have one or several clusters per team, region, environment, or combination thereof. Your clusters may be hosted by different cloud providers—in a multicloud infrastructure—and/or on premises—in a hybrid infrastructure. The benefits of isolation, however, come at the expense of, among other things, reduced bin-packing efficiency and data fragmentation. Let's explore two scenarios.
- Scenario A: You need to run a large parallel workflow, e.g., a machine learning training pipeline, which requires more resources than available in your team's cluster. You could scale out to go fast, or limit parallelism to save money. In the meantime, available resources in other teams' clusters stay idle. Multicluster-scheduler allows you to elect pods to be delegated to other clusters, where resources are available, from the comfort of your own cluster, with a single pod or pod template annotation.
- Scenario B: You need to run a workflow that combines data from multiple clouds or regions. It is either more efficient or required to run some steps closer to their data sources. To optimize throughput or save on data egress charges, you may want to compress or aggregate the data, before loading the results closer to you. Or to respect privacy regulations and minimize your attack surface, you may want to anonymize the data as upstream as possible. You could deploy remote services or functions and call them from your workflow, but that would be complicated. Multicluster-scheduler allows you to simply specify which cluster a pod should run in, again, with a single pod or pod template annotation.
A Multicluster Pod's Journey
Here's a quick summary of a multicluster pod's journey.
- When a pod is created with the
multicluster.admiralty.io/elect=""annotation, the multicluster-scheduler agent's mutating pod admission webhook replaces the pod's containers by a dummy busybox that just waits to be killed. The original spec is saved for later as another annotation.
- We call the resulting pod a proxy pod. The agent then sends an observation of the proxy pod to the scheduler's cluster, which can be the same cluster. The agent also watches other pods, nodes, and node pools and sends observations of them to the scheduler's cluster to guide its decisions.
- The scheduler creates a delegate pod decision in its own cluster. If the original pod was annotated with
multicluster.admiralty.io/clustername=foo, the delegate pod decision is targeted at cluster "foo". Otherwise, the scheduler targets the cluster that could accommodate the most replicas of our pod, based on current observations. More advanced scheduling options are in the works.
- The agent in the target cluster sees the decision and creates a delegate pod. The delegate pod has the same spec as the original pod.
- An observation of the delegate pod is sent back to the scheduler's cluster. When the delegate pod is annotated, the same annotation is fed back to the proxy pod (e.g., so Argo can read step outputs), and when it succeeds or dies, a signal is sent to the proxy pod's container for it to succeed or die too.
For more details, check out the README.
Let's see that in action. Create two clusters, e.g., with Minikube or your favorite cloud provider. In this blog post, we'll assume their associated contexts in your kubeconfig are "cluster1" and "cluster2", but we'll use variables so you can use your own:
Now, following the installation guide, install the scheduler in cluster1 and the agent in both clusters.
We also need Argo in either cluster. We'll use cluster1 in this guide, but feel free to change the variables:
Install the Argo controller and UI (step 2 of the Argo getting gtarted guide):
We're not big fans of giving Argo pods admin privileges, as recommended in step 3 of the Argo getting started guide, so we'll use a minimal service account instead. Because pods will run in the two clusters, we need this service account in both:
Let's also install the Argo CLI locally—although it's optional—to nicely submit and track workflows:
(UPDATE, 2019-03-17, multicluster-scheduler v0.3) Finally, let's enable multicluster-scheduler in the
You can now turn any Argo workflow into a multicluster workflow by adding
multicluster.admiralty.io annotations to its pod templates. Also, don't forget to specify resource requests if you want the scheduler to decide where to run your pods.
Scenario A: Optimizing a Large Parallel Workflow
A default GKE cluster has three nodes, with 1 vCPU and 3.75GB of memory each, out of which 940m vCPU and 2.58GiB of memory are allocatable. The system pods, along with multicluster-scheduler and Argo already request 1840m vCPU in cluster1 and 1740m vCPU in cluster2. Therefore, cluster1 has 980m vCPU available and cluster2 has 1080m. We don't need to spend extra money for this experiment: we will model a "large parallel workflow" by 10 parallel steps requiring 200m vCPU each (including 100m for the Argo sidecar).
First, let's run the following single-cluster workflow (also available in the multicluster-scheduler samples directory):
Here's the final state:
It took the workflow 1 minute 16 seconds to run on cluster1 alone. We can see that cluster1 could only run three steps concurrently, in four waves, which is less than ideal, but expected, because the 940m vCPU available are in three "bins".
Let's annotate our workflow's pod template with
multicluster.admiralty.io/elect="" to make it run on two clusters:
Here's the final state:
It took the workflow only 31 seconds to run across cluster1 and cluster2. Seven steps were able to run concurrently at first, followed by the three remaining steps. Notice that some of the steps were run in cluster1 and the others in cluster2:
The five pods whose names are prefixed with "cluster1-default-" are delegate pods. The prefix indicates their origin. The other pods are the proxy pods.
In cluster2, there are only delegate pods:
Scenario B: Multicluster ETL
We will model this scenario with a simple DAG workflow, where steps A and C can run in cluster1, but step B must run in cluster2; step C depends on steps A and B. Note the use of the
multicluster.admiralty.io/clustername pod template annotation to enforce a placement:
NON_ARGO_CLUSTER is not equal to "cluster2" in your case, modify the workflow before submitting it.
Here's the final state:
Note that step B was delegated to cluster2:
In a real case scenario, you would pipe data between steps using artifact repositories and/or step input/outputs.
Discussion: Pod-Level Federation
As we've demonstrated, multicluster-scheduler integrates nicely with Argo. We didn't have to modify the Argo source code, and simple annotations to the workflows' manifests were enough to make them run across clusters. This would not have been possible with a project like Federation v2, which requires clients to use new, federated APIs, e.g., federated deployment templates, placements, and overrides. The main advantage of multicluster-scheduler is that it federates clusters at the pod level, "the smallest and simplest unit in the Kubernetes object model." The entire Kubernetes ecosystem revolves around pods. By choosing pods as multicluster-scheduler's unit, we're enabling a series of "for free", loosely coupled integrations. Multicluster Horizontal Pod Autoscaler with custom and external metrics should work out-of-the-box (we'll prove that soon), while integrations with Istio and Knative are in the works.
Multicluster-scheduler can run Argo workflows across Kubernetes clusters, delegating pods to where resources are available, or as specified by the user. It can make parallel workflows run faster without scaling out clusters, and it simplifies multi-region and multicloud ETL processes. This integration was made possible by multicluster-scheduler's architecture centered around pods, "the smallest and simplest unit in the Kubernetes object model." The way forward is exciting and includes, among other things, more integrations with the cloud native ecosystem, and advanced scheduling. We're curious to hear the thoughts and feedback of the community, and we welcome contributions!