The Multicluster Toolbox

(updated April 3, 2020) Founder and CEO Adrien Trouillaud spoke at KubeCon + CloudNativeCon Europe 2019. Watch the recording.


The transcript below is mostly for search engines. You should really watch the video if you can.

Thank you, everyone, for being here.

There's been a switch in the room. If you're looking for the security session, it's in G2. Here in G1, it's The Multicluster Toolbox.

So, multicluster, some of you may wonder why anyone would need multiple clusters, dealing with one is hard enough.

But, if you're in this room you probably have an idea and this question has been answered multiple times.

Today I won't be focusing so much on the why rather on the how, how we solve multicluster problems.

Just picture multiple teams working with multiple clouds, multiple data centers, and multiple regions, problems are bound to happen.

The tools that I will introduce in this talk are lessons learned from months of research at Admiralty, a company I founded to research this problem space and offer solutions. I won't talk about those products in this talk.

Why not just use kubefed? Kubefed, for those of you who don't know, is the new name of the Federation v2 project, they renamed the repository last week. In some situations, it's a very valid response. It's a yes, if your use case matches.

What does Federation v2 solve? It solves a problem where you have a central cluster and you want to do top-down resource declaration.

You want to define, declaratively, your resources across your Federation from 1 cluster and you can do that by decomposing your problem into templates.

That is you want to deploy, for example, a deployment here, into multiple clusters, with this basic template. Then, you define a placement, which is a list of cluster names or cluster selectors because clusters are custom resources in Federation. And overrides, if you want some variations.

In this example, there are just more replicas in cluster 2 than in cluster 1.

Not all multicluster problems can be solved with Federation, there are plenty of application examples of multicluster.

We don't have time to go through this list and discuss it, but feel free to you give it a look, read them, and I'm sure some of these will resonate with some of you.

There are no off-the-shelf solutions to solve these problems, the tools are missing. But, what kinds of tools? What are we trying to build? Just like, to build a shelf, you need some tools and you know you want to build a shelf. But, do you know what you want to build when you think about multicluster?

What I think we need is, just like controllers are solutions to many problems in Kubernetes, that’s how Kubernetes works, we need multicluster controllers.

What is missing to create multicluster controllers is a way to watch and reconcile resources in different clusters.

Say you have a high-level API in one cluster and some either Kubernetes or custom APIs in another and you want to reconcile them.

If I delete an object in one cluster, I want the dependent resources to be deleted in the other cluster and we don't have garbage collection across clusters.

Finally, if you've tried a few multicluster solutions out there, it's usually pretty difficult to set up.

You need to generate kubeconfigs and store them in secrets, load them, or there is a custom CLI that you have to use and run an imperative command.

It would be nice if there was a declarative bootstrapping for those things.

[4:37] The generalized solutions to these problems are packaged in some of Admiralty's open-source projects, like multicluster-controller and multicluster-service-account.

Multicluster-controller is, in a way, sort of a multicluster version of the controller-runtime project, which is an official Kubernetes project, that drives kubebuilder or the operator SDK.

Multicluster-service-account is a system of controllers that imports and auto-mounts service accounts from other clusters into pods of a local cluster.

But we won't see a lot of multicluster-controller in this talk. I thought it would be more instructive if we built a multicluster controller from scratch.

We are going to start with the sample-controller project.

Some of you are familiar with it, it's an example project that you can use, you can fork it, vendor it, and create your controller from it.

What does it do? It's an example, there's a high-level API in this project that's called foo. When you create a foo object, then, a NGINX server deployment is created with some parameters.

Some people like to have higher-level APIs for their users, so that they don't have to figure out all the details of how to deploy this thing.

[06:14] So, let's try it.

There will be some live coding, live demonstration, and I don't want to waste your time so I'm going to do a lot of copy-pasting. I'm also a shameless slow typer.

Before you can create a foo objects you have to declare a custom resource, for that there is a custom resource definition.

Is this big enough for people in the back? So so, yes? I'll try it out like this.

This is a very simple custom resource definition that has nothing of the advanced features that you should use, like validation and other things. We don't need that for the purpose of this talk.

We're going to create it in cluster 1. For your information, I just tested them before this demo. I've created 3 clusters in GKE that I'm going to use for the purpose of this talk. They are associated with cluster 0, cluster 1, and cluster 2 contexts in my kubeconfig.

I also extracted individual kubeconfig files in a specific folder as a convenience, you'll see why.

I'm in cluster 1, I create the CRD and then I checked out the code. I made just a few tweaks, it's really very similar to a sample-controller in master in origin. I just added a plug-in for GCP login, things like that.

This controller is running, in another terminal I can create a foo object, which is this one.

It's just an example, the NGINX deployment will be called example-foo and we'll have one replica.

Let's see if the deployment was created, yes.

[08:33] What does it look like? It's an engine NGINX deployment and we can update it with 2 replicas.

You can see that the deployment has been updated with 2 replicas and the usual controller loops things: when you delete the deployment, you want it to be recreated because the foo object still exists. That means as a user you want constantly there to be a deployment with the same name.

If something deletes that deployment, it should be recreated.

It's nice also to, in a control loop, update the status of the owning resource and so here in the foo object we see that there are indeed two available replicas.

It's also nice for the user to broadcast events and so in the sample-controller project events are broadcasted when there are errors or the sync loop was successful.

Finally, if you delete the owning object, which is the foo object, you want the deployment to disappear, to be deleted. That is, using garbage collection.

Alright, so we want to transform this sample-controller project and make it multicluster.

I'm going to start by changing a few things and watching foo in one cluster and deployments in another cluster at the same time, I'm kind of combining a few steps here.

I also, in this example, want all those NGINX deployments to be deployed in the same namespace in that different cluster, I don't want the namespaces to match.

It makes sense in a single cluster to want the owner-dependent relationship to only exist within a single namespace, but across clusters, not necessarily.

So that I can deploy that foo object, which could be you know like, "oh, I need a database and I'm in AKS, but my software needs DynamoDB".

I declare foo and in the other cluster (cluster 0) we deploy all those common resources, central resources.

At the end of this talk, there will be a second step where we can define and declare those foo resources from any cluster.

They're all connected to cluster 0 and cluster 0 runs all those dependencies in the same namespace.

But if we want to do that, we have to understand how a controller works.

This is a diagram from the sample-controller project documentation, it was originally published in Cloud Ark's blog. It was nice from Devdatta to contribute to the project, because it's a great way to explain controllers.

How does a controller work in just a few minutes?

Usually, create a set of informers, indexers, and reflectors from a factory. Yet, there's one set created per resource. They can be shared if you have multiple controllers running in the same process, that's the way to optimize API calls.

The reflector just lists and watches objects from the Kubernetes API which are Kubernetes verbs.

The informer listens to those changes, caches the objects using the indexer so that all the gets and lists that you need to use in your code are optimized. They don't always call the API, they don't, they use the cache.

Your controller can subscribe to changes and that's what the controller does. The controller sees some change like a deployment was created, a foo object was updated, anything, deleted and queues a key in a work queue. That key is what we call the reconcile-key.

It's important to have the same key for the objects that you need to reconcile.

Usually, in a single cluster what we do is we define the key as namespace plus name.

That is a unique way to identify an object in Kubernetes.

If we want to reconcile a foo object with a dependent deployment object. With foo we just say, "okay let's take your namespace and your name, that's your key." With the deployment, we need to know what the owner's name and namespace are and for that, we can rely on owner references.

An owner reference is the identification of the owner, here a foo object called example-foo and the unique identifier.

Note that there is no namespace here, because in Kubernetes we assume that owner-dependent relationships only exist within namespaces.

Finally, we process the items. Here’s where you put your business logic, it often looks the same.

It's the usual control loop where you have a key for that namespace-name, you get the owning object in this case.

That's a very common pattern for higher-level APIs.

[14:57] You get the foo object, it's possible that it doesn't exist anymore because in the meantime it was deleted. Between the time change was noticed and the time the controller processed the key in the work queue, it's possible it disappeared. Well, garbage collection will take care of the deployment, we're good.

If we find the foo object, we need to make sure that it has a deployment. If we don't find the deployment, that means it needs to be created.

It could mean that the deployment has been deleted, it could mean that it's never been created yet, we don't really care because this is a level-based algorithm. We care about the current state of the world and what we want it to be. If we find it, it has to be up to date, if not we update it.

In any case, it's a good practice to update the status of the owning object and broadcast some events.

Let's try to change sample-controller to do this, like you want deployments created in a different cluster in the foo namespace all the time.

[16:18] I'm going to check out a code change.

I will push this branch to a repository after the talk, so that you all can redo the same things on your own if you want.

Oh, I already built it sorry.

I built the manager to the current state, let's see what I did here in the code. I won't go into too many details, there aren't too many changes actually.

In this first change, I was a little naive, I may have broken things we'll see.

Just this manager has a kubeconfig flag to be run out-of-cluster and so I just added one, which I use to create a different Kubernetes configuration in code. I use that configuration to create a client and an informer factory.

In the controller, I just replaced everywhere we needed to create a deployment, update a deployment, delete, no we don't delete because we rely on garbage collection. Anytime we need to do things with deployments I replace using the foo namespace with a constant, the foo namespace instead.

That's it, that's all I did. Let's see if it works.

[18:43] I said I wouldn't waste your time. What just happened? Alright, so let's see if the next step works. It's problem with my kubeconfig. Alright, never mind. I'm going to say my messages first and we'll try to go back to it a little later.

If this had worked, it wouldn’t work properly, here I'm just not able to run it, but I should be able to run it.

So, in the example that hopefully, you can do later, or we can do later, you create a foo object and there is a deployment in the foo namespace in the other cluster so you're happy.

A few things don't work, for example, events are not broadcasted, when you delete the deployment in the foo namespace it is not recreated, and if you delete the foo object the deployment is not deleted so garbage collection doesn't work, and the way to figure out what the owner is of a deployment doesn't work either.

To fix that we need to enhance the owner reference, the owner reference only provides information about the name of an object but should also, in our case for multicluster, say what namespace the owner is in and also what cluster. In this case, we still have just one cluster that contains the foo objects and the next step needed another enhancement and garbage collection doesn't work.

So, why does garbage collection not work? Let's step back a little bit, there are three modes of deletion in Kubernetes which are either orphan the dependents, so you just leave them be. Say, if I delete the owning object foo I want to keep the deployment in place but just remove the owner reference, orphan it. Or you want cascade deletion, which is if I delete the foo object my intention is to also delete the deployment.

This can be done either in the background or in the foreground and that is done by the garbage collection controller. The garbage collection controller is a generic Kubernetes controller that maintains a graph of owner-dependent relationships and when it detects that (it listens to basically all resources) and when say a foo object is deleted it will check the dependency graph and see that, oh, that deployment needs to be deleted, I'll delete it, and that can be done in the background. So, when you delete foo it disappears right away, and the deployment is deleted just a few seconds later or immediately, depends how fast your cluster runs.

Or in the foreground. So, how does that work? It uses finalizers. When you want to delete in the foreground that means you don't want the owning object, the foo object in our case, to disappear until all its dependencies, here your deployment, have been deleted as well.

The garbage collection controller does that by adding a finalizer to an owning resource when it's deleted. This example is a deployment, if you deleted a deployment you would want the replica set to be deleted as well and cascading to the pods.

When you delete the deployment, a deletion timestamp is added but also a foreground deletion finalizer and when the garbage collection controller is done deleting all the dependents it removes the finalizer and finally deletes the owning object.

The problem, in our case, is that we have 2 clusters, each has its own garbage collection controller, they don't agree.

They each have their own representation of owner-dependent relationships and only in their own clusters, they don't communicate so they're unable. They don't know what a multicluster owner reference is.

We could, I'm actually working on one, create a multicluster garbage collection controller that is generic, and we could have a multicluster owner reference that is something that exists in the multicluster-controller.

But, the solution in this example is what I call poor-man's in-controller garbage collection using finalizers.

Basically, in our control loop we're going to add a finalizer right when the object is created and before any of its dependents are created, to avoid any race condition.

If a foo exists it cannot have dependents unless it has a finalizer, that way when we delete the foo object we'll be sure to delete manually, or not manually, automatically in our control loop but like custom code deletes the dependents before we remove the finalizer.

This is in our sync handler in the controller. We implement a decision tree, based on observations. So, is the foo object terminated? Does it have a deletion timestamp? If it doesn't that means, it's still alive and first things first make sure it has a finalizer.

Let's go back to our controller from the very beginning. Is the deployment found? No, create it. Is that today? No, update it, things like that.

For clarity, I didn't loop back with update status, broadcast events, but you still need that.

If it's terminating, kind of typically at the end of the lifecycle. Remember that the lifecycle is not the same thing as your logic in your sync handler, because we're edge-based, sorry, we're level based, note edge-based.

If the deployment is found, we need to delete it if the foo object is terminating and if not, well that means we're ready to remove the finalizer. We do that and then we can just rely on our single cluster garbage collection controller to remove the foo object.

In this second example, I added a cluster to the mix, like I said earlier.

This example should work better because I'm running it in cluster.

The first example was running out-of-cluster from my laptop using kubeconfig files, one of them has been tampered with.

In this example I will deploy a pod in cluster 0, this is where we want to run our controller. We want to run it, in this example, in the central cluster 0 and we want to watch all the clusters 1, 2, however many you want. When they create foo object, make sure that we create the deployments that they need.

How does a pod call a Kubernetes API of its own cluster first? Well, it has a service account. When you create a service account, the service account controller creates a secret in the same namespace with a token to call the API. The token is auto-mounted in the pod when it is admitted and Kubernetes clients, like the client-go library, know how to load the token file that has been mounted as a volume, so all is good.

Now you can extract the token from that secret and create an equivalent kubeconfig file.

That is what people usually do when they do multicluster things.

You take that token, you put it here in the user, then you take the Kubernetes API of that cluster and add some names to know where to find things, then store that in a secret.

It's a bit of a pain, but it works.

Here is an example of an installation procedure for istio multicluster with VPN connectivity.

Run all these bash scripts and extract that, extract this, create that, it works.

It’s a pain point I believe.

What I suggest is a declarative way to do this, to import a service account from another cluster. Multicluster-service-account comes with a CRD called ServiceAccountImport. A ServiceAccountImport is simply a cluster name, namespace, and name of a service account.

The multicluster-service-account controller takes care of finding that secret in the other cluster, importing it, and then if your namespace enables it at admission pods that are annotated with those ServiceAccountImport names will see those tokens as kubeconfig files mounted in a known folder.

Then your Kubernetes client library, like client-go or I think the Python library does that, although maybe I shouldn't I say that because I haven't used it, most Kubernetes client libraries know how to use kubeconfig files because that's what you use when in development when you're just dealing with the clusters that you created.

You can just use your code as usual.

But, wait — some of you may think — how does multicluster service account import those tokens from the other clusters, right?

Well, it's the chicken and egg problem, you got to start somewhere.

I believe that if you install multicluster-service-account you do that bootstrapping process once with the custom CLI, but then all of your other multicluster controllers can use the declarative API of ServiceAccountImport, so you don't have to do that ever again.

It probably wasn't the rooster or the unfertilized eggs, by the way.

Let's write a demo of that.

I'm going to check out my code, add a different tag, and because it just adds an example, then I'm going to deploy those stacks. I have a cluster N stack and a cluster 0 stack.

What's in cluster 1 and cluster 2? We have a custom resource definition for foo objects. We do create a foo namespace, even though it will not contain any deployments, it's just a way to host a service account that is allowed to list, watch, get, update the foo objects and create events.

There are all of our tools for that.

In cluster 0, which is our central cluster in this example, we have a namespace that will host all the NGINX deployments.

The service account that is allowed to list, watch, get, create those deployments, some binding for that.

Here are our two ServiceAccountImports, to import that other service account from all the dependent, the federated, as a general term, clusters, and finally a very simple deployment for the controller that is annotated so I can import those service accounts.

In a way to, call the same arguments that I used in the previous demo.

Kubeconfig get it from this folder where Admiralty service account imports usually puts tokens.

[33:46] Okay, yeah, it's almost done.

It's a good thing that the first demo didn't work because we wouldn't have had time for this.

I applied those tags to cluster 1, cluster 2, for cluster n. Then I'm installing a multicluster-service-account in cluster 0 and I'm bootstrapping with a custom CLI, cluster 1 and cluster 2.

These two steps, you should only do them for your first multicluster setup. If you create more multicluster controllers, you don't have to do this. Your clusters are already well known to each other and they can import tokens from each other.

Once it's bootstrapped, I can finally install the multicluster controller in cluster 0. I create the example-foo in cluster 1, in a different example called bar in cluster 2 and cluster 0, you can see that our deployments have been created.

That's it for the demo.

So, in short, we created this custom multicluster controller system using cross-cluster control loops, cross-cluster garbage collection, and declarative bootstrapping which are the common ingredients at the core of these projects.

There are many examples and I'd be curious to hear about yours because just so many things we need to do with multiple clusters.

Here is how you can connect with me if you want and some links to the different projects I've talked about, either from Admiralty multicluster-controller, multicluster-service-account and multicluster-scheduler, which I barely mentioned, it solves 2 of the things in that list of examples, and then some of the official projects that I mentioned too.

If we have time for questions, I'll be happy to answer a few.