The Surprisingly Low Limits to etcd/Kubernetes API Scale

Oct 10, 2024

3 min read

There is a little known scale limitation to Kubernetes operators that could dramatically affect your implementation. Depending on your or your cloud’s Kubernetes control plane implementation and your state storage/retrieval design it is possible to overwhelm your cluster with a relatively small amount of data.

Background

When you add a Kubernetes operator to your cluster you also add a CustomResourceDefinition (CRD) which adds an endpoint to the Kubernetes API along with a control plane component controller that monitors and maintains resources of the new type. When a custom resource (CR) is added to the cluster it must conform to the schema defined in the CRD. That CR is stored on an etcd key value store focused on consistency/availability using the Raft consensus algorithm. Etcd uses the filesystem to store the key value pairs. Quick querying of data is not the top priority in the etcd design.

Scenario

Now you’ve built an operator and are using it to store state for hundreds of thousands of resources you are starting to notice significant lag. You test out a call kubectl get <cr-name> and note that it takes over 30 seconds to return the list. You assume by limiting the data returned you can speed up the call and instead run kubectl get <cr-name> -l app=<label-key>=<label-value> but find that despite much less data being returned it still takes over 30 seconds. This is because etcd is simply a key value store and therefore the filtering happens in the Kubernetes API after doing a full table scan of all etcd data for that resource. This is true for a number of CR queries.

It gets worse, now you have over a million CRs and your operator is crash looping. It is unable to startup because it cannot retrieve all CRs within the server-side Kubernetes API timeout. Additionally, the Kubernetes API has become unresponsive, you can’t even access the cluster to make emergency changes. Adding an informer to your controller to maintain a local cache has only reduced the frequency of the issue. Your cloud provider says the Kubernetes control plane servers are using only a small percentage of the CPU/memory but the disk is overwhelmed and they’ve already autoscaled the disk performance to the maximum available.

At this point, an obvious solution is to only query one CR at a time kubectl get <cr-name> <resource-name> . That may not be possible and doesn’t solve the issue during operator start-up as it needs to reconcile state of all the CRs it manages requiring the dreaded list call which initiates what equates to a full table scan for etcd. When the etcd leader starts to struggle it could even trigger a new leader election and which causes additional delay in data retrieval.

A well designed Kubernetes control plane and etcd config + infrastructure setup can raise the ceiling of when this failure point is hit but the major cloud providers don’t publicize their designs or the limit of etcd objects that can be stored/retrieved safely. We’ve noted this issue on at least 2 cloud providers and expect it is a concern for all.

Conclusion

Kubernetes operators and custom resources were not built for scaled data storage and retrieval. It makes one wish they put a warning about this on or near this page.If you have not started building your operator yet consider switching to a design that is backed by a traditional database instead of using CRs. There is not anything wrong with etcd, Raft, Kubernetes, cloud providers or operators. As with any tool they have a specific set of use cases they fit well and others use cases they do not perform well solving. If your system might require more than a few hundred thousand individual CRs we suggest you look elsewhere for a solution to your Kubernetes based state/resource management problem.