Drone EKS autoscaling - controller needs CPU requests?

I’m running Drone (kubernetes-native) on AWS EKS, with an autoscaler running in the cluster. The hope is that when CPU utilization rises, the autoscaler will trigger new nodes to be added, and jobs will run on the new nodes. At first glance adding resources with CPU requests would get me what I need. However, some interrelated things seem to be thwarting me:

  • pipelines are created with node affinity. Drone ‘sticks’ the pipeline steps to the same node as their services, which I read as the same node as their ‘drone-job-*’ pod
  • job controllers are created without resources. The cluster doesn’t have the CPU requests up front, so it places these seemingly anywhere, even on CPU-starved nodes.

So I have autoscaling set up to recognize that a step requires more CPU than available, but it can’t scale up because it’s stuck on the same node due to node affinity (I think):

Scale-up predicate failed: GeneralPredicates predicate mismatch, cannot put [...] on [...], reason: node(s) didn't match node selector

Any guidance here is welcome. I’m going to continue to experiment but I’m running out of ideas.

Essentially, right now it feels like I want to get CPU requests on these drone-job-* entries:

1 Like

Hey I’m having the same problem. Would love some input from other Drone on K8s users.

I can confirm that this is exactly what is happening. I have a masivelly parallel build ( with depends_on statements ) and all the steps are started on the same node hammering it’s CPU. Please check the node column in the listing below:

$ kubectl -n xvcp0k60xwr1r25eyvfnsu58j1ybf39p get pods -o wide 
NAME                               READY   STATUS      RESTARTS   AGE     IP             NODE                                                NOMINATED NODE
8k8tjypwid4adter866km8nzv7pqgd4q   1/1     Running     0          17s     100.96.32.14   ip-10-0-96-165.eu-west-1.compute.internal   <none>
a28rar2fhj6t8qfbhzsm1qvuq6lltkp5   1/1     Running     0          17s     100.96.32.15   ip-10-0-96-165.eu-west-1.compute.internal   <none>
ggls4fz6bes2i3wpz4hmqkwlan6j2u36   1/1     Running     0          17s     100.96.32.16   ip-10-0-96-165.eu-west-1.compute.internal   <none>
k1yz80sz57dvbz5ra4cepevfgql9opzm   1/1     Running     0          16s     100.96.32.17   ip-10-0-96-165.eu-west-1.compute.internal   <none>
kpax184m5dxrfd76w1glkrbu9ocrxqmx   1/1     Running     0          17s     100.96.32.9    ip-10-0-96-165.eu-west-1.compute.internal   <none>
lhe297eg809tdi13rg7p1hdaq5x6urv6   1/1     Running     0          17s     100.96.32.10   ip-10-0-96-165.eu-west-1.compute.internal   <none>
m9kkvmff629cgt201zg6qzcz9b5gca9f   0/1     Completed   0          5m37s   100.96.32.8    ip-10-0-96-165.eu-west-1.compute.internal   <none>
mc2slr1hiqs482pqvcpeve9zc9wy7aj3   1/1     Running     0          17s     100.96.32.12   ip-10-0-96-165.eu-west-1.compute.internal   <none>
oinn10m4w0s8af7f0z0vu1mormc38kzy   1/1     Running     0          17s     100.96.32.11   ip-10-0-96-165.eu-west-1.compute.internal   <none>
qmmzt55iq5iuh30moa29wf3uwtj1q6vz   0/1     Completed   0          5m37s   100.96.32.7    ip-10-0-96-165.eu-west-1.compute.internal   <none>
rl1rctgy3qrycus43dvcrh7muhm8vmwg   1/1     Running     0          17s     100.96.32.13   ip-10-0-96-165.eu-west-1.compute.internal   <none>
xadcb3v14xjpdg3t0uc2w6zon8139ye1   0/1     Completed   0          5m46s   100.96.32.6    ip-10-0-96-165.eu-west-1.compute.internal   <none>

Just a reminder that native Kubernetes runtime is still experimental and is not recommended for production use. It may be deprecated and replaced by Tekton in the future, so just be careful if relying on this for a production deployment. With that being said, we will accept patches that fix bugs with the current implementation.

Fair enough, and there are labels in multiple spots specifying that it’s experimental. It could be just us confused early adopters. The only other ask I would have from this thread would be a general update since the “drone goes k8s” blog post 6 months ago:

The Kubernetes Runtime is still considered experiment, however, initial testing has been very positive. There are some known issues and areas of improvement, however, I expect rapid progress over the coming weeks.

^ Dec 7th, 2018 - Drone CI/CD Goes Kubernetes-Native

Has testing remained very positive? Do we still expect rapid progress? Would an update somewhere help, to show a change in priorities? An update like this could save people from wasting their experimental time.

Has testing remained very positive? Do we still expect rapid progress?

I do not believe so. The documentation has been updated to recommend against production use while we re-assess. We are tracking various issues related to Kubernetes where we have summarized our concerns, although no final decisions have been made. Some further reading:

Our current focus is on enabling custom stage definitions [1] and providing a runner framework (conceptually similar to Kubernetes operator framework). This will enable creation of custom runners, and will decouple runners from Drone core. I expect this will lead to a community-driven Kubernetes runtime that supersedes what we have today. I also expect the current Kubernetes runtime to remain active as a community-driven runtime, assuming there is interest in maintaining it despite its faults.

[1] [PROPOSAL] Drone Custom Stage Definitions · Issue #2680 · harness/gitness · GitHub

2 Likes