Hi,
I inherited this environment, and I believe I understand all the pieces, but apologies if I missed something obvious. It was using the drone/autoscaler:1.7.5
image (during my t-shooting I updated that to the 1
image tag which appears to be 1.8).
We have the drone autoscaler container running in GKE (as a STS). I tried to increase the number of nodes by updating DRONE_POOL_MIN
from 1 to 2, but I didn’t see a new VM startup. I didn’t see anything out of the ordinary in the autoscaler logs at all.
Eventually I checked the sqlite DB file and found this row in the servers
table:
agent-8SplDryq|||error||||||2|6RgLh17VoMY1PXKW|Quota 'N2_CPUS' exceeded. Limit: 64.0 in region us-central1.|||||1639586686|1639586698|0|0
(there is 1 row above that one which contains the details for an actual running VM that is working fine)
Here are some recent autoscaler logs:
{"id":"5W6OnerM3sL5zGpr","level":"debug","max-pool":2,"min-pool":1,"msg":"check capacity","pending-builds":0,"running-builds":0,"server-buffer":0,"server-capacity":4,"server-count":2,"time":"2023-03-21T22:33:57Z"}
{"id":"5W6OnerM3sL5zGpr","level":"debug","msg":"no capacity changes required","time":"2023-03-21T22:33:57Z"}
{"id":"5W6OnerM3sL5zGpr","level":"debug","msg":"check capacity complete","time":"2023-03-21T22:33:57Z"}
{"id":"vZRyAuyYovbdeT9H","level":"debug","msg":"calculate unfinished jobs","time":"2023-03-21T22:34:57Z"}
{"id":"vZRyAuyYovbdeT9H","level":"debug","msg":"calculate server capacity","time":"2023-03-21T22:34:57Z"}
So it would appear that the autoscaler thinks there are two VMs running even though only one is. Also, I don’t understand why it says no capacity changes required
even though I haven’t had a job run in over an hour.
A couple other observations:
- I set
DRONE_AGENT_CONCURRENCY=8
but looking at the logs it says ‘server-capacity’ is 4 (which I assume is 2 per server). My ultimate goal in figuring this out is to get the drone-runner-docker that is on the VM to run more than 2 concurrent jobs. - It looks like the default for
DRONE_INTERVAL
is actually 5m, even though the docs say it’s 1m (just something I noticed). - I tried changing the
DRONE_GOOGLE_MACHINE_TYPE
fromn2-highcpu-16
ton2-standard-8
, but the autoscaler doesn’t seem to be doing anything with that - I would have thought it would have recreated the VM.
(Maybe its brains are slightly scrambled due to that row in the servers table…?)
I’m thinking I’ll delete that funky row from the sqlite DB tomorrow morning and see what happens. Anything else I should consider for cleaning this up and achieving my ultimate goal of having a VM that can run 8 concurrent drone jobs?
Thanks!