Autoscaler doesn't deal with quota error when creating VM (GCP)

Hi,

I inherited this environment, and I believe I understand all the pieces, but apologies if I missed something obvious. It was using the drone/autoscaler:1.7.5 image (during my t-shooting I updated that to the 1 image tag which appears to be 1.8).

We have the drone autoscaler container running in GKE (as a STS). I tried to increase the number of nodes by updating DRONE_POOL_MIN from 1 to 2, but I didn’t see a new VM startup. I didn’t see anything out of the ordinary in the autoscaler logs at all.

Eventually I checked the sqlite DB file and found this row in the servers table:

agent-8SplDryq|||error||||||2|6RgLh17VoMY1PXKW|Quota 'N2_CPUS' exceeded.  Limit: 64.0 in region us-central1.|||||1639586686|1639586698|0|0

(there is 1 row above that one which contains the details for an actual running VM that is working fine)

Here are some recent autoscaler logs:

{"id":"5W6OnerM3sL5zGpr","level":"debug","max-pool":2,"min-pool":1,"msg":"check capacity","pending-builds":0,"running-builds":0,"server-buffer":0,"server-capacity":4,"server-count":2,"time":"2023-03-21T22:33:57Z"}
{"id":"5W6OnerM3sL5zGpr","level":"debug","msg":"no capacity changes required","time":"2023-03-21T22:33:57Z"}
{"id":"5W6OnerM3sL5zGpr","level":"debug","msg":"check capacity complete","time":"2023-03-21T22:33:57Z"}
{"id":"vZRyAuyYovbdeT9H","level":"debug","msg":"calculate unfinished jobs","time":"2023-03-21T22:34:57Z"}
{"id":"vZRyAuyYovbdeT9H","level":"debug","msg":"calculate server capacity","time":"2023-03-21T22:34:57Z"}

So it would appear that the autoscaler thinks there are two VMs running even though only one is. Also, I don’t understand why it says no capacity changes required even though I haven’t had a job run in over an hour.

A couple other observations:

  • I set DRONE_AGENT_CONCURRENCY=8 but looking at the logs it says ‘server-capacity’ is 4 (which I assume is 2 per server). My ultimate goal in figuring this out is to get the drone-runner-docker that is on the VM to run more than 2 concurrent jobs.
  • It looks like the default for DRONE_INTERVAL is actually 5m, even though the docs say it’s 1m (just something I noticed).
  • I tried changing the DRONE_GOOGLE_MACHINE_TYPE from n2-highcpu-16 to n2-standard-8, but the autoscaler doesn’t seem to be doing anything with that - I would have thought it would have recreated the VM.
    (Maybe its brains are slightly scrambled due to that row in the servers table…?)

I’m thinking I’ll delete that funky row from the sqlite DB tomorrow morning and see what happens. Anything else I should consider for cleaning this up and achieving my ultimate goal of having a VM that can run 8 concurrent drone jobs?

Thanks!

I have an update: Getting rid of that bad server seemed to fix everything!

I was able to get rid of the bogus entry. I had a firewall blocking my access to the autoscaler via CLI, so after I resolved that I ran drone server destroy agent-8SplDryq (it returned client error 404: {"message":"sql: no rows in result set"}, but it definitely removed the line from the servers db table.

Logs then showed there was one server (server-count of 1):

{"id":"wUGKzQ231qrR6Wy9", "level":"debug", "max-pool":2, "min-pool":1, "msg":"check capacity", "pending-builds":0, "running-builds":0, "server-buffer":0, "server-capacity":2, "server-count":1}

That also seemed to allow the autoscaler to resize my instance and set the proper capacity! I now see this in the logs:

{"id":"fdyXRA5FNnQjAscR", "level":"debug", "max-pool":2, "min-pool":1, "msg":"check capacity", "pending-builds":0, "running-builds":0, "server-buffer":0, "server-capacity":8, "server-count":1}