Drone autoscaler not scaling instances down as expected

maxgruebneraeroqual · July 9, 2021, 1:25am

Hello,

We’re having issues with the autoscaler not scaling down instances as aggressively as I would expect.

The documentation suggests that the default minimum age for an agent is 1h, and the minimum pool size is 2. I currently have 7 running instances, but can see several hour long gaps in the cpu utilisation timeline where I would expect some of the agents to be killed, as there are no running builds.

It’s pretty obvious that the agents are spending a lot of time doing nothing, and this has a significant cost, so I’m keen to sort this rather than giving Bezos more money.

From my understanding, I would expect all bar two agents to be terminated between 14:30 and 16:30 in this graph. Further, a maximum of 4 agents have anything actually being run on them in this 12 hour stretch.

The agents are being spun up and down (I think). Two of the agents have only been running for 22h hours. Two of them have been running for 16 days, which matches the minimum pool.

I have confirmed the pool min age is not overridden in the autoscaler.

The autoscaler logs have a repeated message about the autoscaler considering terminating, then aborting the termination.

{“id”:“z35a5k3geK9z74fR”,“level”:“debug”,“msg”:“calculate unfinished jobs”,“time”:“2021-07-09T00:53:35Z”}
{“id”:“z35a5k3geK9z74fR”,“level”:“debug”,“msg”:“calculate server capacity”,“time”:“2021-07-09T00:53:35Z”}
{“id”:“z35a5k3geK9z74fR”,“level”:“debug”,“max-pool”:30,“min-pool”:2,“msg”:“check capacity”,“pending-builds”:0,“running-builds”:0,“server-buffer”:0,“server-capacity”:7,“server-count”:7,“time”:“2021-07-09T00:53:35Z”}
{“id”:“z35a5k3geK9z74fR”,“level”:“debug”,“msg”:“terminate 5 servers”,“time”:“2021-07-09T00:53:35Z”}
{“id”:“z35a5k3geK9z74fR”,“level”:“debug”,“min-pool”:2,“msg”:“abort terminating %!d(MISSING) instances to ensure minimum capacity met”,“servers-running”:4,“servers-to-terminate”:5,“time”:“2021-07-09T00:53:35Z”}
{“id”:“z35a5k3geK9z74fR”,“level”:“debug”,“msg”:“check capacity complete”,“time”:“2021-07-09T00:53:35Z”}
{“id”:“Btdk7g26mri1np3L”,“level”:“debug”,“msg”:“calculate unfinished jobs”,“time”:“2021-07-09T00:58:35Z”}
{“id”:“Btdk7g26mri1np3L”,“level”:“debug”,“msg”:“calculate server capacity”,“time”:“2021-07-09T00:58:35Z”}
{“id”:“Btdk7g26mri1np3L”,“level”:“debug”,“max-pool”:30,“min-pool”:2,“msg”:“check capacity”,“pending-builds”:0,“running-builds”:0,“server-buffer”:0,“server-capacity”:7,“server-count”:7,“time”:“2021-07-09T00:58:35Z”}
{“id”:“Btdk7g26mri1np3L”,“level”:“debug”,“msg”:“terminate 5 servers”,“time”:“2021-07-09T00:58:35Z”}
{“id”:“Btdk7g26mri1np3L”,“level”:“debug”,“min-pool”:2,“msg”:“abort terminating %!d(MISSING) instances to ensure minimum capacity met”,“servers-running”:4,“servers-to-terminate”:5,“time”:“2021-07-09T00:58:35Z”}
{“id”:“Btdk7g26mri1np3L”,“level”:“debug”,“msg”:“check capacity complete”,“time”:“2021-07-09T00:58:35Z”}
{“id”:“RUOF0rnuBhvPm6S0”,“level”:“debug”,“msg”:“calculate unfinished jobs”,“time”:“2021-07-09T01:03:35Z”}
{“id”:“RUOF0rnuBhvPm6S0”,“level”:“debug”,“msg”:“calculate server capacity”,“time”:“2021-07-09T01:03:35Z”}
{“id”:“RUOF0rnuBhvPm6S0”,“level”:“debug”,“max-pool”:30,“min-pool”:2,“msg”:“check capacity”,“pending-builds”:0,“running-builds”:0,“server-buffer”:0,“server-capacity”:7,“server-count”:7,“time”:“2021-07-09T01:03:35Z”}
{“id”:“RUOF0rnuBhvPm6S0”,“level”:“debug”,“msg”:“terminate 5 servers”,“time”:“2021-07-09T01:03:35Z”}
{“id”:“RUOF0rnuBhvPm6S0”,“level”:“debug”,“min-pool”:2,“msg”:“abort terminating %!d(MISSING) instances to ensure minimum capacity met”,“servers-running”:4,“servers-to-terminate”:5,“time”:“2021-07-09T01:03:35Z”}
{“id”:“RUOF0rnuBhvPm6S0”,“level”:“debug”,“msg”:“check capacity complete”,“time”:“2021-07-09T01:03:35Z”}

bradrydzewski · July 9, 2021, 2:22am

Is it possible your instances are in an error state? Drone does not terminate instances in an error state by default or perhaps Drone tried to terminate your instances and was unable. You can use the command line tools to dump the state of all your instances.

It would also be great to try to add a unit test (see below links) that simulates the current state of your infrastructure and see the result of the unit tests.

https://github.com/drone/autoscaler/blob/master/engine/planner.go
https://github.com/drone/autoscaler/blob/master/engine/planner_test.go

EDIT it looks like you may be running an older version of the autoscaler. I noticed %!d(MISSING) in your output and checked the code and this formatting error had been fixed, along with some of our calculations related to this section of the code.

https://github.com/drone/autoscaler/commit/4d2abe071c1dc0135b8afd3000fbfa2e7ee302f0#diff-6bdd1d696f11832b1a16759caf45d7b71ecec74c272bdacee8e68d96b01438ff

maxgruebneraeroqual · July 13, 2021, 7:48am

You’re right, some of the agents are in an error state.
3 are in an error state, 4 are running.

The error shown from drone server info is:

Error: error during connect: Post https:// DRONE_AGENT_IP :2376/v1.40/images/create?fromImage=drone%2Fdrone-runner-docker&tag=1: EOF

If I hop onto one of the errored agents, I can see that there’s no containers running under docker, and it doesn’t have the drone image. I’m guessing that it was unable to pull the image correctly for some reason.

I think there’s several issues here.

First of all, it seems like if an agent is an error state, the autoscaler will refuse to kill any agents.
I’m assuming this because there’s a two hour period where no builds were being run, and the default pool size and age are set. That means I’d expect it to scale down to the number of errored agents + two healthy agents, in my case 5.

We can see it attempts to do this from the logs
{“id”:“7ntxtRXWWgnE5GpB”,“level”:“debug”,“min-pool”:2,“msg”:“abort terminating %!d(MISSING) instances to ensure minimum capacity met”,“servers-running”:4,“servers-to-terminate”:5,“time”:“2021-07-09T01:33:35Z”}

Servers-running must be the number of servers with a running state, and servers-to-terminate is calculated taking into account the servers in an error state as well.

Secondly, wouldn’t it make sense for an autoscaler to terminate instances that are in an error state? Keeping them alive for debugging purposes makes sense, but terminating them after they age out certainly seems like a good idea.

Thirdly, there doesn’t seem to be transparent way to know if I’ve got a bunch of agents that are broken without having to run a series of CLI commands and/or monitoring my AWS bill. I’m familiar with TeamCity (which, of course, has it’s own set of problems), but it’s transparent about the agents and what they’re up to. Adding something to the web UI for agent status would be nice, but a fair amount of work.

We had an issue with the autoscaler last month failing to scale down instances and nobody noticed for a couple of weeks. The bill was several times our usual usage, so this behaviour has a financial impact.
I thought it was because something went haywire when I updated to drone 2 and rolled back to drone 1, but I think it was the same issue with an errored agent. The autoscaler logs had the same abort terminating line.
Part of that’s definitely on us for not being more proactive in our monitoring, but it seems like there’s a couple of things that can be sorted to improve this in the future.

I don’t think updating the autoscaler version would help for this case (I’ve since killed those errored servers so can’t check). That fix looks like it was for pending servers to not weight capacity, but it still takes into account errored servers.

I’ve opened a PR here that I think addresses points 1 and 2 (although my Go is pretty shoddy)

github.com/drone/autoscaler

ignore error servers determining capacity, terminated aged error servers

drone:master ← herrgruebner:capacity_state

opened 07:44AM - 13 Jul 21 UTC

herrgruebner

+196 -12

This PR does two things. 1) Changes the capacity calculation so that servers in the error state no longer count for capacity Servers in the error state can't build anything, so factoring them in to capacity calculations gives an inaccurate view of capacity. 2) Terminates errored servers after they've reached the minimum age. At the moment, servers in the error state do not get terminated automatically, or preferentially over running servers. If they do get terminated, it will just be because they happen to be in the right spot on the idle server list. In this PR, I've gone with the approach of automatically terminating ones that have aged out. An alternative might be preferred where errored servers are terminated first instead of healthy servers. I think this is a sensible approach because it allows some time to fix servers in the error state, but also doesn't keep broken servers sitting around costing money. I've added some tests, but I'll freely admit my Go is terrible so I'm happy to make changes.

bradrydzewski · July 13, 2021, 12:54pm

See the reasoning at https://github.com/drone/autoscaler/pull/51#issuecomment-531580254. If you are comfortable automatically tearing down instances in an errored state you can enable the following environment variable:

DRONE_ENABLE_REAPER=true

In addition, you can set the below variable to proactively ping and tear down unhealthy instances:

DRONE_ENABLE_PINGER=true

maxgruebneraeroqual · July 14, 2021, 11:35pm

I think we’re both in agreement that nobody wants to be charged for servers spinning up when they don’t need to be, and not being spun down when they should be.

My point is that not spinning down servers in the error state seems like it falls under ‘not being spun down when they should be’. There’s a fair amount of cost hanging onto hidden dead servers.

Surfacing that information via the UI seems like a good step, if they’re not going to be automatically dealt with, but my Go is nowhere near good enough to help with that.

I updated to the latest version of drone autoscaler (1.7.5), and can confirm that it is still failing to scale down.

Weirdly, prior to updating the autoscaler I had 5 agents, two in error. Now I have 14.

{“id”:“BlSvVrbwGKQXjGM5”,“level”:“debug”,“msg”:“calculate unfinished jobs”,“time”:“2021-07-14T23:27:01Z”}
{“id”:“BlSvVrbwGKQXjGM5”,“level”:“debug”,“msg”:“calculate server capacity”,“time”:“2021-07-14T23:27:01Z”}
{“id”:“BlSvVrbwGKQXjGM5”,“level”:“debug”,“max-pool”:16,“min-pool”:2,“msg”:“check capacity”,“pending-builds”:0,“running-builds”:0,“server-buffer”:0,“server-capacity”:14,“server-count”:14,“time”:“2021-07-14T23:27:01Z”}
{“id”:“BlSvVrbwGKQXjGM5”,“level”:“debug”,“msg”:“terminate 12 servers”,“time”:“2021-07-14T23:27:01Z”}
{“id”:“BlSvVrbwGKQXjGM5”,“level”:“debug”,“min-pool”:2,“msg”:“abort terminating instances to ensure minimum capacity met”,“servers-running”:2,“servers-to-terminate”:12,“time”:“2021-07-14T23:27:01Z”}

I had a go at writing a unit test that causes this failure, but I can’t seem to get it. Part of it is that I don’t quite get the planner tests. They don’t seem to verify the state after the plan step.

bradrydzewski · July 14, 2021, 11:53pm

If you want the autoscaler to automatically teardown instances in an error state you can set the below parameter, which will periodically cleanup errored images.

DRONE_ENABLE_REAPER=true

It is abnormal to see instances in an error state once the autoscaler is properly configured. If you are frequently seeing instances in an error state you may want to research the root case, which should be found in the logs, so that you can tweak your configuration accordingly.

maxgruebneraeroqual · July 15, 2021, 12:24am

Yeah, I got that.

What I’m trying to point out is that if there is a single server in the error state, it refuses to scale down any servers at all, even the healthy ones.

Topic		Replies	Views
Autoscaler MIN_IDLE_AGE before terminate Drone Support	1	383	November 8, 2021
Autoscaler not destroying servers Drone Bugs	2	534	May 11, 2021
[autoscaler] Don't scale up if a build is almost done Drone Support	4	558	December 2, 2019
Autoscaler Agents remain in Creating state Drone Support	2	269	November 6, 2020
Autoscaler vm max-age Drone Support	3	331	December 17, 2020

Drone autoscaler not scaling instances down as expected

Related topics