Can we improve the autoscaler boot resilience? (retry creating a server after failure)

Hi,

I’m trying out the drone ecosystem.
After trying on Kubernetes and failing (it’s not supported), I figured that i would use the autoscaler.
After a while of setup, I manage to get the autoscaler to work properly, however i can see that it is sometimes a little bit too eager to install the agent but it doesn’t realize it or take down the server.

What happens is that there is a random error (docker not ready, or something else) and the autoscaler tries to push the agent image, however it just gives an error.

The problem is that the autoscaler ignores the error and thinks that the server works properly.

IMHO, What the autoscaler should do is check if the server is OK and retry a couple of times, then do a teardown if nothing works after n attempts.

Logs below:

11:08PM DBG instance create image=ami-f90a4880 name=agent-qCPx68cr region=eu-west-1 size=t3a.large
11:08PM INF instance create success image=ami-f90a4880 name=agent-qCPx68cr region=eu-west-1 size=t3a.large
11:08PM DBG check instance network image=ami-f90a4880 name=agent-qCPx68cr region=eu-west-1 size=t3a.large
11:09PM DBG check capacity id=2Oty6ZiGgHSTtIHZ max-pool=20 min-pool=0 pending-builds=1 running-builds=0 server-buffer=0 server-capacity=4 server-count=1
11:09PM DBG no capacity changes required id=2Oty6ZiGgHSTtIHZ
11:09PM DBG check capacity complete id=2Oty6ZiGgHSTtIHZ
11:09PM DBG check capacity id=QY0PuiWG5oNX8EoN max-pool=20 min-pool=0 pending-builds=1 running-builds=0 server-buffer=0 server-capacity=4 server-count=1
11:09PM DBG no capacity changes required id=QY0PuiWG5oNX8EoN
11:09PM DBG check capacity complete id=QY0PuiWG5oNX8EoN
11:09PM DBG check instance network image=ami-f90a4880 name=agent-qCPx68cr region=eu-west-1 size=t3a.large
11:09PM DBG instance network ready image=ami-f90a4880 ip=34.245.72.157 name=agent-qCPx68cr region=eu-west-1 size=t3a.large
11:09PM DBG provisioned server server=agent-qCPx68cr
11:10PM DBG check docker connectivity ip=34.245.72.157 name=agent-qCPx68cr
11:10PM DBG connecting to docker ip=34.245.72.157 name=agent-qCPx68cr
11:10PM DBG cannot connect, retry in 1m0s error="Cannot connect to the Docker daemon at https://34.245.72.157:2376. Is the docker daemon running?" ip=34.245.72.157 name=agent-qCPx68cr
11:10PM DBG check capacity id=v0cNTQFs46iFee6R max-pool=20 min-pool=0 pending-builds=1 running-builds=0 server-buffer=0 server-capacity=4 server-count=1
11:10PM DBG no capacity changes required id=v0cNTQFs46iFee6R
11:10PM DBG check capacity complete id=v0cNTQFs46iFee6R

I couldn’t file an issue in GitHub for the autoscaler, so I figured that I should post here.

What are your thoughts on this issue?
Maybe not a big hassle?

Thanks!

What happens is that there is a random error (docker not ready, or something else) and the autoscaler tries to push the agent image, however it just gives an error. The problem is that the autoscaler ignores the error and thinks that the server works properly.

This is not quite accurate. If the autoscaler encounters an error it sets the server state to error.

IMHO, What the autoscaler should do is check if the server is OK and retry a couple of times

The autoscaler will retry multiple times. If you look at the code [1] you will see it pings the docker server multiple times before it finally marks the server as errored.

when do a teardown if nothing works after n attempts.

This is not the default behavior because it can result in an infinite loop of creating and tearing down servers. In most cases an errored instance indicates a configuration problem that needs to be resolved by the operator.

It is possible, however, to configure automatic teardown of errored instances by setting DRONE_ENABLE_REAPER=true. The system runs a cleanup routine every hour that checks for and removes errored instances.

11:10PM DBG cannot connect, retry in 1m0s error="Cannot connect to the Docker daemon at 34.245.72.157:2376. Is the docker daemon running?"

I do not see anything abnormal in the logs that were posted. The autoscaler pings the instance but it is not ready, so it will retry in 1 minute [1].

11:10PM DBG check capacity id=v0cNTQFs46iFee6R max-pool=20 min-pool=0 pending-builds=1 running-builds=0 server-buffer=0 server-capacity=4 server-count=1
11:10PM DBG no capacity changes required id=v0cNTQFs46iFee6R

This also looks fine to me. The autoscaler provisioned 1 server, which is currently pending successful docker ping, which means the capacity is 1.

[1] https://github.com/drone/autoscaler/blob/master/engine/install.go#L131

1 Like

this change introduces the behaviour you’re asking for.

we’re running this in production for about 2-3 months

1 Like

The error can have other causes and is not always an operator error.

When spawning an instance there are 2 external dependencies.

  • docker ppa
  • dockerhub agent image

Both can be down/flaky at any given time. And if they are, the node provisioned will go into error state and claim capacity leading to starvation.

This then requires human intervention to solve the starvation unless the REAPER is enabled, which is opt-in (imho it can become opt-out).

This becomes apparent when keeping nodes at a minimum and going through hundreds of node provisions/cleanups a day and having the minimum scale set to 0.

The reaper with lower interval (https://github.com/drone/autoscaler/pull/49) is a viable alternative imho. It does yield in the same loop though if an operator error were to happen (or if external repos are down).

The loop stays the same except the reaper is slower at solving the issue while at worst they are equaly expensive (paying for a node doing nothing for longer period vs booting retrying endlessly. This is the case at Google/AWS (If one pays per second, that approach will take longer to solve the issue, but cost the same with per second billing) but not the case at DigitalOcean (first hour always billed).

Either way the reaper should become opt-out for better new user experience i guess :slight_smile:

This should probably be the default way, not the reaper method… I understand that spot instances are unusable because of this at the moment? If spot instance is killed, autoscaler won’t know about it and will think there’s a server, and in reality it’s not there

This should probably be the default way, not the reaper method

We observed real-world situations where the current logic prevented the system from creating unbounded servers / infinite loops and thus racking up significant server bills. We therefore take a conservative approach by design and provide the reaper flag for teams once they are more comfortable with the autoscaler and its configuration.

I understand that spot instances are unusable because of this at the moment?

I am not sure they are related. This particular thread suggests we should automatically destroy instances if there is an anomaly creating or configuring the instance, which is unrelated to instances being created and then destroyed without the autoscaler being aware. The other issue with spot instances is if they are destroyed while a build is running, it can leave the build in a stuck state.

just to add to my previous comment …

The other issue with spot instances is if they are destroyed while a build is running, it can leave the build in a stuck state.

related to spot instances, you can set DRONE_ENABLE_PINGER=true which pings the server to verify it still exists and tears down if the ping fails. This can help detect spot instances that were terminated, but does not handle the situation where builds were running on the instance and are now in a “zombie” state.