I’m trying out the drone ecosystem.
After trying on Kubernetes and failing (it’s not supported), I figured that i would use the autoscaler.
After a while of setup, I manage to get the autoscaler to work properly, however i can see that it is sometimes a little bit too eager to install the agent but it doesn’t realize it or take down the server.
What happens is that there is a random error (docker not ready, or something else) and the autoscaler tries to push the agent image, however it just gives an error.
The problem is that the autoscaler ignores the error and thinks that the server works properly.
IMHO, What the autoscaler should do is check if the server is OK and retry a couple of times, then do a teardown if nothing works after n attempts.
What happens is that there is a random error (docker not ready, or something else) and the autoscaler tries to push the agent image, however it just gives an error. The problem is that the autoscaler ignores the error and thinks that the server works properly.
This is not quite accurate. If the autoscaler encounters an error it sets the server state to error.
IMHO, What the autoscaler should do is check if the server is OK and retry a couple of times
The autoscaler will retry multiple times. If you look at the code [1] you will see it pings the docker server multiple times before it finally marks the server as errored.
when do a teardown if nothing works after n attempts.
This is not the default behavior because it can result in an infinite loop of creating and tearing down servers. In most cases an errored instance indicates a configuration problem that needs to be resolved by the operator.
It is possible, however, to configure automatic teardown of errored instances by setting DRONE_ENABLE_REAPER=true. The system runs a cleanup routine every hour that checks for and removes errored instances.
11:10PM DBG cannot connect, retry in 1m0s error="Cannot connect to the Docker daemon at 34.245.72.157:2376. Is the docker daemon running?"
I do not see anything abnormal in the logs that were posted. The autoscaler pings the instance but it is not ready, so it will retry in 1 minute [1].
The error can have other causes and is not always an operator error.
When spawning an instance there are 2 external dependencies.
docker ppa
dockerhub agent image
Both can be down/flaky at any given time. And if they are, the node provisioned will go into error state and claim capacity leading to starvation.
This then requires human intervention to solve the starvation unless the REAPER is enabled, which is opt-in (imho it can become opt-out).
This becomes apparent when keeping nodes at a minimum and going through hundreds of node provisions/cleanups a day and having the minimum scale set to 0.
The reaper with lower interval (https://github.com/drone/autoscaler/pull/49) is a viable alternative imho. It does yield in the same loop though if an operator error were to happen (or if external repos are down).
The loop stays the same except the reaper is slower at solving the issue while at worst they are equaly expensive (paying for a node doing nothing for longer period vs booting retrying endlessly. This is the case at Google/AWS (If one pays per second, that approach will take longer to solve the issue, but cost the same with per second billing) but not the case at DigitalOcean (first hour always billed).
Either way the reaper should become opt-out for better new user experience i guess
This should probably be the default way, not the reaper method… I understand that spot instances are unusable because of this at the moment? If spot instance is killed, autoscaler won’t know about it and will think there’s a server, and in reality it’s not there
This should probably be the default way, not the reaper method
We observed real-world situations where the current logic prevented the system from creating unbounded servers / infinite loops and thus racking up significant server bills. We therefore take a conservative approach by design and provide the reaper flag for teams once they are more comfortable with the autoscaler and its configuration.
I understand that spot instances are unusable because of this at the moment?
I am not sure they are related. This particular thread suggests we should automatically destroy instances if there is an anomaly creating or configuring the instance, which is unrelated to instances being created and then destroyed without the autoscaler being aware. The other issue with spot instances is if they are destroyed while a build is running, it can leave the build in a stuck state.
The other issue with spot instances is if they are destroyed while a build is running, it can leave the build in a stuck state.
related to spot instances, you can set DRONE_ENABLE_PINGER=true which pings the server to verify it still exists and tears down if the ping fails. This can help detect spot instances that were terminated, but does not handle the situation where builds were running on the instance and are now in a “zombie” state.