Improvements to autoscaler to automatically re-attempt fail pull [pending patch]

trondhindenes · October 19, 2020, 7:07am

We’re trying to replace kube-runner with Drone Autoscaler (using aws ec2), but are running into problems. It seems to me like there’s very little retry logic in the autoscaler code, which causes problems.
Here’s an example of a problem:

When triggering a build (which triggers a scale-up), I can see this in the debug log:

{"level":"debug","module":"api","msg":"FIXME: Got an status-code for which error does not match any expected type!!!: -1","status_code":"-1","time":"2020-10-19T06:54:37Z"}
{"error":"Cannot connect to the Docker daemon at https://10.11.30.109:2376. Is the docker daemon running?","ip":"10.11.30.109","level":"debug","msg":"cannot connect, retry in 1m0s","name":"agent-VxggIJ3f","time":"2020-10-19T06:54:37Z"}

A little later it seems to be able to contact the spun-up server, and issues:

{"image":"drone/drone-runner-docker:1","ip":"10.11.30.109","level":"debug","msg":"pull docker image","name":"agent-VxggIJ3f","time":"2020-10-19T06:55:37Z"}

Then I can see this in the log:

{"level":"debug","module":"api","msg":"FIXME: Got an status-code for which error does not match any expected type!!!: -1","status_code":"-1","time":"2020-10-19T06:55:37Z"}
{"error":"error during connect: Post \"https://10.11.30.109:2376/v1.24/images/create?fromImage=drone%2Fdrone-runner-docker\u0026tag=1\": EOF","image":"drone/drone-runner-docker:1","ip":"10.11.30.109","level":"error","msg":"cannot pull docker image","name":"agent-VxggIJ3f","time":"2020-10-19T06:55:37Z"}

At this point, the process seems broken - autoscaler doesnt seem to make any attempt of retrying the operation or anything, even tho there’s still a pending build.

It would seem to me that any piece of code that deals with spinning up vms, waiting for them to be connect-able etc needs very robust retrying because so many things can happen that is outside the control of the system.

Alas, this means that even if we switch from kube-runner to autoscaler it still seems that we have to implement custom “janitors” to ensure that problems are dealt with.

trondhindenes · October 19, 2020, 7:12am

troubleshooting futher, it doesn’t seem like drone-autoscaler syncs the “actual” state from aws at any point, so if I delete a failed node outside of drone-autoscaler, it will never correct its own “server-count” metric, is this correct? It would seem to me that it would be a good idea to periodically check the assumed state with the actual state retrieved from aws.

trondhindenes · October 19, 2020, 12:32pm

just for anyone else arriving here:
We got around the instabilities by creating a custom cloud-init file (we’re using aws ec2) that uses firewall rules to ensure that drone can’t connect until the docker daemon is completely ready:

      runcmd:
      - ufw default allow outgoing
      - ufw default allow incoming
      - ufw deny 2376
      - ufw enable
      - [ systemctl, daemon-reload ]
      - [ systemctl, restart, docker ]
      - docker pull drone/drone-runner-docker
      - ufw allow 2376

This seems to have helped.

bradrydzewski · October 19, 2020, 12:47pm

The system does retry to connect to the daemon before running a docker pull. You can analyze the retry logic here: https://github.com/drone/autoscaler/blob/master/engine/install.go#L117:L141

trondhindenes · October 19, 2020, 12:56pm

Hi,
Yes, the first “connect” is retried. However, if any of the subsequent commands (such as the docker pull) fail, the agent will simply sit there and manual intervention seems to be required.

At least in our environment when using the default cloud-init against ubuntu 20.x vms, the systems gets into the unrecoverable state quite often.

It might be worth considering wrapping all commands against the runner’s docker engine in a retry/backoff loop, not just the initial connect.

bradrydzewski · October 19, 2020, 12:58pm

we have not had any issues with docker pull reliability (we use the autoscaler at cloud.drone.io), but if you system cannot reliably pull docker images, then certainly it would cause problems. We would definitely accept a pull request that performs docker pull using a backoff.

trondhindenes · October 19, 2020, 1:06pm

okay. If you look at the log in the OP (3rd log entry) it’s not a pull error, the error relates to connecting to the daemon as part of issuing the pull request.

However, if we’re the only ones with the problem I’m happy to just leave it - we wrote a cloud-init that works for us, so all good.

techknowlogick · October 21, 2020, 7:45pm

Have you looked at the “reaper” options? Those may assist you with the remediation of instances you’ve manually removed. Alternatively when you remove an instance manually you could also remove it from the database as well.

ashwilliams1 · December 29, 2020, 2:00pm

This is my response to comments made in Can I use volumes in Drone Cloud in order to map /var/run/docker.sock?

please consider sending a pull request that retries the docker pull, on failure, using a limited backoff.

drone does have an option to ping unhealthy nodes and then clean them up. You need to enable the following feature flags to get this behavior:

DRONE_ENABLE_PINGER=true
DRONE_ENABLE_REAPER=true

this enables the pinger which will ping the instance every N minutes to check whether or not it is healthy. If it cannot ping the instance, it is placed in an errored state. The reaper then periodically cleans up instances marked in errored states.

trondhindenes · December 29, 2020, 2:15pm

Thanks! Is that a ping as in “send an ICMP packet?” or a ping as in “try to exersice docker by making some docker api call”?

if by “ping” you mean an actual network ping (ICMP): since the autoscaler already “knows about” the underlying cloud environment, why not use the cloud provider’s api to determine health? a “network ping” doesn’t really say anything about the actual health of an instance.

trondhindenes · December 29, 2020, 2:19pm

btw sorry for the duplicate post, and thanks for cleaning up.

we’re seeing now that while our custom userdata script works, it’s a complexity we’d very much like to avoid.

ashwilliams1 · December 29, 2020, 2:53pm

we run docker ping which tells us whether or not the instance is in a healthy state (network available, etc) and whether or not the docker daemon is in a healthy state and able to process requests. You can learn more by inspecting the pinger code.

we would accept a pull request that retries the docker pull, on failure, using a limited backoff [1]. we have no immediate plans to add this capability due to our large backlog, so sending a pull request will be the fastest way to expedite this feature request.

[1] https://github.com/drone/autoscaler/blob/master/engine/install.go#L150

trondhindenes · December 30, 2020, 11:30am

I think more robustness is needed than just retrying docker pulls. I would propose a logic along the lines of:

provision instance
wait until docker responds (this can be a simple port-test)
retry until 3 successfull docker pings can be made with a 2-second pause between then (this would handle situations where the docker agent needs a restart right after its installed) - limit this retry loop to 20 iterations (configurable?), fail the provisioning if the loop limit is reached
pull an image to ensure everything works (again, with a retry using a configurable iteration limit), fail the provisioning if loop limit is reached
At this point, the instance is healthy and can accept jobs

bradrydzewski · December 30, 2020, 3:46pm

This is pretty much how our logic works today (provision instance, wait until docker responds, pull an image, ready to accept healthy jobs). The only thing we are missing is image pull retries on failure.

Can you please elaborate on this? I have never seen a situation where docker is installed, started and then immediately re-started on a debian or ubuntu os.

EDIT: taking a step back, I think that a simple patch to auto-retry pulling the image may be sufficient and is a good next step. Let’s pause the discussion and resume once you have had a chance to send us a patch and test this in production. If we find there are still failure modes that are not addressed, we can discuss and address in follow-up patches.

Topic		Replies	Views
Can we improve the autoscaler boot resilience? (retry creating a server after failure) Drone Support	7	1000	September 18, 2019
Troubleshooting the Autoscaler Drone FAQ	0	591	January 12, 2022
[duplicate] Make autoscaler more robust Drone Support	2	331	December 29, 2020
[Autoscaler] Pulling drone-runner-docker from AWS ECR Drone Support	3	476	November 24, 2020
Autoscaler not destroying servers Drone Bugs	2	534	May 11, 2021

Improvements to autoscaler to automatically re-attempt fail pull [pending patch]

Related topics