Improvements to autoscaler to automatically re-attempt fail pull [pending patch]

We’re trying to replace kube-runner with Drone Autoscaler (using aws ec2), but are running into problems. It seems to me like there’s very little retry logic in the autoscaler code, which causes problems.
Here’s an example of a problem:

When triggering a build (which triggers a scale-up), I can see this in the debug log:

{"level":"debug","module":"api","msg":"FIXME: Got an status-code for which error does not match any expected type!!!: -1","status_code":"-1","time":"2020-10-19T06:54:37Z"}
{"error":"Cannot connect to the Docker daemon at https://10.11.30.109:2376. Is the docker daemon running?","ip":"10.11.30.109","level":"debug","msg":"cannot connect, retry in 1m0s","name":"agent-VxggIJ3f","time":"2020-10-19T06:54:37Z"}

A little later it seems to be able to contact the spun-up server, and issues:

{"image":"drone/drone-runner-docker:1","ip":"10.11.30.109","level":"debug","msg":"pull docker image","name":"agent-VxggIJ3f","time":"2020-10-19T06:55:37Z"}

Then I can see this in the log:

{"level":"debug","module":"api","msg":"FIXME: Got an status-code for which error does not match any expected type!!!: -1","status_code":"-1","time":"2020-10-19T06:55:37Z"}
{"error":"error during connect: Post \"https://10.11.30.109:2376/v1.24/images/create?fromImage=drone%2Fdrone-runner-docker\u0026tag=1\": EOF","image":"drone/drone-runner-docker:1","ip":"10.11.30.109","level":"error","msg":"cannot pull docker image","name":"agent-VxggIJ3f","time":"2020-10-19T06:55:37Z"}

At this point, the process seems broken - autoscaler doesnt seem to make any attempt of retrying the operation or anything, even tho there’s still a pending build.

It would seem to me that any piece of code that deals with spinning up vms, waiting for them to be connect-able etc needs very robust retrying because so many things can happen that is outside the control of the system.

Alas, this means that even if we switch from kube-runner to autoscaler it still seems that we have to implement custom “janitors” to ensure that problems are dealt with.

troubleshooting futher, it doesn’t seem like drone-autoscaler syncs the “actual” state from aws at any point, so if I delete a failed node outside of drone-autoscaler, it will never correct its own “server-count” metric, is this correct? It would seem to me that it would be a good idea to periodically check the assumed state with the actual state retrieved from aws.

just for anyone else arriving here:
We got around the instabilities by creating a custom cloud-init file (we’re using aws ec2) that uses firewall rules to ensure that drone can’t connect until the docker daemon is completely ready:

      runcmd:
      - ufw default allow outgoing
      - ufw default allow incoming
      - ufw deny 2376
      - ufw enable
      - [ systemctl, daemon-reload ]
      - [ systemctl, restart, docker ]
      - docker pull drone/drone-runner-docker
      - ufw allow 2376

This seems to have helped.

The system does retry to connect to the daemon before running a docker pull. You can analyze the retry logic here: https://github.com/drone/autoscaler/blob/master/engine/install.go#L117:L141

Hi,
Yes, the first “connect” is retried. However, if any of the subsequent commands (such as the docker pull) fail, the agent will simply sit there and manual intervention seems to be required.

At least in our environment when using the default cloud-init against ubuntu 20.x vms, the systems gets into the unrecoverable state quite often.

It might be worth considering wrapping all commands against the runner’s docker engine in a retry/backoff loop, not just the initial connect.

we have not had any issues with docker pull reliability (we use the autoscaler at cloud.drone.io), but if you system cannot reliably pull docker images, then certainly it would cause problems. We would definitely accept a pull request that performs docker pull using a backoff.

okay. If you look at the log in the OP (3rd log entry) it’s not a pull error, the error relates to connecting to the daemon as part of issuing the pull request.

However, if we’re the only ones with the problem I’m happy to just leave it - we wrote a cloud-init that works for us, so all good.

Have you looked at the “reaper” options? Those may assist you with the remediation of instances you’ve manually removed. Alternatively when you remove an instance manually you could also remove it from the database as well.

This is my response to comments made in Can I use volumes in Drone Cloud in order to map /var/run/docker.sock?

please consider sending a pull request that retries the docker pull, on failure, using a limited backoff.

drone does have an option to ping unhealthy nodes and then clean them up. You need to enable the following feature flags to get this behavior:

DRONE_ENABLE_PINGER=true
DRONE_ENABLE_REAPER=true

this enables the pinger which will ping the instance every N minutes to check whether or not it is healthy. If it cannot ping the instance, it is placed in an errored state. The reaper then periodically cleans up instances marked in errored states.

Thanks! Is that a ping as in “send an ICMP packet?” or a ping as in “try to exersice docker by making some docker api call”?

if by “ping” you mean an actual network ping (ICMP): since the autoscaler already “knows about” the underlying cloud environment, why not use the cloud provider’s api to determine health? a “network ping” doesn’t really say anything about the actual health of an instance.

btw sorry for the duplicate post, and thanks for cleaning up.

we’re seeing now that while our custom userdata script works, it’s a complexity we’d very much like to avoid.

we run docker ping which tells us whether or not the instance is in a healthy state (network available, etc) and whether or not the docker daemon is in a healthy state and able to process requests. You can learn more by inspecting the pinger code.

we would accept a pull request that retries the docker pull, on failure, using a limited backoff [1]. we have no immediate plans to add this capability due to our large backlog, so sending a pull request will be the fastest way to expedite this feature request.

[1] https://github.com/drone/autoscaler/blob/master/engine/install.go#L150

I think more robustness is needed than just retrying docker pulls. I would propose a logic along the lines of:

  • provision instance
  • wait until docker responds (this can be a simple port-test)
  • retry until 3 successfull docker pings can be made with a 2-second pause between then (this would handle situations where the docker agent needs a restart right after its installed) - limit this retry loop to 20 iterations (configurable?), fail the provisioning if the loop limit is reached
  • pull an image to ensure everything works (again, with a retry using a configurable iteration limit), fail the provisioning if loop limit is reached
  • At this point, the instance is healthy and can accept jobs

This is pretty much how our logic works today (provision instance, wait until docker responds, pull an image, ready to accept healthy jobs). The only thing we are missing is image pull retries on failure.

Can you please elaborate on this? I have never seen a situation where docker is installed, started and then immediately re-started on a debian or ubuntu os.

EDIT: taking a step back, I think that a simple patch to auto-retry pulling the image may be sufficient and is a good next step. Let’s pause the discussion and resume once you have had a chance to send us a patch and test this in production. If we find there are still failure modes that are not addressed, we can discuss and address in follow-up patches.