[duplicate] Make autoscaler more robust

trondhindenes · December 29, 2020, 12:42pm

from time to time, I’m getting this:

This happens when drone-autoscaler has just scaled up because of additional build jobs.

I can get around this by implementing a custom userData block:

    # use firewall to disable access to docker until it has restarted and has been able to pull an image
    runcmd:
      - ufw default allow outgoing
      - ufw default allow incoming
      - ufw deny 2376
      - echo activating firewall
      - ufw enable
      - apt-get install -o Dpkg::Options::="--force-confold" --force-yes -y docker-ce #custom docker config is already in place. These options makes sure installing docker doesnt overwrite them.
      - docker pull drone/drone-runner-docker
      - echo sleeping for 30 secs
      - sleep 30
      - echo opening firewall
      - ufw allow 2376

We inject this using the DRONE_AMAZON_USERDATA_FILE environment variable.
This allows docker to get installed without overwriting the config (daemon.json), start and perform a docker pull before it becomes available to drone.

It would be good if more robustness was built-in to the drone-autoscaler instead, so that we didn’t have to do this. For example, perform a docker pull (with appropirate retry logic) and only when that succeeds mark the runner as “ready for service”.

I think we discussed this in another thread and the maintainer said they’d never experienced issues with drone runners coming online, but we’re seeing it quite frequently - it seems to me as if drone-ausoscaler simply marks the node as healthy too soon. We’d very much like to not have to maintain a custom userdata config.

trondhindenes · December 29, 2020, 12:51pm

just to note: We’ve also seen other situations where an ec2 instance becomes unhealthy, but drone has no way of checking/discarding unhealthy nodes. We therefore run a “drone-autoscaler-janitor” in a separate process that uses drone’s api to compare aws ec2 instance status with what’s found in drone. This has been necessary in order to discard provisioning failures etc. It would be awesome if drone-autoscaler had this “unhappy path” logic built-in.

ashwilliams1 · December 29, 2020, 1:59pm

It looks like there is an existing thread for this topic at Autoscaler too brittle?. Let’s move the discussion to this existing thread. I will reply to your messages above in the original thread.

Topic		Replies	Views
Improvements to autoscaler to automatically re-attempt fail pull [pending patch] Drone Support	13	591	December 30, 2020
Can we improve the autoscaler boot resilience? (retry creating a server after failure) Drone Support	7	1004	September 18, 2019
After some help with drone and the amazon autoscaler Drone Support	3	910	December 12, 2018
Are autoscalers supposed to provision docker? Drone Support	1	321	September 10, 2019
Error with autoscaled drone agents Drone Support	9	527	March 17, 2023

[duplicate] Make autoscaler more robust

Related topics