Error with autoscaled drone agents

maxgruebneraeroqual · April 6, 2022, 2:31am

We have an issue with agents spinning up and reporting an error.

Via the autoscaler CLI commands I can see

‘error during connect: Post “https://{IPADDRESS}:2376/v1.40/images/create?fromImage=drone%2Fdrone-runner-docker&tag=1”: EOF’

I can ssh into the agents, and can’t really see anything wrong with them, although I’m not sure what I’m meant to look for. Docker does seem to be running on them.

It’s been a while since I touched the setup, but we host our agents in AWS, and they have a custom cloudinit that loads in docker hub credentials so they don’t die to rate limiting. Pretty sure it’s the template from the docs with a config file jammed in.

It’d be nice to fix this because those agents wait around for a bit, preventing us from scaling to our maximum, until the autoscaler kills them (we’ve enabled the autoscaler reaper feature).

maxgruebneraeroqual · May 23, 2022, 11:09pm

We’re still having this issue, and it occurs relatively often. Of the last 10 agents to spin up, two were spun up in an error state.

jimsheldon · July 21, 2022, 3:31pm

Hey @maxgruebneraeroqual happy to help with this, can you provide more info?

brad · July 21, 2022, 7:12pm

One root cause we observed in the past is the following …

You choose a base AMI where linux distro / package manager is configured to auto-upgrade packages on startup. As a result, when the autoscaler provisions a vm and it boots, the docker daemon starts and then the package manager immediately stops docker for upgrade, making the daemon unreachable for a period of time … when you manually login you see Docker daemon running, because the upgrade has since completed.

I believe aws linux is the biggest offender but other distros may be impacted as well. This can be solved by disabling the auto-upgrade in the base AMI. Here is a link to a past discussion, for reference https://github.com/drone/autoscaler/issues/108.

maxgruebneraeroqual · July 22, 2022, 1:42am

We were using the default AMI, which I thought would have been fine, but maybe because we’re using a custom cloudinit to shove the docker hub credentials in, that’s adjusted something.

After we updated to ubuntu 22.04 we’re getting the same error as in that post that you linked Brad, but much less frequently.

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

I’ll have a tinker and see where I get to.

maxgruebneraeroqual · January 19, 2023, 4:30am

Several months later, I’ve finally had the opportunity to look into this some more.

The docker service is indeed rebooting when an agent is created.

We’re using a custom cloud_init.yaml, basically identical to this one, to pass the docker hub credentials in, so that we can pull stuff without going over the docker hub rate limits.

https://autoscale.drone.io/configure/cloud-init/

At the bottom of that, there’s a couple of runcmd commands that reload the docker daemon systemd files are restart docker, which is probably what’s causing my issues.

I’ll see if I can come up with a better one

Eowin · March 14, 2023, 1:14pm

Hello,

I encountered the same issue for many months. I tried with AMI ubuntu 20.04 & 22.04.

After reading logs in autoscaled instance, I saw that docker were restarted 2 times.

I guess this problem is random because sometimes the server is able to communicate with the agent after the second reboot (no problem case) and sometimes the server is able to communicate with the agent before the second reboot of docker and the software does not try to open a new connection with the agent.

So to fix this issue, I used the following tutorial to create a custom init configuration :

https://autoscale.drone.io/configure/cloud-init/

And I edited these lines :

runcmd:
  - [ systemctl, daemon-reload ]
  - [ systemctl, restart, docker ]

To this :

runcmd:
  - [ systemctl, daemon-reload ]
  - [ systemctl, start, docker ]

Since this change, I didn’t see ec2 instance stuck (since last week, and ~50 jobs)

To fix the issue, I think @brad should update the default init configuration.

Regards,

maxgruebneraeroqual · March 14, 2023, 8:36pm

I totally forgot to update.

I did roughly the same Eowin, except I found that I didn’t need to run a start/restart command at all I think, and it still seems to load the docker hub credentials fine.

I’ve seen agents come up broken a couple of times since, but it happens much less often than it used to.

Eowin · March 15, 2023, 1:08pm

Hi @maxgruebneraeroqual,

I just changed the restart to start to be sure docker is started. This is maybe the reason you still have broken agent (to confirm), after ~90 jobs i didn’t encounter issue.

For docker hub credential, I have created a secret “dockerconfigjson” in the organization with the json code that you find in your ~/.docker/config.json.

In my .drone.yml I added it to log in to my docker registry :

image_pull_secrets:
  - dockerconfigjson

I think this solution is more secure than have clear credential in configuration file

cryptoluks · March 17, 2023, 9:59am

Thanks! This also fixed the issues I experienced with Hetzner.

Seems like a race condition: Docker was already started on the host (default) and Drone pulled the agent image and tries to start the container. But then the restart (cloud-init) of the docker daemon interrupts the startup of the containers and the autoscaler never recovers.

So yes, it should be systemctl start docker by default if we can ensure, that the override.conf of the docker service is already active. Or maybe force docker to not autostart after install, then the agent will simply retry until systemctl restart docker was invoked.

In general it would be nice, if the autoscaler is able to recover from failed image pull/starts.

Topic		Replies	Views
Drone-agent in error status Drone Support	1	465	October 3, 2019
Troubleshooting the Autoscaler Drone FAQ	0	591	January 12, 2022
Drone Autoscaler Drone Support	11	1551	June 15, 2019
After some help with drone and the amazon autoscaler Drone Support	3	908	December 12, 2018
Droner autoscaler - Issue with AWS EC2 Drone Support	1	459	March 3, 2022

Error with autoscaled drone agents

Related topics