We’re trying to replace kube-runner with Drone Autoscaler (using aws ec2), but are running into problems. It seems to me like there’s very little retry logic in the autoscaler code, which causes problems.
Here’s an example of a problem:
When triggering a build (which triggers a scale-up), I can see this in the debug log:
{"level":"debug","module":"api","msg":"FIXME: Got an status-code for which error does not match any expected type!!!: -1","status_code":"-1","time":"2020-10-19T06:54:37Z"}
{"error":"Cannot connect to the Docker daemon at https://10.11.30.109:2376. Is the docker daemon running?","ip":"10.11.30.109","level":"debug","msg":"cannot connect, retry in 1m0s","name":"agent-VxggIJ3f","time":"2020-10-19T06:54:37Z"}
A little later it seems to be able to contact the spun-up server, and issues:
{"image":"drone/drone-runner-docker:1","ip":"10.11.30.109","level":"debug","msg":"pull docker image","name":"agent-VxggIJ3f","time":"2020-10-19T06:55:37Z"}
Then I can see this in the log:
{"level":"debug","module":"api","msg":"FIXME: Got an status-code for which error does not match any expected type!!!: -1","status_code":"-1","time":"2020-10-19T06:55:37Z"}
{"error":"error during connect: Post \"https://10.11.30.109:2376/v1.24/images/create?fromImage=drone%2Fdrone-runner-docker\u0026tag=1\": EOF","image":"drone/drone-runner-docker:1","ip":"10.11.30.109","level":"error","msg":"cannot pull docker image","name":"agent-VxggIJ3f","time":"2020-10-19T06:55:37Z"}
At this point, the process seems broken - autoscaler doesnt seem to make any attempt of retrying the operation or anything, even tho there’s still a pending build.
It would seem to me that any piece of code that deals with spinning up vms, waiting for them to be connect-able etc needs very robust retrying because so many things can happen that is outside the control of the system.
Alas, this means that even if we switch from kube-runner to autoscaler it still seems that we have to implement custom “janitors” to ensure that problems are dealt with.