Autoscaler Agents remain in Creating state

Luca_Ballerini · October 16, 2020, 9:42am

Hi,

We are seeing an issue with agents that remain in Creating state when we have a burst in the number of agents create by the autoscaler.

If agents are created gracefully, 1-4 there is generally no issue, but if we go from our minimum and requesting 10+ agents within a parallelised build, we se that some agents remain indefinitely in Creating state.

Logs on both the autoscaler and agent don’t seems to show anything wrong, the VM in GCP gets created, but we can see there are no Docker processes running yet.

Any hint on where to look for issues would be great. The current theory is that the autoscaler is waiting for the agent to report some sort of status or there is a concurrency issue when multiple agents get requested.
We’ve been looking into this to try understand what would be an expected response for the status to update and mark the agent ready to start taking builds:

github.com

drone/autoscaler/blob/master/drivers/google/provider.go#L107


			return nil, err
		}
		p.service, err = compute.New(client)
		if err != nil {
			return nil, err
		}
	}
	return p, nil
}

func (p *provider) waitZoneOperation(ctx context.Context, name string) error {
	for {
		op, err := p.service.ZoneOperations.Get(p.project, p.zone, name).Context(ctx).Do()
		if err != nil {
			if gerr, ok := err.(*googleapi.Error); ok &&
				gerr.Code == http.StatusNotFound {
				return autoscaler.ErrInstanceNotFound
			}
			return err
		}
		if op.Error != nil {

ashwilliams1 · October 16, 2020, 12:58pm

Please post the autoscaler logs with TRACE logging enabled. The logs will help us determine where in the process the autoscaler is waiting so that we can suggest possible root causes. If the logs are insufficient, we can add more.

This is not quite how it works. First, the autoscaler provisions the instance and then makes an API call to describe the instance and to check the instance network. Once the instance is successfully provisioned, the status is changed from creating to staging

Next, the autoscaler tries to docker ping the docker instance to verify it is initialized (using a backoff). And finally once it is able to ping the instance, it installs and starts the autoscaler (using docker create and docker start). Once the agent is successfully installed, the status is changed from creating to running.

Since you see the runner is stuck in a creating status, we can narrow this down to some problem with instance creation. It sounds like it might be stuck in the waitZoneOperation backoff. The backoff is subject to a 1 hour timeout, which ultimately propagates to the waitZoneOperation call using this context. The autoscaler performs this backoff until GCP indicates the instance is given a status of DONE or until the API returns an error (which includes a timeout error).

The agent is ready once it has successfully connected to the Docker daemon on the machine and executed a docker ping and installed the agent using docker create and docker start. But as mentioned above, it sounds like you are not getting past the instance creation and verification step.

Luca_Ballerini · November 6, 2020, 4:10pm

Thanks for the explanation - I have a quick question on

Please post the autoscaler logs with TRACE logging enabled. The logs will help us determine where in the process the autoscaler is waiting so that we can suggest possible root causes. If the logs are insufficient, we can add more.

Is trace enabled the same way as for the server using DRONE_LOGS_TRACE=true? I see in the docs a reference for DRONE_LOGS_DEBUG but nothing specific for trace.

Topic		Replies	Views
Troubleshooting the Autoscaler Drone FAQ	0	591	January 12, 2022
Drone autoscaler not scaling instances down as expected Drone Support	6	422	July 15, 2021
Error with autoscaled drone agents Drone Support	9	515	March 17, 2023
Make drone agent listen some port Drone Support	6	455	April 8, 2019
Agent launched by autoscaler stuck at StateStaging Drone Support	7	437	September 12, 2019

Autoscaler Agents remain in Creating state

Related topics