Troubleshooting the Autoscaler

This section will help triage common autoscaler issues.

Whenever we encounter autoscaler issues they are generally related to configuration. The autoscaler is also very temperamental which means it can require a good amount of trial and error to arrive at a working configuration. To triage autoscaler issues we need to see configuration details and logs. Please take the following actions and provide the following data:

  1. provide your full server configuration
  2. provide your full autoscaler configuration
  3. provide your full autoscaler logs with debug logging enabled
  4. provide your full runner logs with trace logging enabled, when having build issues.
  5. provide your full yaml, when having build issues.
  6. provide screenshots, when you see build errors in the user interface.
  7. please do not redact domain names or hostnames

Note that in order to provide runner logs you will need to ssh into the runner server and run the docker logs command. Depending on firewall rules you may be able to connect to the remote machine from your laptop and retrieve docker logs using the Drone command line tools. See https://autoscale.drone.io/cli/commands/drone-server-env/

Enabling Trace Runner Logs

In order to enable trace level logging for the runner, please pass the below parameter to your autoscaler. Please note that this will only enable trace logging for future runners, not existing runners. Trace logging should be disabled once troubleshooting is complete.

DRONE_AGENT_ENVIRON=DRONE_TRACE=true

Cleanup Errored Instances

Instances in an error state are not automatically terminated, and require manual attention. We want to prevent the potential for infinite loops of creating and destroying instances, or situations where the system continues creating instances but not properly destroying them, both of which could result in large server bills. This is especially true for providers that bill by the hour.

You can override this default behavior and instruct the autoscaler to automatically terminate instances in an error state with the below setting. The cleanup routine is executed every hour.

DRONE_ENABLE_REAPER=true

Detect Unreachable Instances

An instance may become unreachable due to problems with the underlying infrastructure, or it may become unreachable if someone manually terminates the instance from the aws / gcp / azure console without the autoscaler being aware.

The autoscaler can be configured to ping instances and check health using the below setting. If the autoscaler cannot ping the instance or it is in an unhealthy state, it will be placed in an error state.

DRONE_ENABLE_PINGER=true