Hello, we’re experiencing intermittent issues with the Drone autoscaler (v1.7.5) where there are pending jobs in Drone that should trigger a scale up, but the scale up does not occur, and the autoscaler does not report logs for “check capacity complete” for a period of time. After this “blackout” period, there is a spike in logs that seem to be a mass stream cancellation/close/error, and the capacity check logs resume. If there are still pending jobs after the logs begin appearing again, the autoscale occurs as normal.
Apologies for the busy graph. The yellow line here is a canary that is searching for the “check capacity complete” log from the autoscaler. You can see that it stopped reporting slightly after 08:55, and the pending job count (green line) began to grow in this gap. Once it started reporting again after 09:12, the autoscaler began bringing up instances.
During this time period the “check capacity complete” log messages did not log at all. At the end of the time period, there was noticeably higher than normal spike of log messages (500 or so, up from ~25 every 15 seconds).
The log messages during this time are one of four types. Examples -
2021-11-08 09:12:51
{“fields.time”:“2021-11-08T14:12:51Z”,“latency”:24227812260498,“level”:“debug”,“method”:“GET”,“msg”:“”,“remote”:“10.122.103.135:64064”,“request”:“/api/stream”,“request-id”:“20cvA8ulD0KVvohQYQqtNm5qovl”,“time”:“2021-11-08T14:12:51Z”}Show context
2021-11-08 09:12:51
{“level”:“debug”,“msg”:“events: stream error”,“request-id”:“20clN8Nogg8EV7HxUf4ZlUjmlNs”,“time”:“2021-11-08T14:12:51Z”,“user.login”:“xxxxx”}
2021-11-08 09:12:51
{“level”:“debug”,“msg”:“events: stream closed”,“request-id”:“20cqmLYdbs3dMD2xXb7LPQyjmaZ”,“time”:“2021-11-08T14:12:51Z”,“user.login”:“xxxxx”}
2021-11-08 09:12:51
{“level”:“debug”,“msg”:“events: stream closed”,“request-id”:“20clN8Nogg8EV7HxUf4ZlUjmlNs”,“time”:“2021-11-08T14:12:51Z”,“user.login”:“xxxxx”}
2021-11-08 09:12:51
{“level”:“debug”,“msg”:“events: stream cancelled”,“request-id”:“20cvA8ulD0KVvohQYQqtNm5qovl”,“time”:“2021-11-08T14:12:51Z”,“user.login”:“xxxxx”}
None of the messages in this spike indicate any kind of service restart or other problem with the service itself. Unfortunately, this issue is intermittent enough that we cannot reproduce it predictably. Sometimes we can go weeks without seeing it, other times we can see it multiple times in day. Frequently when this occurs, we are paged by our users as their pipelines can take 20 minutes or more to begin because of the pending jobs backing up.
Any help would be appreciated, and we can continue to troubleshoot on our side any thoughts or ideas.
Thanks!