Drone builds take 15 minutes to start

We’ve been encountering this issue in the past month or so.

Drone version: 0.8

Our drone server is hosted on an on-prom kubernetes cluster and our agents are run on a docker swarm across 3 VMs. We currently have about 25 agents running on the swarm and very rarely have more than 15+ builds running concurrently.

During the time where the builds are pending, we see the grey clock, ticking back and forth. Our latest run in with this issue happened earlier today, with only a handful of other builds running. I have tried to restart the agents on the docker swarm and that did not alleviate the issue.

Do you know what could be causing this issue?

15 minutes just happens to be the default TCP connection timeout in the Go standard library. Given the technologies involved my gut would tell me that something is breaking TCP connections between the agent and server without sending a close notification to one side. This would cause the connection to hang for 15 minutes until it eventually times out, at which point it automatically reconnects and begins processing builds again.

Perhaps you can provide more details regarding your network topology?