Like in the title … agent simply lost connection over night. All pipelines were pending startup. docker restart <container> fixed it. It happened for the second time and I don’t believe it’s related to network issues. Any tips for debugging? Please mind it never happened on version 0.7 with web sockets.
I’ve ran into this too. I have a dev environment that doesn’t run many jobs. Seems like when there is no activity the agent disconnects. Restart the agents bring things back up again … until no activity and it disconnects again.
Is there any heartbeat between server and agent? The reason I ask is because I think in my environment the issue is a firewall that kills idle connections after a while. I’m still investigating on that.
I’m running drone/drone:0.8.1 currently in dev. I have seen this also on drone/drone:0.8.0-rc.3.
I do not see the issue in Prod. Prod is more active and in an environment that does not have the firewall.
So I tried the following keep alive settings, which has helped other people running elasticsearch in the same environment who saw connectivity issues due to firewall killing idle connections:
I’d like to continue this discussion please. I’ve tried to work-around these seemed connectivity issues between the agent<->server that continually cause missing builds (Pending). One issue I have with most discussion points thus far is the fact I’m not going through a LB or Reverse Proxy with agent communications; but through Docker’s overlay networking. @bradrydzewski tells me (likely rightly so) that there are some known issues with overlay networking with reliability. This config looks like this => https://gist.github.com/d596784d3138e9ba22483501589aa600
The biggest pain point right now is that there is no way to get the resulting Pending builds going again. Restarting the servers and/or the agents doesn’t help. Restarting the build just displays an error in the UI.
Interesting that with my updated config that connects the agent(s) directly over the host network (side-stepping any overlay issues) I see errors like this in the agent logs:
ci_drone-agent.3.xzgpo31ppdk8@dm4.mydomain.com | INFO: 2017/11/25 17:39:55 transport: http2Client.notifyError got notified that the client transport was broken EOF.
ci_drone-agent.2.l7uwojy8af34@dm5.mydomain.com | INFO: 2017/11/25 17:39:55 transport: http2Client.notifyError got notified that the client transport was broken EOF.
ci_drone-agent.2.l7uwojy8af34@dm5.mydomain.com | INFO: 2017/11/25 17:39:55 transport: http2Client.notifyError got notified that the client transport was broken read tcp 172.17.0.3:42816->10.0.0.10:9000: read: connection reset by peer.
ci_drone-agent.2.l7uwojy8af34@dm5.mydomain.com | INFO: 2017/11/25 17:39:55 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 10.0.0.10:9000: getsockopt: connection refused"; Reconnecting to {10.0.0.10:9000 <nil>}
ci_drone-agent.3.xzgpo31ppdk8@dm4.mydomain.com | INFO: 2017/11/25 17:39:55 transport: http2Client.notifyError got notified that the client transport was broken read tcp 172.17.0.3:49262->10.0.0.10:9000: read: connection reset by peer.
ci_drone-agent.3.xzgpo31ppdk8@dm4.mydomain.com | INFO: 2017/11/25 17:39:55 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 10.0.0.10:9000: getsockopt: connection refused"; Reconnecting to {10.0.0.10:9000 <nil>}
ci_drone-agent.3.xzgpo31ppdk8@dm4.mydomain.com | INFO: 2017/11/25 17:39:56 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 10.0.0.10:9000: getsockopt: connection refused"; Reconnecting to {10.0.0.10:9000 <nil>}
ci_drone-agent.1.l0k1ocf4kw3n@dm3.mydomain.com | INFO: 2017/11/25 17:39:55 transport: http2Client.notifyError got notified that the client transport was broken EOF.
ci_drone-agent.1.l0k1ocf4kw3n@dm3.mydomain.com | INFO: 2017/11/25 17:39:55 transport: http2Client.notifyError got notified that the client transport was broken read tcp 172.17.0.3:33444->10.0.0.10:9000: read: connection reset by peer.
ci_drone-agent.1.l0k1ocf4kw3n@dm3.mydomain.com | INFO: 2017/11/25 17:39:55 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 10.0.0.10:9000: getsockopt: connection refused"; Reconnecting to {10.0.0.10:9000 <nil>}
ci_drone-agent.1.l0k1ocf4kw3n@dm3.mydomain.com | INFO: 2017/11/25 17:39:56 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 10.0.0.10:9000: getsockopt: connection refused"; Reconnecting to {10.0.0.10:9000 <nil>}
ci_drone-agent.2.l7uwojy8af34@dm5.mydomain.com | INFO: 2017/11/25 17:39:56 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 10.0.0.10:9000: getsockopt: connection refused"; Reconnecting to {10.0.0.10:9000 <nil>}
What I don’t see is the agent(s) trying to reconnect at all. Is this expected?
It looks to me like the agent is trying to reconnect, but is returning a connection refused. Why is your network refusing connection from the agent to server?
Error while dialing dial tcp 10.0.0.10:9000: getsockopt: connection refused
The server will reject keepalive requests with a frequently under 5m by default and will result in the client from being blocked by the server. This is a built-in grpc security feature. If you want to set duration values under 5m, you need this patch https://github.com/drone/drone/pull/2295
Also note that 1s and 5s are pretty low values. Usually load balancers and proxies have 60 second default timeouts, so 30s and 15s are probably better values, respectively.