[0.8.1] Agent loses connection overnight

Rush · October 12, 2017, 5:15pm

Like in the title … agent simply lost connection over night. All pipelines were pending startup. docker restart <container> fixed it. It happened for the second time and I don’t believe it’s related to network issues. Any tips for debugging? Please mind it never happened on version 0.7 with web sockets.

This is my docker-compose file.

version: '2'

services:
  drone-agent:
    image: drone/agent:latest
    restart: always
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - DRONE_SERVER=my.ci.io:9000
      - DRONE_SECRET=${DRONE_SECRET}
      - DRONE_MAX_PROCS=16
      - DRONE_DEBUG=true
      - DRONE_DEBUG_PRETTY=true

jmccann · October 13, 2017, 12:42pm

I’ve ran into this too. I have a dev environment that doesn’t run many jobs. Seems like when there is no activity the agent disconnects. Restart the agents bring things back up again … until no activity and it disconnects again.

Is there any heartbeat between server and agent? The reason I ask is because I think in my environment the issue is a firewall that kills idle connections after a while. I’m still investigating on that.

I’m running drone/drone:0.8.1 currently in dev. I have seen this also on drone/drone:0.8.0-rc.3.

I do not see the issue in Prod. Prod is more active and in an environment that does not have the firewall.

jmccann · October 13, 2017, 3:42pm

Going to test resolving this by tuning linux keepalive

http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html

jmccann · October 14, 2017, 2:31pm

So I tried the following keep alive settings, which has helped other people running elasticsearch in the same environment who saw connectivity issues due to firewall killing idle connections:

    "tcp_keepalive_time": 1800
    "tcp_keepalive_probes": 3
    "tcp_keepalive_intvl": 60

And it didn’t seem to help. Still lost agent connection overnight.

Rush · October 16, 2017, 2:54am

@jmccann I hope we can find a solution but my gut feeling is that the problems lies in software

jmccann · October 17, 2017, 10:30pm

@bradrydzewski Do you have any thoughts on this? Should it be considered an issue? If so I can open one for it. Thanks!

bradrydzewski · October 17, 2017, 10:45pm

I think we would need to add a healthcheck as described here https://github.com/grpc/grpc/blob/master/doc/health-checking.md

I am told that sending a healthcheck every 30 seconds will keep something like haproxy from killing idle connections.

appleboy · October 18, 2017, 3:41pm

I have the same issue with running drone in kubernetes

jmccann · October 18, 2017, 6:28pm

I did create https://github.com/drone/drone/issues/2246 to track this

appleboy · November 13, 2017, 2:42pm

I created new PR to add grpc health check https://github.com/drone/drone/pull/2262

prologic · November 25, 2017, 6:07pm

I’d like to continue this discussion please. I’ve tried to work-around these seemed connectivity issues between the agent<->server that continually cause missing builds (Pending). One issue I have with most discussion points thus far is the fact I’m not going through a LB or Reverse Proxy with agent communications; but through Docker’s overlay networking. @bradrydzewski tells me (likely rightly so) that there are some known issues with overlay networking with reliability. This config looks like this => https://gist.github.com/d596784d3138e9ba22483501589aa600

Problem is even working around this and going direct (e.g: this Docker stack file: https://gist.github.com/prologic/ddd39faba96618ec802d079517359104) I still have issues.

The biggest pain point right now is that there is no way to get the resulting Pending builds going again. Restarting the servers and/or the agents doesn’t help. Restarting the build just displays an error in the UI.

Interesting that with my updated config that connects the agent(s) directly over the host network (side-stepping any overlay issues) I see errors like this in the agent logs:

ci_drone-agent.3.xzgpo31ppdk8@dm4.mydomain.com    | INFO: 2017/11/25 17:39:55 transport: http2Client.notifyError got notified that the client transport was broken EOF.
ci_drone-agent.2.l7uwojy8af34@dm5.mydomain.com    | INFO: 2017/11/25 17:39:55 transport: http2Client.notifyError got notified that the client transport was broken EOF.
ci_drone-agent.2.l7uwojy8af34@dm5.mydomain.com    | INFO: 2017/11/25 17:39:55 transport: http2Client.notifyError got notified that the client transport was broken read tcp 172.17.0.3:42816->10.0.0.10:9000: read: connection reset by peer.
ci_drone-agent.2.l7uwojy8af34@dm5.mydomain.com    | INFO: 2017/11/25 17:39:55 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 10.0.0.10:9000: getsockopt: connection refused"; Reconnecting to {10.0.0.10:9000 <nil>}
ci_drone-agent.3.xzgpo31ppdk8@dm4.mydomain.com    | INFO: 2017/11/25 17:39:55 transport: http2Client.notifyError got notified that the client transport was broken read tcp 172.17.0.3:49262->10.0.0.10:9000: read: connection reset by peer.
ci_drone-agent.3.xzgpo31ppdk8@dm4.mydomain.com    | INFO: 2017/11/25 17:39:55 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 10.0.0.10:9000: getsockopt: connection refused"; Reconnecting to {10.0.0.10:9000 <nil>}
ci_drone-agent.3.xzgpo31ppdk8@dm4.mydomain.com    | INFO: 2017/11/25 17:39:56 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 10.0.0.10:9000: getsockopt: connection refused"; Reconnecting to {10.0.0.10:9000 <nil>}
ci_drone-agent.1.l0k1ocf4kw3n@dm3.mydomain.com    | INFO: 2017/11/25 17:39:55 transport: http2Client.notifyError got notified that the client transport was broken EOF.
ci_drone-agent.1.l0k1ocf4kw3n@dm3.mydomain.com    | INFO: 2017/11/25 17:39:55 transport: http2Client.notifyError got notified that the client transport was broken read tcp 172.17.0.3:33444->10.0.0.10:9000: read: connection reset by peer.
ci_drone-agent.1.l0k1ocf4kw3n@dm3.mydomain.com    | INFO: 2017/11/25 17:39:55 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 10.0.0.10:9000: getsockopt: connection refused"; Reconnecting to {10.0.0.10:9000 <nil>}
ci_drone-agent.1.l0k1ocf4kw3n@dm3.mydomain.com    | INFO: 2017/11/25 17:39:56 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 10.0.0.10:9000: getsockopt: connection refused"; Reconnecting to {10.0.0.10:9000 <nil>}
ci_drone-agent.2.l7uwojy8af34@dm5.mydomain.com    | INFO: 2017/11/25 17:39:56 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 10.0.0.10:9000: getsockopt: connection refused"; Reconnecting to {10.0.0.10:9000 <nil>}

What I don’t see is the agent(s) trying to reconnect at all. Is this expected?

bradrydzewski · November 25, 2017, 6:43pm

the agent uses the built-in reconnect logic provided by grpc [1]

	cli := &Client{
		endpoint: endpoint,
		retry:    defaultRetryClount,
		backoff:  defaultBackoff,
		headers:  map[string][]string{},
	}

It looks to me like the agent is trying to reconnect, but is returning a connection refused. Why is your network refusing connection from the agent to server?

 Error while dialing dial tcp 10.0.0.10:9000: getsockopt: connection refused

[1] see pipeline/rpc/client.go

appleboy · January 14, 2018, 7:33am

Update Status: still not working on drone/agent:latest

The followings are error messages in drone/agent.

2018/01/14 07:19:06 grpc error: done(): code: Internal: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: 1
2018/01/14 07:19:07 grpc error: done(): code: Internal: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: 1
2018/01/14 07:19:08 grpc error: done(): code: Internal: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: 1
2018/01/14 07:19:09 grpc error: done(): code: Internal: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: 1
2018/01/14 07:19:10 grpc error: done(): code: Internal: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: 1

bradrydzewski · January 14, 2018, 6:04pm

Have you tried setting the keep alive values? https://github.com/drone/drone/pull/2294

appleboy · January 15, 2018, 1:26am

HI @bradrydzewski Yes, I have tried it but it can’t work.

See the setting:

github.com

appleboy/drone-on-kubernetes/blob/9c4b4fdb930b976af4096ac28b5f01e717d80bcf/gke/drone-configmap.yaml#L43-L44


agent.keepalive.time: "1s"
agent.keepalive.timeout: "5s"

github.com

appleboy/drone-on-kubernetes/blob/9c4b4fdb930b976af4096ac28b5f01e717d80bcf/gke/drone-agent-deployment.yaml#L62-L71


- name: DRONE_KEEPALIVE_TIME
  valueFrom:
    configMapKeyRef:
      name: drone-config
      key: agent.keepalive.time
- name: DRONE_KEEPALIVE_TIMEOUT
  valueFrom:
    configMapKeyRef:
      name: drone-config
      key: agent.keepalive.timeout

agent.keepalive.time: "1s"
agent.keepalive.timeout: "5s"

Does this setting right?

bradrydzewski · January 15, 2018, 1:30am

The server will reject keepalive requests with a frequently under 5m by default and will result in the client from being blocked by the server. This is a built-in grpc security feature. If you want to set duration values under 5m, you need this patch https://github.com/drone/drone/pull/2295

Also note that 1s and 5s are pretty low values. Usually load balancers and proxies have 60 second default timeouts, so 30s and 15s are probably better values, respectively.

appleboy · January 15, 2018, 1:35am

I will try the 5m value in keepalive.time without https://github.com/drone/drone/pull/2295 patch.

  agent.keepalive.time: "5m"
  agent.keepalive.timeout: "30s"

appleboy · January 15, 2018, 1:50am

drone agent always outputs the following when the container starts in Kubernetes.

2018-01-15T01:46:00Z |DEBU| request next execution
2018/01/15 01:46:00 grpc error: done(): code: Internal: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: 1
2018/01/15 01:46:01 grpc error: done(): code: Internal: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: 1

You can see the message per second.

bradrydzewski · January 15, 2018, 2:26am

maybe they can help you in this thread transport: check error returned by WriteField to catch illegal characters in metadata · Issue #1134 · grpc/grpc-go · GitHub

appleboy · January 15, 2018, 2:32am

I also try the patch https://github.com/drone/drone/pull/2295 and update the PermitWithoutStream as true

            grpc.KeepaliveEnforcementPolicy(keepalive.EnforcementPolicy{
                MinTime: c.Duration("keepalive-min-time"),
                PermitWithoutStream: true,
            }),

Still not working with that.

Topic		Replies	Views
[autoscaler] [gce] Drone agents require restart after idling Drone Support	16	1381	June 5, 2018
Drone autoscaler agent can not connect to grpc port Drone Support	2	1156	April 18, 2018
Drone Agent (0.8) on kubernet cannot connect to server Drone Support	2	1080	January 25, 2018
Build pending / missing agents Drone Support	2	899	October 28, 2017
Drone 0.5 builds stuck in pending Drone Support	23	3581	November 6, 2017

[0.8.1] Agent loses connection overnight

Related topics