[0.8.1] Agent loses connection overnight

I recommend the grpc support channel. They might be able to help you debug your environment and recommend a solution. https://groups.google.com/forum/#!forum/grpc-io

OK, I will post my question on grpc support channel. Thanks @bradrydzewski

I’ve faced same issue with AWS network load balancer. After a while agent was losing connection to server and was just hanging in memory doing nothing while pending jobs were piling up on the server.

First, I have configured agent to use keepalive (ping):

- DRONE_KEEPALIVE_TIME=20s
- DRONE_KEEPALIVE_TIMEOUT=20s

But it didn’t seem to help, what seemed to happen is ping starts at normal rate (every 40s) and then slows down to hours. Which makes it inefficient for maintaining connection “active” in AWS load balancer.

After looking into some code and figuring out how it all works :smiley: I have updated server configuration with

- DRONE_KEEPALIVE_MIN_TIME=5s

This means GRPC server will not slow down ping rate unless pings sent multiple times during 5s. By default this value seems to be 2 hours :confused:

These three environment variables helped me to solve the problem.

I’ve still got this issue and not able to resolve it yet. I’ve set up drone-agent and drone-server on a Kubernetes cluster but just get repeated log messages from the agent:

2018/07/24 15:59:12 grpc error: done(): code: Internal: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: 1

I understood this was due to the RPC messages from the agent not being delivered if the server was running behind a load-balancer (as a LoadBalancer service) although I have mine configured to be a NodePort service, without public ingress. See my service definition here.

I’ve then set up an ingress controller to dispatch requests to my public IP (via a DNS A record) on port 80.

Finally, my configmap contains:

agent.keepalive.time: "20s"
agent.keepalive.timeout: "20s"
agent.keepalive.min.time: "5s"

…which in turn are used by my server deployment.

Is anybody able to share their Kubernetes config for Drone as I’m not yet able to resolve this issue yet!

Thanks! :slight_smile:

I believe DRONE_KEEPALIVE_MIN_TIME is a server variable, however, it looks like it is being passed to the agent as agent.keepalive.min.time. Not sure this will make a difference, but perhaps is worth trying.

Also note that there is an stable helm chart. I wasn’t involved in its creation, but perhaps the charts might provide some good configuration hints. https://github.com/helm/charts/tree/master/stable/drone