Drone agents errors: context deadline exceeded

I have several drone agents run on Kubernetes. But some agents will keep spawning these errors:

2018/07/30 03:54:41 grpc error: wait(): code: DeadlineExceeded: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/07/30 03:54:42 grpc error: wait(): code: DeadlineExceeded: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/07/30 03:54:43 grpc error: wait(): code: DeadlineExceeded: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/07/30 03:54:44 grpc error: wait(): code: DeadlineExceeded: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/07/30 03:54:45 grpc error: wait(): code: DeadlineExceeded: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/07/30 03:54:46 grpc error: wait(): code: DeadlineExceeded: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/07/30 03:54:47 grpc error: wait(): code: DeadlineExceeded: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/07/30 03:54:48 grpc error: wait(): code: DeadlineExceeded: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/07/30 03:54:49 grpc error: wait(): code: DeadlineExceeded: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/07/30 03:54:50 grpc error: wait(): code: DeadlineExceeded: rpc error: code = DeadlineExceeded desc = context deadline exceeded

In that case the only thing that will stop these errors is to restart the agents.
The agents with these errors seems doesn’t run any pipeline step when this happens.
How to avoid these errors?

I remember two other individuals had a similar issue. The docker logs command was freezing (problem with docker daemon, not drone) and was resulting in builds hanging and exceeding deadlines. In one case, the individual restarted the machine and/or docker daemon and the issue was solved. In the other case, the individual upgraded docker and it was solved.

So is it normal that when this error happens, it never stops and new build won’t be assigned to these drone agents?

It is not normal, but in this case would be a docker bug, so nothing we can do about it in Drone. There were (are?) multiple open issues in the moby issue tracker related to logs freezing. As I mentioned one other team had a similar issue and they resolved it by either upgrading or downgrading docker (not sure)

I upgraded Docker to the current latest stable version (18.06.0-ce) on Ubuntu but the problem still exist.
Btw I don’t know how this is related to logs freezing as I don’t use that command and thus it doesn’t result to my problem.

@minhdanh can you share your setup? If you are using docker swarm, make sure to use endpoint mode as dnsrr.

Sure. I have a drone server run as a docker container using docker-compose on a Ubuntu server.


services:
  drone-server:
    container_name: drone-server
    image: drone/drone:0.8.5
    restart: unless-stopped
    ports:
      - "9000:9000"

On another Kubernetes cluster I have several drone agents (version 0.8.5, using this helm chart: https://github.com/helm/charts/tree/master/stable/drone) configured to connect to that drone server using a url like https://drone.example.com:9000

Docker version on the K8s cluster is 18.06.0-ce.

When the agents started on K8s they will work fine, for a while. Then some of them will misbehave and keeps spawning the error that I described. When this happens I need to terminate those agents so that they’re created again. The CI job that started by those agents seem stuck in running state and won’t finish.

Hi all,

is there any update or fix about this issue? I am the same issue too.

I tried docker version from 17.03 ~ 18.06, all those versions have the same problem. And I also tried to move OS from Ubuntu 18.04 to Ubuntu 16.04 and CentOS 7, but the problem still the same.

I used docker-compose to start drone server & agent in a single VM.

Thanks!

Hi, all

I’ve found the problem was caused by the pipeline execution time exceed the default timeout setting(60mins). And the rpc error message starts to spawn once per second, never stop until restart the agent container.

I solved this problem by extending the timeout setting with an admin user. @minhdanh if you still have the same issue, you can try if this help.