In that case the only thing that will stop these errors is to restart the agents.
The agents with these errors seems doesn’t run any pipeline step when this happens.
How to avoid these errors?
I remember two other individuals had a similar issue. The docker logs command was freezing (problem with docker daemon, not drone) and was resulting in builds hanging and exceeding deadlines. In one case, the individual restarted the machine and/or docker daemon and the issue was solved. In the other case, the individual upgraded docker and it was solved.
It is not normal, but in this case would be a docker bug, so nothing we can do about it in Drone. There were (are?) multiple open issues in the moby issue tracker related to logs freezing. As I mentioned one other team had a similar issue and they resolved it by either upgrading or downgrading docker (not sure)
I upgraded Docker to the current latest stable version (18.06.0-ce) on Ubuntu but the problem still exist.
Btw I don’t know how this is related to logs freezing as I don’t use that command and thus it doesn’t result to my problem.
On another Kubernetes cluster I have several drone agents (version 0.8.5, using this helm chart: https://github.com/helm/charts/tree/master/stable/drone) configured to connect to that drone server using a url like https://drone.example.com:9000
Docker version on the K8s cluster is 18.06.0-ce.
When the agents started on K8s they will work fine, for a while. Then some of them will misbehave and keeps spawning the error that I described. When this happens I need to terminate those agents so that they’re created again. The CI job that started by those agents seem stuck in running state and won’t finish.
is there any update or fix about this issue? I am the same issue too.
I tried docker version from 17.03 ~ 18.06, all those versions have the same problem. And I also tried to move OS from Ubuntu 18.04 to Ubuntu 16.04 and CentOS 7, but the problem still the same.
I used docker-compose to start drone server & agent in a single VM.
I’ve found the problem was caused by the pipeline execution time exceed the default timeout setting(60mins). And the rpc error message starts to spawn once per second, never stop until restart the agent container.
I solved this problem by extending the timeout setting with an admin user. @minhdanh if you still have the same issue, you can try if this help.