0.8 Build permanently pending

Build in peding status.
On reboot drone agent, agent log:
{“time”:“2017-09-25T11:09:04Z”,“level”:“error”,“error”:“rpc error: code = Canceled desc = context canceled”,“message”:“pipeline done with error”}
server log:
INFO: 2017/09/25 11:09:04 grpc: Server.processUnaryRPC failed to write status stream error: code = Canceled desc = "context canceled"
or
INFO: 2017/09/25 12:50:16 grpc: Server.processUnaryRPC failed to write status connection error: desc = “transport is closing”

1 Like

btw, agents run on remote host from server.
I guess agents lose connection to server after some idle time.

And every time same situation:
On push drone create 2 builds, for pull_request and push events. They keep in pending status.
After manual restarting agent:
Build for pull_request starting and works fine.
Build for push stay in pending status forever.

More pushed after that work well, until another long idle.

Any suggestions?

I am experiencing something similar with 0.8. I just stood up a cluster on mesos yesterday, so I’m not sure if I’ve done something wrong.

The first few builds will work, but after the machines sit idle for awhile, builds will no longer be picked up by any agents. Sometimes, restarting the agents and/or the server will kick off the pending builds, however.

I have seen errors in the server log when this happens, similar to “http2Server.HandleStreams failed to read frame” and “connection reset by peer”.

Sounds like you are having networking issues.

“http2Server.HandleStreams failed to read frame” and “connection reset by peer”.

Pending builds indicate they are not being picked up by the agents. These error messages tell me that something is breaking the network connection between your server and agent. If the agent cannot connect with the server, it cannot fetch builds from the queue.

In general drone can recover from server disconnects, however, I cannot personally reproduce and test every possible network failure and error code, so there certainly could be edge cases. I recommend looking at the code that handles reconnects and retry logic, and sending a patch if you think it can be improved:

Note that I will not be providing a patch as I am unable to reproduce. So anyone that wants to see this resolved will need to dig into the code and send a patch.

I will try to take a look at this when I get a moment, but I’m only a few months in on golang experience. :slight_smile: I did notice that under 0.7, the connection seems to be much more stable. Apparently, the version of docker under our mesos stack is 1.13. so I wonder if this might somehow be related to this issue: Builds stuck in running 0.8-rc.3.

Connection is a first issue.
Some builds never start even after agent restarts.
image

I am not able to repeat this issue and there is not enough information here to reproduce or provide further advise. I would therefore recommend looking at the code and sending a patch if you are able to consistently reproduce an error.

I can reproduce it on my working server. How can i collect additional information for you?

Well, i have tried move drone server back to the same server with agent. But it not resolve this bug.

server logs:
INFO: 2017/10/10 09:32:07 transport: http2Server.HandleStreams failed to read frame: read tcp 10.0.0.7:9000->10.0.0.6:47362: read: connection reset by peer
INFO: 2017/10/10 09:32:07 grpc: Server.processUnaryRPC failed to write status connection error: desc = “transport is closing”

agent logs:
2017-10-10T09:39:01Z |ERRO| pipeline done with error error="rpc error: code = Canceled desc = context canceled"
2017-10-10T09:39:01Z |ERRO| pipeline done with error error="rpc error: code = Canceled desc = context canceled"
2017-10-10T09:39:01Z |ERRO| pipeline done with error error=“rpc error: code = Canceled desc = context canceled”

After restart agent, one of two builds keep peding status.

This situation repeated every ~20 min idle without builds.

This is my config:

docker service create
–name drone
–network drone
–publish 9000:9000
–env GIN_MODE=debug
–env DRONE_DEBUG=true
–env DRONE_DEBUG_PRETTY=true
–env DRONE_HOST=“https://drone.(hidden).net”
–env DRONE_GITHUB=true
–env DRONE_GITHUB_CLIENT=(hidden)
–env DRONE_GITHUB_SECRET=(hidden)
–env DRONE_SECRET=(hidden)
–env DRONE_OPEN=false
–env DRONE_ADMIN=(hidden)
–env DRONE_DATABASE_DRIVER=mysql
–env DRONE_DATABASE_DATASOURCE="(hidden)"
drone/drone:0.8.1

docker service create
–name drone_agent
–network drone
–env DRONE_DEBUG=true
–env DRONE_DEBUG_PRETTY=true
–env DRONE_SERVER=“drone:9000”
–env DRONE_MAX_PROCS=3
–env DRONE_SECRET=(hidden)
–mount type=bind,source=/var/run/docker.sock,destination=/var/run/docker.sock
drone/agent:0.8.1 agent

try setting the endpoint_mode to dnsrr and skip the publish on the server as thats not allowed with dnsrr. since I added that to my docker-compose file for the stack in the swarm I’m not getting pending builds anymore.

2 Likes

Looks like this solve my problem. Thank you.

So, i guess the problem somewhere between docker swarm port forwarding and keeping connection by drone agent.

Based on my understanding the default endpoint mode is vip, which assigns a virtual IP address in the ingress network to allow publishing ports in the entire swarm. Using dnsrr resolves the dns to the ip of the container directly, skipping one layer. Must be that layer that breaks https/2 as the same setup was working fine with drone 0.7 in default / vip mode.

Had the same problem and endpoint_mode: dnsrr fixed it.
fruuf, thank you!

I encountered the same issue (reported here [0.8.1] Agent loses connection overnight). Some server logs:

INFO: 2017/10/16 16:52:09 transport: http2Server.HandleStreams failed to read frame: read tcp 10.0.8.128:9000-\u003e10.142.0.3:53800: read: connection timed out
INFO: 2017/10/16 16:52:09 grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"
INFO: 2017/10/16 16:52:09 grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"
INFO: 2017/10/16 16:52:09 grpc: Server.processUnaryRPC failed to write status stream error: code = Canceled desc = "context canceled"
INFO: 2017/10/16 16:52:09 grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"
INFO: 2017/10/16 16:52:09 grpc: Server.processUnaryRPC failed to write status stream error: code = Canceled desc = "context canceled"
INFO: 2017/10/16 16:52:09 grpc: Server.processUnaryRPC failed to write status stream error: code = Canceled desc = "context canceled"
INFO: 2017/10/16 16:52:09 grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"

My Solution:

  1. run drone-server in docker swarm in network A
  2. get the drone-server ip-value in network A
  3. then run drone-client in network A with DRONE_SERVER=ip_value

this will skip the docker vip, but less convenient…