Drone 0.5 builds stuck in pending

Was about to open an issue but followed directions to post here first. :wink:

I’ve been running into issues with Drone 0.5 where new builds will sit in a pending state and all agents (x 4) do not have any jobs running on them.

The last time this happened I saw the following log right before the issue started occurring:

November 21st 2016, 10:22:50.000	drone5-server	time="2016-11-21T16:22:50Z" level=error msg="Unable to read logs from broker. Timeout. %!s(<nil>)"
November 21st 2016, 10:22:50.000	drone5-server	time="2016-11-21T16:22:50Z" level=error msg="Unable to read logs from broker. Timeout. %!s(<nil>)"

Beyond that I’m not seeing much else useful from a logging perspective.

What I’m ending up having to do to “fix” it is running the following on the drone server:

docker stop drone && docker start drone

That seems to allow new builds to again start having jobs consumed by the agents.

I am running NGINX reverse proxy as was documented in drone/docs for SSL termination and using wss:// protocol.

Let me know if any other information would be useful.

Thanks!

Have you tried using DRONE_BROKER_DEBUG?

This issue you are describing seems isolated to individuals placing nginx (or load balancers) in front of drone. Is this something you can either confirm or disprove?

I have DRONE_BROKER_DEBUG enabled. After the error occurred I no longer was receiving logs from stomp. The WebUI continues to function and produce logs however.

Of my 4 agents, 1 agent IS NOT going through the reverse proxy (forgot to mention that earlier). That system also was not processing builds from the queue.

I’ll try having all agents not go through reverse proxy and see if issue occurs again.

So I am dumping what I am seeing so far. Looks like /queue/pending queue is in a stuck place:

$ /go/bin/mq --server=wss://drone-github.drone.svc.cluster.local:8000/ws/broker --username=x-token --password=FOOBAR bench pubsub --topic /queue/drone
Performing Publish/Subscribe performance test                                                                                                         
                                                                                                                                                      
clients: 1                                                                                                                                            
elapsed: 1.799981346s                                                                                                                                 
msg/sec: 55556.13                                                                                                                                     
latency: n/a                                                                                                                                          
                                                                                                                                                      
$ /go/bin/mq --server=wss://drone-github.drone.svc.cluster.local:8000/ws/broker --username=x-token --password=FOOBAR bench pubsub --topic /queue/pendi
ng                                                                                                                                                    
Performing Publish/Subscribe performance test
< never responds, just hangs >

for comparison, I run the same thing and see this

$ mq bench pubsub
Performing Publish/Subscribe performance test

clients: 1
elapsed: 502.020823ms
msg/sec: 199194.92
latency: n/a

$ mq bench pubsub
Performing Publish/Subscribe performance test

clients: 1
elapsed: 501.338405ms
msg/sec: 199466.07
latency: n/a

my guess would be that nginx is not closing the connection (on one end) and as a result it is putting the queue in an inconsistent state. There is a dangling connection that is blocking … @jmccann this could still impact your instance where 3 of 4 agents are using nginx

I am not using nginx in front of drone server. Also if you look at my previous message. When I am trying to bench /queue/drone - I see no issues. However when I am benching /queue/pending mq client just hangs. Both tests made talking to the same server which is in a broken state.

just for point of comparison, I can run the same benchmark against a fresh drone instance and am not able to repeat any issue.

$ mq --username="x-token" --password="foo" --server=ws://my.drone/ws/broker bench pubsub --topic="/queue/pending" --message-count=50000
Performing Publish/Subscribe performance test

clients: 1
elapsed: 302.575999ms
msg/sec: 165247.74
latency: n/a

$ mq --username="x-token" --password="foo" --server=ws://my.drone/ws/broker bench pubsub --topic="/queue/pending" --message-count=50000
Performing Publish/Subscribe performance test

clients: 1
elapsed: 301.070428ms
msg/sec: 166074.10
latency: n/a

The reason that I point this out is because, as far as the broker is concerned, there is no difference between /queue/drone and /queue/pending. I’m not saying because I can’t repeat the issue that no issue exists. I’m just saying the path itself, as far as the internals of the queue is concerned, does not matter.

Until I can find a way to reproduce this, I would really like to either prove or disprove the theory that something is keeping a connection open / not closing the ws connection. You should be able to audit the server logs and see connection and disconnect events.

If the agent disconnects and you do not see a disconnect event on the server, it would confirm this theory.

logger.Verbosef("stomp: successfully opened socket connection.")
logger.Verbosef("stomp: successfully closed socket connection.")

How do I know when the agent disconnects? I can verify I do not see the closed socket message before the issue begins but do not know how to confirm that the agent has disconnected. They were still processing jobs they had received prior to when the issue occurred. Would the agent produce a log if it disconnected?

the agent logs various message for connecting and disconnecting. See https://github.com/drone/drone/blob/master/drone/agent/agent.go#L207

specifically I would look for this in the agent logs:

logger.Warningf("connection interrupted, attempting to reconnect.")

I did not see such a message until I stop/start my drone server container.

My timeline so far:

10:22:20 last stomp message (stomp: unsubscribe 244: successful: destination /topic/cancel)
10:22:50 read log timeout from server (Unable to read logs from broker. Timeout. %!s(<nil>))
10:24:42 no more jobs running on agents (probably jobs started before 10:22 just finishing up)
11:07:xx I restart drone server (at this point I see the agents connection interrupted, attempting to reconnect but at no time before)

@jmccann your error tells me that when a build completes, it attempts to read the full logs from the broker, but this takes > 30 seconds or never completes. See this code block.

I feel it is important to mention that the logs are actually multiple messages in the queue (one per line) and drone knows when it reaches the end of the logs because there will be a special message with header eof: true that drone looks for here

So the question is what would cause fetching the full logs to timeout? Here are some thoughts that come to mind off the top of my head …

  1. the logs exceed 5mb and therefore the eof: true message doesn’t get sent. Reading the logs times out because eof is never read
  2. the eof message somehow isn’t sent, or fails to send
  3. there is some internal deadlock in the server <> broker connection, which is basically just go channels piping to one another. This could explain everything just blocking and halting progress

The first two could probably be tested by dumping the contents of the logs when a timeout is received. Are they over 5mb? Are the full logs present but just missing an eof?

The third item could be a bit more difficult to track down. It would be somewhere in this file.

I was able to repeat a zombie TCP connection by letting my laptop sleep while the TCP connection was open. The connection broke but neither side received a close event, which means the broker was still trying to write jobs to the closed connection.

I have created a branch in the message broker to implement heart-beat and re-connect. I am first testing a patch that implements a simple 30 second heartbeat that should kill the connection. This should prevent hanging connections from receiving queue items, although will still require a bit more testing.

Patch at https://github.com/drone/mq/tree/feature/reconnect

Is the session data being persistently stored in the database? Because I was getting this on a load balanced server (I was running multiple instances of the server pointing to the same database, all being load balanced). I wonder if it is related.

drone does not support load balanced servers at this time because the queue is central.

That is great news - how can I build a drone server/agent with the mentioned branch? Then I will happily help testing - since I also experienced this issue in the past days.

For me it seems that some network component outside my servers/loadbalancers closes the connection on openstack and thus connections are left in a hanging state. (This does not occure when I run it on my home network stack)

FYI the latest build includes the heartbeat code that I added to the message broker, so hopefully this helps keep the connection open, as well as remove failed connections more quickly.

I think there are still some edge cases that will need to be worked out, as well as some improvements to the code that re-connects the agent if / when it fails while builds are running. But hopefully this helps get us closer to where we want to be.

Feel free to test and let me know how it goes.

1 Like

I’ve been running the latest built with the hearbeat mq and haven’t experienced any further issues so far.

Setup here is:
Drone Agent -> nginx (ssl+external ip) -> drone server

@patrickjahns thanks for the update! Keep an eye on things and let me know if you see any changes in behavior.

@jmccann and @vaijab let me know if you are able to get the latest server and agent and monitor for a few days and see if the issue has been resolved.

Thanks!

Drone agents have been solid since the healthcheck was added. Thanks!