Drone autoscaler agent can not connect to grpc port

This is likely something simple that I am just missing. I am running drone in kubernetes along with the drone autoscaler. I can use the drone CLI to create an agent server (I’m using Digital Ocean) which seems to get setup correctly. However when I log into that agent server and look at the agent container’s logs I see a ton of:

INFO: 2018/04/17 04:29:34 transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.

With a few of these every now and then:

2018/04/17 04:29:31 grpc error: done(): code: Unavailable: rpc error: code = Unavailable desc = transport is closing

This is my current setup including everything I have checked… so hopefully this is just setup wrong and is a simple fix.

builder-drone pod exposes ports (http) 8000 and (grpc) 9000. This was setup using the helm chart for drone.

I have an ingress that maps from drone.my.domain.com:443 (I have a cert setup) to the builder-drone pod on port 8000 for the web UI and basic api.

I have an ingress that maps from agent.my.domain.com:80 to the builder-drone pod on port 9000.

The drone-autoscaler pod is setup to talk to the builder-drone pod on port 8000 using local kubernetes dns. It is also configured to hand the agent.my.domain.com:80 address to the DO agents that it starts up.

The kubernetes ingress/services that are setup to do this mapping are based on nginx.

Here are my thoughts/questions on what could be wrong

  1. Could nginx be messing up the grpc connection if its treating it like an http connection?
  2. Is the basic premise for this setup correct? As in if things are configured correctly above that should generally work?
  3. Is this setup just doomed to fail?

Could nginx be messing up the grpc connection if its treating it like an http connection?

absolutely, up until a few weeks ago nginx was not compatible with grpc. Even if you have the absolute latest version of nginx you would need special configuration to handle proxying grpc, which is http2 and not http.

I recommend the following:

  1. configure the agents to connect directly to your server, without routing through nginx
  2. post a question to kubernetes support with regards to how grpc is supported. This is not my area of expertise and is something they will have more insight into

I just tried to setup nginx with grpc since the image that supports that was put up today… just to see.

I did some digging and on the off-chance you have an idea on whats going on I wanted to followup and ask you.

When I hit the endpoint setup with http2 with curl using the --http2 flag I get the following log in nginx

[18/Apr/2018:01:57:15 +0000] "\x16\x03\x01\x02\x00\x01\x00\x01\xFC\x03\x03\x96j'\x8E\x0C\xE8\x99\xE9g\x9A\xC68\xD1m\x916\xCCTQ@'\xFCU{=\xF3\xBAP\x03~\xB1B\x00\x00\x86\xCC\x14\xCC\x13\xCC\x15\xC00\xC0,\xC0(\xC0$\xC0\x14\xC0" 400 174 "-" "-" 0 0.048 [] - - - -
2018/04/18 01:57:20 [error] 686#686: *205353 connect() failed (111: Connection refused) while connecting to upstream, client: xxx.xxx.xxx.xxx, server: agent.my.domain.com, request: "GET / HTTP/2.0", upstream: "grpc://10.2.2.23:9000", host: "agent.my.domain.com"
[18/Apr/2018:01:57:20 +0000] "GET / HTTP/2.0" 200 0 "-" "curl/7.54.0" 42 0.005 [default-builder-drone-9000] 10.2.2.23:9000, 10.2.1.17:9000 0, 0 0.001, 0.004 502, 200

At which time the drone server pops out a log message:

INFO: 2018/04/18 01:57:36 transport: http2Server.HandleStreams failed to read frame: read tcp 10.2.1.17:9000->10.2.2.0:52608: read: connection reset by peer

Which makes it look like things are working. Curl gets back

HTTP/2 200 
server: nginx/1.13.12
date: Wed, 18 Apr 2018 01:48:45 GMT
content-type: application/grpc
content-length: 0
grpc-status: 8
grpc-message: malformed method name: "/"
strict-transport-security: max-age=15724800; includeSubDomains

Which I think looks good… but including it since I’m not 100% sure if this is what drone would return when hitting ‘/’ on grpc

However, when I spin up the agent… I get a TON of these in my nginx log

[18/Apr/2018:01:53:48 +0000] "PRI * HTTP/2.0" 400 174 "-" "-" 0 0.007 [] - - - -

Which makes me think its not routing it properly but only when the agent is hitting it…

Thanks for the initial info too! Definitely helped me move forward a bit.