Agent failling to pick up bulds 1.0.0-rc5

Running drone/drone:1.0.0-rc.5

I’m able to bring up the server and configure repositories, however, when starting the agent the following error keeps showing up on the logs:

2019/02/15 18:53:11 [DEBUG] POST https://drone.example.com/rpc/v1/request (status: 524): retrying in 1s (30 left)
2019/02/15 18:53:41 [ERR] POST https://drone.example.com/rpc/v1/request request failed: Post https://drone.example.com/rpc/v1/request: context deadline exceeded

Apparently the error is expected due to the connection long polling used by the agents
However, in any case, the agent is not picking up any builds.

I do see the following being printed on the server’s logs whenever a new agent is started;

DEBU[1575] manager: request queue item                   arch=amd64 kernel= os=linux variant=

Here is the trace from the server with the run time config:

license: ""
authn:
  endpoint: ""
  token: ""
  skipverify: false
cron:
  disabled: false
  interval: 30m0s
cloning:
  alwaysauth: false
  username: ""
  password: ""
  image: ""
  pull: IfNotExists
database:
  driver: postgres
  datasource: postgres://db-connection-string
  secret: ""
docker:
  config: ""
http:
  allowedhosts: []
  hostsproxyheaders: []
  sslredirect: false
  ssltemporaryredirect: false
  sslhost: ""
  sslproxyheaders: {}
  stsseconds: 0
  stsincludesubdomains: false
  stspreload: false
  forcestsheader: false
  browserxssfilter: true
  framedeny: true
  contenttypenosniff: false
  contentsecuritypolicy: ""
  referrerpolicy: ""
logging:
  debug: false
  trace: true
  color: true
  pretty: true
  text: true
proxy:
  addr: https://drone.example.com
  host: drone.example.com
  proto: https
registration:
  closed: false
registries:
  endpoint: ""
  password: ""
  skipverify: false
repository:
  filter: []
runner:
  local: false
  image: drone/controller:1.0.0-rc.5
  platform: linux/amd64
  os: linux
  arch: amd64
  kernel: ""
  variant: ""
  machine: drone-server-9784c9db7-dk6s6
  capacity: 0
  labels: {}
  volumes: []
  networks: []
  devices: []
  privileged: []
  environ: {}
  limits:
    memswaplimit: 0
    memlimit: 0
    shmsize: 0
    cpuquota: 0
    cpushares: 0
    cpuset: ""
nomad:
  enabled: false
  datacenters:
  - dc1
  namespace: ""
  region: ""
  prefix: drone-job-
  image: drone/controller:1.0.0-rc.5
  imagepull: false
  memory: 1024
  cpu: 20
kube:
  enabled: false
  namespace: ""
  path: ""
  url: ""
  ttl: 300
  serviceaccountname: ""
  pullpolicy: Always
  image: drone/controller:1.0.0-rc.5
rpc:
  server: https://drone.example.com
  secret: secret
  debug: true
  host: drone.example.com
  proto: https
s3:
  bucket: ""
  prefix: ""
secrets:
  endpoint: ""
  password: ""
  skipverify: false
server:
  addr: https://drone.example.com
  host: drone.example.com
  port: :443
  proto: https
  acme: false
  cert: /opt/tls/example.io.crt
  key: /opt/tls/example.io.key
session:
  timeout: 720h0m0s
  secret: lpcL39slcwdQk45ExbeG4l91TnrZtAcg
status:
  desc: ""
  label: ""
users:
  create:
    username: user
    machine: false
    admin: true
    token: ""
webhook:
  endpoint: []
  secret: ""
  skipverify: false
yaml:
  endpoint: ""
  secret: ""
  skipverify: false
bitbucket:
  clientid: ""
  clientsecret: ""
  skipverify: false
gitea:
  server: ""
  skipverify: false
github:
  server: https://github.com
  apiserver: https://api.github.com
  clientid: clientid
  clientsecret: clientsecret
  skipverify: false
  scope:
  - repo
  - repo:status
  - user:email
  - read:org
  ratelimit: 0
gitlab:
  server: https://gitlab.com
  clientid: ""
  clientsecret: ""
  skipverify: false
gogs:
  server: ""
  skipverify: false
stash:
  server: ""
  consumerkey: ""
  consumersecret: ""
  privatekey: ""
  skipverify: false
2019/02/15 18:53:11 [DEBUG] POST https://drone.example.com/rpc/v1/request (status: 524): retrying in 1s (30 left)
2019/02/15 18:53:41 [ERR] POST https://drone.example.com/rpc/v1/request request failed: Post https://drone.example.com/rpc/v1/request: context deadline exceeded

For others that see these in their logs and are wondering what they mean, I wanted to provide a brief explanation. It is normal to see context deadline exceeded when rpc debugging is enabled. Drone uses long polling with a 30 second timeout. When the timeout is reached an error is written to the logs, and then Drone re-connects and resumes polling. This repeats until the agent pulls a build from the queue. So is is normal to see this in the logs, and is not indicative of a problem.

The rpc debugger can still surface some valuable information. For example if there are connectivity issues between your agent and server, or authentication issues, the rpc debugger can help debug. You would see something in the logs like this:

2019/02/15 11:49:00 [DEBUG] POST http://drone.company.com/rpc/v1/request: retrying in 2s (29 left)
2019/02/15 11:49:00 [ERR] POST http://drone.company.com/rpc/v1/request request failed: Post http://drone.company.com/rpc/v1/request: dial tcp: lookup host.docker.internal: no such host

or like this:

2019/02/15 11:49:54 [ERR] POST http://localhost/rpc/v1/request request failed: Post http://localhost/rpc/v1/request: dial tcp [::1]:80: connect: connection refused
2019/02/15 11:49:54 [DEBUG] POST http://localhost/rpc/v1/request: retrying in 1s (30 left)

If is of course possible that some mitm service (load balancer, proxy, etc) is not correctly passing the request to the server, so this is something you should test for. You should also make sure the load balancer has a timeout > 35 seconds (less than 60 seconds is fine). You can verify agents are connecting to the server by looking at your server logs and checking for the following debug logs:

manager: request queue item

If the agents are connecting to the server, the most common root cause for not pulling a build is incorrect node filtering (defined in the yaml) or agent labels (with no corresponding node filters in the yaml). Another common root cause are when your agents are running on arm architectures or windows servers, but the yaml is missing the platform section. Lastly, a proxy or loadbalancer timing-out connections too soon can also cause problems.

Thanks Brad.
The issue was due to mismatched labels set on the build agents.