[answered] Pipeline keeps running even though Runner is offline

I do have a Docker Runner which is not online 24/7. The Server does not notice when the Runner shuts down and doesn’t stop the pipeline, even though the 4h timeout is exceeded (by a lot).

Steps to reproduce:

  • Start Build
  • Shut down Runner

Expected behaviour:
Drone stops the pipeline after the timeout is exceeded.

Actual Behaviour:
The Pipeline is forever displayed as running.

Version:

  • Drone Server: 1.9
  • Drone Runner: 1.6

Runners must be gracefully terminated; they must not be force-terminated while pipelines are running, otherwise they are stuck in a running state.

The server does not keep track of runner connectivity for a number of reasons (for example, connections are not persistent and the runners use long polling and frequently connect and disconnect to avoid tcp timeouts, which are common in many corporate networks). If you stop or restart the server while builds are running, or the runner loses connectivity with the server, it is able to keep running pipelines and upload the results using a backoff once it is able to re-establish a connection. This decentralized design makes the system more resilient to outages and flaky networks, but the tradeoff is that you must not shut down a runner while it is running a pipeline.

The servers does scan for stuck jobs every 24 hours and terminates them. If you want to reduce the interval and scan more frequently, you can adjust the cleanup intervals and deadlines by passing the following environment variables to your Drone server:

DRONE_CLEANUP_INTERVA=1h
DRONE_CLEANUP_DEADLINE_RUNNING=1hr

Just for clarification: If a (docker) runner is getting a signal to gracefully stop, it is 1. not accepting (reads: pulling for) any new jobs to execute and 2. blocks until it’s currently executing jobs have finished before exiting itself?

Settings the runner’s container grace period longer than the maximum build timeout (default 60min) should do the job, if the OS itself is not reaping processes at some point.

Would it be hard to implement a different, configurable strategy for runners, that is automatically cancelling executing jobs (and notifying the server thereof) before exiting?