Zombie Builds long past timeout

We’ve had a number of problems with hanging builds. That is to say, a build that is still in a “running” status (in some parts of the UI at least) long after it should have timed out.

I’m looking at one now that started over 29 hours ago, and the build page shows it as running. It is not in the “recent builds” panel, but it is in /api/queue.

I suspect this happens if the runner process exits unexpectedly. I’ve found old containers on nodes from long abandoned builds (usually service containers I think).

My best way to find these is to look at /api/queue and filter for old builds. I then have to scan repos and builds for the correct ID, since the queue doesn’t give me the numbers the rest of the api endpoints need. Then I can look at them and press the cancel button. Its a bit of an involved script, and I’d rather not need to depend on it.

Is this kind of zombie reaping something drone server should be doing?

Drone has an internal reaper [1] that was added back in June, however, the default settings are conservative and it could take up to 48 hours to reap a zombie build. You can override the default settings and increase the reaper frequency and timeouts using these parameters:

DRONE_CLEANUP_INTERVAL=24h
DRONE_CLEANUP_DEADLINE_RUNNING=24h
DRONE_CLEANUP_DEADLINE_PENDING=24h

I suspect this happens if the runner process exits unexpectedly. I’ve found old containers on nodes from long abandoned builds

Yes, this is the typical root cause for a zombie build. The timeout is enforced by the runner, which means if the runner exits unexpectedly, the build gets stuck in a running state. The reaper is not meant to enforce the timeout – that responsibility remains with the runner. The reaper just automates cleanup of zombie builds, which previously had to be done manually.

[1] https://github.com/drone/drone/blob/master/service/canceler/reaper/reaper.go