Containers that run longer than 1 minute are killed and exit cleanly

I’ve been running into this issue lately with Docker builds. We have a couple of build stages that run for longer than 1 minute and have been exiting before they’re actually finished. This is present in both 0.7.3 and 0.8.1. I was able to replicate this behavior with a simple script.

#!/bin/sh
period=60
echo "sleeping $period"
sleep $period
exit 1

This script should fail the stage but since it runs for 1 minute, the container is cleaned up before exiting. If the period is reduced to 59 it works as expected. I’ve been chasing this issue down for the last couple of days so any help would be appreciated! If anyone has seen something like this, please let me know!

I was unable to repeat with the script provided above.

We have a large number of active installs and I have not received any similar reports, which leads me to believe this is an isolated issue.

The behavior you are describing almost sounds like you are running the script as a service container (perhaps unknowingly). Can you provide a simple yaml file and script, confirm it reproduces the problem in your environment, and then past a copy here?

This is the repo I used to replicate the issue.

One thing to note: We run all of our agents on Kubernetes and override DOCKER_HOST. This allows all agents to run containers against the same Docker daemon. I suspected that it was a network issue, but we run the same setup with Jenkins and have not experienced the same problems. Also, the consistent 60s kind of rules that out.

Thanks for posting the sample, unfortunately, I was still unable to reproduce.

Well, thanks for trying to reproduce. I’m going to try bumping the docker version the Docker host is running to eliminate that possibility.

Also note that many people running drone on kubernetes run the agent and a docker:dind container in the same pod. The agent interacts with the dind daemon instead of the host machine daemon. This approach seems to be preferred among the kubernetes crowd, and may eliminate the need for you to update the host machine docker daemon.

Thanks for the advice!

One of the reasons we run the everything against a single Docker host is to take advantage of a shared cache. Regardless, I was actually able to find some interesting info based on that.

I spun everything up locally (via Docker Compose) and pointed the agent at the remote host. I experienced the same behavior. Then, I pulled the dind image (at the same Docker version as the remote host) and then pointed the agent at that container. Everything worked as expected. It successfully moved past the 60s mark. So…something tells me there’s an issue with the ELB that’s fronting the remote host that’s causing the connection to drop after 60s.

Yep, it ended up being the ELB. It had an Idle Connection Timeout of 60s. Bumping it to 61 lets a job finish, but if the stage runs past the connection timeout, it still exits cleanly when it probably shouldn’t. I’m not sure if this is a bug or not.

I believe ELB enforces a 1 minute timeout

If I understand correctly, this would mean that even if you have 10 drone agents, they would all connect to a central docker daemon and launch all your builds on this single central server. This probably works better with Jenkins, but given the fact that drone runs all build steps in containers, I might not recommend this approach.

Yep :frowning: I guess I discovered that the hard way.

Yep, it works well for us. We didn’t see this problem on Jenkins but I imagine Drone is doing something slightly different than the docker client or docker-compose.

Yes, In this case, it is a bit apples to oranges. Jenkins runs your builds and build scripts on the host machine (by default). Drone runs your builds and build scripts in containers.

Lets say you have the following yaml:

pipeline:
  build:
    image: alpine
    commands: [ sleep 60 ]

The drone agent is essentially doing something like this:

docker run alpine /bin/sh -c "sleep 60"

If the docker host points to a central remote docker daemon, this means all of your build steps will be executed on this remote machine, and will undermine any agent clustering you have in place.

Oh yeah, I understand that. I guess what I was referring to is that we did image builds against that machine, through the ELB, that ran for more than 1 minute and it was fine. The Drone agent is essentially doing the same thing, but it seems like once the connection gets dropped it assumes “success”.

ah ok, makes sense.

drone uses the container exit code to determine the build status, and doesn’t make any assumptions. So this would tell me the container step is returning a zero exit code. But if you are able to consistently show otherwise, I definitely encourage sending a PR to help improve :slight_smile:

Before I dig in and submit a PR, I have 2 questions:

  1. Does the agent leave a connection open for the duration of the stage execution?
  2. When that connection is severed, does the agent verify the status of the container? At this point, I can’t tell if the connection is getting severed and Drone is removing the container or if Docker is removing the container because the connection was severed which could potentially result in a misreporting of the status.

The script I’ve been testing with does end with exit 1 which makes me think the container is getting killed before it actually completes for one reason or another.

I think the configuration is still unclear to me. Is the agent connecting to a remote, central docker daemon? If yes, I think this requires a change in approach vs a patch. Such a configuration might make sense for Jenkins, but there are alternative approaches that are better suited for Drone.

Yes

Yes, we use the docker wait endpoint which blocks and waits for the container result. If the request is broken it returns an error, which causes the build to error.

At the risk of beating a dead horse, I think it makes sense for Drone as well. If you have, lets say, 5 Drone agents pointing at the same Docker host, you only need to pull stage images 1 time. Each subsequent build that uses that image will start more quickly. If you were using 5 individual Docker hosts (1/agent), at worst, you’d have to pull the image 5 times, provided you hit a different agent each time. Also, when building Docker images, having access to a shared cache is useful to keep build times down. If a build fails at a particular step, you can reuse the layers that were previously built to increase the speed. If you built against a different agent each time, you run the risk of having to redo everything you just did.

As far as this issue is concerned, that’s neither here nor there.

From what I’m seeing, this isn’t the case. Once the ELB times out the connection, the stage reports success, the container exits prematurely and is removed. Like I said, I can’t tell if Docker is removing the container or if Drone is. I’ll try to figure out how to get setup for development and see what’s going on.

If your agents are connecting to a single docker daemon, there is no reason to have multiple agents because all of your pipelines will effectively run on that single machine. There would be no horizontal scaling possible when using a single daemon. So in this case you would be better off creating a single agent, connecting it to a single docker daemon, and configuring the agent to execute up to 5 concurrent pipelines using DRONE_MAX_PROCS=5.

I do apologize if I am missing something and am an not trying to be difficult, I just want to make sure you have a supported, dependable drone installation that will grow with your team.

Nope, that totally makes sense! I appreciate the insight. I didn’t realize that adjusting DRONE_MAX_PROCS would increase the number of concurrent pipelines per agent. I guess it’s a trade off. If we get to the point where we have scaling issues because of that, we’ll rethink our strategy. At this point, we we haven’t seen any performance hits from it. Just this ELB issue.

The whole reason we did that was because our CI code was really bad about cleaning stuff up after failure that we needed to get it off of our Kubernetes cluster. Since Drone is really good about it, we may be able to get away with it.

IMO, there’s still a bug though since the connection to the daemon was severed and the stage reported success. A workaround is to increase Idle Connection Timeout since there isn’t any keep alive when calling the wait endpoint.