One root cause of this error is that Docker socket not being properly mounted into the agent. Here are some things to check:
check the runner --volume=/var/run/docker.sock:/var/run/docker.sock
make sure the mapping is correct
make sure the socket exists on the host machine. Some linux distributions, such as centos and coreos, may write the socket at a different location.
make sure SE Linux is not interfering
make sure Drone does not start before the Docker socket is initialized (no longer an issue in newer versions of the Drone runner, since it will not start until Docker is confirmed to be available)
is Docker restarting for some reason? What do you see in the Docker daemon logs? Is Docker hanging from the terminal? In the past there have been Docker regressions that caused the Docker daemon to lock up or panic. Upgrading to newer versions of Docker or Containerd has proven successful in resolving these issues.
is Docker being automatically upgraded? If yes, this should be disabled because this can cause unexpected Docker downtime in the middle of a running build.
please make sure you are running the very latest Docker version (and Drone for that matter). This helps avoid troubleshooting issues that have already been resolved
Thanks to all contributors for Drone.
I have drone/drone:1.0.0 and drone/agent:1.0.0 running via docker-compose on a Google Cloud n1-standard-1 (1 vCPU, 3.75 GB memory) Instance, 20GB free disk.
I’m looking for suggestions on how to diagnose the following:
The instance has been semi-usable for me for several days. On each github repo push, a task is correctly created, but about 6 out of 8 times the task stops immediately with the “default: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?” message.
I click restart for up to 10 times, until I see that the cloning begins. It seems I’m always able to make the [clone, postgres service, and the build steps] run, as expected, provided I’m patient enough with restarting.
Facts:
volume is mounted
socket is present:
srw-rw---- 1 root docker 0 Mar 22 11:25 /var/run/docker.sock
This error comes directly from the Docker Client source code. The client will throw the error if any of the following happens:
there is a request timeout to the Docker daemon
there is a connection refused error when trying to connect to the deamon (I think this error is only when using TCP)
there is a “dial unix: no such file or directory” error
You can trace the error in the Docker client code here and here.
If we look at the Drone source code we see there is very little surface area. For example, below is the code used to connect to Docker [1]. At just 4 lines of code there is very little opportunity for error within the Drone codebase.
I do recall one individual solved this problem by upgrading Docker to the very latest version. There are documented problems in the moby issue tracker with people getting this error [2]. And there are documented issues in the GitLab issue tracker where they have seen the same error with the GitLab runner [3].
So given that a) I cannot reproduce this problem locally or at cloud.drone.io and b) users of other systems (e.g. gitlab) are experiencing the problem, I am operating under the assumption that this is most likely an issue with Docker or perhaps a host machine configuration issue. I am ready to help if there are actionable improvements we can make to Drone, however, unless we identify an issue with Drone there is unfortunately little I can do on my end.
@topiaruss also I updated the original post to include this second common root cause. You might want to check to see if this applies to your installation.
Bingo!
I did NOT have that flag set.
Adding it seems to have made an improvement. I’ll confirm later, after some more experience.
I needed to docker-compose down, then up, to ensure new env variables. Then it seemed fine.
Thanks for coming back to this!
–r.
I’ve switched back to CentOS default docker package instead of docker-ce and it works so far. There is an “issue” with SELinux and passing the docker socket to a container because SELinux blocks the access to /var/run/docker.sock. This is also an expected behavior:
The aggravating thing is, this is exactly what we want SELinux to prevent. If a container process got to the point of talking to the /var/run/docker.sock, you know this is a serious security issue. Giving a container access to the Docker socket, means you are giving it full root on your system.
To allow connections anyway you have to setup a custom SELinux policy or run the agent container (if you are not using a single server setup) in privileged mode.
We’re experiencing this on a regular basis - updating our docker version is non-trivial b/c we’re in gcloud kubernetes, and it would mean downtime for our servers - as a result we’re effectively restarting all our agents by hand multiple times per day. I wouldn’t be surprised if docker crashing and restarting was causing this.
Short of debugging that, would it be reasonable to restart the agents or otherwise help the agents reconnect to docker in situations like these? Has any work been done in this direction already?
Following up a week later - updating gke (and therefore docker) and updating to the latest drone (1.3) solved this for us - we’re no longer seeing the issue. Thanks much!