Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running

bradrydzewski · March 20, 2019, 9:12pm

One root cause of this error is that Docker socket not being properly mounted into the agent. Here are some things to check:

check the runner --volume=/var/run/docker.sock:/var/run/docker.sock
make sure the mapping is correct
make sure the socket exists on the host machine. Some linux distributions, such as centos and coreos, may write the socket at a different location.
make sure SE Linux is not interfering
make sure Drone does not start before the Docker socket is initialized (no longer an issue in newer versions of the Drone runner, since it will not start until Docker is confirmed to be available)
is Docker restarting for some reason? What do you see in the Docker daemon logs? Is Docker hanging from the terminal? In the past there have been Docker regressions that caused the Docker daemon to lock up or panic. Upgrading to newer versions of Docker or Containerd has proven successful in resolving these issues.
is Docker being automatically upgraded? If yes, this should be disabled because this can cause unexpected Docker downtime in the middle of a running build.
please make sure you are running the very latest Docker version (and Drone for that matter). This helps avoid troubleshooting issues that have already been resolved

topiaruss · April 3, 2019, 4:51pm

Thanks to all contributors for Drone.
I have drone/drone:1.0.0 and drone/agent:1.0.0 running via docker-compose on a Google Cloud n1-standard-1 (1 vCPU, 3.75 GB memory) Instance, 20GB free disk.
I’m looking for suggestions on how to diagnose the following:

The instance has been semi-usable for me for several days. On each github repo push, a task is correctly created, but about 6 out of 8 times the task stops immediately with the “default: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?” message.
I click restart for up to 10 times, until I see that the cloning begins. It seems I’m always able to make the [clone, postgres service, and the build steps] run, as expected, provided I’m patient enough with restarting.

Facts:

volume is mounted
socket is present:
srw-rw---- 1 root docker 0 Mar 22 11:25 /var/run/docker.sock
uname -a:
linux drone-runner-1 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3.1 (2019-02-19) x86_64 GNU/Linux

Because I can always eventually get a successful build, configuration seems to be fine. So what causes the unreliability?

Does anyone see the same behaviour? What should I be looking at, to diagnose and resolve this?

bradrydzewski · April 7, 2019, 5:06pm

This error comes directly from the Docker Client source code. The client will throw the error if any of the following happens:

there is a request timeout to the Docker daemon
there is a connection refused error when trying to connect to the deamon (I think this error is only when using TCP)
there is a “dial unix: no such file or directory” error

You can trace the error in the Docker client code here and here.

If we look at the Drone source code we see there is very little surface area. For example, below is the code used to connect to Docker [1]. At just 4 lines of code there is very little opportunity for error within the Drone codebase.

cli, err := docker.NewEnvClient()
if err != nil {
  return nil, err
}

I do recall one individual solved this problem by upgrading Docker to the very latest version. There are documented problems in the moby issue tracker with people getting this error [2]. And there are documented issues in the GitLab issue tracker where they have seen the same error with the GitLab runner [3].

So given that a) I cannot reproduce this problem locally or at cloud.drone.io and b) users of other systems (e.g. gitlab) are experiencing the problem, I am operating under the assumption that this is most likely an issue with Docker or perhaps a host machine configuration issue. I am ready to help if there are actionable improvements we can make to Drone, however, unless we identify an issue with Drone there is unfortunately little I can do on my end.

[1] https://github.com/drone/drone-runtime/blob/master/engine/docker/docker.go#L40:L46
[2] https://github.com/moby/moby/search?q=Cannot+connect+to+the+Docker+daemon+at&type=Issues
[3] https://gitlab.com/gitlab-org/gitlab-runner/issues/1986

topiaruss · April 7, 2019, 5:19pm

Thanks, Brad, for the precise and valued response.
I will follow your suggestions, and report progress here.
–r

bradrydzewski · April 7, 2019, 6:39pm

@topiaruss also I updated the original post to include this second common root cause. You might want to check to see if this applies to your installation.

topiaruss · April 8, 2019, 5:55pm

Bingo!
I did NOT have that flag set.
Adding it seems to have made an improvement. I’ll confirm later, after some more experience.
I needed to docker-compose down, then up, to ensure new env variables. Then it seemed fine.
Thanks for coming back to this!
–r.

topiaruss · April 9, 2019, 9:27am

Yes! My intermittent problem has gone, since I added:

DRONE_AGENTS_ENABLED=true

to the server environment.

Thanks again. Really enjoying Drone.

xoxys · May 10, 2019, 9:57am

I’ve switched back to CentOS default docker package instead of docker-ce and it works so far. There is an “issue” with SELinux and passing the docker socket to a container because SELinux blocks the access to /var/run/docker.sock. This is also an expected behavior:

The aggravating thing is, this is exactly what we want SELinux to prevent. If a container process got to the point of talking to the /var/run/docker.sock, you know this is a serious security issue. Giving a container access to the Docker socket, means you are giving it full root on your system.

Full article

To allow connections anyway you have to setup a custom SELinux policy or run the agent container (if you are not using a single server setup) in privileged mode.

russmatney · September 12, 2019, 10:20pm

We’re experiencing this on a regular basis - updating our docker version is non-trivial b/c we’re in gcloud kubernetes, and it would mean downtime for our servers - as a result we’re effectively restarting all our agents by hand multiple times per day. I wouldn’t be surprised if docker crashing and restarting was causing this.

Short of debugging that, would it be reasonable to restart the agents or otherwise help the agents reconnect to docker in situations like these? Has any work been done in this direction already?

russmatney · September 20, 2019, 7:26pm

Following up a week later - updating gke (and therefore docker) and updating to the latest drone (1.3) solved this for us - we’re no longer seeing the issue. Thanks much!

Topic		Replies	Views
Cannot connect do docker socket Drone Support	0	462	September 2, 2019
Problem running DinD build Drone Support	4	1123	December 2, 2021
Could not change group /var/run/docker.sock to docker: group docker not found Drone Support	2	3302	October 6, 2021
[SOLVED] Cannot connect to the Docker daemon Drone Support	3	2443	April 11, 2019
Drone docker runner on kubernetes Drone Support	27	1479	July 9, 2020

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running

Related topics