Drone leaving directories behind in Kubernetes

We’re running Drone in Kubernetes and we recently had a node report disk pressure so I decided to investigate. Over 80% of the disk usage was from the /tmp/drone directory. Looks like that is where drone mounts local volumes? Thats fine except drone doesn’t appear to clean up after itself. Our most active repository is ~4GB in size. This is going to cause significant node churn if every single build uses (without reclaiming) 4GB of disk. Is there a configuration option or something that I missed to tell drone to delete those directories after it’s done?

Drone has code in place to cleanup after itself which has been working in my test instance. The relevant code can be found here: https://github.com/drone/drone-runtime/blob/master/engine/kube/kube.go#L263

I should warn that the Kubernetes runtime is still considered experiment and is not recommended for production use. You are of course welcome to run it in production, but you may need to get hands on with the code and issue patches when needed. We have a guide for debugging and contributing to kubernetes at Contributing to Drone for Kubernetes

Interesting. How large is the test cluster you are running on? Our cluster is small but we have 6-8 worker nodes. One thing I noticed is that the node where the Drone server is running has a clean /tmp/drone directory, but all the other nodes I checked have directories left over in /tmp/drone. I’m not familiar with the intricacies of the Drone application, but I’m wondering if the Drone server is what attempts to perform that Destroy call that you linked, so the delete will only succeed if the Drone build job is placed on the same node that the Drone server is running on? Is that possible?

we have seen variation across providers (digital ocean vs gke vs eks, etc). Perhaps some security policy is preventing deletes? I tested with a 4 node digital ocean cluster. The code to remove the directory is not run on the server, it is run by the kubernetes job (drone/controller image). The job uses node affinity to ensure all pipeline containers are schedule on the same machine as the job [1]

[1] https://github.com/drone/drone-runtime/blob/master/engine/kube/kube.go#L142

Hey, I’m trying to poke around and investigate this, but as I mentioned before I’m very unfamiliar with Drone internals. A couple things I’ve noticed:

In the kube engine function you linked, I noticed you also delete the namespace right after you attempt to delete the directory (https://github.com/drone/drone-runtime/blob/master/engine/kube/kube.go#L271-L275). We’re seeing our namespaces deleted, so clearly this function is executing successfully.

So, I triggered a drone build and did a kubectl describe pod on both the job (controller) and one of the individual pipeline step containers. Here is an excerpt:

Job (controller):

$ kc describe pod drone-job-1597-grjnwpa9ijur6cm6g-mcfzb
Name:               drone-job-1597-grjnwpa9ijur6cm6g-mcfzb
Namespace:          drone
Node:               ip-172-20-53-7.us-west-2.compute.internal/172.20.53.7
...
Containers:
  drone-controller:
    Container ID:   docker://e71c0a495a385919faec073c59fe535bf142f34b34719e35f22bf9542775661f
    Image:          drone/controller:1.0.0
    Image ID:       docker-pullable://drone/controller@sha256:d2b5d070d53f7465d45af819f97d0a341ccf02478ec44ed000e7e82a3f39db08
...
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from drone-server-token-p6q2v (ro)
...
Volumes:
  drone-server-token-p6q2v:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  drone-server-token-p6q2v
    Optional:    false
...

Pipeline step container:

$ kc describe pod 4aj6gcrburv3psmo38vutcvqo4f8m7tm -n fpqhs3xavjzvm6nbijzv37mfrk1yrqwu
Name:               4aj6gcrburv3psmo38vutcvqo4f8m7tm
Namespace:          fpqhs3xavjzvm6nbijzv37mfrk1yrqwu
Node:               ip-172-20-53-7.us-west-2.compute.internal/172.20.53.7
...
Containers:
  4aj6gcrburv3psmo38vutcvqo4f8m7tm:
    Container ID:   docker://246dbc7ec57b854d93a0786de7f42c72676c07fb43afa079215e35aab4008f2e
    Image:          docker.io/plugins/ecr:latest
    Image ID:       docker-pullable://plugins/ecr@sha256:fdfd91b6e486898e9730b3bd495aa44d7350306c57f40af420a6772de4aae7cb
...
    Mounts:
      /drone/src from q4u5ttvy6baqvhfyj8rngihb2wrm8wox (rw)
...
Volumes:
  q4u5ttvy6baqvhfyj8rngihb2wrm8wox:
    Type:          HostPath (bare host directory volume)
    Path:          /tmp/drone/fpqhs3xavjzvm6nbijzv37mfrk1yrqwu/q4u5ttvy6baqvhfyj8rngihb2wrm8wox
    HostPathType:  DirectoryOrCreate
...

So from that I was able to confirm that both the Job controller and pipeline step containers are running on the same node (ip-172-20-53-7.us-west-2.compute.internal/172.20.53.7), which is good. But I did notice that the controller doesn’t have the local directory mounted. Notice how the pipeline has the HostPath type volume declared and mounted? The only volume and mount I see on the controller is for secrets. And if the controller is the one trying to delete the directories (I think thats what you said, and it would makes sense), then I would assume it would always fail since it can’t see the directories.

Perhaps I’m missing something here, but I was hoping you could weigh in on this.

@bradrydzewski I would appreciate any insight you have on my previous comment ^. I dug through the drone-server code and from what I can tell this is where the job is created: https://github.com/drone/drone/blob/master/scheduler/kube/kube.go#L107-L138

And I don’t see any volume declarations or mounting there (granted I don’t see the secret being mounted there either, so I could be off). Regardless the lack of the local volume mount in my job manifest from kubectl describe job makes me wonder how the controller is capable of clearing out that directory.

@bradrydzewski Could really use any assistance you can provide here. If you view my earlier messages I think I found the source of the problem. I don’t see the drone-job pod getting the Host volume mounted, which would prevent any cleanup from happening.

thanks for taking the time to research, I will review tomorrow and let you know what I see

Sounds good, thank you!

BTW, I was able figure out how to build Drone late last night using Taskfile.yml. I hacked in a volume and volumeMount to the job declaration, and got it to build, but when I ran the image for our drone server I kept getting this error:
{"error":"Binary was compiled with 'CGO_ENABLED=0', go-sqlite3 requires cgo to work. This is a stub","level":"fatal","msg":"main: cannot initialize server","time":"2019-05-01T17:35:56Z"}

I noticed that you set that flag here: https://github.com/drone/drone/blob/master/Taskfile.yml#L55 so I manually set it to 1 and rebuilt, but that image gave an error as well:
standard_init_linux.go:178: exec user process caused "no such file or directory"

I’m not sure if you’ve run into this as well or have any guidance. I didn’t see any other questions about it so I thought I’d ask for myself and as a reference for others.

Drone uses embedded sqlite3, so if you disable CGO you can only use mysql or postgres. If you compile source on x64 linux you can enable CGO.

Okay, so I finally got all of this working. I’ll post the status for others in case they run into issues building drone.

So regarding the standard_init_linux.go:178: exec user process caused "no such file or directory" error I was getting with CGO_ENABLED=1, I found this very useful stackoverflow article: https://stackoverflow.com/questions/49079981/golang-app-in-docker-exec-user-process-caused-no-such-file-or-directory
Which led me to believe that I couldn’t compile the binary on my local system. I then noticed that in your .drone.yaml file here you build your binaries in the golang:1.11 image. So, I downloaded that image, mounted the code into it with a volume and tried building there. But, I also needed the build command. I was able to find that here.

From there I was able to get the image to properly build and run without the runtime error. I was able to add my mount/volume fix and have confirmed that the directories are now being cleaned up. I’ll submit a PR for change tomorrow.

we are having the same issues here. its there any solution yet? i found 13 GB in /tmp/drone with drone 1.0.1 in EKS

it is resolved in newer versions. the latest stable version is 1.2