We are evaluating the move from an older Drone1 dind implementation to Drone2 with kubernetes runner. We are in GCP, have ~60 pipelines and obviously we want to introduce as few changes as possible. We also want to be able to scale our GKE node workers based on CPU memory. Currently, we have 2 blockers that we are unsure how to solve:
-
First is docker image caching. If a pipeline runs on the same worker, everything is working fine as the image exists locally. When new workers are introduced though, everything needs to be rebuild again. I wonder what is the suggested way to move forward with this, I understand that image caching is a difficult problem to solve in the Kubernetes ecosystem, however the runner becomes less exciting if there is no way around image caching.
-
Then is the image building. Our drone.yaml file includes many different pipelines. One of them builds locally the Docker image, then the rest use it for all kind of things (testing, building etc). With multiple worker nodes, a pipeline can be picked up by a different worker, which doesn’t have the image locally (because another worker built it) and the pipeline fails. I guess we can push the image to GCR when we build it so each pipeline get it from there, but this sounds like an “expensive” way around it, especially without proper caching in place.
Is anyone using the kubernetes runner in a big production scale to share some ideas?