Steps repeatedly failing with clone:skipped

We deployed a second runner in our CI cluster recently.

And things haven’t been going great ever since…
Lots of jobs would randomly fail with the clone:skipped error message
But restarting them would, sometimes, fix them.

However, now, some projects keep failing to build with no useful logs…

Even when using the simplest pipeline configuration possible, the error message doesn’t change; pipeline gets destroyed as soon as it’s started…

.drone.jsonnet

[
  {
    kind: 'pipeline',
    type: 'docker',
    name: 'Base',
    steps: [
      {
        commands: ['echo EHLO'],
        image: 'node:14-alpine',
        name: 'echo',

      },
    ],
    trigger: { event: ['push'] },
  },

converted to .drone.yml

---
{
   "kind": "pipeline",
   "name": "Base",
   "steps": [
      {
         "commands": [
            "echo EHLO"
         ],
         "image": "node:14-alpine",
         "name": "echo"
      }
   ],
   "trigger": {
      "event": [
         "push"
      ]
   },
   "type": "docker"
}
---
kind: signature
hmac: 88611f11ae869c614ab4a045a94f0292d11a1ff99a5b74dbd997896657db855d

...

Runner Logs

time="2021-11-30T15:31:06Z" level=debug msg="stage received" stage.id=25718 stage.name=Base stage.number=1 thread=2
time="2021-11-30T15:31:06Z" level=debug msg="stage accepted" stage.id=25718 stage.name=Base stage.number=1 thread=2
time="2021-11-30T15:31:06Z" level=debug msg="stage details fetched" build.id=18272 build.number=41 repo.id=196 repo.name=modbus-manager repo.namespace=TS stage.id=25718 stage.name=Base stage.number=1 thread=2
time="2021-11-30T15:31:06Z" level=debug msg="updated stage to running" build.id=18272 build.number=41 repo.id=196 repo.name=modbus-manager repo.namespace=TS stage.id=25718 stage.name=Base stage.number=1 thread=2
time="2021-11-30T15:31:09Z" level=debug msg="destroying the pipeline environment" build.id=18272 build.number=41 repo.id=196 repo.name=modbus-manager repo.namespace=TS stage.id=25718 stage.name=Base stage.number=1 thread=2
time="2021-11-30T15:31:10Z" level=debug msg="successfully destroyed the pipeline environment" build.id=18272 build.number=41 repo.id=196 repo.name=modbus-manager repo.namespace=TS stage.id=25718 stage.name=Base stage.number=1 thread=2
time="2021-11-30T15:31:10Z" level=debug msg="updated stage to complete" build.id=18272 build.number=41 duration=2 repo.id=196 repo.name=modbus-manager repo.namespace=TS stage.id=25718 stage.name=Base stage.number=1 thread=2
time="2021-11-30T15:31:10Z" level=debug msg="poller: request stage from remote server" thread=2
time="2021-11-30T15:31:10Z" level=trace msg="http: context canceled"
time="2021-11-30T15:31:10Z" level=debug msg="done listening for cancellations" build.id=18272 build.number=41 repo.id=196 repo.name=modbus-manager repo.namespace=TS stage.id=25718 stage.name=Base stage.number=1 thread=2

Some other topics hinted at missing secrets failing the pipeline.
This pipeline fails even without any secrets anywhere

Other threads talk of restarting/updating Drone to fix the issue.
Admittedly, it worked some times
This morning, I’ve rebuilt the whole build cluster on drone/drone:2.6 and the same skipping issues still occur.

I’m at a loss now as to what I may try differently to fix this issue.
Any pointers?

Thanks

Support was quick to point something new to me.

In the drone UI, if you inspect the Response JSON, it may expose a new error there.

Mine was

{"id":18272,"repo_id":196,"trigger":"@hook","number":41,"status":"error","event":"push","action":"","link":"","timestamp":0,"message":"ci: can it echo?!?","before":"0f3ae15e3ae2a7630177e4f4747044cf80dd45be","after":"6abae17094472737810e30050991d86bf79fda18","ref":"refs/heads/core","source_repo":"","source":"core","target":"core","started":1638286266,"finished":1638286269,"created":1638286266,"updated":1638286266,"version":3,"stages":[{"id":25718,"repo_id":196,"build_id":18272,"number":1,"name":"Base","kind":"pipeline","type":"docker","status":"error","error":"Error response from daemon: could not find an available, non-overlapping IPv4 address pool among the defaults to assign to the network","errignore":false,"exit_code":255,"machine":"ace-bld-1-light","os":"linux","arch":"amd64","started":1638286266,"stopped":1638286268,"created":1638286266,"updated":1638286269,"version":4,"on_success":true,"on_failure":false,"steps":[{"id":232969,"step_id":25718,"number":1,"name":"clone","status":"skipped","exit_code":0,"started":1638286268,"stopped":1638286268,"version":2,"image":"drone/git:latest"},{"id":232970,"step_id":25718,"number":2,"name":"echo","status":"skipped","exit_code":0,"started":1638286268,"stopped":1638286268,"version":2,"depends_on":["clone"],"image":"docker.io/library/node:14-alpine"}]}]}

So it was a matter of allocating more vnets to the Docker Daemon.

Hope this helps someone later on.