Kubernetes runner intermittently fails steps

I’m not sure if this is a problem with the runner or something larger in my cluster, but over the last month or two, these failures have become more frequent. Using the drone-kube-runner, during a pipeline run, sometimes a step will fail with no output at all usually or the only output being like:

unable to retrieve container logs for docker://226be60fecae14fa720572ccae3062f530f1d71c9a1dbf67d9b10029b9458c0d

A restart will usually pass but lately, it’s taken several attempts as on a retry sometimes a different step will fail like this. I have my drone logging turned up on the main process and the drone-kube-runner but don’t see any notable errors in the logs. I was using drone/drone-runner-kube:latest until today when I switched to drone/drone-runner-kube:1.0.0-beta.3 but still experienced the issue with that. Main Drone version is 1.7.0.

I’m not seeing anything helpful in the kube logs (kube-apiserver, kube-scheduler, kube-controller-manager or even just docker) but it’s a bit like trying to find a needle in the proverbial haystack. I’m not really sure where to go from here. This problem only started occurring after switching to the drone-kube-runner from the legacy kubernetes runner. It’s the running on the same kubernetes cluster.

I guess I should have posted this under Support, but the amount of transient build failures this is causing is quite high. Any help at all on what logging I should turn up would be useful.

Hi Matt,
I did some research on this particular error and came across an interesting thread [1] from knative (which shares some similiarities with Drone) and they found this issue was caused by the container being deleted before the kubernetes client had enough time to fetch the logs. I wonder if this could be a possible root cause? Does the step run very quickly?

[1] https://github.com/knative/docs/issues/300

Thanks for looking into this.

Yes, most of these steps have executions that take only a handful of seconds. Drone usually reports 4 or so seconds for a total run time for a particular step when its successful. I believe that time also includes image pull time, so actual time the container is in a running state is probably less than a second.

I think that covers most of the intermittent failures, but sometimes the clone step fails with no output and that takes roughly 40 seconds when successful. But I think you’re on to something…

I think there are a couple things I would like to attempt:

  1. Retry getting the logs on error with a backoff
  2. Delay Pod deletion to given enough time to fetch logs

If I enable both of the above with feature flags, would you be open to testing since you are able to reproduce with some consistency?

Yes, absolutely. I would love if the deletion time were configurable. I would find a lot of value in being able to inspect them during general troubleshooting. I’d probably set it for about 24 hours or so.

Hi Matt,

We added some feature flags to drone/drone-runner-kube:latest that I hope will help:

DRONE_FEATURE_FLAG_RETRY_LOGS=true
configures the system to retry streaming the logs on error. I recommend enabling this flag first and monitoring its behavior. If this flag solves the problem, we can implement a more permanent fix where we automatically retry.

DRONE_FEATURE_FLAG_DELAYED_DELETE=true
delays removing the pod by 30 seconds. If the underlying root cause is the Pod being deleted before logs are streamed, setting this flag to true should workaround this issue by giving the system enough time to stream logs.

DRONE_FEATURE_FLAG_DISABLE_DELETE=true
disables removing the Pod. If enabled this requires manually removing the resources associated with the pipeline (Pod, Config Maps, Secrets, etc). If streaming the logs fails consistently, we should be able to disable Pod deletion and try streaming logs using kubectl to see if the error is reproducible.

Thanks very much. I’ll throw the first flag in now and give it a few days.

Update: I set the DRONE_FEATURE_FLAG_RETRY_LOGS flag last week and while it seemed to help with the explicit unable to retrieve errors, I still had a number of no output failures like the clone step and where the step was actually successful but was marked as failed.

So this morning I added the DRONE_FEATURE_FLAG_DELAYED_DELETE flag. So far so good. Lots of builds and no abnormal failures! I’ll report back if I notice any in the next few days.

Well, I guess I jinxed it. I’m still getting all the failures noted above even with those two flags set.

Any chance someone tried to cancel this particular build ?

No, they show up as failed rather than cancelled.

@bradrydzewski Do we really need this?

	// END: FEATURE FAG
	if err != nil {
		return nil, err
	}

We can just let it go to the next step

return k.waitForTerminated(ctx, spec, step)

and it will error there is there really is an issue.

@sethpollack if we remove the error and allow the pipeline to continue to the next step, does that mean the logs would be empty? I do think this is an improvement since we don’t want the pipeline needlessly failing (so I agree with you) but I think the lack of logs would still be an open issue.

@bradrydzewski correct, we would still need a solution for getting the logs.

@bradrydzewski I have been digging into this a bit more and noticed a few things:

  1. When we change the image from the placeholder image to the step image the container is terminated with an exit code 2
  2. Whenever I run into these flakes the status code is 2 as well.

I think we may just be getting the placeholder logs rather than an issue with tailing the logs.

I think the fix may just be changing this code https://github.com/drone-runners/drone-runner-kube/blob/master/engine/engine_impl.go#L250

from:

  if (!image.Match(cs.Image, step.Placeholder) && cs.State.Running != nil) || (cs.State.Terminated != nil) {
    return true, nil
  }

to:

  if (!image.Match(cs.Image, step.Placeholder) && (cs.State.Running != nil || cs.State.Terminated != nil) {
    return true, nil
  }

I patched my runner with this fix and I’ll open a PR if this solves the issue.

1 Like

Thanks Seth, let me know if that helps.

I think there may be a few different root causes that manifest with the same error. We saw an example where this error was caused by the context being cancelled. Kubernetes is wrapping and obfuscating the underlying context.Canceled error making this hard to diagnose. Aside from a more user-friendly message, one solution is to increase timeout for builds that exceed 60 minutes and get this error.

Also, in some cases we are seeing builds being cancelled early / unexpectedly because the build timeout, which defaults to 60 minutes, starts when the Pod is scheduled. If it takes 10 minutes to schedule the Pod, it will only have 50 minutes remaining to execute (for example). This was also hard to identify because Kubernetes was wrapping the context.Canceled error. Similar to above, the short term fix is to increase build timeout.

I think we are dealing with a few different issues here, so hopefully we can chip away at them.

This seems to be helping with the with the flakes that have no logs but I’m still getting runs that are successful but then exit with an exit code 2 anyway.

I think it’s possible that waitForTerminated is getting older events from when we update the placeholder image.

I’m going to add this logic there as well:

if !image.Match(cs.Image, step.Placeholder) && cs.State.Terminated != nil {
    state.ExitCode = int(cs.State.Terminated.ExitCode)
    return true, nil
}

I put @sethpollack fixes as a PR https://github.com/drone-runners/drone-runner-kube/pull/36 and it seems to work just great for us so far.

I’m also seeing a lot of these errors:

timed out waiting for the condition

It seems like sometimes even though the container is updated with the new image it doesn’t actually restart so it just hangs until it times out.