Kubernetes runner intermittently fails steps

mattlqx · May 18, 2020, 11:43pm

I’m not sure if this is a problem with the runner or something larger in my cluster, but over the last month or two, these failures have become more frequent. Using the drone-kube-runner, during a pipeline run, sometimes a step will fail with no output at all usually or the only output being like:

unable to retrieve container logs for docker://226be60fecae14fa720572ccae3062f530f1d71c9a1dbf67d9b10029b9458c0d

A restart will usually pass but lately, it’s taken several attempts as on a retry sometimes a different step will fail like this. I have my drone logging turned up on the main process and the drone-kube-runner but don’t see any notable errors in the logs. I was using drone/drone-runner-kube:latest until today when I switched to drone/drone-runner-kube:1.0.0-beta.3 but still experienced the issue with that. Main Drone version is 1.7.0.

I’m not seeing anything helpful in the kube logs (kube-apiserver, kube-scheduler, kube-controller-manager or even just docker) but it’s a bit like trying to find a needle in the proverbial haystack. I’m not really sure where to go from here. This problem only started occurring after switching to the drone-kube-runner from the legacy kubernetes runner. It’s the running on the same kubernetes cluster.

mattlqx · May 29, 2020, 6:32pm

I guess I should have posted this under Support, but the amount of transient build failures this is causing is quite high. Any help at all on what logging I should turn up would be useful.

ashwilliams1 · May 29, 2020, 9:25pm

Hi Matt,
I did some research on this particular error and came across an interesting thread [1] from knative (which shares some similiarities with Drone) and they found this issue was caused by the container being deleted before the kubernetes client had enough time to fetch the logs. I wonder if this could be a possible root cause? Does the step run very quickly?

[1] https://github.com/knative/docs/issues/300

mattlqx · May 29, 2020, 9:43pm

Thanks for looking into this.

Yes, most of these steps have executions that take only a handful of seconds. Drone usually reports 4 or so seconds for a total run time for a particular step when its successful. I believe that time also includes image pull time, so actual time the container is in a running state is probably less than a second.

I think that covers most of the intermittent failures, but sometimes the clone step fails with no output and that takes roughly 40 seconds when successful. But I think you’re on to something…

ashwilliams1 · May 29, 2020, 10:33pm

I think there are a couple things I would like to attempt:

Retry getting the logs on error with a backoff
Delay Pod deletion to given enough time to fetch logs

If I enable both of the above with feature flags, would you be open to testing since you are able to reproduce with some consistency?

mattlqx · May 29, 2020, 10:46pm

Yes, absolutely. I would love if the deletion time were configurable. I would find a lot of value in being able to inspect them during general troubleshooting. I’d probably set it for about 24 hours or so.

bradrydzewski · June 2, 2020, 7:17pm

Hi Matt,

We added some feature flags to drone/drone-runner-kube:latest that I hope will help:

DRONE_FEATURE_FLAG_RETRY_LOGS=true
configures the system to retry streaming the logs on error. I recommend enabling this flag first and monitoring its behavior. If this flag solves the problem, we can implement a more permanent fix where we automatically retry.

DRONE_FEATURE_FLAG_DELAYED_DELETE=true
delays removing the pod by 30 seconds. If the underlying root cause is the Pod being deleted before logs are streamed, setting this flag to true should workaround this issue by giving the system enough time to stream logs.

DRONE_FEATURE_FLAG_DISABLE_DELETE=true
disables removing the Pod. If enabled this requires manually removing the resources associated with the pipeline (Pod, Config Maps, Secrets, etc). If streaming the logs fails consistently, we should be able to disable Pod deletion and try streaming logs using kubectl to see if the error is reproducible.

mattlqx · June 2, 2020, 7:25pm

Thanks very much. I’ll throw the first flag in now and give it a few days.

mattlqx · June 9, 2020, 1:20am

Update: I set the DRONE_FEATURE_FLAG_RETRY_LOGS flag last week and while it seemed to help with the explicit unable to retrieve errors, I still had a number of no output failures like the clone step and where the step was actually successful but was marked as failed.

So this morning I added the DRONE_FEATURE_FLAG_DELAYED_DELETE flag. So far so good. Lots of builds and no abnormal failures! I’ll report back if I notice any in the next few days.

mattlqx · June 10, 2020, 3:43pm

Well, I guess I jinxed it. I’m still getting all the failures noted above even with those two flags set.

ashwilliams1 · June 10, 2020, 4:56pm

Any chance someone tried to cancel this particular build ?

mattlqx · June 10, 2020, 5:01pm

No, they show up as failed rather than cancelled.

sethpollack · June 10, 2020, 5:02pm

@bradrydzewski Do we really need this?

	// END: FEATURE FAG
	if err != nil {
		return nil, err
	}

We can just let it go to the next step

return k.waitForTerminated(ctx, spec, step)

and it will error there is there really is an issue.

bradrydzewski · June 10, 2020, 6:23pm

@sethpollack if we remove the error and allow the pipeline to continue to the next step, does that mean the logs would be empty? I do think this is an improvement since we don’t want the pipeline needlessly failing (so I agree with you) but I think the lack of logs would still be an open issue.

sethpollack · June 10, 2020, 6:40pm

@bradrydzewski correct, we would still need a solution for getting the logs.

sethpollack · July 24, 2020, 12:18am

@bradrydzewski I have been digging into this a bit more and noticed a few things:

When we change the image from the placeholder image to the step image the container is terminated with an exit code 2
Whenever I run into these flakes the status code is 2 as well.

I think we may just be getting the placeholder logs rather than an issue with tailing the logs.

I think the fix may just be changing this code https://github.com/drone-runners/drone-runner-kube/blob/master/engine/engine_impl.go#L250

from:

  if (!image.Match(cs.Image, step.Placeholder) && cs.State.Running != nil) || (cs.State.Terminated != nil) {
    return true, nil
  }

to:

  if (!image.Match(cs.Image, step.Placeholder) && (cs.State.Running != nil || cs.State.Terminated != nil) {
    return true, nil
  }

I patched my runner with this fix and I’ll open a PR if this solves the issue.

bradrydzewski · July 24, 2020, 12:48am

Thanks Seth, let me know if that helps.

I think there may be a few different root causes that manifest with the same error. We saw an example where this error was caused by the context being cancelled. Kubernetes is wrapping and obfuscating the underlying context.Canceled error making this hard to diagnose. Aside from a more user-friendly message, one solution is to increase timeout for builds that exceed 60 minutes and get this error.

Also, in some cases we are seeing builds being cancelled early / unexpectedly because the build timeout, which defaults to 60 minutes, starts when the Pod is scheduled. If it takes 10 minutes to schedule the Pod, it will only have 50 minutes remaining to execute (for example). This was also hard to identify because Kubernetes was wrapping the context.Canceled error. Similar to above, the short term fix is to increase build timeout.

I think we are dealing with a few different issues here, so hopefully we can chip away at them.

sethpollack · July 26, 2020, 3:02am

This seems to be helping with the with the flakes that have no logs but I’m still getting runs that are successful but then exit with an exit code 2 anyway.

I think it’s possible that waitForTerminated is getting older events from when we update the placeholder image.

I’m going to add this logic there as well:

if !image.Match(cs.Image, step.Placeholder) && cs.State.Terminated != nil {
    state.ExitCode = int(cs.State.Terminated.ExitCode)
    return true, nil
}

smira · July 27, 2020, 6:04pm

I put @sethpollack fixes as a PR https://github.com/drone-runners/drone-runner-kube/pull/36 and it seems to work just great for us so far.

sethpollack · July 28, 2020, 1:54am

I’m also seeing a lot of these errors:

timed out waiting for the condition

It seems like sometimes even though the container is updated with the new image it doesn’t actually restart so it just hangs until it times out.

Topic		Replies	Views
Kube runner, handling exitCode=2 reason=Error Drone Support	2	1163	February 11, 2022
Drone-runner-kube 1.0.0-beta.12 - Failed step blocks pipeline Drone Bugs	33	874	October 13, 2021
Latest drone-runner-kube fails successful steps and does not display output Drone Support	8	678	June 11, 2021
Drone doesn't exit from previous success step to the next step occasionally with latest kube runner Drone Bugs	2	540	September 8, 2020
Drone-runner-kube and exit code 2 Drone Support	3	1215	May 5, 2022

Kubernetes runner intermittently fails steps

Related topics