Drone-runner-kube 1.0.0-beta.12 - Failed step blocks pipeline

After upgrading Drone-runner-kube to 1.0.0-beta.12 the pipeline gets stuck at the step, that fails but no other steps are dependent on this step. It gets an Error 1 but the step keeps running.

Expected behavior with Drone-runner-kube 1.0.0-beta.9 was:

As a bonus, after manually stopping the build, the pod stays created, doesn’t get terminated.

Please enable trace level logging and then post the runner logs for the drone developers to analyze (you can post to gist / pastebin / etc and provide a link in this thread) as well as a yaml file that can be used to reproduce.

cc @marko-gacesa

Does this happen every time or has it happened only once? If you are able to reproduce this, the trace logs will be of the greatest value.

Yeah, this was happening with every run, I’ll reproduce trace logs on Friday or Monday.

Here are the logs, hopefully this tells you something


Here’s what happened after I manually canceled the build. This time pod was terminated but before it wasn’t.
Yet the build still got stuck at the same step.

Here’s pod name: drone-l1ibolk9ohcpu7bkfqf2

Hi, @mknapcok. Thanks for sharing this.

In the runner’s log, I found that the step “check-dependencies” (container=drone-3e4r5l615rtj4qstaa6x) was never finished. It was launched at “2021-09-15T21:03:05Z”, successfully started two seconds later, but a Kubernetes event for the step termination never arrived. From the screenshot I can see the step was still running at “2021-09-15T21:14:25Z” when the pod was deleted (after you canceled the build).

What was the step doing? Is it possible that the step itself got stuck, and not the runner, because the log looks normal. Remember, there’s no timeout for execution of steps.

Also, it’s weird that the first time the runner failed to delete a pod. What version of Kubernetes are you using?

Sorry, now I see I blacked out all lines, it was finished with Error 1, this step is meant to fail, it’s just to display indirect dependecies for our devs. The expected behaviour is to fail and continue with next step.

With the beta.12 it stays stuck indefinitely in this state.

As I was reproducing this issue now, the logs don’t reveal anything obvious. Here’s a snippet from logs regarding container=drone-c86no4qtxiixqe22hep0

time="2021-09-19T20:06:15Z" level=debug msg="Engine: Starting step" build.id=4359 build.number=2175 container=drone-c86no4qtxiixqe22hep0 image="698591829921.dkr.ecr.eu-central-1.amazonaws.com/base-php:1.6" namespace=drone placeholder="drone/placeholder:1" pod=drone-4dgyh7gom0u4zfr8yyf7 repo.id=6 repo.name=smartlook-application repo.namespace=smartlook stage.id=45862 stage.name=checks stage.number=1 step=check-dependencies step.name=check-dependencies thread=11
time="2021-09-19T20:06:16Z" level=debug msg="Launched containers. Duration=0.10s" count=1 failed=0 success=1
time="2021-09-19T20:06:16Z" level=debug msg="PodWatcher: Waiting..." container=drone-c86no4qtxiixqe22hep0 pod=drone-4dgyh7gom0u4zfr8yyf7 stepState=running
time="2021-09-19T20:06:16Z" level=trace msg="PodWatcher: Event" event=MODIFIED pod=drone-4dgyh7gom0u4zfr8yyf7
time="2021-09-19T20:06:16Z" level=trace msg="PodWatcher: Event" event=MODIFIED pod=drone-4dgyh7gom0u4zfr8yyf7
time="2021-09-19T20:06:17Z" level=trace msg="PodWatcher: Event" event=MODIFIED pod=drone-4dgyh7gom0u4zfr8yyf7
time="2021-09-19T20:06:17Z" level=debug msg="PodWatcher: Container state changed" container=drone-c86no4qtxiixqe22hep0 image="698591829921.dkr.ecr.eu-central-1.amazonaws.com/base-php:1.6" pod=drone-4dgyh7gom0u4zfr8yyf7 restartCount=1 state=RUNNING stepState=running
time="2021-09-19T20:06:17Z" level=debug msg="PodWatcher: Wait finished. Duration=0.98s" container=drone-c86no4qtxiixqe22hep0 error="<nil>" pod=drone-4dgyh7gom0u4zfr8yyf7 stepState=running
time="2021-09-19T20:06:22Z" level=trace msg="PodWatcher: Event" event=MODIFIED pod=drone-4dgyh7gom0u4zfr8yyf7
time="2021-09-19T20:06:22Z" level=debug msg="PodWatcher: Waiting..." container=drone-c86no4qtxiixqe22hep0 pod=drone-4dgyh7gom0u4zfr8yyf7 stepState=finished
time="2021-09-19T20:06:23Z" level=trace msg="PodWatcher: Periodic container state check" namespace=drone pod=drone-cqc2oqqunfu0gv4gxcsp
time="2021-09-19T20:06:23Z" level=trace msg="PodWatcher: Periodic container state check" namespace=drone pod=drone-itxh4eebvuucmkp3nsh7
time="2021-09-19T20:06:23Z" level=trace msg="PodWatcher: Periodic container state check" namespace=drone pod=drone-yb4foy9uqoo3kg9ieg60
time="2021-09-19T20:06:25Z" level=trace msg="PodWatcher: Periodic container state check" namespace=drone pod=drone-4dgyh7gom0u4zfr8yyf7

After manually cancelling at 20:06 the build, the pod kept running. There are still coming logs with checking state.

If you want to see this bug personally I’m happy to get on a call with you and show you.

Ok, before we get on a call can you please provide these things:

  1. Your Kubernetes version?
  2. “kubectl describe pod” output of the problematic pod

Thanks!

Yeah, sure I will. Kubernetes version is 1.21 and describe pod will be in next post.
Here I have another case of a step failing but staying running:

Here’s describe pod with failed step “check-dependecies”

Here’s the one from pod with failed “test” step that keeps running:

@mknapcok, please try out the latest image. It should solve your problem.

Yeah, it seems it helped with the build that was getting stuck

I’ll let it run tomorrow and see if more builds will be done like this.

Okay, it look’s like the steps that finish with error fail successfuly.
Although we discovered a different bug now with step that fails but in the end, in Drone UI it displays this error message:

During the run, the logs are normally displayed as usual but as the run is finished it’s displaying this messsage.
I was monitoring logs in a container on k8s during the run of the build and the logs finish with this lines:

We’d expect this lines will be displayed in Drone server UI but unfortunately we just find error message shown on a picture above.

Note: Pod ID’s don’t match as the screenshot with logs was made on pod, that was triggered with kube-runner beta.12 as I was testing previous behavior. Anyway the logs were the same on the pod, that produced the error on first picture. This behavior is happening with drone-runner-kube:latest.

This was run with drone-runner-kube:1.0.0-beta.9 and I’d say this is the expected behavior:

Yeah, and the same happens with the build I was showing in the beginning. Build fails but doesn’t display logs.

Failing to display the logs could be an issue with the user interface as opposed to an issue with the runner. Can you see the logs if you manually try to access the log endpoint in your browser? The log endpoint follows this pattern:

/api/repos/{organization}/{repo}/builds/{buildNumber}/logs/{stageNumber}/{stepNumber}

cc @d1wilko

Yes, I am able to access them this way: