Yeah, will enable trace. I thought I had, but I guess not yet. Sry, didn’t realize this was a private build.
Ok, we found another one on a public build: https://drone.grafana.net/grafana/grafana/3949
Logs from around that time (including trace):
{"log":"time=\"2020-10-06T12:02:23Z\" level=debug msg=\"stage received\" stage.id=31998 stage.name=test-pr stage.number=1 thread=1\n","stream":"stderr","time":"2020-10-06T12:02:23.414526983Z"}
{"log":"time=\"2020-10-06T12:02:23Z\" level=debug msg=\"stage accepted\" stage.id=31998 stage.name=test-pr stage.number=1 thread=1\n","stream":"stderr","time":"2020-10-06T12:02:23.42286068Z"}
{"log":"time=\"2020-10-06T12:02:23Z\" level=debug msg=\"stage details fetched\" build.id=13964 build.number=3949 repo.id=48 repo.name=grafana repo.namespace=grafana stage.id=31998 stage.name=test-pr stage.number=1 thread=1\n","stream":"stderr","time":"2020-10-06T12:02:23.453335187Z"}
{"log":"time=\"2020-10-06T12:02:23Z\" level=debug msg=\"updated stage to running\" build.id=13964 build.number=3949 repo.id=48 repo.name=grafana repo.namespace=grafana stage.id=31998 stage.name=test-pr stage.number=1 thread=1\n","stream":"stderr","time":"2020-10-06T12:02:23.982418764Z"}
{"log":"time=\"2020-10-06T12:02:30Z\" level=trace msg=\"http: no content returned: re-connect and re-try\"\n","stream":"stderr","time":"2020-10-06T12:02:30.530893674Z"}
{"log":"time=\"2020-10-06T12:02:37Z\" level=debug msg=\"received exit code 0\" build.id=13964 build.number=3949 repo.id=48 repo.name=grafana repo.namespace=grafana stage.id=31998 stage.name=test-pr stage.number=1 step.name=clone thread=1\n","stream":"stderr","time":"2020-10-06T12:02:37.284585639Z"}
{"log":"time=\"2020-10-06T12:02:39Z\" level=debug msg=\"received exit code 0\" build.id=13964 build.number=3949 repo.id=48 repo.name=grafana repo.namespace=grafana stage.id=31998 stage.name=test-pr stage.number=1 step.name=identify-runner thread=1\n","stream":"stderr","time":"2020-10-06T12:02:39.623123504Z"}
{"log":"time=\"2020-10-06T12:02:53Z\" level=trace msg=\"http: no content returned: re-connect and re-try\"\n","stream":"stderr","time":"2020-10-06T12:02:53.455636347Z"}
{"log":"time=\"2020-10-06T12:03:50Z\" level=trace msg=\"http: no content returned: re-connect and re-try\"\n","stream":"stderr","time":"2020-10-06T12:03:50.536582503Z"}
{"log":"time=\"2020-10-06T12:04:03Z\" level=debug msg=\"received exit code 0\" build.id=13964 build.number=3949 repo.id=48 repo.name=grafana repo.namespace=grafana stage.id=31998 stage.name=test-pr stage.number=1 step.name=initialize thread=1\n","stream":"stderr","time":"2020-10-06T12:04:03.841237002Z"}
{"log":"time=\"2020-10-06T12:04:03Z\" level=debug msg=\"destroying the pipeline environment\" build.id=13964 build.number=3949 repo.id=48 repo.name=grafana repo.namespace=grafana stage.id=31998 stage.name=test-pr stage.number=1 thread=1\n","stream":"stderr","time":"2020-10-06T12:04:03.969474647Z"}
{"log":"time=\"2020-10-06T12:04:03Z\" level=debug msg=\"FIXME: Got an status-code for which error does not match any expected type!!!: -1\" module=api status_code=-1\n","stream":"stderr","time":"2020-10-06T12:04:03.969486996Z"}
{"log":"time=\"2020-10-06T12:04:08Z\" level=debug msg=\"successfully destroyed the pipeline environment\" build.id=13964 build.number=3949 repo.id=48 repo.name=grafana repo.namespace=grafana stage.id=31998 stage.name=test-pr stage.number=1 thread=1\n","stream":"stderr","time":"2020-10-06T12:04:08.52212417Z"}
{"log":"time=\"2020-10-06T12:04:08Z\" level=debug msg=\"stage failed\" build.id=13964 build.number=3949 duration=100 error=\"1 error occurred:\\n\\t* \u003chtml\u003e\\r\\n\u003chead\u003e\u003ctitle\u003e400 Bad Request\u003c/title\u003e\u003c/head\u003e\\r\\n\u003cbody\u003e\\r\\n\u003ccenter\u003e\u003ch1\u003e400 Bad Request\u003c/h1\u003e\u003c/center\u003e\\r\\n\u003chr\u003e\u003ccenter\u003enginx/1.15.10\u003c/center\u003e\\r\\n\u003c/body\u003e\\r\\n\u003c/html\u003e\\r\\n\\n\\n\" repo.id=48 repo.name=grafana repo.namespace=grafana stage.id=31998 stage.name=test-pr stage.number=1 thread=1\n","stream":"stderr","time":"2020-10-06T12:04:08.522144127Z"}
{"log":"time=\"2020-10-06T12:04:08Z\" level=debug msg=\"poller: request stage from remote server\" thread=1\n","stream":"stderr","time":"2020-10-06T12:04:08.522158546Z"}
{"log":"time=\"2020-10-06T12:04:08Z\" level=trace msg=\"http: context canceled\"\n","stream":"stderr","time":"2020-10-06T12:04:08.522287763Z"}
{"log":"time=\"2020-10-06T12:04:08Z\" level=debug msg=\"done listening for cancellations\" build.id=13964 build.number=3949 repo.id=48 repo.name=grafana repo.namespace=grafana stage.id=31998 stage.name=test-pr stage.number=1 thread=1\n","stream":"stderr","time":"2020-10-06T12:04:08.522301236Z"}
{"log":"time=\"2020-10-06T12:04:30Z\" level=trace msg=\"http: no content returned: re-connect and re-try\"\n","stream":"stderr","time":"2020-10-06T12:04:30.539099804Z"}
I am a bit confused since the “initialize” step returns a 0 exit code, and the next line is “destroying the pipeline”. Also it is not clear to me why the stage was failed after all of the success. Looks like there was an http 400 error somewhere. Perhaps a problem reporting step status back to the server?
One thing that jumps out is the log entry that indicates the stage failed:
time=\"2020-10-06T12:04:08Z\"
level=debug msg=\"stage failed\"
build.id=13964
build.number=3949
duration=100
error=\"1 error occurred:\\n\\t* \u003chtml\u003e\\r\\n\u003chead\u003e\u003ctitle\u003e400 Bad Request\u003c/title\u003e\u003c/head\u003e\\r\\n\u003cbody\u003e\\r\\n\u003ccenter\u003e\u003ch1\u003e400 Bad Request\u003c/h1\u003e\u003c/center\u003e\\r\\n\u003chr\u003e\u003ccenter\u003enginx/1.15.10\u003c/center\u003e\\r\\n\u003c/body\u003e\\r\\n\u003c/html\u003e\\r\\n\\n\\n\
This is interesting for a few reasons. First is that Drone does not return a 400 bad request from any runner API calls. Second, it is that the error returns an HTML page. It almost looks like some sort of reverse proxy intercepting the request and providing an unexpected response code and body.
Regardless and error should result in an error so let me see what else I can dig up.
Thanks for working through this and providing the additional details. The additional logging really helped. Specifically the error message with the HTML / 400 Bad Request suggested the API calls to update the Step’s status were failing in an unexpected way. Normally, the runner would retry using a backoff until the server could process the request, however, the unexpected HTTP status code was resulting in an unhandled / unexpected exception which should have been failing the pipeline but was not. I have a patch, and will tag a new release once I’ve finished some more testing https://github.com/drone/runner-go/commit/179a2f65e8b2161a1ad91eace9bf356d1f51e475
I think there is also a second piece to this problem that should be resolved for the system to fully function as expected. I recommend checking your reverse proxy or load balancer to prevent it from modifying or overriding the http response on error. The runner has logic built around different response codes and http errors, including retries with backoff depending on the failure type, which can make the runner a bit more resilient to minor networking issues, and will prevent a single failed http request from failing your pipeline.
Thanks, Brad!
Deployed 1.5.3, now we see if things improve. Regarding the proxy, our drone instance is fronted by an nginx ingress controller. It doesn’t make sense to me that it would be causing the 400 error, but possibly some network issue could cause that. I will try and dig there.
Happy to report this scenario now fails the build: https://drone.grafana.net/grafana/grafana/4312/1/1
Current hypothesis is that the request is over some limit defined by the nginx-ingress we are using, and header or body size limits need adjusting.