We’re experiencing an issue where our pipelines are not displaying logs for certain steps.
What happens:
We have multiple steps that run from an alpine image with Python
Each step has commands like test, echo “test”, pip install…
While the pipeline is running we can see all logs output in the UI
After the step is finished I click off the step, then click on the step, no logs show
I only get a status of the build step
In the console I see a 404 request for logs for that step in FF and Chrome.
In the drone logs we see:
{
“error”: “stream: not found”,
“level”: “warning”,
“msg”: “manager: cannot teardown log stream”,
“step.id”: someid,
“step.name”: “step name”,
“step.status”: “skipped”,
“time”: “2020-04-29T22:30:27Z”
}
Any ideas what is happening here? We haven’t updated drone or these images in days but it recently started doing this. We bounced all our runner instances. Only some of our pipelines are showing this behavior. Both Failed and Successful steps and pipelines are affected.
The logs indicated the step was skipped which means no logs would exist, in which case this log entry would be expected. This should probably be a debug log entry instead of a warning to avoid throwing false positives.
Do you have any server or runner logs for steps that are not being skipped where you observed this behavior? Perhaps with the runner trace logs enabled?
Also is it possible you have a reverse proxy or load balancer that is preventing the final logs from uploading to the server due to request size limits. When the pipeline completes the full logs are uploaded and persisted (the realtime log stream is volatile and is not persisted). The payload with the full logs can grow quite large. The behavior you are describing is consistent with the final logs failing to upload, for which the most common root cause is load balancer / proxy request size limits.
There are no logs from the runner at all for the steps that do not retain logs. No logs in the server either that indicate an error. There are no reverse proxies, and these steps haven’t changed in 8 days when they last completed. The successful steps are roughly 6 seconds in length and I’m positive they aren’t over a size limit since we have some steps that I know, from history, run much longer with more logs :).
That said, your response did jog my memory that we recently added a WAF to the load balancer and I’m wondering now if that is the cause of the errors. Is there a way we can skip sending logs from the runner to the main Drone ALB (on AWS)? Off the top of my head I’m thinking we may want, or need, to setup a private, internal, ALB for this info.
Based on the behavior you described it definitely sounds like something is blocking uploads to the server. If uploads reached the server and were failing we would see these log entries. As a next step we should be able to capture the HTTP request and response between the runner and the server using these runner parameters:
Ok, reporting back this morning. It is the WAF we added to the ALB some time ago. Only certain steps are getting blocked because the rules don’t like what is in the logs.
For me, what seems to be the best approach, would be the keep the server/runner traffic private so we don’t have the need for a WAF in that connection. Is there any support to have the runners and server communicate over an internal load balancer but have Drone server hosted also behind a different load balancer? Or is anyone using that setup with references we could start with?
Looking at the docker runners it seems like we can add another load balancer to the Drone server and set the RPC host to use that load balancer to communicate. We might give that a try. Let me know if you have any different recommendations. Thanks again!