We have a build (should be publicly accessible here that show success on github, even though most of the steps are actually skipped. This has happened at least twice now.
I can’t find anything in logs to indicate why it chose to skip those steps. Restarting the build has resulted in a successful run, so I don’t think it is anything wrong with the config.
The only reason I can guess things are being skipped is that a step dependency is not being met, but in this case all of the steps that ran were successful. The “identify-runner” step is a little off here, because it shows success, even though the ui shows no logs. That seems a little unusual.
This is pretty bad, since we usually trust the checkmark on a PR to mean all of the tests have passed and so forth, but in this case we got the checkmark without running any of the tests.
Are there any steps I can take to debug why this is happening and figure out how to fix it? We are running drone 1.9.
I ran through drone lint and I can see the problem is caused by an invalid dependency that is defined your dependency graph. It looks like defining an invalid dependency causes the pipeline to short circuit and exit.
$ drone lint
linter: invalid or unknown step dependency
which version of the runner are you using? I think we may have patched this recently, but if not, we can work on a patch.
@bradrydzewski How can you tell the invalid dependency? The linter doesn’t output it. The strange thing though, is that this issue happens only rarely? Shouldn’t Drone always skip steps due to the missing dependency?
@bradrydzewski Can you see which dependency is missing?? I’ve read through .drone.yml now and can’t see any missing dependencies at all. I mean, there might be one, but I can’t spot it (disclaimer: I’m the author of the Grafana Drone config).
We have updated our runners to all use the latest docker runner (1.5). We have also triple-checked our dependency graphs and made sure they all pass lint.
We are continuing to see unexplained skipped steps, primarily on private repos. Restarting the build usually fixes it, but it is happening with some frequency.
The common symptom seems to be skipped steps, and at least one “successful” step that has no logs at all.
I can confirm @captncraig’s account, we have experienced several builds of a private repo today to skip (most) steps and be marked as successful by Drone. I’ve run drone lint on the repo .drone.yml in question, and it finds no issues.
Could there please be a Drone configuration parameter to make it fail hard (with an explanation) when unable to resolve dependencies, instead of skipping and marking the build successful? I think this behaviour is really bad, as it’s never something I want to happen and I’m also super confused as to why it’s happening.
I want to happen and I’m also super confused as to why it’s happening
I published a patch yesterday that is available in drone/drone-runner-docker:latest, however, if you upgraded to version 1.5 you would not have the fix yet. I recommend using the latest image to see if this solves the issue. I am not aware of any other root cause for the behavior described, however, if the issue persists we would kindly ask that you enabled debug logging on the runner and provide the logs, as well as a yaml that we can use to try and reproduce (also if the repository is open source, a link to the build is also helpful).
edit: just noticed you mentioned this was for a private repository. once you upgrade, feel free to email me the requested info if you don’t want to share publicly
Thanks for making that change @bradrydzewski! It’ll be a great help if Drone catches and reports these errors, instead of just skipping steps I will see with @captncraig if we can try the latest image revision.
I’ll email you the Drone config in question (to the standard drone.io address).
I deployed runners from latest. The linter error does stop one edge case that looks a little like ours, and I much prefer the new behaviour, so thank you.
Time will tell if that gives any improvement or insight to the problematic builds.
@bradrydzewski This problem is happening again for us now (although randomly, as before) See f.ex. this build. @captncraig We’re sure we’re on a version with the fix?
we made numerous patches to both prevent this issue and provide additional trace logging so that we could help diagnose the issue if it happens again. In order to take advantage of all patches and improved logging you need version 1.5.2 of the runner or higher. If this happens again, enable trace logging on the runner and provide a log dump and we can analyze.
I think we might have been on an older version of the runners after all, from what I hear we’ve rolled out v1.5.2 now. Fingers crossed it smooths out the bumps!
I am unable to access the link. Also can you enable trace logging on the runner? We added a bunch of new trace log entries to help debug this issue.
Specifically the FIXME lines are a little concerning.
the FIXME lines come from the Docker Client and should be ignored. The error is both annoying and misleading, but unfortunately nothing we can do about it until there is an upstream Docker fix.