@genisd thx for the pointer. Yes, it looks like the exact issue I experienced.
@bradrydzewski As discussed in https://github.com/drone/drone/pull/2653 this may burn a lot of as autoscaler thinks there are pending stages but they are never dispatched. Monkey patch is not a desirable way as the codebase rapidly evolves. Please consider to fix this. I think the integrity is more important than the performance.
We have not experienced this problem at cloud.drone.io and we have pretty significant build volume. If we were able to experience the issue it would be easier for us to troubleshoot and patch, however, this has not been the case.
Regarding the pull request, the author wrote the following in the description:
Another approach would be to properly kill those pending stages, but my Golang skills are currently too limited to accomplish this. Also further debugging would be needed to determine why it didn’t worked in the first place.
As the author notes, the pull request does not fix the root cause of the problem and they did not have time to research and identify the root cause. While I do appreciate the pull request (most people just complain and never send a pull request) it does not address the root cause. It is a quick hack. I have no desire to merge hacks and thus introduce technical debt. Send a pull request that addresses the root cause, and I will merge it. Or identify and document the root cause and I will be happy to patch.
@bradrydzewski Thank you for the clarification. I’ll try to create a drone.yml to reproduce the error, but in the meanwhile, I’m going to run the SQL script periodically to fix the inconsistency in Postgres.
update stages set stage_status='killed'
where stage_id in
(select stage_id from stages s
join builds b on s.stage_build_id = b.build_id
where b.build_status not in ('running') and s.stage_status='pending');
I understand you need to reproduce the situation to debug, but I’d appreciate if you know I need to update other tables. I get no result from drone queue ls after running the script so I presume it is sufficient.
where b.build_status not in ('running') and s.stage_status='pending'
It catches the all status other than running, including pending. Do I miss your point?
Sorry for the confusion. But if build_status is pending, wouldn’t stage_status be pending as well? So we shouldn’t be setting those stages to killed, it is just a queued build.
update stages set stage_status='killed'
where stage_id in
(select stage_id from stages s
join builds b on s.stage_build_id = b.build_id
where b.build_status not in ('running', 'pending') and s.stage_status='pending');