Hello! I see that drone posts some metrics to Prometheus https://docs.drone.io/administration/server/metrics/ in those docs. However, I can’t find much, or any, information on which metrics are posted. I found some docs for drone 0.8.x, but we’re on 1.x.x now.
I’m looking to capture MTTR and MTBF as well as trying to infer some cycle / lead time. I’d need to be able to track the start and end time of entire pipelines as well as their finish state. I am wondering if these metrics are possible to capture from Drone before I go setting up Prometheus and the metrics exporter?
You can visit the prometheus metrics endpoint at /metrics to see a list of support Drone metrics (prefixed with drone_) with descriptions of what those metrics represent.
# HELP drone_build_count Total number of builds.
# TYPE drone_build_count gauge
drone_build_count 108777
# HELP drone_pending_builds Total number of pending builds.
# TYPE drone_pending_builds gauge
drone_pending_builds 0
# HELP drone_pending_jobs Total number of pending jobs.
# TYPE drone_pending_jobs gauge
drone_pending_jobs 0
# HELP drone_repo_count Total number of registered repositories.
# TYPE drone_repo_count gauge
drone_repo_count 3905
# HELP drone_running_builds Total number of running builds.
# TYPE drone_running_builds gauge
drone_running_builds 2
# HELP drone_running_jobs Total number of running jobs.
# TYPE drone_running_jobs gauge
drone_running_jobs 2
# HELP drone_user_count Total number of active users.
# TYPE drone_user_count gauge
drone_user_count 4584
Unfortunately these metrics don’t have the data I would need to capture what I want. Can you advise on any way I could possibly track start/end time and status of pipelines in Drone?
I found some discussion on an API potentially for metrics and other discussions around extensions for metrics. It doesn’t seem like any of that happened though and I’m unclear on how a Drone plugin or extension could provide any functionality like that. Am I just looking in the wrong places?
You would need to create a small metrics collector (standalone program) that queries the Drone database (using sql) to extract the data you need and provide to Prometheus. This is how other teams are collecting custom metrics today, although unfortunately none of them (to my knowledge) have published their code.
We have considered sponsoring some sort of project (or project template) to help teams create custom metric collectors, however, our primary focus right now is on our roadmap.
Thanks again for your quick responses. I was really hoping for some event system over straight DB access. That said, I understand you can’t be everything to everyone all the time :). I’ll look in the DB and see what I can get. Maybe we can get approval to open source some solution…
Drone supports system-wide webhooks which could be used to feed data to Drone and aggregate metrics. See How to use Global Webhooks. Direct database access, however, would be compatible with whatever solution we release in the future.
@bradrydzewski I’ve had some time now to get webhooks setup and started looking at the payload. I’m noticing something I didn’t expect though with webhooks.
I’m running on AWS. I setup drone to post webhooks back to it’s own load balancer. That load balancer is forwarding requests to a lambda where I can process them and store them in cloudwatch metrics.
When I trigger a new build I get a flood of requests to the load balancer. Hundreds of requests.
Any idea why this is? What could drone possibly have to emit hundreds of events of? I’m continuing to dig through what is getting sent and will have more information later on that. But, it’s very clear, that triggering a build correlates to these huge request spikes.
After debugging the requests I now see that these requests are not the result of the webhooks. They are drone RPC calls. Looking back at request count over the last 2 weeks it was always this way. I just didn’t notice until after I enabled webhooks.
Yep, the docker runner doesn’t buffer log uploads (we were planning to buffer before 1.0 but we realized it actually had minimal perf implications and we ran out of time ). The newer runners (which use the runner-go library) buffer uploads which can reduce traffic for builds that generate significant logs. The docker runner will be migrated to use runner-go in the coming months.
@mneil I wanted to circle back and let you know that the new agent is available and [1] buffers the log stream and should create a bit less noise in your monitoring system.
I’ve been using the /metrics endpoint for a while now. I’m sharing the prometheus definition I use to fetch the data and send it to our dashboards.
Although, we saw a weird behavior over the metric: drone_running_builds
In time, there are some stuck values. Right now we have at 8, even though there are absolutely no running builds at all. Is there any way to clean this? reset it?
Our prometheus.yml definition to publish this data to Datadog if it’s any help to anyone: