-
Notifications
You must be signed in to change notification settings - Fork 4
Add monitoring for the queue depth and jobs being executed at a time #86
Description
As part of the investigation into Omicron CI performance, I had to gather stats on how many jobs sit in the queue at a point in time and how much time jobs were spending in the queue. For the initial analysis I ran some database queries and did some ad-hoc processing on the data. While that worked, it's a manual process and doesn't allow for continuous monitoring and alerting.
We should change Buildomat to export relevant metrics in the OpenMetrics format, so that we'll be able to ingest them in any centralized monitoring solution operations will decide to spin up in the future. The metrics will be exposed in a /metrics endpoint of the Buildomat server, protected with a bearer token configured in the server's config.toml.
The metrics I am thinking of adding initially are:
buildomat_jobs_queued{target="TARGET"}(gauge): number of jobs currently in the queue for any given target.buildomat_jobs_running{target="TARGET"}(gauge): number of jobs currently running for any given target.buildomat_jobs_time_in_queue{target="TARGET"}(gauge): number of seconds the oldest job for a given target has been queued.