Skip to content

Add monitoring for the queue depth and jobs being executed at a time #86

@emilyalbini

Description

@emilyalbini

As part of the investigation into Omicron CI performance, I had to gather stats on how many jobs sit in the queue at a point in time and how much time jobs were spending in the queue. For the initial analysis I ran some database queries and did some ad-hoc processing on the data. While that worked, it's a manual process and doesn't allow for continuous monitoring and alerting.

We should change Buildomat to export relevant metrics in the OpenMetrics format, so that we'll be able to ingest them in any centralized monitoring solution operations will decide to spin up in the future. The metrics will be exposed in a /metrics endpoint of the Buildomat server, protected with a bearer token configured in the server's config.toml.

The metrics I am thinking of adding initially are:

  • buildomat_jobs_queued{target="TARGET"} (gauge): number of jobs currently in the queue for any given target.
  • buildomat_jobs_running{target="TARGET"} (gauge): number of jobs currently running for any given target.
  • buildomat_jobs_time_in_queue{target="TARGET"} (gauge): number of seconds the oldest job for a given target has been queued.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions