internet-latency-collector: add wheresitup job backlog observability#3203
Open
internet-latency-collector: add wheresitup job backlog observability#3203
Conversation
Add metrics and logging to make wheresitup service slowdowns easier to detect and diagnose. During the 2026-03-09 incident, ~60-90 jobs per cycle were completing too slowly, causing backlog accumulation that was only visible by cross-referencing total_jobs with processed_count. - Add pending_jobs and in_progress_count to export summary logs - Add WheresitupPendingJobs prometheus gauge for alerting on backlog - Add WheresitupAPIResponseDuration histogram for API call timing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pending_jobsprometheus gauge andin_progress_count/pending_jobslog fields to wheresitup export summary, making job backlog accumulation visible for alerting and log analysisWheresitupAPIResponseDurationhistogram for tracking API call latencyContext
During the 2026-03-09 wheresitup slowdown (~09:01–10:50 UTC), ~60-90 of 435 jobs per cycle were completing too slowly, causing backlog growth from 435 to 1322 pending jobs. The only way to detect this was cross-referencing
total_jobswithprocessed_countin logs after the fact. These changes make the backlog directly observable in both logs and Prometheus.Testing Verification