Skip to content

Add breadcrumb when rate limiting fails to read the clock#7281

Open
bendk wants to merge 1 commit intomozilla:mainfrom
bendk:push-qpkrqorszswl
Open

Add breadcrumb when rate limiting fails to read the clock#7281
bendk wants to merge 1 commit intomozilla:mainfrom
bendk:push-qpkrqorszswl

Conversation

@bendk
Copy link
Contributor

@bendk bendk commented Mar 19, 2026

This rate limit check doesn't seem to be working in same cases, for example for https://bugzilla.mozilla.org/show_bug.cgi?id=2004954 I saw many reports per minute from the same user. I'm wondering if the cause is this check failing.

Pull Request checklist

  • Breaking changes: This PR follows our breaking change policy
    • This PR follows the breaking change policy:
      • This PR has no breaking API changes, or
      • There are corresponding PRs for our consumer applications that resolve the breaking changes and have been approved
  • Quality: This PR builds and tests run cleanly
    • Note:
      • For changes that need extra cross-platform testing, consider adding [ci full] to the PR title.
      • If this pull request includes a breaking change, consider cutting a new release after merging.
  • Tests: This PR includes thorough tests or an explanation of why it does not
  • Changelog: This PR includes a changelog entry in CHANGELOG.md or an explanation of why it does not need one
    • Any breaking changes to Swift or Kotlin binding APIs are noted explicitly
  • Dependencies: This PR follows our dependency management guidelines
    • Any new dependencies are accompanied by a summary of the due diligence applied in selecting them.

This rate limit check doesn't seem to be working in same cases, for
example for https://bugzilla.mozilla.org/show_bug.cgi?id=2004954 I saw
many reports per minute from the same user.  I'm wondering if the cause
is this check failing.
@bendk bendk requested review from a team and skhamis March 19, 2026 16:22
// through seems okay in this case. We should get back into a good state soon
// after.
_ => (),
_ => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little skeptical of this - I can see how extra single events might get through as the clock changes, but not how that could cause many reports per minute. What's your theory about how that actually happens in practice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem really weird.

Here's what I'm looking at in grafana. One client is generating multiple error pings per second (I need to add a client filter, but for now I just added enough filters other that I'm pretty sure I'm only capturing one person).

I don't really understand what's going on. This is one wild theory, maybe there's something very wrong at the system level. My other theory was that the FF was restarting and clearing out the rate limiting data, but it seems impossible for FF to restart that much in practice. My only other theory was that it was part of some automation. This was really just a shot in the dark. Do you have any ideas on what could be happening?

(BTW, the unit in that graph is "errors / day" which is not right for unique clients. I think it's just one unique client, but since the interval is 2 hours it's multiplying to get 12 unique clients / day. I'm going to try to fix that today.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants