Skip to content

Use floats for tracking cpu time#564

Open
noncrab wants to merge 1 commit intotikv:masterfrom
noncrab:issue-560-fractional-cpu-seconds
Open

Use floats for tracking cpu time#564
noncrab wants to merge 1 commit intotikv:masterfrom
noncrab:issue-560-fractional-cpu-seconds

Conversation

@noncrab
Copy link
Copy Markdown

@noncrab noncrab commented Mar 28, 2026

Fixes #560.

The one thing i'm not sure about is trading the well-defined wrap-around semantics for the risks of running out of bits in the mantissa. But then again, this uses an AtomicF64 under the hood, whose mantissa is 53 bits wide (can accurately hold an integer of up to ~9e+15).

Practically, by my (hand-wavey) reckoning, that'll mean we'll not be able to accurately represent times at ~1µs of precision after ~285 CPU millenia. I'm not sure how much of an issue that might be, or how best to record that concern.

I'd also be tempted to just use a gauge here as that'll avoid the need for calculating the delta since the last update (and the ensuing race condition) but that's bound to be a regression for someone.

Summary by CodeRabbit

  • Bug Fixes
    • Improved accuracy of CPU metrics through higher-precision calculations and enhanced handling of edge cases in metric computation.

Signed-off-by: Noncrab <git@noncrab.net>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 28, 2026

📝 Walkthrough

Walkthrough

ProcessCollector's cpu_total metric type is changed from IntCounter to Counter, with corresponding updates to the constructor and CPU time computation logic. The metric now uses floating-point division instead of integer division to properly track CPU seconds.

Changes

Cohort / File(s) Summary
CPU Metric Type Fix
src/process_collector.rs
Changed cpu_total field from IntCounter to Counter and updated construction method from IntCounter::with_opts to Counter::with_opts. Updated CPU time computation from integer division to floating-point division, replacing saturating_sub(past) with (total - past).max(0.0) to properly handle underflow in float context.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 A counter's tale, once told in whole,
Now flows as fragments, soft and small,
From IntCounter's rigid role,
To Counter's grace—the fix for all!
✨ Prometheus smiles at metrics true,
When floats dance freely, clear and new.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly captures the main change: switching CPU time tracking from integer to floating-point values.
Linked Issues check ✅ Passed The PR successfully converts process_cpu_seconds_total from IntCounter to Counter, addressing issue #560's requirement for float-based CPU time reporting.
Out of Scope Changes check ✅ Passed All changes are directly related to converting CPU time tracking from integer to floating-point in ProcessCollector, with no unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/process_collector.rs (1)

167-172: Race condition handling is reasonable, but double-counting remains possible.

The .max(0.0) correctly prevents negative increments when another thread has already updated the counter past the current total. However, the TOCTOU gap between get() (line 167) and inc_by() (line 171) means concurrent collectors could both read the same past value and each add the full delta, causing double-counting.

Given the PR author's analysis that gauges would be a regression for some users, this trade-off seems acceptable. The window for this race is small, and the metric will self-correct on subsequent collections.

Consider clarifying the comment to note both failure modes (underflow prevention and potential double-counting):

📝 Suggested comment clarification
             let past = self.cpu_total.get();
-            // If two threads are collecting metrics at the same time,
-            // the cpu_total counter may have already been updated,
-            // and the subtraction may underflow.
+            // If two threads collect concurrently, one may update cpu_total
+            // before the other calls inc_by. Use max(0.0) to avoid incrementing
+            // by a negative value if `past` was already advanced beyond `total`.
+            // Note: concurrent collection may still double-count briefly.
             self.cpu_total.inc_by((total - past).max(0.0));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/process_collector.rs` around lines 167 - 172, Update the inline comment
around the cpu_total read/update (the call sites cpu_total.get(),
cpu_total.inc_by((total - past).max(0.0)), and cpu_total.collect()) to
explicitly state that the current defensive .max(0.0) prevents underflow if
another thread has advanced the counter, but that the TOCTOU window between
reading past and calling inc_by can still allow two collectors to read the same
past and both add the delta (causing transient double-counting), and that this
is a small, self-correcting race accepted over switching to gauges; keep the
wording concise and reference the variables past and total so future readers
understand the two failure modes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/process_collector.rs`:
- Around line 167-172: Update the inline comment around the cpu_total
read/update (the call sites cpu_total.get(), cpu_total.inc_by((total -
past).max(0.0)), and cpu_total.collect()) to explicitly state that the current
defensive .max(0.0) prevents underflow if another thread has advanced the
counter, but that the TOCTOU window between reading past and calling inc_by can
still allow two collectors to read the same past and both add the delta (causing
transient double-counting), and that this is a small, self-correcting race
accepted over switching to gauges; keep the wording concise and reference the
variables past and total so future readers understand the two failure modes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dfce4370-bc87-4d83-acd9-4603562dfc95

📥 Commits

Reviewing files that changed from the base of the PR and between 8151418 and 58a78e9.

📒 Files selected for processing (1)
  • src/process_collector.rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

process_cpu_seconds_total is incorrectly registered as an IntCounter

1 participant