Skip to content

Fix low recall when limit_val_batches is set#1298

Open
vickysharma-prog wants to merge 1 commit intoweecology:mainfrom
vickysharma-prog:fix-limit-batches-recall-1232
Open

Fix low recall when limit_val_batches is set#1298
vickysharma-prog wants to merge 1 commit intoweecology:mainfrom
vickysharma-prog:fix-limit-batches-recall-1232

Conversation

@vickysharma-prog
Copy link
Contributor

Description

When limit_val_batches is set (e.g., 0.1 for 10%), evaluation loads the full ground truth CSV but predictions only cover the limited images. This makes recall look very low because of "missing" predictions for images that were never processed.

Added a check in __evaluate__ that trims ground_df based on limit_val_batches value. Uses ceil(limit_val_batches * n_images) as suggested in the issue.

Also added a test case to verify the fix.

Related Issue(s)

Fixes #1232

AI-Assisted Development

  • I used AI tools (e.g., GitHub Copilot, ChatGPT, etc.) in developing this PR
  • I understand all the code I'm submitting
  • I have reviewed and validated all AI-generated code

AI tools used (if applicable):
Used for initial research and understanding the codebase structure

Copy link
Collaborator

@jveitchmichaelis jveitchmichaelis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution, here's some comments:

  • Please scope the PR to only the issue (remove .gitignore changes, we could include that in another submission).
  • Please test with non-deprecated eval calls. Use m.create_trainer() with limit batches set as an argument and then call m.trainer.validate(m) or m.trainer.fit(m). You may need to set the validation interval to 1. This would better reflect a training scenario.
  • As above, this code should work with .validate() - __evaluate__ is not called during training.
  • The code for main is a little defensive. The trainer is created on init so it's almost impossible to run this function without self.trainer existing.
  • The test case does not adequately check the behavior. For example you have asserted non-negative recall, but this does not prove that recall accurately reflects the limited dataframe.
  • Please remove the LLM-inserted "fixes" and issue number from the comment in main (L1023), this is unnecessary.
  • Is this code correct in multi-GPU environments? I don't think my suggestion in the issue is correct in this case.

It's probably best for this logic to go in the RecallPrecision metric. You can filter ground_df by the image_paths that the metric was called on which is more reliable in multi-GPU.

@vickysharma-prog
Copy link
Contributor Author

Thanks for the detailed feedback @jveitchmichaelis! Really appreciate the thorough review.
Quick acknowledgments:

  1. ✅ Will remove .gitignore changes
  2. ✅ Will remove the "fixes Evaluation reports spuriously low recall if limit_batches is set #1232" comment from code
  3. ✅ Will make the code less defensive

Regarding the architectural suggestion:
It's probably best for this logic to go in the RecallPrecision metric. You can filter ground_df by the image_paths that the metric was called on which is more reliable in multi-GPU.
This makes sense - filtering at the metric level based on actual predicted image_paths would be more reliable than calculating based on limit_val_batches. Could you point me to where the RecallPrecision metric is defined so I can refactor the fix there?
For the test - I'll update it to use create_trainer() with limit_val_batches and call trainer.validate() instead of the deprecated __evaluate__ method.
Let me know if I'm understanding correctly!

@vickysharma-prog
Copy link
Contributor Author

Pushed the updates:

  • Removed .gitignore changes
  • Updated comment in main.py
  • Updated test to use create_trainer() + trainer.validate()
    Note: ReadTheDocs build seems to be failing on dependency install (uv sync --extra docs) - this appears unrelated to my changes. Let me know if I need to do anything on my end.
    Still working on understanding the RecallPrecision metric location for the architectural refactor you suggested.

@vickysharma-prog
Copy link
Contributor Author

Just pushed another commit - removed the defensive checks (hasattr and getattr) since trainer always exists.

Current changes:

  • ✅ Removed .gitignore changes
  • ✅ Removed issue reference from code comment
  • ✅ Made code less defensive
  • ✅ Updated test to use create_trainer() + trainer.validate()

Regarding the RecallPrecision metric refactor - I searched the codebase and found iou_metric and mAP_metric (from torchmetrics), but couldn't find a custom RecallPrecision metric. Still working on understanding the RecallPrecision metric location for the architectural refactor you suggested.

(The ReadTheDocs failure seems to be a dependency issue unrelated to my changes)

@jveitchmichaelis
Copy link
Collaborator

jveitchmichaelis commented Feb 5, 2026

Still working on understanding the RecallPrecision metric location for the architectural refactor you suggested.

Please update your main branch + rebase this one

@vickysharma-prog vickysharma-prog force-pushed the fix-limit-batches-recall-1232 branch from 3d832f8 to a4d9fe9 Compare February 5, 2026 14:08
@vickysharma-prog
Copy link
Contributor Author

Rebased on latest main all checks passing now!
Let me know if there's anything else to address.

@vickysharma-prog
Copy link
Contributor Author

I found the RecallPrecision logic in metrics.py.
From what I can see, the filtering could live in the metric itself by restricting ground_df to the image_paths actually seen by the metric before calling __evaluate_wrapper__. I was thinking this would happen in compute(), but I wanted to double-check that this aligns with the intended flow (vs doing this earlier in update()).
Does this sound like the right place to apply the fix?

@jveitchmichaelis
Copy link
Collaborator

  • Please review your test case. Hint: can you demonstrate this would fail before your fix?
  • You cannot handle this in update() as the underlying issue happens when __evaluate_wrapper__ is called on the full ground truth dataframe.

@vickysharma-prog
Copy link
Contributor Author

pre-commit.ci autofix

@vickysharma-prog
Copy link
Contributor Author

Moved the fix to RecallPrecision.compute() in metrics.py as suggested - now filtering ground_df to only include images that were actually predicted before calling __evaluate_wrapper__.
Removed the old fix from main.py and updated the test.

@vickysharma-prog
Copy link
Contributor Author

Hi @jveitchmichaelis,
Just a gentle follow-up on this PR.
I’ve addressed all the requested changes and rebased on the latest main, and CI is passing.
Please let me know if there’s anything else I should update.
Thanks again for the review!

@jveitchmichaelis
Copy link
Collaborator

@vickysharma-prog it'd be good to have a test case that would fail on the existing main branch and passes here to confirm your code works as intended (checking that recall is non-zero is not sufficient).

@vickysharma-prog
Copy link
Contributor Author

@jveitchmichaelis Updated the test to demonstrate the fix!
Test verification:

  • Main branch: FAILED - box_recall=0.50, expected > 0.7
  • This PR: PASSED - box_recall=1.00
    The test creates 4 images in ground truth but makes predictions for only 2 (simulating limit_val_batches).
  • Before fix: ground_df has 4 images → recall = 2/4 = 0.50 → test fails
  • After fix: ground_df filtered to 2 images → recall = 2/2 = 1.00 → test passes
    Ready for review!

Copy link
Collaborator

@jveitchmichaelis jveitchmichaelis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, some further changes and please squash your PR to a single commit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove edits to gitignore in this PR

m.trainer.fit(m)
def test_recall_not_lowered_by_unprocessed_images(tmp_path):
"""Regression test for #1232."""
import pandas as pd
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No inline imports please

assert version_dir.join("hparams.yaml").exists(), "hparams.yaml not found"
# Without fix: recall = 0.5 (2/4 images)
# With fix: recall = 1.0 (2/2 filtered images)
assert results['box_recall'] > 0.7, (
Copy link
Collaborator

@jveitchmichaelis jveitchmichaelis Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a decent approach. I would suggest one more assertion that the image_paths attribute in the metric has len = 2.

I also think we can assert recall 1 here (use isclose for a safe comparison with float)? We know the analytical value.

However I would also suggest making the bounding boxes for the unused images different, as an additional check that we're not comparing the wrong boxes for some reason (e.g. set img3 and img4 to have different pixel values). Unlikely, but picking the same value can sometimes hide weird bugs like this.

m.create_trainer(limit_train_batches=1, limit_val_batches=1, max_epochs=1)
m.trainer.fit(m)
def test_recall_not_lowered_by_unprocessed_images(tmp_path):
"""Regression test for #1232."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove references to the issue number throughout.

"This test checks that recall is only computed for images that were passed to the metric and ignores unprocessed images in the ground truth dataframe."

@vickysharma-prog vickysharma-prog force-pushed the fix-limit-batches-recall-1232 branch from 551db9b to 0d5d846 Compare February 17, 2026 11:01
@vickysharma-prog
Copy link
Contributor Author

@jveitchmichaelis Thanks for the detailed guidance throughout this PR!

Done! Changes made:

  • Removed .gitignore edits
  • Moved imports to top of file
  • Different bounding boxes for img3/img4 to catch edge cases
  • Added len(metric.image_indices) == 2 assertion
  • Using math.isclose() for recall = 1.0
  • Updated docstring per your suggestion
  • Squashed to single commit
    Ready for review whenever you get a chance.

@codecov
Copy link

codecov bot commented Feb 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.87%. Comparing base (884502e) to head (408e150).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1298      +/-   ##
==========================================
- Coverage   87.35%   86.87%   -0.48%     
==========================================
  Files          24       24              
  Lines        2981     3064      +83     
==========================================
+ Hits         2604     2662      +58     
- Misses        377      402      +25     
Flag Coverage Δ
unittests 86.87% <ø> (-0.48%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vickysharma-prog
Copy link
Contributor Author

@jveitchmichaelis Just a gentle ping - all changes addressed and checks passing. Let me know if anything else needs updating!

@vickysharma-prog
Copy link
Contributor Author

pre-commit.ci autofix

@vickysharma-prog
Copy link
Contributor Author

@jveitchmichaelis All changes addressed, conflicts resolved, and CI passing now. Ready for re-review when you get a chance!

@vickysharma-prog vickysharma-prog force-pushed the fix-limit-batches-recall-1232 branch 2 times, most recently from f8514e8 to 6cd356d Compare March 3, 2026 11:05
@vickysharma-prog
Copy link
Contributor Author

@jveitchmichaelis All tests are passing. I’ve rebased and squashed the branch into a single commit with all prior feedback addressed.
Summary of changes:

  • metrics.py: +4 lines — filter ground_df to predicted images
  • test_main.py: +43 lines — added regression test

Please let me know if any further adjustments are needed.

@jveitchmichaelis
Copy link
Collaborator

jveitchmichaelis commented Mar 3, 2026

Thanks @vickysharma-prog. We are currently in the process of refactoring the update method in the metric which may supersede this PR. Ideally the metric shouldn't need to know about paths at all. Will update on this shortly. However the test case here is probably still useful, as it's a valid edge case.

@vickysharma-prog
Copy link
Contributor Author

vickysharma-prog commented Mar 4, 2026

Hi @jveitchmichaelis,
I see you've opened #1343 with the refactored metric — looks great!
Since you mentioned the test case here might still be useful, should I extract it into a separate PR once #1343 is merged? Happy to adapt it to the new implementation.
Let me know!

@jveitchmichaelis
Copy link
Collaborator

jveitchmichaelis commented Mar 9, 2026

Can you remove the changes to the metric here, leave the test. Then after we merge #1343, we can merge this to add the coverage for limit_val_batches.

Though maybe wait because you'll have to rebase and change the metric signature.

@vickysharma-prog
Copy link
Contributor Author

Got it! @jveitchmichaelis I'll wait for #1343 to merge, then rebase and update accordingly. Thanks for the heads up!

@jveitchmichaelis
Copy link
Collaborator

@vickysharma-prog please could you rebase against main and drop changes to metrics.py. The test suite should still pass I think, as the metric is only accumulated for samples that it sees.

@vickysharma-prog vickysharma-prog force-pushed the fix-limit-batches-recall-1232 branch from 6cd356d to 408e150 Compare March 20, 2026 15:34
@vickysharma-prog
Copy link
Contributor Author

@jveitchmichaelis Rebased and dropped metrics.py changes.
Had to tweak the test slightly since #1343 changed the API - RecallPrecision no longer takes csv_file and update() now needs targets param. Test is updated accordingly and passing!

@vickysharma-prog
Copy link
Contributor Author

@jveitchmichaelis Rebased and dropped metrics.py changes as requested, Let me know if any further changes are needed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Evaluation reports spuriously low recall if limit_batches is set

2 participants