Fix low recall when limit_val_batches is set by vickysharma-prog · Pull Request #1298 · weecology/DeepForest

vickysharma-prog · 2026-02-03T11:41:01Z

Description

When limit_val_batches is set (e.g., 0.1 for 10%), evaluation loads the full ground truth CSV but predictions only cover the limited images. This makes recall look very low because of "missing" predictions for images that were never processed.

Added a check in __evaluate__ that trims ground_df based on limit_val_batches value. Uses ceil(limit_val_batches * n_images) as suggested in the issue.

Also added a test case to verify the fix.

Related Issue(s)

Fixes #1232

AI-Assisted Development

I used AI tools (e.g., GitHub Copilot, ChatGPT, etc.) in developing this PR
I understand all the code I'm submitting
I have reviewed and validated all AI-generated code

AI tools used (if applicable):
Used for initial research and understanding the codebase structure

jveitchmichaelis

Thanks for your contribution, here's some comments:

Please scope the PR to only the issue (remove .gitignore changes, we could include that in another submission).
Please test with non-deprecated eval calls. Use m.create_trainer() with limit batches set as an argument and then call m.trainer.validate(m) or m.trainer.fit(m). You may need to set the validation interval to 1. This would better reflect a training scenario.
As above, this code should work with .validate() - __evaluate__ is not called during training.
The code for main is a little defensive. The trainer is created on init so it's almost impossible to run this function without self.trainer existing.
The test case does not adequately check the behavior. For example you have asserted non-negative recall, but this does not prove that recall accurately reflects the limited dataframe.
Please remove the LLM-inserted "fixes" and issue number from the comment in main (L1023), this is unnecessary.
Is this code correct in multi-GPU environments? I don't think my suggestion in the issue is correct in this case.

It's probably best for this logic to go in the RecallPrecision metric. You can filter ground_df by the image_paths that the metric was called on which is more reliable in multi-GPU.

vickysharma-prog · 2026-02-05T06:12:06Z

Thanks for the detailed feedback @jveitchmichaelis! Really appreciate the thorough review.
Quick acknowledgments:

✅ Will remove .gitignore changes
✅ Will remove the "fixes Evaluation reports spuriously low recall if limit_batches is set #1232" comment from code
✅ Will make the code less defensive

Regarding the architectural suggestion:
It's probably best for this logic to go in the RecallPrecision metric. You can filter ground_df by the image_paths that the metric was called on which is more reliable in multi-GPU.
This makes sense - filtering at the metric level based on actual predicted image_paths would be more reliable than calculating based on limit_val_batches. Could you point me to where the RecallPrecision metric is defined so I can refactor the fix there?
For the test - I'll update it to use create_trainer() with limit_val_batches and call trainer.validate() instead of the deprecated __evaluate__ method.
Let me know if I'm understanding correctly!

vickysharma-prog · 2026-02-05T07:11:56Z

Pushed the updates:

Removed .gitignore changes
Updated comment in main.py
Updated test to use create_trainer() + trainer.validate()
Note: ReadTheDocs build seems to be failing on dependency install (uv sync --extra docs) - this appears unrelated to my changes. Let me know if I need to do anything on my end.
Still working on understanding the RecallPrecision metric location for the architectural refactor you suggested.

vickysharma-prog · 2026-02-05T07:37:34Z

Just pushed another commit - removed the defensive checks (hasattr and getattr) since trainer always exists.

Current changes:

✅ Removed .gitignore changes
✅ Removed issue reference from code comment
✅ Made code less defensive
✅ Updated test to use create_trainer() + trainer.validate()

Regarding the RecallPrecision metric refactor - I searched the codebase and found iou_metric and mAP_metric (from torchmetrics), but couldn't find a custom RecallPrecision metric. Still working on understanding the RecallPrecision metric location for the architectural refactor you suggested.

(The ReadTheDocs failure seems to be a dependency issue unrelated to my changes)

jveitchmichaelis · 2026-02-05T08:39:41Z

Still working on understanding the RecallPrecision metric location for the architectural refactor you suggested.

Please update your main branch + rebase this one

vickysharma-prog · 2026-02-05T14:21:20Z

Rebased on latest main all checks passing now!
Let me know if there's anything else to address.

vickysharma-prog · 2026-02-05T14:35:26Z

I found the RecallPrecision logic in metrics.py.
From what I can see, the filtering could live in the metric itself by restricting ground_df to the image_paths actually seen by the metric before calling __evaluate_wrapper__. I was thinking this would happen in compute(), but I wanted to double-check that this aligns with the intended flow (vs doing this earlier in update()).
Does this sound like the right place to apply the fix?

jveitchmichaelis · 2026-02-05T21:46:13Z

Please review your test case. Hint: can you demonstrate this would fail before your fix?
You cannot handle this in update() as the underlying issue happens when __evaluate_wrapper__ is called on the full ground truth dataframe.

vickysharma-prog · 2026-02-06T06:31:52Z

pre-commit.ci autofix

vickysharma-prog · 2026-02-06T06:42:39Z

Moved the fix to RecallPrecision.compute() in metrics.py as suggested - now filtering ground_df to only include images that were actually predicted before calling __evaluate_wrapper__.
Removed the old fix from main.py and updated the test.

vickysharma-prog · 2026-02-14T05:36:05Z

Hi @jveitchmichaelis,
Just a gentle follow-up on this PR.
I’ve addressed all the requested changes and rebased on the latest main, and CI is passing.
Please let me know if there’s anything else I should update.
Thanks again for the review!

jveitchmichaelis · 2026-02-16T17:33:57Z

@vickysharma-prog it'd be good to have a test case that would fail on the existing main branch and passes here to confirm your code works as intended (checking that recall is non-zero is not sufficient).

vickysharma-prog · 2026-02-16T19:37:49Z

@jveitchmichaelis Updated the test to demonstrate the fix!
Test verification:

Main branch: FAILED - box_recall=0.50, expected > 0.7
This PR: PASSED - box_recall=1.00
The test creates 4 images in ground truth but makes predictions for only 2 (simulating limit_val_batches).
Before fix: ground_df has 4 images → recall = 2/4 = 0.50 → test fails
After fix: ground_df filtered to 2 images → recall = 2/2 = 1.00 → test passes
Ready for review!

jveitchmichaelis

Thanks for the update, some further changes and please squash your PR to a single commit.

jveitchmichaelis · 2026-02-16T20:17:26Z

.gitignore

Remove edits to gitignore in this PR

jveitchmichaelis · 2026-02-16T22:19:50Z

tests/test_main.py

-    m.trainer.fit(m)
+def test_recall_not_lowered_by_unprocessed_images(tmp_path):
+    """Regression test for #1232."""
+    import pandas as pd


No inline imports please

jveitchmichaelis · 2026-02-16T22:20:53Z

tests/test_main.py

-    assert version_dir.join("hparams.yaml").exists(), "hparams.yaml not found"
+    # Without fix: recall = 0.5 (2/4 images)
+    # With fix: recall = 1.0 (2/2 filtered images)
+    assert results['box_recall'] > 0.7, (


This seems like a decent approach. I would suggest one more assertion that the image_paths attribute in the metric has len = 2.

I also think we can assert recall 1 here (use isclose for a safe comparison with float)? We know the analytical value.

However I would also suggest making the bounding boxes for the unused images different, as an additional check that we're not comparing the wrong boxes for some reason (e.g. set img3 and img4 to have different pixel values). Unlikely, but picking the same value can sometimes hide weird bugs like this.

jveitchmichaelis · 2026-02-16T22:21:07Z

tests/test_main.py

-    m.create_trainer(limit_train_batches=1, limit_val_batches=1, max_epochs=1)
-    m.trainer.fit(m)
+def test_recall_not_lowered_by_unprocessed_images(tmp_path):
+    """Regression test for #1232."""


Remove references to the issue number throughout.

"This test checks that recall is only computed for images that were passed to the metric and ignores unprocessed images in the ground truth dataframe."

vickysharma-prog · 2026-02-17T11:08:21Z

@jveitchmichaelis Thanks for the detailed guidance throughout this PR!

Done! Changes made:

Removed .gitignore edits
Moved imports to top of file
Different bounding boxes for img3/img4 to catch edge cases
Added len(metric.image_indices) == 2 assertion
Using math.isclose() for recall = 1.0
Updated docstring per your suggestion
Squashed to single commit
Ready for review whenever you get a chance.

codecov · 2026-02-19T21:56:33Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.87%. Comparing base (884502e) to head (408e150).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1298      +/-   ##
==========================================
- Coverage   87.35%   86.87%   -0.48%     
==========================================
  Files          24       24              
  Lines        2981     3064      +83     
==========================================
+ Hits         2604     2662      +58     
- Misses        377      402      +25

Flag	Coverage Δ
unittests	`86.87% <ø> (-0.48%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

vickysharma-prog · 2026-02-20T05:53:58Z

@jveitchmichaelis Just a gentle ping - all changes addressed and checks passing. Let me know if anything else needs updating!

vickysharma-prog · 2026-02-22T05:06:55Z

pre-commit.ci autofix

vickysharma-prog · 2026-02-22T07:44:41Z

@jveitchmichaelis All changes addressed, conflicts resolved, and CI passing now. Ready for re-review when you get a chance!

vickysharma-prog · 2026-03-03T11:54:34Z

@jveitchmichaelis All tests are passing. I’ve rebased and squashed the branch into a single commit with all prior feedback addressed.
Summary of changes:

metrics.py: +4 lines — filter ground_df to predicted images
test_main.py: +43 lines — added regression test

Please let me know if any further adjustments are needed.

jveitchmichaelis · 2026-03-03T21:42:39Z

Thanks @vickysharma-prog. We are currently in the process of refactoring the update method in the metric which may supersede this PR. Ideally the metric shouldn't need to know about paths at all. Will update on this shortly. However the test case here is probably still useful, as it's a valid edge case.

vickysharma-prog · 2026-03-04T04:11:29Z

Hi @jveitchmichaelis,
I see you've opened #1343 with the refactored metric — looks great!
Since you mentioned the test case here might still be useful, should I extract it into a separate PR once #1343 is merged? Happy to adapt it to the new implementation.
Let me know!

jveitchmichaelis · 2026-03-09T18:11:41Z

Can you remove the changes to the metric here, leave the test. Then after we merge #1343, we can merge this to add the coverage for limit_val_batches.

Though maybe wait because you'll have to rebase and change the metric signature.

vickysharma-prog · 2026-03-09T18:44:46Z

Got it! @jveitchmichaelis I'll wait for #1343 to merge, then rebase and update accordingly. Thanks for the heads up!

jveitchmichaelis · 2026-03-20T14:12:18Z

@vickysharma-prog please could you rebase against main and drop changes to metrics.py. The test suite should still pass I think, as the metric is only accumulated for samples that it sees.

vickysharma-prog · 2026-03-20T16:12:32Z

@jveitchmichaelis Rebased and dropped metrics.py changes.
Had to tweak the test slightly since #1343 changed the API - RecallPrecision no longer takes csv_file and update() now needs targets param. Test is updated accordingly and passing!

vickysharma-prog · 2026-03-23T11:34:51Z

@jveitchmichaelis Rebased and dropped metrics.py changes as requested, Let me know if any further changes are needed!

jveitchmichaelis requested changes Feb 5, 2026

View reviewed changes

vickysharma-prog requested a review from jveitchmichaelis February 5, 2026 08:12

vickysharma-prog force-pushed the fix-limit-batches-recall-1232 branch from 3d832f8 to a4d9fe9 Compare February 5, 2026 14:08

jveitchmichaelis requested changes Feb 16, 2026

View reviewed changes

vickysharma-prog force-pushed the fix-limit-batches-recall-1232 branch from 551db9b to 0d5d846 Compare February 17, 2026 11:01

vickysharma-prog requested a review from jveitchmichaelis February 20, 2026 06:14

vickysharma-prog force-pushed the fix-limit-batches-recall-1232 branch 2 times, most recently from f8514e8 to 6cd356d Compare March 3, 2026 11:05

Add regression test for recall with limit_val_batches

408e150

vickysharma-prog force-pushed the fix-limit-batches-recall-1232 branch from 6cd356d to 408e150 Compare March 20, 2026 15:34

Conversation

vickysharma-prog commented Feb 3, 2026

Description

Related Issue(s)

AI-Assisted Development

Uh oh!

jveitchmichaelis left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vickysharma-prog commented Feb 5, 2026

Uh oh!

vickysharma-prog commented Feb 5, 2026

Uh oh!

vickysharma-prog commented Feb 5, 2026

Uh oh!

jveitchmichaelis commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vickysharma-prog commented Feb 5, 2026

Uh oh!

vickysharma-prog commented Feb 5, 2026

Uh oh!

jveitchmichaelis commented Feb 5, 2026

Uh oh!

vickysharma-prog commented Feb 6, 2026

Uh oh!

vickysharma-prog commented Feb 6, 2026

Uh oh!

vickysharma-prog commented Feb 14, 2026

Uh oh!

jveitchmichaelis commented Feb 16, 2026

Uh oh!

vickysharma-prog commented Feb 16, 2026

Uh oh!

jveitchmichaelis left a comment

Choose a reason for hiding this comment

Uh oh!

jveitchmichaelis Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

jveitchmichaelis Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

jveitchmichaelis Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jveitchmichaelis Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

vickysharma-prog commented Feb 17, 2026

Uh oh!

codecov bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vickysharma-prog commented Feb 20, 2026

Uh oh!

vickysharma-prog commented Feb 22, 2026

Uh oh!

vickysharma-prog commented Feb 22, 2026

Uh oh!

vickysharma-prog commented Mar 3, 2026

Uh oh!

jveitchmichaelis commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vickysharma-prog commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jveitchmichaelis commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vickysharma-prog commented Mar 9, 2026

Uh oh!

jveitchmichaelis commented Mar 20, 2026

Uh oh!

vickysharma-prog commented Mar 20, 2026

Uh oh!

jveitchmichaelis left a comment •

edited

Loading

jveitchmichaelis commented Feb 5, 2026 •

edited

Loading

jveitchmichaelis Feb 16, 2026 •

edited

Loading

codecov bot commented Feb 19, 2026 •

edited

Loading

jveitchmichaelis commented Mar 3, 2026 •

edited

Loading

vickysharma-prog commented Mar 4, 2026 •

edited

Loading

jveitchmichaelis commented Mar 9, 2026 •

edited

Loading