Fix low recall when limit_val_batches is set#1298
Fix low recall when limit_val_batches is set#1298vickysharma-prog wants to merge 1 commit intoweecology:mainfrom
Conversation
There was a problem hiding this comment.
Thanks for your contribution, here's some comments:
- Please scope the PR to only the issue (remove .gitignore changes, we could include that in another submission).
- Please test with non-deprecated eval calls. Use m.create_trainer() with limit batches set as an argument and then call m.trainer.validate(m) or m.trainer.fit(m). You may need to set the validation interval to 1. This would better reflect a training scenario.
- As above, this code should work with .validate() -
__evaluate__is not called during training. - The code for
mainis a little defensive. The trainer is created oninitso it's almost impossible to run this function without self.trainer existing. - The test case does not adequately check the behavior. For example you have asserted non-negative recall, but this does not prove that recall accurately reflects the limited dataframe.
- Please remove the LLM-inserted "fixes" and issue number from the comment in main (L1023), this is unnecessary.
- Is this code correct in multi-GPU environments? I don't think my suggestion in the issue is correct in this case.
It's probably best for this logic to go in the RecallPrecision metric. You can filter ground_df by the image_paths that the metric was called on which is more reliable in multi-GPU.
|
Thanks for the detailed feedback @jveitchmichaelis! Really appreciate the thorough review.
Regarding the architectural suggestion: |
|
Pushed the updates:
|
|
Just pushed another commit - removed the defensive checks ( Current changes:
Regarding the RecallPrecision metric refactor - I searched the codebase and found (The ReadTheDocs failure seems to be a dependency issue unrelated to my changes) |
Please update your |
3d832f8 to
a4d9fe9
Compare
|
Rebased on latest main all checks passing now! |
|
I found the RecallPrecision logic in |
|
|
pre-commit.ci autofix |
|
Moved the fix to |
|
Hi @jveitchmichaelis, |
|
@vickysharma-prog it'd be good to have a test case that would fail on the existing main branch and passes here to confirm your code works as intended (checking that recall is non-zero is not sufficient). |
|
@jveitchmichaelis Updated the test to demonstrate the fix!
|
jveitchmichaelis
left a comment
There was a problem hiding this comment.
Thanks for the update, some further changes and please squash your PR to a single commit.
There was a problem hiding this comment.
Remove edits to gitignore in this PR
tests/test_main.py
Outdated
| m.trainer.fit(m) | ||
| def test_recall_not_lowered_by_unprocessed_images(tmp_path): | ||
| """Regression test for #1232.""" | ||
| import pandas as pd |
There was a problem hiding this comment.
No inline imports please
tests/test_main.py
Outdated
| assert version_dir.join("hparams.yaml").exists(), "hparams.yaml not found" | ||
| # Without fix: recall = 0.5 (2/4 images) | ||
| # With fix: recall = 1.0 (2/2 filtered images) | ||
| assert results['box_recall'] > 0.7, ( |
There was a problem hiding this comment.
This seems like a decent approach. I would suggest one more assertion that the image_paths attribute in the metric has len = 2.
I also think we can assert recall 1 here (use isclose for a safe comparison with float)? We know the analytical value.
However I would also suggest making the bounding boxes for the unused images different, as an additional check that we're not comparing the wrong boxes for some reason (e.g. set img3 and img4 to have different pixel values). Unlikely, but picking the same value can sometimes hide weird bugs like this.
tests/test_main.py
Outdated
| m.create_trainer(limit_train_batches=1, limit_val_batches=1, max_epochs=1) | ||
| m.trainer.fit(m) | ||
| def test_recall_not_lowered_by_unprocessed_images(tmp_path): | ||
| """Regression test for #1232.""" |
There was a problem hiding this comment.
Remove references to the issue number throughout.
"This test checks that recall is only computed for images that were passed to the metric and ignores unprocessed images in the ground truth dataframe."
551db9b to
0d5d846
Compare
|
@jveitchmichaelis Thanks for the detailed guidance throughout this PR! Done! Changes made:
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1298 +/- ##
==========================================
- Coverage 87.35% 86.87% -0.48%
==========================================
Files 24 24
Lines 2981 3064 +83
==========================================
+ Hits 2604 2662 +58
- Misses 377 402 +25
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@jveitchmichaelis Just a gentle ping - all changes addressed and checks passing. Let me know if anything else needs updating! |
|
pre-commit.ci autofix |
|
@jveitchmichaelis All changes addressed, conflicts resolved, and CI passing now. Ready for re-review when you get a chance! |
f8514e8 to
6cd356d
Compare
|
@jveitchmichaelis All tests are passing. I’ve rebased and squashed the branch into a single commit with all prior feedback addressed.
Please let me know if any further adjustments are needed. |
|
Thanks @vickysharma-prog. We are currently in the process of refactoring the |
|
Hi @jveitchmichaelis, |
|
Can you remove the changes to the metric here, leave the test. Then after we merge #1343, we can merge this to add the coverage for limit_val_batches. Though maybe wait because you'll have to rebase and change the metric signature. |
|
Got it! @jveitchmichaelis I'll wait for #1343 to merge, then rebase and update accordingly. Thanks for the heads up! |
|
@vickysharma-prog please could you rebase against |
6cd356d to
408e150
Compare
|
@jveitchmichaelis Rebased and dropped metrics.py changes. |
|
@jveitchmichaelis Rebased and dropped metrics.py changes as requested, Let me know if any further changes are needed! |
Description
When
limit_val_batchesis set (e.g., 0.1 for 10%), evaluation loads the full ground truth CSV but predictions only cover the limited images. This makes recall look very low because of "missing" predictions for images that were never processed.Added a check in
__evaluate__that trimsground_dfbased onlimit_val_batchesvalue. Usesceil(limit_val_batches * n_images)as suggested in the issue.Also added a test case to verify the fix.
Related Issue(s)
Fixes #1232
AI-Assisted Development
AI tools used (if applicable):
Used for initial research and understanding the codebase structure