Skip to content

fix: filter same-ID and same-playlist duplicates from find-duplicate-names#36

Merged
jmlrt merged 3 commits intomainfrom
fix/duplicate-names-filtering
Mar 21, 2026
Merged

fix: filter same-ID and same-playlist duplicates from find-duplicate-names#36
jmlrt merged 3 commits intomainfrom
fix/duplicate-names-filtering

Conversation

@jmlrt
Copy link
Owner

@jmlrt jmlrt commented Mar 16, 2026

Summary

Filter out non-actionable duplicates from find-duplicate-names:

  • Tracks with the same Spotify ID
  • Tracks appearing in the exact same set of playlists

These pairs cannot be meaningfully resolved since there's no "keep one, remove other" decision possible.

Changes

Code Changes (spotfm/spotify/dupes.py)

  • Pass 1: Added playlist-identity filter (lines 428-432)
    • Skips pairs where both tracks have identical playlist sets
  • Pass 2: Added playlist-identity filter (lines 541-545)
    • Applies same logic to cross-prefix comparisons

Test Coverage (tests/test_dupes.py)

  • test_same_id_tracks_excluded(): Verifies same-ID pairs are filtered
    • Mocks candidates with identical track ID
    • Confirms output is empty
  • test_same_playlist_tracks_excluded(): Verifies same-playlist pairs are filtered
    • Mocks candidates in identical playlists
    • Confirms output is empty

Documentation (TODO.md)

  • ✅ Removed completed HIGH PRIORITY task: "Fix find-duplicate-names to Filter Same-ID Dupes"
  • Enhanced future roadmap with duration-based categorization and parameterized SQL migration

Testing

  • ✅ All 321 tests pass
  • ✅ Linting passed (ruff)
  • ✅ New test cases added and passing
  • ✅ Code coverage maintained at 79.46%

Related Issues

Implements and closes the HIGH PRIORITY task from TODO.md: "Fix find-duplicate-names to Filter Same-ID Dupes"

jmlrt and others added 2 commits March 16, 2026 11:12
…names

Exclude non-actionable duplicate pairs where:
- Both tracks have the same Spotify ID, or
- Both tracks appear in the exact same set of playlists

These pairs cannot be meaningfully resolved (no "keep one, remove other"
decision possible), and filtering them out makes the output actionable.

Changes:
- Add playlist-identity check in Pass 1 and Pass 2 of find_duplicate_names()
- Add test_same_id_tracks_excluded() to verify same-ID filtering
- Add test_same_playlist_tracks_excluded() to verify same-playlist filtering
- Update TODO.md to mark task as completed with enhanced future roadmap

All 321 tests pass; linting checks pass.

Co-Authored-By: Claude Code <noreply@anthropic.com>
Remove HIGH PRIORITY task 'Fix find-duplicate-names to Filter Same-ID Dupes'
as it has been implemented in PR #36.

Co-Authored-By: Claude Code <noreply@anthropic.com>
@jmlrt jmlrt marked this pull request as ready for review March 21, 2026 08:34
Copilot AI review requested due to automatic review settings March 21, 2026 08:34
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates find-duplicate-names to drop duplicate pairs that are not actionable: pairs with the same Spotify track ID and pairs where both tracks appear in the exact same set of playlists. It also adds regression tests to ensure these cases are excluded.

Changes:

  • Add a “same playlist-set” filter in both pass 1 (same-prefix) and pass 2 (cross-prefix shared-artist) comparisons.
  • Add unit tests covering same-ID exclusion and same-playlist-set exclusion.
  • Update TODO.md roadmap by removing a completed item and adding future enhancement tasks.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
spotfm/spotify/dupes.py Adds filtering to exclude pairs that share identical playlist membership in both matching passes.
tests/test_dupes.py Adds tests asserting same-ID and same-playlist-set pairs are not returned by find_duplicate_names.
TODO.md Updates planning notes (remove completed task; add duration/categorization and SQL parameterization roadmap items).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address Copilot review feedback:
- Precompute playlist_id_set for each candidate in get_fuzzy_match_candidates()
- Use cached sets instead of computing frozenset() on-the-fly in Pass 1 & 2
- Move pair_key check before expensive playlist comparisons (Pass 1)
- Eliminates O(n) set construction per pair comparison

Improvements:
- Reduces hot-path overhead in match-processing loops
- Avoids repeated frozenset() calls for identical playlist comparisons
- All 321 tests pass; linting passes

Co-Authored-By: Claude Code <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 205 to 219
@@ -213,6 +215,7 @@ def get_fuzzy_match_candidates(excluded_playlist_ids=None, min_name_length=3):
"full_name": full_name,
"name_prefix": name_prefix,
"name_length": name_length,
"playlist_id_set": playlist_id_set,
}
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_fuzzy_match_candidates() now returns an additional playlist_id_set field, but the function docstring/return description still only mentions prefix/length metadata. Please update the docstring (and any documented return shape) to include playlist_id_set so callers/tests know it’s part of the candidate schema.

Copilot uses AI. Check for mistakes.
"playlists": shared_playlists,
"full_name": "The Beatles - Come Together",
"name_prefix": "com",
"name_length": 14,
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mocked candidate for "Come Together" sets name_length to 14, but in production this field comes from LENGTH(t.name) (i.e., len(name), which is 12 for "Come Together"). To keep the mock aligned with real candidate schema and avoid future refactor brittleness, compute this with len(...) (or set the correct literal).

Suggested change
"name_length": 14,
"name_length": len("Come Together"),

Copilot uses AI. Check for mistakes.
@jmlrt jmlrt merged commit aef4c31 into main Mar 21, 2026
8 checks passed
@jmlrt jmlrt deleted the fix/duplicate-names-filtering branch March 21, 2026 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants