Podcast firehose and guest extraction pipeline by magent-cryptograss · Pull Request #38 · cryptograss/pickipedia

magent-cryptograss · 2026-01-31T12:10:17Z

Summary

13 bluegrass podcast RSS feeds verified and catalogued in podcast-feeds.json
Combined feed aggregator (podcast-firehose.py) merges all feeds into one RSS feed sorted by date
Guest name extraction (podcast-guest-patterns.json) uses per-podcast regex patterns to pull guest names from episode titles — currently extracting 636 episodes with 510 unique guest names across 9 podcasts
Episode page generator (podcast-episodes.py) produces {{PodcastEpisode}} wikitext with semantic [[guest::Name]] properties
55 tests (test_podcast_guests.py) covering every podcast format, skip patterns, false positive prevention, and edge cases

Podcasts covered

Podcast	Guests extracted	Status
Bluegrass Unlimited	100	✅ 100%
Picky Fingers Banjo	179	✅ 100%
What's The Reason	82	✅ 100%
Walls of Time	40	✅ 100%
Toy Heart	20	✅ 100%
Bluegrass Ambassadors	3	✅ 100%
Bluegrass Jam Along	128	🟡 93% (9 edge cases)
Bluegrass BKLYN	52	🟡 85%
Grass Talk Radio	32	🟡 Interview-only (most eps are monologues)
4 music/radio shows	—	No guest extraction (not interview formats)

Not yet included in this PR

Template:PodcastEpisode (already created on the wiki, rev 1405)
Actual episode page creation (needs review of naming convention and batch strategy)
Jenkins job for periodic feed regeneration
Deploy path for the combined feed XML

Test plan

python3 -m pytest tools/test_podcast_guests.py -v — 55 tests pass
python3 tools/podcast-episodes.py --preview 10 — review sample wikitext output
python3 tools/podcast-firehose.py /tmp/test.xml — generates combined feed
Review regex patterns in podcast-guest-patterns.json for any obvious misses

Song IDs correspond to track numbers on Tony Rice's 1979 Manzanita album: - Track 5: Nine Pound Hammer (Pushups) - Track 7: Blue Railroad Train (Squats) - Track 8: Ginseng Sullivan (Army Crawls) Updated example config and documentation to use correct mappings.

- Detect V2 vs V1 by presence of blockheight field - V2 uses blockheight instead of date - V2 uses videoHash instead of uri - Add contract_version field to token template - Add separate category for V2 tokens

MediaWiki ResourceLoader needs explicit paths to find module files.

Two bugs fixed: 1. Leaderboards were being generated inside the per-source loop, so with V1+V2 sources, each run would overwrite leaderboards twice (once with V1 data, then with V2 data), causing alternating edits. Now aggregates all tokens from all sources before generating leaderboards once. 2. Bot saved pages even when content was identical. Added check to skip saving if the existing page content matches the generated content exactly.

Replaces the PHP maintenance script with a well-tested Python implementation. Key improvements: - Aggregates all tokens from all sources BEFORE generating leaderboards (fixes the bug where V1 and V2 sources would overwrite each other) - Compares content before saving - won't create edits if unchanged - 68 pytest tests including regression tests for the aggregation bug - Clean separation: models, chain_data, config_parser, leaderboard, wiki_client - DryRunClient for easy testing and safe debugging - CLI with --dry-run, --verbose options Usage: cd tools/blue-railroad-import pip install -e ".[dev]" pytest # run tests blue-railroad-import --chain-data /path/to/chainData.json --dry-run -v

favicon.ico at the repo root wasn't being copied into the MediaWiki build - only the assets/ directory gets copied by the Jenkinsfile. Moved it there and added $wgFavicon to LocalSettings.php.

Tools for aggregating bluegrass podcast RSS feeds and extracting guest names from episode titles for semantic wiki tagging. - podcast-feeds.json: 13 verified bluegrass podcast RSS feeds - podcast-firehose.py: Merges feeds into a single combined RSS feed - podcast-guest-patterns.json: Per-podcast regex patterns for guest name extraction (636 guests from 510 unique names across 9 podcasts) - podcast-episodes.py: Generates {{PodcastEpisode}} wikitext from RSS data + extracted guests - test_podcast_guests.py: 55 tests covering all podcast formats, skip patterns, edge cases, and false positive prevention

DryRunClient.would_save was ambiguous — it sounded hypothetical but actually records every save_page() call. Renamed to saved_pages. Also replaced should/would in test comments with direct descriptions of what the test asserts.

jMyles and others added 10 commits January 27, 2026 20:59

Support Blue Railroad V2 token format in import script

ed9d105

- Detect V2 vs V1 by presence of blockheight field - V2 uses blockheight instead of date - V2 uses videoHash instead of uri - Add contract_version field to token template - Add separate category for V2 tokens

Add date-to-blockheight datepicker as ResourceLoader module

05dc3e4

Add ResourceFileModulePaths to fix datepicker module loading

0853f50

MediaWiki ResourceLoader needs explicit paths to find module files.

favicon

9aa39a3

Move favicon to assets/ and set $wgFavicon

9857ab0

favicon.ico at the repo root wasn't being copied into the MediaWiki build - only the assets/ directory gets copied by the Jenkinsfile. Moved it there and added $wgFavicon to LocalSettings.php.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Podcast firehose and guest extraction pipeline#38

Podcast firehose and guest extraction pipeline#38
magent-cryptograss wants to merge 10 commits intocryptograss:mainfrom
magent-cryptograss:podcast-firehose

magent-cryptograss commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

magent-cryptograss commented Jan 31, 2026

Summary

Podcasts covered

Not yet included in this PR

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants