Skip to content

Podcast firehose and guest extraction pipeline#38

Open
magent-cryptograss wants to merge 10 commits intocryptograss:mainfrom
magent-cryptograss:podcast-firehose
Open

Podcast firehose and guest extraction pipeline#38
magent-cryptograss wants to merge 10 commits intocryptograss:mainfrom
magent-cryptograss:podcast-firehose

Conversation

@magent-cryptograss
Copy link
Copy Markdown
Contributor

Summary

  • 13 bluegrass podcast RSS feeds verified and catalogued in podcast-feeds.json
  • Combined feed aggregator (podcast-firehose.py) merges all feeds into one RSS feed sorted by date
  • Guest name extraction (podcast-guest-patterns.json) uses per-podcast regex patterns to pull guest names from episode titles — currently extracting 636 episodes with 510 unique guest names across 9 podcasts
  • Episode page generator (podcast-episodes.py) produces {{PodcastEpisode}} wikitext with semantic [[guest::Name]] properties
  • 55 tests (test_podcast_guests.py) covering every podcast format, skip patterns, false positive prevention, and edge cases

Podcasts covered

Podcast Guests extracted Status
Bluegrass Unlimited 100 ✅ 100%
Picky Fingers Banjo 179 ✅ 100%
What's The Reason 82 ✅ 100%
Walls of Time 40 ✅ 100%
Toy Heart 20 ✅ 100%
Bluegrass Ambassadors 3 ✅ 100%
Bluegrass Jam Along 128 🟡 93% (9 edge cases)
Bluegrass BKLYN 52 🟡 85%
Grass Talk Radio 32 🟡 Interview-only (most eps are monologues)
4 music/radio shows No guest extraction (not interview formats)

Not yet included in this PR

  • Template:PodcastEpisode (already created on the wiki, rev 1405)
  • Actual episode page creation (needs review of naming convention and batch strategy)
  • Jenkins job for periodic feed regeneration
  • Deploy path for the combined feed XML

Test plan

  • python3 -m pytest tools/test_podcast_guests.py -v — 55 tests pass
  • python3 tools/podcast-episodes.py --preview 10 — review sample wikitext output
  • python3 tools/podcast-firehose.py /tmp/test.xml — generates combined feed
  • Review regex patterns in podcast-guest-patterns.json for any obvious misses

jMyles and others added 10 commits January 27, 2026 20:59
Song IDs correspond to track numbers on Tony Rice's 1979 Manzanita album:
- Track 5: Nine Pound Hammer (Pushups)
- Track 7: Blue Railroad Train (Squats)
- Track 8: Ginseng Sullivan (Army Crawls)

Updated example config and documentation to use correct mappings.
- Detect V2 vs V1 by presence of blockheight field
- V2 uses blockheight instead of date
- V2 uses videoHash instead of uri
- Add contract_version field to token template
- Add separate category for V2 tokens
MediaWiki ResourceLoader needs explicit paths to find module files.
Two bugs fixed:
1. Leaderboards were being generated inside the per-source loop,
   so with V1+V2 sources, each run would overwrite leaderboards twice
   (once with V1 data, then with V2 data), causing alternating edits.
   Now aggregates all tokens from all sources before generating
   leaderboards once.

2. Bot saved pages even when content was identical. Added check to
   skip saving if the existing page content matches the generated
   content exactly.
Replaces the PHP maintenance script with a well-tested Python implementation.

Key improvements:
- Aggregates all tokens from all sources BEFORE generating leaderboards
  (fixes the bug where V1 and V2 sources would overwrite each other)
- Compares content before saving - won't create edits if unchanged
- 68 pytest tests including regression tests for the aggregation bug
- Clean separation: models, chain_data, config_parser, leaderboard, wiki_client
- DryRunClient for easy testing and safe debugging
- CLI with --dry-run, --verbose options

Usage:
  cd tools/blue-railroad-import
  pip install -e ".[dev]"
  pytest  # run tests
  blue-railroad-import --chain-data /path/to/chainData.json --dry-run -v
favicon.ico at the repo root wasn't being copied into the MediaWiki
build - only the assets/ directory gets copied by the Jenkinsfile.
Moved it there and added $wgFavicon to LocalSettings.php.
Tools for aggregating bluegrass podcast RSS feeds and extracting
guest names from episode titles for semantic wiki tagging.

- podcast-feeds.json: 13 verified bluegrass podcast RSS feeds
- podcast-firehose.py: Merges feeds into a single combined RSS feed
- podcast-guest-patterns.json: Per-podcast regex patterns for guest
  name extraction (636 guests from 510 unique names across 9 podcasts)
- podcast-episodes.py: Generates {{PodcastEpisode}} wikitext from
  RSS data + extracted guests
- test_podcast_guests.py: 55 tests covering all podcast formats,
  skip patterns, edge cases, and false positive prevention
DryRunClient.would_save was ambiguous — it sounded hypothetical but
actually records every save_page() call. Renamed to saved_pages.

Also replaced should/would in test comments with direct descriptions
of what the test asserts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants