Podcast firehose and guest extraction pipeline#38
Open
magent-cryptograss wants to merge 10 commits intocryptograss:mainfrom
Open
Podcast firehose and guest extraction pipeline#38magent-cryptograss wants to merge 10 commits intocryptograss:mainfrom
magent-cryptograss wants to merge 10 commits intocryptograss:mainfrom
Conversation
Song IDs correspond to track numbers on Tony Rice's 1979 Manzanita album: - Track 5: Nine Pound Hammer (Pushups) - Track 7: Blue Railroad Train (Squats) - Track 8: Ginseng Sullivan (Army Crawls) Updated example config and documentation to use correct mappings.
- Detect V2 vs V1 by presence of blockheight field - V2 uses blockheight instead of date - V2 uses videoHash instead of uri - Add contract_version field to token template - Add separate category for V2 tokens
MediaWiki ResourceLoader needs explicit paths to find module files.
Two bugs fixed: 1. Leaderboards were being generated inside the per-source loop, so with V1+V2 sources, each run would overwrite leaderboards twice (once with V1 data, then with V2 data), causing alternating edits. Now aggregates all tokens from all sources before generating leaderboards once. 2. Bot saved pages even when content was identical. Added check to skip saving if the existing page content matches the generated content exactly.
Replaces the PHP maintenance script with a well-tested Python implementation. Key improvements: - Aggregates all tokens from all sources BEFORE generating leaderboards (fixes the bug where V1 and V2 sources would overwrite each other) - Compares content before saving - won't create edits if unchanged - 68 pytest tests including regression tests for the aggregation bug - Clean separation: models, chain_data, config_parser, leaderboard, wiki_client - DryRunClient for easy testing and safe debugging - CLI with --dry-run, --verbose options Usage: cd tools/blue-railroad-import pip install -e ".[dev]" pytest # run tests blue-railroad-import --chain-data /path/to/chainData.json --dry-run -v
favicon.ico at the repo root wasn't being copied into the MediaWiki build - only the assets/ directory gets copied by the Jenkinsfile. Moved it there and added $wgFavicon to LocalSettings.php.
Tools for aggregating bluegrass podcast RSS feeds and extracting
guest names from episode titles for semantic wiki tagging.
- podcast-feeds.json: 13 verified bluegrass podcast RSS feeds
- podcast-firehose.py: Merges feeds into a single combined RSS feed
- podcast-guest-patterns.json: Per-podcast regex patterns for guest
name extraction (636 guests from 510 unique names across 9 podcasts)
- podcast-episodes.py: Generates {{PodcastEpisode}} wikitext from
RSS data + extracted guests
- test_podcast_guests.py: 55 tests covering all podcast formats,
skip patterns, edge cases, and false positive prevention
DryRunClient.would_save was ambiguous — it sounded hypothetical but actually records every save_page() call. Renamed to saved_pages. Also replaced should/would in test comments with direct descriptions of what the test asserts.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
podcast-feeds.jsonpodcast-firehose.py) merges all feeds into one RSS feed sorted by datepodcast-guest-patterns.json) uses per-podcast regex patterns to pull guest names from episode titles — currently extracting 636 episodes with 510 unique guest names across 9 podcastspodcast-episodes.py) produces{{PodcastEpisode}}wikitext with semantic[[guest::Name]]propertiestest_podcast_guests.py) covering every podcast format, skip patterns, false positive prevention, and edge casesPodcasts covered
Not yet included in this PR
Template:PodcastEpisode(already created on the wiki, rev 1405)Test plan
python3 -m pytest tools/test_podcast_guests.py -v— 55 tests passpython3 tools/podcast-episodes.py --preview 10— review sample wikitext outputpython3 tools/podcast-firehose.py /tmp/test.xml— generates combined feedpodcast-guest-patterns.jsonfor any obvious misses