Conversation
Why these changes are being introduced: It was determined that we were not crawling LibGuides sub-pages in browsertrix. Once they started rolling in to Transmogrifier for transform to TIMDEX records, it became clear we'd need to do a little work to handle them. How this addresses that need: * Update the LibGuides API URL to include `?expand=pages` * this adds a `.pages` node to the main/parent guides API data * Interleave these sub-pages with the main guides in the API data, allowing the transform to find and utilize them as well * Because of increased crawl scope, filter out additional directory guides that have `g=176063` in the URL Side effects of this change: * Transmogrifier can transform sub-pages crawled from libguides.mit.edu, resulting in an increased TIMDEX record count for the `libguides` source Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-449
There was a problem hiding this comment.
Pull request overview
Adds support for LibGuides sub-pages by expanding guide “pages” from the LibGuides API into first-class rows so the existing transform pipeline can match and process crawled sub-page records.
Changes:
- Update LibGuides API URL default to request
?expand=pages, then expand sub-pages into rows inLibGuidesAPIClient.fetch_guides. - Improve exclusion behavior by filtering out non-
libguides.mit.edurecords and adding additional staff-directory URL exclusions. - Add/adjust tests and update dependencies in
Pipfile.lock.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
transmogrifier/sources/json/libguides.py |
Expands API sub-pages into DataFrame rows; adds hostname-based exclusion; updates URL matching behavior. |
transmogrifier/config.py |
Defaults LibGuides guides endpoint to ?expand=pages. |
tests/sources/json/test_libguides.py |
Adds a unit test covering sub-page expansion behavior. |
tests/conftest.py |
Sets LibGuides-related env vars for test runs. |
Pipfile.lock |
Updates locked dependency set (including a new dev dependency). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| all_rows: list[dict] = [] | ||
| for guide in guides: | ||
| pages = guide.get("pages", []) | ||
| all_rows.append(guide) | ||
| for page in pages: | ||
| # inherit parent columns, then overlay page-specific columns | ||
| page_row = {**guide, **page} | ||
| all_rows.append(page_row) |
There was a problem hiding this comment.
This is the meaningful change in this PR. Again, this detailed Jira comment has more context.
We just need to interleave the sub-pages into the main dataframe, so they are accessible at the same level as "normal" guides during metadata work.
ehanson8
left a comment
There was a problem hiding this comment.
Also works as expected, great update!
Purpose and background context
Why these changes are being introduced:
It was determined that we were not crawling LibGuides sub-pages in browsertrix. Once they started
rolling in to Transmogrifier for transform to TIMDEX records, it became clear we'd need to do a little
work to handle them.
See this detaild findings comment in the Jira ticket, specifically the section "2- Update Transmogrifier to retrieve sub-pages in API call": https://mitlibraries.atlassian.net/browse/USE-449?focusedCommentId=182143.
How this addresses that need:
?expand=pages.pagesnode to the main/parent guides API dataand utilize them as well
g=176063inthe URL
How can a reviewer manually see the effects of these changes?
1- Set dev1 credentials
2- Ensure that
.envfile has LibGuides API credentials set (shared in slack)3- Run transformation:
Note there are quite a few records skipped, this is expected. Additionally, there are a couple of logged errors for guides that we can't find a URL for in the LibGuides API data, this is also expected.
My assumption is that we may continue to have a bit of churn on the
libguidessource over the next few weeks, so aiming to make meaningful and impactful changes, but maybe not account for every conceivable edge case at this time.Includes new or updated dependencies?
YES
Changes expectations for external applications?
YES: Transmogrifier can transform sub-pages crawled from libguides.mit.edu, resulting in an increased
TIMDEX record count for the
libguidessourceWhat are the relevant tickets?
Code review