Skip to content

Feat/xgboost rework#50

Merged
SeanClay10 merged 32 commits intomainfrom
feat/xgboost-rework
Mar 15, 2026
Merged

Feat/xgboost rework#50
SeanClay10 merged 32 commits intomainfrom
feat/xgboost-rework

Conversation

@raymondcen
Copy link
Collaborator

The biggest change is that PDFs can now be processed in parallel, which cuts down wait time significantly when running batches. There's a solid round of improvements to how text gets cleaned and how the AI extracts information.

Processing multiple PDFs at once

  • Previously the pipeline handled one PDF at a time. Now you can pass --workers N to run N files simultaneously. It defaults to 1 so nothing breaks if you don't touch it.

Cleaner text going into the model
Spent time improving what actually gets fed to the AI. Noisy stuff like DOIs, figure captions, and numbered references gets stripped out. Weak or irrelevant paragraphs get dropped entirely using a scoring system, and section headers are now preserved correctly (they were getting removed before).

Switched to qwen2.5:7b, added retry logic for null responses, and bumped up the context limit so longer papers don't get cut off.

Misc fixes and optimizations

  • Spell checker loads once at startup instead of on every call
  • XGBoost automatically uses GPU if one is available
  • OCR bypass option during parallel runs since it was causing workers to freeze

@raymondcen raymondcen requested a review from SeanClay10 March 15, 2026 22:24
Copy link
Collaborator

@SeanClay10 SeanClay10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@SeanClay10 SeanClay10 merged commit 05477fa into main Mar 15, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants