Feat/xgboost rework by raymondcen · Pull Request #50 · NovakLabOSU/FracFeedExtractor

raymondcen · 2026-03-15T22:24:29Z

The biggest change is that PDFs can now be processed in parallel, which cuts down wait time significantly when running batches. There's a solid round of improvements to how text gets cleaned and how the AI extracts information.

Processing multiple PDFs at once

Previously the pipeline handled one PDF at a time. Now you can pass --workers N to run N files simultaneously. It defaults to 1 so nothing breaks if you don't touch it.

Cleaner text going into the model
Spent time improving what actually gets fed to the AI. Noisy stuff like DOIs, figure captions, and numbered references gets stripped out. Weak or irrelevant paragraphs get dropped entirely using a scoring system, and section headers are now preserved correctly (they were getting removed before).

Switched to qwen2.5:7b, added retry logic for null responses, and bumped up the context limit so longer papers don't get cut off.

Misc fixes and optimizations

Spell checker loads once at startup instead of on every call
XGBoost automatically uses GPU if one is available
OCR bypass option during parallel runs since it was causing workers to freeze

…s, etc)

…t info

…ach dissection

SeanClay10

LGTM!

raymondcen added 29 commits February 27, 2026 19:32

feat: cleans noisy features (doi, numbered references, figure caption…

9042c78

…s, etc)

fix: text cleaning left blank lines

fa61459

feat: temp pipeline for testing

688deaa

fix: section headers being removed

af2836f

use section priority rankings instead

0687e62

save cleaned text to folder

317387d

use parargraph scoring

17b7b55

update data for comparison later

f738892

added another filter to drop entire paragraphs that contain irrelevan…

ff34c74

…t info

update instructions on files

b3b4abf

added retry logic to retry null returns

af63973

increased char and num-ctx limit on ollama

af6aa3a

drops paragraphs with no neg or pos signal

cde64a1

added long doc test

00bcf41

switched to qwen2.5:7b

d6efaf3

rewrote the system prompt to handle diverse study methods beyond stom…

b3dcf46

…ach dissection

reformat

ad49f3c

improved truncation

0b3a3d8

added multi cpu processing option

f66d48d

added --workers arg

208029d

added sequqnetial pdf processing

4a01b9d

loads SpellChecker() only once

33adda5

xgboost trinaing set to gpu if available

823fd7f

added bypass OCR because it froze workers

72388c1

--labels option to process all useful papers

2263116

fixed name scanning in labels.json

3320e37

reformat

251d86c

Merge branch 'feat/improve-logging' into feat/xgboost-rework

c961734

Merge remote-tracking branch 'origin/main' into feat/xgboost-rework

610ba72

raymondcen requested a review from SeanClay10 March 15, 2026 22:24

raymondcen and others added 3 commits March 15, 2026 15:29

Delete data/cleaned-text/llm_text/Adams_1989_20260312_115204.txt

6920c95

Delete data/cleaned-text/section_filter/Adams_1989_20260312_115204.txt

41c49a4

deleted .txt files

46dabc1

SeanClay10 approved these changes Mar 15, 2026

View reviewed changes

SeanClay10 merged commit 05477fa into main Mar 15, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/xgboost rework#50

Feat/xgboost rework#50
SeanClay10 merged 32 commits intomainfrom
feat/xgboost-rework

raymondcen commented Mar 15, 2026

Uh oh!

SeanClay10 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

raymondcen commented Mar 15, 2026

Uh oh!

SeanClay10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants