feat(#131): checkpointed pipeline + fix import path issues#398
Draft
redbbean wants to merge 2 commits intofireform-core:mainfrom
Draft
feat(#131): checkpointed pipeline + fix import path issues#398redbbean wants to merge 2 commits intofireform-core:mainfrom
redbbean wants to merge 2 commits intofireform-core:mainfrom
Conversation
implemented checkpointing/resuming pipeline: - hash transcript + field names to avoid collisions from different transcripts - atomic checkpoint writes to .tmp, then replacing previous checkpoint file - handling for Ctrl+C using SIGINT - retry loop for Ollama timeouts - JSONL error logging for Ollama -1 responses added docs required by issue fireform-core#131/fireform-core#133 have NOT added testing for the checkpointing yet added import error resolution from issue fireform-core#116/fireform-core#117 and fireform-core#118/fireform-core#119 added updated Makefile test path from issue fireform-core#380/fireform-core#381 haven't completely closed fireform-core#131/fireform-core#133 yet
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Closes #131
Implements a checkpointed LLM extraction pipeline so that an interrupted run (container crash, Ollama timeout, Ctrl+C) resumes exactly where it left off without re-processing already-extracted fields.
Also fixes a
ModuleNotFoundError: No module named 'src'that preventedmake execfrom running, caused by an incompletePYTHONPATH. Fixes #116 and #118.src/llm.pymain_loop()_get_field_names()handles_target_fieldsas either adict(frompypdf) or alist.tmp+os.replace()SIGINThandler so Ctrl+C flushes the checkpoint before exit-1responses to help with debuggingjson=Noneparam tojson_data=Nonesrc/main.pyfrom typing import Union(pre-existing bug from issue #docker-compose.yml+DockerfilePYTHONPATH=/app/srcchanged toPYTHONPATH=/app:/app/srcdocs/session_resumption.md(new)Type of change
How Has This Been Tested?
PYTHONPATH fix:
make execruns successfullyCheckpointing logic: has not been tested
Test A —
make execcompletes without import errorsTest Configuration: Docker Desktop,
python:3.11-slim, Ollama + MistralChecklist
Coming across this issue was interesting because I've implemented a similar checkpointing and resuming process in LLM benchmarking, so I was able to reuse some of the error log, checkpoint write, and retry loop logic. Please let me know if there's anything I can improve!