Skip to content

fix: align image filenames between parse_html and extract_images when Blank-Page divs present#85

Open
voidborne-d wants to merge 1 commit intodatalab-to:masterfrom
voidborne-d:fix/image-name-mismatch-blank-page
Open

fix: align image filenames between parse_html and extract_images when Blank-Page divs present#85
voidborne-d wants to merge 1 commit intodatalab-to:masterfrom
voidborne-d:fix/image-name-mismatch-blank-page

Conversation

@voidborne-d
Copy link
Copy Markdown

Problem

parse_html and extract_images generate different filenames for the same image when the document contains Blank-Page divs before Image/Figure divs.

parse_html counts all top-level divs in its div_idx counter (including Blank-Page), while extract_images iterates over chunks from parse_layout/parse_chunks, which filters out Blank-Page divs. This causes the two counters to diverge:

HTML: [Blank-Page(1), Text(2), Blank-Page(3), Image(4)]
                                               div_idx=4 -> hash_4_img.webp

Chunks: [Text, Image]  (Blank-Pages filtered)
               div_idx=2 -> hash_2_img.webp  <-- MISMATCH

The HTML/Markdown output references hash_4_img.webp, but the extracted image is saved as hash_2_img.webp. Broken image link.

This affects any multi-page PDF where the OCR model emits Blank-Page labels for empty or near-empty pages.

Fix

  1. LayoutBlock: Add div_idx field to store the original 1-based position among all top-level divs.
  2. parse_layout: Use enumerate(top_level_divs, start=1) and store the position in each LayoutBlock.div_idx, preserving the original count even when Blank-Page divs are skipped.
  3. extract_images: Read chunk["div_idx"] (carried through parse_chunks -> asdict) instead of maintaining a separate counter.

The change is backward-compatible: div_idx defaults to 0 in the dataclass, and extract_images falls back to idx + 1 if the field is missing.

Tests

17 new tests in tests/test_image_name_consistency.py:

  • 5 regression tests -- verify image filenames match between parse_html and extract_images with Blank-Page divs (all 5 fail on the old code, proving the bug)
  • 3 baseline tests -- no Blank-Page divs, normal operation still works
  • 4 div_idx tracking tests -- verify parse_layout stores correct original positions
  • 1 markdown test -- parse_markdown also references the correct filename
  • 4 edge cases -- all blanks, missing img tag, empty HTML, no image blocks

… Blank-Page divs are present

parse_html counts all divs (including Blank-Page) in its div_idx counter,
but extract_images iterates over chunks from parse_layout which filters
out Blank-Page divs. When a document has Blank-Page divs before Image/Figure
divs, the two functions generate different filenames for the same image:
- HTML/Markdown references e.g. 'hash_3_img.webp'
- extract_images saves as e.g. 'hash_1_img.webp'
This causes broken image references in the output.

Fix: store the original div position (counting all divs) in LayoutBlock.div_idx
during parse_layout, carry it through to chunks, and use it in extract_images
to generate filenames consistent with parse_html.

Includes 17 tests covering:
- 5 regression tests for the Blank-Page mismatch bug (all fail on old code)
- 3 baseline tests (no Blank-Page, normal operation)
- 4 parse_layout div_idx tracking tests
- 1 markdown output consistency test
- 4 edge cases (all blanks, missing img tag, empty HTML, no images)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant