fix: align image filenames between parse_html and extract_images when Blank-Page divs present by voidborne-d · Pull Request #85 · datalab-to/chandra

voidborne-d · 2026-04-04T12:57:15Z

Problem

parse_html and extract_images generate different filenames for the same image when the document contains Blank-Page divs before Image/Figure divs.

parse_html counts all top-level divs in its div_idx counter (including Blank-Page), while extract_images iterates over chunks from parse_layout/parse_chunks, which filters out Blank-Page divs. This causes the two counters to diverge:

HTML: [Blank-Page(1), Text(2), Blank-Page(3), Image(4)]
                                               div_idx=4 -> hash_4_img.webp

Chunks: [Text, Image]  (Blank-Pages filtered)
               div_idx=2 -> hash_2_img.webp  <-- MISMATCH

The HTML/Markdown output references hash_4_img.webp, but the extracted image is saved as hash_2_img.webp. Broken image link.

This affects any multi-page PDF where the OCR model emits Blank-Page labels for empty or near-empty pages.

Fix

LayoutBlock: Add div_idx field to store the original 1-based position among all top-level divs.
parse_layout: Use enumerate(top_level_divs, start=1) and store the position in each LayoutBlock.div_idx, preserving the original count even when Blank-Page divs are skipped.
extract_images: Read chunk["div_idx"] (carried through parse_chunks -> asdict) instead of maintaining a separate counter.

The change is backward-compatible: div_idx defaults to 0 in the dataclass, and extract_images falls back to idx + 1 if the field is missing.

Tests

17 new tests in tests/test_image_name_consistency.py:

5 regression tests -- verify image filenames match between parse_html and extract_images with Blank-Page divs (all 5 fail on the old code, proving the bug)
3 baseline tests -- no Blank-Page divs, normal operation still works
4 div_idx tracking tests -- verify parse_layout stores correct original positions
1 markdown test -- parse_markdown also references the correct filename
4 edge cases -- all blanks, missing img tag, empty HTML, no image blocks

… Blank-Page divs are present parse_html counts all divs (including Blank-Page) in its div_idx counter, but extract_images iterates over chunks from parse_layout which filters out Blank-Page divs. When a document has Blank-Page divs before Image/Figure divs, the two functions generate different filenames for the same image: - HTML/Markdown references e.g. 'hash_3_img.webp' - extract_images saves as e.g. 'hash_1_img.webp' This causes broken image references in the output. Fix: store the original div position (counting all divs) in LayoutBlock.div_idx during parse_layout, carry it through to chunks, and use it in extract_images to generate filenames consistent with parse_html. Includes 17 tests covering: - 5 regression tests for the Blank-Page mismatch bug (all fail on old code) - 3 baseline tests (no Blank-Page, normal operation) - 4 parse_layout div_idx tracking tests - 1 markdown output consistency test - 4 edge cases (all blanks, missing img tag, empty HTML, no images)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: align image filenames between parse_html and extract_images when Blank-Page divs present#85

fix: align image filenames between parse_html and extract_images when Blank-Page divs present#85
voidborne-d wants to merge 1 commit intodatalab-to:masterfrom
voidborne-d:fix/image-name-mismatch-blank-page

voidborne-d commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

voidborne-d commented Apr 4, 2026

Problem

Fix

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant