Skip to content

Update to anchorite 0.2.0 and delegate core functionality to library#6

Merged
folded merged 13 commits intomainfrom
anchorite-0.2.0-compat
Mar 16, 2026
Merged

Update to anchorite 0.2.0 and delegate core functionality to library#6
folded merged 13 commits intomainfrom
anchorite-0.2.0-compat

Conversation

@folded
Copy link
Copy Markdown
Owner

@folded folded commented Mar 16, 2026

Summary

  • Delegates bbox_alignment, document chunking, range_ops, and settings to anchorite 0.2.0, removing the internal copies
  • Updates Anchor construction from box=BBox(...) (singular) to boxes=(BBox(...),) (tuple) to match the new API
  • Updates tests from fitz/anchorite.document.fitz mocks to pypdfium2 mocks, since anchorite.document now uses pypdfium2
  • Fixes pre-existing syntax error: missing class keyword in _LayoutProcessor
  • Converts hubble_docai_bboxes fixture from pickle (referenced anchorite.types which no longer exists) to JSON
  • Regenerates golden files for the new multi-box span format in annotated Markdown
  • Bumps version to 0.5.0

Test plan

  • All 18 tests pass
  • Lint clean (ruff check + ruff format)
  • Golden files regenerated and verified

🤖 Generated with Claude Code

folded and others added 4 commits January 8, 2026 21:17
- Propagate mime_type to document chunks

- Add simplified magic number inference for images

- Add tests for mime type inference and propagation
- Delegate bbox alignment, document chunking, range ops, and settings to
  anchorite 0.2.0, removing the internal copies (bbox_alignment.py,
  document.py, range_ops.py, settings.py)
- Update Anchor construction: box= (singular) → boxes= (tuple of BBox)
- Update fitz → pypdfium2 mocks in tests (anchorite.document now uses pypdfium2)
- Fix missing `class` keyword in _LayoutProcessor (syntax error)
- Convert hubble_docai_bboxes fixture from pickle (anchorite.types) to JSON
- Regenerate golden files for new multi-box span format
- Bump version to 0.5.0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the gemini-ocr project by integrating the anchorite library to manage core document processing and alignment functionalities. This change streamlines the codebase, standardizes API interactions for bounding boxes and document chunks, and enhances flexibility in configuring OCR and markdown generation providers. The update also includes minor bug fixes and improvements to the command-line interface.

Highlights

  • Core Functionality Delegation: The project now delegates bbox_alignment, document chunking, range operations, and settings management to the anchorite library (version 0.2.0), removing internal implementations of these features.
  • API Modernization: The Anchor construction has been updated from a singular box=BBox(...) to a tuple boxes=(BBox(...),) to align with the new anchorite API.
  • Testing Infrastructure Update: Tests have been updated to use pypdfium2 mocks instead of fitz/anchorite.document.fitz mocks, reflecting the anchorite.document module's new reliance on pypdfium2.
  • New Gemini Prompt Argument: A new --gemini-prompt command-line argument has been introduced in run_ocr.py to allow appending additional instructions to the default Gemini prompt.
  • Configuration Refactor: The Settings class has been removed, and configuration is now handled through a from_env function that constructs appropriate provider classes (e.g., GeminiMarkdownProvider, DocAIMarkdownProvider, DocAIAnchorProvider) based on environment variables.
  • Golden File Regeneration: Golden files for annotated Markdown and bounding box fixtures have been regenerated to accommodate the new multi-box span format and the transition from pickle to JSON for hubble_docai_bboxes.
  • Version Bump: The project version has been updated from 0.4.0 to 0.5.0.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • pyproject.toml
    • Bumped project version to 0.5.0.
    • Added anchorite==0.2.0 as a project dependency.
    • Removed seq-smith from direct dependencies.
    • Added [tool.uv.sources] section for editable anchorite path.
  • run_ocr.py
    • Added --gemini-prompt command-line argument.
    • Passed gemini_prompt argument to the process_document function.
  • src/gemini_ocr/init.py
    • Removed Settings import and export.
    • Added from_env import and export from gemini_ocr.gemini_ocr.
  • src/gemini_ocr/bbox_alignment.py
    • Removed file, as functionality is now handled by anchorite.
  • src/gemini_ocr/docai.py
    • Refactored _call_docai and _generate_cache_path to accept individual parameters instead of a settings.Settings object.
    • Imported DocumentChunk from anchorite.document.
    • Removed imports for gemini_ocr.document and gemini_ocr.settings.
  • src/gemini_ocr/docai_layout.py
    • Renamed TableCell to _TableCell and LayoutProcessor to _LayoutProcessor to indicate internal usage.
    • Replaced generate_markdown function with DocAIMarkdownProvider class inheriting from anchorite.providers.MarkdownProvider.
    • Updated imports to use anchorite.document.DocumentChunk and anchorite.providers.MarkdownProvider.
    • Removed imports for gemini_ocr.document and gemini_ocr.settings.
  • src/gemini_ocr/docai_ocr.py
    • Replaced _run_document_ai and generate_bounding_boxes functions with DocAIAnchorProvider class inheriting from anchorite.providers.AnchorProvider.
    • Updated bounding box creation to use anchorite.Anchor and anchorite.BBox with a tuple of boxes.
    • Updated imports to use anchorite types and DocumentChunk from anchorite.document.
    • Removed imports for gemini_ocr.document and gemini_ocr.settings.
  • src/gemini_ocr/docling.py
    • Replaced generate_markdown function with DoclingMarkdownProvider class inheriting from anchorite.providers.MarkdownProvider.
    • Updated imports to use anchorite.document.DocumentChunk and anchorite.providers.MarkdownProvider.
    • Removed imports for gemini_ocr.document and gemini_ocr.settings.
  • src/gemini_ocr/document.py
    • Removed file, as functionality is now handled by anchorite.
  • src/gemini_ocr/gemini.py
    • Replaced _call_gemini, _generate_cache_path, and generate_markdown functions with GeminiMarkdownProvider class inheriting from anchorite.providers.MarkdownProvider.
    • Updated generate_markdown to accept an optional prompt argument.
    • Updated imports to use anchorite.document.DocumentChunk and anchorite.providers.MarkdownProvider.
    • Removed imports for gemini_ocr.document and gemini_ocr.settings.
  • src/gemini_ocr/gemini_ocr.py
    • Removed RawOcrData, OcrResult, _generate_markdown_for_chunk, _batched_gather, and extract_raw_data functions.
    • Introduced _OcrMode enum for processing modes.
    • Added FixedMarkdownProvider class.
    • Added from_env function to construct providers from environment variables.
    • Updated process_document to accept MarkdownProvider and AnchorProvider objects and utilize anchorite.process_document.
    • Removed imports for bbox_alignment, document, settings module, collections, itertools, re, typing.
    • Added imports for anchorite, anchorite.document, anchorite.providers, and specific providers from gemini_ocr.
  • src/gemini_ocr/range_ops.py
    • Removed file, as functionality is now handled by anchorite.
  • src/gemini_ocr/settings.py
    • Removed file, as settings are now managed by provider classes and the from_env function.
  • tests/fixtures/hubble_docai_bboxes.json
    • Added new JSON fixture for Document AI bounding boxes, replacing the previous pickle format.
  • tests/fixtures/hubble_docai_golden.md
    • Modified golden file to reflect updated span format and minor text adjustments.
  • tests/fixtures/hubble_gemini_responses.json
    • Modified golden file to reflect updated span format and minor text adjustments.
  • tests/fixtures/hubble_golden.md
    • Modified golden file to reflect updated span format and minor text adjustments.
  • tests/test_bbox_alignment.py
    • Removed file, as tests for bbox_alignment are no longer relevant.
  • tests/test_coverage.py
    • Removed file, as tests for coverage calculation are no longer relevant in the new architecture.
  • tests/test_docai_layout.py
    • Updated references to LayoutProcessor to _LayoutProcessor.
    • Removed gemini_ocr.document import.
  • tests/test_docai_mode.py
    • Updated imports to use anchorite.document and specific provider classes.
    • Removed fitz and settings imports.
    • Updated test_process_document_docai_mode to use the new provider-based API and anchorite.Anchor types.
  • tests/test_gcs_support.py
    • Updated document.chunks calls to anchorite.document.chunks.
    • Removed gemini_ocr.document import.
  • tests/test_mime_type_inference.py
    • Added new test file for mime type inference using anchorite.document.chunks.
  • tests/test_missed_matches.py
    • Removed file, as tests for missed matches are no longer relevant.
  • tests/test_model_config.py
    • Updated settings.Settings to GeminiMarkdownProvider.
    • Updated document.DocumentChunk to anchorite.document.DocumentChunk.
    • Updated gemini.generate_markdown call to provider.generate_markdown.
  • tests/test_ocr_annotation.py
    • Removed file, as tests for OCR annotation are no longer relevant.
  • tests/test_ocr_nesting.py
    • Removed file, as tests for OCR nesting are no longer relevant.
  • tests/test_range_ops.py
    • Removed file, as tests for range operations are no longer relevant.
  • tests/test_regression.py
    • Updated hubble_docai_bboxes fixture loading from pickle to JSON and converting to anchorite.Anchor objects.
    • Updated mock functions to align with the new provider-based API.
    • Updated gemini_ocr.process_document calls to explicitly pass markdown and anchor providers.
    • Removed settings and pickle imports.
  • tests/test_renumbering.py
    • Removed file, as renumbering functionality is now handled by anchorite.
  • tests/test_settings.py
    • Refactored tests to validate the new from_env function and provider instantiation instead of the removed Settings class.
  • uv.lock
    • Updated gemini-ocr version to 0.5.0.
    • Added anchorite and pypdfium2 as dependencies.
    • Removed seq-smith from gemini-ocr dependencies.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a major refactoring by migrating the core document processing logic to use the anchorite library. This involves removing internal document, bbox_alignment, and settings modules, and replacing them with anchorite's DocumentChunk, AnchorProvider, and MarkdownProvider interfaces. The pyproject.toml was updated to reflect the new anchorite dependency and remove seq-smith. The run_ocr.py script was enhanced to accept an optional --gemini-prompt argument. The Document AI and Gemini integration modules (docai.py, docai_layout.py, docai_ocr.py, gemini.py) were refactored into provider classes (DocAIMarkdownProvider, DocAIAnchorProvider, GeminiMarkdownProvider) that implement anchorite's provider interfaces, centralizing configuration and logic. The main gemini_ocr.py module was updated to use these new providers and the anchorite.process_document function, simplifying the overall document processing workflow and allowing providers to be built from environment variables via a new from_env function.

folded added 9 commits March 16, 2026 19:07
Replace all `from x import ClassName` with module-level imports and
qualified names throughout src/ and tests/ to match Google Python Style
Guide requirements.
Replace all `import anchorite.document` / `import anchorite.providers`
with a single `import anchorite`, relying on anchorite's __init__.py
re-exporting its submodules.
Remove process_document wrapper — callers now use anchorite.process_document
directly. gemini_ocr's public API is now the provider classes and from_env().
@folded folded merged commit 2c4a38a into main Mar 16, 2026
2 checks passed
@folded folded deleted the anchorite-0.2.0-compat branch March 16, 2026 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant