Skip to content

fix: restore Arrow nullable flags lost in Polars round-trips (ENG-375)#130

Merged
eywalker merged 8 commits intodevfrom
eng-375/fix-polars-nullability
Apr 7, 2026
Merged

fix: restore Arrow nullable flags lost in Polars round-trips (ENG-375)#130
eywalker merged 8 commits intodevfrom
eng-375/fix-polars-nullability

Conversation

@kurodo3
Copy link
Copy Markdown

@kurodo3 kurodo3 bot commented Apr 7, 2026

Summary

  • Polars sets all Arrow fields to nullable=True on output regardless of original schema, corrupting schema intent for int (non-optional) vs Optional[int] columns
  • Adds arrow_utils.restore_schema_nullability() — looks up each field by name in reference schemas captured before the Polars operation, and reinstates the original nullable flag. This is schema-intent-based, not data-based.
  • Replaces the previous infer_schema_nullable workaround in Join.static_process (which was data-based and would incorrectly mark Optional[T] columns as non-nullable when they happen to contain no nulls)
  • Applies the fix at all 6 Polars round-trip sites in FunctionNode and the join loop in Join

Why not infer_schema_nullable?

infer_schema_nullable derives nullable from whether the data actually contains nulls. An Optional[int] column with no current null values would be marked nullable=False — silently wrong. restore_schema_nullability uses the Arrow schemas from before the Polars conversion, which encode the user's type annotations.

Test plan

  • Unit tests for restore_schema_nullability (6 cases), including the Optional[T]-with-no-nulls case that infer_schema_nullable fails
  • Integration tests: FunctionNode.get_all_records, FunctionNode._iter_all_from_database, Join.op_forward with both non-nullable and nullable columns
  • New test test_join_preserves_nullable_optional_column_with_no_nulls specifically catches the infer_schema_nullable regression
  • All 14 new tests pass; existing test suite unaffected

Closes ENG-375

🤖 Generated with Claude Code

Polars converts all Arrow fields to nullable=True when producing its
Arrow output, corrupting schema intent for non-optional columns.

Add `arrow_utils.restore_schema_nullability()` which reinstates the
original nullable flag for each field by name using the reference
schemas captured before any Polars operation. This is schema-intent-based
(unlike `infer_schema_nullable` which is data-based and would incorrectly
mark Optional columns as non-nullable when they happen to have no nulls).

Apply the fix at all Polars round-trip sites:
- FunctionNode: all 5 join sites and the as_table sort
- Join.static_process: the per-iteration inner join loop
  (removes the previous infer_schema_nullable workaround)

Tests added (red-green-refactor):
- Unit tests for restore_schema_nullability including the Optional[T]
  correctness case that infer_schema_nullable fails
- Integration tests for FunctionNode.get_all_records,
  FunctionNode._iter_all_from_database, and Join.op_forward

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 7, 2026

Codecov Report

❌ Patch coverage is 97.67442% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/orcapod/utils/arrow_utils.py 95.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

kurodo3 bot and others added 2 commits April 7, 2026 03:24
Add TestJoinTagColumnNullability with three cases:

1. Shared tag columns with mixed nullable flags — both "id" (nullable=False)
   and "group" (nullable=True, Optional, no actual nulls) are preserved through
   the Polars inner join on those shared keys.

2. Non-shared tag columns from each side of a cartesian-product join — "id"
   (non-nullable) from source1 and "category" (nullable) from source2 both
   retain their flags in the result.

3. Three-way join (two Polars iterations) — verifies that restore_schema_
   nullability is applied correctly at each iteration, not just the last.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…creation

add_system_tag_columns and add_source_info were appending columns with
table.append_column(string_name, array), which Arrow defaults to
nullable=True. Those columns are always computed and never null, so
nullable=False is the correct schema intent.

This caused a regression in TestJoinOutputSchemaSystemTags after
restore_schema_nullability replaced infer_schema_nullable in Join:
restore_schema_nullability faithfully preserves nullable=True for these
columns (from the reference schema captured before the Polars join),
while the operator's output_schema prediction expected nullable=False.

Fix: pass an explicit pa.field(..., nullable=False) to append_column in
both functions, so the schema is correct from the moment of creation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a schema-integrity regression where Polars round-trips (Arrow -> Polars -> Arrow) drop Arrow field nullability intent by outputting all fields as nullable=True. It introduces a schema-intent-based restoration step that re-applies the original nullable flags (and field types/metadata) using reference schemas captured before Polars operations.

Changes:

  • Added arrow_utils.restore_schema_nullability() to restore Arrow field nullability (and field type/metadata) from one or more reference schemas.
  • Updated FunctionNode and Join.static_process to capture pre-Polars schemas and restore nullability after Polars joins/sorts (replacing the prior data-based infer_schema_nullable workaround in Join).
  • Added unit + integration tests covering round-trip nullability preservation, including the “Optional[T] with no nulls” case.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/orcapod/utils/arrow_utils.py Adds restore_schema_nullability() and explicitly marks computed/system/source columns as non-nullable.
src/orcapod/core/operators/join.py Restores schema nullability after Polars join; removes infer_schema_nullable cast.
src/orcapod/core/nodes/function_node.py Restores schema nullability after Polars joins/sorts in multiple code paths.
tests/test_data/test_polars_nullability/test_restore_schema_nullability.py Unit tests for restore_schema_nullability, plus demonstrations of the Polars behavior.
tests/test_data/test_polars_nullability/test_function_node_nullability.py Integration tests ensuring FunctionNode + Join preserve nullable intent across joins.
tests/test_data/test_polars_nullability/__init__.py Package marker for the new test module.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


import pyarrow as pa
import polars as pl
import pytest
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest is imported but not used in this test module. Please remove it to avoid unused-import noise (and potential lint failures if ruff/pyflakes is enabled in CI).

Suggested change
import pytest

Copilot uses AI. Check for mistakes.
assert joined.schema.field("score").nullable is True


class TestRestoreSchemaHullability:
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in class name: TestRestoreSchemaHullability should be TestRestoreSchemaNullability (or similar) to match the helper being tested.

Suggested change
class TestRestoreSchemaHullability:
class TestRestoreSchemaNullability:

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +14
import pyarrow as pa
import pytest

import orcapod as op
from orcapod.core.nodes.function_node import FunctionNode
from orcapod.databases import InMemoryArrowDatabase
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest is imported but not used in this module. Please remove it to avoid unused-import noise (and potential lint failures if linting is enabled).

Copilot uses AI. Check for mistakes.
Comment on lines +1169 to +1173
@@ -1093,7 +1170,11 @@ def add_source_info(
[f"{source_val}::{col}" for source_val in source_column],
type=pa.large_string(),
)
table = table.append_column(f"{constants.SOURCE_PREFIX}{col}", source_column)
# Source info columns are always computed strings, never null.
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source_column is overwritten inside the per-column loop, so the source-info token accumulates previous column names (e.g., src::col1::col2) rather than staying src::col2 for the col2 source column. Preserve the original per-row source_info values (e.g., keep a separate base list/array) and build each _source_{col} column from that base.

Copilot uses AI. Check for mistakes.
Comment on lines 1370 to +1377
if column_config.sort_by_tags:
output_table_schema = output_table.schema
output_table = (
pl.DataFrame(output_table)
.sort(by=self.keys()[0], descending=False)
.to_arrow()
)
output_table = arrow_utils.restore_schema_nullability(output_table, output_table_schema)
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR restores nullability after Polars .sort(...).to_arrow() in FunctionNode.as_table, but there is another Polars sort round-trip in FunctionPodStream.as_table (src/orcapod/core/function_pod.py, around the column_config.sort_by_tags block) that still drops nullable flags. Consider applying restore_schema_nullability there too for consistent behavior across stream/table materialization paths.

Copilot uses AI. Check for mistakes.
- Remove unused `import pytest` from both test modules
- Rename TestRestoreSchemaHullability → TestRestoreSchemaNullability (typo fix)
- Fix source_column accumulation bug in add_source_info: each _source_<col>
  column was being built from the array produced for the previous column,
  causing tokens like "src::col1::col2" instead of "src::col2". Introduce
  base_source captured once before the loop so every column is independently
  derived from the original per-row source tokens.
- Apply restore_schema_nullability after the Polars sort in
  FunctionPodStream.as_table (function_pod.py) to match the same fix already
  applied in FunctionNode.as_table, covering the stream materialization path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Author

@kurodo3 kurodo3 bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed all five review comments in commit 3218c4c:

1 & 3 — Unused import pytest
Removed from both test_restore_schema_nullability.py and test_function_node_nullability.py.

2 — Typo in class name
TestRestoreSchemaHullabilityTestRestoreSchemaNullability.

4 — source_column accumulation bug in add_source_info
Real bug — thanks for catching it. The loop was reassigning source_column to the pa.array built for each column, so the next iteration started from already-appended values (e.g. "src::col1::col2" instead of "src::col2"). Fixed by introducing base_source (captured once before the loop) and building each per-column array from that base independently.

5 — Missing restore_schema_nullability in FunctionPodStream.as_table
Applied the same pattern (capture schema before, restore after) to the sort_by_tags Polars sort in function_pod.py, consistent with the equivalent fix in FunctionNode.as_table.

kurodo3 bot and others added 2 commits April 7, 2026 04:32
…375)

TestAddSourceInfo in test_arrow_utils.py covers four cases:

1. Single column — baseline: _source_x = "src::x".
2. Multi-column — the regression case: each _source_<col> must equal
   "base::<col>", not "base::prev_col::<col>". With the bug, _source_y
   would be "base::x::y" because source_column was reused across iterations.
3. Per-row source tokens — same accumulation bug with a list input:
   _source_b row 0 must be "src0::b", not "src0::a::b".
4. Column count — one _source_<col> per data column, no more.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…G-375)

Remove assertion messages and docstring prose that described the buggy
output pattern. Tests now simply assert the exact contract — each
_source_<col> value is '<source_token>::<col>' — which is sufficient to
catch the regression without mentioning what the wrong value would be.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1155 to 1158
if source_info is None or isinstance(source_info, str):
source_column = [source_info] * table.num_rows
base_source = [source_info] * table.num_rows
elif isinstance(source_info, Collection):
if len(source_info) != table.num_rows:
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base_source is only assigned for source_info is None/str or when source_info is a Collection; any other runtime type will fall through and later raise an UnboundLocalError when building col_source. Add an explicit else: raise TypeError(...) (or normalize other iterables) so callers get a clear error and the function can’t crash with an unbound local.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d4bcd72 — added else: raise TypeError(...) after the elif branch so any unsized iterable (e.g. a generator) gets a clear error instead of an UnboundLocalError.

Comment on lines 176 to 184
# field names, preserving nullable flags — must be done BEFORE the
# Polars rename call below, which loses all nullability information.
next_ref_schema = pa.schema([
pa.field(rename_map.get(f.name, f.name), f.type, nullable=f.nullable, metadata=f.metadata)
for f in next_table.schema
])

if rename_map:
next_table = pl.DataFrame(next_table).rename(rename_map).to_arrow()
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When rename_map is non-empty, this does an Arrow→Polars→Arrow round-trip just to rename, and then immediately converts next_table to Polars again for the join. This is an avoidable extra conversion (and can be expensive for large tables). Consider renaming in Arrow (compute new name list and call next_table.rename_columns(...)) or keep a pl.DataFrame(next_table).rename(rename_map) and pass that directly into the join without converting back to Arrow in between.

Suggested change
# field names, preserving nullable flags — must be done BEFORE the
# Polars rename call below, which loses all nullability information.
next_ref_schema = pa.schema([
pa.field(rename_map.get(f.name, f.name), f.type, nullable=f.nullable, metadata=f.metadata)
for f in next_table.schema
])
if rename_map:
next_table = pl.DataFrame(next_table).rename(rename_map).to_arrow()
# field names, preserving nullable flags before the Polars join,
# which loses original nullability information.
next_ref_schema = pa.schema([
pa.field(rename_map.get(f.name, f.name), f.type, nullable=f.nullable, metadata=f.metadata)
for f in next_table.schema
])
if rename_map:
next_table = next_table.rename_columns([
rename_map.get(name, name) for name in next_table.column_names
])

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d4bcd72 — replaced pl.DataFrame(next_table).rename(rename_map).to_arrow() with Arrow-native next_table.rename_columns([rename_map.get(name, name) for name in next_table.column_names]). Eliminates the round-trip entirely.

"""
Unit tests for the restore_schema_nullability helper in arrow_utils.

RED phase: all tests in this file must fail before the helper exists.
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module docstring mentions a “RED phase” expectation that tests must fail before the helper exists, which will be misleading once merged (and isn’t actionable for future maintainers). Consider removing/updating this to describe the behavioral contract being tested rather than the development phase.

Suggested change
RED phase: all tests in this file must fail before the helper exists.
These tests document that Polars round-trips and joins can widen Arrow field
nullability, and verify that restore_schema_nullability reapplies the original
schema's nullable=False flags where appropriate.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d4bcd72 — removed the RED phase phrasing; docstring now describes the behavioral contract being validated.

Comment on lines +2 to +5
Integration tests: FunctionNode and Join preserve non-nullable column constraints
after the Arrow → Polars → Arrow round-trip that occurs during joins.

RED phase: tests should fail before the fix is applied.
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring states these tests are “RED phase” and should fail before the fix is applied; after merge this becomes stale and potentially confusing. Consider removing that phrasing and focusing the docstring on the nullability guarantees being validated.

Suggested change
Integration tests: FunctionNode and Join preserve non-nullable column constraints
after the ArrowPolarsArrow round-trip that occurs during joins.
RED phase: tests should fail before the fix is applied.
Integration tests validating that FunctionNode and Join preserve non-nullable
column constraints across the ArrowPolarsArrow round-trip that occurs
during joins and record retrieval.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d4bcd72 — same update: RED phase line removed, docstring now describes the nullability guarantees being tested.

- join.py: replace Polars rename round-trip with Arrow-native
  rename_columns(), eliminating an unnecessary Arrow→Polars→Arrow
  conversion before the join.

- arrow_utils.py: add explicit else/TypeError in add_source_info so
  callers passing an unsized iterable (e.g. a generator) get a clear
  error instead of an UnboundLocalError from base_source being unbound.

- test_function_node_nullability.py: remove stale "RED phase" phrasing
  from module docstring; replace with description of the behavioral
  contract being validated.

- test_restore_schema_nullability.py: same stale docstring fix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kurodo3
Copy link
Copy Markdown
Author

kurodo3 bot commented Apr 7, 2026

Second review round addressed in commit d4bcd72:

Polars rename round-trip (join.py): Replaced pl.DataFrame(next_table).rename(rename_map).to_arrow() with Arrow-native next_table.rename_columns(...). Eliminates an unnecessary Arrow→Polars→Arrow conversion that happened before the join.

base_source UnboundLocalError (arrow_utils.py): Added else: raise TypeError(...) after the elif isinstance(source_info, Collection) branch. Any unsized iterable (e.g. a generator, which doesn't satisfy Collection due to missing __len__) now gets a clear error message rather than an UnboundLocalError when the loop tries to use base_source.

Stale "RED phase" docstrings (both test files): Removed the development-phase phrasing from both module docstrings. Both now describe the behavioral contract being validated rather than the TDD workflow used to write them.

Resolve conflicts in function_node.py between our restore_schema_nullability
additions and dev's _filter_by_content_hash calls. In each conflict site,
apply the content-hash filter first (dev), then capture the schemas for
restore_schema_nullability (ours). The first conflict also absorbs dev's
rename of the join result variable from `joined` to `filtered` in
_get_cached_results_for_entries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@eywalker eywalker merged commit 54bb187 into dev Apr 7, 2026
9 checks passed
@eywalker eywalker deleted the eng-375/fix-polars-nullability branch April 7, 2026 06:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants