Skip to content

to_arrow_batch_reader returns a different schema than to_arrow #2250

@enkidulan

Description

@enkidulan

Apache Iceberg version

main (development)

Please describe the bug 🐞

In the development version, I noticed that the to_arrow_batch_reader method casts all string types to large_string, whereas the to_arrow method returns the schema as defined in the parquet file. At first glance, it looks like a bug, likely a regression from #1669

Here is a script you can use to reproduce the issue:

import pyarrow as pa
from pyiceberg.catalog import load_catalog
from uuid import uuid4
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, DoubleType

catalog = load_catalog("default")

df = pa.Table.from_pylist(
    [
        {"city": "Amsterdam", "lat": 52.371807, "long": 4.896029},
        {"city": "San Francisco", "lat": 37.773972, "long": -122.431297},
        {"city": "Drachten", "lat": 53.11254, "long": 6.0989},
        {"city": "Paris", "lat": 48.864716, "long": 2.349014},
    ],
)

schema = Schema(
    NestedField(1, "city", StringType(), required=False),
    NestedField(2, "lat", DoubleType(), required=False),
    NestedField(3, "long", DoubleType(), required=False),
)

tbl = catalog.create_table(f"default.cities-{uuid4()}", schema=schema)

tbl.overwrite(df)


schema_to_arrow = tbl.scan().to_arrow().schema

schema_to_arrow_batch_reader = tbl.scan().to_arrow_batch_reader().schema

print("schema_to_arrow == schema_to_arrow_batch_reader", schema_to_arrow == schema_to_arrow_batch_reader)
print("\nschema_to_arrow")
print(schema_to_arrow)
print("\nschema_to_arrow_batch_reader")
print(schema_to_arrow_batch_reader)

output:

schema_to_arrow == schema_to_arrow_batch_reader False

schema_to_arrow:
city: string
lat: double
long: double

schema_to_arrow_batch_reader:
city: large_string                               
  -- field metadata --
  PARQUET:field_id: '1'
lat: double
  -- field metadata --
  PARQUET:field_id: '2'
long: double
  -- field metadata --
  PARQUET:field_id: '3'

Notice that in to_arrow schema says city: string, while in to_arrow_batch_reader it's city: large_string

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions