Skip to content

Decimal unscale fails with empty column #2263

@berg2043

Description

@berg2043

Apache Iceberg version

0.9.1 (latest release)

Please describe the bug 🐞

After applying the fix from #1983 to fix decimal conversion, "conversion from NoneType to Decimal is not supported" is thrown if a decimal column is empty. Here's a snippet of code to replicate

from decimal import Decimal

import pyarrow as pa
from pyiceberg.io.pyarrow import pyarrow_to_schema
from pyiceberg.schema import Schema
from pyiceberg.types import DecimalType, NestedField
from pyiceberg.catalog import Catalog, load_catalog
from pyiceberg.table.name_mapping import MappedField, NameMapping


warehouse_path = '/tmp'

catalog = load_catalog(
    "default",
    type = "sql",
    uri = f"sqlite://///{warehouse_path}/test",
    warehouse = f'file://{warehouse_path}',
)

catalog.create_namespace_if_not_exists(
  'test',
  {'loacation': f'file://{warehouse_path}'}
)

decimal8 = pa.array([Decimal("123.45"), Decimal("678.91")], pa.decimal128(8, 2))
decimal16 = pa.array([Decimal("12345679.123456"), Decimal("67891234.678912")], pa.decimal128(16, 6))
decimal19 = pa.array([Decimal("1234567890123.123456"), Decimal("9876543210703.654321")], pa.decimal128(19, 6))
empty_decimal8 = pa.array([None, None], pa.decimal128(8,2))
empty_decimal16 = pa.array([None, None], pa.decimal128(16, 6))
empty_decimal19 = pa.array([None, None], pa.decimal128(19, 6))

table = pa.Table.from_pydict(
    {
        "decimal8": decimal8,
        "decimal16": decimal16,
        "decimal19": decimal19,
        "empty_decimal8": empty_decimal8,
        "empty_decimal16": empty_decimal16,
        "empty_decimal19": empty_decimal19,
    },
)

pa_schema = table.schema

name_mapping = NameMapping([
  MappedField(**{'field-id': i+1, 'names': [name]})
  for i, name
  in enumerate(pa_schema.names)
])

schema = pyarrow_to_schema(
  pa_schema,
  name_mapping
)

pyiceberg_table = catalog.create_table(
  'test.decimals',
  schema=table.schema,
)

pyiceberg_table.append(table)

My current fix to data_file_statistics_from_parquet_metadata is as follows, but I'm unsure what the unintended consequences would be.

                    if isinstance(stats_col.iceberg_type, DecimalType) and statistics.physical_type != "FIXED_LEN_BYTE_ARRAY":
                        scale = stats_col.iceberg_type.scale
                        if statistics.min_raw:
                            col_aggs[field_id].update_min(unscaled_to_decimal(statistics.min_raw, scale))
                        if statistics.max_raw:
                            col_aggs[field_id].update_max(unscaled_to_decimal(statistics.max_raw, scale))

I could not get the nightly build to install, so I'm unsure if this still exists. I tested it with 0.9.0 and did not run into this issue.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions