Skip to content

PyArrowFile class is not compatible with ABFS uri syntax #2698

@NikitaMatskevich

Description

@NikitaMatskevich

Apache Iceberg version

0.10.0 (latest release)

Please describe the bug 🐞

Starting from version 20, Pyarrow has support for Azure filesystems.

ABFS URIs have this format: abfs[s]://<file_system>@<account_name>.dfs.core.windows.net//<file_name>

But Pyarrow library expects the following path format for Azure: abfs[s]://<file_system>//<file_name>.

As you see, the part "@<account_name>.<dfs|blob>.core.windows.net" prevents users to use pyarrow file io in Azure environment. This issue CAN be fixed in Pyiceberg by removing account_name part.

The proposed fix is just to start a conversation around the issue. I am not 100% sure how and where this should be fixed.

We know similar issues do not occur with Fsspec file io.

Examples

We have a very basic setup with RestCatalog:

def create_iceberg_catalog():
    CATALOG_URI = "https://lakehouse.../catalog"

    catalog_config = {
        "uri": CATALOG_URI,
        PY_IO_IMPL: "pyiceberg.io.pyarrow.PyArrowFileIO",
        ADLS_ACCOUNT_NAME: "lakehouseaccount",
    }

    return RestCatalog("lakehouse", **catalog_config)

When we create a table "testns.testtable", it is assigned a following location : abfss://lakehouse-azure-bucket@lakehouseaccount.dfs.core.windows.net/testns/testtable

Then, when we try to append data to the table:

data = pa.table(
    {
        "id": pa.array(range(5), type=pa.int32()),  # Ensure 'id' is int32 to match Iceberg schema
        "value": [random.choice(["Heads", "Tails"]) for _ in range(5)],
    }
)
table.append(data)

it throws the following exception:

OSError: ListBlobsByHierarchy failed for prefix='aip_test[/test_table-xxx/metadata/snap-xxx.avro](https://xxx/test_table-xxx.avro)'. GetFileInfo is unable to determine whether the path exists. Azure Error: [InvalidResourceName] 400 The specified resource name contains invalid characters.

This is because exists() method is called:

File [~/.official-venvs/amd64.ipykernel-default.master/lib/python3.12/site-packages/pyiceberg/io/pyarrow.py:368](https://xxx/user/nikita-matckevich/.official-venvs/amd64.ipykernel-default.master/lib/python3.12/site-packages/pyiceberg/io/pyarrow.py#line=367), in PyArrowFile.create(self, overwrite)
    366     if not overwrite and self.exists() is True:

And it expects the uri without "@akehouseaccount.dfs.core.windows.net". When we monkey-patch the PyArrowFile.init everything works fine:

PyArrowFile.old_init = PyArrowFile.__init__
def patched_init(self, location: str, path: str, fs: FileSystem, buffer_size: int = ONE_MEGABYTE):
    # Call the original __init__ method
    self.old_init(location, path, fs, buffer_size)
    self._path = remove_section_between_at_and_slash(path)
    print("Logging: PyArrowFile initialized")
PyArrowFile.__init__ = patched_init

It does not matter how and with which engine the table was created and written before: all pyarrow methods are not working, even those that are on read path, so it will be impossible to scan a non-empty table as well. We tested it by creating a table with fsspec file io and reading it with pyarrow file io.

It is hard to test this behavior with Azurite, because Azurite uris are different and do not contain "@<account_name>" part.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions