PyArrowFile class is not compatible with ABFS uri syntax

### Apache Iceberg version

0.10.0 (latest release)

### Please describe the bug 🐞

Starting from version 20, Pyarrow has support for Azure filesystems.

ABFS URIs have this [format](https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction-abfs-uri): abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file_name>

But Pyarrow library expects the following path format for Azure: abfs[s]://<file_system>/<path>/<file_name>.

As you see, the part "@<account_name>.<dfs|blob>.core.windows.net" prevents users to use pyarrow file io in Azure environment. This issue CAN be fixed in Pyiceberg by removing account_name part.

The proposed [fix](https://github.com/apache/iceberg-python/pull/2683) is just to start a conversation around the issue. I am not 100% sure how and where this should be fixed.

We know similar issues do not occur with Fsspec file io.

### Examples

We have a very basic setup with RestCatalog:

```
def create_iceberg_catalog():
    CATALOG_URI = "https://lakehouse.../catalog"

    catalog_config = {
        "uri": CATALOG_URI,
        PY_IO_IMPL: "pyiceberg.io.pyarrow.PyArrowFileIO",
        ADLS_ACCOUNT_NAME: "lakehouseaccount",
    }

    return RestCatalog("lakehouse", **catalog_config)
```

When we create a table "testns.testtable", it is assigned a following location : abfss://lakehouse-azure-bucket@lakehouseaccount.dfs.core.windows.net/testns/testtable

Then, when we try to append data to the table:
```
data = pa.table(
    {
        "id": pa.array(range(5), type=pa.int32()),  # Ensure 'id' is int32 to match Iceberg schema
        "value": [random.choice(["Heads", "Tails"]) for _ in range(5)],
    }
)
table.append(data)
```

it throws the following exception:
```
OSError: ListBlobsByHierarchy failed for prefix='aip_test[/test_table-xxx/metadata/snap-xxx.avro](https://xxx/test_table-xxx.avro)'. GetFileInfo is unable to determine whether the path exists. Azure Error: [InvalidResourceName] 400 The specified resource name contains invalid characters.
```

This is because exists() method is called:
```
File [~/.official-venvs/amd64.ipykernel-default.master/lib/python3.12/site-packages/pyiceberg/io/pyarrow.py:368](https://xxx/user/nikita-matckevich/.official-venvs/amd64.ipykernel-default.master/lib/python3.12/site-packages/pyiceberg/io/pyarrow.py#line=367), in PyArrowFile.create(self, overwrite)
    366     if not overwrite and self.exists() is True:
```

And it expects the uri without "@akehouseaccount.dfs.core.windows.net". When we monkey-patch the PyArrowFile.__init__ everything works fine:
```
PyArrowFile.old_init = PyArrowFile.__init__
def patched_init(self, location: str, path: str, fs: FileSystem, buffer_size: int = ONE_MEGABYTE):
    # Call the original __init__ method
    self.old_init(location, path, fs, buffer_size)
    self._path = remove_section_between_at_and_slash(path)
    print("Logging: PyArrowFile initialized")
PyArrowFile.__init__ = patched_init
```

It does not matter how and with which engine the table was created and written before: all pyarrow methods are not working, even those that are on read path, so it will be impossible to scan a non-empty table as well. We tested it by creating a table with fsspec file io and reading it with pyarrow file io.

It is hard to test this behavior with Azurite, because Azurite uris are different and do not contain "@<account_name>" part.

### Willingness to contribute

- [x] I can contribute a fix for this bug independently
- [ ] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyArrowFile class is not compatible with ABFS uri syntax #2698

Apache Iceberg version

Please describe the bug 🐞

Examples

Willingness to contribute

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

PyArrowFile class is not compatible with ABFS uri syntax #2698

Description

Apache Iceberg version

Please describe the bug 🐞

Examples

Willingness to contribute

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions