Skip to content

feat!(storage): Allow StorageFactory::build to take table metadata#2247

Open
CTTY wants to merge 4 commits intoapache:mainfrom
CTTY:ctty/storage-metadata
Open

feat!(storage): Allow StorageFactory::build to take table metadata#2247
CTTY wants to merge 4 commits intoapache:mainfrom
CTTY:ctty/storage-metadata

Conversation

@CTTY
Copy link
Collaborator

@CTTY CTTY commented Mar 17, 2026

Which issue does this PR close?

What changes are included in this PR?

  • Introduced a new API with_metadata to StorageFactory
  • Updated RestCatalog::load_file_io accordingly

Are these changes tested?

Relying on the existing tests

"Unable to load file io, neither warehouse nor metadata location is set!",
));
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need these checks anymore since now we rely on a configured StorageFactory

@CTTY CTTY added the breaking label Mar 17, 2026
/// # Arguments
///
/// * `metadata` - The table metadata to incorporate
fn with_metadata(&self, metadata: &TableMetadata) -> Result<Arc<dyn StorageFactory>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be part of build method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern about having build_with_metadata is that it means we will need to thread the metadata through FileIO / FileIOBuilder — something like FileIOBuilder::with_metadata(metadata) that stores it, and then FileIO::get_storage calls build_with_metadata when metadata is present, or build when it's not. I think metadata can be heavy to be passed around

I think just having with_metadata is much lighter and custom implementations can only extract what they need upfront rather than iceberg holding metadata info until build

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced, as a pub trait, providing two not compatible entrance method makes both the caller and callee difficult. As a caller, when should I call which method? What's the contract should we follow? As a callee, why I need to implement two incompatible methods?

As with storing table metadata, in fact it's caused by the optimization of building storage lazily. We need to choose to sacrifice at one of them, e.g. pay for storing table metadata or building storage eargly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing lazy storage makes sense to me, right now I don't see it being useful. When users are using FileIO, they will want to initialize Storage anyway

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the API to be

    fn build(
        &self,
        config: &StorageConfig,
        metadata: Option<&TableMetadata>,
    ) -> Result<Arc<dyn Storage>>;

at some place you will have to have a FileIO constructed before you have a metadata object, and I think we should at least allow users to pass None in that case. An alternative is to pass in "default" TableMetadata but it seems hacky

@CTTY CTTY changed the title feat!(storage): Add with_metadata to storage factory, update load_file_io feat!(storage): Allow StorageFactory::build to take table metadata Mar 23, 2026
fn build(
&self,
config: &StorageConfig,
metadata: Option<&TableMetadata>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After second thought, I'm still not sure if this is the right direction to go. IIUC, the motivation of adding this param is to improve the performance of resolving storage to avoid checking url everytime, but this add complexity to implementation. It's essentially useless for non resolving storage. For resolving storage, what should we do if it's None?
Or should we ask if the resolving everytime is a true problem?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the motivation of adding this param is to improve the performance of resolving storage to avoid checking url everytime

Yes

For resolving storage, what should we do if it's None?

It's mostly up to the custom implementations. It should either fail or return a resolving storage that resolves schemes before every IO

Or should we ask if the resolving everytime is a true problem?

I'm also not sure about this when I think about it again. the perf improvement can actually be trivial (resolving schemes vs IO operations is usually microseconds vs milliseconds) cc @vustef do you have more context maybe?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's far fetched, but I do have some use cases in mind for adding table information to the storage. For example, if we want to implement a refreshable storage, that refreshes vended storage credentials when they expire, it has got to talk to catalog to vend credentials again. But by default, it has no information about which table to vend credentials for. So passing table information (name in this case) is needed. Similarly, passing table_location can be useful for optimizing storage resolution (s3 vs azure etc.).

I don't know how big perf difference this is, but I feel like we could do thousands of IOPS, and resolving storage on every IO could be an anti-pattern when we could do it just once. That said, if the perf impact is not big, it may not be required just for that. I'd let you decide whether the above pattern in combination with this one is enough to warrant the ability to attach some additional table information to the storage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add StorageFactory::with_metadata

3 participants