Skip to content

feat: delete orphaned files#1958

Closed
jayceslesar wants to merge 36 commits intoapache:mainfrom
jayceslesar:feat/orphan-files
Closed

feat: delete orphaned files#1958
jayceslesar wants to merge 36 commits intoapache:mainfrom
jayceslesar:feat/orphan-files

Conversation

@jayceslesar
Copy link
Copy Markdown
Contributor

Closes #1200

Rationale for this change

Ability to do more table maintenance from pyiceberg (iceberg-python?)

Are these changes tested?

Added a test!

Are there any user-facing changes?

Yes, this is a new method on the Table class.

Copy link
Copy Markdown
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @jayceslesar, sorry for the late review.

I think this is a great start, I left some comments, let me know what you think!

Copy link
Copy Markdown
Contributor

@smaheshwar-pltr smaheshwar-pltr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @jayceslesar, using InpsectTable to get orphaned files to submit to the executor pool is a nice idea! Just some concerns / suggestions / debugging help 😄

Copy link
Copy Markdown
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I added a few comments. ptal :)

@kevinjqliu
Copy link
Copy Markdown
Contributor

a meta question, wydt of moving the orphan file function to its own file/namespace, similar to how to use .inspect.

i like the idea of having all the table maintenance functions together, similar to delta table's optimize

@jayceslesar
Copy link
Copy Markdown
Contributor Author

jayceslesar commented May 4, 2025

a meta question, wydt of moving the orphan file function to its own file/namespace, similar to how to use .inspect.

i like the idea of having all the table maintenance functions together, similar to delta table's optimize

I think that makes sense -- would #1880 end up there too?

Also ideally there is a CLI that exposes all the maintenance actions too right?

I think moving things to a new OptimizeTable class in a new namespace optimize.py makes a lot of sense, can be modeled very similar to the InspectTable and generally makes things cleaner -- I think it still makes sense to have the all_known_files inside of inspect though, and can still use that in the new OptimizeTable

@Fokko
Copy link
Copy Markdown
Contributor

Fokko commented May 13, 2025

i like the idea of having all the table maintenance functions together, similar to delta table's optimize

That's a good point. However, I think we should be able to either run them separate as well. For example, delete orphan files won't affect the speed of the table, so it is more of a maintenance feature to reduce object storage costs. Delete orphan files can also be pretty costly because of the list operation, ideally you would delegate this to the catalog that uses, for example, s3 inventory.

@jayceslesar
Copy link
Copy Markdown
Contributor Author

@Fokko we probably also want pyiceberg to have some idea about https://iceberg.apache.org/spec/#delete-formats right? Is it currently aware of those files?

@Fokko
Copy link
Copy Markdown
Contributor

Fokko commented Jun 24, 2025

@jayceslesar I believe the merge-on-read delete files (positional deletes, equality deletes, and deletion vectors) are returned by the all-files. The only part that's missing is the partition statistics files.

@jayceslesar
Copy link
Copy Markdown
Contributor Author

@jayceslesar I believe the merge-on-read delete files (positional deletes, equality deletes, and deletion vectors) are returned by the all-files. The only part that's missing is the partition statistics files.

Sounds good, I will add the partition statistics files when that is merged!

@aammar5
Copy link
Copy Markdown

aammar5 commented Jul 10, 2025

Once issue I've found with this PR is that the catalog properties need to propagate to PyArrowFileIO(properties=...) otherwise endpoint/authentication/etc to things like s3 simply fail ...

flat_known_files: set[str] = reduce(set.union, all_known_files.values(), set())

scheme, _, _ = PyArrowFileIO.parse_location(location)
pyarrow_io = PyArrowFileIO()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pyarrow_io = PyArrowFileIO()
pyarrow_io = PyArrowFileIO(properties=self.tbl.catalog.properties)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Id like to see if I can achieve this without pyarrow and will attempt to do so after working in #2146

if older_than is None:
older_than = timedelta(0)
as_of = datetime.now(timezone.utc) - older_than
all_files = [f.path for f in fs.get_file_info(selector) if f.type == FileType.File and f.mtime < as_of]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
all_files = [f.path for f in fs.get_file_info(selector) if f.type == FileType.File and f.mtime < as_of]
all_files = [f"{scheme}://{f.path}" for f in fs.get_file_info(selector) if f.type == FileType.File and f.mtime < as_of]

Comment on lines +68 to +73
try:
import pyarrow as pa # noqa: F401
except ModuleNotFoundError as e:
raise ModuleNotFoundError(
"For deleting orphaned files with a PyArrowFileIO, PyArrow needs to be installed"
) from e
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this error ever happen? If the table's io is a PyArrowFileIo I think we've already verified that PyArrow is installed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We dont ask if its pyarrowfilio we ask if it isnt fsspecfilio

@jayceslesar
Copy link
Copy Markdown
Contributor Author

Going to get around adding tests for both types of FileIO... @Fokko @kevinjqliu anything else you think we need here?

@ForeverAngry
Copy link
Copy Markdown
Contributor

@jayceslesar how's this coming? Let me know if i can help with anything. Id like to use this in prod as well!

@github-actions
Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that's incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Mar 17, 2026
@github-actions
Copy link
Copy Markdown

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Delete orphan files

7 participants