fix(table): fast-append must inherit all parent manifests unconditionally#3
Closed
cassio-paesleme wants to merge 3 commits intodocker:mainfrom
Closed
Conversation
Add write.parquet.root-repetition property (required/optional/repeated, default: required) to control the Parquet root schema element's repetition type. arrow-go defaults to Repeated, which Snowflake interprets as one-level list encoding and rejects files with list columns. Defaulting to Required aligns with the Parquet spec and matches arrow-rs, pyarrow, and parquet-java behavior.
iter.Pull(args.counter) was called unconditionally, but in the partitioned path newWriterFactory creates its own iter.Pull and the original stopCount was never called, leaking one goroutine per write. Move iter.Pull into the unpartitioned branch where it is actually used. Add a regression test confirming goroutine count stays stable.
…ally fastAppendFiles.existingManifests() filtered parent manifests using HasAddedFiles() || HasExistingFiles(). Both methods return false when the manifest list entry has added_files_count=0 and existing_files_count=0, which is the standard Iceberg v2 representation for inherited manifests written by external writers such as Athena, Spark, and Trino. As a result, any data written by an external writer was silently dropped from the snapshot on the next iceberg-go fast-append. Queries against the table after the append returned only the iceberg-go-written rows; all previously existing data became invisible. A fast-append never removes or overwrites data files, so the correct behaviour is to inherit all manifests from the parent snapshot unconditionally. Remove the filter and return previous.Manifests() directly. Fixes: data loss when appending to an Iceberg table that was previously written by Athena or other external writers. Tested: new TestFastAppendInheritsZeroCountManifests reproduces the bug (FAIL before patch, PASS after) and the full ./table/... suite passes with no regressions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
fastAppendFiles.existingManifests()filters parent manifests using:HasAddedFiles()returnsAddedFilesCount != 0andHasExistingFiles()returnsExistingFilesCount != 0. Both returnfalsewhen a manifest list entry hasadded_files_count=0andexisting_files_count=0— the standard Iceberg v2 representation for inherited manifests written by external writers (Athena, Spark, Trino, etc.).As a result, any data written by an external writer is silently dropped from the snapshot on the next iceberg-go fast-append. Queries against the table after the append return only the iceberg-go-written rows; all previously existing data becomes invisible.
Root Cause Confirmed
Reproduced with Athena + iceberg-go on a staging Iceberg table. After inserting 2 rows with Athena and appending 1 row with iceberg-go, the parent snapshot's manifest list entries have
added_files_count=0, existing_files_count=0. Both manifests fail the filter and are dropped. Athena sees only the 1 iceberg-go row.Diagnostic output (before fix):
Full investigation: docker/data-platform#406
Upstream Apache PR: apache/iceberg-go#869
Fix
A fast-append never removes or overwrites data files, so all parent manifests must be inherited unconditionally. Remove the filter and return
previous.Manifests()directly.Testing
TestFastAppendInheritsZeroCountManifests: creates two zero-count manifests simulating an Athena-written snapshot, fast-appends one new data file, asserts all 3 manifests are present. FAIL before patch, PASS after../table/...suite passes with no regressions.