fix(table): fast-append must inherit all parent manifests unconditionally#4
Open
cassio-paesleme wants to merge 3 commits intodocker:mainfrom
Open
fix(table): fast-append must inherit all parent manifests unconditionally#4cassio-paesleme wants to merge 3 commits intodocker:mainfrom
cassio-paesleme wants to merge 3 commits intodocker:mainfrom
Conversation
Add write.parquet.root-repetition property (required/optional/repeated, default: required) to control the Parquet root schema element's repetition type. arrow-go defaults to Repeated, which Snowflake interprets as one-level list encoding and rejects files with list columns. Defaulting to Required aligns with the Parquet spec and matches arrow-rs, pyarrow, and parquet-java behavior.
iter.Pull(args.counter) was called unconditionally, but in the partitioned path newWriterFactory creates its own iter.Pull and the original stopCount was never called, leaking one goroutine per write. Move iter.Pull into the unpartitioned branch where it is actually used. Add a regression test confirming goroutine count stays stable.
3b42d06 to
b33dcb2
Compare
brefsdal
reviewed
Apr 10, 2026
| } | ||
| previous, err := fa.base.txn.meta.SnapshotByID(fa.base.parentSnapshotID) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("could not find parent snapshot %d: %w", fa.base.parentSnapshotID, err) |
…ally fastAppendFiles.existingManifests() filtered parent manifests using HasAddedFiles() || HasExistingFiles(). Both methods return false when the manifest list entry has added_files_count=0 and existing_files_count=0, which is the standard Iceberg v2 representation for inherited manifests written by external writers such as Athena, Spark, and Trino. As a result, any data written by an external writer was silently dropped from the snapshot on the next iceberg-go fast-append. Queries against the table after the append returned only the iceberg-go-written rows; all previously existing data became invisible. A fast-append never removes or overwrites data files, so the correct behaviour is to inherit all manifests from the parent snapshot unconditionally. Remove the filter and return previous.Manifests() directly. Fixes: data loss when appending to an Iceberg table that was previously written by Athena or other external writers. Tested: new TestFastAppendInheritsZeroCountManifests reproduces the bug (FAIL before patch, PASS after) and the full ./table/... suite passes with no regressions.
b33dcb2 to
21052fe
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Data written by Athena (or Spark/Trino) disappears after any
iceberg-gofast-append. The new snapshot only contains rows written by iceberg-go; all previously existing data becomes invisible to queries.Root cause
Two bugs in
fastAppendFiles.existingManifests():Bug 1 — Field name mismatch
The Iceberg spec names the manifest list count fields
added_files_count(field 504),existing_files_count(505),deleted_files_count(506). Athena usesadded_data_files_count,existing_data_files_count,deleted_data_files_countfor the same logical fields. When iceberg-go read an Athena-written manifest list, it found nothing at the spec name, read 0 for all counts, andHasAddedFiles()returnedfalsefor every Athena manifest.Fix: Both naming conventions are handled on read — the
manifestFilestruct carries both avro tags andAddedDataFiles()coalesces whichever is non-zero. Both names are written on every manifest list write (same field-id, different names) so Athena readers find their expected name and spec-compliant readers find theirs.Bug 2 — Manifest inheritance filter
existingManifests()kept only manifests whereHasAddedFiles() || HasExistingFiles(). Because of Bug 1, both returnedfalsefor all Athena manifests — they were silently dropped. The new snapshot only referenced iceberg-go's newly written files.Fix: A fast-append never removes data. Remove the filter and return
previous.Manifests()directly — all parent manifests are inherited unconditionally.Tests
TestFastAppendInheritsZeroCountManifests— reproduces Bug 2: constructs a parent snapshot with Athena-style manifests (zero counts), fast-appends a new data file on top, and asserts all parent manifests are present in the resulting snapshot.TestReadManifestListAthenaFieldNames— reproduces Bug 1: encodes an OCF manifest list using Athena field names, reads it back viaReadManifestList, and asserts the counts are decoded correctly.Note on field-ids
added_files_countandadded_data_files_countare the same logical Iceberg field (id 504) — Athena diverged from the spec name. Both names carry field-id 504 in the writer schema because they represent the same field; this is what field renaming in schema evolution looks like. Spark, Trino, and other spec-compliant readers ignore the unknownadded_data_files_countfield; Athena ignoresadded_files_count.🤖 Generated with Claude Code