Conversation
6af257b to
4da5bf5
Compare
laskoviymishka
left a comment
There was a problem hiding this comment.
The read path structure is solid and the Java alignment is largely correct — field IDs, doc strings, manifest list writer semantics, and the Arrow synthesis pipeline all check out.
Three issues need to land before this merges.
First row ID inheritance diverges from Java spec (manifest.go ReadEntry). Java's idAssigner unconditionally executes nextRowId += file.recordCount() for every file — null or explicit. The Go implementation only advances nextFirstRowID when FirstRowIDField == nil, so a file with an explicit first_row_id silently resets the baseline for all subsequent null files in the same manifest, producing overlapping row ID ranges. The fix and the *int64 cleanup land together: initialize nextFirstRowID eagerly in NewManifestReader, then unconditionally advance after the conditional assign.
Wrong sequence number for DataSequenceNumber (scanner.go PlanFiles). e.SequenceNum() is the manifest entry's metadata sequence number; _last_updated_sequence_number per spec requires the data sequence number — entry.dataSequenceNumber() in Java, e.FileSequenceNum() in Go. These are identical for freshly ADDED entries but diverge for EXISTING entries carried across compacted manifests, where the bug silently inflates the reported sequence number.
ManifestFile.FirstRowId() must be FirstRowID() before this public interface is merged. The PR already correctly renames the struct field to FirstRowID; the exported method should follow the same Go acronym convention. Fixing a public interface post-merge requires a breaking change.
9256510 to
61787dd
Compare
laskoviymishka
left a comment
There was a problem hiding this comment.
One more thing: memory leak, aside that - all good.
Same root cause as #762 — NewArray() starts at refcount 1, NewRecordBatch retains to refcount 2, local refs are never dropped so memory is never freed. Two places: the production release loop in synthesizeRowLineageColumns and the test setup in TestSynthesizeRowLineageColumns. The test fix is as important as the production fix — NewCheckedAllocator would have caught this immediately and prevents regressions of the same class.
3b3c7e2 to
b21cd14
Compare
|
@zeroshade could you help to review this PR as well? |
|
@dttung2905 sorry for the delay, i'll give this a review tomorrow or monday |
|
@dttung2905 is this ready for a new review? |
Yes it is @zeroshade |
zeroshade
left a comment
There was a problem hiding this comment.
Looking good so far, though there's still the outstanding question at https://github.com/apache/iceberg-go/pull/735/changes#r2943078618
| } | ||
|
|
||
| // IsMetadataColumn returns true if the field ID is a reserved metadata column (e.g. row lineage). | ||
| func IsMetadataColumn(fieldID int) bool { |
There was a problem hiding this comment.
nit: Now that IsMetadataColumn exists, worth adding a guard in NewSchema (or update_schema.go:AddColumn) that rejects user-defined fields with reserved IDs. Could be as follow-up PR for this.
There was a problem hiding this comment.
I will try to follow this up with another PR. I think it is getting big to review now
Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>
Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>
Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>
Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>
Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>
Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>
Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>
96ef322 to
0b6298b
Compare
This should fully support read path and partially support write path
Unsupported write path:
_row_idand_last_updated_sequence_numberare not copied into the new files. Row lineage is preserved for appends and for metadata/manifest list; it is not yet preserved when rewriting data files._row_id/_last_updated_sequence_numberas null columns (they are omitted); that is allowed by the spec and is not planned in this PR.