Conversation
laskoviymishka
left a comment
There was a problem hiding this comment.
nice work!
this is a solid foundation for DV read support: envelope parsing is correct, puffin ReadAt integration is clean, and the Java golden test data is a great choice for cross-impl validation.
main concerns:
- OOM from sparse keys — the gap-filling loop in DeserializeRoaringPositionBitmap can allocate billions of empty bitmaps with adversarial input (keys [0, 0xFFFFFFFF]). needs a cap or a map-based representation
- Serialize writes empty gap bitmaps — Java only writes non-empty keys, so Go-serialized output won't match. reader-only for now but it's public — either fix or unexport
- ReadDV cardinality trap — always passes dvFile.Count() which is 0 when unset, rejecting valid DVs
- dvMockDataFile/strPtr missing — tests won't compile, looks like a helper file was left out of the PR
- 4 golden files unused — especially all-container-types-position-index.bin which exercises run containers, the main cross-library bug source
left some comments.
Also general observation - we should add for serialize / write puffin x-engine compatibility e2e tests, this probably beyond this PR scope, but need to keep this in mind once we start to enable DV support.
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| package table |
There was a problem hiding this comment.
i'm sart to think that moving this to separate package may be a good idea, in general table package is quite large.
wdyt?
| "github.com/stretchr/testify/require" | ||
| ) | ||
|
|
||
| func readDVTestData(t *testing.T, name string) []byte { |
There was a problem hiding this comment.
am i reading this right:
- 64map32bitvals.bin,
- all-container-types-position-index.bin
- empty-position-index.bin
- small-and-large-values-position-index.bin
never actually reads here, why then add them into testdata?
i think it make sense to include coverage for them
| } | ||
|
|
||
| if format != AvroFile && format != OrcFile && format != ParquetFile { | ||
| if format != AvroFile && format != OrcFile && format != ParquetFile && format != PuffinFile { |
There was a problem hiding this comment.
PuffinFile should probably only be valid when content == EntryContentPosDeletes — right now a caller could create a data file (content=DATA) with PUFFIN format and this check wouldn't catch it
| } | ||
|
|
||
| // fill gaps with empty bitmaps | ||
| for len(b.bitmaps) < int(key) { |
There was a problem hiding this comment.
can OOM with cracked input: keys [0, 0xFFFFFFFF] with count=2 allocates ~4 billion empty bitmaps.
maybe use a map[uint32]*roaring.Bitmap or cap the max key
| } | ||
|
|
||
| // Set marks a position in the bitmap. | ||
| func (b *RoaringPositionBitmap) Set(pos int64) { |
There was a problem hiding this comment.
need a if pos < 0 { guard here
negative pos → negative key via arithmetic shift → grow is a no-op → bitmaps[key] panics with index-out-of-range.
| } | ||
|
|
||
| // Contains checks if a position is set. | ||
| func (b *RoaringPositionBitmap) Contains(pos int64) bool { |
| b := &RoaringPositionBitmap{} | ||
| lastKey := -1 | ||
|
|
||
| for i := int64(0); i < count; i++ { |
There was a problem hiding this comment.
count is validated for < 0 but not capped:
count = math.MaxInt64 will spin the loop trying to allocate.
i think we need here a reasonable max (something like puffin DefaultMaxBlobSize / minimum-bitmap-size)
| return nil, fmt.Errorf("read DV blob at offset %d: %w", offset, err) | ||
| } | ||
|
|
||
| return DeserializeDV(blobData, dvFile.Count()) |
There was a problem hiding this comment.
we always passes dvFile.Count() as expectedCardinality. Count() returns 0 when not set in manifest.
DeserializeDV treats >= 0 as "validate" → will reject any real DV when count defaults to 0.
Need either pass -1 always and let caller validate, or use *int64 for the parameter
|
|
||
| offset := *dvFile.ContentOffset() | ||
| size := *dvFile.ContentSizeInBytes() | ||
| blobData := make([]byte, size) |
There was a problem hiding this comment.
need a guard: if size < 0 || size > DefaultMaxBlobSize { return error }
this would panic on negative or huge ContentSizeInBytes
| "github.com/stretchr/testify/require" | ||
| ) | ||
|
|
||
| func readBitmapTestData(t *testing.T, name string) []byte { |
There was a problem hiding this comment.
is this a duplicate function?
can we re-use readDVTestData from other test here?
No description provided.