Skip to content

feat: Deletion vector reader#866

Open
Shreyas220 wants to merge 4 commits intoapache:mainfrom
Shreyas220:deletion_vector-reader
Open

feat: Deletion vector reader#866
Shreyas220 wants to merge 4 commits intoapache:mainfrom
Shreyas220:deletion_vector-reader

Conversation

@Shreyas220
Copy link
Copy Markdown
Contributor

No description provided.

@Shreyas220 Shreyas220 requested a review from zeroshade as a code owner April 9, 2026 11:28
Copy link
Copy Markdown
Contributor

@laskoviymishka laskoviymishka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work!

this is a solid foundation for DV read support: envelope parsing is correct, puffin ReadAt integration is clean, and the Java golden test data is a great choice for cross-impl validation.

main concerns:

  1. OOM from sparse keys — the gap-filling loop in DeserializeRoaringPositionBitmap can allocate billions of empty bitmaps with adversarial input (keys [0, 0xFFFFFFFF]). needs a cap or a map-based representation
  2. Serialize writes empty gap bitmaps — Java only writes non-empty keys, so Go-serialized output won't match. reader-only for now but it's public — either fix or unexport
  3. ReadDV cardinality trap — always passes dvFile.Count() which is 0 when unset, rejecting valid DVs
  4. dvMockDataFile/strPtr missing — tests won't compile, looks like a helper file was left out of the PR
  5. 4 golden files unused — especially all-container-types-position-index.bin which exercises run containers, the main cross-library bug source

left some comments.

Also general observation - we should add for serialize / write puffin x-engine compatibility e2e tests, this probably beyond this PR scope, but need to keep this in mind once we start to enable DV support.

// specific language governing permissions and limitations
// under the License.

package table
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm sart to think that moving this to separate package may be a good idea, in general table package is quite large.

wdyt?

"github.com/stretchr/testify/require"
)

func readDVTestData(t *testing.T, name string) []byte {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am i reading this right:

  • 64map32bitvals.bin,
  • all-container-types-position-index.bin
  • empty-position-index.bin
  • small-and-large-values-position-index.bin

never actually reads here, why then add them into testdata?
i think it make sense to include coverage for them

}

if format != AvroFile && format != OrcFile && format != ParquetFile {
if format != AvroFile && format != OrcFile && format != ParquetFile && format != PuffinFile {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PuffinFile should probably only be valid when content == EntryContentPosDeletes — right now a caller could create a data file (content=DATA) with PUFFIN format and this check wouldn't catch it

}

// fill gaps with empty bitmaps
for len(b.bitmaps) < int(key) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can OOM with cracked input: keys [0, 0xFFFFFFFF] with count=2 allocates ~4 billion empty bitmaps.

maybe use a map[uint32]*roaring.Bitmap or cap the max key

}

// Set marks a position in the bitmap.
func (b *RoaringPositionBitmap) Set(pos int64) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a if pos < 0 { guard here

negative pos → negative key via arithmetic shift → grow is a no-op → bitmaps[key] panics with index-out-of-range.

}

// Contains checks if a position is set.
func (b *RoaringPositionBitmap) Contains(pos int64) bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, need >0 guard

b := &RoaringPositionBitmap{}
lastKey := -1

for i := int64(0); i < count; i++ {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count is validated for < 0 but not capped:

count = math.MaxInt64 will spin the loop trying to allocate.

i think we need here a reasonable max (something like puffin DefaultMaxBlobSize / minimum-bitmap-size)

return nil, fmt.Errorf("read DV blob at offset %d: %w", offset, err)
}

return DeserializeDV(blobData, dvFile.Count())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we always passes dvFile.Count() as expectedCardinality. Count() returns 0 when not set in manifest.

DeserializeDV treats >= 0 as "validate" → will reject any real DV when count defaults to 0.

Need either pass -1 always and let caller validate, or use *int64 for the parameter


offset := *dvFile.ContentOffset()
size := *dvFile.ContentSizeInBytes()
blobData := make([]byte, size)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a guard: if size < 0 || size > DefaultMaxBlobSize { return error }

this would panic on negative or huge ContentSizeInBytes

"github.com/stretchr/testify/require"
)

func readBitmapTestData(t *testing.T, name string) []byte {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a duplicate function?
can we re-use readDVTestData from other test here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants