Skip to content

Core, Parquet: Allow for Writing Parquet/Avro Manifests in V4#15634

Open
RussellSpitzer wants to merge 2 commits intoapache:mainfrom
RussellSpitzer:ParquetManifests
Open

Core, Parquet: Allow for Writing Parquet/Avro Manifests in V4#15634
RussellSpitzer wants to merge 2 commits intoapache:mainfrom
RussellSpitzer:ParquetManifests

Conversation

@RussellSpitzer
Copy link
Member

Extends V4 Manifest writer to allow it to write manifests in either Parquet or Avro based on the file extension. A default is also added to do Parquet Manifests in the SDK when the Version is 4. This could be parameterized later but that will require parameterizing the test suites so I decided on a single format (parquet) for now.

There are a few other required changes here outside of testing

  1. Handling of splitOffsets in Parquet needs to be changed since BaseFile returns an immutable view which Parquet was attempting to re-use by clearing.

  2. Unpartitioned Tables need special care since parquet cannot store empty structs in the schema. This means reading from parquet manifests means skipping the parquet field and then changing read offsets if the partition is not defined. The read code is shared between all versions at this time so this change effects older avro readers as well.

  3. Some of the tests code for TestReplacePartitions assumed that you could validate against a slightly different vesrion of the table. This is a problem if the table you make is partitioned and the validation table is unpartitioned. It use to work ... accidently I think because we would make unpartitioned operations committed to a partitioned table.

--- Some Benchmarks
Note this is all done with Full reads, while we expect writes to be slower, reads should be faster when we actually do column specific projection. Since in this code the avro and parquet read paths are both doing full scans we don't expect them to be materially different.

image

I also deleted the old Manifest benchmarks which were specific to V1, and V1/V2 respectively and replaced them with a new benchmark which can be used on any version

@RussellSpitzer
Copy link
Member Author

@anoopj I hear your PR #15049 will make this a lot cleaner, so I will review that on Monday :)

@RussellSpitzer
Copy link
Member Author

image Updated with the new benchmark code

@RussellSpitzer
Copy link
Member Author

I have missed some deprecations and test rewrites, going to split those off to another branch to resolve first to not confuse this pr.

Copy link
Contributor

@anoopj anoopj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks great to me. A couple of minor comments.

if (reuse != null) {
this.lastList = reuse;
// reuse containers may come from a different reader (e.g. Avro) with incompatible types
this.lastList = reuse instanceof ArrayList ? reuse : null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a brief comment to explain why this looks for ArrayList? (was not super obvious to me). Same for the LinkedHashMap on line 980

FileFormat manifestFormat = FileFormat.fromFileName(inputFile.location());
Preconditions.checkArgument(
manifestFormat == FileFormat.AVRO,
"Reading manifest metadata is only supported for Avro manifests: %s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume read support will come in a separate PR? Not sure if it would be useful, but #14577 is a quick and dirty prototype i did last year.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look for the old discussions but we basically came to consensus that we just weren't going to read metadata out of manifests in this code path, all the deprecation work I've been doing is to remove that dependency so we won't ever have a read metadata pathway for internal data readers that aren't Avro

@sfc-gh-rspitzer
Copy link

Could you please check out #15656? It has about 1k more net changes to handle deprecations I missed. After that I'll come back to this one and finish it up. Thanks for checking on this PR as well!

…et by Default

Extends V4 Manifest writer to allow it to write manfiests in either Parquet
or Avro based on the file extension. A default is also added to do Parquet
Manifests in the SDK when the Version is 4. This could be parameterized
later but that will requrie parameterizing the test suites so I decied
on a single format (parquet) for now.

There are a few other requried changes here outside of testing

1. Handling of splitOffsets in Parquet needs to be changed since
BaseFile returns an immutable view which Parquet was attempting to
re-use by clearing.

2. Unpartitioned Tables need special care since parquet cannot store
empty structs in the schema. This means reading from parquet manfiests
means skipping the parquet field and then changing read offsets if the
partition is not defined. The read code is shared between all versions
at this time so this change effects older avro readers as well.

3. Some of the tests code for TestReplacePartitions assumed that you
could validate against a slightly different vesrion of the table. This is
a problem if the table you make is partitioned and the validation table
is unpartitioned. It use to work ... accidently I think because we would
make unpartitioned operations committed to a partitioned table.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants