Skip to content

Data: Add TCK tests for Metadata Columns in BaseFormatModelTests#15675

Open
Guosmilesmile wants to merge 4 commits intoapache:mainfrom
Guosmilesmile:tck_metadata
Open

Data: Add TCK tests for Metadata Columns in BaseFormatModelTests#15675
Guosmilesmile wants to merge 4 commits intoapache:mainfrom
Guosmilesmile:tck_metadata

Conversation

@Guosmilesmile
Copy link
Contributor

This pr add TCK tests for metadata column reading in BaseFormatModelTests.

Metadata Colums:

  • FILE_PATH
  • SPEC_ID
  • ROW_POSITION
  • IS_DELETED
  • Lineage
    • ROW_ID
    • LAST_UPDATED_SEQUENCE_NUMBER
  • PARTITION_COLUMN
    • Transformations
    • Partition evolution (adding and removing columns)

Part of #15415

@github-actions github-actions bot added the data label Mar 18, 2026
@Guosmilesmile Guosmilesmile force-pushed the tck_metadata branch 4 times, most recently from 12a56cf to eca8c4d Compare March 20, 2026 16:46
@Guosmilesmile Guosmilesmile marked this pull request as draft March 21, 2026 15:21
@Guosmilesmile Guosmilesmile marked this pull request as ready for review March 22, 2026 12:28
Comment on lines +216 to +217
List<T> list = convertToEngineRecords(genericRecords, schema);
assertEquals(schema, list, readRecords);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that, it was for testing purposes. I'll revert it back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You promised to revert this back, but seems like it slipped somehow

new String[] {FEATURE_FILTER, FEATURE_CASE_SENSITIVE, FEATURE_SPLIT},
FileFormat.ORC,
new String[] {FEATURE_REUSE_CONTAINERS});
new String[] {FEATURE_REUSE_CONTAINERS, FEATURE_META_ROW_LINEAGE});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How hard would it be to implement this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should work. I'll give it a try in the next PR.

DataGenerator dataGenerator = new DataGenerators.DefaultSchema();
Schema schema = dataGenerator.schema();
List<Record> genericRecords = dataGenerator.generateRecords();
writeGenericRecords(fileFormat, schema, genericRecords);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we create rows where the ROW_ID and the LAST_UPDATED_SEQUENCE_NUMBER is set.
It is a valid scenario that some of the rows has a row_id, and for the other rows these are unset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will add a UT to cover it.

Comment on lines +805 to +806
PartitionData partitionData = new PartitionData(partitionType);
partitionData.set(0, "test_col_a");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this part or the partition data is read only from the idToConstant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is necessary. The partition data information is needed for both writing and reading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, after thinking about it, if we are testing the read, we don't actually need to inject partition information here, because it is injected through idToConstant. I'll change it to non-partitioned for testing.

partitionData.set(0, "test_col_a");

DataWriter<Record> writer =
FormatModelRegistry.dataWriteBuilder(fileFormat, Record.class, encryptedFile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the writer remove the partition columns? If so, then we need these tests, but this is more like a writer test


protected abstract void assertEquals(Schema schema, List<T> expected, List<T> actual);

protected abstract Object convertConstantToEngine(Types.NestedField field, Object value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just create a Record and Schema and use convertToEngine instead of this new method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried it. When I add partitionData to idToConstant, the Flink side requires it to be a Record. The RowDataConverter.convert used in convertToEngine will force the STRUCT to be converted to a Record and throw an error. For metadata processing on the Flink side, RowDataUtil.convertConstant is used instead.

case STRUCT:
return convert(type.asStructType(), (Record) object);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this help?

  private static RowData convert(Types.StructType struct, StructLike record) {
    GenericRowData rowData = new GenericRowData(struct.fields().size());
    List<Types.NestedField> fields = struct.fields();
    for (int i = 0; i < fields.size(); i += 1) {
      Types.NestedField field = fields.get(i);

      Type fieldType = field.type();
      rowData.setField(i, convert(fieldType, record.get(i, Object.class)));
    }
    return rowData;
  }

Notice StructLike record) {, and Object.class)));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I converted StructLike to Record, and then used convertToEngineRecords + assertEquals for comparison, which allowed me to remove the convertConstantToEngine method.


protected abstract Object convertConstantToEngine(Types.NestedField field, Object value);

protected abstract <D> List<D> convertToPartitionIdentity(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use an existing method to archive this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along with the aforementioned changes, this part has also been optimized away.

};
}

private Map<Integer, Object> convertConstantsToEngine(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just have the Constants in a GenericRecord and convert it to the engine type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not quite able to get the direction of the adjustment needed here. However, I have made some changes mentioned above, and I'm not sure if further adjustments are needed in this part. I would appreciate more suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too convoluted and we still need an getFieldFromEngineRow, so we are not much better of.

If we still need the extra method, we might be better of having a method like:

public static Object convertConstantsToEngine(Type type, Object value);

For Spark it could simply call SparkUtil.internalToSpark

convertConstantToEngine

@ParameterizedTest
@FieldSource("FILE_FORMATS")
void testReadMetadataColumnPartitionBucketTransform(FileFormat fileFormat) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help me with highlighting the differences between this test and testReadMetadataColumnPartitionIdentity?


@ParameterizedTest
@FieldSource("FILE_FORMATS")
void testReadMetadataColumnPartitionEvolutionAddColumn(FileFormat fileFormat) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a test with addColumnWithDefaultReadValue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants