Data: Add TCK tests for Metadata Columns in BaseFormatModelTests#15675
Data: Add TCK tests for Metadata Columns in BaseFormatModelTests#15675Guosmilesmile wants to merge 4 commits intoapache:mainfrom
Conversation
12a56cf to
eca8c4d
Compare
eca8c4d to
486652f
Compare
| List<T> list = convertToEngineRecords(genericRecords, schema); | ||
| assertEquals(schema, list, readRecords); |
There was a problem hiding this comment.
Sorry about that, it was for testing purposes. I'll revert it back.
There was a problem hiding this comment.
You promised to revert this back, but seems like it slipped somehow
| new String[] {FEATURE_FILTER, FEATURE_CASE_SENSITIVE, FEATURE_SPLIT}, | ||
| FileFormat.ORC, | ||
| new String[] {FEATURE_REUSE_CONTAINERS}); | ||
| new String[] {FEATURE_REUSE_CONTAINERS, FEATURE_META_ROW_LINEAGE}); |
There was a problem hiding this comment.
How hard would it be to implement this?
There was a problem hiding this comment.
I think it should work. I'll give it a try in the next PR.
| DataGenerator dataGenerator = new DataGenerators.DefaultSchema(); | ||
| Schema schema = dataGenerator.schema(); | ||
| List<Record> genericRecords = dataGenerator.generateRecords(); | ||
| writeGenericRecords(fileFormat, schema, genericRecords); |
There was a problem hiding this comment.
Could we create rows where the ROW_ID and the LAST_UPDATED_SEQUENCE_NUMBER is set.
It is a valid scenario that some of the rows has a row_id, and for the other rows these are unset
There was a problem hiding this comment.
Yes, I will add a UT to cover it.
| PartitionData partitionData = new PartitionData(partitionType); | ||
| partitionData.set(0, "test_col_a"); |
There was a problem hiding this comment.
Do we need this part or the partition data is read only from the idToConstant
There was a problem hiding this comment.
I think it is necessary. The partition data information is needed for both writing and reading.
There was a problem hiding this comment.
Hmm, after thinking about it, if we are testing the read, we don't actually need to inject partition information here, because it is injected through idToConstant. I'll change it to non-partitioned for testing.
| partitionData.set(0, "test_col_a"); | ||
|
|
||
| DataWriter<Record> writer = | ||
| FormatModelRegistry.dataWriteBuilder(fileFormat, Record.class, encryptedFile) |
There was a problem hiding this comment.
Does the writer remove the partition columns? If so, then we need these tests, but this is more like a writer test
data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java
Outdated
Show resolved
Hide resolved
|
|
||
| protected abstract void assertEquals(Schema schema, List<T> expected, List<T> actual); | ||
|
|
||
| protected abstract Object convertConstantToEngine(Types.NestedField field, Object value); |
There was a problem hiding this comment.
Could we just create a Record and Schema and use convertToEngine instead of this new method?
There was a problem hiding this comment.
I have tried it. When I add partitionData to idToConstant, the Flink side requires it to be a Record. The RowDataConverter.convert used in convertToEngine will force the STRUCT to be converted to a Record and throw an error. For metadata processing on the Flink side, RowDataUtil.convertConstant is used instead.
There was a problem hiding this comment.
Could this help?
private static RowData convert(Types.StructType struct, StructLike record) {
GenericRowData rowData = new GenericRowData(struct.fields().size());
List<Types.NestedField> fields = struct.fields();
for (int i = 0; i < fields.size(); i += 1) {
Types.NestedField field = fields.get(i);
Type fieldType = field.type();
rowData.setField(i, convert(fieldType, record.get(i, Object.class)));
}
return rowData;
}
Notice StructLike record) {, and Object.class)));
There was a problem hiding this comment.
I converted StructLike to Record, and then used convertToEngineRecords + assertEquals for comparison, which allowed me to remove the convertConstantToEngine method.
|
|
||
| protected abstract Object convertConstantToEngine(Types.NestedField field, Object value); | ||
|
|
||
| protected abstract <D> List<D> convertToPartitionIdentity( |
There was a problem hiding this comment.
Could we use an existing method to archive this?
There was a problem hiding this comment.
Along with the aforementioned changes, this part has also been optimized away.
| }; | ||
| } | ||
|
|
||
| private Map<Integer, Object> convertConstantsToEngine( |
There was a problem hiding this comment.
Could we just have the Constants in a GenericRecord and convert it to the engine type?
There was a problem hiding this comment.
Sorry, I'm not quite able to get the direction of the adjustment needed here. However, I have made some changes mentioned above, and I'm not sure if further adjustments are needed in this part. I would appreciate more suggestions.
There was a problem hiding this comment.
This is too convoluted and we still need an getFieldFromEngineRow, so we are not much better of.
If we still need the extra method, we might be better of having a method like:
public static Object convertConstantsToEngine(Type type, Object value);
For Spark it could simply call SparkUtil.internalToSpark
convertConstantToEngine
|
|
||
| @ParameterizedTest | ||
| @FieldSource("FILE_FORMATS") | ||
| void testReadMetadataColumnPartitionBucketTransform(FileFormat fileFormat) throws IOException { |
There was a problem hiding this comment.
Could you help me with highlighting the differences between this test and testReadMetadataColumnPartitionIdentity?
|
|
||
| @ParameterizedTest | ||
| @FieldSource("FILE_FORMATS") | ||
| void testReadMetadataColumnPartitionEvolutionAddColumn(FileFormat fileFormat) throws IOException { |
There was a problem hiding this comment.
Could we have a test with addColumnWithDefaultReadValue?
This pr add TCK tests for metadata column reading in BaseFormatModelTests.
Metadata Colums:
Part of #15415