API, Core: Introduce classes for content stats#13933
Conversation
RussellSpitzer
left a comment
There was a problem hiding this comment.
I think this is all pretty great, I have some minor comments but I do think before we merge we should finalize the vote on the design.
3f2c780 to
73116cf
Compare
core/src/main/java/org/apache/iceberg/stats/BaseContentStats.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/stats/BaseContentStats.java
Outdated
Show resolved
Hide resolved
|
What do we think about the error handling? I see that in many places we just return -1. WDYT? |
6f7ff3f to
3667707
Compare
core/src/main/java/org/apache/iceberg/stats/BaseFieldStats.java
Outdated
Show resolved
Hide resolved
3667707 to
8e07811
Compare
core/src/main/java/org/apache/iceberg/stats/BaseFieldStats.java
Outdated
Show resolved
Hide resolved
929292b to
ac0a867
Compare
| NULL_VALUE_COUNT.fieldName(), | ||
| Types.LongType.get(), | ||
| "Total null value count"), | ||
| optional( |
There was a problem hiding this comment.
Can't we exclude NaN if the type isn't floating point?
There was a problem hiding this comment.
we can, but that implies that we have set the respective field statsStruct on each FieldStats instance so that we can properly fetch the right value whenever get(int pos, Class<X> javaClass) is called. This makes the impl more complicated and would technically require to set that statsStruct everywhere where a FieldStats instance is created.
This means that
BaseFieldStats<?> fieldStatsOne = BaseFieldStats.builder().fieldId(1).build();
BaseFieldStats<?> fieldStatsTwo = BaseFieldStats.builder().fieldId(2).build();
BaseContentStats stats =
BaseContentStats.builder()
.withTableSchema(...)
.withFieldStats(fieldStatsOne)
.withFieldStats(fieldStatsTwo)
.build();
would now become
BaseFieldStats<?> fieldStatsOne = BaseFieldStats.builder().fieldId(1).statsStruct(...).build();
BaseFieldStats<?> fieldStatsTwo = BaseFieldStats.builder().fieldId(2).statsStruct(...).build();
BaseContentStats stats =
BaseContentStats.builder()
.withTableSchema(...)
.withFieldStats(fieldStatsOne)
.withFieldStats(fieldStatsTwo)
.build();
This makes building a new FieldStats instance quite complicated. Another alternative would be to pass the statsStruct whenever the ContentStats instance is read. I did a quick POC and that would look something like this: b676e1c
432d308 to
f8a5743
Compare
| public class StatsUtil { | ||
| private static final Logger LOG = LoggerFactory.getLogger(StatsUtil.class); | ||
| static final int NUM_STATS_PER_COLUMN = 200; | ||
| static final int RESERVED_FIELD_IDS = 200; |
There was a problem hiding this comment.
We should add comments since this is confusing and related to the metadata file field reserved space.
There was a problem hiding this comment.
I've slightly renamed the constants and also added comments everywhere
|
thanks for the reviews @pvary, @stevenzwu, @danielcweeks |
This introduces just the base classes that are needed for content stats and is extracted from #13694.