Skip to content

API, Core: Introduce classes for content stats#13933

Merged
nastra merged 16 commits intoapache:mainfrom
nastra:content-stats-apis
Nov 12, 2025
Merged

API, Core: Introduce classes for content stats#13933
nastra merged 16 commits intoapache:mainfrom
nastra:content-stats-apis

Conversation

@nastra
Copy link
Contributor

@nastra nastra commented Aug 27, 2025

This introduces just the base classes that are needed for content stats and is extracted from #13694.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is all pretty great, I have some minor comments but I do think before we merge we should finalize the vote on the design.

@nastra nastra force-pushed the content-stats-apis branch 2 times, most recently from 3f2c780 to 73116cf Compare September 19, 2025 08:38
@pvary
Copy link
Contributor

pvary commented Sep 22, 2025

What do we think about the error handling? I see that in many places we just return -1.
I think we would like to highlight the cases when we have unexpected results. We might decide to keep writing out data when the stats are facing unexpected input, but I think minimally we need to log warn messages.

WDYT?

@nastra nastra requested a review from pvary September 25, 2025 10:40
@nastra nastra force-pushed the content-stats-apis branch 2 times, most recently from 929292b to ac0a867 Compare October 27, 2025 10:51
NULL_VALUE_COUNT.fieldName(),
Types.LongType.get(),
"Total null value count"),
optional(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we exclude NaN if the type isn't floating point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can, but that implies that we have set the respective field statsStruct on each FieldStats instance so that we can properly fetch the right value whenever get(int pos, Class<X> javaClass) is called. This makes the impl more complicated and would technically require to set that statsStruct everywhere where a FieldStats instance is created.

This means that

BaseFieldStats<?> fieldStatsOne = BaseFieldStats.builder().fieldId(1).build();
BaseFieldStats<?> fieldStatsTwo = BaseFieldStats.builder().fieldId(2).build();
BaseContentStats stats =
        BaseContentStats.builder()
            .withTableSchema(...)
            .withFieldStats(fieldStatsOne)
            .withFieldStats(fieldStatsTwo)
            .build();

would now become

BaseFieldStats<?> fieldStatsOne = BaseFieldStats.builder().fieldId(1).statsStruct(...).build();
BaseFieldStats<?> fieldStatsTwo = BaseFieldStats.builder().fieldId(2).statsStruct(...).build();
BaseContentStats stats =
        BaseContentStats.builder()
            .withTableSchema(...)
            .withFieldStats(fieldStatsOne)
            .withFieldStats(fieldStatsTwo)
            .build();

This makes building a new FieldStats instance quite complicated. Another alternative would be to pass the statsStruct whenever the ContentStats instance is read. I did a quick POC and that would look something like this: b676e1c

@nastra nastra force-pushed the content-stats-apis branch from 432d308 to f8a5743 Compare October 30, 2025 15:42
public class StatsUtil {
private static final Logger LOG = LoggerFactory.getLogger(StatsUtil.class);
static final int NUM_STATS_PER_COLUMN = 200;
static final int RESERVED_FIELD_IDS = 200;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add comments since this is confusing and related to the metadata file field reserved space.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've slightly renamed the constants and also added comments everywhere

@nastra nastra requested a review from danielcweeks November 11, 2025 15:24
Copy link
Contributor

@danielcweeks danielcweeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nastra !

@nastra
Copy link
Contributor Author

nastra commented Nov 12, 2025

thanks for the reviews @pvary, @stevenzwu, @danielcweeks

@nastra nastra merged commit 2899b5a into apache:main Nov 12, 2025
43 checks passed
@nastra nastra deleted the content-stats-apis branch November 12, 2025 06:47
thomaschow pushed a commit to thomaschow/iceberg that referenced this pull request Jan 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants