Skip to content

fix(aggs): top_hits panics with unknown fields#2803

Open
Totodore wants to merge 5 commits intoquickwit-oss:mainfrom
Totodore:fix-tophits-panics
Open

fix(aggs): top_hits panics with unknown fields#2803
Totodore wants to merge 5 commits intoquickwit-oss:mainfrom
Totodore:fix-tophits-panics

Conversation

@Totodore
Copy link
Copy Markdown

@Totodore Totodore commented Jan 14, 2026

Related to this comment:
quickwit-oss/quickwit#6088 (comment)

  • The current implementation top hits panics when incorrect fields are specified in docvalue_fields. This leads to non-ergonomic errors in quickwit for example when we query non-existent docvalue_fields.
  • fixes a bug where non-glob fast-fields where incorrectly matched because the dot notation of columns was incorrectly escaped.
  • Make allowed fields of TopHitsAggregationReq pub so it is possible to instantiate and modify a TopHitsAggregationReq.

@Totodore Totodore marked this pull request as ready for review January 14, 2026 12:33
.any(|(name, _)| name.as_str() == field)
{
return Ok(vec![field.to_owned()]);
.find(|(name, _)| &name.replace(JSON_PATH_SEGMENT_SEP_STR, ".") == field)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's not the way to do things. Can you look into the code and see how it is done elsewhere?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used FastFieldReaders::resolve_field which required to make it pub(crate), as I don't know the type of the column when checking if the field exists so I cant use column_opt. Is it Ok? or should I use something else?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a bunch of aggregations relying on columns.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a many aggregations relying on columns, including aggregation that do not know beforehands the type of the targeted columns. Term aggregation for instance. Before increasing the visibility of a method, could you see if you can do what is done there?

The thing I was missing is that this aggregation type implementation is terrible and misleading.

@Totodore Totodore requested a review from fulmicoton January 14, 2026 15:03
unsupported_err("version")?;
}

self.doc_value_fields = self
Copy link
Copy Markdown
Collaborator

@fulmicoton fulmicoton Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty terrible.

Can you answer the question what is "self.doc_value_fields"?

Can you investigate whether we can fix this subaggregation in a deeper way?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the full reallocation? In that case I agree and I can refactor that.

The only purpose of this seems to map globbed patterns to real fields, non glob-patterns should not trigger clones though.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No the reallocation is not the problem.

We need to be quantitative in this kind of discussion: this allocation will happen once per segment. This is negligible. Same thing for FxHashMap vs HashMap in quickwit and your concern about calling a method that is public but does "more".

On the other hand, the following code (again this is not your doing) is a massive footgun.

If you try to explain to someone what doc_value_fields is it would look something like this.

"well before we run validation, it is a human readable string that describes a user input, but after validation it is a string that tantivy uses to address the regular field or a json field that it uses internally"

It is even exposed to the external world because aggregations make it possible to get the list of columns used in an aggregation as it used for IO in quickwit.

.any(|(name, _)| name.as_str() == field)
{
if !field.contains('*') {
reader.resolve_field(field)?.ok_or_else(|| {
Copy link
Copy Markdown
Collaborator

@fulmicoton fulmicoton Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect dynamic_column_handles is better (as it is public). I am not sure though

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dynamic_column_handles does a bit more work and internally calls resolve_field:
https://docs.rs/tantivy/latest/src/tantivy/fastfield/readers.rs.html#238-251

Copy link
Copy Markdown
Collaborator

@fulmicoton fulmicoton Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but can it is not public :-/

Changing to pub(crate) is not so bad but unnecessary here.

let pattern = globbed_string_to_regex(field)?;
let fields = reader
.columnar()
.iter_columns()?
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean to use it in Quickwit? I suspect it might not work if you have a large number of columns.

);

if fields.is_empty() {
return Err(TantivyError::SchemaError(format!(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx

.iter_columns()?
.map(|(name, _)| {
// normalize path from internal fast field repr
name.replace(JSON_PATH_SEGMENT_SEP_STR, ".")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong too

@fulmicoton
Copy link
Copy Markdown
Collaborator

@Totodore the whole sub aggregation is terrible. We should have never merged this PR originally. Can you approach it with clean eyes and clean it up?

@Totodore
Copy link
Copy Markdown
Author

Thanks for your feedbacks @fulmicoton I'll try to provide a full refacto based on what I can learn from other aggregation implementations (terms for example,) rather than a small patch.

@fulmicoton
Copy link
Copy Markdown
Collaborator

fulmicoton commented Jan 15, 2026

@Totodore That would be awesome! Thank you!

For the most obscure part:
Quickwit need to pre-fetch all columns from S3 before running search. For this reason, aggregation need to explicitly declare the list of columns they rely on. We need to make sure that the function returns the column in the format expected.

@Totodore
Copy link
Copy Markdown
Author

Totodore commented Jan 25, 2026

@fulmicoton I dig a bit and I don't really see how we could provide the dynamic field with pattern matching functionality available in docvalue_fields while pre-fetching columns in quickwit.

I think there are three solutions:

  • Remove the pattern matching on column names feature entirely.
  • Keep it but make it unusable from quickwit as we cannot pre-fetch the fast fields if there are wildcards.
  • Update get_fast_field_names to take a handle to the columns list to resolve the regexes dynamically when quickwit is getting all the fastfields on the aggregation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants