fix(aggs): top_hits panics with unknown fields#2803
fix(aggs): top_hits panics with unknown fields#2803Totodore wants to merge 5 commits intoquickwit-oss:mainfrom
Conversation
src/aggregation/metric/top_hits.rs
Outdated
| .any(|(name, _)| name.as_str() == field) | ||
| { | ||
| return Ok(vec![field.to_owned()]); | ||
| .find(|(name, _)| &name.replace(JSON_PATH_SEGMENT_SEP_STR, ".") == field) |
There was a problem hiding this comment.
that's not the way to do things. Can you look into the code and see how it is done elsewhere?
There was a problem hiding this comment.
I used FastFieldReaders::resolve_field which required to make it pub(crate), as I don't know the type of the column when checking if the field exists so I cant use column_opt. Is it Ok? or should I use something else?
There was a problem hiding this comment.
There are a bunch of aggregations relying on columns.
There was a problem hiding this comment.
There are a many aggregations relying on columns, including aggregation that do not know beforehands the type of the targeted columns. Term aggregation for instance. Before increasing the visibility of a method, could you see if you can do what is done there?
The thing I was missing is that this aggregation type implementation is terrible and misleading.
| unsupported_err("version")?; | ||
| } | ||
|
|
||
| self.doc_value_fields = self |
There was a problem hiding this comment.
this is pretty terrible.
Can you answer the question what is "self.doc_value_fields"?
Can you investigate whether we can fix this subaggregation in a deeper way?
There was a problem hiding this comment.
You mean the full reallocation? In that case I agree and I can refactor that.
The only purpose of this seems to map globbed patterns to real fields, non glob-patterns should not trigger clones though.
There was a problem hiding this comment.
No the reallocation is not the problem.
We need to be quantitative in this kind of discussion: this allocation will happen once per segment. This is negligible. Same thing for FxHashMap vs HashMap in quickwit and your concern about calling a method that is public but does "more".
On the other hand, the following code (again this is not your doing) is a massive footgun.
If you try to explain to someone what doc_value_fields is it would look something like this.
"well before we run validation, it is a human readable string that describes a user input, but after validation it is a string that tantivy uses to address the regular field or a json field that it uses internally"
It is even exposed to the external world because aggregations make it possible to get the list of columns used in an aggregation as it used for IO in quickwit.
| .any(|(name, _)| name.as_str() == field) | ||
| { | ||
| if !field.contains('*') { | ||
| reader.resolve_field(field)?.ok_or_else(|| { |
There was a problem hiding this comment.
I suspect dynamic_column_handles is better (as it is public). I am not sure though
There was a problem hiding this comment.
dynamic_column_handles does a bit more work and internally calls resolve_field:
https://docs.rs/tantivy/latest/src/tantivy/fastfield/readers.rs.html#238-251
There was a problem hiding this comment.
ok, but can it is not public :-/
Changing to pub(crate) is not so bad but unnecessary here.
| let pattern = globbed_string_to_regex(field)?; | ||
| let fields = reader | ||
| .columnar() | ||
| .iter_columns()? |
There was a problem hiding this comment.
I think you mean to use it in Quickwit? I suspect it might not work if you have a large number of columns.
| ); | ||
|
|
||
| if fields.is_empty() { | ||
| return Err(TantivyError::SchemaError(format!( |
| .iter_columns()? | ||
| .map(|(name, _)| { | ||
| // normalize path from internal fast field repr | ||
| name.replace(JSON_PATH_SEGMENT_SEP_STR, ".") |
|
@Totodore the whole sub aggregation is terrible. We should have never merged this PR originally. Can you approach it with clean eyes and clean it up? |
|
Thanks for your feedbacks @fulmicoton I'll try to provide a full refacto based on what I can learn from other aggregation implementations (terms for example,) rather than a small patch. |
|
@Totodore That would be awesome! Thank you! For the most obscure part: |
|
@fulmicoton I dig a bit and I don't really see how we could provide the dynamic field with pattern matching functionality available in I think there are three solutions:
|
Related to this comment:
quickwit-oss/quickwit#6088 (comment)
docvalue_fields. This leads to non-ergonomic errors in quickwit for example when we query non-existentdocvalue_fields.TopHitsAggregationReqpub so it is possible to instantiate and modify aTopHitsAggregationReq.