-
Notifications
You must be signed in to change notification settings - Fork 188
Description
Problem Statement
As one of the breaking changes for 3.0, we [removed the DSL format](#3367). I think that it's worth revisiting this and potentially adding a new simplified JSON format.
Current State
Why did we remove it?
Originally we directly copied the underlying OpenSearch DSL format from the internal search request. This includes the hits, as well as lots of extra metadata:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 12,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "fruits",
"_id": "d88d69445faebeaaa9c0683d",
"_score": 1.0,
"_source": {
"name": "strawberry",
"color": "red",
"approx_size_in": 2,
"has_pit": false
}
},
{
"_index": "fruits",
"_id": "8bec12157951f7e9003a13b4",
"_score": 1.0,
"_source": {
"name": "apple",
"color": "red",
"approx_size_in": 4,
"has_pit": false
}
}
]
}
}The problem is that this doesn't work for more sophisticated queries. For anything that requires some level of postprocessing (e.g. spath extraction), it's not obvious what the result should look like, and the metadata might not make sense anymore (e.g. the time taken should include the spath extraction time). It also isn't clear how to express this format for join queries, which span multiple indices and thus invalidate the _shards metrics. Even for the queries that work, directly returning DSL without going through the normal result serialization system punches a hole in our query execution that's hard to work around.
Why should we bring it back?
The primary JSON interface we ship today is JDBC, which orients the records around a list of columns.
{
"schema": [
{
"name": "name",
"type": "string"
},
{
"name": "approx_size_in",
"type": "bigint"
},
{
"name": "has_pit",
"type": "boolean"
},
{
"name": "color",
"type": "string"
}
],
"datarows": [
[
"strawberry",
2,
false,
"red"
],
[
"apple",
4,
false,
"red"
]
],
"total": 2,
"size": 2
}The problem is that this format isn't very convenient for standard JSON parsing flows: you need one pass to load the columns, and another to correlate it with the data rows. It's not easy without custom logic to do any analysis of the results, and when you have lots of columns it's hard to tell which one is which. That custom logic requirement breaks tools like jq. This is inconvenient enough that I use a custom script to fix this:
$ echo '{"query":"source=fruits | head 2"}' | xh post localhost:9200/_plugins/_ppl | unjdbc
{"approx_size_in":2,"color":"red","has_pit":false,"name":"strawberry"}
{"approx_size_in":4,"color":"red","has_pit":false,"name":"apple"}Only then can I natively pass query output to these tools.
Long-term Goals
- Simplify JSON result parsing for users.
- Enable easier integration with other tools that process OpenSearch DSL records.
- Provide an easier migration path for 2.x customers reliant on this format.
Proposal
I propose re-introducing at least one json format. I think there are three formats worth considering:
- An ordinary JSON object format compatible with standard parsers,
json_array
[{"approx_size_in":2,"color":"red","has_pit":false,"name":"strawberry"},
{"approx_size_in":4,"color":"red","has_pit":false,"name":"apple"}]- An NDJSON stream which works better with streaming parsers like jq,
ndjson. It's also theoretically easier to stream over transport if we'd like to optimize latency here in the future (but if we need that we should invest in shipping a binary format).
{"approx_size_in":2,"color":"red","has_pit":false,"name":"strawberry"}
{"approx_size_in":4,"color":"red","has_pit":false,"name":"apple"}- A DSL-style response that has
hitsand_sourcenesting and grants compatibility to 2.x clients that don't want to migrate formats,dsl. We don't need to reimplement all of the metadata in one batch, I think it's safe to assume that "most" users are just pulling out the_sourcerecords.
{
"took": 1, // <-- (Optional) Implemented by timing the query if we detect this format.
"hits": {
"hits": [
{
"_index": "fruits", // <-- Metadata optional, we can only include `_source` if it's complex.
"_id": "d88d69445faebeaaa9c0683d",
"_source": {
"name": "strawberry",
"color": "red",
"approx_size_in": 2,
"has_pit": false
}
},
{
"_index": "fruits", // <-- Leave it off for queries that don't have a clear index value (e.g. joins, aggregations)
"_id": "8bec12157951f7e9003a13b4",
"_source": {
"name": "apple",
"color": "red",
"approx_size_in": 4,
"has_pit": false
}
}
]
}
}Approach
The actual formats can be implemented one-by-one, I don't think there's any cross-dependency (maybe share object serialization if we need to do column traversal, but that's implementation nits).
To give maximum flexibility to users and enable seamless migration, we can define a json format which can default to any of the others with a config like plugins.query.default_json_format that takes any of the above values. This then aliases json to be equivalent to ?format=ndjson, ?format=dsl, ... Users who are migrating 2.x clusters can set this config to dsl:
POST _cluster/_settings
{
"persistent": { "plugins.query.default_json_format": "dsl" }
}I think the nicest default for most end users will be json_array.
Alternatives
- Do nothing: Require users to continue implementing their own field mappers.
- We can freely implement or not-implement any of these formats.
ndjsonis probably the best candidate for exclusion. I selfishly would like it as I use it regularly as part of my development, and if we have the infrastructure to implement the other two then it seems easy to add on.
- For the config, this introduces the scenario of users breaking themselves by changing it, so it may be safer to just alias
jsonwith a documentation note like "This is an alias todslprovided for backward compatibility."- The downside of this is that I think a reasonable user would try
?format=jsonand expect good results, I think there's a world where it becomes sort-of tribal knowledge to say "Oh, you don't want to use json, you want to use json_array." - We could also pick a better json default and have
dslbe separate, but this requires user code to seamlessly handle usingdslon 3.x andjsonon 2.x.
- The downside of this is that I think a reasonable user would try
Implementation Discussion
Not particularly interesting: we can register a new formatter that takes a QueryResponse and does arbitrary processing. The existing JDBC formatters already give a pretty good template for this.
Metadata can be added to dsl incrementally as needed, an initial implementation might only have hits and _source.
"took" would be a nice piece of metadata to also put on the existing JDBC format, it could be implemented by just putting a simple timer around the query execution. This could help users debug backend vs frontend latency at a glance.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status