Skip to content

[RFC] Reintroduce JSON formats for OpenSearch SQL & PPL #5245

@Swiddis

Description

@Swiddis

Problem Statement

As one of the breaking changes for 3.0, we [removed the DSL format](#3367). I think that it's worth revisiting this and potentially adding a new simplified JSON format.

Current State

Why did we remove it?

Originally we directly copied the underlying OpenSearch DSL format from the internal search request. This includes the hits, as well as lots of extra metadata:

{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 12,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "fruits",
                "_id": "d88d69445faebeaaa9c0683d",
                "_score": 1.0,
                "_source": {
                    "name": "strawberry",
                    "color": "red",
                    "approx_size_in": 2,
                    "has_pit": false
                }
            },
            {
                "_index": "fruits",
                "_id": "8bec12157951f7e9003a13b4",
                "_score": 1.0,
                "_source": {
                    "name": "apple",
                    "color": "red",
                    "approx_size_in": 4,
                    "has_pit": false
                }
            }
        ]
    }
}

The problem is that this doesn't work for more sophisticated queries. For anything that requires some level of postprocessing (e.g. spath extraction), it's not obvious what the result should look like, and the metadata might not make sense anymore (e.g. the time taken should include the spath extraction time). It also isn't clear how to express this format for join queries, which span multiple indices and thus invalidate the _shards metrics. Even for the queries that work, directly returning DSL without going through the normal result serialization system punches a hole in our query execution that's hard to work around.

Why should we bring it back?

The primary JSON interface we ship today is JDBC, which orients the records around a list of columns.

{
    "schema": [
        {
            "name": "name",
            "type": "string"
        },
        {
            "name": "approx_size_in",
            "type": "bigint"
        },
        {
            "name": "has_pit",
            "type": "boolean"
        },
        {
            "name": "color",
            "type": "string"
        }
    ],
    "datarows": [
        [
            "strawberry",
            2,
            false,
            "red"
        ],
        [
            "apple",
            4,
            false,
            "red"
        ]
    ],
    "total": 2,
    "size": 2
}

The problem is that this format isn't very convenient for standard JSON parsing flows: you need one pass to load the columns, and another to correlate it with the data rows. It's not easy without custom logic to do any analysis of the results, and when you have lots of columns it's hard to tell which one is which. That custom logic requirement breaks tools like jq. This is inconvenient enough that I use a custom script to fix this:

$ echo '{"query":"source=fruits | head 2"}' | xh post localhost:9200/_plugins/_ppl | unjdbc
{"approx_size_in":2,"color":"red","has_pit":false,"name":"strawberry"}
{"approx_size_in":4,"color":"red","has_pit":false,"name":"apple"}

Only then can I natively pass query output to these tools.

Long-term Goals

  • Simplify JSON result parsing for users.
  • Enable easier integration with other tools that process OpenSearch DSL records.
  • Provide an easier migration path for 2.x customers reliant on this format.

Proposal

I propose re-introducing at least one json format. I think there are three formats worth considering:

  • An ordinary JSON object format compatible with standard parsers, json_array
[{"approx_size_in":2,"color":"red","has_pit":false,"name":"strawberry"},
 {"approx_size_in":4,"color":"red","has_pit":false,"name":"apple"}]
  • An NDJSON stream which works better with streaming parsers like jq, ndjson. It's also theoretically easier to stream over transport if we'd like to optimize latency here in the future (but if we need that we should invest in shipping a binary format).
{"approx_size_in":2,"color":"red","has_pit":false,"name":"strawberry"}
{"approx_size_in":4,"color":"red","has_pit":false,"name":"apple"}
  • A DSL-style response that has hits and _source nesting and grants compatibility to 2.x clients that don't want to migrate formats, dsl. We don't need to reimplement all of the metadata in one batch, I think it's safe to assume that "most" users are just pulling out the _source records.
{
    "took": 1, // <-- (Optional) Implemented by timing the query if we detect this format.
    "hits": {
        "hits": [
            {
                "_index": "fruits", // <-- Metadata optional, we can only include `_source` if it's complex.
                "_id": "d88d69445faebeaaa9c0683d",
                "_source": {
                    "name": "strawberry",
                    "color": "red",
                    "approx_size_in": 2,
                    "has_pit": false
                }
            },
            {
                "_index": "fruits", // <-- Leave it off for queries that don't have a clear index value (e.g. joins, aggregations)
                "_id": "8bec12157951f7e9003a13b4",
                "_source": {
                    "name": "apple",
                    "color": "red",
                    "approx_size_in": 4,
                    "has_pit": false
                }
            }
        ]
    }
}

Approach

The actual formats can be implemented one-by-one, I don't think there's any cross-dependency (maybe share object serialization if we need to do column traversal, but that's implementation nits).

To give maximum flexibility to users and enable seamless migration, we can define a json format which can default to any of the others with a config like plugins.query.default_json_format that takes any of the above values. This then aliases json to be equivalent to ?format=ndjson, ?format=dsl, ... Users who are migrating 2.x clusters can set this config to dsl:

POST _cluster/_settings
{
  "persistent": { "plugins.query.default_json_format": "dsl" }
}

I think the nicest default for most end users will be json_array.

Alternatives

  • Do nothing: Require users to continue implementing their own field mappers.
  • We can freely implement or not-implement any of these formats.
    • ndjson is probably the best candidate for exclusion. I selfishly would like it as I use it regularly as part of my development, and if we have the infrastructure to implement the other two then it seems easy to add on.
  • For the config, this introduces the scenario of users breaking themselves by changing it, so it may be safer to just alias json with a documentation note like "This is an alias to dsl provided for backward compatibility."
    • The downside of this is that I think a reasonable user would try ?format=json and expect good results, I think there's a world where it becomes sort-of tribal knowledge to say "Oh, you don't want to use json, you want to use json_array."
    • We could also pick a better json default and have dsl be separate, but this requires user code to seamlessly handle using dsl on 3.x and json on 2.x.

Implementation Discussion

Not particularly interesting: we can register a new formatter that takes a QueryResponse and does arbitrary processing. The existing JDBC formatters already give a pretty good template for this.

Metadata can be added to dsl incrementally as needed, an initial implementation might only have hits and _source.

"took" would be a nice piece of metadata to also put on the existing JDBC format, it could be implemented by just putting a simple timer around the query execution. This could help users debug backend vs frontend latency at a glance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCRequest For Commentsfeature

    Type

    No type

    Projects

    Status

    New

    Status

    Not Started

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions