[RFC] Reintroduce JSON formats for OpenSearch SQL & PPL

## Problem Statement

As one of the breaking changes for 3.0, we [[removed the DSL format](https://github.com/opensearch-project/sql/pull/3367)](https://github.com/opensearch-project/sql/pull/3367). I think that it's worth revisiting this and potentially adding a new simplified JSON format.
## Current State

### Why did we remove it?

Originally we directly copied the underlying OpenSearch DSL format from the internal search request. This includes the hits, as well as lots of extra metadata:

```json
{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 12,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "fruits",
                "_id": "d88d69445faebeaaa9c0683d",
                "_score": 1.0,
                "_source": {
                    "name": "strawberry",
                    "color": "red",
                    "approx_size_in": 2,
                    "has_pit": false
                }
            },
            {
                "_index": "fruits",
                "_id": "8bec12157951f7e9003a13b4",
                "_score": 1.0,
                "_source": {
                    "name": "apple",
                    "color": "red",
                    "approx_size_in": 4,
                    "has_pit": false
                }
            }
        ]
    }
}
```

The problem is that this doesn't work for more sophisticated queries. For anything that requires some level of postprocessing (e.g. `spath` extraction), it's not obvious what the result should look like, and the metadata might not make sense anymore (e.g. the time taken should include the spath extraction time). It also isn't clear how to express this format for join queries, which span multiple indices and thus invalidate the `_shards` metrics. Even for the queries that work, directly returning DSL without going through the normal result serialization system punches a hole in our query execution that's hard to work around.

### Why should we bring it back?

The primary JSON interface we ship today is JDBC, which orients the records around a list of columns.

```json
{
    "schema": [
        {
            "name": "name",
            "type": "string"
        },
        {
            "name": "approx_size_in",
            "type": "bigint"
        },
        {
            "name": "has_pit",
            "type": "boolean"
        },
        {
            "name": "color",
            "type": "string"
        }
    ],
    "datarows": [
        [
            "strawberry",
            2,
            false,
            "red"
        ],
        [
            "apple",
            4,
            false,
            "red"
        ]
    ],
    "total": 2,
    "size": 2
}
```

The problem is that this format isn't very convenient for standard JSON parsing flows: you need one pass to load the columns, and another to correlate it with the data rows. It's not easy without custom logic to do any analysis of the results, and when you have lots of columns it's hard to tell which one is which. That custom logic requirement breaks tools like `jq`. This is inconvenient enough that I use a custom script to fix this:

```bash
$ echo '{"query":"source=fruits | head 2"}' | xh post localhost:9200/_plugins/_ppl | unjdbc
{"approx_size_in":2,"color":"red","has_pit":false,"name":"strawberry"}
{"approx_size_in":4,"color":"red","has_pit":false,"name":"apple"}
```

Only then can I natively pass query output to these tools.

## Long-term Goals

- Simplify JSON result parsing for users.
- Enable easier integration with other tools that process OpenSearch DSL records.
- Provide an easier migration path for 2.x customers reliant on this format.

## Proposal

I propose re-introducing at least one `json` format. I think there are three formats worth considering:

- An ordinary JSON object format compatible with standard parsers, `json_array`

```json
[{"approx_size_in":2,"color":"red","has_pit":false,"name":"strawberry"},
 {"approx_size_in":4,"color":"red","has_pit":false,"name":"apple"}]
```

- An NDJSON stream which works better with streaming parsers like jq, `ndjson`. It's also theoretically easier to stream over transport if we'd like to optimize latency here in the future (but if we need that we should invest in shipping a binary format).

```json
{"approx_size_in":2,"color":"red","has_pit":false,"name":"strawberry"}
{"approx_size_in":4,"color":"red","has_pit":false,"name":"apple"}
```

- A DSL-style response that has `hits` and `_source` nesting and grants compatibility to 2.x clients that don't want to migrate formats, `dsl`. We don't need to reimplement all of the metadata in one batch, I think it's safe to assume that "most" users are just pulling out the `_source` records.

```json5
{
    "took": 1, // <-- (Optional) Implemented by timing the query if we detect this format.
    "hits": {
        "hits": [
            {
                "_index": "fruits", // <-- Metadata optional, we can only include `_source` if it's complex.
                "_id": "d88d69445faebeaaa9c0683d",
                "_source": {
                    "name": "strawberry",
                    "color": "red",
                    "approx_size_in": 2,
                    "has_pit": false
                }
            },
            {
                "_index": "fruits", // <-- Leave it off for queries that don't have a clear index value (e.g. joins, aggregations)
                "_id": "8bec12157951f7e9003a13b4",
                "_source": {
                    "name": "apple",
                    "color": "red",
                    "approx_size_in": 4,
                    "has_pit": false
                }
            }
        ]
    }
}
```

### Approach

The actual formats can be implemented one-by-one, I don't think there's any cross-dependency (maybe share object serialization if we need to do column traversal, but that's implementation nits).

To give maximum flexibility to users and enable seamless migration, we can define a `json` format which can default to any of the others with a config like `plugins.query.default_json_format` that takes any of the above values. This then aliases json to be equivalent to `?format=ndjson`, `?format=dsl`, ... Users who are migrating 2.x clusters can set this config to `dsl`:

```json5
POST _cluster/_settings
{
  "persistent": { "plugins.query.default_json_format": "dsl" }
}
```

I think the nicest default for most end users will be `json_array`.

### Alternatives

- Do nothing: Require users to continue implementing their own field mappers.
- We can freely implement or not-implement any of these formats.
  - `ndjson` is probably the best candidate for exclusion. I selfishly would like it as I use it regularly as part of my development, and if we have the infrastructure to implement the other two then it seems easy to add on.
- For the config, this introduces the scenario of users breaking themselves by changing it, so it may be safer to just alias `json` with a documentation note like "This is an alias to `dsl` provided for backward compatibility."
	- The downside of this is that I think a reasonable user would try `?format=json` and expect good results, I think there's a world where it becomes sort-of tribal knowledge to say "Oh, you don't want to use *json*, you want to use *json\_array*."
	- We could also pick a better json default and have `dsl` be separate, but this requires user code to seamlessly handle using `dsl` on 3.x and `json` on 2.x.
### Implementation Discussion

Not particularly interesting: we can [register a new formatter](https://github.com/opensearch-project/sql/tree/e7e1d98122add703cc46206ce7462ca65b3fa05d/protocol/src/main/java/org/opensearch/sql/protocol/response/format) that takes a `QueryResponse` and does arbitrary processing. The existing JDBC formatters already give a pretty good template for this.

Metadata can be added to `dsl` incrementally as needed, an initial implementation might only have `hits` and `_source`.

`"took"` would be a nice piece of metadata to also put on the existing JDBC format, it could be implemented by just putting a simple timer around the query execution. This could help users debug backend vs frontend latency at a glance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Reintroduce JSON formats for OpenSearch SQL & PPL #5245

Problem Statement

Current State

Why did we remove it?

Why should we bring it back?

Long-term Goals

Proposal

Approach

Alternatives

Implementation Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Reintroduce JSON formats for OpenSearch SQL & PPL #5245

Description

Problem Statement

Current State

Why did we remove it?

Why should we bring it back?

Long-term Goals

Proposal

Approach

Alternatives

Implementation Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions