Add proposal for per-tenant TSDB status API#7335

Open

CharlieTLe wants to merge 1 commit intocortexproject:masterfrom

CharlieTLe:proposal/per-tenant-tsdb-status-api

Member

CharlieTLe commented Mar 7, 2026

Design proposal for the /api/v1/status/tsdb endpoint that exposes per-tenant TSDB head cardinality statistics. Covers architecture (Distributor fan-out to Ingesters), gRPC definitions, aggregation logic, Prometheus compatibility trade-offs, extensibility to long-term storage, and Distributor vs Querier routing alternatives.


          Add proposal for per-tenant TSDB status API

bd518d8

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>

pull-request-size bot added the size/L label

dosubot bot added the type/feature label

CharlieTLe mentioned this pull request

Add per-tenant TSDB cardinality status API endpoint #7332

Open

4 tasks

yeya24 reviewed

View reviewed changes

docs/proposals/per-tenant-tsdb-status-api.md

    
              Currently, Cortex tenants lack visibility into which metrics, labels, and label-value pairs contribute the most series in ingesters. Without this information, debugging high-cardinality issues requires operators to inspect TSDB internals directly on ingester instances, which is impractical in a multi-tenant, distributed environment.

              Prometheus itself exposes a `/api/v1/status/tsdb` endpoint that provides cardinality statistics from the TSDB head. This proposal brings equivalent functionality to Cortex as a multi-tenant, distributed API.

Contributor

yeya24 Mar 9, 2026

I am not a fan of TSDB status API name... Prometheus API might change and add more stuff. A dedicated api/v1/cardinality might be better?

docs/proposals/per-tenant-tsdb-status-api.md

    
              ## Out of Scope

              - **Long-term storage cardinality analysis**: This endpoint only covers in-memory TSDB head data in ingesters. Analyzing cardinality across compacted blocks in object storage is a separate concern. A future long-term cardinality API could reuse portable fields (see [Extensibility](#extensibility-to-long-term-storage)) or introduce a separate endpoint.

Contributor

yeya24 Mar 9, 2026

Do we plan to have a different API for long term storage cardinality? We should aim for the same API endpoint even though we don't have to design for it now

docs/proposals/per-tenant-tsdb-status-api.md

    
              Expose per-tenant TSDB head cardinality statistics via a REST API endpoint on the Cortex query path. The endpoint should:

              1. Be compatible with the Prometheus `/api/v1/status/tsdb` response format.

Contributor

yeya24 Mar 9, 2026

I am not sure if this needs to be as part of the goal. Does it need to be compatible.
I think our API response format is already incompatible today

docs/proposals/per-tenant-tsdb-status-api.md

    
              ```

              - **Authentication**: Requires `X-Scope-OrgID` header (standard Cortex tenant authentication).

              - **Query Parameter**: `limit` (optional, default 10) - controls the number of top items returned per category.

Contributor

yeya24 Mar 9, 2026

How about start and end?

docs/proposals/per-tenant-tsdb-status-api.md

    
              message TSDBStatusResponse {

                uint64 num_series = 1;

                int64 min_time = 2;

                int64 max_time = 3;

Contributor

yeya24 Mar 9, 2026

Do we need min max? How do we aggregate this in the final response? min(min_t) and max(max_t)?

docs/proposals/per-tenant-tsdb-status-api.md

    
              2. **`chunkCount` omitted**: Prometheus includes a `chunkCount` field (from `prometheus_tsdb_head_chunks`). In a distributed system with replication, chunk counts across ingesters cannot be meaningfully aggregated — chunks are an ingester-local storage detail, and summing/dividing by the replication factor does not produce a useful number.

              **Open question**: Should we adopt the `headStats` wrapper to maintain client compatibility with Prometheus tooling? The trade-off is compatibility vs simplicity — the flat format is easier to consume for Cortex-specific clients, but adopting the Prometheus format would allow reuse of existing client libraries.

Contributor

yeya24 Mar 9, 2026

Any Prometheus tool consumes this today? Why compatibility is a concern

docs/proposals/per-tenant-tsdb-status-api.md

    
              | `labelValueCountByLabelName` | No | Portable to block storage |

              | `seriesCountByLabelValuePair` | No | Portable to block storage |

              | `memoryInBytesByLabelName` | **Yes** | In-memory byte usage has no analogue in object storage |

              | `minTime` / `maxTime` | **Yes** | Reflects head time range, not total storage |

Contributor

yeya24 Mar 9, 2026

Do we need to add those head specific fields?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L type/feature