Skip to content

feat: add local telemetry playground and monitoring docs#286

Open
udsmicrosoft wants to merge 18 commits intodocumentdb:mainfrom
udsmicrosoft:users/urismiley/telemetry-playground
Open

feat: add local telemetry playground and monitoring docs#286
udsmicrosoft wants to merge 18 commits intodocumentdb:mainfrom
udsmicrosoft:users/urismiley/telemetry-playground

Conversation

@udsmicrosoft
Copy link
Copy Markdown
Collaborator

@udsmicrosoft udsmicrosoft commented Mar 5, 2026

Summary

Adds a complete, self-contained local telemetry playground for Kind clusters, monitoring documentation, and sidecar injector OTel support.

Local Telemetry Playground (documentdb-playground/telemetry/local/)

One-command deployment./scripts/deploy.sh handles everything: Kind cluster, cert-manager, CNPG, DocumentDB operator (from GHCR), observability stack, DocumentDB HA cluster, and traffic generators.

  • 3-instance DocumentDB HA cluster (1 primary + 2 streaming replicas) on Kind
  • Full observability stack: OTel Collector (central + per-node DaemonSet), Prometheus, Tempo, Loki, Grafana
  • Gateway metrics, traces, and logs via OTLP push
  • PostgreSQL metrics via CNPG built-in exporter and OTel postgresql receiver
  • System resource metrics via kubeletstats receiver (DaemonSet for full-cluster coverage)
  • Pre-built Grafana dashboards:
    • Gateway — ops/sec, average latency, error rates, active connections, pool utilization, document throughput by collection, request/response sizes, gateway logs (Loki)
    • Internals — PG backends, replication lag, commits/rollbacks, row operations/sec, WAL size, database size, container CPU/memory/network/filesystem
  • Traffic generator Jobs (RW to primary, RO to replicas) with error injection (~10% of iterations)
  • Sample Prometheus alerting rules (GatewayHighErrorRate, PostgresReplicationLagHigh, PostgresConnectionSaturation, GatewayDown)
  • Tempo→Loki and Tempo→Prometheus datasource correlations in Grafana
  • validate.sh health check script verifying pods, targets, and data flow
  • Support for custom gateway images via local Kind registry (localhost:5001)

Sidecar Injector Changes

  • Add OTEL_TRACING_ENABLED env var for trace collection
  • Add POD_NAME env var (from K8s downward API) and OTEL_RESOURCE_ATTRIBUTES with service.instance.id=$(POD_NAME) for per-instance metric attribution in multi-replica deployments
  • Add imagePullPolicy to sidecar injector deployment

Monitoring Documentation (docs/.../monitoring/)

  • Architecture overview — OTLP push flow from gateway, collector deployment modes, ServiceMonitor examples, CNPG metrics availability caveat
  • Metrics reference — gateway metrics (db_client_operations_total, gateway_client_connections_active, pool/document metrics), container resources, controller-runtime (with correct controller names), CNPG/PostgreSQL metrics, OTel naming conventions
  • Updated mkdocs.yml navigation

Related

  • Depends on gateway telemetry instrumentation from documentdb/documentdb:
    • Base instrumentation (operations, latency, request/response size): documentdb#443
    • Full metric coverage (connections, pool, document throughput): branch users/urismiley/metrics-expansion

@xgerman
Copy link
Copy Markdown
Collaborator

xgerman commented Mar 11, 2026

Review — PR #286: Local Telemetry Playground + Monitoring Docs

This PR has two parts: (1) a code change to the sidecar injector adding OTEL resource attributes, and (2) documentation + playground files for monitoring. Reviewing both.


Code Change: lifecycle.go — OTEL Resource Attributes

The code adds POD_NAME (via downward API) and OTEL_RESOURCE_ATTRIBUTES=service.instance.id=$(POD_NAME) to every gateway sidecar container.

Check Result
Env var ordering (POD_NAME before OTEL_RESOURCE_ATTRIBUTES) ✅ Correct — K8s resolves $(VAR) in declaration order
Downward API metadata.name field path ✅ Valid
OTEL semantic convention service.instance.id ✅ Matches OTel spec
Insertion point (after OTEL_EXPORTER_OTLP_ENDPOINT, before credentials) ✅ Clean
Existing tests cover this code path ⚠️ No lifecycle_test.go found — see Major #3

The code change is correct and well-placed.


🔴 Critical (1)

1. PostgreSQL receiver will fail to authenticate — password is "unused"

The OTel collector config contains:

postgresql:
  endpoint: documentdb-preview-rw.documentdb-preview-ns.svc.cluster.local:5432
  username: postgres
  password: "unused"

CNPG generates and manages the postgres superuser password in a Kubernetes Secret (typically <cluster-name>-superuser). The hardcoded "unused" password will cause the postgresql receiver to fail SQL authentication, meaning all PostgreSQL-level metrics (replication lag, backends, DB size, commits, rollbacks) documented in metrics.md will not actually be collected.

Fix options:

  1. Mount the CNPG superuser secret into the OTel collector pod and reference it via ${env:PG_MONITOR_PASSWORD}
  2. Create a dedicated monitoring role with limited permissions and a known password
  3. Add a setup step to the README that creates the monitoring credentials

This also applies to PG_MONITOR_USER: postgres / PG_MONITOR_PASSWORD: unused in the collector deployment env vars.


🟠 Major (3)

2. Monitoring docs missing front matter (title, description, tags)

Both overview.md and metrics.md lack YAML front matter. Per the documentation standards, every page should include:

---
title: Monitoring Overview
description: How to monitor DocumentDB clusters using OpenTelemetry, Prometheus, and Grafana.
tags:
  - monitoring
  - observability
  - metrics
---

This affects search ranking, AI discoverability, and consistency with other documentation pages.

3. No tests for the lifecycle.go code change

The sidecar injector's lifecycle.go has no test file. The new POD_NAME and OTEL_RESOURCE_ATTRIBUTES env vars are injected into every DocumentDB gateway container — a functional change that affects all deployments. While the change is straightforward, a unit test would prevent regression (e.g., if someone reorders the env vars, breaking $(POD_NAME) resolution).

Suggestion: Add a test in operator/cnpg-plugins/sidecar-injector/internal/lifecycle/lifecycle_test.go that verifies:

  • POD_NAME env var is present with FieldRef to metadata.name
  • OTEL_RESOURCE_ATTRIBUTES is present and contains service.instance.id=$(POD_NAME)
  • POD_NAME appears before OTEL_RESOURCE_ATTRIBUTES in the env slice

4. Traffic generator and observability stack contain hardcoded passwords

Multiple files use plaintext passwords:

  • documentdb-ha.yaml: password: DemoPassword100
  • traffic-generator.yaml: -p DemoPassword100
  • observability-stack.yaml: GF_SECURITY_ADMIN_PASSWORD: admin, GF_AUTH_ANONYMOUS_ORG_ROLE: Admin

For a playground/demo this is acceptable, but:

  • The Grafana anonymous admin access (GF_AUTH_ANONYMOUS_ENABLED: true + GF_AUTH_ANONYMOUS_ORG_ROLE: Admin) should have a comment warning this is for demo use only
  • Consider adding a ⚠️ DO NOT USE IN PRODUCTION header comment in observability-stack.yaml

🟡 Minor (5)

5. Gateway metric names cannot be verified against source code

The docs reference metrics like db_client_operations_total, gateway_client_connections_active, db_client_connection_active, etc. These are emitted by the gateway binary (separate repo), not the operator Go code. I cannot verify these metric names against this repository.

Recommendation: Add a note in metrics.md indicating these metric names are from the DocumentDB Gateway and may change between versions.

6. Collector deployed as Deployment but kubeletstats only covers one node

The kubeletstats receiver with K8S_NODE_NAME env var only scrapes the node where the collector pod runs. In the 4-node Kind cluster (1 control-plane + 3 workers), you'll miss kubelet metrics from 3 of 4 nodes.

Recommendation: Add a note in the observability-stack.yaml or overview.md that kubeletstats coverage is limited in Deployment mode (vs DaemonSet).

7. Kind config uses containerd v1 registry path

The kind-config.yaml uses [plugins."io.containerd.grpc.v1.cri".registry]. With kindest/node:v1.35.0 and containerd 2.x, this path may have changed to io.containerd.cri.v1.images. Verify against the target Kind version.

8. K8S_VERSION defaults to v1.35.0 in setup-kind.sh

This pins to a very specific Kind image. If the image isn't available, the script fails. Consider defaulting to a known-good version or adding a version check.

9. The documentdb-preview-rw endpoint is hardcoded in collector config

The OTel collector config hardcodes documentdb-preview-rw.documentdb-preview-ns.svc.cluster.local:5432. This only works for a cluster named documentdb-preview in namespace documentdb-preview-ns. Document this coupling or make it configurable via env vars.


✅ What's Good

  • Clean code change: The POD_NAME / OTEL_RESOURCE_ATTRIBUTES injection is well-implemented with correct K8s env var ordering
  • Complete observability stack: Tempo (traces) + Loki (logs) + Prometheus (metrics) + Grafana (dashboards) + OTel Collector — full three-pillar observability
  • ExternalName service bridge: Clever cross-namespace OTLP routing from DocumentDB namespace to observability namespace
  • Comprehensive metrics reference: metrics.md covers container, gateway, operator, and PostgreSQL metrics with useful PromQL examples
  • Architecture diagram: Clear ASCII art showing data flow from pods → collector → backends → Grafana
  • OTel naming table: OpenTelemetry ↔ Prometheus metric name mapping table is very helpful
  • Traffic generator: Well-designed mongosh script with varied CRUD operations, periodic cleanup, and error handling
  • mkdocs.yml change is clean: Only adds nav entries, no extension conflicts with other PRs
  • Grafana dashboard JSON: Pre-built 1528-line dashboard — users get instant value
  • RBAC for collector: Proper ClusterRole with nodes/stats, pods, services, metrics access

Summary

Severity Count Items
🔴 Critical 1 PostgreSQL receiver auth will fail (password: "unused")
🟠 Major 3 Missing front matter, no lifecycle tests, hardcoded passwords need warnings
🟡 Minor 5 Unverifiable gateway metrics, kubeletstats single-node, containerd v2, K8s version pin, hardcoded endpoint
Code change correct, complete stack, clean mkdocs, good architecture

Verdict: The code change to lifecycle.go is correct and adds genuine value (per-instance metric attribution). The observability stack is comprehensive and well-designed. Fix the PostgreSQL receiver authentication (Critical #1) — without it, half the documented metrics won't actually be collected. Add front matter to the docs and ideally a unit test for the env var injection.

- Restructured telemetry playground into organized k8s/, dashboards/, scripts/ layout
- Split monolithic observability-stack.yaml into per-component files
- Added two Grafana dashboards: Gateway (ops, latency, errors, connections) and
  Internals (PostgreSQL, indexes, infrastructure)
- Added deploy.sh and teardown.sh scripts for one-command setup/teardown
- Added Grafana dashboard provisioning via ConfigMap
- Added traffic generators with read/write split (primary vs replicas)
- Added OTEL_TRACING_ENABLED=true to sidecar injector for trace collection
- Added monitoring docs (overview.md, metrics.md) with architecture diagrams
- Added README with Mermaid architecture diagram

Signed-off-by: urismiley <urismiley@microsoft.com>
@udsmicrosoft udsmicrosoft force-pushed the users/urismiley/telemetry-playground branch from ae99e4f to 04ef816 Compare March 16, 2026 16:20
- Fix Critical documentdb#1: OTel collector postgresql receiver now uses CNPG
  superuser secret (copied cross-namespace by deploy.sh) instead of
  hardcoded 'unused' password
- Fix Major documentdb#2: Add YAML front matter (title, description, tags) to
  monitoring overview.md and metrics.md
- Fix Major documentdb#4: Add DO NOT USE IN PRODUCTION warnings to grafana.yaml,
  cluster.yaml, traffic-generator.yaml, and otel-collector.yaml
- Fix Minor documentdb#5: Add note that gateway metric names are versioned
  independently and may change between releases
- Fix Minor documentdb#6: Document kubeletstats single-node coverage limitation
  in otel-collector.yaml
- Fix Minor documentdb#9: Document hardcoded PG endpoint coupling to CR name
  and namespace in otel-collector.yaml

Signed-off-by: urismiley <urismiley@microsoft.com>
Convert the monitoring architecture diagram from ASCII art to a Mermaid
graph for better rendering in mkdocs. Enable the pymdownx.superfences
mermaid custom fence in mkdocs.yml.

Signed-off-by: urismiley <urismiley@microsoft.com>
@udsmicrosoft udsmicrosoft marked this pull request as ready for review March 16, 2026 16:40
Copilot AI review requested due to automatic review settings March 16, 2026 16:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a local “telemetry playground” for running a DocumentDB HA demo on Kind with a full observability stack, and introduces monitoring documentation pages (with MkDocs nav updates) describing the telemetry/metrics architecture and references. It also updates the CNPG sidecar injector to set per-pod OpenTelemetry resource attributes for better per-instance attribution.

Changes:

  • Add documentdb-playground/telemetry/local/ demo (Kind setup/deploy/teardown scripts, k8s manifests, traffic generator, Grafana dashboards).
  • Add Monitoring docs (overview.md, metrics.md) and update mkdocs.yml nav + Mermaid fenced blocks support.
  • Update sidecar injector to inject POD_NAME and OTEL_RESOURCE_ATTRIBUTES (and enable tracing via env var).

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
operator/cnpg-plugins/sidecar-injector/internal/lifecycle/lifecycle.go Injects additional OTEL-related env vars into the gateway sidecar for per-instance attribution.
mkdocs.yml Adds Monitoring section to nav and enables Mermaid code fences via pymdownx.superfences.
documentdb-playground/telemetry/local/scripts/teardown.sh Adds Kind cluster teardown + proxy container cleanup.
documentdb-playground/telemetry/local/scripts/setup-kind.sh Adds Kind cluster + local registry bootstrap script.
documentdb-playground/telemetry/local/scripts/deploy.sh Adds one-command deploy flow for observability stack + DocumentDB + traffic.
documentdb-playground/telemetry/local/k8s/traffic/traffic-generator.yaml Adds RW/RO traffic generator Jobs and gateway Services for demo load.
documentdb-playground/telemetry/local/k8s/observability/tempo.yaml Adds Tempo deployment/service for traces backend.
documentdb-playground/telemetry/local/k8s/observability/prometheus.yaml Adds Prometheus deployment/service and scrape config for the collector.
documentdb-playground/telemetry/local/k8s/observability/otel-collector.yaml Adds OTel Collector deployment, RBAC, and pipelines for metrics/traces/logs.
documentdb-playground/telemetry/local/k8s/observability/namespace.yaml Adds observability namespace manifest.
documentdb-playground/telemetry/local/k8s/observability/loki.yaml Adds Loki deployment/service for logs backend.
documentdb-playground/telemetry/local/k8s/observability/grafana.yaml Adds Grafana deployment/service plus datasource & dashboard provisioning.
documentdb-playground/telemetry/local/k8s/documentdb/collector-bridge.yaml Adds ExternalName bridge Service so gateway OTLP endpoint resolves across namespaces.
documentdb-playground/telemetry/local/k8s/documentdb/cluster.yaml Adds demo DocumentDB namespace, credentials, and DocumentDB CR for 3-instance HA.
documentdb-playground/telemetry/local/dashboards/internals.json Adds “Internals” Grafana dashboard JSON.
documentdb-playground/telemetry/local/dashboards/gateway.json Adds “Gateway” Grafana dashboard JSON.
documentdb-playground/telemetry/local/README.md Documents local playground usage and architecture (diagram + steps).
docs/operator-public-documentation/preview/monitoring/overview.md Adds monitoring architecture + setup/verification guidance.
docs/operator-public-documentation/preview/monitoring/metrics.md Adds metrics reference and PromQL examples across signal sources.

You can also share your feedback on Copilot code review. Take the survey.

Document that the gateway image must include OpenTelemetry
instrumentation from documentdb/documentdb#443 for the telemetry
playground and monitoring features to work.

Signed-off-by: urismiley <urismiley@microsoft.com>
- Fix Mermaid diagram: 'remote write' → 'scrape :8889' (Prometheus
  scrapes the collector, not the other way around)
- Derive CONTEXT from CLUSTER_NAME in deploy.sh for consistency with
  setup-kind.sh
- Add 3-minute timeout to CNPG superuser secret wait loop
- Pin mongodb-community-server image to 7.0.30-ubuntu2204
- Fix '3-node' → '3-instance' in overview.md (1 node, 3 instances)

Signed-off-by: urismiley <urismiley@microsoft.com>
Signed-off-by: urismiley <urismiley@microsoft.com>
Signed-off-by: urismiley <urismiley@microsoft.com>
…el, validation script, kubeletstats DaemonSet

Signed-off-by: urismiley <urismiley@microsoft.com>
…fy OTel naming, trim verbose examples

Signed-off-by: urismiley <urismiley@microsoft.com>
@udsmicrosoft udsmicrosoft force-pushed the users/urismiley/telemetry-playground branch 3 times, most recently from b3cdc5f to dae61fe Compare March 25, 2026 06:36
…quisite instructions

Signed-off-by: urismiley <urismiley@microsoft.com>
… to cluster CR

Signed-off-by: urismiley <urismiley@microsoft.com>
…otal

Signed-off-by: urismiley <urismiley@microsoft.com>
…ted gateway metrics

Signed-off-by: urismiley <urismiley@microsoft.com>
@udsmicrosoft udsmicrosoft force-pushed the users/urismiley/telemetry-playground branch from 9b0e6b5 to dfeb8fb Compare March 25, 2026 18:15
…c/sec by collection

Signed-off-by: urismiley <urismiley@microsoft.com>
…l metrics-expansion branch

Signed-off-by: urismiley <urismiley@microsoft.com>
… published images, add custom image docs

Signed-off-by: urismiley <urismiley@microsoft.com>
Signed-off-by: urismiley <urismiley@microsoft.com>
…alidate.sh, fix doc metric name

Signed-off-by: urismiley <urismiley@microsoft.com>
- **Helm 3** — [install](https://helm.sh/docs/intro/install/)
- **jq** — for credential copying

!!! important "Gateway OTEL support required"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These docs don't use the mkbuild stuff, so the !!! important tag doesn't work here

./scripts/validate.sh

# Tear down
./scripts/teardown.sh
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could have a separate heading for cleanup instead of being in quickstart

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you do have a section for it, so I think it can be removed here.

Copy link
Copy Markdown
Collaborator

@WentingWu666666 WentingWu666666 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs/operator-public-documentation/preview/monitoring/overview.md (lines 23-30)

Instead of asking users to build their own gateway image from a specific branch, it would be better to publish a pre-built gateway image (e.g., to ghcr.io/documentdb/...) that includes the OTEL instrumentation. Requiring users to build the image themselves adds friction and is error-prone most users won't have the build toolchain set up for the gateway. Could we provide a ready-to-use image with telemetry support, or at least track an issue to do so before GA?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants