Skip to content

Add databricks doctor command#4730

Open
simonfaltum wants to merge 9 commits intomainfrom
simonfaltum/doctor-command
Open

Add databricks doctor command#4730
simonfaltum wants to merge 9 commits intomainfrom
simonfaltum/doctor-command

Conversation

@simonfaltum
Copy link
Member

@simonfaltum simonfaltum commented Mar 12, 2026

Why

Users debugging CLI setup issues (auth failures, config problems, network issues) have no single command to diagnose their environment. They must manually run separate commands to check auth, config, and connectivity.

Changes

Before: Users had to manually run separate commands to check auth, config, and connectivity.
Now: A new databricks doctor command runs all diagnostic checks and reports results as a checklist:

  • CLI version (info)
  • Config file readability and profile count (pass/fail)
  • Active profile (info)
  • Authentication validity and auth type (pass/fail)
  • User identity via CurrentUser.Me (pass/fail)
  • Network connectivity to workspace host (pass/fail)

Text output uses colored status icons ([ok], [FAIL], etc.) to stdout. JSON output (--output json) returns a structured array. Auth failures are reported as check results, not command errors.

Open item

  • Top-level command deny list: Like the global flags, the doctor command name should be added to a deny list for new API names in the universe API linters, so future auto-generated API commands don't collide with it. Tracked separately.

Test plan

  • Unit tests for each check function
  • Unit tests for both text and JSON rendering
  • Tests for graceful error handling (auth failure, missing config)
  • make lintfull passes
  • make checks passes

Adds a top-level `databricks doctor` command that validates CLI setup by
running sequential diagnostic checks: CLI version, config file readability,
active profile, authentication, user identity, and network connectivity.

Auth failures are reported as check results, not command errors. Supports
both text output (colored status icons) and JSON output (`--output json`).

Co-authored-by: Isaac
@eng-dev-ecosystem-bot
Copy link
Collaborator

eng-dev-ecosystem-bot commented Mar 12, 2026

Commit: f406908

Run: 23059775876

Env 🔄​flaky 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
💚​ aws linux 8 7 268 787 6:58
💚​ aws windows 8 7 270 785 5:12
🔄​ aws-ucws linux 2 7 7 364 702 8:42
🔄​ aws-ucws windows 2 7 7 366 700 6:55
💚​ azure linux 2 9 271 785 5:34
💚​ azure windows 2 9 273 783 4:37
🔄​ azure-ucws linux 2 1 9 369 698 8:55
🔄​ azure-ucws windows 2 1 9 371 696 7:09
💚​ gcp linux 2 9 267 788 6:26
💚​ gcp windows 2 9 269 786 5:52
16 interesting tests: 7 SKIP, 7 RECOVERED, 2 flaky
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
🔄​ TestAccept 💚​R 💚​R 🔄​f 🔄​f 💚​R 💚​R 🔄​f 🔄​f 💚​R 💚​R
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🔄​ TestAccept/ssh/connect-serverless-gpu 🙈​s 🙈​s 🔄​f 🔄​f 🙈​s 🙈​s 🔄​f 🔄​f 🙈​s 🙈​s
💚​ TestAccept/ssh/connection 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
Top 20 slowest tests (at least 2 minutes):
duration env testname
4:20 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:10 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:39 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:37 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:19 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:18 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:14 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:11 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:09 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:07 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:52 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:49 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:49 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:44 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:42 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:39 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:37 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:17 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:11 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:09 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform

The test used env.Set(ctx, ...) to set DATABRICKS_HOST and
DATABRICKS_TOKEN, but checkAuth creates a bare config.Config{}
that reads from real environment variables via os.Getenv, not
the context-based env layer. Use t.Setenv instead so the SDK
can see the values.

Co-authored-by: Isaac
@simonfaltum simonfaltum marked this pull request as ready for review March 13, 2026 11:34
Copy link
Contributor

@shreyas-goenka shreyas-goenka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This review was posted by Claude (AI assistant). Shreyas will do a separate, more thorough review pass.

Priority: HIGH — Config resolution diverges from real CLI auth path

MAJOR: resolveConfig diverges from real CLI auth

The resolveConfig function in databricks doctor constructs its own config resolution path instead of going through the standard SDK/CLI authentication flow. This means the doctor command could report "config is fine" while the real CLI fails (or vice versa). If the goal is to diagnose auth issues, it should use the same code path the CLI uses.

MEDIUM: Network check bypasses SDK HTTP client

The connectivity check uses http.DefaultClient directly instead of going through the SDK's configured HTTP client. In enterprise environments with proxies or custom TLS, this will give misleading results — the check might fail even though the SDK would succeed (or vice versa).

Other Observations

  • Good idea for a diagnostic command overall
  • The step-by-step output format is user-friendly
  • Missing test coverage for the core diagnostic logic

When the workspace client is unavailable but config is resolved,
the network check was falling back to http.DefaultClient. This
ignores proxy and custom TLS settings from the SDK config, giving
misleading results in enterprise environments. Use
configuredNetworkHTTPClient(cfg) instead, which respects
HTTPTransport and InsecureSkipVerify from the config.
…allback, skip status

- Detect account-level configs (AccountID + account host) and use
  NewAccountClient instead of always using NewWorkspaceClient
- Add 15s per-check deadline for auth and identity checks to prevent
  hangs on unresponsive IdP
- Network check now tries even when config resolution fails, as long
  as a host URL is available from partial config resolution
- Identity marked as 'skip' (not 'fail') when auth failed or when
  using account-level profile, avoiding double failures from one root cause
- Add skip status rendering in text output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants