-
Notifications
You must be signed in to change notification settings - Fork 1
Support bundle context for use with AI assistants #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
brathina-spectro
wants to merge
2
commits into
main
Choose a base branch
from
context
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+281
−0
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,281 @@ | ||
| # Troubleshooting Guide — SpectroCloud Support Bundles | ||
|
|
||
| This guide gives the context needed to effectively troubleshoot SpectroCloud support bundles. | ||
| Reference this document at the start of any support bundle analysis session. | ||
|
|
||
| --- | ||
|
|
||
| ## Bundle Types | ||
|
|
||
| There are two distinct bundle types with different scopes: | ||
|
|
||
| | Aspect | Edge Cluster Bundle | Infrastructure (Infra) Bundle | | ||
| |---|---|---| | ||
| | **Script** | `support-bundle-edge.sh` | `support-bundle-infra.sh` | | ||
| | **Naming** | `<hostname>-YYYY-MM-DD_HH_MM_SS/` | `<cluster-name>-YYYY-MM-DD_HH_MM_SS/` | | ||
| | **Scope** | Host-level OS + Kubernetes cluster | Kubernetes cluster only | | ||
| | **OS/journald logs** | Yes | No | | ||
| | **Network config** | Yes | No | | ||
| | **Edge-specific files** | Yes (`/oem`, `/run/stylus`, cloud-config) | No | | ||
| | **Container runtime logs** | Yes (crictl) | No | | ||
| | **K8s resources** | Yes | Yes | | ||
| | **Pod logs** | Yes | Yes | | ||
| | **Distro detection** | Yes (k3s, rke2, kubeadm, Canonical) | No (cluster-agnostic) | | ||
|
|
||
| **How to tell them apart:** Edge bundles have `journald/`, `systeminfo/`, `networking/`, `oem/` directories at the root. Infra bundles have only `k8s/` (plus `console.log` and `.support-bundle`). | ||
|
|
||
| --- | ||
|
|
||
| ## Edge Bundle — Directory Structure | ||
|
|
||
| ``` | ||
| <hostname>-YYYY-MM-DD_HH_MM_SS/ | ||
| ├── .support-bundle # Bundle version (format: YYYYMMDD+githash) | ||
| ├── console.log # Script execution log — check for collection errors | ||
| │ | ||
| ├── systeminfo/ # Node OS information | ||
| │ ├── hostname # Node hostname | ||
| │ ├── os-release # OS name and version | ||
| │ ├── cmdline # Kernel boot parameters | ||
| │ └── resolv.conf # DNS configuration | ||
| │ | ||
| ├── journald/ # Service logs (most important for edge debugging) | ||
| │ ├── dmesg # Kernel messages | ||
| │ ├── journal-boot # Full boot journal | ||
| │ ├── stylus-operator.log # Palette edge operator — primary log for pack/lifecycle issues | ||
| │ ├── stylus-agent.log # Palette edge agent — registration, communication | ||
| │ ├── k3s.log # K3s runtime (if k3s distro) | ||
| │ ├── kubelet.log # Kubelet | ||
| │ ├── containerd.log # Container runtime | ||
| │ └── <50+ other services> # systemd-timesyncd, cos-setup-boot, etc. | ||
| │ | ||
| ├── var/log/ # OS log files (mirrors journald, sometimes more history) | ||
| │ ├── stylus-operator.log | ||
| │ ├── stylus-agent.log | ||
| │ └── store.log # Stylus store/state service | ||
| │ | ||
| ├── networking/ # Network configuration | ||
| │ ├── iptables # Firewall rules | ||
| │ ├── ip-route # Routing table | ||
| │ └── cni/ # CNI plugin configs | ||
| │ | ||
| ├── oem/ # OEM/Kairos configuration files | ||
| ├── run/stylus/ # Stylus runtime state files | ||
| ├── run/immucore/ # Immucore (immutable OS) state | ||
| ├── usr/local/cloud-config/ # Kairos cluster config (cluster.kairos.yaml — key file!) | ||
| ├── opt/spectrocloud/ # Binary checksums | ||
| │ | ||
| ├── k8s/ # Kubernetes cluster data | ||
| │ ├── cluster-info/ # cluster-info dump, API server resources, pod logs | ||
| │ ├── cluster-resources/ # kubectl get outputs (YAML) for all resource types | ||
| │ │ ├── nodes.yaml # Node status, conditions, capacity | ||
| │ │ ├── namespaces.yaml # All namespaces | ||
| │ │ ├── pods/ # Per-namespace pod status | ||
| │ │ ├── deployments/ # Per-namespace deployments | ||
| │ │ ├── events/ # Kubernetes events (warnings/errors) | ||
| │ │ └── <all other resource types> | ||
| │ ├── pod-logs/ # Current pod logs (from /var/log/pods) | ||
| │ ├── previous-pod-logs/ # Previous container logs (post-crash/restart) | ||
| │ └── metrics/ # kubectl top nodes/pods output | ||
| │ | ||
| ├── k3s/ # K3s-specific (if k3s distro) | ||
| │ ├── certs/ # TLS certificates | ||
| │ └── crictl/logs/ # crictl container logs | ||
| │ | ||
| ├── etcd/ # etcd status (if kubeadm distro) | ||
| └── helm/ # Helm releases and repo info | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Infrastructure Bundle — Directory Structure | ||
|
|
||
| ``` | ||
| <cluster-name>-YYYY-MM-DD_HH_MM_SS/ | ||
| ├── .support-bundle # Bundle version | ||
| ├── console.log # Script execution log | ||
| └── k8s/ | ||
| ├── cluster-info/ # cluster-info dump, API resources, pod logs | ||
| ├── cluster-resources/ # kubectl get outputs (YAML) | ||
| │ ├── nodes.yaml | ||
| │ ├── namespaces.yaml | ||
| │ ├── pods/ | ||
| │ ├── deployments/ | ||
| │ ├── events/ | ||
| │ └── <all other resource types> | ||
| ├── previous-pod-logs/ # Previous container logs | ||
| └── metrics/ # kubectl top output | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## SpectroCloud Platform Overview | ||
|
|
||
| **Palette** is the SpectroCloud management platform. An edge deployment has: | ||
|
|
||
| - **Stylus Operator** — runs on the edge node, manages pack lifecycle (install/upgrade/delete) | ||
| - **Stylus Agent** — registers the node with the management plane, handles communication | ||
| - **Packs** — Helm chart-based applications deployed via cluster profiles | ||
| - **Cluster Profile** — defines what packs (OS, CNI, CSI, add-ons) are deployed on a cluster | ||
| - **Management Plane** — remote Palette SaaS that communicates with the agent | ||
|
|
||
| **Cluster types (by namespace presence):** | ||
| - Enterprise cluster: has `hubble-system` namespace | ||
| - PCG (Palette Cloud Gateway): has `jet-system` namespace with `spectro-cloud-driver` deployment | ||
| - Edge cluster: typically has `palette-system`, `spectro-system`, `spectro-task-*` namespaces | ||
|
|
||
| **Key system namespaces on edge:** | ||
| | Namespace | Purpose | | ||
| |---|---| | ||
| | `palette-system` | Palette webhooks and controllers | | ||
| | `spectro-system` | Stylus operator and core agents | | ||
| | `spectro-task-<id>` | Task execution for cluster operations | | ||
| | `cluster-<id>` | Cluster management agent | | ||
| | `cert-manager` | Certificate management | | ||
| | `kube-system` | Core Kubernetes components | | ||
|
|
||
| --- | ||
|
|
||
| ## Supported Kubernetes Distributions (Edge) | ||
|
|
||
| | Distribution | Indicator | Default KUBECONFIG | | ||
| |---|---|---| | ||
| | k3s | `k3s/` directory in bundle | `/run/kubeconfig` | | ||
| | rke2 | `rke2/` directory in bundle | `/etc/rancher/rke2/rke2.yaml` | | ||
| | kubeadm | `kubeadm/`, `etcd/` directories | `/etc/kubernetes/admin.conf` | | ||
| | Canonical k8s | `canonical-k8s/`, `dqlite/` directories | snap-managed path | | ||
|
|
||
| --- | ||
|
|
||
| ## Supported OS Platforms (Edge) | ||
|
|
||
| - **Kairos-based**: SLE Micro (SUSE), Ubuntu, openSUSE — identified by `kairos` in `/run/stylus` files and `cos-setup-boot` in journald | ||
| - **Non-Kairos**: Standard Linux installs using kubeadm/rke2 | ||
|
|
||
| Kairos provides an **immutable OS** — config lives in `/oem/` and `/usr/local/cloud-config/`. The `cluster.kairos.yaml` is the primary cluster configuration source. | ||
|
|
||
| --- | ||
|
|
||
| ## Troubleshooting Methodology | ||
|
|
||
| ### Step 1 — Orient Yourself | ||
|
|
||
| 1. Read `.support-bundle` → confirm bundle version | ||
| 2. Read `systeminfo/hostname`, `systeminfo/os-release` → OS and node identity | ||
| 3. Check `usr/local/cloud-config/cluster.kairos.yaml` (edge) → K8s distro, network config, cluster ID | ||
| 4. Read `console.log` → any collection errors (e.g., commands that failed to run) | ||
| 5. Check bundle timestamp in directory name → when was data collected? | ||
|
|
||
| ### Step 2 — Check Node and Cluster Health | ||
|
|
||
| 1. `k8s/cluster-resources/nodes.yaml` → Node conditions (Ready, MemoryPressure, DiskPressure, PIDPressure) | ||
| 2. `k8s/cluster-resources/pods/<namespace>.yaml` → Look for non-Running/non-Completed pods | ||
| 3. `k8s/metrics/` → Resource pressure (high CPU/memory usage) | ||
| 4. `k8s/cluster-resources/events/` → Kubernetes warnings and errors | ||
|
|
||
| **Pod states to investigate:** `CrashLoopBackOff`, `Error`, `Pending`, `OOMKilled`, `ImagePullBackOff`, `ErrImagePull` | ||
|
|
||
| ### Step 3 — Investigate Stylus/Palette Issues | ||
|
|
||
| For pack deployment failures, registration issues, or upgrade problems: | ||
|
|
||
| 1. **Primary:** `journald/stylus-operator.log` — pack pull, install, reconciliation errors | ||
| 2. **Secondary:** `journald/stylus-agent.log` — registration, management plane comms | ||
| 3. **Supporting:** `var/log/store.log` — state store operations | ||
|
|
||
| **Common patterns to grep for:** | ||
| ``` | ||
| "failed to" | ||
| "error" | ||
| "basic credential not found" # Registry auth failure | ||
| "apiserver not ready" # Startup race condition (usually transient) | ||
| "pack .* not found" # Missing pack in cluster profile | ||
| "failed to get hostId" # Node annotation issue | ||
| "failed to find.*token" # Auth token missing | ||
| ``` | ||
|
|
||
| ### Step 4 — Investigate Pod/Workload Issues | ||
|
|
||
| 1. `k8s/cluster-resources/pods/<namespace>.yaml` → pod status, restartCount, lastState | ||
| 2. `k8s/pod-logs/<namespace>/<pod>/` → current logs | ||
| 3. `k8s/previous-pod-logs/<namespace>/<pod>/` → logs from before last crash/restart | ||
| 4. `k8s/cluster-resources/events/<namespace>.yaml` → event history | ||
|
|
||
| ### Step 5 — Investigate Node-Level Issues (Edge Only) | ||
|
|
||
| 1. `journald/dmesg` → kernel/hardware errors, OOM kills | ||
| 2. `journald/containerd.log` → container runtime errors | ||
| 3. `journald/kubelet.log` → kubelet issues | ||
| 4. `journald/k3s.log` (or rke2) → distribution-specific issues | ||
| 5. `networking/iptables`, `networking/ip-route` → network policy/routing problems | ||
| 6. `oem/` → OEM configuration issues | ||
|
|
||
| --- | ||
|
|
||
| ## Common Issues and Where to Find Them | ||
|
|
||
| | Symptom | Where to Look | What to Search For | | ||
| |---|---|---| | ||
| | Pack not installing | `journald/stylus-operator.log` | `"pack"`, `"failed to install"`, `"helm"` | | ||
| | Registry/image pull failure | `journald/stylus-operator.log` | `"credential"`, `"ImagePull"`, `"unauthorized"` | | ||
| | Node not registering | `journald/stylus-agent.log` | `"register"`, `"failed to"`, `"management plane"` | | ||
| | Pod crashlooping | `k8s/previous-pod-logs/` + events | Pod `lastState.terminated.reason` | | ||
| | OOM kill | `journald/dmesg` + pod status | `"oom"`, `"OOMKilled"` | | ||
| | Storage issues | `k8s/cluster-resources/pvc/` + pod events | `"FailedMount"`, `"bound"`, `"pending"` | | ||
| | Certificate errors | `journald/k3s.log`, cert-manager logs | `"certificate"`, `"tls"`, `"x509"` | | ||
| | DNS resolution | `systeminfo/resolv.conf` + coredns logs | DNS server config, coredns pod health | | ||
| | Upgrade failure | `journald/stylus-operator.log` + pack CSVs | `"upgrade"`, `"failed"` + check pack timeline | | ||
| | Startup/boot failure | `journald/journal-boot`, `journald/dmesg` | `"failed"`, `"error"` early in timeline | | ||
| | etcd issues | `etcd/` directory + k3s/rke2 logs | `"etcd"`, `"snapshot"`, `"leader"` | | ||
|
|
||
| --- | ||
|
|
||
| ## Key Configuration Files (Edge) | ||
|
|
||
| | File | Contents | Why It Matters | | ||
| |---|---|---| | ||
| | `usr/local/cloud-config/cluster.kairos.yaml` | K3s/cluster config, CNI, CIDR, OIDC, node IP | Primary cluster config source | | ||
| | `oem/*.yaml` | OEM/hardware profile | Device-specific settings | | ||
| | `run/stylus/*` | Stylus runtime state | Current operational state | | ||
| | `etc/containerd/config.toml` | Container runtime config | Registry mirrors, snapshotter settings | | ||
|
|
||
| --- | ||
|
|
||
| ## Cluster Configuration Quick Reference (from `cluster.kairos.yaml`) | ||
|
|
||
| Key fields to check: | ||
| - `k3s_args`: K3s startup flags (cluster-cidr, service-cidr, disable options) | ||
| - `k3s_args.flannel-backend`: CNI backend (e.g., `wireguard-native`) | ||
| - `k3s_args.node-ip`: Node's primary IP | ||
| - `k3s_args.tls-san`: Virtual/load-balancer IP for HA | ||
| - `k3s_args.oidc-issuer-url`: Management plane OIDC endpoint | ||
| - `k3s_args.token`: Cluster join token (treat as sensitive) | ||
|
|
||
| --- | ||
|
|
||
| ## Helm Releases | ||
|
|
||
| Located in `helm/` directory. Contains: | ||
| - Release names, namespaces, chart versions, status (deployed/failed/pending) | ||
| - Useful for confirming what version of a chart is actually installed vs. what's expected | ||
|
|
||
| --- | ||
|
|
||
| ## Notes on Log File Sizes | ||
|
|
||
| Large log files indicate high activity or issues: | ||
| - `stylus-operator.log` > 10 MB → likely repeated retry loops (registry failures, reconciliation errors) | ||
| - `stylus-agent.log` > 5 MB → communication issues with management plane | ||
| - `journalctl` > 100 MB → extended runtime or very verbose logging; use targeted greps | ||
|
|
||
| When log files are very large, grep for specific error patterns rather than reading linearly. Start from the **end** of the file for the most recent state. | ||
|
|
||
| --- | ||
|
|
||
| ## Infra Bundle — Additional Notes | ||
|
|
||
| Infra bundles lack OS-level data but follow the same `k8s/` structure as edge. When troubleshooting infra bundles: | ||
| - Focus entirely on pod health, events, and resource status | ||
| - Check `previous-pod-logs/` for any controller/operator crashes | ||
| - `k8s/cluster-resources/` contains the same resource types as edge | ||
| - No `journald/` or `systeminfo/` — OS-level issues cannot be diagnosed |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should be renamed to AGENTS.md if it's intended to be consumed by LLM agents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://agents.md/