Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
sidebar_position: 3
edition: ce
---

# High Availability Installation

Use the [ocboot](https://github.com/yunionio/ocboot) deployment tool for high-availability installation of Cloudpods services, which is more suitable for production environment deployments.

## Prerequisites

import HAEnv from '../../shared/getting-started/_parts/_ha-env.mdx';

<HAEnv />

## Start Installation

import OcbootReleaseDownload from '../../shared/getting-started/_parts/_quickstart-ocboot-release-download.mdx';

<OcbootReleaseDownload />

### Write Deployment Configuration

import OcbootConfigHA from '@site/src/components/OcbootConfigHA';

<OcbootConfigHA productVersion='AI' />

### Start Deployment

```bash
$ ./ocboot.sh install ./config-ha.yml
```

After deployment completes, you can open https://10.127.190.10 (VIP) in a browser, enter the username `admin` and password `admin@123`, and access the web console.

After deployment, you can add more nodes to the existing cluster. Refer to: [Adding Compute Nodes](./host).

Note: When adding nodes, please use the actual IP of the first control node (do not use the VIP), because the VIP may drift. Typically only the first control node has SSH passwordless login permissions configured for other nodes; otherwise, SSH login may fail.
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
sidebar_position: 4
edition: ce
---

# Adding Compute Nodes

To run AI applications in AI Cloud, you need to first add the corresponding compute nodes (hosts) and ensure they have the necessary container and (if applicable) GPU environment. This section describes how to deploy the base components on compute nodes and add hosts to the cluster.

Compute nodes are mainly responsible for container, network, storage, and GPU management.

## Environment

import OcbootEnv from '../../shared/getting-started/_parts/_quickstart-ocboot-k3s-env.mdx';

<OcbootEnv />

## Use ocboot to Add Corresponding Nodes

The following operations are performed on the control node. Use the `ocboot.sh add-node` command on the control node to add the corresponding compute node to the cluster.

Assuming you want to add compute node 10.168.222.140 to control node 10.168.26.216, you first need SSH root passwordless login to both the corresponding compute node and the control node itself.

::::tip Note
If it is a high-availability deployment environment, do not use the VIP for the control node IP when adding nodes. Only use the actual IP of the first control node, because the VIP may drift to other nodes, and usually only the first node has SSH passwordless login permissions configured for other nodes. Using other control nodes may cause SSH login to fail.
::::

```bash
# Set the control node itself to passwordless login
$ ssh-copy-id -i ~/.ssh/id_rsa.pub root@10.168.26.216

# Try passwordless login to the control node to verify
$ ssh root@10.168.26.216 "hostname"

# Copy the generated ~/.ssh/id_rsa.pub public key to the machine to be deployed
$ ssh-copy-id -i ~/.ssh/id_rsa.pub root@10.168.222.140

# Try passwordless login to the machine to be deployed. You should be able to get the hostname without entering a password
$ ssh root@10.168.222.140 "hostname"
```

### Add Nodes

The following commands are all run on the previously deployed control node. The control node should have the [ocboot](https://github.com/yunionio/ocboot) deployment tool installed in advance.

::::tip If you plan to run GPU-based AI applications
If you plan to run GPU-dependent AI applications (such as Ollama) on the new compute node, please complete [Setting up NVIDIA and CUDA Environment](./setup-nvidia-cuda) on the target compute node before running `ocboot.sh add-node` to add the node.<!-- If you plan to run GPU-dependent AI applications (such as Ollama/vLLM) on the new compute node, please complete [Setting up NVIDIA and CUDA Environment](./setup-nvidia-cuda) on the target compute node before running `ocboot.sh add-node` to add the node. -->
::::

```bash
# Use ocboot to add nodes
$ ./ocboot.sh add-node --enable-ai-env 10.168.26.216 10.168.222.140
```

```bash
# Other options, use '--help' for help
$ ./ocboot.sh add-node --help
```

### Enable Compute Nodes (Hosts)

After the compute nodes are added, you need to enable the newly reported compute nodes (hosts). Only enabled hosts can run virtual machines and related workloads.

```bash
# Use climc to view the registered host list
$ climc host-list

# Enable the specified host
$ climc host-enable <host_name>
```

## Common Troubleshooting

For common troubleshooting on compute nodes, please refer to: [Host Service Troubleshooting](../../onpremise/guides/host/troubleshooting).
Original file line number Diff line number Diff line change
@@ -1,18 +1,106 @@
---
sidebar_position: 1
title: Quick Start
description: AI Cloud quick start guide
description: AI Cloud Quick Start Guide
edition: ce
---

# Quick Start
# Ocboot Quick Installation of AI Cloud

This page describes how to get started with Cloudpods AI Cloud.
Use the [ocboot](https://github.com/yunionio/ocboot) deployment tool to quickly deploy a Cloudpods AI Cloud environment.

:::tip Note
This chapter walks you through quickly setting up an AI Cloud environment using the deployment tool. For deploying a high-availability cluster in production, please refer to: [High Availability Installation](./ha-ce).
:::

## Prerequisites

- Cloudpods platform is deployed (private cloud or K8s)
- Sufficient compute and storage for AI workloads
:::tip Notes:

- About GPU
- If the target machine has an Nvidia GPU, you can choose to use the ocboot tool to automatically deploy the driver and CUDA. See details later in this document.
- If there is no Nvidia GPU, the deployed environment will not be able to run inference container instances like Ollama, but it can run application container instances like OpenClaw and Dify that do not depend on GPU.<!-- 如果没有 Nvidia GPU,部署后的环境就无法运行 ollama 和 vllm 这种推理容器实例,但可以运行 OpenClaw 和 dify 这种不依赖 GPU 的应用容器实例。 -->
- The operating system must be a clean installation, because the deployment tool will build a k3s cluster of a specified version from scratch. Make sure the system does not have Kubernetes, Docker, or other container management tools installed, otherwise conflicts will cause installation failures.
- Minimum requirements: 8 CPU cores, 8 GiB memory, 200 GiB storage.
- Virtual machines and services store data under the **/opt** directory, so ideally it is recommended to set up a separate mount point for **/opt**.
- For example, create an ext4 partition on /dev/sdb1 and mount it to /opt via /etc/fstab.
- On Debian-family operating systems (e.g., Debian and Ubuntu), during the first ocboot deployment, GRUB boot options will be detected and updated so that k3s can run properly. As a result, the operating system will reboot during deployment. After the reboot, simply re-run the ocboot deployment.
:::

Supported distributions vary by CPU architecture. Currently supported distributions are as follows:

Note: 4.0 refers to Release/4.0.

| OS and Architecture | 4.0 |
| -------------------------------------- | ---- |
| OpenEuler 22.03 LTS Sp3 x86_64+aarch64 | ✅ |

## Install Cloudpods AI Cloud

import OcbootReleaseDownload from '../../shared/getting-started/_parts/_quickstart-ocboot-release-download.mdx';

<OcbootReleaseDownload />

### Run the Deployment Tool

Next, execute `ocboot.sh run.py` to deploy the services. The **host_ip** parameter is the IP address of the deployment node and is optional. If not specified, the NIC used by the default route will be selected for service deployment. If your node has multiple NICs, you can specify **host_ip** to choose the corresponding NIC for listening.

import OcbootRun from '@site/src/components/OcbootK3sRun';

<OcbootRun productVersion='ai' />

The `./ocboot.sh run.py` script calls Ansible to deploy services. If the script exits due to issues during deployment, you can re-run the script to retry.

### Configure NVIDIA Driver and CUDA (Optional)

To install or configure NVIDIA drivers and CUDA on a node to run GPU-dependent AI container applications such as Ollama, please refer to: [Setting up NVIDIA and CUDA Environment](./setup-nvidia-cuda).<!-- 若需在节点上安装或配置 NVIDIA 驱动与 CUDA 以运行 Ollama、vLLM 等依赖 GPU 的 AI 容器应用,请参考:[配置 NVIDIA 与 CUDA 环境](./setup-nvidia-cuda)。 -->

## Start Using Cloudpods AI Cloud

```bash
....
# After deployment completes, the following output indicates success
# Open https://10.168.26.216 in a browser, where the IP is the <host_ip> set earlier
# Log in with admin/admin@123 to access the web interface
Initialized successfully!
Web page: https://10.168.26.216
User: admin
Password: admin@123
```

After deployment, open the Web address output by ocboot (e.g., `https://<host_ip>`) in a browser, and log in with the provided credentials to access the Cloudpods console.

### Enable Hosts

The newly created environment will be added to the platform as a host node, which is not enabled by default. Go to **Compute -> Infrastructure -> Hosts** to view the host list and enable the corresponding host.
![](../../shared/getting-started/images/host.png)

### Quickly Create AI Instances

Go to the **"Artificial Intelligence"** menu to quickly create AI applications. Please refer to the corresponding documentation based on your needs.

:::tip
- Before creating applications that **depend on GPU**, please first complete: [Setting up NVIDIA and CUDA Environment](./setup-nvidia-cuda).
:::

| Application | Type | Description | GPU Required | Quick Start |
|---|---|---|---|---|
| OpenClaw | AI App | Open-source self-hosted personal agent assistant | No | [OpenClaw Quick Start](../guides/llm-app/openclaw#quickstart) |
| Dify | AI App | LLM application development and workflow orchestration platform (can connect to inference services) | No | [Dify Quick Start](../guides/llm-app/dify#quickstart) |
| ComfyUI | AI App | Image generation and node-based workflow application | Yes | [ComfyUI Quick Start](../guides/llm-app/comfyui#quickstart) |
| Ollama | AI Inference | Lightweight local inference service | Yes | [Ollama Quick Start](../guides/llm-inference/ollama#quickstart) |
<!-- | vLLM | AI推理 | 高吞吐、低延迟推理服务 | 需要 | [vLLM 快速开始](../guides/llm-inference/vllm#quickstart) | -->

## FAQ

### 1. How to add more AI nodes?

To add new nodes (especially GPU nodes) to an existing cluster, if you need to run GPU-dependent AI applications, it is recommended to first configure NVIDIA/CUDA. Refer to: [Setting up NVIDIA and CUDA Environment](./setup-nvidia-cuda). Then complete the host addition and enablement. Refer to: [Adding Compute Nodes](./host).

### 2. How to upgrade?

Please refer to [Upgrading via ocboot](/docs/onpremise/operations/upgrading/ocboot-upgrade) (if AI Cloud shares the ocboot upgrade process with private cloud).

## Next steps
### 3. Other questions?

Detailed installation and configuration will be updated with product releases. Stay tuned.
Feel free to submit issues on Cloudpods GitHub Issues: [https://github.com/yunionio/cloudpods/issues](https://github.com/yunionio/cloudpods/issues). We will respond as soon as possible.
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
---
sidebar_position: 2
title: Setting up NVIDIA and CUDA Environment
description: Install NVIDIA drivers and CUDA on AI Cloud nodes to run GPU-based AI applications such as Ollama
# 原文保留:在 AI 云节点上安装 NVIDIA 驱动与 CUDA,以运行 Ollama、vLLM 等 GPU 类 AI 应用
edition: ce
---

# Setting up NVIDIA and CUDA Environment

This document describes how to use ocboot's `setup-ai-env` command to configure NVIDIA drivers, CUDA, containerd, and other AI runtime environments on an existing cluster or standalone machine, without deploying or modifying Cloudpods cloud management services. This is applicable for scenarios that require running GPU-dependent AI container applications such as Ollama.<!-- 适用于需要运行 Ollama、vLLM 等依赖 GPU 的 AI 容器应用的场景。 -->

:::tip Notes
- If the target environment does not have an Nvidia GPU, or you do not need to run GPU-dependent container AI applications like Ollama, you can skip this document.<!-- 如果待部署环境没有 Nvidia GPU,或不需要运行 Ollama 和 vLLM 这类需要 GPU 的容器 AI 应用,可以跳过本文。 -->
- Without the Nvidia runtime environment configured, you can still run AI container applications that do not depend on GPU, such as OpenClaw and Dify.
:::

## Command Format

Please first download and prepare the NVIDIA driver and CUDA installation packages (e.g., CUDA 12.8). Note that you should download the **.run** installation packages:
- NVIDIA driver: Available from [NVIDIA Driver Downloads](https://www.nvidia.com/en-us/drivers/unix/).
- CUDA installation package: Available from [NVIDIA CUDA Toolkit Downloads](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64).

After downloading, transfer the installation packages to the target machine or a path accessible from the machine running ocboot.

```bash
./ocboot.sh setup-ai-env <target_host1> [target_host2 ...] \
--nvidia-driver-installer-path <full_path_to_driver> \
--cuda-installer-path <full_path_to_cuda> \
[--gpu-device-virtual-number 2] [--user USER] [--key-file KEY] [--port PORT]
```

## Examples

```bash
# Basic usage
./run.py setup-ai-env 10.127.222.247 \
--nvidia-driver-installer-path /root/nvidia/NVIDIA-Linux-x86_64-570.133.07.run \
--cuda-installer-path /root/nvidia/cuda_12.8.1_570.124.06_linux.run

# Specify GPU virtual number and SSH options
./run.py setup-ai-env 10.127.222.247 \
--nvidia-driver-installer-path /root/nvidia/NVIDIA-Linux-x86_64-570.133.07.run \
--cuda-installer-path /root/nvidia/cuda_12.8.1_570.124.06_linux.run \
--gpu-device-virtual-number 2 \
--user admin \
--port 2222
```

## Parameter Reference

| Parameter | Required | Default | Description |
|------|------|--------|------|
| `--nvidia-driver-installer-path` | Yes | - | Full path to the NVIDIA driver installation package. |
| `--cuda-installer-path` | Yes | - | Full path to the CUDA installation package. |
| `--gpu-device-virtual-number` | No | 2 | Virtual number for NVIDIA GPU shared devices. |
| `--user` / `-u` | No | root | SSH username. |
| `--key-file` / `-k` | No | - | SSH private key file path. |
| `--port` / `-p` | No | 22 | SSH port. |

## Installation Steps and Workflow

ocboot will perform the following steps (among others) on the target host:

1. Check OS support and local installation files (if paths are specified)
2. Install kernel headers and development packages
3. Clean up vfio-related configurations (if present)
4. Install NVIDIA driver (if `--nvidia-driver-installer-path` is provided)
5. Install CUDA environment (if `--cuda-installer-path` is provided)
6. Configure GRUB (add nvidia-drm.modeset=1)
7. Install NVIDIA Container Toolkit
8. Configure containerd runtime
9. Configure host device mappings (auto-discover NVIDIA corresponding `/dev/dri/renderD*` and generate configuration)
10. Verify the installation

## Important Notes

- During installation, the target host may **automatically reboot** after installing kernel packages or updating GRUB. Please be aware of this in advance.
- Ensure the target host has sufficient disk space and network connectivity to download dependencies such as the NVIDIA Container Toolkit.
- The installation packages must be prepared on the **machine running ocboot**, and ensure they have been **transferred to the target machine before running the playbook**. For transfer methods, refer to the Quick Start FAQ: [How to transfer installation packages to the target machine?](./quickstart#3-如何将安装包传输到目标机).
- Ensure passwordless SSH login between the machine running ocboot and the target host (or use parameters such as `--key-file`).

## FAQ

### No GPU listed or GPU not detected in the host list after deployment?

Confirm that the NVIDIA driver and CUDA are installed on the target machine, and that the driver version matches the CUDA version. If you used `run.py ai` without passing `--nvidia-driver-installer-path` / `--cuda-installer-path`, you need to manually install the driver and CUDA on the target machine beforehand. After installation, run `nvidia-smi` to verify.

### Is it normal for the target host to automatically reboot during installation?

Yes. After installing kernel-related packages or updating GRUB, ocboot may trigger a reboot to load new drivers or configurations. Do not interrupt the process. If the deployment is not complete after the reboot, re-run the original deployment command to continue.

### How to transfer installation packages to the target machine?

Before running `setup-ai-env` or `ai` with installation package parameters, you can first transfer the packages to the target machine, for example:

```bash
# Using rsync (recommended)
rsync -avP /path/to/nvidia/NVIDIA-Linux-x86_64-570.133.07.run target_host:/root/nvidia/
rsync -avP /path/to/cuda/cuda_12.8.1_570.124.06_linux.run target_host:/root/nvidia/

# Using scp
scp /path/to/nvidia/NVIDIA-Linux-x86_64-570.133.07.run target_host:/root/nvidia/
scp /path/to/cuda/cuda_12.8.1_570.124.06_linux.run target_host:/root/nvidia/
```

### How to check if the platform has GPUs available?

If the configuration is successful, go to **Compute -> Infrastructure -> Passthrough Devices** to see the detected and reported GPUs.
![](./images/gpu-list.png)
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
sidebar_position: 2
hide_table_of_contents: true
edition: ce
---

# User Guide

Concepts and usage instructions for AI applications, inference, templates, model libraries, and images.

## Reading Guide

- For deployment, see [Quick Start](../getting-started/quickstart).
- For GPU deployment, see [NVIDIA/CUDA](../getting-started/setup-nvidia-cuda).
- Read in this order: [AI Applications](./llm-app/) (including [Application Templates](./llm-app/template)) → [AI Inference](./llm-inference/) (including [Inference Templates](./llm-inference/template) / [AI Model Library](./llm-inference/model-library)) → [AI Images](./llm-image).

## Console Menu and Feature Overview

After entering **Artificial Intelligence**, the menu is divided into **Applications**, **Inference**, and **Images**:

- **Applications**: Application instances (Dify/OpenClaw/ComfyUI), application templates.
- **Inference**: Inference instances (Ollama), inference templates, inference model library.<!-- Inference instances (Ollama/vLLM), inference templates, inference model library. -->
- **Images**: Managed under "Artificial Intelligence" → "Images"; selected when creating templates.

## Terminology

- **Template**: Resource configuration such as CPU/memory/GPU required to run an instance.
- **Model Library**: Reusable model resources for use by inference applications.
- **Image**: Container image for the instance runtime environment, determining functionality and version.

## Typical Relationship

Instance = Image + Spec + (optional) Model.

import IndexDocCardList from '@site/src/components/IndexDocCardList';

<IndexDocCardList />
Loading
Loading