UPSTREAM PR #1184: Feat: Select backend devices via arg by loci-dev · Pull Request #40 · auroralabs-loci/stable-diffusion.cpp

loci-dev · 2026-02-02T10:47:12Z

Note

Source pull request: leejet/stable-diffusion.cpp#1184

The main goal of this PR is to improve user experience in multi-gpu setups, allowing to chose which model part gets sent to which device.

Cli changes:

Adds the --main-backend-device [device_name] argument to set the default backend
remove --clip-on-cpu, --vae-on-cpu and --control-net-cpu arguments
replace them respectively with the new --clip_backend_device [device_name], --vae-backend-device [device_name], --control-net-backend-device [device_name] arguments
add the --diffusion_backend_device (control the device used for the diffusion/flow models) and the --tae-backend-device
add --upscaler-backend-device, --photomaker-backend-device, and --vision-backend-device
add --list-devices argument to print the list of available ggml devices and exit.
add --rpc argument to connect to a compatible GGML rpc server

C API changes (stable-diffusion.h):

Change the content of the sd_ctx_params_t struct.
void list_backends_to_buffer(char* buffer, size_t buffer_size) to write the details of the available buffers to a null-terminated char array. Devices are separated by newline characters (\n), and the name and description of the device are separated by \t character.
size_t backend_list_size() to get the size of the buffer needed for void list_backends_to_buffer
void add_rpc_device(const char* address); connect to a ggml RPC backend (from llama.cpp)

The default device selection should now consistently prioritize discrete GPUs over iGPUs.

For example if you want to run the text encoders on CPU, you'd need to use --clip_backend_device CPU instead of --clip-on-cpu

TODO:

Fix bug with --lora-apply-mode immediately when clip and diffsion models are running on different (non-cpu) backends.
Clean up logs

Important: to use RPC, you need to add -DGGML_RPC=ON to the build. Additionally it requires either sd.cpp to be built with -DSD_USE_SYSTEM_GGML flag (I haven't tested that one), or the RPC server to be built with -DCMAKE_C_FLAGS="-DGGML_MAX_NAME=128" -DCMAKE_CXX_FLAGS="-DGGML_MAX_NAME=128" (default is 64)

Fixes #1116

loci-review · 2026-02-03T11:50:26Z

Overview

Analysis of stable-diffusion.cpp across 18 commits reveals minimal performance impact from multi-backend device management refactoring. Of 48,425 total functions, 124 were modified (0.26%), 331 added, and 109 removed. Power consumption increased negligibly: build.bin.sd-cli (+0.388%, 479,167→481,028 nJ) and build.bin.sd-server (+0.239%, 512,977→514,202 nJ).

Function Analysis

SDContextParams Constructor (both binaries): Response time increased ~40% (+2,816-2,840ns) due to initializing 9 new std::string device placement fields replacing 3 boolean flags. Enables per-component GPU/CPU device selection for heterogeneous computing.

SDContextParams Destructor (both binaries): Response time increased ~42% (+2,497-2,505ns) from destroying 9 additional string members. One-time cleanup cost outside inference paths.

~StableDiffusionGGML (both binaries): Throughput time increased ~95% (+192ns absolute) managing 7 backend types versus 3, including loop-based cleanup for multiple CLIP backends. Response time impact minimal (+5.2%, ~720ns).

ggml_e8m0_to_fp32_half (sd-cli): Response time improved 24% (-36ns), benefiting quantization operations called millions of times during inference.

Standard library functions (std::_Rb_tree::begin, std::vector::_S_max_size, std::swap): Showed 76-289% throughput increases due to template instantiation complexity, but absolute changes remain under 220ns in non-critical initialization paths.

Additional Findings

All performance regressions occur in initialization and cleanup phases, not inference hot paths. The architectural changes enable multi-GPU workload distribution, per-component device placement (diffusion, CLIP, VAE on separate devices), and runtime backend flexibility. Quantization improvements and multi-GPU capabilities provide net performance gains during actual inference, far exceeding the microsecond-level initialization overhead. Changes are well-justified architectural improvements with negligible real-world impact.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

loci-review · 2026-03-06T05:35:29Z

Overview

Analysis of stable-diffusion.cpp compared 50,147 functions across two binaries: build.bin.sd-cli (power consumption: 491,453 nJ → 494,103 nJ, +0.539%) and build.bin.sd-server (527,149 nJ → 528,268 nJ, +0.212%). Of these, 139 functions were modified, 406 added, 63 removed, and 49,539 unchanged. The target version implements a multi-backend architecture enabling heterogeneous computing (GPU/CPU/RPC) with per-component device allocation, introducing measured performance trade-offs in initialization paths while maintaining minimal overall energy impact.

Function Analysis

apply_loras_immediately (both binaries): Response time increased from ~10.6ms to ~31.2ms (+191-193%, +20.5ms absolute). Throughput time doubled from ~685ns to ~1,370ns (+100%). The function was refactored to load LoRA models three times with backend-specific filtering (diffusion, CLIP, VAE/TAE) instead of once, enabling component-specific device placement and preventing tensor contamination. This architectural change is justified for multi-GPU support and correctness.

neon_compute_fp16_to_fp32 (sd-cli): Response time increased from 85ns to 305ns (+258%, +220ns). Throughput time increased from 77ns to 298ns (+286%). This NEON-optimized FP16→FP32 conversion regression originates from the external GGML library, not application code. The 3.6x slowdown could impact ARM platforms if called frequently during tensor operations.

~StableDiffusionGGML (both binaries): Throughput time increased from 201ns to 393ns (+95%, +192ns). Response time increased from ~13.9ms to ~14.6ms (+5%, +720ns). The destructor now iterates through multiple CLIP backends and includes conditional checks for diffusion_backend and tae_backend to prevent double-free errors, supporting the multi-backend architecture.

~SDContextParams (both binaries): Response time increased from ~5.9ms to ~8.4ms (+42%, +2.5μs). Throughput time increased from 115ns to 166ns (+44%). The compiler-generated destructor now cleans up 9 std::string members for backend device configuration instead of 3 boolean flags, an expected trade-off for flexible device specification.

ggml_compute_forward_map_custom2 (sd-server): Response time increased from 110ns to 191ns (+75%, +82ns). Throughput time increased from 95ns to 177ns (+86%). This GGML custom operation likely added backend compatibility checks or precision handling logic for runtime backend management.

Standard library functions showed improvements: vector::begin (-68% response time), basic_string::_M_set_length (-42% response time), and vector::_S_max_size (-57% response time), offsetting some architectural overhead through compiler optimizations.

Additional Findings

The multi-backend refactoring enables critical ML deployment capabilities: parallel CLIP encoding across multiple GPUs, heterogeneous execution (CPU text encoding while GPU handles diffusion), and distributed inference via RPC. Performance regressions are concentrated in initialization and configuration paths rather than inference hot paths, with <1% power consumption increase indicating efficient implementation. The NEON regression warrants investigation for ARM-based edge deployments where FP16 quantized models are common.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

fix sdxl conditionner backends fix sd3 backend display

loci-review · 2026-03-20T05:16:58Z

Overview

Analysis of 50,067 functions across 23 commits implementing multi-backend device selection for distributed inference. Modified: 142 functions (0.28%), New: 442, Removed: 63.

Binaries analyzed:

build.bin.sd-cli: +0.547% power consumption (+2,689 nJ)
build.bin.sd-server: +0.246% power consumption (+1,298 nJ)

Impact: Performance changes are justified architectural improvements enabling multi-GPU support, distributed inference via RPC, and flexible hardware resource allocation for SDXL/SD3/FLUX models.

Function Analysis

apply_loras_immediately (both binaries):

Response time: 10.4ms → 31.2ms (+20.8ms, +200%)
Throughput time: 663-679ns → 1,279-1,293ns (+600-631ns, +88-95%)
Justification: Refactored to apply LoRAs to three model components (diffusion, text encoders, VAE) instead of one. Calls load_lora_model_from_file three times, directly explaining 3x increase. Enables per-component backend selection critical for multi-GPU setups.

~StableDiffusionGGML (both binaries):

Response time: 13.9ms → 14.6ms (+720ns, +5.2%)
Throughput time: 201ns → 393ns (+192ns, +96%)
Justification: Enhanced for multi-backend cleanup with vector iteration for multiple CLIP backends. Adds stack canary protection (+42ns). One-time shutdown cost.

~SDContextParams (sd-server):

Response time: 5.9μs → 8.5μs (+2.5μs, +42%)
Justification: Replaced 3 boolean CPU flags with 9 string-based device configuration members enabling granular per-component device assignment ("cuda:0", "cuda:1", "cpu", "rpc://server:port").

Standard library regressions:

std::_Rb_tree::end(): +227% (+183ns) - compiler code layout regression
std::back_inserter: +204% (+185ns) - compiler optimization issue
std::allocator::allocate: +57-68% (+85ns) - object size increase from multi-backend support

Other analyzed functions (constructors, lambda operators) showed minimal overhead (<100ns absolute) justified by architectural enhancements.

Flame Graph Comparison

Function: apply_loras_immediately (build.bin.sd-server)

Base version:

Target version:

Target version shows three-pass architecture with load_lora_model_from_file (32.6% of execution) and memory allocation (31.9%) dominating. Base version had single-pass loading.

Additional Findings

ML/GPU Operations: Changes enable parallel text encoder execution on separate GPUs (critical for SDXL CLIP-L + CLIP-G, SD3/FLUX multi-encoder architectures), distributed inference via RPC backends, and flexible VAE offloading to reduce GPU memory pressure. Initialization overhead (~21ms) is negligible compared to typical inference time (5-30 seconds per image). Diffusion sampling loops remain unaffected.

Standard library regressions (std::_Rb_tree::end(), std::back_inserter) stem from compiler code generation differences, not source changes. Consider reviewing compiler flags between builds.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

loci-dev force-pushed the main branch 19 times, most recently from 052ebb0 to 76ede2c Compare February 3, 2026 10:20

loci-dev force-pushed the loci/pr-1184-select-backend branch from 29e8399 to 2d43513 Compare February 3, 2026 10:46

loci-dev temporarily deployed to stable-diffusion-cpp-prod February 3, 2026 10:46 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 76ede2c to d36519c Compare February 3, 2026 11:18

loci-dev force-pushed the main branch 7 times, most recently from 5bbc590 to 68f62a5 Compare February 8, 2026 04:51

loci-dev force-pushed the main branch 2 times, most recently from dd19ab8 to 98460a7 Compare March 10, 2026 04:15

loci-dev force-pushed the main branch from 98460a7 to b898db0 Compare March 17, 2026 04:17

stduhpf and others added 23 commits March 19, 2026 13:57

Select backend devices via arg

d1a5564

fix build

53be44f

show backend device description

5de8649

CLI: add --list-devices arg

08befec

null-terminate even if buffer is too small

ce0ebd2

move stuff to ggml_extend.cpp

fc8312e

--upscaler-backend-device

5c05507

use diffusion_backend for loading LoRAs

4bf3b74

--photomaker-backend-device (+fixes)

f410280

--vision-backend-device

10828c9

check backends at runtime

18dbbc1

fix missing includes

0775cab

fix typo

b974266

multiple clip backend devices

469b550

fix sdxl conditionner backends fix sd3 backend display

update help message

3c01070

Add RPC documentation

a36cb4d

update docs

0f7128a

update RPC docs

1506612

fix apply_loras_immediately when using different non-CPU backends

4e8efa4

Force sequencial tensor loading when using RPC

80af877

fix build

c9ff85a

Get first stage backend for loading loras

2e07c95

Fix lora loading when using multiple clip backends

4fb8901

loci-dev force-pushed the loci/pr-1184-select-backend branch from 79f3aa5 to 4fb8901 Compare March 20, 2026 04:16

loci-dev temporarily deployed to stable-diffusion-cpp-prod March 20, 2026 04:16 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #1184: Feat: Select backend devices via arg#40

UPSTREAM PR #1184: Feat: Select backend devices via arg#40
loci-dev wants to merge 23 commits intomainfrom
loci/pr-1184-select-backend

loci-dev commented Feb 2, 2026

Uh oh!

loci-review bot commented Feb 3, 2026

Uh oh!

loci-review bot commented Mar 6, 2026

Uh oh!

loci-review bot commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Feb 2, 2026

Uh oh!

loci-review bot commented Feb 3, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

loci-review bot commented Mar 6, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

loci-review bot commented Mar 20, 2026

Overview

Function Analysis

Flame Graph Comparison

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants