Flux schnell by kali · Pull Request #2082 · sonos/tract

kali · 2026-03-30T07:59:23Z

No description provided.

- Template cnn.cu over f32/f16 types (accumulate in f32 for precision) - Route f16 Conv ops to CUDA in transform.rs - Dispatch to conv{N}d_f16_generic in ConvGeneric - Add conv_f16 test suite (ConvProblemF16 wrapper) - Register conv_f16 tests in test-cuda suite

Groups work fine with the generic CUDA conv kernel for both f32 and f16.

cuDNN returns CUDNN_STATUS_NOT_SUPPORTED for f16 3D convolutions, so we gate cuDNN f16 to hw_rank <= 2 and let higher ranks use the generic CUDA kernel.

Cast scale and bias back to the input datum type before the final mul/add so the output matches the expected f16 type instead of being promoted to f32 by mixed-type arithmetic.

Cast alpha and beta scalar constants to match the input datum type so they don't promote the output from f16 to f32.

Flow-matching transformer (12B params), CLIP-L + T5-XXL text encoders, no classifier-free guidance (distilled), 4 steps. Packing/unpacking and RoPE position IDs wrapped in the ONNX export for clean Rust interface.

…ipeline - export.py: load one component at a time in f16 (VAE in f32) to avoid OOM - reference.py: generate reference I/O bundles for tract validation - main.rs: full pipeline — tokenize, text encode, denoise (4 steps), VAE decode Models loaded/unloaded sequentially to fit in 32GB VRAM. Transformer + text encoders in f16, VAE in f32 (instance norm overflows f16).

kali added 8 commits March 30, 2026 08:07

Remove no_group restriction from CUDA conv tests

ff7c6ab

Groups work fine with the generic CUDA conv kernel for both f32 and f16.

Add f16 cuDNN conv support (2D only, 3D+ falls back to generic)

6d6a99d

cuDNN returns CUDNN_STATUS_NOT_SUPPORTED for f16 3D convolutions, so we gate cuDNN f16 to hw_rank <= 2 and let higher ranks use the generic CUDA kernel.

Fix LayerNorm f16 output dtype mismatch

e50dbae

Cast scale and bias back to the input datum type before the final mul/add so the output matches the expected f16 type instead of being promoted to f32 by mixed-type arithmetic.

Fix Gemm f16 output dtype mismatch

39adf18

Cast alpha and beta scalar constants to match the input datum type so they don't promote the output from f16 to f32.

Add FLUX.1-schnell example

fcec406

Flow-matching transformer (12B params), CLIP-L + T5-XXL text encoders, no classifier-free guidance (distilled), 4 steps. Packing/unpacking and RoPE position IDs wrapped in the ONNX export for clean Rust interface.

Add flux-schnell wip.md summarizing current state

da03572

kali force-pushed the flux-schnell branch from b60c3ec to da03572 Compare March 30, 2026 08:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flux schnell#2082

Flux schnell#2082
kali wants to merge 8 commits intomainfrom
flux-schnell

kali commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kali commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant