Skip to content

Flux schnell#2082

Open
kali wants to merge 8 commits intomainfrom
flux-schnell
Open

Flux schnell#2082
kali wants to merge 8 commits intomainfrom
flux-schnell

Conversation

@kali
Copy link
Copy Markdown
Collaborator

@kali kali commented Mar 30, 2026

No description provided.

kali added 8 commits March 30, 2026 08:07
- Template cnn.cu over f32/f16 types (accumulate in f32 for precision)
- Route f16 Conv ops to CUDA in transform.rs
- Dispatch to conv{N}d_f16_generic in ConvGeneric
- Add conv_f16 test suite (ConvProblemF16 wrapper)
- Register conv_f16 tests in test-cuda suite
Groups work fine with the generic CUDA conv kernel for both f32 and f16.
cuDNN returns CUDNN_STATUS_NOT_SUPPORTED for f16 3D convolutions,
so we gate cuDNN f16 to hw_rank <= 2 and let higher ranks use the
generic CUDA kernel.
Cast scale and bias back to the input datum type before the final
mul/add so the output matches the expected f16 type instead of being
promoted to f32 by mixed-type arithmetic.
Cast alpha and beta scalar constants to match the input datum type
so they don't promote the output from f16 to f32.
Flow-matching transformer (12B params), CLIP-L + T5-XXL text encoders,
no classifier-free guidance (distilled), 4 steps. Packing/unpacking and
RoPE position IDs wrapped in the ONNX export for clean Rust interface.
…ipeline

- export.py: load one component at a time in f16 (VAE in f32) to avoid OOM
- reference.py: generate reference I/O bundles for tract validation
- main.rs: full pipeline — tokenize, text encode, denoise (4 steps), VAE decode
  Models loaded/unloaded sequentially to fit in 32GB VRAM.
  Transformer + text encoders in f16, VAE in f32 (instance norm overflows f16).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant