Problem Description
1 Prior work performs the Hadamard transform in float64, whereas our approach uses bfloat16.
2 The weight transform can be done in-place, eliminating the need to reapply it at every iteration of AR tuning.
3 Supports shared layers, such as MoE and fused QKV.
4 Uses true randomness matrix for each layer.
5 Fused with block-wise AR tuning to significantly reduce RAM usage (otherwise memory overhead is high).
Reproduction Steps
~
Environment Information
~
Error Logs
Additional Context
No response
Problem Description
1 Prior work performs the Hadamard transform in float64, whereas our approach uses bfloat16.
2 The weight transform can be done in-place, eliminating the need to reapply it at every iteration of AR tuning.
3 Supports shared layers, such as MoE and fused QKV.
4 Uses true randomness matrix for each layer.
5 Fused with block-wise AR tuning to significantly reduce RAM usage (otherwise memory overhead is high).
Reproduction Steps
~
Environment Information
~
Error Logs
~Additional Context
No response