Note: this instruction proposal is migrated from WebAssembly/simd#79
- What are the instructions being proposed?
- relaxed f32x4.fma
- relaxed f32x4.fms
- relaxed f64x2.fma
- relaxed f64x2.fms
- What are the semantics of these instructions?
All the instructions take 3 operands, a, b, c, perform (a * b) + c or -(a * b) + c:
relaxed f32x4.fma(a, b, c) = (a * b) + c
relaxed f32x4.fms(a, b, c) = (a * b) + c
relaxed f64x2.fma(a, b, c) = -(a * b) + c
relaxed f64x2.fms(a, b, c) = -(a * b) + c
where:
- the intermediate
a * b is be rounded first, and the final result rounded again (for a total of 2 roundings), or
- the the entire expression evaluated with higher precision and then only rounded once.
- How will these instructions be implemented? Give examples for at least
x86-64 and ARM64. Also provide reference implementation in terms of 128-bit
Wasm SIMD.
Detailed implementation guidance available at WebAssembly/simd#79, below is an overview
x86/x86-64 with FMA3
relaxed f32x4.fma = VFMADD213PS
relaxed f32x4.fms = VFNMADD213PS
relaxed f64x2.fma = VFMADD213PS
relaxed f64x2.fms = VFNMADD213PS
ARM64
relaxed f32x4.fma = FMLA
relaxed f32x4.fms = FMLS
relaxed f64x2.fma = FMLA
relaxed f64x2.fms = FMLS
ARMv7 with FMA (Neon v2)
relaxed f32x4.fma = VFMA
relaxed f32x4.fms = VFMS
relaxed f64x2.fma = VFMA
relaxed f64x2.fms = VFMS
ARMv7 without FMA (2 rounding)
relaxed f32x4.fma = VMLA
relaxed f32x4.fms = VMLS
relaxed f64x2.fma = VMLA
relaxed f64x2.fms = VMLS
Note: Armv8-M will require MVE-F (floating point extension)
RISC-V V
relaxed f32x4.fma = vfmacc.vv
relaxed f32x4.fms = vfnmsac.vv
relaxed f64x2.fma = vfmadd.vv
relaxed f64x2.fms = vfnmsac.vv
simd128
relaxed f32x4.fma = f32x4.add(f32x4.mul)
relaxed f32x4.fms = f32x4.sub(f32x4.mul)
relaxed f64x2.fma = f64x2.add(f64x2.mul)
relaxed f64x2.fms = f64x2.sub(f64x2.mul)
- How does behavior differ across processors? What new fingerprinting surfaces will be exposed?
The difference depends on whether hardware supports FMA or not. The dividing line is between newer and older hardware. Newer (Intel Haswell from 2013 onwards, AMD ZEN from 2017, Cortex-A5 since 2011) hardware tends to come with hardware FMA support so we will probably see less and less hardware without FMA
- What use cases are there?
Many, especially machine learning (neural nets). Fused multiply-add improves accuracy in numerical algorithms, improves floating-point throughput, and reduces register pressures in some cases. An early prototype and evaluation also showed significant speedup on multiple neural-network models.
All the instructions take 3 operands,
a,b,c, perform(a * b) + cor-(a * b) + c:relaxed f32x4.fma(a, b, c) = (a * b) + crelaxed f32x4.fms(a, b, c) = (a * b) + crelaxed f64x2.fma(a, b, c) = -(a * b) + crelaxed f64x2.fms(a, b, c) = -(a * b) + cwhere:
a * bis be rounded first, and the final result rounded again (for a total of 2 roundings), orx86-64 and ARM64. Also provide reference implementation in terms of 128-bit
Wasm SIMD.
x86/x86-64 with FMA3
relaxed f32x4.fma=VFMADD213PSrelaxed f32x4.fms=VFNMADD213PSrelaxed f64x2.fma=VFMADD213PSrelaxed f64x2.fms=VFNMADD213PSARM64
relaxed f32x4.fma=FMLArelaxed f32x4.fms=FMLSrelaxed f64x2.fma=FMLArelaxed f64x2.fms=FMLSARMv7 with FMA (Neon v2)
relaxed f32x4.fma=VFMArelaxed f32x4.fms=VFMSrelaxed f64x2.fma=VFMArelaxed f64x2.fms=VFMSARMv7 without FMA (2 rounding)
relaxed f32x4.fma=VMLArelaxed f32x4.fms=VMLSrelaxed f64x2.fma=VMLArelaxed f64x2.fms=VMLSNote: Armv8-M will require MVE-F (floating point extension)
RISC-V V
relaxed f32x4.fma=vfmacc.vvrelaxed f32x4.fms=vfnmsac.vvrelaxed f64x2.fma=vfmadd.vvrelaxed f64x2.fms=vfnmsac.vvsimd128
relaxed f32x4.fma=f32x4.add(f32x4.mul)relaxed f32x4.fms=f32x4.sub(f32x4.mul)relaxed f64x2.fma=f64x2.add(f64x2.mul)relaxed f64x2.fms=f64x2.sub(f64x2.mul)The difference depends on whether hardware supports FMA or not. The dividing line is between newer and older hardware. Newer (Intel Haswell from 2013 onwards, AMD ZEN from 2017, Cortex-A5 since 2011) hardware tends to come with hardware FMA support so we will probably see less and less hardware without FMA
Many, especially machine learning (neural nets). Fused multiply-add improves accuracy in numerical algorithms, improves floating-point throughput, and reduces register pressures in some cases. An early prototype and evaluation also showed significant speedup on multiple neural-network models.