PoC: Reduce large struct allocations to <= 13/17/21 KiB for ML-DSA-44/65/87#1005
PoC: Reduce large struct allocations to <= 13/17/21 KiB for ML-DSA-44/65/87#1005mkannwischer wants to merge 22 commits intomainfrom
Conversation
CBMC Results (ML-DSA-87)
Full Results (179 proofs)
|
CBMC Results (ML-DSA-44)
Full Results (179 proofs)
|
Introduce mld_s1vec, following the same pattern as mld_polymat for reduced RAM usage. In normal mode, it stores the full NTT'd polyvecl. In REDUCE_RAM mode, it stores a pointer to the packed s1 data in the secret key and unpacks + NTTs individual polynomials on demand. This reduces signing memory in REDUCE_RAM mode: - ML-DSA-44: 32,448 -> 28,384 (-4,064 bytes) - ML-DSA-65: 44,768 -> 39,680 (-5,088 bytes) - ML-DSA-87: 59,104 -> 51,968 (-7,136 bytes) Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Same pattern as mld_s1vec: in normal mode stores the full NTT'd polyveck, in REDUCE_RAM mode stores a pointer and unpacks + NTTs on demand. REDUCE_RAM signing memory reduction: - ML-DSA-44: 28,384 -> 24,320 (-4,064 bytes) - ML-DSA-65: 39,680 -> 33,568 (-6,112 bytes) - ML-DSA-87: 51,968 -> 43,808 (-8,160 bytes) Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Same pattern as mld_s1vec and mld_s2vec: in normal mode stores the full NTT'd polyveck, in REDUCE_RAM mode stores a pointer and unpacks + NTTs on demand. REDUCE_RAM signing memory reduction: - ML-DSA-44: 24,320 -> 20,256 (-4,064 bytes) - ML-DSA-65: 33,568 -> 27,456 (-6,112 bytes) - ML-DSA-87: 43,808 -> 35,648 (-8,160 bytes) Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Instead of allocating a full polyveck for h in attempt_signature_generation, compute cs2, ct0, and hints one polynomial at a time using scratch polys. This eliminates the polyveck h from the yh_u union, replacing mld_pack_sig_c_h with incremental packing via mld_pack_sig_c, mld_pack_sig_h_init, and mld_pack_sig_h_poly. Sign allocation savings (normal / REDUCE_RAM): - ML-DSA-44: -4096 / 0 bytes - ML-DSA-65: -6144 / -1024 bytes - ML-DSA-87: -8192 / -1024 bytes Note: CBMC proofs are not updated yet. Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
In REDUCE_RAM mode, shrink mld_polymat from rho + row_buffer (L polys) to rho + poly_buffer (1 poly). Replace mld_polymat_get_row with mld_polymat_get_element that samples a single A[k][l] on demand. Rewrite mld_polyvec_matrix_pointwise_montgomery in REDUCE_RAM mode to use per-element access, accumulating A[k][l] * v[l] one element at a time. Normal mode is unchanged (full matrix, row-based access). REDUCE_RAM allocation savings: - ML-DSA-44: keypair -3072, sign -3072, verify -3072, pk_from_sk -3072 - ML-DSA-65: keypair -4096, sign -4096, verify -4096, pk_from_sk -4096 - ML-DSA-87: keypair -6144, sign -6144, verify -6144, pk_from_sk -6144 Note: CBMC proofs are not updated yet. Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Introduce mld_yvec type following the same pattern as mld_s1vec/s2vec/t0vec: in normal mode it holds the full polyvecl, in REDUCE_RAM mode it stores only the seed (rhoprime) and nonce for on-demand regeneration. Add mld_polyvec_matrix_pointwise_montgomery_yvec which computes w = invNTT(A * NTT(y)). In REDUCE_RAM mode it fuses y sampling with column-by-column matrix multiplication, avoiding storage of y entirely. In normal mode it delegates to the existing bulk path. Also enable mld_poly_uniform_gamma1 for REDUCE_RAM builds so the per-poly y regeneration works for all parameter sets. REDUCE_RAM sign allocation savings: - ML-DSA-44: 17184 -> 13120 (-4064 bytes) - ML-DSA-65: 22336 -> 17248 (-5088 bytes) - ML-DSA-87: 28480 -> 21344 (-7136 bytes) Normal mode is unchanged. Note: CBMC proofs are not updated yet. Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Replace mld_compute_t0_t1_tr_from_sk_components with per-row mld_compute_t0k_t1k. Both keygen and pk_from_sk now process one row at a time, packing t1[k] into pk and t0[k] into sk immediately. This eliminates full polyveck allocations for t0, t1, and the matrix from both code paths. Note: CBMC proofs are not updated yet. Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
To silence linting errors. Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
CBMC Results (ML-DSA-65)
Full Results (179 proofs)
|
Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
c06caac to
b56df8e
Compare
oqs-bot
left a comment
There was a problem hiding this comment.
Intel Xeon 4th gen (c7i)
Details
| Benchmark suite | Current: b56df8e | Previous: bb07ee8 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
34222 cycles |
34508 cycles |
0.99 |
ML-DSA-44 sign |
120003 cycles |
119762 cycles |
1.00 |
ML-DSA-44 verify |
38274 cycles |
38106 cycles |
1.00 |
ML-DSA-65 keypair |
58946 cycles |
61327 cycles |
0.96 |
ML-DSA-65 sign |
198396 cycles |
202109 cycles |
0.98 |
ML-DSA-65 verify |
63064 cycles |
62771 cycles |
1.00 |
ML-DSA-87 keypair |
92076 cycles |
94593 cycles |
0.97 |
ML-DSA-87 sign |
242172 cycles |
240827 cycles |
1.01 |
ML-DSA-87 verify |
96085 cycles |
96019 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
oqs-bot
left a comment
There was a problem hiding this comment.
Intel Xeon 4th gen (c7i) (no-opt)
Details
| Benchmark suite | Current: b56df8e | Previous: bb07ee8 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
95415 cycles |
93753 cycles |
1.02 |
ML-DSA-44 sign |
331798 cycles |
333304 cycles |
1.00 |
ML-DSA-44 verify |
99519 cycles |
99738 cycles |
1.00 |
ML-DSA-65 keypair |
161336 cycles |
159678 cycles |
1.01 |
ML-DSA-65 sign |
539375 cycles |
544024 cycles |
0.99 |
ML-DSA-65 verify |
162895 cycles |
160787 cycles |
1.01 |
ML-DSA-87 keypair |
268110 cycles |
267177 cycles |
1.00 |
ML-DSA-87 sign |
705391 cycles |
705890 cycles |
1.00 |
ML-DSA-87 verify |
268591 cycles |
270246 cycles |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Arm Cortex-A76 (Raspberry Pi 5) benchmarks (opt)
Details
| Benchmark suite | Current: b56df8e | Previous: db65535 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
112362 cycles |
113139 cycles |
0.99 |
ML-DSA-44 sign |
356672 cycles |
355421 cycles |
1.00 |
ML-DSA-44 verify |
117719 cycles |
117817 cycles |
1.00 |
ML-DSA-65 keypair |
194918 cycles |
196421 cycles |
0.99 |
ML-DSA-65 sign |
586360 cycles |
588818 cycles |
1.00 |
ML-DSA-65 verify |
194819 cycles |
194511 cycles |
1.00 |
ML-DSA-87 keypair |
319997 cycles |
322254 cycles |
0.99 |
ML-DSA-87 sign |
751619 cycles |
752975 cycles |
1.00 |
ML-DSA-87 verify |
319902 cycles |
320113 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
oqs-bot
left a comment
There was a problem hiding this comment.
AMD EPYC 3rd gen (c6a)
Details
| Benchmark suite | Current: b56df8e | Previous: db65535 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
68388 cycles |
68974 cycles |
0.99 |
ML-DSA-44 sign |
189267 cycles |
187318 cycles |
1.01 |
ML-DSA-44 verify |
69139 cycles |
69050 cycles |
1.00 |
ML-DSA-65 keypair |
118689 cycles |
119428 cycles |
0.99 |
ML-DSA-65 sign |
301252 cycles |
300617 cycles |
1.00 |
ML-DSA-65 verify |
115747 cycles |
115643 cycles |
1.00 |
ML-DSA-87 keypair |
201697 cycles |
203571 cycles |
0.99 |
ML-DSA-87 sign |
393413 cycles |
394649 cycles |
1.00 |
ML-DSA-87 verify |
194483 cycles |
195659 cycles |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
oqs-bot
left a comment
There was a problem hiding this comment.
Intel Xeon 3rd gen (c6i)
Details
| Benchmark suite | Current: b56df8e | Previous: db65535 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
55881 cycles |
56817 cycles |
0.98 |
ML-DSA-44 sign |
180718 cycles |
182410 cycles |
0.99 |
ML-DSA-44 verify |
61359 cycles |
61615 cycles |
1.00 |
ML-DSA-65 keypair |
97413 cycles |
98729 cycles |
0.99 |
ML-DSA-65 sign |
296886 cycles |
298290 cycles |
1.00 |
ML-DSA-65 verify |
101441 cycles |
100286 cycles |
1.01 |
ML-DSA-87 keypair |
150700 cycles |
152586 cycles |
0.99 |
ML-DSA-87 sign |
354653 cycles |
355720 cycles |
1.00 |
ML-DSA-87 verify |
154075 cycles |
153499 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
oqs-bot
left a comment
There was a problem hiding this comment.
AMD EPYC 3rd gen (c6a) (no-opt)
Details
| Benchmark suite | Current: b56df8e | Previous: db65535 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
134214 cycles |
134983 cycles |
0.99 |
ML-DSA-44 sign |
524720 cycles |
524482 cycles |
1.00 |
ML-DSA-44 verify |
147384 cycles |
147385 cycles |
1.00 |
ML-DSA-65 keypair |
226870 cycles |
228309 cycles |
0.99 |
ML-DSA-65 sign |
854441 cycles |
864340 cycles |
0.99 |
ML-DSA-65 verify |
236415 cycles |
236413 cycles |
1.00 |
ML-DSA-87 keypair |
368665 cycles |
370688 cycles |
0.99 |
ML-DSA-87 sign |
1068488 cycles |
1079564 cycles |
0.99 |
ML-DSA-87 verify |
382091 cycles |
383220 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
oqs-bot
left a comment
There was a problem hiding this comment.
AMD EPYC 4th gen (c7a)
Details
| Benchmark suite | Current: b56df8e | Previous: db65535 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
39871 cycles |
42279 cycles |
0.94 |
ML-DSA-44 sign |
136504 cycles |
132300 cycles |
1.03 |
ML-DSA-44 verify |
44253 cycles |
43971 cycles |
1.01 |
ML-DSA-65 keypair |
71924 cycles |
76769 cycles |
0.94 |
ML-DSA-65 sign |
213770 cycles |
217452 cycles |
0.98 |
ML-DSA-65 verify |
72509 cycles |
73895 cycles |
0.98 |
ML-DSA-87 keypair |
108439 cycles |
108025 cycles |
1.00 |
ML-DSA-87 sign |
251417 cycles |
252354 cycles |
1.00 |
ML-DSA-87 verify |
109165 cycles |
109188 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
oqs-bot
left a comment
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'AMD EPYC 4th gen (c7a)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.
| Benchmark suite | Current: b56df8e | Previous: db65535 | Ratio |
|---|---|---|---|
ML-DSA-44 sign |
136504 cycles |
132300 cycles |
1.03 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Arm Cortex-A76 (Raspberry Pi 5) benchmarks (no-opt)
Details
| Benchmark suite | Current: b56df8e | Previous: db65535 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
211753 cycles |
212555 cycles |
1.00 |
ML-DSA-44 sign |
758883 cycles |
759099 cycles |
1.00 |
ML-DSA-44 verify |
229118 cycles |
228906 cycles |
1.00 |
ML-DSA-65 keypair |
377189 cycles |
380502 cycles |
0.99 |
ML-DSA-65 sign |
1247155 cycles |
1251648 cycles |
1.00 |
ML-DSA-65 verify |
371410 cycles |
372262 cycles |
1.00 |
ML-DSA-87 keypair |
603210 cycles |
604945 cycles |
1.00 |
ML-DSA-87 sign |
1585138 cycles |
1590686 cycles |
1.00 |
ML-DSA-87 verify |
618819 cycles |
616948 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
oqs-bot
left a comment
There was a problem hiding this comment.
Intel Xeon 3rd gen (c6i) (no-opt)
Details
| Benchmark suite | Current: b56df8e | Previous: db65535 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
156948 cycles |
157614 cycles |
1.00 |
ML-DSA-44 sign |
548292 cycles |
551534 cycles |
0.99 |
ML-DSA-44 verify |
169377 cycles |
169123 cycles |
1.00 |
ML-DSA-65 keypair |
266042 cycles |
267907 cycles |
0.99 |
ML-DSA-65 sign |
891894 cycles |
904333 cycles |
0.99 |
ML-DSA-65 verify |
274396 cycles |
275011 cycles |
1.00 |
ML-DSA-87 keypair |
447024 cycles |
448619 cycles |
1.00 |
ML-DSA-87 sign |
1153646 cycles |
1157905 cycles |
1.00 |
ML-DSA-87 verify |
459676 cycles |
458683 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
oqs-bot
left a comment
There was a problem hiding this comment.
Graviton4
Details
| Benchmark suite | Current: b56df8e | Previous: db65535 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
67619 cycles |
68090 cycles |
0.99 |
ML-DSA-44 sign |
202698 cycles |
202380 cycles |
1.00 |
ML-DSA-44 verify |
70891 cycles |
70623 cycles |
1.00 |
ML-DSA-65 keypair |
119598 cycles |
121010 cycles |
0.99 |
ML-DSA-65 sign |
330515 cycles |
332267 cycles |
0.99 |
ML-DSA-65 verify |
117848 cycles |
117974 cycles |
1.00 |
ML-DSA-87 keypair |
196903 cycles |
198259 cycles |
0.99 |
ML-DSA-87 sign |
427461 cycles |
428218 cycles |
1.00 |
ML-DSA-87 verify |
194811 cycles |
194635 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
oqs-bot
left a comment
There was a problem hiding this comment.
Graviton3
Details
| Benchmark suite | Current: b56df8e | Previous: db65535 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
71568 cycles |
72253 cycles |
0.99 |
ML-DSA-44 sign |
212744 cycles |
212376 cycles |
1.00 |
ML-DSA-44 verify |
75553 cycles |
75747 cycles |
1.00 |
ML-DSA-65 keypair |
126328 cycles |
127630 cycles |
0.99 |
ML-DSA-65 sign |
349346 cycles |
350882 cycles |
1.00 |
ML-DSA-65 verify |
125556 cycles |
125712 cycles |
1.00 |
ML-DSA-87 keypair |
205745 cycles |
208495 cycles |
0.99 |
ML-DSA-87 sign |
444140 cycles |
450030 cycles |
0.99 |
ML-DSA-87 verify |
205734 cycles |
205745 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
oqs-bot
left a comment
There was a problem hiding this comment.
AMD EPYC 4th gen (c7a) (no-opt)
Details
| Benchmark suite | Current: b56df8e | Previous: db65535 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
120051 cycles |
120340 cycles |
1.00 |
ML-DSA-44 sign |
444378 cycles |
447581 cycles |
0.99 |
ML-DSA-44 verify |
130075 cycles |
130373 cycles |
1.00 |
ML-DSA-65 keypair |
203529 cycles |
204354 cycles |
1.00 |
ML-DSA-65 sign |
719394 cycles |
728319 cycles |
0.99 |
ML-DSA-65 verify |
209932 cycles |
209199 cycles |
1.00 |
ML-DSA-87 keypair |
338921 cycles |
338993 cycles |
1.00 |
ML-DSA-87 sign |
918581 cycles |
921541 cycles |
1.00 |
ML-DSA-87 verify |
346483 cycles |
348601 cycles |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
oqs-bot
left a comment
There was a problem hiding this comment.
Graviton4 (no-opt)
Details
| Benchmark suite | Current: b56df8e | Previous: db65535 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
128669 cycles |
128240 cycles |
1.00 |
ML-DSA-44 sign |
445672 cycles |
447597 cycles |
1.00 |
ML-DSA-44 verify |
136986 cycles |
144662 cycles |
0.95 |
ML-DSA-65 keypair |
219848 cycles |
220500 cycles |
1.00 |
ML-DSA-65 sign |
720286 cycles |
727093 cycles |
0.99 |
ML-DSA-65 verify |
221049 cycles |
223077 cycles |
0.99 |
ML-DSA-87 keypair |
365316 cycles |
365045 cycles |
1.00 |
ML-DSA-87 sign |
919622 cycles |
925847 cycles |
0.99 |
ML-DSA-87 verify |
370439 cycles |
372789 cycles |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
oqs-bot
left a comment
There was a problem hiding this comment.
Graviton3 (no-opt)
Details
| Benchmark suite | Current: b56df8e | Previous: db65535 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
137571 cycles |
138463 cycles |
0.99 |
ML-DSA-44 sign |
482669 cycles |
483929 cycles |
1.00 |
ML-DSA-44 verify |
148479 cycles |
162291 cycles |
0.91 |
ML-DSA-65 keypair |
240785 cycles |
241435 cycles |
1.00 |
ML-DSA-65 sign |
784950 cycles |
792312 cycles |
0.99 |
ML-DSA-65 verify |
240892 cycles |
241250 cycles |
1.00 |
ML-DSA-87 keypair |
394576 cycles |
396566 cycles |
0.99 |
ML-DSA-87 sign |
1006235 cycles |
1012538 cycles |
0.99 |
ML-DSA-87 verify |
403026 cycles |
402623 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
oqs-bot
left a comment
There was a problem hiding this comment.
Graviton2
Details
| Benchmark suite | Current: b56df8e | Previous: db65535 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
112662 cycles |
113410 cycles |
0.99 |
ML-DSA-44 sign |
356702 cycles |
355818 cycles |
1.00 |
ML-DSA-44 verify |
118075 cycles |
118279 cycles |
1.00 |
ML-DSA-65 keypair |
195068 cycles |
196486 cycles |
0.99 |
ML-DSA-65 sign |
587010 cycles |
588672 cycles |
1.00 |
ML-DSA-65 verify |
195142 cycles |
194830 cycles |
1.00 |
ML-DSA-87 keypair |
321107 cycles |
323043 cycles |
0.99 |
ML-DSA-87 sign |
752936 cycles |
753644 cycles |
1.00 |
ML-DSA-87 verify |
319982 cycles |
320341 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
oqs-bot
left a comment
There was a problem hiding this comment.
Graviton2 (no-opt)
Details
| Benchmark suite | Current: b56df8e | Previous: db65535 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
213721 cycles |
213406 cycles |
1.00 |
ML-DSA-44 sign |
759277 cycles |
762744 cycles |
1.00 |
ML-DSA-44 verify |
229673 cycles |
235007 cycles |
0.98 |
ML-DSA-65 keypair |
378733 cycles |
380391 cycles |
1.00 |
ML-DSA-65 sign |
1246651 cycles |
1253555 cycles |
0.99 |
ML-DSA-65 verify |
372918 cycles |
371798 cycles |
1.00 |
ML-DSA-87 keypair |
603099 cycles |
604988 cycles |
1.00 |
ML-DSA-87 sign |
1583611 cycles |
1596422 cycles |
0.99 |
ML-DSA-87 verify |
618046 cycles |
619153 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
… mode Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
This is a proof of concept how we could get the stack consumption of {keygen,sign,verify} to <= 13/17/21 KiB of allocations via
MLD_ALLOCinREDUCE_RAM-mode. Additionally, as before we need a little bit of stack memory - I have measures 2.5 KiB on my machine, so the overall memory consumption should be comfortably below 16/20/24 KiB.Warning This is a quick-and-dirty PoC and I do not recommend relying on it. The CBMC proofs are work in progress. This PR will definitely not be merged in one piece, but instead in smaller steps.