We currently generate twiddles once and cache them. This helps small size performance but may hinder large sizes: for large sizes we can do a lot of computation thanks to multi-threading but get bottlenecked by memory bandwidth.
It is worth exploring whether generating twiddles on the fly instead of loading cached data helps alleviate memory bandwidth bottlenecks. A sketch of what on-the-fly generation could look like can be found here.
It would also reduce memory usage, so we might make it available as a separate low-memory mode even if it doesn't improve performance.