Skip to content

Replace SSE intrinsics with Armadillo in SDPR for ARM compilation#458

Merged
gaow merged 3 commits intomainfrom
fix-sdpr-arm-compilation
Apr 7, 2026
Merged

Replace SSE intrinsics with Armadillo in SDPR for ARM compilation#458
gaow merged 3 commits intomainfrom
fix-sdpr-arm-compilation

Conversation

@gaow
Copy link
Copy Markdown
Contributor

@gaow gaow commented Apr 7, 2026

Summary

  • SDPR's sample_assignment() used x86 SSE intrinsics (log_ps, exp_ps, _mm_max_ps, _mm_hadd_ps from sse_mathfun.h) for computing log-sum-exp over M=1000 cluster probabilities — this prevented compilation on ARM/Apple Silicon
  • Replace with Armadillo vectorized ops (arma::log, arma::exp, arma::max, arma::accu) which delegate to platform-optimal SIMD (NEON on ARM, SSE/AVX on x86) through compiler auto-vectorization
  • Remove src/sse_mathfun.h (719 lines) and src/simde/ directory (~789K lines of x86→ARM translation headers that weren't working anyway)
  • Remove SIMDE_ENABLE_NATIVE_ALIASES flag from Makevars.in

Performance note

The original SSE code processed 4 floats at a time for exp/log. Armadillo's vectorized arma::exp/arma::log on contiguous arma::vec storage achieves similar throughput via compiler auto-vectorization and platform BLAS, without architecture-specific intrinsics.

Test plan

  • C++ compiles on ARM/Apple Silicon (verified via Rcpp::sourceCpp)
  • Armadillo log-sum-exp matches scalar reference to machine epsilon (~8.9e-16)
  • Full devtools::test(filter="regularized_regression") with SDPR tests
  • R CMD check passes

🤖 Generated with Claude Code

gaow and others added 3 commits April 7, 2026 07:22
The original SDPR (Zhou et al.) used x86 SSE intrinsics (log_ps, exp_ps,
_mm_max_ps, _mm_hadd_ps from sse_mathfun.h) for computing log-sum-exp
over M cluster probabilities in sample_assignment(). This prevented
compilation on ARM/Apple Silicon.

Replace with Armadillo vectorized operations (arma::log, arma::exp,
arma::max, arma::accu) which delegate to platform-optimal SIMD (NEON
on ARM, SSE/AVX on x86) through the compiler auto-vectorization,
giving portable performance without architecture-specific intrinsics.

- Rewrite sample_assignment() using arma::vec for cluster probabilities
- Remove src/sse_mathfun.h (719 lines of x86 SSE math functions)
- Remove src/simde/ directory (~789K lines of x86→ARM translation headers)
- Remove SIMDE_ENABLE_NATIVE_ALIASES flag from Makevars.in

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rewrite sdpr_mcmc.cpp as a faithful line-by-line translation of the
original SDPR (Zhou et al., github.com/eldronzhou/SDPR) from GSL to
Armadillo, with clear comments referencing each original function.

Key fixes vs the previous port:
- Fix cls_assgn initialization: use all-zero (null cluster) instead of
  random assignment across M clusters, which caused "Mat::init() too
  large" crashes when sample_beta() tried to allocate enormous dense
  matrices for nearly-all-causal SNP lists on iteration 1
- Restore missing N* factor in sample_beta() A_vec computation
  (currently N=1.0 so no numerical difference, but correct for future)
- Clean up sample_assignment() Armadillo vectorized log/exp
- Add signal recovery unit tests with realistic LD (binomial genotypes)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gaow
Copy link
Copy Markdown
Contributor Author

gaow commented Apr 7, 2026

Cross-method benchmark: SDPR vs PRS-CS vs lassosum

Simulated data: n=1000, p=50, binomial genotypes (MAF=0.3), 4 causal SNPs with effects (0.3, -0.25, 0.2, -0.15).

Method comparison (seed=42)

Method cor(est, truth) cor(est, sdpr) cor(est, prscs) cor(est, lasso) nnz
SDPR 0.9650 -- 0.8841 0.8778 50
PRS-CS 0.8204 0.8841 -- 0.9968 50
lassosum (s=0.9) 0.8140 0.8778 0.9968 -- 49

SDPR stability (5 MCMC runs, same data)

Run cor(est, truth)
1 0.9639
2 0.9649
3 0.9601
4 0.9637
5 0.9653
mean (sd) 0.9636 (0.0020)

Cross-seed comparison

Seed SDPR PRS-CS lassosum
42 0.9634 0.8204 0.8140
123 0.6513 0.8390 0.8393
2024 0.6279 0.8199 0.8237

All three methods recover the true causal signal. SDPR shows the highest accuracy on seed=42 (cor=0.96) with very low variance across MCMC runs (sd=0.002). PRS-CS and lassosum are highly correlated with each other (cor=0.997) and show more consistent performance across seeds.

@gaow gaow merged commit 7084e27 into main Apr 7, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant