Replace SSE intrinsics with Armadillo in SDPR for ARM compilation#458
Merged
Replace SSE intrinsics with Armadillo in SDPR for ARM compilation#458
Conversation
The original SDPR (Zhou et al.) used x86 SSE intrinsics (log_ps, exp_ps, _mm_max_ps, _mm_hadd_ps from sse_mathfun.h) for computing log-sum-exp over M cluster probabilities in sample_assignment(). This prevented compilation on ARM/Apple Silicon. Replace with Armadillo vectorized operations (arma::log, arma::exp, arma::max, arma::accu) which delegate to platform-optimal SIMD (NEON on ARM, SSE/AVX on x86) through the compiler auto-vectorization, giving portable performance without architecture-specific intrinsics. - Rewrite sample_assignment() using arma::vec for cluster probabilities - Remove src/sse_mathfun.h (719 lines of x86 SSE math functions) - Remove src/simde/ directory (~789K lines of x86→ARM translation headers) - Remove SIMDE_ENABLE_NATIVE_ALIASES flag from Makevars.in Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rewrite sdpr_mcmc.cpp as a faithful line-by-line translation of the original SDPR (Zhou et al., github.com/eldronzhou/SDPR) from GSL to Armadillo, with clear comments referencing each original function. Key fixes vs the previous port: - Fix cls_assgn initialization: use all-zero (null cluster) instead of random assignment across M clusters, which caused "Mat::init() too large" crashes when sample_beta() tried to allocate enormous dense matrices for nearly-all-causal SNP lists on iteration 1 - Restore missing N* factor in sample_beta() A_vec computation (currently N=1.0 so no numerical difference, but correct for future) - Clean up sample_assignment() Armadillo vectorized log/exp - Add signal recovery unit tests with realistic LD (binomial genotypes) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
Author
Cross-method benchmark: SDPR vs PRS-CS vs lassosumSimulated data: n=1000, p=50, binomial genotypes (MAF=0.3), 4 causal SNPs with effects (0.3, -0.25, 0.2, -0.15). Method comparison (seed=42)
SDPR stability (5 MCMC runs, same data)
Cross-seed comparison
All three methods recover the true causal signal. SDPR shows the highest accuracy on seed=42 (cor=0.96) with very low variance across MCMC runs (sd=0.002). PRS-CS and lassosum are highly correlated with each other (cor=0.997) and show more consistent performance across seeds. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
sample_assignment()used x86 SSE intrinsics (log_ps,exp_ps,_mm_max_ps,_mm_hadd_psfromsse_mathfun.h) for computing log-sum-exp over M=1000 cluster probabilities — this prevented compilation on ARM/Apple Siliconarma::log,arma::exp,arma::max,arma::accu) which delegate to platform-optimal SIMD (NEON on ARM, SSE/AVX on x86) through compiler auto-vectorizationsrc/sse_mathfun.h(719 lines) andsrc/simde/directory (~789K lines of x86→ARM translation headers that weren't working anyway)SIMDE_ENABLE_NATIVE_ALIASESflag fromMakevars.inPerformance note
The original SSE code processed 4 floats at a time for exp/log. Armadillo's vectorized
arma::exp/arma::logon contiguousarma::vecstorage achieves similar throughput via compiler auto-vectorization and platform BLAS, without architecture-specific intrinsics.Test plan
Rcpp::sourceCpp)devtools::test(filter="regularized_regression")with SDPR testsR CMD checkpasses🤖 Generated with Claude Code