A high-performance C++/CUDA library for solving block-structured linear systems, specifically block tridiagonal matrices. This is a C++/CUDA implementation equivalent to the Julia BlockStructuredSolvers.jl package.
- Multiple Backends: CPU (using BLAS/LAPACK) and CUDA (using cuBLAS/cuSOLVER) support
- Precision Options: Both single (float) and double precision
- Optimized Algorithms: Block Cholesky factorization for symmetric positive definite systems
- Modern C++: Uses C++17 features and modern design patterns
- High Performance: Optimized implementations for each backend
The library implements a block Cholesky factorization for block tridiagonal matrices of the form:
[A₁ B₁ 0 0 ⋯ ]
[B₁ᵀ A₂ B₂ 0 ⋯ ]
[ 0 B₂ᵀ A₃ B₃ ⋯ ]
[ ⋮ ⋮ ⋱ ⋱ ⋱ ]
Where:
Aᵢare n×n diagonal blocks (must be symmetric positive definite)Bᵢare n×n off-diagonal blocks- The system is solved using forward and backward substitution after factorization
- C++17 compatible compiler (GCC 7+, Clang 5+, MSVC 2017+)
- CMake 3.18+
- BLAS library (OpenBLAS, Intel MKL, or system BLAS)
- LAPACK library
- NVIDIA CUDA Toolkit 11.0+
- cuBLAS and cuSOLVER libraries
- CUDA-capable GPU with compute capability 6.0+
- Clone the repository:
git clone <repository-url>
cd BlockStructuredSolvers_CPP- Create build directory:
mkdir build && cd build- Configure with CMake:
# For CPU-only build
cmake .. -DCMAKE_BUILD_TYPE=Release
# For CUDA build
cmake .. -DCMAKE_BUILD_TYPE=Release -DENABLE_CUDA=ON- Build:
make -j$(nproc)- Run tests:
ctest#include "block_structured_solvers.hpp"
using namespace block_structured_solvers;
int main() {
const int N = 10; // Number of blocks
const int n = 4; // Size of each block
// Create CPU solver
auto solver = create_solver_double(Backend::CPU, N, n);
// Set up diagonal blocks (must be symmetric positive definite)
for (int i = 0; i < N; ++i) {
auto block = create_random_positive_definite_matrix<double>(n, 42 + i);
solver->set_diagonal_block(i, block);
}
// Set up off-diagonal blocks
for (int i = 0; i < N - 1; ++i) {
std::vector<std::vector<double>> block(n, std::vector<double>(n, 0.1));
solver->set_offdiagonal_block(i, block);
}
// Create right-hand side
std::vector<std::vector<double>> rhs(N, std::vector<double>(n, 1.0));
// Factorize and solve
solver->factorize();
std::vector<std::vector<double>> solution;
solver->solve(rhs, solution);
return 0;
}#include "block_structured_solvers.hpp"
using namespace block_structured_solvers;
int main() {
const int N = 20; // Number of blocks
const int n = 8; // Size of each block
try {
// Create CUDA solver
auto solver = create_solver_double(Backend::CUDA, N, n);
// Set up system (same as CPU example)
// ... setup code ...
// Factorize and solve on GPU
solver->factorize();
std::vector<std::vector<double>> solution;
solver->solve(rhs, solution);
} catch (const std::exception& e) {
std::cerr << "CUDA error: " << e.what() << std::endl;
}
return 0;
}// Create solver with specific type
template<typename T>
std::unique_ptr<BlockTriDiagData<T>> create_solver(Backend backend, int N, int n);
// Convenience functions
std::unique_ptr<BlockTriDiagData<float>> create_solver_float(Backend backend, int N, int n);
std::unique_ptr<BlockTriDiagData<double>> create_solver_double(Backend backend, int N, int n);class BlockTriDiagData<T> {
public:
// Set diagonal block i to the given matrix
virtual void set_diagonal_block(int i, const std::vector<std::vector<T>>& block) = 0;
// Set off-diagonal block i to the given matrix
virtual void set_offdiagonal_block(int i, const std::vector<std::vector<T>>& block) = 0;
// Perform block Cholesky factorization
virtual void factorize() = 0;
// Solve the system Ax = b
virtual void solve(const std::vector<std::vector<T>>& rhs,
std::vector<std::vector<T>>& solution) = 0;
};// Create identity matrix
template<typename T>
std::vector<std::vector<T>> create_identity_matrix(int n);
// Create random positive definite matrix
template<typename T>
std::vector<std::vector<T>> create_random_positive_definite_matrix(int n, unsigned int seed = 42);
// Print matrix to console
template<typename T>
void print_matrix(const std::vector<std::vector<T>>& matrix);The library is optimized for performance on both CPU and GPU:
- CPU Implementation: Uses optimized BLAS/LAPACK routines (GEMM, TRSM, POTRF)
- CUDA Implementation: Uses cuBLAS and cuSOLVER for maximum GPU performance
- Memory Layout: Optimized data layouts for each backend
- Error Handling: Comprehensive error checking with meaningful messages
- Problem Size: GPUs typically perform better with larger block sizes (n ≥ 16)
- Precision: Single precision is ~2x faster than double precision on most GPUs
- Memory: Pre-allocate and reuse solver objects when solving multiple systems
- Batch Size: For multiple systems, consider using batched operations
The repository includes several examples and benchmarks:
examples/example_cpu.cpp: Basic CPU usageexamples/example_cuda.cpp: Basic CUDA usagebenchmarks/benchmark.cpp: Performance comparison between CPU and CUDAtests/test_solver.cpp: Unit tests for correctness
To run the examples:
./example_cpu
./example_cuda
./benchmarkThis C++/CUDA implementation provides equivalent functionality to the Julia BlockStructuredSolvers.jl package:
| Feature | Julia Version | C++/CUDA Version |
|---|---|---|
| CPU Backend | ✅ BLAS/LAPACK | ✅ BLAS/LAPACK |
| CUDA Backend | ✅ CUDA.jl | ✅ cuBLAS/cuSOLVER |
| Precision | Float32/Float64 | float/double |
| Algorithm | Block Cholesky | Block Cholesky |
| Performance | High | Comparable/Better |
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this library in your research, please cite:
@software{BlockStructuredSolvers_CPP,
author = {David Jin, Alexis Montoison, and Sungho Shin},
title = {BlockStructuredSolvers_CPP: A High-Performance C++/CUDA Library for Block-Structured Linear Systems},
year = {2024},
url = {https://github.com/username/BlockStructuredSolvers_CPP}
}