From 0cb835c9583b9bb9963f9788fbd9f02344538bee Mon Sep 17 00:00:00 2001 From: RevBayes analysis <1695515+ms609@users.noreply.github.com> Date: Tue, 10 Mar 2026 11:40:43 +0000 Subject: [PATCH 01/19] AI setup --- .Rbuildignore | 3 + .positai/settings.json | 25 +++++ AGENTS.md | 205 +++++++++++++++++++++++++++++++++++++++++ 3 files changed, 233 insertions(+) create mode 100644 .positai/settings.json create mode 100644 AGENTS.md diff --git a/.Rbuildignore b/.Rbuildignore index f2ea6e43d..1aa03d51f 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -7,6 +7,7 @@ ^src\- ^src\. CONTRIBUTING +AGENTS.md README.md ^\.github ^\.lintr @@ -34,3 +35,5 @@ revdep ^\.vs.*$ ^\.vscode.*$ ^CRAN-SUBMISSION$ +^\.positai$ +^\.claude$ diff --git a/.positai/settings.json b/.positai/settings.json new file mode 100644 index 000000000..f3f178518 --- /dev/null +++ b/.positai/settings.json @@ -0,0 +1,25 @@ +{ + "permission": { + "read": { + "*.r": "allow", + "*.h": "allow", + "*.cpp": "allow", + "*.R": "allow", + "*": "allow" + }, + "edit": { + "*.cpp": "allow", + "*.h": "allow", + "*.R": "allow", + "*.md": "allow" + }, + "bash": { + "echo *": "allow", + "grep *": "allow", + "head *": "allow" + "ls *": "allow", + "rm -f src/*.o src/*.dll": "allow", + "tail *": "allow", + } + } +} diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 000000000..dd295d396 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,205 @@ +# TreeDist — Agent Memory + +## Project Overview + +**TreeDist** is an R package (GPL ≥ 3) providing a suite of topological distance metrics +between phylogenetic trees. The mathematical core is implemented in C++17 and exposed to R +via Rcpp. The primary optimization goal is speed: many real analyses compute pairwise +distances over hundreds or thousands of trees, so inner loops must be tight. + +Current version: `2.12.0.9000` (development). +CRAN package page: + +--- + +## Repository Layout + +``` +TreeDist/ +├── src/ # C++17 source (the main optimization target) +├── R/ # R wrapper layer and pure-R helpers +├── benchmark/ # Microbenchmark scripts (bench package) +├── tests/testthat/ # Unit tests +├── data-raw/ # Scripts that regenerate lookup tables / data +├── vignettes/ # User-facing tutorials +└── inst/ # Installed extras +``` + +**Sibling repository** — `../TreeTools` is a companion package that TreeDist links against +at the C++ level (`LinkingTo: TreeTools`). Edits to TreeTools headers (especially +`SplitList.h`) can affect TreeDist performance and can be pushed to CRAN independently +when ready. + +--- + +## C++ Source Files + +| File | Size | Purpose | +|------|------|---------| +| `tree_distances.cpp` | 22 KB | Main distance calculations; calls into CostMatrix / LAP | +| `tree_distances.h` | 15 KB | **CostMatrix** class; cache-aligned storage; `findRowSubmin` hot path | +| `lap.cpp` | 10 KB | Jonker-Volgenant LAPJV linear assignment; extensively hand-optimized | +| `spr.cpp` | 7 KB | SPR distance approximation | +| `spr_lookup.cpp` | — | SPR lookup-table implementation | +| `nni_distance.cpp` | 16 KB | NNI distance approximations; HybridBuffer allocation | +| `li_diameters.h` | 30 KB | Precomputed NNI diameter lookup tables | +| `information.h` | 6 KB | log₂ / factorial lookup tables (max 8192); cached at startup | +| `binary_entropy_counts.cpp` | — | Binary entropy calculations | +| `day_1985.cpp` | 10 KB | Consensus tree information | +| `hmi.cpp` | 6 KB | Hierarchical Mutual Information | +| `hpart.cpp` | 7 KB | Hierarchical partition structures | +| `reduce_tree.cpp` | 11 KB | Prune trees to common tip set before distance calculation | +| `path_vector.cpp` | 3 KB | Path distance vector | +| `mast.cpp` | 5 KB | Maximum Agreement Subtree | +| `RcppExports.cpp` | 20 KB | Auto-generated Rcpp glue (do not edit by hand) | +| `ints.h` | — | Fixed-width integer typedefs (`splitbit`, `int16`, `int32`, …) | + +--- + +## Key Optimizations Already in Place + +Understanding what has already been done avoids duplicating effort. + +### Memory / cache +- **64-byte alignment** (`alignas(64)`) on `CostMatrix::data_` and `t_data_`. +- `CostMatrix` pads row width to a multiple of `BLOCK_SIZE = 16` so SIMD loads never + straddle cache lines. +- A **transposed copy** of the cost matrix is maintained to allow column-wise access with + sequential memory reads. + +### Arithmetic +- **Lookup tables** in `information.h` for `log2` (values 0–(SL_MAX_TIPS-1)²) and + `log2_factorial` (up to 8192); initialized via `__attribute__((constructor))`. + Using these avoids runtime `std::log2()` calls on hot paths. +- **Loop unrolling** — `CostMatrix::findRowSubmin()` manually unrolls 4× (comment: + "gives ~20% speedup"). +- **`__restrict__`** pointer annotations in `lap.cpp` and `tree_distances.h` to enable + compiler alias analysis. +- **`__builtin_assume_aligned(ptr, 64)`** hints around inner loops. +- **`__builtin_popcount()`** for split-bit counting. + +### Algorithm +- **Column reduction in reverse order** in LAPJV (faster convergence). +- **`findRowSubmin` two-pass strategy** avoids redundant comparisons. +- **LAPJV v2.10.0** achieved a 2× speedup for large matrices via reorganisation. +- **`HybridBuffer`** in `nni_distance.cpp`: small allocations go on the + stack (thresholds: 512 splits, 16 bins) to avoid heap overhead. +- **Tree reduction** (`reduce_tree.cpp`): trees are pruned to their common tip set before + any distance is computed; this is a major algorithmic win for partially-overlapping + taxon sets. + +### Types +- `cost = int_fast64_t` for LAP to avoid overflow and exploit 64-bit registers. +- `BIG = numeric_limits::max() / SL_MAX_SPLITS` — the divide avoids integer + overflow inside the LAP inner loop. +- `ROUND_PRECISION = 2048²` for safe rounding in cost scaling. + +--- + +## Benchmark Infrastructure + +All benchmarks live in `benchmark/` and use the `bench` package. + +```r +source("benchmark/_init.R") # loads TreeTools, TreeDist; defines Benchmark() +``` + +`Benchmark(expr)` wraps `bench::mark()` with `min_iterations = 3` and `time_unit = "us"`. +When run non-interactively (e.g. CI) results are serialised to `.bench.Rds` files. +`_compare_results.R` compares PR results against `main` and fails the build on a +regression > 4 %. + +### Existing benchmark scripts + +| Script | What it tests | +|--------|---------------| +| `bench-tree-distances.R` | CID, PID, RF, MCI on 100×50-tip and 40×200-tip tree sets | +| `bench-LAPJV.R` | LAPJV on 40×40, 400×400, 1999×1999 uniform random matrices | +| `bench-PathDist.R` | PathDist on 6×182-tip trees | + +To add a new benchmark, create `benchmark/bench-.R` following the existing pattern +and add it to `_run_benchmarks.R`. + +--- + +## Profiling with VTune + +VTune 2025.9 is installed at: + +``` +C:\Program Files (x86)\Intel\oneAPI\vtune\2025.9\bin64\vtune.exe +``` + +Typical workflow: +1. Build TreeDist with debug symbols but optimisations enabled: + `PKG_CXXFLAGS="-O2 -g"` in `~/.R/Makevars`. +2. Write a driver `.R` script that exercises the hot path in a loop (use the benchmark + scripts as a starting point). +3. Run a hotspot collection: + ``` + vtune -collect hotspots -result-dir vtune-out -- Rscript path/to/driver.R + ``` +4. View the report: + ``` + vtune -report hotspots -result-dir vtune-out + ``` + Or open the `.vtune` project in the VTune GUI. +5. Pay attention to the **Memory Access** and **Threading** analyses for cache-miss and + false-sharing diagnostics. + +--- + +## TreeTools Dependency + +`../TreeTools` is available locally and editable. It is linked at the C++ level via +`LinkingTo`; the key header consumed by TreeDist is ``. + +Important constants defined there that affect TreeDist: +- `SL_MAX_TIPS` — maximum number of leaf taxa per tree. +- `SL_MAX_SPLITS` — maximum number of splits. +- `splitbit` — the unsigned integer type used for bitset representation of splits. + +If a bottleneck traces back to a TreeTools header (e.g. SplitList layout, bit-width of +`splitbit`), changes can be made in `../TreeTools`, tested locally by re-installing, and +pushed to CRAN when stable. + +--- + +## Development Workflow + +```r +# Build and reload (from R) +devtools::load_all() # fast incremental rebuild +devtools::test() # run testthat suite + +# Or from the shell +R CMD build . +R CMD check TreeDist_*.tar.gz + +# Run a single benchmark interactively +source("benchmark/_init.R") +source("benchmark/bench-tree-distances.R") +``` + +C++ compilation flags are controlled by `src/Makevars.win` (Windows) / `src/Makevars`. +The package requires **C++17** (`CXX_STD = CXX17`). + +--- + +## Known Optimization Opportunities / TODOs + +- `information.h` line 19: comment suggests considering increasing `MAX_FACTORIAL_LOOKUP` + beyond 8192 or falling back to runtime calculation for larger values. +- `information.h` lines 120–122: code duplication flagged for consolidation. +- LAP inner loop: the 4× manual unroll works well; investigate whether AVX2 SIMD + intrinsics (`_mm256_*`) could replace it on modern hardware. +- `CostMatrix` transpose: currently maintained as a full second copy; a cache-oblivious + blocking scheme (already partially implemented via `BLOCK_SIZE`) could reduce memory + bandwidth further. +- SPR distance (`spr.cpp`, `spr_lookup.cpp`): the algorithm is relatively recent + (v2.8.0); profiling under VTune may reveal further hot spots. +- Parallelism: `parallel.R` exposes a parallel interface at the R level; the C++ layer + does not yet use OpenMP. Pairwise distance matrices are an embarrassingly parallel + workload. +- Large-tree path (`int32` migration, v2.12.0 dev): ensure new code paths are as + optimized as the original `int16` paths. From fff9a510c5618436e838b0db101600485733669f Mon Sep 17 00:00:00 2001 From: RevBayes analysis <1695515+ms609@users.noreply.github.com> Date: Tue, 10 Mar 2026 13:08:59 +0000 Subject: [PATCH 02/19] OpenMP integration --- .positai/settings.json | 11 +- AGENTS.md | 37 ++++++- R/RcppExports.R | 4 + R/tree_distance_utilities.R | 16 +++ src/RcppExports.cpp | 13 +++ src/lap.cpp | 5 +- src/pairwise_distances.cpp | 199 ++++++++++++++++++++++++++++++++++++ src/tree_distances.cpp | 19 ---- src/tree_distances.h | 24 ++++- 9 files changed, 297 insertions(+), 31 deletions(-) create mode 100644 src/pairwise_distances.cpp diff --git a/.positai/settings.json b/.positai/settings.json index f3f178518..9d0741541 100644 --- a/.positai/settings.json +++ b/.positai/settings.json @@ -3,6 +3,7 @@ "read": { "*.r": "allow", "*.h": "allow", + "*.log": "allow", "*.cpp": "allow", "*.R": "allow", "*": "allow" @@ -10,16 +11,22 @@ "edit": { "*.cpp": "allow", "*.h": "allow", + "DESCRIPTION": "allow", + "src/Makevars": "allow", + "src/Makevars.win": "allow", "*.R": "allow", "*.md": "allow" }, "bash": { + "cd C:/Users/pjjg18/GitHub/TreeDist": "allow", + "cat *": "allow", "echo *": "allow", + "find *": "allow", "grep *": "allow", - "head *": "allow" + "head *": "allow", "ls *": "allow", "rm -f src/*.o src/*.dll": "allow", - "tail *": "allow", + "tail *": "allow" } } } diff --git a/AGENTS.md b/AGENTS.md index dd295d396..280a133c7 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -116,6 +116,7 @@ regression > 4 %. | `bench-tree-distances.R` | CID, PID, RF, MCI on 100×50-tip and 40×200-tip tree sets | | `bench-LAPJV.R` | LAPJV on 40×40, 400×400, 1999×1999 uniform random matrices | | `bench-PathDist.R` | PathDist on 6×182-tip trees | +| `bench-MCI-openmp.R` | MCI/CID OpenMP scaling on 100×50, 40×200, and 200×200-tip tree sets | To add a new benchmark, create `benchmark/bench-.R` following the existing pattern and add it to `_run_benchmarks.R`. @@ -186,11 +187,38 @@ The package requires **C++17** (`CXX_STD = CXX17`). --- +## Completed Optimizations (this dev cycle) + +### OpenMP parallelism for pairwise distances (`src/pairwise_distances.cpp`) +Added `#pragma omp parallel for schedule(dynamic)` over the pairwise loop for +`MutualClusteringInfo` / `ClusteringInfoDistance`. Build infrastructure: + +- `src/Makevars` and `src/Makevars.win` created (these did not exist before); + both set `CXX_STD = CXX17` and `PKG_CXXFLAGS/PKG_LIBS = $(SHLIB_OPENMP_CXXFLAGS)`. + On platforms without OpenMP the flags are empty and the package builds + single-threaded. +- `lap()` in `lap.cpp` gained an `allow_interrupt = true` parameter; the batch + path passes `false` to avoid calling `Rcpp::checkUserInterrupt()` from a + worker thread. +- `add_ic_element` moved from `tree_distances.cpp` (inside `namespace TreeDist`, + inaccessible to other TUs) into `tree_distances.h` as a proper `inline`, + fixing a latent ODR issue and making it visible to `pairwise_distances.cpp`. +- R fast path: `.SplitDistanceAllPairs()` in `tree_distance_utilities.R` detects + `Func == MutualClusteringInfoSplits` with a same-tip-set, no-cluster call and + routes it to `cpp_mutual_clustering_all_pairs()`. + +**Benchmark script**: `benchmark/bench-MCI-openmp.R` +**To measure speedup**: run `bench-MCI-openmp.R` in a **fresh R session** after +`devtools::load_all()` (Windows locks the DLL in the session that compiles it). + +--- + ## Known Optimization Opportunities / TODOs - `information.h` line 19: comment suggests considering increasing `MAX_FACTORIAL_LOOKUP` beyond 8192 or falling back to runtime calculation for larger values. -- `information.h` lines 120–122: code duplication flagged for consolidation. +- `information.h` lines 120–122: `log2_factorial_table` is a verbatim copy from + `TreeSearch/src/expected_mi.cpp`; should be defined once (TreeTools?) and shared. - LAP inner loop: the 4× manual unroll works well; investigate whether AVX2 SIMD intrinsics (`_mm256_*`) could replace it on modern hardware. - `CostMatrix` transpose: currently maintained as a full second copy; a cache-oblivious @@ -198,8 +226,9 @@ The package requires **C++17** (`CXX_STD = CXX17`). bandwidth further. - SPR distance (`spr.cpp`, `spr_lookup.cpp`): the algorithm is relatively recent (v2.8.0); profiling under VTune may reveal further hot spots. -- Parallelism: `parallel.R` exposes a parallel interface at the R level; the C++ layer - does not yet use OpenMP. Pairwise distance matrices are an embarrassingly parallel - workload. +- OpenMP for other metrics: `pairwise_distances.cpp` currently covers only MCI/CID. + Adding equivalent batch functions for `SharedPhylogeneticInfo`, `MatchingSplitInfo`, + `RobinsonFoulds`, and `MatchingSplitDistance` would extend the speedup; each needs + a score-only variant of its computation. - Large-tree path (`int32` migration, v2.12.0 dev): ensure new code paths are as optimized as the original `int16` paths. diff --git a/R/RcppExports.R b/R/RcppExports.R index 81e688d39..2ca5d5977 100644 --- a/R/RcppExports.R +++ b/R/RcppExports.R @@ -78,6 +78,10 @@ cpp_nni_distance <- function(edge1, edge2, nTip) { .Call(`_TreeDist_cpp_nni_distance`, edge1, edge2, nTip) } +cpp_mutual_clustering_all_pairs <- function(splits_list, n_tip) { + .Call(`_TreeDist_cpp_mutual_clustering_all_pairs`, splits_list, n_tip) +} + path_vector <- function(edge) { .Call(`_TreeDist_path_vector`, edge) } diff --git a/R/tree_distance_utilities.R b/R/tree_distance_utilities.R index 5bbef3ac3..e7de4f2fa 100644 --- a/R/tree_distance_utilities.R +++ b/R/tree_distance_utilities.R @@ -129,6 +129,22 @@ CalculateTreeDistance <- function(Func, tree1, tree2 = NULL, nTip = length(tipLabels), ...) { cluster <- getOption("TreeDist-cluster") + # Fast path: use the OpenMP batch function for mutual clustering when all + # trees share the same tip set and no R-level cluster has been configured. + # Matches the behaviour of the generic path but avoids per-pair R overhead. + if (!is.na(nTip) && is.null(cluster) && + identical(Func, MutualClusteringInfoSplits)) { + splits <- as.Splits(splits1, tipLabels = tipLabels, asSplits = FALSE) + return(structure( + cpp_mutual_clustering_all_pairs(splits, as.integer(nTip)), + class = "dist", + Size = length(splits1), + Labels = names(splits1), + Diag = FALSE, + Upper = FALSE + )) + } + if (is.na(nTip)) { splits <- lapply(splits1, as.Splits) diff --git a/src/RcppExports.cpp b/src/RcppExports.cpp index 5c338bf32..3aa1a162c 100644 --- a/src/RcppExports.cpp +++ b/src/RcppExports.cpp @@ -211,6 +211,18 @@ BEGIN_RCPP return rcpp_result_gen; END_RCPP } +// cpp_mutual_clustering_all_pairs +NumericVector cpp_mutual_clustering_all_pairs(const List& splits_list, const int n_tip); +RcppExport SEXP _TreeDist_cpp_mutual_clustering_all_pairs(SEXP splits_listSEXP, SEXP n_tipSEXP) { +BEGIN_RCPP + Rcpp::RObject rcpp_result_gen; + Rcpp::RNGScope rcpp_rngScope_gen; + Rcpp::traits::input_parameter< const List& >::type splits_list(splits_listSEXP); + Rcpp::traits::input_parameter< const int >::type n_tip(n_tipSEXP); + rcpp_result_gen = Rcpp::wrap(cpp_mutual_clustering_all_pairs(splits_list, n_tip)); + return rcpp_result_gen; +END_RCPP +} // path_vector IntegerVector path_vector(IntegerMatrix edge); RcppExport SEXP _TreeDist_path_vector(SEXP edgeSEXP) { @@ -432,6 +444,7 @@ static const R_CallMethodDef CallEntries[] = { {"_TreeDist_lapjv", (DL_FUNC) &_TreeDist_lapjv, 2}, {"_TreeDist_cpp_mast", (DL_FUNC) &_TreeDist_cpp_mast, 3}, {"_TreeDist_cpp_nni_distance", (DL_FUNC) &_TreeDist_cpp_nni_distance, 3}, + {"_TreeDist_cpp_mutual_clustering_all_pairs", (DL_FUNC) &_TreeDist_cpp_mutual_clustering_all_pairs, 2}, {"_TreeDist_path_vector", (DL_FUNC) &_TreeDist_path_vector, 1}, {"_TreeDist_vec_diff_euclidean", (DL_FUNC) &_TreeDist_vec_diff_euclidean, 2}, {"_TreeDist_pair_diff_euclidean", (DL_FUNC) &_TreeDist_pair_diff_euclidean, 1}, diff --git a/src/lap.cpp b/src/lap.cpp index 978bd9a47..73f6c5989 100644 --- a/src/lap.cpp +++ b/src/lap.cpp @@ -69,7 +69,8 @@ inline bool nontrivially_less_than(cost a, cost b) noexcept { cost lap(const lap_row dim, cost_matrix &input_cost, std::vector &rowsol, - std::vector &colsol) + std::vector &colsol, + const bool allow_interrupt) // input: // dim - problem size @@ -186,7 +187,7 @@ cost lap(const lap_row dim, // Put in current k, and go back to that k. // Continue augmenting path i - j1 with i0. freeunassigned[--k] = i0; - Rcpp::checkUserInterrupt(); + if (allow_interrupt) Rcpp::checkUserInterrupt(); } else { // No further augmenting reduction possible. // Store i0 in list of free rows for next phase. diff --git a/src/pairwise_distances.cpp b/src/pairwise_distances.cpp new file mode 100644 index 000000000..b0cd93672 --- /dev/null +++ b/src/pairwise_distances.cpp @@ -0,0 +1,199 @@ +/* pairwise_distances.cpp + * + * Batch pairwise distance functions using OpenMP where available. + * Each exported function takes a list of split matrices (one per tree) and + * returns the lower-triangle distance vector in combn(n, 2) order, suitable + * for direct use as the data payload of an R dist object. + */ + +#ifdef _OPENMP +#include +#endif + +#include +#include +#include +#include + +#include "tree_distances.h" + +using Rcpp::List; +using Rcpp::NumericVector; +using Rcpp::RawMatrix; +using TreeTools::SplitList; +using TreeTools::count_bits; + + +// Score-only version of mutual_clustering(). Thread-safe: uses only local +// storage and read-only globals (lg2 table, lookup tables). +// Passes allow_interrupt = false to lap() so it is safe to call from an +// OpenMP parallel region. +static double mutual_clustering_score( + const SplitList& a, const SplitList& b, const int32 n_tips +) { + if (a.n_splits == 0 || b.n_splits == 0 || n_tips == 0) return 0.0; + + const bool a_has_more = (a.n_splits > b.n_splits); + const int16 most_splits = a_has_more ? a.n_splits : b.n_splits; + const int16 a_extra = a_has_more ? most_splits - b.n_splits : 0; + const int16 b_extra = a_has_more ? 0 : most_splits - a.n_splits; + const double n_tips_rcp = 1.0 / static_cast(n_tips); + + constexpr cost max_score = BIG; + constexpr double over_max = 1.0 / static_cast(BIG); + const double max_over_tips = static_cast(BIG) * n_tips_rcp; + + cost_matrix score(most_splits); + double exact_score = 0.0; + int16 exact_n = 0; + + std::vector a_match(a.n_splits, 0); + auto b_match = std::make_unique(b.n_splits); + std::fill(b_match.get(), b_match.get() + b.n_splits, int16(0)); + + for (int16 ai = 0; ai < a.n_splits; ++ai) { + if (a_match[ai]) continue; + + const int16 na = a.in_split[ai]; + const int16 nA = n_tips - na; + const auto* a_row = a.state[ai]; + + for (int16 bi = 0; bi < b.n_splits; ++bi) { + int16 a_and_b = 0; + const auto* b_row = b.state[bi]; + for (int16 bin = 0; bin < a.n_bins; ++bin) { + a_and_b += count_bits(a_row[bin] & b_row[bin]); + } + + const int16 nb = b.in_split[bi]; + const int16 nB = n_tips - nb; + const int16 a_and_B = na - a_and_b; + const int16 A_and_b = nb - a_and_b; + const int16 A_and_B = nA - A_and_b; + + if ((!a_and_B && !A_and_b) || (!a_and_b && !A_and_B)) { + // Exact match (nested or identical splits) + exact_score += TreeDist::ic_matching(na, nA, n_tips); + ++exact_n; + a_match[ai] = bi + 1; + b_match[bi] = ai + 1; + break; + } else if (a_and_b == A_and_b && a_and_b == a_and_B && a_and_b == A_and_B) { + score(ai, bi) = max_score; // Avoid rounding errors on orthogonal splits + } else { + double ic_sum = 0.0; + TreeDist::add_ic_element(ic_sum, a_and_b, na, nb, n_tips); + TreeDist::add_ic_element(ic_sum, a_and_B, na, nB, n_tips); + TreeDist::add_ic_element(ic_sum, A_and_b, nA, nb, n_tips); + TreeDist::add_ic_element(ic_sum, A_and_B, nA, nB, n_tips); + score(ai, bi) = max_score - static_cast(ic_sum * max_over_tips); + } + } + + if (b.n_splits < most_splits) { + score.padRowAfterCol(ai, b.n_splits, max_score); + } + } + + // Early exit when everything matched exactly + if (exact_n == b.n_splits || exact_n == a.n_splits) { + return exact_score * n_tips_rcp; + } + + const int16 lap_n = most_splits - exact_n; + std::vector rowsol(lap_n); + std::vector colsol(lap_n); + + if (exact_n) { + // Build a reduced cost matrix omitting exact-matched rows/cols + cost_matrix small(lap_n); + int16 a_pos = 0; + for (int16 ai = 0; ai < a.n_splits; ++ai) { + if (a_match[ai]) continue; + int16 b_pos = 0; + for (int16 bi = 0; bi < b.n_splits; ++bi) { + if (b_match[bi]) continue; + small(a_pos, b_pos) = score(ai, bi); + ++b_pos; + } + small.padRowAfterCol(a_pos, lap_n - a_extra, max_score); + ++a_pos; + } + small.padAfterRow(lap_n - b_extra, max_score); + + const double lap_score = + static_cast((max_score * lap_n) - + lap(lap_n, small, rowsol, colsol, false)) * over_max; + return lap_score + exact_score * n_tips_rcp; + + } else { + // No exact matches — pad and solve the full matrix + for (int16 ai = a.n_splits; ai < most_splits; ++ai) { + for (int16 bi = 0; bi < most_splits; ++bi) { + score(ai, bi) = max_score; + } + } + return static_cast( + (max_score * lap_n) - lap(lap_n, score, rowsol, colsol, false) + ) / max_score; + } +} + + +//' Pairwise mutual clustering information — batch computation +//' +//' Internal function. Computes all pairwise MCI scores for a set of trees, +//' using OpenMP threads when available (falling back to single-threaded +//' execution otherwise). No interrupt checking is performed inside the +//' parallel region; the outer R call remains interruptible between batches. +//' +//' @param splits_list A list of split matrices (class `Splits` or `RawMatrix`), +//' one per tree, all covering the same tip set. Typically the object +//' returned by `as.Splits(trees, tipLabels = labs, asSplits = FALSE)`. +//' @param n_tip Integer; number of tips shared by all trees. +//' @return Numeric vector of length `n*(n-1)/2` containing pairwise MCI +//' scores in `combn(n, 2)` column-major order (i.e. the data payload of +//' an R `dist` object). +//' @keywords internal +// [[Rcpp::export]] +NumericVector cpp_mutual_clustering_all_pairs( + const List& splits_list, + const int n_tip +) { + const int N = splits_list.size(); + if (N < 2) return NumericVector(0); + + const int n_pairs = N * (N - 1) / 2; + + // Construct SplitList objects on the main thread: R SEXP access is not + // thread-safe, and the SplitList constructor reads from R RawMatrix objects. + // SplitList is not move-constructible, so we heap-allocate via unique_ptr. + std::vector> splits; + splits.reserve(N); + for (int k = 0; k < N; ++k) { + splits.push_back( + std::make_unique(Rcpp::as(splits_list[k])) + ); + } + + NumericVector result(n_pairs); + double* const res = result.begin(); + + // Iterate over columns of the combn(N,2) lower triangle. + // Pair (col, row) with col < row maps to dist-vector index: + // p = col*(N-1) - col*(col-1)/2 + row - col - 1 +#ifdef _OPENMP +#pragma omp parallel for schedule(dynamic) +#endif + for (int col = 0; col < N - 1; ++col) { +#ifndef _OPENMP + Rcpp::checkUserInterrupt(); +#endif + for (int row = col + 1; row < N; ++row) { + const int p = col * (N - 1) - col * (col - 1) / 2 + row - col - 1; + res[p] = mutual_clustering_score(*splits[col], *splits[row], n_tip); + } + } + + return result; +} diff --git a/src/tree_distances.cpp b/src/tree_distances.cpp index 7666b175a..9d33f717d 100644 --- a/src/tree_distances.cpp +++ b/src/tree_distances.cpp @@ -33,25 +33,6 @@ namespace TreeDist { } } - inline void add_ic_element(double& ic_sum, int16 nkK, int16 nk, int16 nK, - int16 n_tips) noexcept { - /* - * See equation 16 in Meila 2007. I denote k' as K. - * nkK is converted to pkK in the calling function, when the sum of all - * elements is divided by n. - */ - if (nkK && nk && nK) { - if (nkK == nk && nkK == nK && nkK << 1 == n_tips) { - ic_sum += nkK; - } else { - const int32 numerator = nkK * n_tips; - const int32 denominator = nk * nK; - if (numerator != denominator) { - ic_sum += nkK * (lg2[numerator] - lg2[denominator]); - } - } - } - } } diff --git a/src/tree_distances.h b/src/tree_distances.h index b5a2f97bc..16907bd31 100644 --- a/src/tree_distances.h +++ b/src/tree_distances.h @@ -379,13 +379,29 @@ using cost_matrix = CostMatrix; extern cost lap(lap_row dim, cost_matrix &input_cost, std::vector &rowsol, - std::vector &colsol); - -extern inline void add_ic_element(double& ic_sum, int16 nkK, int16 nk, int16 nK, - int16 n_tips); + std::vector &colsol, + bool allow_interrupt = true); namespace TreeDist { + // See equation 16 in Meila 2007 (k' denoted K). + // nkK is converted to pkK in the calling function when divided by n. + inline void add_ic_element(double& ic_sum, const int16 nkK, const int16 nk, + const int16 nK, const int16 n_tips) noexcept { + if (nkK && nk && nK) { + if (nkK == nk && nkK == nK && nkK << 1 == n_tips) { + ic_sum += nkK; + } else { + const int32 numerator = nkK * n_tips; + const int32 denominator = nk * nK; + if (numerator != denominator) { + ic_sum += nkK * (lg2[numerator] - lg2[denominator]); + } + } + } + } + + // Returns lg2_unrooted[x] - lg2_trees_matching_split(y, x - y) [[nodiscard]] inline double mmsi_pair_score(const int16 x, const int16 y) noexcept { assert(SL_MAX_TIPS + 2 <= std::numeric_limits::max()); // verify int16 ok From 7fbc97686123cdf73337cfaa4a30aff62c645b92 Mon Sep 17 00:00:00 2001 From: RevBayes analysis <1695515+ms609@users.noreply.github.com> Date: Tue, 10 Mar 2026 13:40:37 +0000 Subject: [PATCH 03/19] Multithreaded calculation --- AGENTS.md | 31 ++++++++++++++-- NEWS.md | 14 ++++++++ R/parallel.R | 49 ++++++++++++++++++++----- R/tree_distance_utilities.R | 3 +- man/StartParallel.Rd | 50 +++++++++++++++++++++----- man/TreeDist-package.Rd | 1 + man/cpp_mutual_clustering_all_pairs.Rd | 27 ++++++++++++++ src/RcppExports.cpp | 9 ++--- src/pairwise_distances.cpp | 5 +-- 9 files changed, 162 insertions(+), 27 deletions(-) create mode 100644 man/cpp_mutual_clustering_all_pairs.Rd diff --git a/AGENTS.md b/AGENTS.md index 280a133c7..a871f9580 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -203,13 +203,40 @@ Added `#pragma omp parallel for schedule(dynamic)` over the pairwise loop for - `add_ic_element` moved from `tree_distances.cpp` (inside `namespace TreeDist`, inaccessible to other TUs) into `tree_distances.h` as a proper `inline`, fixing a latent ODR issue and making it visible to `pairwise_distances.cpp`. +- `SplitList` is not move-constructible; `std::vector` therefore + cannot be used (reserve/emplace_back require movability). Fix: use + `std::vector>` instead. - R fast path: `.SplitDistanceAllPairs()` in `tree_distance_utilities.R` detects `Func == MutualClusteringInfoSplits` with a same-tip-set, no-cluster call and routes it to `cpp_mutual_clustering_all_pairs()`. **Benchmark script**: `benchmark/bench-MCI-openmp.R` -**To measure speedup**: run `bench-MCI-openmp.R` in a **fresh R session** after -`devtools::load_all()` (Windows locks the DLL in the session that compiles it). +**To measure speedup**: install with `install.packages(".", repos=NULL, type="source")` +into a fresh library (avoids `devtools::load_all()` debug flags and Windows DLL lock). +Use `TreeDist:::cpp_mutual_clustering_all_pairs` when calling from an installed package. + +#### Measured speedups (release build, -O2, 16-core Windows machine) + +| Scenario | Serial R loop | R parallel (16 workers) | OpenMP batch | +|---|---|---|---| +| 100 trees × 50 tips (4 950 pairs) | 117 ms | 78 ms | **17 ms** | +| 40 trees × 200 tips (780 pairs) | 292 ms | 3 350 ms | **35 ms** | + +OpenMP vs serial: **7–8×**. OpenMP vs R-parallel cluster: **5–95×** (R-parallel +incurs ~2–3 s serialisation overhead regardless of problem size). + +#### R parallel cluster crossover (50-tip trees, 16 workers) + +The R parallel cluster is only competitive with the serial loop when the problem +is large enough to amortise its ~2–3 s IPC/serialisation overhead. For 50-tip +trees that crossover is around **500 trees (~125 000 pairs)**, where serial takes +~2.9 s and parallel takes ~2.8 s. The OpenMP batch path remains faster than +both at every size measured. + +Practical implication: **`StartParallel()` provides no benefit for MCI/CID** on +machines where OpenMP is available, and actively harms performance for typical +analysis sizes (≤ 200 trees). The fast path therefore bypasses the R cluster +entirely when `cluster` is `NULL`. --- diff --git a/NEWS.md b/NEWS.md index a08f516de..c6ae93d6d 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,3 +1,17 @@ +# TreeDist 2.12.0.9001 (2026-03-10) + +- Pairwise `ClusteringInfoDistance()` / `MutualClusteringInfo()` now uses + an OpenMP multi-threaded batch path when the package is compiled with + OpenMP support, giving ~7–8× speed-up over the serial path on a 16-core + machine. + +- The number of OpenMP threads is controlled by `options(mc.cores = N)`; + the default is `1` (single-threaded). Set `mc.cores` to + `parallel::detectCores()` or a fixed integer to enable multi-threading. + `StartParallel()` / `StopParallel()` are no longer needed for + `ClusteringInfoDistance()` when OpenMP is available. + + # TreeDist 2.12.0.9000 (2026-02-19) - Tweak vignettes. diff --git a/R/parallel.R b/R/parallel.R index 1e86e1ee1..d8d57458d 100644 --- a/R/parallel.R +++ b/R/parallel.R @@ -3,28 +3,59 @@ #' #' Accelerate distance calculation by employing multiple \acronym{CPU} workers. #' -#' "TreeDist" parallelizes the calculation of tree to tree distances via -#' the [`parCapply()`] function, using a user-defined cluster specified in +#' ## OpenMP (recommended for `ClusteringInfoDistance` / `MutualClusteringInfo`) +#' +#' When the package is built with \acronym{OpenMP} support (the default on +#' Linux and Windows; optional on macOS), pairwise +#' [`ClusteringInfoDistance()`] / [`MutualClusteringInfo()`] calculations use +#' an efficient multi-threaded code path automatically — no cluster setup is +#' required. +#' +#' The number of \acronym{OpenMP} threads is controlled by the standard +#' `"mc.cores"` option: +#' +#' ```r +#' options(mc.cores = parallel::detectCores()) # use all available cores +#' options(mc.cores = 4L) # or a fixed number +#' ``` +#' +#' The default is `1` (single-threaded). The \acronym{OpenMP} path is +#' substantially faster than the R-cluster path for typical analysis sizes and +#' is preferred when available. +#' +#' ## R parallel cluster (other metrics) +#' +#' For metrics that do not yet have an \acronym{OpenMP} batch implementation +#' (e.g. [`RobinsonFoulds()`], [`MatchingSplitDistance()`]), "TreeDist" +#' parallelizes via [`parCapply()`] using a cluster stored in #' `options("TreeDist-cluster")`. #' -#' `StartParallel()` calls `parallel::makeCluster()` and tells "TreeDist" to -#' use the created cluster. +#' `StartParallel()` calls `parallel::makeCluster()` and registers the cluster. #' -#' `SetParallel()` tells "TreeDist" to use a pre-existing or user-specified -#' cluster. +#' `SetParallel()` registers a pre-existing cluster. #' -#' `StopParallel()` stops the current TreeDist cluster. +#' `StopParallel()` stops the current TreeDist cluster and releases resources. +#' +#' Note that R-cluster parallelism carries a serialisation overhead of ~2–3 s, +#' so it is only beneficial for large problems (roughly > 500 trees at 50 tips). +#' For [`ClusteringInfoDistance()`] the \acronym{OpenMP} path is faster at +#' every problem size and is used automatically when no cluster is registered. #' #' @param \dots Parameters to pass to [`makeCluster()`]. #' @param cl An existing cluster. #' #' @examples -#' if (interactive()) { # Only run in terminal +#' # OpenMP parallelism: just set mc.cores before calling distance functions. +#' options(mc.cores = 2L) +#' # ClusteringInfoDistance(trees) # now uses 2 OpenMP threads +#' options(mc.cores = NULL) # restore default (single-threaded) +#' +#' if (interactive()) { # R cluster: only worthwhile for non-OpenMP metrics #' library("TreeTools", quietly = TRUE) #' nCores <- ceiling(parallel::detectCores() / 2) #' StartParallel(nCores) # Takes a few seconds to set up processes #' GetParallel() -#' ClusteringInfoDistance(as.phylo(0:6, 100)) +#' RobinsonFoulds(as.phylo(0:6, 100)) #' StopParallel() # Returns system resources #' } #' @template MRS diff --git a/R/tree_distance_utilities.R b/R/tree_distance_utilities.R index e7de4f2fa..0509ed0eb 100644 --- a/R/tree_distance_utilities.R +++ b/R/tree_distance_utilities.R @@ -136,7 +136,8 @@ CalculateTreeDistance <- function(Func, tree1, tree2 = NULL, identical(Func, MutualClusteringInfoSplits)) { splits <- as.Splits(splits1, tipLabels = tipLabels, asSplits = FALSE) return(structure( - cpp_mutual_clustering_all_pairs(splits, as.integer(nTip)), + cpp_mutual_clustering_all_pairs(splits, as.integer(nTip), + as.integer(getOption("mc.cores", 1L))), class = "dist", Size = length(splits1), Labels = names(splits1), diff --git a/man/StartParallel.Rd b/man/StartParallel.Rd index 71d5efc88..7fef07ffc 100644 --- a/man/StartParallel.Rd +++ b/man/StartParallel.Rd @@ -35,25 +35,57 @@ StopParallel(quietly = FALSE) Accelerate distance calculation by employing multiple \acronym{CPU} workers. } \details{ -"TreeDist" parallelizes the calculation of tree to tree distances via -the \code{\link[=parCapply]{parCapply()}} function, using a user-defined cluster specified in +\subsection{OpenMP (recommended for \code{ClusteringInfoDistance} / \code{MutualClusteringInfo})}{ + +When the package is built with \acronym{OpenMP} support (the default on +Linux and Windows; optional on macOS), pairwise +\code{\link[=ClusteringInfoDistance]{ClusteringInfoDistance()}} / \code{\link[=MutualClusteringInfo]{MutualClusteringInfo()}} calculations use +an efficient multi-threaded code path automatically — no cluster setup is +required. + +The number of \acronym{OpenMP} threads is controlled by the standard +\code{"mc.cores"} option: + +\if{html}{\out{
}}\preformatted{options(mc.cores = parallel::detectCores()) # use all available cores +options(mc.cores = 4L) # or a fixed number +}\if{html}{\out{
}} + +The default is \code{1} (single-threaded). The \acronym{OpenMP} path is +substantially faster than the R-cluster path for typical analysis sizes and +is preferred when available. +} + +\subsection{R parallel cluster (other metrics)}{ + +For metrics that do not yet have an \acronym{OpenMP} batch implementation +(e.g. \code{\link[=RobinsonFoulds]{RobinsonFoulds()}}, \code{\link[=MatchingSplitDistance]{MatchingSplitDistance()}}), "TreeDist" +parallelizes via \code{\link[=parCapply]{parCapply()}} using a cluster stored in \code{options("TreeDist-cluster")}. -\code{StartParallel()} calls \code{parallel::makeCluster()} and tells "TreeDist" to -use the created cluster. +\code{StartParallel()} calls \code{parallel::makeCluster()} and registers the cluster. + +\code{SetParallel()} registers a pre-existing cluster. -\code{SetParallel()} tells "TreeDist" to use a pre-existing or user-specified -cluster. +\code{StopParallel()} stops the current TreeDist cluster and releases resources. -\code{StopParallel()} stops the current TreeDist cluster. +Note that R-cluster parallelism carries a serialisation overhead of ~2–3 s, +so it is only beneficial for large problems (roughly > 500 trees at 50 tips). +For \code{\link[=ClusteringInfoDistance]{ClusteringInfoDistance()}} the \acronym{OpenMP} path is faster at +every problem size and is used automatically when no cluster is registered. +} } \examples{ -if (interactive()) { # Only run in terminal +# OpenMP parallelism: just set mc.cores before calling distance functions. +options(mc.cores = 2L) +# ClusteringInfoDistance(trees) # now uses 2 OpenMP threads +options(mc.cores = NULL) # restore default (single-threaded) + +if (interactive()) { # R cluster: only worthwhile for non-OpenMP metrics library("TreeTools", quietly = TRUE) nCores <- ceiling(parallel::detectCores() / 2) StartParallel(nCores) # Takes a few seconds to set up processes GetParallel() - ClusteringInfoDistance(as.phylo(0:6, 100)) + RobinsonFoulds(as.phylo(0:6, 100)) StopParallel() # Returns system resources } } diff --git a/man/TreeDist-package.Rd b/man/TreeDist-package.Rd index c322229b2..b142288f7 100644 --- a/man/TreeDist-package.Rd +++ b/man/TreeDist-package.Rd @@ -139,6 +139,7 @@ Other contributors: \item Roy Jonker \email{roy_jonker@magiclogic.com} (LAP algorithm) [programmer, copyright holder] \item Yong Yang \email{yongyanglink@gmail.com} (LAP algorithm) [contributor, copyright holder] \item Yi Cao (LAP algorithm) [contributor, copyright holder] + \item Neil Kaye (Mercator image) } } diff --git a/man/cpp_mutual_clustering_all_pairs.Rd b/man/cpp_mutual_clustering_all_pairs.Rd new file mode 100644 index 000000000..3be2abff9 --- /dev/null +++ b/man/cpp_mutual_clustering_all_pairs.Rd @@ -0,0 +1,27 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/RcppExports.R +\name{cpp_mutual_clustering_all_pairs} +\alias{cpp_mutual_clustering_all_pairs} +\title{Pairwise mutual clustering information — batch computation} +\usage{ +cpp_mutual_clustering_all_pairs(splits_list, n_tip, n_threads = 1L) +} +\arguments{ +\item{splits_list}{A list of split matrices (class \code{Splits} or \code{RawMatrix}), +one per tree, all covering the same tip set. Typically the object +returned by \code{as.Splits(trees, tipLabels = labs, asSplits = FALSE)}.} + +\item{n_tip}{Integer; number of tips shared by all trees.} +} +\value{ +Numeric vector of length \code{n*(n-1)/2} containing pairwise MCI +scores in \code{combn(n, 2)} column-major order (i.e. the data payload of +an R \code{dist} object). +} +\description{ +Internal function. Computes all pairwise MCI scores for a set of trees, +using OpenMP threads when available (falling back to single-threaded +execution otherwise). No interrupt checking is performed inside the +parallel region; the outer R call remains interruptible between batches. +} +\keyword{internal} diff --git a/src/RcppExports.cpp b/src/RcppExports.cpp index 3aa1a162c..c8702cdbf 100644 --- a/src/RcppExports.cpp +++ b/src/RcppExports.cpp @@ -212,14 +212,15 @@ BEGIN_RCPP END_RCPP } // cpp_mutual_clustering_all_pairs -NumericVector cpp_mutual_clustering_all_pairs(const List& splits_list, const int n_tip); -RcppExport SEXP _TreeDist_cpp_mutual_clustering_all_pairs(SEXP splits_listSEXP, SEXP n_tipSEXP) { +NumericVector cpp_mutual_clustering_all_pairs(const List& splits_list, const int n_tip, const int n_threads); +RcppExport SEXP _TreeDist_cpp_mutual_clustering_all_pairs(SEXP splits_listSEXP, SEXP n_tipSEXP, SEXP n_threadsSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const List& >::type splits_list(splits_listSEXP); Rcpp::traits::input_parameter< const int >::type n_tip(n_tipSEXP); - rcpp_result_gen = Rcpp::wrap(cpp_mutual_clustering_all_pairs(splits_list, n_tip)); + Rcpp::traits::input_parameter< const int >::type n_threads(n_threadsSEXP); + rcpp_result_gen = Rcpp::wrap(cpp_mutual_clustering_all_pairs(splits_list, n_tip, n_threads)); return rcpp_result_gen; END_RCPP } @@ -444,7 +445,7 @@ static const R_CallMethodDef CallEntries[] = { {"_TreeDist_lapjv", (DL_FUNC) &_TreeDist_lapjv, 2}, {"_TreeDist_cpp_mast", (DL_FUNC) &_TreeDist_cpp_mast, 3}, {"_TreeDist_cpp_nni_distance", (DL_FUNC) &_TreeDist_cpp_nni_distance, 3}, - {"_TreeDist_cpp_mutual_clustering_all_pairs", (DL_FUNC) &_TreeDist_cpp_mutual_clustering_all_pairs, 2}, + {"_TreeDist_cpp_mutual_clustering_all_pairs", (DL_FUNC) &_TreeDist_cpp_mutual_clustering_all_pairs, 3}, {"_TreeDist_path_vector", (DL_FUNC) &_TreeDist_path_vector, 1}, {"_TreeDist_vec_diff_euclidean", (DL_FUNC) &_TreeDist_vec_diff_euclidean, 2}, {"_TreeDist_pair_diff_euclidean", (DL_FUNC) &_TreeDist_pair_diff_euclidean, 1}, diff --git a/src/pairwise_distances.cpp b/src/pairwise_distances.cpp index b0cd93672..f591ec812 100644 --- a/src/pairwise_distances.cpp +++ b/src/pairwise_distances.cpp @@ -158,7 +158,8 @@ static double mutual_clustering_score( // [[Rcpp::export]] NumericVector cpp_mutual_clustering_all_pairs( const List& splits_list, - const int n_tip + const int n_tip, + const int n_threads = 1 ) { const int N = splits_list.size(); if (N < 2) return NumericVector(0); @@ -183,7 +184,7 @@ NumericVector cpp_mutual_clustering_all_pairs( // Pair (col, row) with col < row maps to dist-vector index: // p = col*(N-1) - col*(col-1)/2 + row - col - 1 #ifdef _OPENMP -#pragma omp parallel for schedule(dynamic) +#pragma omp parallel for schedule(dynamic) num_threads(n_threads) #endif for (int col = 0; col < N - 1; ++col) { #ifndef _OPENMP From c3169f9470a25c54c8a353767e2d2b3cf835fed5 Mon Sep 17 00:00:00 2001 From: RevBayes analysis <1695515+ms609@users.noreply.github.com> Date: Tue, 10 Mar 2026 13:54:00 +0000 Subject: [PATCH 04/19] Code coverage --- src/tree_distances.h | 13 ++--- tests/testthat/test-pairwise_distances.R | 63 ++++++++++++++++++++++++ 2 files changed, 68 insertions(+), 8 deletions(-) create mode 100644 tests/testthat/test-pairwise_distances.R diff --git a/src/tree_distances.h b/src/tree_distances.h index 16907bd31..06b8661a9 100644 --- a/src/tree_distances.h +++ b/src/tree_distances.h @@ -389,14 +389,11 @@ namespace TreeDist { inline void add_ic_element(double& ic_sum, const int16 nkK, const int16 nk, const int16 nK, const int16 n_tips) noexcept { if (nkK && nk && nK) { - if (nkK == nk && nkK == nK && nkK << 1 == n_tips) { - ic_sum += nkK; - } else { - const int32 numerator = nkK * n_tips; - const int32 denominator = nk * nK; - if (numerator != denominator) { - ic_sum += nkK * (lg2[numerator] - lg2[denominator]); - } + assert(!(nkK == nk && nkK == nK && nkK << 1 == n_tips)); + const int32 numerator = nkK * n_tips; + const int32 denominator = nk * nK; + if (numerator != denominator) { + ic_sum += nkK * (lg2[numerator] - lg2[denominator]); } } } diff --git a/tests/testthat/test-pairwise_distances.R b/tests/testthat/test-pairwise_distances.R new file mode 100644 index 000000000..499dc21ab --- /dev/null +++ b/tests/testthat/test-pairwise_distances.R @@ -0,0 +1,63 @@ +## Tests for mutual_clustering_score() branches in src/pairwise_distances.cpp. +## +## These branches are only reachable via cpp_mutual_clustering_all_pairs(), +## invoked automatically by MutualClusteringInfo() when all trees share the +## same tip set and no R-level cluster is active (the fast path in +## .SplitDistanceAllPairs()). The fast path returns a dist; MutualClusteringInfo +## then converts it to a full matrix and fills the diagonal with ClusteringEntropy +## values. The off-diagonal entries equal the raw MCI scores from the batch C++ +## function. + +test_that("cpp_mutual_clustering_all_pairs: orthogonal splits score 0 (line 82)", { + # Two 8-tip trees, each with exactly one internal split. + # The splits cross orthogonally: every quadrant contains exactly 2 tips, + # so a_and_b == A_and_b == a_and_B == A_and_B == 2. + # This triggers the rounding-error guard (score = max_score) at line 82, and + # the LAP assigns that sole pair, yielding MCI = 0. + tr1 <- ape::read.tree(text = "((t1,t2,t3,t4),(t5,t6,t7,t8));") + tr2 <- ape::read.tree(text = "((t1,t2,t5,t6),(t3,t4,t7,t8));") + trees <- structure(list(tr1, tr2), class = "multiPhylo") + + r <- MutualClusteringInfo(trees) + + # r is a 2×2 matrix; the off-diagonal [2,1] is the pairwise MCI + expect_equal(r[2, 1], 0, tolerance = 1e-10) + # Agrees with the single-pair path (cpp_mutual_clustering, not the batch fn) + expect_equal(r[2, 1], MutualClusteringInfo(tr1, tr2), tolerance = 1e-10) +}) + +test_that("cpp_mutual_clustering_all_pairs: unequal split counts (lines 94, 131-138)", { + # Three 6-tip trees with different numbers of non-trivial splits: + # + # tr_1a — 1 split: {t1,t2,t3} | {t4,t5,t6} + # tr_3 — 3 splits: {t1,t4}, {t2,t5}, {t3,t6} + # tr_1b — 1 split: {t1,t2} | {t3,t4,t5,t6} + # + # The batch loop (col < row) processes three pairs using 0-indexed trees: + # + # (col=0 → a=tr_1a[1 split], row=1 → b=tr_3[3 splits]): + # a_has_more = FALSE; most_splits = 3. + # No split pairs are exact matches, so exact_n = 0 → else branch: + # loop fills phantom rows ai=1,2 with max_score (lines 131–134) + # LAP solves the full 3×3 matrix (lines 136–138) + # + # (col=1 → a=tr_3[3 splits], row=2 → b=tr_1b[1 split]): + # a_has_more = TRUE; most_splits = 3, b.n_splits = 1. + # For every ai=0..2: padRowAfterCol(ai, 1, max_score) (line 94) + + tr_1a <- ape::read.tree(text = "((t1,t2,t3),(t4,t5,t6));") + tr_3 <- ape::read.tree(text = "(((t1,t4),(t2,t5)),(t3,t6));") + tr_1b <- ape::read.tree(text = "((t1,t2),(t3,t4,t5,t6));") + trees <- structure(list(tr_1a, tr_3, tr_1b), class = "multiPhylo") + + r <- MutualClusteringInfo(trees) + + # r is a 3×3 matrix; off-diagonal [i, j] with i > j is pair (j, i) + expect_equal(dim(r), c(3L, 3L)) + + # Cross-validate all three off-diagonal pairs against the single-pair path + # (which uses cpp_mutual_clustering in tree_distances.cpp, not the batch fn) + expect_equal(r[2, 1], MutualClusteringInfo(tr_1a, tr_3), tolerance = 1e-9) + expect_equal(r[3, 1], MutualClusteringInfo(tr_1a, tr_1b), tolerance = 1e-9) + expect_equal(r[3, 2], MutualClusteringInfo(tr_3, tr_1b), tolerance = 1e-9) +}) From 01a7fa2155648a865af341500dfea61db1071ec4 Mon Sep 17 00:00:00 2001 From: RevBayes analysis <1695515+ms609@users.noreply.github.com> Date: Tue, 10 Mar 2026 14:49:55 +0000 Subject: [PATCH 05/19] Multithreading --- AGENTS.md | 25 ++- R/RcppExports.R | 44 +++- R/parallel.R | 61 ++--- R/tree_distance_info.R | 16 ++ R/tree_distance_utilities.R | 68 ++++-- man/StartParallel.Rd | 62 ++--- man/TreeDistance.Rd | 15 ++ src/RcppExports.cpp | 72 ++++++ src/pairwise_distances.cpp | 436 ++++++++++++++++++++++++++++++++++++ 9 files changed, 726 insertions(+), 73 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index a871f9580..7cb7f775c 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -238,6 +238,26 @@ machines where OpenMP is available, and actively harms performance for typical analysis sizes (≤ 200 trees). The fast path therefore bypasses the R cluster entirely when `cluster` is `NULL`. +#### OpenMP rollout to all remaining distance metrics (this dev cycle) + +Extended OpenMP batch computation to all LAP-based and RF-info metrics. +Added score-only static functions and `cpp_*_all_pairs` exports in +`pairwise_distances.cpp` for: + +| C++ batch function | R `Func` intercepted | LAP? | +|---|---|---| +| `cpp_rf_info_all_pairs` | `InfoRobinsonFouldsSplits` | No | +| `cpp_msd_all_pairs` | `MatchingSplitDistanceSplits` | Yes | +| `cpp_msi_all_pairs` | `MatchingSplitInfoSplits` | Yes | +| `cpp_shared_phylo_all_pairs` | `SharedPhylogeneticInfoSplits` | Yes | +| `cpp_jaccard_all_pairs` | `NyeSplitSimilarity`, `JaccardSplitSimilarity` | Yes | + +The R fast-path in `.SplitDistanceAllPairs()` was refactored from a single +`if` block into a unified `if (!is.na(nTip) && is.null(cluster))` guard with +per-function branches, returning early only when a batch function matches. +For `JaccardSplitSimilarity`, `k` and `allowConflict` are extracted from `...` +with sensible defaults (1.0, TRUE) if absent. + --- ## Known Optimization Opportunities / TODOs @@ -253,9 +273,6 @@ entirely when `cluster` is `NULL`. bandwidth further. - SPR distance (`spr.cpp`, `spr_lookup.cpp`): the algorithm is relatively recent (v2.8.0); profiling under VTune may reveal further hot spots. -- OpenMP for other metrics: `pairwise_distances.cpp` currently covers only MCI/CID. - Adding equivalent batch functions for `SharedPhylogeneticInfo`, `MatchingSplitInfo`, - `RobinsonFoulds`, and `MatchingSplitDistance` would extend the speedup; each needs - a score-only variant of its computation. +- OpenMP for other metrics: **DONE** — see "Completed Optimizations" below. - Large-tree path (`int32` migration, v2.12.0 dev): ensure new code paths are as optimized as the original `int16` paths. diff --git a/R/RcppExports.R b/R/RcppExports.R index 2ca5d5977..612219f87 100644 --- a/R/RcppExports.R +++ b/R/RcppExports.R @@ -78,8 +78,48 @@ cpp_nni_distance <- function(edge1, edge2, nTip) { .Call(`_TreeDist_cpp_nni_distance`, edge1, edge2, nTip) } -cpp_mutual_clustering_all_pairs <- function(splits_list, n_tip) { - .Call(`_TreeDist_cpp_mutual_clustering_all_pairs`, splits_list, n_tip) +#' Pairwise mutual clustering information — batch computation +#' +#' Internal function. Computes all pairwise MCI scores for a set of trees, +#' using OpenMP threads when available (falling back to single-threaded +#' execution otherwise). No interrupt checking is performed inside the +#' parallel region; the outer R call remains interruptible between batches. +#' +#' @param splits_list A list of split matrices (class `Splits` or `RawMatrix`), +#' one per tree, all covering the same tip set. Typically the object +#' returned by `as.Splits(trees, tipLabels = labs, asSplits = FALSE)`. +#' @param n_tip Integer; number of tips shared by all trees. +#' @return Numeric vector of length `n*(n-1)/2` containing pairwise MCI +#' scores in `combn(n, 2)` column-major order (i.e. the data payload of +#' an R `dist` object). +#' @keywords internal +cpp_mutual_clustering_all_pairs <- function(splits_list, n_tip, n_threads = 1L) { + .Call(`_TreeDist_cpp_mutual_clustering_all_pairs`, splits_list, n_tip, n_threads) +} + +#' @keywords internal +cpp_rf_info_all_pairs <- function(splits_list, n_tip, n_threads = 1L) { + .Call(`_TreeDist_cpp_rf_info_all_pairs`, splits_list, n_tip, n_threads) +} + +#' @keywords internal +cpp_msd_all_pairs <- function(splits_list, n_tip, n_threads = 1L) { + .Call(`_TreeDist_cpp_msd_all_pairs`, splits_list, n_tip, n_threads) +} + +#' @keywords internal +cpp_msi_all_pairs <- function(splits_list, n_tip, n_threads = 1L) { + .Call(`_TreeDist_cpp_msi_all_pairs`, splits_list, n_tip, n_threads) +} + +#' @keywords internal +cpp_shared_phylo_all_pairs <- function(splits_list, n_tip, n_threads = 1L) { + .Call(`_TreeDist_cpp_shared_phylo_all_pairs`, splits_list, n_tip, n_threads) +} + +#' @keywords internal +cpp_jaccard_all_pairs <- function(splits_list, n_tip, k = 1.0, allow_conflict = TRUE, n_threads = 1L) { + .Call(`_TreeDist_cpp_jaccard_all_pairs`, splits_list, n_tip, k, allow_conflict, n_threads) } path_vector <- function(edge) { diff --git a/R/parallel.R b/R/parallel.R index d8d57458d..9c835b064 100644 --- a/R/parallel.R +++ b/R/parallel.R @@ -3,13 +3,20 @@ #' #' Accelerate distance calculation by employing multiple \acronym{CPU} workers. #' -#' ## OpenMP (recommended for `ClusteringInfoDistance` / `MutualClusteringInfo`) +#' ## OpenMP (recommended for all split-based metrics) #' #' When the package is built with \acronym{OpenMP} support (the default on -#' Linux and Windows; optional on macOS), pairwise -#' [`ClusteringInfoDistance()`] / [`MutualClusteringInfo()`] calculations use -#' an efficient multi-threaded code path automatically — no cluster setup is -#' required. +#' Linux and Windows; optional on macOS), all pairwise split-based distance +#' calculations use an efficient multi-threaded batch path automatically — +#' no cluster setup is required. The affected functions are: +#' +#' - [`ClusteringInfoDistance()`] / [`MutualClusteringInfo()`] +#' - [`SharedPhylogeneticInfo()`] / [`DifferentPhylogeneticInfo()`] +#' - [`MatchingSplitInfo()`] / [`MatchingSplitInfoDistance()`] +#' - [`MatchingSplitDistance()`] +#' - [`InfoRobinsonFoulds()`] +#' - [`NyeSimilarity()`] +#' - [`JaccardRobinsonFoulds()`] #' #' The number of \acronym{OpenMP} threads is controlled by the standard #' `"mc.cores"` option: @@ -19,43 +26,45 @@ #' options(mc.cores = 4L) # or a fixed number #' ``` #' -#' The default is `1` (single-threaded). The \acronym{OpenMP} path is -#' substantially faster than the R-cluster path for typical analysis sizes and -#' is preferred when available. -#' -#' ## R parallel cluster (other metrics) +#' The default is `1` (single-threaded). #' -#' For metrics that do not yet have an \acronym{OpenMP} batch implementation -#' (e.g. [`RobinsonFoulds()`], [`MatchingSplitDistance()`]), "TreeDist" -#' parallelizes via [`parCapply()`] using a cluster stored in -#' `options("TreeDist-cluster")`. +#' ## R parallel cluster #' -#' `StartParallel()` calls `parallel::makeCluster()` and registers the cluster. +#' `StartParallel()` creates an R socket cluster (via [`makeCluster()`]) and +#' registers it for use by TreeDist. `SetParallel()` registers a pre-existing +#' cluster. `StopParallel()` stops the cluster and releases resources. #' -#' `SetParallel()` registers a pre-existing cluster. +#' **When to use `StartParallel()`:** for metrics that do not have an +#' \acronym{OpenMP} batch path, namely tree-object-based distances such as +#' [`NNIDist()`] and [`MASTSize()`] / [`MASTInfo()`], or any function called +#' via [`CompareAll()`]. R-cluster parallelism carries a serialisation overhead +#' of ~2–3 s, so it is only beneficial for large problems. #' -#' `StopParallel()` stops the current TreeDist cluster and releases resources. -#' -#' Note that R-cluster parallelism carries a serialisation overhead of ~2–3 s, -#' so it is only beneficial for large problems (roughly > 500 trees at 50 tips). -#' For [`ClusteringInfoDistance()`] the \acronym{OpenMP} path is faster at -#' every problem size and is used automatically when no cluster is registered. +#' **When _not_ to use `StartParallel()`:** for the split-based metrics listed +#' above. Registering a cluster disables the \acronym{OpenMP} batch path for +#' those functions, replacing a thread-local C++ loop with inter-process +#' communication — which is slower at every problem size measured. Call +#' `StopParallel()` before computing split-based distances if a cluster is +#' active. #' #' @param \dots Parameters to pass to [`makeCluster()`]. #' @param cl An existing cluster. #' #' @examples -#' # OpenMP parallelism: just set mc.cores before calling distance functions. +#' # OpenMP parallelism: set mc.cores before calling any split-based metric. #' options(mc.cores = 2L) -#' # ClusteringInfoDistance(trees) # now uses 2 OpenMP threads +#' # MutualClusteringInfo(trees) # uses 2 OpenMP threads automatically #' options(mc.cores = NULL) # restore default (single-threaded) #' -#' if (interactive()) { # R cluster: only worthwhile for non-OpenMP metrics +#' if (interactive()) { +#' # R cluster: beneficial for NNIDist, MASTSize/MASTInfo, CompareAll(), etc. +#' # Do NOT activate while computing split-based distances (MCI, SPI, MSI, …) +#' # as it bypasses the faster OpenMP path. #' library("TreeTools", quietly = TRUE) #' nCores <- ceiling(parallel::detectCores() / 2) #' StartParallel(nCores) # Takes a few seconds to set up processes #' GetParallel() -#' RobinsonFoulds(as.phylo(0:6, 100)) +#' CompareAll(as.phylo(0:6, 100), NNIDist) #' StopParallel() # Returns system resources #' } #' @template MRS diff --git a/R/tree_distance_info.R b/R/tree_distance_info.R index 5052ac8c2..dcc8ea9ee 100644 --- a/R/tree_distance_info.R +++ b/R/tree_distance_info.R @@ -121,6 +121,22 @@ #' which the maintainer plans to attempt in the future; please [comment on GitHub]( #' https://github.com/ms609/TreeTools/issues/141) if you would find this useful. #' +#' # Parallelism +#' +#' When `tree2 = NULL` and all trees share the same tip labels, pairwise +#' distance calculation uses a multi-threaded \acronym{OpenMP} batch path +#' automatically. Control the number of threads with the `"mc.cores"` option: +#' +#' ```r +#' options(mc.cores = parallel::detectCores()) # use all cores +#' options(mc.cores = 4L) # or a fixed number +#' ``` +#' +#' Do **not** call [`StartParallel()`] for these functions: a registered +#' R cluster disables the \acronym{OpenMP} path and replaces it with slower +#' inter-process communication. See [`StartParallel()`] for full guidance +#' on when an R cluster is appropriate. +#' #' @template tree12ListParams #' #' @param normalize If a numeric value is provided, this will be used as a diff --git a/R/tree_distance_utilities.R b/R/tree_distance_utilities.R index 0509ed0eb..5e8ed907f 100644 --- a/R/tree_distance_utilities.R +++ b/R/tree_distance_utilities.R @@ -129,21 +129,59 @@ CalculateTreeDistance <- function(Func, tree1, tree2 = NULL, nTip = length(tipLabels), ...) { cluster <- getOption("TreeDist-cluster") - # Fast path: use the OpenMP batch function for mutual clustering when all - # trees share the same tip set and no R-level cluster has been configured. - # Matches the behaviour of the generic path but avoids per-pair R overhead. - if (!is.na(nTip) && is.null(cluster) && - identical(Func, MutualClusteringInfoSplits)) { - splits <- as.Splits(splits1, tipLabels = tipLabels, asSplits = FALSE) - return(structure( - cpp_mutual_clustering_all_pairs(splits, as.integer(nTip), - as.integer(getOption("mc.cores", 1L))), - class = "dist", - Size = length(splits1), - Labels = names(splits1), - Diag = FALSE, - Upper = FALSE - )) + # Fast paths: use OpenMP batch functions when all trees share the same tip + # set and no R-level cluster has been configured. Each branch mirrors the + # generic path exactly but avoids per-pair R overhead. + if (!is.na(nTip) && is.null(cluster)) { + .n_threads <- as.integer(getOption("mc.cores", 1L)) + .batch_result <- if (identical(Func, MutualClusteringInfoSplits)) { + splits <- as.Splits(splits1, tipLabels = tipLabels, asSplits = FALSE) + cpp_mutual_clustering_all_pairs(splits, as.integer(nTip), .n_threads) + + } else if (identical(Func, InfoRobinsonFouldsSplits)) { + splits <- as.Splits(splits1, tipLabels = tipLabels, asSplits = FALSE) + cpp_rf_info_all_pairs(splits, as.integer(nTip), .n_threads) + + } else if (identical(Func, MatchingSplitDistanceSplits)) { + splits <- as.Splits(splits1, tipLabels = tipLabels, asSplits = FALSE) + cpp_msd_all_pairs(splits, as.integer(nTip), .n_threads) + + } else if (identical(Func, MatchingSplitInfoSplits)) { + splits <- as.Splits(splits1, tipLabels = tipLabels, asSplits = FALSE) + cpp_msi_all_pairs(splits, as.integer(nTip), .n_threads) + + } else if (identical(Func, SharedPhylogeneticInfoSplits)) { + splits <- as.Splits(splits1, tipLabels = tipLabels, asSplits = FALSE) + cpp_shared_phylo_all_pairs(splits, as.integer(nTip), .n_threads) + + } else if (identical(Func, NyeSplitSimilarity)) { + splits <- as.Splits(splits1, tipLabels = tipLabels, asSplits = FALSE) + cpp_jaccard_all_pairs(splits, as.integer(nTip), + k = 1.0, allow_conflict = TRUE, .n_threads) + + } else if (identical(Func, JaccardSplitSimilarity)) { + dots <- list(...) + splits <- as.Splits(splits1, tipLabels = tipLabels, asSplits = FALSE) + cpp_jaccard_all_pairs( + splits, as.integer(nTip), + k = as.double(if ("k" %in% names(dots)) dots[["k"]] else 1L), + allow_conflict = as.logical( + if ("allowConflict" %in% names(dots)) dots[["allowConflict"]] else TRUE), + .n_threads + ) + + } else { + NULL + } + + if (!is.null(.batch_result)) { + return(structure(.batch_result, + class = "dist", + Size = length(splits1), + Labels = names(splits1), + Diag = FALSE, + Upper = FALSE)) + } } if (is.na(nTip)) { diff --git a/man/StartParallel.Rd b/man/StartParallel.Rd index 7fef07ffc..1a918897e 100644 --- a/man/StartParallel.Rd +++ b/man/StartParallel.Rd @@ -35,13 +35,21 @@ StopParallel(quietly = FALSE) Accelerate distance calculation by employing multiple \acronym{CPU} workers. } \details{ -\subsection{OpenMP (recommended for \code{ClusteringInfoDistance} / \code{MutualClusteringInfo})}{ +\subsection{OpenMP (recommended for all split-based metrics)}{ When the package is built with \acronym{OpenMP} support (the default on -Linux and Windows; optional on macOS), pairwise -\code{\link[=ClusteringInfoDistance]{ClusteringInfoDistance()}} / \code{\link[=MutualClusteringInfo]{MutualClusteringInfo()}} calculations use -an efficient multi-threaded code path automatically — no cluster setup is -required. +Linux and Windows; optional on macOS), all pairwise split-based distance +calculations use an efficient multi-threaded batch path automatically — +no cluster setup is required. The affected functions are: +\itemize{ +\item \code{\link[=ClusteringInfoDistance]{ClusteringInfoDistance()}} / \code{\link[=MutualClusteringInfo]{MutualClusteringInfo()}} +\item \code{\link[=SharedPhylogeneticInfo]{SharedPhylogeneticInfo()}} / \code{\link[=DifferentPhylogeneticInfo]{DifferentPhylogeneticInfo()}} +\item \code{\link[=MatchingSplitInfo]{MatchingSplitInfo()}} / \code{\link[=MatchingSplitInfoDistance]{MatchingSplitInfoDistance()}} +\item \code{\link[=MatchingSplitDistance]{MatchingSplitDistance()}} +\item \code{\link[=InfoRobinsonFoulds]{InfoRobinsonFoulds()}} +\item \code{\link[=NyeSimilarity]{NyeSimilarity()}} +\item \code{\link[=JaccardRobinsonFoulds]{JaccardRobinsonFoulds()}} +} The number of \acronym{OpenMP} threads is controlled by the standard \code{"mc.cores"} option: @@ -50,42 +58,44 @@ The number of \acronym{OpenMP} threads is controlled by the standard options(mc.cores = 4L) # or a fixed number }\if{html}{\out{}} -The default is \code{1} (single-threaded). The \acronym{OpenMP} path is -substantially faster than the R-cluster path for typical analysis sizes and -is preferred when available. +The default is \code{1} (single-threaded). } -\subsection{R parallel cluster (other metrics)}{ - -For metrics that do not yet have an \acronym{OpenMP} batch implementation -(e.g. \code{\link[=RobinsonFoulds]{RobinsonFoulds()}}, \code{\link[=MatchingSplitDistance]{MatchingSplitDistance()}}), "TreeDist" -parallelizes via \code{\link[=parCapply]{parCapply()}} using a cluster stored in -\code{options("TreeDist-cluster")}. - -\code{StartParallel()} calls \code{parallel::makeCluster()} and registers the cluster. +\subsection{R parallel cluster}{ -\code{SetParallel()} registers a pre-existing cluster. +\code{StartParallel()} creates an R socket cluster (via \code{\link[=makeCluster]{makeCluster()}}) and +registers it for use by TreeDist. \code{SetParallel()} registers a pre-existing +cluster. \code{StopParallel()} stops the cluster and releases resources. -\code{StopParallel()} stops the current TreeDist cluster and releases resources. +\strong{When to use \code{StartParallel()}:} for metrics that do not have an +\acronym{OpenMP} batch path, namely tree-object-based distances such as +\code{\link[=NNIDist]{NNIDist()}} and \code{\link[=MASTSize]{MASTSize()}} / \code{\link[=MASTInfo]{MASTInfo()}}, or any function called +via \code{\link[=CompareAll]{CompareAll()}}. R-cluster parallelism carries a serialisation overhead +of ~2–3 s, so it is only beneficial for large problems. -Note that R-cluster parallelism carries a serialisation overhead of ~2–3 s, -so it is only beneficial for large problems (roughly > 500 trees at 50 tips). -For \code{\link[=ClusteringInfoDistance]{ClusteringInfoDistance()}} the \acronym{OpenMP} path is faster at -every problem size and is used automatically when no cluster is registered. +\strong{When \emph{not} to use \code{StartParallel()}:} for the split-based metrics listed +above. Registering a cluster disables the \acronym{OpenMP} batch path for +those functions, replacing a thread-local C++ loop with inter-process +communication — which is slower at every problem size measured. Call +\code{StopParallel()} before computing split-based distances if a cluster is +active. } } \examples{ -# OpenMP parallelism: just set mc.cores before calling distance functions. +# OpenMP parallelism: set mc.cores before calling any split-based metric. options(mc.cores = 2L) -# ClusteringInfoDistance(trees) # now uses 2 OpenMP threads +# MutualClusteringInfo(trees) # uses 2 OpenMP threads automatically options(mc.cores = NULL) # restore default (single-threaded) -if (interactive()) { # R cluster: only worthwhile for non-OpenMP metrics +if (interactive()) { + # R cluster: beneficial for NNIDist, MASTSize/MASTInfo, CompareAll(), etc. + # Do NOT activate while computing split-based distances (MCI, SPI, MSI, …) + # as it bypasses the faster OpenMP path. library("TreeTools", quietly = TRUE) nCores <- ceiling(parallel::detectCores() / 2) StartParallel(nCores) # Takes a few seconds to set up processes GetParallel() - RobinsonFoulds(as.phylo(0:6, 100)) + CompareAll(as.phylo(0:6, 100), NNIDist) StopParallel() # Returns system resources } } diff --git a/man/TreeDistance.Rd b/man/TreeDistance.Rd index feede9e80..352d169c4 100644 --- a/man/TreeDistance.Rd +++ b/man/TreeDistance.Rd @@ -258,6 +258,21 @@ Trees with over 8192 leaves require further modification of the source code, which the maintainer plans to attempt in the future; please \href{https://github.com/ms609/TreeTools/issues/141}{comment on GitHub} if you would find this useful. } +\section{Parallelism}{ +When \code{tree2 = NULL} and all trees share the same tip labels, pairwise +distance calculation uses a multi-threaded \acronym{OpenMP} batch path +automatically. Control the number of threads with the \code{"mc.cores"} option: + +\if{html}{\out{
}}\preformatted{options(mc.cores = parallel::detectCores()) # use all cores +options(mc.cores = 4L) # or a fixed number +}\if{html}{\out{
}} + +Do \strong{not} call \code{\link[=StartParallel]{StartParallel()}} for these functions: a registered +R cluster disables the \acronym{OpenMP} path and replaces it with slower +inter-process communication. See \code{\link[=StartParallel]{StartParallel()}} for full guidance +on when an R cluster is appropriate. +} + \examples{ tree1 <- ape::read.tree(text="((((a, b), c), d), (e, (f, (g, h))));") tree2 <- ape::read.tree(text="(((a, b), (c, d)), ((e, f), (g, h)));") diff --git a/src/RcppExports.cpp b/src/RcppExports.cpp index c8702cdbf..280865e7b 100644 --- a/src/RcppExports.cpp +++ b/src/RcppExports.cpp @@ -224,6 +224,73 @@ BEGIN_RCPP return rcpp_result_gen; END_RCPP } +// cpp_rf_info_all_pairs +NumericVector cpp_rf_info_all_pairs(const List& splits_list, const int n_tip, const int n_threads); +RcppExport SEXP _TreeDist_cpp_rf_info_all_pairs(SEXP splits_listSEXP, SEXP n_tipSEXP, SEXP n_threadsSEXP) { +BEGIN_RCPP + Rcpp::RObject rcpp_result_gen; + Rcpp::RNGScope rcpp_rngScope_gen; + Rcpp::traits::input_parameter< const List& >::type splits_list(splits_listSEXP); + Rcpp::traits::input_parameter< const int >::type n_tip(n_tipSEXP); + Rcpp::traits::input_parameter< const int >::type n_threads(n_threadsSEXP); + rcpp_result_gen = Rcpp::wrap(cpp_rf_info_all_pairs(splits_list, n_tip, n_threads)); + return rcpp_result_gen; +END_RCPP +} +// cpp_msd_all_pairs +NumericVector cpp_msd_all_pairs(const List& splits_list, const int n_tip, const int n_threads); +RcppExport SEXP _TreeDist_cpp_msd_all_pairs(SEXP splits_listSEXP, SEXP n_tipSEXP, SEXP n_threadsSEXP) { +BEGIN_RCPP + Rcpp::RObject rcpp_result_gen; + Rcpp::RNGScope rcpp_rngScope_gen; + Rcpp::traits::input_parameter< const List& >::type splits_list(splits_listSEXP); + Rcpp::traits::input_parameter< const int >::type n_tip(n_tipSEXP); + Rcpp::traits::input_parameter< const int >::type n_threads(n_threadsSEXP); + rcpp_result_gen = Rcpp::wrap(cpp_msd_all_pairs(splits_list, n_tip, n_threads)); + return rcpp_result_gen; +END_RCPP +} +// cpp_msi_all_pairs +NumericVector cpp_msi_all_pairs(const List& splits_list, const int n_tip, const int n_threads); +RcppExport SEXP _TreeDist_cpp_msi_all_pairs(SEXP splits_listSEXP, SEXP n_tipSEXP, SEXP n_threadsSEXP) { +BEGIN_RCPP + Rcpp::RObject rcpp_result_gen; + Rcpp::RNGScope rcpp_rngScope_gen; + Rcpp::traits::input_parameter< const List& >::type splits_list(splits_listSEXP); + Rcpp::traits::input_parameter< const int >::type n_tip(n_tipSEXP); + Rcpp::traits::input_parameter< const int >::type n_threads(n_threadsSEXP); + rcpp_result_gen = Rcpp::wrap(cpp_msi_all_pairs(splits_list, n_tip, n_threads)); + return rcpp_result_gen; +END_RCPP +} +// cpp_shared_phylo_all_pairs +NumericVector cpp_shared_phylo_all_pairs(const List& splits_list, const int n_tip, const int n_threads); +RcppExport SEXP _TreeDist_cpp_shared_phylo_all_pairs(SEXP splits_listSEXP, SEXP n_tipSEXP, SEXP n_threadsSEXP) { +BEGIN_RCPP + Rcpp::RObject rcpp_result_gen; + Rcpp::RNGScope rcpp_rngScope_gen; + Rcpp::traits::input_parameter< const List& >::type splits_list(splits_listSEXP); + Rcpp::traits::input_parameter< const int >::type n_tip(n_tipSEXP); + Rcpp::traits::input_parameter< const int >::type n_threads(n_threadsSEXP); + rcpp_result_gen = Rcpp::wrap(cpp_shared_phylo_all_pairs(splits_list, n_tip, n_threads)); + return rcpp_result_gen; +END_RCPP +} +// cpp_jaccard_all_pairs +NumericVector cpp_jaccard_all_pairs(const List& splits_list, const int n_tip, const double k, const bool allow_conflict, const int n_threads); +RcppExport SEXP _TreeDist_cpp_jaccard_all_pairs(SEXP splits_listSEXP, SEXP n_tipSEXP, SEXP kSEXP, SEXP allow_conflictSEXP, SEXP n_threadsSEXP) { +BEGIN_RCPP + Rcpp::RObject rcpp_result_gen; + Rcpp::RNGScope rcpp_rngScope_gen; + Rcpp::traits::input_parameter< const List& >::type splits_list(splits_listSEXP); + Rcpp::traits::input_parameter< const int >::type n_tip(n_tipSEXP); + Rcpp::traits::input_parameter< const double >::type k(kSEXP); + Rcpp::traits::input_parameter< const bool >::type allow_conflict(allow_conflictSEXP); + Rcpp::traits::input_parameter< const int >::type n_threads(n_threadsSEXP); + rcpp_result_gen = Rcpp::wrap(cpp_jaccard_all_pairs(splits_list, n_tip, k, allow_conflict, n_threads)); + return rcpp_result_gen; +END_RCPP +} // path_vector IntegerVector path_vector(IntegerMatrix edge); RcppExport SEXP _TreeDist_path_vector(SEXP edgeSEXP) { @@ -446,6 +513,11 @@ static const R_CallMethodDef CallEntries[] = { {"_TreeDist_cpp_mast", (DL_FUNC) &_TreeDist_cpp_mast, 3}, {"_TreeDist_cpp_nni_distance", (DL_FUNC) &_TreeDist_cpp_nni_distance, 3}, {"_TreeDist_cpp_mutual_clustering_all_pairs", (DL_FUNC) &_TreeDist_cpp_mutual_clustering_all_pairs, 3}, + {"_TreeDist_cpp_rf_info_all_pairs", (DL_FUNC) &_TreeDist_cpp_rf_info_all_pairs, 3}, + {"_TreeDist_cpp_msd_all_pairs", (DL_FUNC) &_TreeDist_cpp_msd_all_pairs, 3}, + {"_TreeDist_cpp_msi_all_pairs", (DL_FUNC) &_TreeDist_cpp_msi_all_pairs, 3}, + {"_TreeDist_cpp_shared_phylo_all_pairs", (DL_FUNC) &_TreeDist_cpp_shared_phylo_all_pairs, 3}, + {"_TreeDist_cpp_jaccard_all_pairs", (DL_FUNC) &_TreeDist_cpp_jaccard_all_pairs, 5}, {"_TreeDist_path_vector", (DL_FUNC) &_TreeDist_path_vector, 1}, {"_TreeDist_vec_diff_euclidean", (DL_FUNC) &_TreeDist_vec_diff_euclidean, 2}, {"_TreeDist_pair_diff_euclidean", (DL_FUNC) &_TreeDist_pair_diff_euclidean, 1}, diff --git a/src/pairwise_distances.cpp b/src/pairwise_distances.cpp index f591ec812..a35f4958e 100644 --- a/src/pairwise_distances.cpp +++ b/src/pairwise_distances.cpp @@ -11,6 +11,9 @@ #endif #include +#include +#include +#include #include #include #include @@ -198,3 +201,436 @@ NumericVector cpp_mutual_clustering_all_pairs( return result; } + + +// ============================================================================= +// InfoRobinsonFoulds (rf_info) — no LAP, thread-safe +// ============================================================================= + +static double rf_info_score( + const SplitList& a, const SplitList& b, const int32 n_tips +) { + const int16 last_bin = a.n_bins - 1; + const int16 unset_tips = (n_tips % SL_BIN_SIZE) ? + SL_BIN_SIZE - n_tips % SL_BIN_SIZE : 0; + const splitbit unset_mask = ALL_ONES >> unset_tips; + const double lg2_unrooted_n = lg2_unrooted[n_tips]; + double score = 0; + + splitbit b_complement[SL_MAX_SPLITS][SL_MAX_BINS]; + for (int16 i = 0; i < b.n_splits; ++i) { + for (int16 bin = 0; bin < last_bin; ++bin) { + b_complement[i][bin] = ~b.state[i][bin]; + } + b_complement[i][last_bin] = b.state[i][last_bin] ^ unset_mask; + } + + for (int16 ai = 0; ai < a.n_splits; ++ai) { + for (int16 bi = 0; bi < b.n_splits; ++bi) { + bool all_match = true, all_complement = true; + for (int16 bin = 0; bin < a.n_bins; ++bin) { + if (a.state[ai][bin] != b.state[bi][bin]) { all_match = false; break; } + } + if (!all_match) { + for (int16 bin = 0; bin < a.n_bins; ++bin) { + if (a.state[ai][bin] != b_complement[bi][bin]) { + all_complement = false; break; + } + } + } + if (all_match || all_complement) { + int16 leaves_in_split = 0; + for (int16 bin = 0; bin < a.n_bins; ++bin) { + leaves_in_split += count_bits(a.state[ai][bin]); + } + score += lg2_unrooted_n + - lg2_rooted[leaves_in_split] + - lg2_rooted[n_tips - leaves_in_split]; + break; + } + } + } + return score; +} + +//' @keywords internal +// [[Rcpp::export]] +NumericVector cpp_rf_info_all_pairs( + const List& splits_list, + const int n_tip, + const int n_threads = 1 +) { + const int N = splits_list.size(); + if (N < 2) return NumericVector(0); + const int n_pairs = N * (N - 1) / 2; + + std::vector> splits; + splits.reserve(N); + for (int k = 0; k < N; ++k) { + splits.push_back( + std::make_unique(Rcpp::as(splits_list[k])) + ); + } + + NumericVector result(n_pairs); + double* const res = result.begin(); + +#ifdef _OPENMP +#pragma omp parallel for schedule(dynamic) num_threads(n_threads) +#endif + for (int col = 0; col < N - 1; ++col) { +#ifndef _OPENMP + Rcpp::checkUserInterrupt(); +#endif + for (int row = col + 1; row < N; ++row) { + const int p = col * (N - 1) - col * (col - 1) / 2 + row - col - 1; + res[p] = rf_info_score(*splits[col], *splits[row], n_tip); + } + } + return result; +} + + +// ============================================================================= +// MatchingSplitDistance — LAP-based +// ============================================================================= + +static double msd_score( + const SplitList& a, const SplitList& b, const int32 n_tips +) { + const int16 most_splits = std::max(a.n_splits, b.n_splits); + if (most_splits == 0) return 0.0; + const int16 split_diff = most_splits - std::min(a.n_splits, b.n_splits); + const int16 half_tips = n_tips / 2; + const cost max_score = BIG / most_splits; + + cost_matrix score(most_splits); + + for (int16 ai = 0; ai < a.n_splits; ++ai) { + for (int16 bi = 0; bi < b.n_splits; ++bi) { + splitbit total = 0; + for (int16 bin = 0; bin < a.n_bins; ++bin) { + total += count_bits(a.state[ai][bin] ^ b.state[bi][bin]); + } + score(ai, bi) = total; + } + for (int16 bi = 0; bi < b.n_splits; ++bi) { + if (score(ai, bi) > half_tips) score(ai, bi) = n_tips - score(ai, bi); + } + score.padRowAfterCol(ai, b.n_splits, max_score); + } + score.padAfterRow(a.n_splits, max_score); + + std::vector rowsol(most_splits); + std::vector colsol(most_splits); + + return static_cast( + lap(most_splits, score, rowsol, colsol, false) - max_score * split_diff + ); +} + +//' @keywords internal +// [[Rcpp::export]] +NumericVector cpp_msd_all_pairs( + const List& splits_list, + const int n_tip, + const int n_threads = 1 +) { + const int N = splits_list.size(); + if (N < 2) return NumericVector(0); + const int n_pairs = N * (N - 1) / 2; + + std::vector> splits; + splits.reserve(N); + for (int k = 0; k < N; ++k) { + splits.push_back( + std::make_unique(Rcpp::as(splits_list[k])) + ); + } + + NumericVector result(n_pairs); + double* const res = result.begin(); + +#ifdef _OPENMP +#pragma omp parallel for schedule(dynamic) num_threads(n_threads) +#endif + for (int col = 0; col < N - 1; ++col) { +#ifndef _OPENMP + Rcpp::checkUserInterrupt(); +#endif + for (int row = col + 1; row < N; ++row) { + const int p = col * (N - 1) - col * (col - 1) / 2 + row - col - 1; + res[p] = msd_score(*splits[col], *splits[row], n_tip); + } + } + return result; +} + + +// ============================================================================= +// MatchingSplitInfo — LAP-based +// ============================================================================= + +static double msi_score( + const SplitList& a, const SplitList& b, const int32 n_tips +) { + const int16 most_splits = std::max(a.n_splits, b.n_splits); + if (most_splits == 0) return 0.0; + + constexpr cost max_score = BIG; + const double max_possible = lg2_unrooted[n_tips] + - lg2_rooted[int16((n_tips + 1) / 2)] + - lg2_rooted[int16(n_tips / 2)]; + const double score_over_possible = static_cast(max_score) / max_possible; + const double possible_over_score = max_possible / static_cast(max_score); + + cost_matrix score(most_splits); + splitbit different[SL_MAX_BINS]; + + for (int16 ai = 0; ai < a.n_splits; ++ai) { + for (int16 bi = 0; bi < b.n_splits; ++bi) { + int16 n_different = 0, n_a_only = 0, n_a_and_b = 0; + for (int16 bin = 0; bin < a.n_bins; ++bin) { + different[bin] = a.state[ai][bin] ^ b.state[bi][bin]; + n_different += count_bits(different[bin]); + n_a_only += count_bits(a.state[ai][bin] & different[bin]); + n_a_and_b += count_bits(a.state[ai][bin] & ~different[bin]); + } + const int16 n_same = n_tips - n_different; + score(ai, bi) = cost(max_score - score_over_possible * + TreeDist::mmsi_score(n_same, n_a_and_b, n_different, n_a_only)); + } + score.padRowAfterCol(ai, b.n_splits, max_score); + } + score.padAfterRow(a.n_splits, max_score); + + std::vector rowsol(most_splits); + std::vector colsol(most_splits); + + return static_cast( + (max_score * most_splits) - lap(most_splits, score, rowsol, colsol, false) + ) * possible_over_score; +} + +//' @keywords internal +// [[Rcpp::export]] +NumericVector cpp_msi_all_pairs( + const List& splits_list, + const int n_tip, + const int n_threads = 1 +) { + const int N = splits_list.size(); + if (N < 2) return NumericVector(0); + const int n_pairs = N * (N - 1) / 2; + + std::vector> splits; + splits.reserve(N); + for (int k = 0; k < N; ++k) { + splits.push_back( + std::make_unique(Rcpp::as(splits_list[k])) + ); + } + + NumericVector result(n_pairs); + double* const res = result.begin(); + +#ifdef _OPENMP +#pragma omp parallel for schedule(dynamic) num_threads(n_threads) +#endif + for (int col = 0; col < N - 1; ++col) { +#ifndef _OPENMP + Rcpp::checkUserInterrupt(); +#endif + for (int row = col + 1; row < N; ++row) { + const int p = col * (N - 1) - col * (col - 1) / 2 + row - col - 1; + res[p] = msi_score(*splits[col], *splits[row], n_tip); + } + } + return result; +} + + +// ============================================================================= +// SharedPhylogeneticInfo — LAP-based +// ============================================================================= + +static double shared_phylo_score( + const SplitList& a, const SplitList& b, const int32 n_tips +) { + const int16 most_splits = std::max(a.n_splits, b.n_splits); + if (most_splits == 0) return 0.0; + + const int16 overlap_a = int16(n_tips + 1) / 2; + constexpr cost max_score = BIG; + const double best_overlap = TreeDist::one_overlap(overlap_a, n_tips / 2, n_tips); + const double max_possible = lg2_unrooted[n_tips] - best_overlap; + const double score_over_possible = static_cast(max_score) / max_possible; + const double possible_over_score = max_possible / static_cast(max_score); + + cost_matrix score(most_splits); + + for (int16 ai = a.n_splits; ai--; ) { + for (int16 bi = b.n_splits; bi--; ) { + const double spi = TreeDist::spi_overlap( + a.state[ai], b.state[bi], n_tips, + a.in_split[ai], b.in_split[bi], a.n_bins); + score(ai, bi) = (spi == 0.0) ? max_score + : cost((spi - best_overlap) * score_over_possible); + } + score.padRowAfterCol(ai, b.n_splits, max_score); + } + score.padAfterRow(a.n_splits, max_score); + + std::vector rowsol(most_splits); + std::vector colsol(most_splits); + + return static_cast( + (max_score * most_splits) - lap(most_splits, score, rowsol, colsol, false) + ) * possible_over_score; +} + +//' @keywords internal +// [[Rcpp::export]] +NumericVector cpp_shared_phylo_all_pairs( + const List& splits_list, + const int n_tip, + const int n_threads = 1 +) { + const int N = splits_list.size(); + if (N < 2) return NumericVector(0); + const int n_pairs = N * (N - 1) / 2; + + std::vector> splits; + splits.reserve(N); + for (int k = 0; k < N; ++k) { + splits.push_back( + std::make_unique(Rcpp::as(splits_list[k])) + ); + } + + NumericVector result(n_pairs); + double* const res = result.begin(); + +#ifdef _OPENMP +#pragma omp parallel for schedule(dynamic) num_threads(n_threads) +#endif + for (int col = 0; col < N - 1; ++col) { +#ifndef _OPENMP + Rcpp::checkUserInterrupt(); +#endif + for (int row = col + 1; row < N; ++row) { + const int p = col * (N - 1) - col * (col - 1) / 2 + row - col - 1; + res[p] = shared_phylo_score(*splits[col], *splits[row], n_tip); + } + } + return result; +} + + +// ============================================================================= +// Jaccard / Nye similarity — LAP-based; k and allow_conflict are per-call +// ============================================================================= + +static double jaccard_score( + const SplitList& a, const SplitList& b, const int32 n_tips, + const double exponent, const bool allow_conflict +) { + const int16 most_splits = std::max(a.n_splits, b.n_splits); + if (most_splits == 0) return 0.0; + + constexpr cost max_score = BIG; + constexpr double max_scoreL = static_cast(max_score); + + cost_matrix score(most_splits); + + for (int16 ai = 0; ai < a.n_splits; ++ai) { + const int16 na = a.in_split[ai]; + const int16 nA = n_tips - na; + + for (int16 bi = 0; bi < b.n_splits; ++bi) { + int16 a_and_b = 0; + for (int16 bin = 0; bin < a.n_bins; ++bin) { + a_and_b += count_bits(a.state[ai][bin] & b.state[bi][bin]); + } + const int16 nb = b.in_split[bi]; + const int16 nB = n_tips - nb; + const int16 a_and_B = na - a_and_b; + const int16 A_and_b = nb - a_and_b; + const int16 A_and_B = nB - a_and_B; + + if (!allow_conflict && !( + a_and_b == na || a_and_B == na || + A_and_b == nA || A_and_B == nA)) { + score(ai, bi) = max_score; + } else { + const int16 A_or_b = n_tips - a_and_B; + const int16 a_or_B = n_tips - A_and_b; + const int16 a_or_b = n_tips - A_and_B; + const int16 A_or_B = n_tips - a_and_b; + const double ars_ab = static_cast(a_and_b) / static_cast(a_or_b); + const double ars_Ab = static_cast(A_and_b) / static_cast(A_or_b); + const double ars_aB = static_cast(a_and_B) / static_cast(a_or_B); + const double ars_AB = static_cast(A_and_B) / static_cast(A_or_B); + const double min_both = (ars_ab < ars_AB) ? ars_ab : ars_AB; + const double min_either = (ars_aB < ars_Ab) ? ars_aB : ars_Ab; + const double best = (min_both > min_either) ? min_both : min_either; + + if (exponent == 1.0) { + score(ai, bi) = cost(max_scoreL - max_scoreL * best); + } else if (std::isinf(exponent)) { + score(ai, bi) = cost((best == 1.0) ? 0.0 : max_scoreL); + } else { + score(ai, bi) = cost(max_scoreL - max_scoreL * std::pow(best, exponent)); + } + } + } + score.padRowAfterCol(ai, b.n_splits, max_score); + } + score.padAfterRow(a.n_splits, max_score); + + std::vector rowsol(most_splits); + std::vector colsol(most_splits); + + return static_cast( + (max_score * most_splits) - lap(most_splits, score, rowsol, colsol, false) + ) / max_scoreL; +} + +//' @keywords internal +// [[Rcpp::export]] +NumericVector cpp_jaccard_all_pairs( + const List& splits_list, + const int n_tip, + const double k = 1.0, + const bool allow_conflict = true, + const int n_threads = 1 +) { + const int N = splits_list.size(); + if (N < 2) return NumericVector(0); + const int n_pairs = N * (N - 1) / 2; + + std::vector> splits; + splits.reserve(N); + for (int i = 0; i < N; ++i) { + splits.push_back( + std::make_unique(Rcpp::as(splits_list[i])) + ); + } + + NumericVector result(n_pairs); + double* const res = result.begin(); + +#ifdef _OPENMP +#pragma omp parallel for schedule(dynamic) num_threads(n_threads) +#endif + for (int col = 0; col < N - 1; ++col) { +#ifndef _OPENMP + Rcpp::checkUserInterrupt(); +#endif + for (int row = col + 1; row < N; ++row) { + const int p = col * (N - 1) - col * (col - 1) / 2 + row - col - 1; + res[p] = jaccard_score(*splits[col], *splits[row], n_tip, k, allow_conflict); + } + } + return result; +} From f6cd4db91aa894e15005b721be13e7a80275cb31 Mon Sep 17 00:00:00 2001 From: RevBayes analysis <1695515+ms609@users.noreply.github.com> Date: Tue, 10 Mar 2026 14:58:54 +0000 Subject: [PATCH 06/19] OpenMP --- inst/WORDLIST | 1 + 1 file changed, 1 insertion(+) diff --git a/inst/WORDLIST b/inst/WORDLIST index ac56518cf..edb7570a4 100644 --- a/inst/WORDLIST +++ b/inst/WORDLIST @@ -43,6 +43,7 @@ NNIDist NyeTreeSimilarity OEIS ORCID +OpenMP Perotti PhysRevE PlotTools From 283fb19fb744e9ff44cdd7b0e07e030985720017 Mon Sep 17 00:00:00 2001 From: RevBayes analysis <1695515+ms609@users.noreply.github.com> Date: Tue, 10 Mar 2026 14:59:56 +0000 Subject: [PATCH 07/19] "Neil", "Kaye", role = c("cph"), --- DESCRIPTION | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/DESCRIPTION b/DESCRIPTION index 319cd6e61..59fbaebda 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -11,7 +11,7 @@ Authors@R: c(person("Martin R.", "Smith", person("Yong", "Yang", email = "yongyanglink@gmail.com", role = c("ctb", "cph"), comment = "LAP algorithm"), person("Yi", "Cao", role = c("ctb", "cph"), comment = "LAP algorithm"), - person("Neil", "Kaye", comment = "Mercator image") + person("Neil", "Kaye", role = c("cph"), comment = "Mercator image") ) License: GPL (>= 3) Description: Implements measures of tree similarity, including From bae52e4632b36a6b3063a521422000fefc81ee0f Mon Sep 17 00:00:00 2001 From: RevBayes analysis <1695515+ms609@users.noreply.github.com> Date: Tue, 10 Mar 2026 15:42:15 +0000 Subject: [PATCH 08/19] macOS --- inst/WORDLIST | 1 + 1 file changed, 1 insertion(+) diff --git a/inst/WORDLIST b/inst/WORDLIST index edb7570a4..cd993881d 100644 --- a/inst/WORDLIST +++ b/inst/WORDLIST @@ -98,6 +98,7 @@ ingroup interdecile jonker magiclogic +macOS mergesort molbev msw From fb6d61723ed431d56682adc39ce20963e8c3ea9c Mon Sep 17 00:00:00 2001 From: RevBayes analysis <1695515+ms609@users.noreply.github.com> Date: Tue, 10 Mar 2026 15:48:06 +0000 Subject: [PATCH 09/19] cache lt comparison --- src/lap.cpp | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/src/lap.cpp b/src/lap.cpp index 73f6c5989..9614cb992 100644 --- a/src/lap.cpp +++ b/src/lap.cpp @@ -166,7 +166,8 @@ cost lap(const lap_row dim, lap_col j1 = min_idx; lap_row i0 = colsol[j1]; - if (nontrivially_less_than(umin, usubmin)) { + const bool strictly_less = nontrivially_less_than(umin, usubmin); + if (strictly_less) { // Change the reduction of the minimum column to increase the minimum // reduced cost in the row to the subminimum. v[j1] -= (usubmin - umin); @@ -183,7 +184,7 @@ cost lap(const lap_row dim, colsol[j1] = i; if (i0 > -1) { // Minimum column j1 assigned earlier. - if (nontrivially_less_than(umin, usubmin)) { + if (strictly_less) { // Put in current k, and go back to that k. // Continue augmenting path i - j1 with i0. freeunassigned[--k] = i0; From da4798dcfdf97d8ac14766de3cf9785bcf7dda32 Mon Sep 17 00:00:00 2001 From: RevBayes analysis <1695515+ms609@users.noreply.github.com> Date: Tue, 10 Mar 2026 15:52:14 +0000 Subject: [PATCH 10/19] Plan & prevent plot interruption --- AGENTS.md | 53 ++++++++++++++++++++++++++++++++++++++++++ tests/testthat/setup.R | 7 ++++++ 2 files changed, 60 insertions(+) create mode 100644 tests/testthat/setup.R diff --git a/AGENTS.md b/AGENTS.md index 7cb7f775c..3ae99878b 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -260,6 +260,59 @@ with sensible defaults (1.0, TRUE) if absent. --- +## LAP Profiling Notes + +### Scaling behaviour (debug build, uniform random matrices) + +| n | median (μs) | implied α | +|---|---|---| +| 10 | 14.8 | — | +| 25 | 28.1 | 0.70 | +| 47 | 76.6 | 1.59 | +| 97 | 377 | 2.20 | +| 197 | 1798 | 2.21 | +| 400 | 14953 | 2.99 | + +For the actual tree-distance workload (n ≈ 47 for 50-tip trees, n ≈ 197 for +200-tip trees), the solver is already well into super-quadratic territory. +The Dijkstra augmentation phase is the dominant cost from n ≈ 100 upward. + +### Phase analysis + +- **Column reduction** (n² total, O(n) per column via transposed sequential scan): + already well-optimised; the one-time transpose pays for itself immediately. +- **Reduction transfer** (n² total): was blocked from SIMD auto-vectorisation + by an `if (j == j1) continue;` branch inside the minimum-search loop. + **Fixed** by splitting around `j1` into two clean vectorisable loops. +- **Augmenting row reduction** (2 × free_rows × n): calls `findRowSubmin`, + already 4× manually unrolled. `nontrivially_less_than()` was called twice + per free row with identical arguments; **fixed** by caching as `strictly_less`. +- **Augment solution / Dijkstra** (free_rows × n²): the dominant phase for + n ≥ 100. The inner update loop iterates over `col_list[up..dim-1]` via + indirect `j = col_list[k]` indexing, then accesses `row_i[j]`, `v_ptr[j]`, + and `d[j]` as scattered gathers — preventing SIMD vectorisation. + +### Dijkstra restructuring opportunity (unimplemented — needs release-build VTune) + +Replace the `col_list` permutation with a `bool scanned[dim]` mask and iterate +directly over `0..dim-1`. Pros: sequential access to `row_i`, `v_ptr`, `d` +→ auto-vectorisable; no indirect gather. Cons: visits all `dim` columns each +iteration (vs progressively fewer with `col_list`), so ~2× more comparisons in +the best case. Net benefit is uncertain and must be measured on a release build. + +**To profile:** build with `PKG_CXXFLAGS="-O2 -g -fno-omit-frame-pointer"` and +run `benchmark/vtune-driver.R` through VTune hotspot collection. + +### Test suite fix + +Added `tests/testthat/setup.R` that opens a null PDF device for the duration of +all test runs. This suppresses bare `plot()` / `TreeDistPlot()` calls in tests +from appearing in the interactive graphics device, and prevents vdiffr snapshot +rendering from leaking to screen. vdiffr opens its own `svglite` device on top +of the null device, so snapshot tests are unaffected. + +--- + ## Known Optimization Opportunities / TODOs - `information.h` line 19: comment suggests considering increasing `MAX_FACTORIAL_LOOKUP` diff --git a/tests/testthat/setup.R b/tests/testthat/setup.R new file mode 100644 index 000000000..d3da6c074 --- /dev/null +++ b/tests/testthat/setup.R @@ -0,0 +1,7 @@ +# Redirect all graphics to a null device for the test session. +# This prevents bare plot() calls and vdiffr snapshot rendering from +# appearing in the interactive graphics device (e.g. the RStudio viewer). +# vdiffr opens its own svglite device on top of this one, so snapshot +# tests are unaffected. +pdf(NULL) +withr::defer(grDevices::dev.off(), teardown_env()) From 73c71bd1f88649ef73270a0be7ba955e6f6aa188 Mon Sep 17 00:00:00 2001 From: RevBayes analysis <1695515+ms609@users.noreply.github.com> Date: Wed, 11 Mar 2026 07:56:08 +0000 Subject: [PATCH 11/19] LapScratch Modest speedup --- src/lap.cpp | 18 ++++--- src/pairwise_distances.cpp | 104 +++++++++++++++++++++++++++---------- src/tree_distances.h | 61 ++++++++++++++++++++-- 3 files changed, 147 insertions(+), 36 deletions(-) diff --git a/src/lap.cpp b/src/lap.cpp index 9614cb992..2e09138f6 100644 --- a/src/lap.cpp +++ b/src/lap.cpp @@ -70,7 +70,8 @@ cost lap(const lap_row dim, cost_matrix &input_cost, std::vector &rowsol, std::vector &colsol, - const bool allow_interrupt) + const bool allow_interrupt, + LapScratch &scratch) // input: // dim - problem size @@ -82,9 +83,12 @@ cost lap(const lap_row dim, { lap_row num_free = 0; - alignas(64) std::vector v(((dim + BLOCK_SIZE - 1) / BLOCK_SIZE) * BLOCK_SIZE); + scratch.ensure(dim); + auto& v = scratch.v; + auto& matches = scratch.matches; + // matches must start at zero for the column-reduction counter + std::fill(matches.begin(), matches.begin() + dim, 0); const cost* __restrict__ v_ptr = v.data(); - std::vector matches(dim); // Counts how many times a row could be assigned. // COLUMN REDUCTION for (lap_col j = dim; j--; ) { // Reverse order gives better results. @@ -108,7 +112,7 @@ cost lap(const lap_row dim, } // REDUCTION TRANSFER - std::vector freeunassigned(dim); // List of unassigned rows. + auto& freeunassigned = scratch.freeunassigned; // List of unassigned rows. for (lap_row i = 0; i < dim; ++i) { if (matches[i] == 0) { @@ -146,7 +150,7 @@ cost lap(const lap_row dim, } // AUGMENTING ROW REDUCTION - std::vector col_list(dim); // List of columns to be scanned in various ways. + auto& col_list = scratch.col_list; // List of columns to be scanned in various ways. int loopcnt = 0; // do-loop to be done twice. do { @@ -199,8 +203,8 @@ cost lap(const lap_row dim, } while (loopcnt < 2); // Repeat once. // AUGMENT SOLUTION for each free row. - std::vector d(dim); // 'Cost-distance' in augmenting path calculation. - std::vector predecessor(dim); // Row-predecessor of column in augmenting/alternating path. + auto& d = scratch.d; // 'Cost-distance' in augmenting path calculation. + auto& predecessor = scratch.predecessor; // Row-predecessor of column in augmenting/alternating path. for (lap_row f = 0; f < num_free; ++f) { bool unassignedfound = false; diff --git a/src/pairwise_distances.cpp b/src/pairwise_distances.cpp index a35f4958e..f4b559444 100644 --- a/src/pairwise_distances.cpp +++ b/src/pairwise_distances.cpp @@ -32,7 +32,8 @@ using TreeTools::count_bits; // Passes allow_interrupt = false to lap() so it is safe to call from an // OpenMP parallel region. static double mutual_clustering_score( - const SplitList& a, const SplitList& b, const int32 n_tips + const SplitList& a, const SplitList& b, const int32 n_tips, + LapScratch& scratch ) { if (a.n_splits == 0 || b.n_splits == 0 || n_tips == 0) return 0.0; @@ -104,8 +105,9 @@ static double mutual_clustering_score( } const int16 lap_n = most_splits - exact_n; - std::vector rowsol(lap_n); - std::vector colsol(lap_n); + scratch.ensure(most_splits); + auto& rowsol = scratch.rowsol; + auto& colsol = scratch.colsol; if (exact_n) { // Build a reduced cost matrix omitting exact-matched rows/cols @@ -126,7 +128,7 @@ static double mutual_clustering_score( const double lap_score = static_cast((max_score * lap_n) - - lap(lap_n, small, rowsol, colsol, false)) * over_max; + lap(lap_n, small, rowsol, colsol, false, scratch)) * over_max; return lap_score + exact_score * n_tips_rcp; } else { @@ -137,7 +139,7 @@ static double mutual_clustering_score( } } return static_cast( - (max_score * lap_n) - lap(lap_n, score, rowsol, colsol, false) + (max_score * lap_n) - lap(lap_n, score, rowsol, colsol, false, scratch) ) / max_score; } } @@ -183,6 +185,11 @@ NumericVector cpp_mutual_clustering_all_pairs( NumericVector result(n_pairs); double* const res = result.begin(); + // One LapScratch per thread — grown lazily on first use, never freed between + // pairs. Indexed by omp_get_thread_num() (always 0 in the serial path). + const int n_scratch = std::max(1, n_threads); + std::vector scratches(n_scratch); + // Iterate over columns of the combn(N,2) lower triangle. // Pair (col, row) with col < row maps to dist-vector index: // p = col*(N-1) - col*(col-1)/2 + row - col - 1 @@ -192,10 +199,15 @@ NumericVector cpp_mutual_clustering_all_pairs( for (int col = 0; col < N - 1; ++col) { #ifndef _OPENMP Rcpp::checkUserInterrupt(); +#endif +#ifdef _OPENMP + LapScratch& scratch = scratches[omp_get_thread_num()]; +#else + LapScratch& scratch = scratches[0]; #endif for (int row = col + 1; row < N; ++row) { const int p = col * (N - 1) - col * (col - 1) / 2 + row - col - 1; - res[p] = mutual_clustering_score(*splits[col], *splits[row], n_tip); + res[p] = mutual_clustering_score(*splits[col], *splits[row], n_tip, scratch); } } @@ -296,7 +308,8 @@ NumericVector cpp_rf_info_all_pairs( // ============================================================================= static double msd_score( - const SplitList& a, const SplitList& b, const int32 n_tips + const SplitList& a, const SplitList& b, const int32 n_tips, + LapScratch& scratch ) { const int16 most_splits = std::max(a.n_splits, b.n_splits); if (most_splits == 0) return 0.0; @@ -321,11 +334,12 @@ static double msd_score( } score.padAfterRow(a.n_splits, max_score); - std::vector rowsol(most_splits); - std::vector colsol(most_splits); + scratch.ensure(most_splits); + auto& rowsol = scratch.rowsol; + auto& colsol = scratch.colsol; return static_cast( - lap(most_splits, score, rowsol, colsol, false) - max_score * split_diff + lap(most_splits, score, rowsol, colsol, false, scratch) - max_score * split_diff ); } @@ -351,16 +365,24 @@ NumericVector cpp_msd_all_pairs( NumericVector result(n_pairs); double* const res = result.begin(); + const int n_scratch = std::max(1, n_threads); + std::vector scratches(n_scratch); + #ifdef _OPENMP #pragma omp parallel for schedule(dynamic) num_threads(n_threads) #endif for (int col = 0; col < N - 1; ++col) { #ifndef _OPENMP Rcpp::checkUserInterrupt(); +#endif +#ifdef _OPENMP + LapScratch& scratch = scratches[omp_get_thread_num()]; +#else + LapScratch& scratch = scratches[0]; #endif for (int row = col + 1; row < N; ++row) { const int p = col * (N - 1) - col * (col - 1) / 2 + row - col - 1; - res[p] = msd_score(*splits[col], *splits[row], n_tip); + res[p] = msd_score(*splits[col], *splits[row], n_tip, scratch); } } return result; @@ -372,7 +394,8 @@ NumericVector cpp_msd_all_pairs( // ============================================================================= static double msi_score( - const SplitList& a, const SplitList& b, const int32 n_tips + const SplitList& a, const SplitList& b, const int32 n_tips, + LapScratch& scratch ) { const int16 most_splits = std::max(a.n_splits, b.n_splits); if (most_splits == 0) return 0.0; @@ -404,11 +427,12 @@ static double msi_score( } score.padAfterRow(a.n_splits, max_score); - std::vector rowsol(most_splits); - std::vector colsol(most_splits); + scratch.ensure(most_splits); + auto& rowsol = scratch.rowsol; + auto& colsol = scratch.colsol; return static_cast( - (max_score * most_splits) - lap(most_splits, score, rowsol, colsol, false) + (max_score * most_splits) - lap(most_splits, score, rowsol, colsol, false, scratch) ) * possible_over_score; } @@ -434,16 +458,24 @@ NumericVector cpp_msi_all_pairs( NumericVector result(n_pairs); double* const res = result.begin(); + const int n_scratch = std::max(1, n_threads); + std::vector scratches(n_scratch); + #ifdef _OPENMP #pragma omp parallel for schedule(dynamic) num_threads(n_threads) #endif for (int col = 0; col < N - 1; ++col) { #ifndef _OPENMP Rcpp::checkUserInterrupt(); +#endif +#ifdef _OPENMP + LapScratch& scratch = scratches[omp_get_thread_num()]; +#else + LapScratch& scratch = scratches[0]; #endif for (int row = col + 1; row < N; ++row) { const int p = col * (N - 1) - col * (col - 1) / 2 + row - col - 1; - res[p] = msi_score(*splits[col], *splits[row], n_tip); + res[p] = msi_score(*splits[col], *splits[row], n_tip, scratch); } } return result; @@ -455,7 +487,8 @@ NumericVector cpp_msi_all_pairs( // ============================================================================= static double shared_phylo_score( - const SplitList& a, const SplitList& b, const int32 n_tips + const SplitList& a, const SplitList& b, const int32 n_tips, + LapScratch& scratch ) { const int16 most_splits = std::max(a.n_splits, b.n_splits); if (most_splits == 0) return 0.0; @@ -481,11 +514,12 @@ static double shared_phylo_score( } score.padAfterRow(a.n_splits, max_score); - std::vector rowsol(most_splits); - std::vector colsol(most_splits); + scratch.ensure(most_splits); + auto& rowsol = scratch.rowsol; + auto& colsol = scratch.colsol; return static_cast( - (max_score * most_splits) - lap(most_splits, score, rowsol, colsol, false) + (max_score * most_splits) - lap(most_splits, score, rowsol, colsol, false, scratch) ) * possible_over_score; } @@ -511,16 +545,24 @@ NumericVector cpp_shared_phylo_all_pairs( NumericVector result(n_pairs); double* const res = result.begin(); + const int n_scratch = std::max(1, n_threads); + std::vector scratches(n_scratch); + #ifdef _OPENMP #pragma omp parallel for schedule(dynamic) num_threads(n_threads) #endif for (int col = 0; col < N - 1; ++col) { #ifndef _OPENMP Rcpp::checkUserInterrupt(); +#endif +#ifdef _OPENMP + LapScratch& scratch = scratches[omp_get_thread_num()]; +#else + LapScratch& scratch = scratches[0]; #endif for (int row = col + 1; row < N; ++row) { const int p = col * (N - 1) - col * (col - 1) / 2 + row - col - 1; - res[p] = shared_phylo_score(*splits[col], *splits[row], n_tip); + res[p] = shared_phylo_score(*splits[col], *splits[row], n_tip, scratch); } } return result; @@ -533,7 +575,8 @@ NumericVector cpp_shared_phylo_all_pairs( static double jaccard_score( const SplitList& a, const SplitList& b, const int32 n_tips, - const double exponent, const bool allow_conflict + const double exponent, const bool allow_conflict, + LapScratch& scratch ) { const int16 most_splits = std::max(a.n_splits, b.n_splits); if (most_splits == 0) return 0.0; @@ -588,11 +631,12 @@ static double jaccard_score( } score.padAfterRow(a.n_splits, max_score); - std::vector rowsol(most_splits); - std::vector colsol(most_splits); + scratch.ensure(most_splits); + auto& rowsol = scratch.rowsol; + auto& colsol = scratch.colsol; return static_cast( - (max_score * most_splits) - lap(most_splits, score, rowsol, colsol, false) + (max_score * most_splits) - lap(most_splits, score, rowsol, colsol, false, scratch) ) / max_scoreL; } @@ -620,16 +664,24 @@ NumericVector cpp_jaccard_all_pairs( NumericVector result(n_pairs); double* const res = result.begin(); + const int n_scratch = std::max(1, n_threads); + std::vector scratches(n_scratch); + #ifdef _OPENMP #pragma omp parallel for schedule(dynamic) num_threads(n_threads) #endif for (int col = 0; col < N - 1; ++col) { #ifndef _OPENMP Rcpp::checkUserInterrupt(); +#endif +#ifdef _OPENMP + LapScratch& scratch = scratches[omp_get_thread_num()]; +#else + LapScratch& scratch = scratches[0]; #endif for (int row = col + 1; row < N; ++row) { const int p = col * (N - 1) - col * (col - 1) / 2 + row - col - 1; - res[p] = jaccard_score(*splits[col], *splits[row], n_tip, k, allow_conflict); + res[p] = jaccard_score(*splits[col], *splits[row], n_tip, k, allow_conflict, scratch); } } return result; diff --git a/src/tree_distances.h b/src/tree_distances.h index 06b8661a9..35d58c397 100644 --- a/src/tree_distances.h +++ b/src/tree_distances.h @@ -83,6 +83,13 @@ class CostMatrix { makeUntranspose(); } + // Reset for reuse at the same dimension — clears the transpose cache so the + // next findColMin() re-builds it from fresh data. Call between pair + // computations when reusing a thread-local CostMatrix instance. + void reset() noexcept { transposed_ = false; } + + [[nodiscard]] size_t dim() const noexcept { return dim_; } + void makeTranspose() noexcept { if (transposed_) return; @@ -188,10 +195,12 @@ class CostMatrix { std::fill(data_.begin() + start_index, data_.begin() + end_index, value); } - std::pair findColMin(lap_col j) { + std::pair findColMin(lap_col j, + lap_row search_dim = -1) { makeTranspose(); const cost* col_data = col(j); - const auto min_ptr = std::min_element(col_data, col_data + dim_); + const lap_row n = (search_dim < 0) ? static_cast(dim_) : search_dim; + const auto min_ptr = std::min_element(col_data, col_data + n); return {*min_ptr, static_cast(std::distance(col_data, min_ptr))}; } @@ -374,13 +383,59 @@ class CostMatrix { using cost_matrix = CostMatrix; +// --------------------------------------------------------------------------- +// LapScratch — reusable heap storage for one thread's LAPJV workspace. +// +// Pass to the scratch-taking overload of lap() to eliminate the ~6 per-call +// std::vector allocations. In a serial context construct once before the +// loop; in an OpenMP context allocate one per thread and index by +// omp_get_thread_num(). All vectors are grown lazily; never shrunk. +// --------------------------------------------------------------------------- +struct LapScratch { + std::vector v; // column dual variables (padded) + std::vector matches; // assignment-count per row + std::vector freeunassigned; // list of unassigned rows + std::vector col_list; // column scan list + std::vector d; // Dijkstra distances + std::vector predecessor; // augmenting-path predecessors + // rowsol / colsol are included so score functions that don't need the + // solution afterwards can avoid a separate allocation. + std::vector rowsol; + std::vector colsol; + + void ensure(int dim) noexcept { + const int padded = static_cast( + ((static_cast(dim) + BLOCK_SIZE - 1) / BLOCK_SIZE) * BLOCK_SIZE); + if (static_cast(v.size()) < padded) v.resize(padded); + if (static_cast(matches.size()) < dim) matches.resize(dim); + if (static_cast(freeunassigned.size()) < dim) freeunassigned.resize(dim); + if (static_cast(col_list.size()) < dim) col_list.resize(dim); + if (static_cast(d.size()) < dim) d.resize(dim); + if (static_cast(predecessor.size()) < dim) predecessor.resize(dim); + if (static_cast(rowsol.size()) < dim) rowsol.resize(dim); + if (static_cast(colsol.size()) < dim) colsol.resize(dim); + } +}; + /*************** FUNCTIONS *******************/ +// Primary overload: caller supplies pre-allocated scratch (zero alloc in hot loop). extern cost lap(lap_row dim, cost_matrix &input_cost, std::vector &rowsol, std::vector &colsol, - bool allow_interrupt = true); + bool allow_interrupt, + LapScratch &scratch); + +// Convenience overload: creates a temporary scratch (for one-off calls). +inline cost lap(lap_row dim, + cost_matrix &input_cost, + std::vector &rowsol, + std::vector &colsol, + bool allow_interrupt = true) { + LapScratch scratch; + return lap(dim, input_cost, rowsol, colsol, allow_interrupt, scratch); +} namespace TreeDist { From 46eb4131821e01dd9b9c2475c7ad902e8eb94580 Mon Sep 17 00:00:00 2001 From: RevBayes analysis <1695515+ms609@users.noreply.github.com> Date: Wed, 11 Mar 2026 16:30:12 +0000 Subject: [PATCH 12/19] Document failed optimization attempt --- AGENTS.md | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 50 insertions(+), 3 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index 3ae99878b..afc22bbac 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -109,6 +109,35 @@ When run non-interactively (e.g. CI) results are serialised to `.bench.Rds` file `_compare_results.R` compares PR results against `main` and fails the build on a regression > 4 %. +### ⚠ Benchmarking protocol — always use a release install + +**Do not benchmark with `devtools::load_all()`.** It appends +`-UNDEBUG -Wall -pedantic -g -O0` to the compiler flags, which overrides the +`-O2` in `~/.R/Makevars` and produces an unrepresentative (typically 3–5× +slower) build. All timing figures in this file were produced from a release +install unless noted otherwise. + +**Correct workflow** (run from a *fresh* R session — do NOT load TreeDist first, +to avoid the Windows DLL lock): + +```r +source("benchmark/bench-release.R") + +# 1. Benchmark HEAD (before your patch) +bench_release(label = "baseline") + +# 2. Apply your changes, then benchmark again +bench_release(label = "my-optimisation") + +# 3. Compare +bench_compare("baseline", "my-optimisation") +``` + +`bench_release()` installs the current working-tree source to a private temp +library via `install.packages(..., type = "source")` (so Makevars flags apply +correctly) and runs the suite in a `Rscript` subprocess. Results are saved to +`benchmark/results/