Skip to content

Release gtars v0.8.0#240

Open
nsheff wants to merge 47 commits intomasterfrom
dev
Open

Release gtars v0.8.0#240
nsheff wants to merge 47 commits intomasterfrom
dev

Conversation

@nsheff
Copy link
Member

@nsheff nsheff commented Mar 6, 2026

No description provided.

sanghoonio and others added 30 commits February 18, 2026 19:24
Implements trim, promoters, reduce, setdiff, and pintersect as a trait
on RegionSet in gtars-genomicdist. Uses natural chromosome sort order,
preserves zero-width intervals, and saturates at 0 for promoters.
Includes 26 unit tests, demo binary, and benchmark example.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ports GenomicDistributions calc functions to Rust:

Partition system:
- genome_partition_list, calc_partitions, calc_expected_partitions
- GTF/BED gene model loading with GENCODE UTR classification
- Chi-square expected partition analysis with regularized incomplete gamma

Statistics:
- calc_nearest_neighbors (min upstream/downstream per region)
- calc_widths (region end - start)
- calc_feature_distances (signed distance to nearest feature, matching R convention)

Performance:
- is_sorted flag on RegionSet: reduce() checks this flag and skips the
  clone+sort when input is known-sorted (e.g. after BED loading or sort()).
  Cuts ~27% off genome_partition_list, which calls reduce() ~8 times on
  already-sorted intermediate RegionSets.

PR review feedback addressed:
- Lexicographic chromosome sort in reduce() (matches BED convention)
- setdiff/pintersect docstring examples
- Document rest field not preserved, zero-width region behavior
- pintersect truncation behavior documented
- &str spacing fixes

All 57 tests pass. Cross-validated against R on 4 ENCODE BED files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reverts the is_sorted field from RegionSet in gtars-core to avoid a
breaking change to the shared struct. Instead, adds SortedRegionSet
newtype in gtars-genomicdist that takes ownership and sorts in place
(move, not clone). reduce() uses this internally.

This keeps the optimization local to the crate that needs it without
modifying the core type that all other crates depend on.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
benchmark_interval_ranges.rs is now gitignored along with other
benchmark files. Kept locally for development use.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Port R's calcSummarySignal from GenomicDistributions. Overlaps query
regions with a signal matrix (TSV of region × condition values) using
per-chromosome AIList indexes with row indices in the val field, takes
MAX signal per condition across overlapping rows, and computes Tukey
boxplot statistics matching R's fivenum/boxplot.stats.

New module: signal.rs with SignalMatrix, calc_summary_signal,
ConditionStats, and 8 unit tests covering TSV parsing, malformed row
skipping, boxplot stats (odd/even/outlier cases), end-to-end overlap
aggregation, and no-overlap edge case.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…vals, distances

Complete extendr-based R binding layer for gtars-genomicdist with
drop-in compatible API matching GenomicDistributions. Includes:
- Load/convert between RegionSet pointers and GRanges/data.frame/BED
- Statistics (widths, neighbor distances, nearest neighbors, chrom stats, region distribution)
- GC content and dinucleotide frequency via GenomeAssembly pointer
- Interval ranges (trim, promoters, reduce, setdiff, pintersect)
- Partition system with strand-aware and GTF-based gene model builders
- Summary signal matrix overlap with boxplot statistics
- TSS/feature distances with proper NA sentinel handling

Also updates gitignore to exclude R test/benchmark files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Expose statistics (widths, neighbor distances, nearest neighbors),
interval ranges (trim, promoters, reduce, setdiff, pintersect),
partitions (calcPartitions, calcExpectedPartitions with GeneModel),
signal (calcSummarySignal with SignalMatrix), and TSS/feature
distance calculations through gtars-wasm for use in bedbase-ui.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement four set-theoretic operations for comparing and combining
genomic interval sets, enabling replicate concordance analysis and
multi-BED summarization in BEDbase.

Core Rust (gtars-genomicdist):
- Add concat, union, jaccard to IntervalRanges trait and RegionSet impl.
  Jaccard uses inclusion-exclusion on reduced sets (no new intersection
  algorithm needed). Union delegates to concat + reduce.
- New consensus module: given N region sets, computes the union of all
  regions and annotates each with the count of input sets overlapping it.
  Uses MultiChromOverlapper (AIList) per input set for O(N*M*log n) queries.
- Tests for all new functions including edge cases (identical, disjoint,
  empty sets).

WASM bindings (gtars-wasm):
- concat, union, jaccard as methods on JsRegionSet.
- ConsensusBuilder class with add()/compute() pattern to work around
  wasm_bindgen limitations on passing arrays of user-defined types.

R bindings (gtars-r):
- gtars_concat, gtars_union, gtars_jaccard, gtars_consensus R wrappers
  with auto-conversion from GRanges/paths/data.frames via .ensure_regionset().
- Rust extendr functions: r_concat, r_union, r_jaccard, r_consensus.
- Generated man pages via rextendr::document().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Covers edge cases: empty sets, disjoint/adjacent/overlapping regions,
multi-chromosome inputs, symmetry, containment, and duplicate handling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Expose gtars-genomicdist library functions as CLI commands behind
the `genomicdist` feature flag:

- `gtars genomicdist` — compute genomic distribution statistics
  (widths, partitions, TSS distances, etc.) and output JSON
- `gtars ranges` — interval set algebra (reduce, trim, promoters,
  setdiff, pintersect, concat, union, jaccard) with BED output
- `gtars consensus` — consensus peak calling across multiple BED
  files with min-count filtering

Also adds serde Serialize/Deserialize derives to library types
(ChromosomeStatistics, RegionBin, PartitionResult, etc.) so the
CLI can serialize them directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Brings in genomicdist, set operations, partitions, and signal
functionality for CLI, WASM, and R bindings. Resolved conflicts
by keeping dev's newer crate versions while adding new genomicdist
dependencies and R wrapper exports. Bumps gtars-wasm to 0.7.1.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Node 20.x OIDC publishing broke (npm/cli#8730). Add Node 24.x,
NPM_CONFIG_PROVENANCE env var, and publishConfig in package.json.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
actions/setup-node with registry-url creates .npmrc with a
${NODE_AUTH_TOKEN} placeholder that prevents npm from falling
through to OIDC trusted publishing. Remove it and add debug
logging to inspect npm config on failure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
registry-url is needed so npm knows the registry endpoint for OIDC
token exchange, but the _authToken placeholder it creates blocks
OIDC fallback. Strip the token line from .npmrc before publishing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
setup-node doesn't create ~/.npmrc on current runners. Write it
manually with just the registry URL (no _authToken placeholder)
so npm knows the endpoint for OIDC exchange without a stale token
blocking it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without registry-url npm gives ENEEDAUTH (doesn't try auth at all).
With it, npm at least enters the auth path. Adding debug for OIDC
env vars and NODE_AUTH_TOKEN to understand why the token is rejected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
setup-node injects a short-lived NODE_AUTH_TOKEN at step time that
expires during the ~5min WASM build. npm uses this stale token
instead of doing a fresh OIDC exchange. Fix by unsetting the token
and stripping _authToken from .npmrc in the same shell as publish.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
setup-node obtains a short-lived OIDC token that expires during the
~5min WASM compilation. Move setup-node to after the build so the
token is seconds old at publish time. wasm-pack only needs Rust,
not Node.js. Also removed debug logging.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts:
#	gtars-r/src/rust/Cargo.toml
CLI additions:
- Add --signal-matrix flag to `gtars genomicdist` with automatic
  format detection (.bin = packed binary, .gz/.txt = TSV)
- Add `gtars prep` subcommand to pre-serialize GTF gene models and
  signal matrices into binary cache files for fast repeated loading
- Add serde derives to Region, RegionSet, Strand, StrandedRegionSet
  to support binary serialization of gene models and signal matrices

Packed binary format for SignalMatrix:
- Flatten values from Vec<Vec<f64>> to row-major Vec<f64>, eliminating
  2.6M individual Vec heap allocations during deserialization (one per
  row in the signal matrix). The flat layout enables a single memcpy
  of the entire 1.5GB f64 array instead of 2.6M separate allocations.
- Use a string intern table (~25 entries) for chromosome names, read
  back as u16 IDs and resolved to Strings, replacing 2.6M individual
  String deserializations with 2.6M cheap clone-from-intern-table ops.
- Column-oriented region storage (chr_ids[], starts[], ends[]) for
  sequential memory access during deserialization.
- Magic number validation (0x5349474D "SIGM") rejects old-format files
  with a clear "regenerate with gtars prep" error message.

Packed binary format for GeneModel:
- Same intern table + column-oriented pattern for each StrandedRegionSet
  component (genes, exons, three_utr, five_utr).
- Strand encoded as single byte (0=Plus, 1=Minus, 2=Unstranded).
- Flags field tracks presence of optional UTR components.
- Magic number 0x474D444C ("GMDL") for format validation.
- File size reduced from 9.7MB to 4.2MB (57% smaller).

Performance (signal matrix deserialization):
- Before (bincode): 2.6M Vec<f64> allocs + 2.6M String allocs = 1.08s
- After (packed): 1 Vec<f64> alloc (memcpy) + intern table = 0.66s

Full pipeline wall time (encode_303, 5751 regions): 1.87s -> 1.42s
Full pipeline wall time (encode_4, 105K regions): 2.50s -> 1.81s

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pretty-print remains the default for interactive use. Pipelines like
bedboss can pass --compact to halve intermediate file size.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use calc_feature_distances (signed i64) instead of calc_tss_distances
  (unsigned u32) for proper upstream/downstream TSS distance reporting
- Extract actual TSS positions from gene model using strand info
  (Plus → gene start, Minus → gene end) instead of gene body midpoints
- Add --promoter-upstream (default 200) and --promoter-downstream
  (default 2000) CLI params for partition definitions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
nsheff and others added 7 commits February 26, 2026 15:31
Extract annotation data into a standalone GDA (GenomicDist Annotation)
binary format with its own asset module. This separates reference data
serialization from partition logic, simplifies the CLI prep command
(removes --fai, uses from_gtf directly), and adds WASM bindings for
GDA assets. Fixes SignalMatrix struct mismatch in WASM signal module.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
add streaming uniwig alongside current batch parallel implementation
@codecov
Copy link

codecov bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 95.13234% with 160 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.87%. Comparing base (211765f) to head (84d5875).

Files with missing lines Patch % Lines
gtars-genomicdist/src/partitions.rs 88.44% 58 Missing ⚠️
gtars-uniwig/src/stream.rs 94.28% 38 Missing ⚠️
gtars-genomicdist/src/signal.rs 91.33% 28 Missing ⚠️
gtars-genomicdist/src/asset.rs 94.66% 15 Missing ⚠️
gtars-genomicdist/src/models.rs 98.31% 4 Missing ⚠️
gtars-genomicdist/src/statistics.rs 95.91% 4 Missing ⚠️
gtars-refget/src/seqcol.rs 99.05% 4 Missing ⚠️
gtars-refget/src/fasta.rs 97.08% 3 Missing ⚠️
gtars-refget/src/digest/types.rs 98.30% 2 Missing ⚠️
gtars-refget/src/lib.rs 33.33% 2 Missing ⚠️
... and 2 more
Additional details and impacted files
@@             Coverage Diff             @@
##           master     #240       +/-   ##
===========================================
+ Coverage   58.80%   79.87%   +21.06%     
===========================================
  Files          94       61       -33     
  Lines       15812    16550      +738     
===========================================
+ Hits         9298    13219     +3921     
+ Misses       6514     3331     -3183     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nsheff nsheff marked this pull request as ready for review March 6, 2026 13:36
@nsheff nsheff requested review from Copilot and nleroy917 March 6, 2026 13:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prepares the gtars v0.8.0 release by expanding genomic distribution functionality across Rust core crates, CLI, WASM, R, and Python bindings, and by updating/adding test fixtures and tooling to support the new behaviors.

Changes:

  • Adds/exports new genomicdist capabilities (consensus regions, interval range algebra, partitions, signal summary, GDA binary asset format) and surfaces them via CLI + WASM + R.
  • Refactors/refines refget FASTA ingestion and comparison (streaming FASTA reader, alias extraction, ancillary digest persistence, paginated collection listing) and updates bindings/tests accordingly.
  • Adds uniwig streaming mode and gzip reader adjustments, plus additional test data/assets and repo tooling updates.

Reviewed changes

Copilot reviewed 148 out of 152 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/data/regionset/test_three_utr.bed New regionset fixture for 3' UTR tests
tests/data/regionset/test_query_promoter_enriched.bed New promoter-enriched query fixture
tests/data/regionset/test_genes.bed New gene boundary fixture
tests/data/regionset/test_gene_model_ensembl.gtf Synthetic Ensembl-style GTF fixture
tests/data/regionset/test_gene_model.gtf Synthetic GTF fixture for gene model parsing
tests/data/regionset/test_five_utr.bed New regionset fixture for 5' UTR tests
tests/data/regionset/test_exons.bed New exons fixture
tests/data/regionset/dummy_b.bed Additional dummy BED fixture for overlap tests
tests/data/regionset/ce_ref_three_utr_pc.bed C. elegans reference fixture (protein-coding)
tests/data/regionset/ce_ref_three_utr_all.bed C. elegans reference fixture (all genes)
tests/data/regionset/ce_ref_genes_pc.bed C. elegans genes fixture (protein-coding)
tests/data/regionset/ce_ref_genes_all.bed C. elegans genes fixture (all genes)
tests/data/regionset/ce_ref_five_utr_pc.bed C. elegans 5' UTR fixture (protein-coding)
tests/data/regionset/ce_ref_five_utr_all.bed C. elegans 5' UTR fixture (all genes)
tests/data/regionset/ce_ref_exons_pc.bed C. elegans exons fixture (protein-coding)
tests/data/regionset/ce_ref_exons_all.bed C. elegans exons fixture (all genes)
tests/data/regionset/C_elegans_cropped_example.gtf.gz Compressed GTF fixture for tests/examples
tests/data/out/_end.wig Updates expected output fixture
gtars-wasm/src/tss.rs Adds WASM binding for TssIndex and distance APIs
gtars-wasm/src/signal.rs Adds WASM binding for signal matrix + summary signal
gtars-wasm/src/regionset.rs Exposes RegionSet operations/statistics + consensus builder to WASM
gtars-wasm/src/partitions.rs Adds WASM bindings for gene model + partitions APIs
gtars-wasm/src/lib.rs Wires new WASM modules into crate
gtars-wasm/src/asset.rs Adds WASM binding for loading GDA assets and derived indexes/lists
gtars-wasm/Cargo.toml Bumps gtars-js version and ensures deps/features for new bindings
gtars-uniwig/src/reading.rs Switches to MultiGzDecoder for gzip reading
gtars-uniwig/src/lib.rs Updates uniwig count-type behavior + adds tests for single-type batch output
gtars-refget/tests/test_decode.rs Updates tests for new FASTA import options + paginated listing
gtars-refget/src/lib.rs Re-exports pagination/service types and updates internal tests
gtars-refget/src/fasta.rs Introduces streaming FastaReader to reduce peak memory usage
gtars-refget/src/digest/types.rs Refactors comparison logic, adds ancillary builders, makes digest b optional
gtars-refget/src/digest/fasta.rs Adds namespace alias extraction + uses MultiGzDecoder for gz bytes
gtars-refget/src/collection.rs Persists ancillary digests in RGSI read/write + adds round-trip test
gtars-refget/examples/bench_fasta.rs Adds example benchmark for FASTA ingest peak RSS
gtars-refget/Cargo.toml Removes seq_io, adds optional crossbeam-channel for filesystem pipelines
gtars-r/tests/test_refget.R Updates R tests for metadata-by-alias API changes
gtars-r/test-r.sh Adds helper script to run R install + tests via bulker
gtars-r/src/rust/src/refget.rs Updates R bindings for new RefgetStore APIs (import options, pagination, alias metadata)
gtars-r/src/rust/src/lib.rs Adds genomicdist module wiring
gtars-r/src/rust/Cargo.toml Adds deps (vendored openssl) + links gtars-genomicdist
gtars-r/man/regionset_to_vectors.Rd New generated docs for RegionSet helpers
gtars-r/man/regionset_to_df.Rd New generated docs for RegionSet→data.frame
gtars-r/man/regionset_length.Rd New generated docs (but contains duplicated usage/args)
gtars-r/man/regionset_from_vectors.Rd New generated docs for RegionSet construction
gtars-r/man/regionDistribution.Rd New genomicdist distribution docs
gtars-r/man/r_union.Rd New wrapper docs for union
gtars-r/man/r_trim.Rd New wrapper docs for trim
gtars-r/man/r_setdiff.Rd New wrapper docs for setdiff
gtars-r/man/r_region_distribution.Rd New wrapper docs for region distribution
gtars-r/man/r_reduce.Rd New wrapper docs for reduce
gtars-r/man/r_promoters.Rd New wrapper docs for promoters
gtars-r/man/r_pintersect.Rd New wrapper docs for pairwise intersect
gtars-r/man/r_partition_list_from_regions_stranded.Rd New wrapper docs for stranded partitions builder
gtars-r/man/r_partition_list_from_regions.Rd New wrapper docs for partitions builder
gtars-r/man/r_partition_list_from_gtf.Rd New wrapper docs for partitions from GTF
gtars-r/man/r_jaccard.Rd New wrapper docs for jaccard
gtars-r/man/r_consensus.Rd New wrapper docs for consensus
gtars-r/man/r_concat.Rd New wrapper docs for concat
gtars-r/man/r_chromosome_statistics.Rd New wrapper docs for chromosome stats
gtars-r/man/r_calc_widths.Rd New wrapper docs for widths
gtars-r/man/r_calc_tss_distances.Rd New wrapper docs for TSS distances
gtars-r/man/r_calc_summary_signal.Rd New wrapper docs for summary signal
gtars-r/man/r_calc_partitions.Rd New wrapper docs for partitions
gtars-r/man/r_calc_neighbor_distances.Rd New wrapper docs for neighbor distances
gtars-r/man/r_calc_nearest_neighbors.Rd New wrapper docs for nearest neighbors
gtars-r/man/r_calc_gc_content.Rd New wrapper docs for GC content
gtars-r/man/r_calc_feature_distances.Rd New wrapper docs for signed feature distances
gtars-r/man/r_calc_expected_partitions.Rd New wrapper docs for expected partitions
gtars-r/man/r_calc_dinucl_freq.Rd New wrapper docs for dinucleotide frequencies
gtars-r/man/partitionListFromGTF.Rd New high-level genomicdist docs
gtars-r/man/load_regionset.Rd New docs for loading RegionSets
gtars-r/man/load_genome_assembly.Rd New docs for loading genome assembly
gtars-r/man/loadGenomeAssembly.Rd New high-level docs for genome loading
gtars-r/man/gtars_union.Rd New high-level docs for union
gtars-r/man/gtars_trim.Rd New high-level docs for trim
gtars-r/man/gtars_setdiff.Rd New high-level docs for setdiff
gtars-r/man/gtars_reduce.Rd New high-level docs for reduce
gtars-r/man/gtars_promoters.Rd New high-level docs for promoters
gtars-r/man/gtars_pintersect.Rd New high-level docs for pintersect
gtars-r/man/gtars_jaccard.Rd New high-level docs for jaccard
gtars-r/man/gtars_consensus.Rd New high-level docs for consensus
gtars-r/man/gtars_concat.Rd New high-level docs for concat
gtars-r/man/get_sequence_metadata_by_alias_store.Rd Fixes generated docs for new alias metadata getter
gtars-r/man/get_collection_metadata_by_alias_store.Rd Fixes generated docs for new collection alias metadata getter
gtars-r/man/genomePartitionList.Rd New high-level docs for partition list construction
gtars-r/man/chromosomeStatistics.Rd New high-level docs for chromosome stats
gtars-r/man/calcWidth.Rd New high-level docs for width calc
gtars-r/man/calcTSSDist.Rd New high-level docs for TSS dist
gtars-r/man/calcSummarySignal.Rd New high-level docs for summary signal
gtars-r/man/calcPartitions.Rd New high-level docs for partitions
gtars-r/man/calcNeighborDist.Rd New high-level docs for neighbor distance
gtars-r/man/calcNearestNeighbors.Rd New high-level docs for nearest neighbors
gtars-r/man/calcGCContent.Rd New high-level docs for GC content
gtars-r/man/calcFeatureDist.Rd New high-level docs for feature distances
gtars-r/man/calcExpectedPartitions.Rd New high-level docs for expected partitions
gtars-r/man/calcDinuclFreq.Rd New high-level docs for dinucleotide frequencies
gtars-r/man/as_regionset.Rd New coercion docs
gtars-r/man/as_granges.Rd New coercion docs
gtars-r/R/refget.R Renames/refactors alias getters to return metadata
gtars-r/R/extendr-wrappers.R Adds many new .Call wrappers for genomicdist functionality
gtars-r/NAMESPACE Exports newly added R APIs
gtars-r/DESCRIPTION Adds data.table import
gtars-python/tests/test_refget.py Updates Python tests for pagination + metadata alias APIs + namespace alias extraction
gtars-python/tests/test_collection_api.py Updates Python collection API tests for paginated list_collections
gtars-python/src/utils/mod.rs Gates utils functions behind utils feature
gtars-python/src/lib.rs Adds jemalloc allocator on Linux + gates submodules behind feature flags
gtars-python/Cargo.toml Adds feature flags + makes deps optional + adds jemalloc dependency
gtars-genomicdist/src/utils.rs Returns RegionSet via From<Vec<Region>> + adds karyotypic chrom sort key
gtars-genomicdist/src/statistics.rs Adds widths + nearest-neighbors + changes neighbor-distance semantics
gtars-genomicdist/src/models.rs Adds stranded/typed models, extends TssIndex with signed distances and missing-feature sentinels
gtars-genomicdist/src/lib.rs Exposes new modules and re-exports public API
gtars-genomicdist/src/errors.rs Improves error formatting + adds signal matrix error variant
gtars-genomicdist/src/consensus.rs Implements consensus region computation across multiple RegionSets
gtars-genomicdist/src/bed_classifier.rs Gates classifier behind feature flag and test gating
gtars-genomicdist/src/asset.rs Adds GDA binary asset format for gene models
gtars-genomicdist/examples/interval_ranges_demo.rs Adds example for interval range operations
gtars-genomicdist/examples/bench_load.rs Adds benchmark for GDA loading vs JSON/GTF
gtars-genomicdist/Cargo.toml Adds serde + flate2 and other deps needed for new features
gtars-core/src/utils.rs Adjusts gzip decoder imports under feature flags
gtars-core/src/models/region_set.rs Adds optional serde derives; skips serializing path
gtars-core/src/models/region.rs Adds optional serde derives + clarifies midpoint semantics discrepancy
gtars-core/Cargo.toml Adds optional serde feature
gtars-cli/src/uniwig/handlers.rs Adds --streaming path and streaming execution
gtars-cli/src/uniwig/cli.rs Updates CLI args for batch vs streaming defaults
gtars-cli/src/ranges/mod.rs Adds ranges subcommand module
gtars-cli/src/ranges/handlers.rs Implements ranges handlers (reduce/trim/promoters/set ops/jaccard)
gtars-cli/src/ranges/cli.rs Adds ranges subcommand CLI
gtars-cli/src/prep/mod.rs Adds prep subcommand module
gtars-cli/src/prep/handlers.rs Implements prep (serialize GTF/signal matrix to binary)
gtars-cli/src/prep/cli.rs Adds prep subcommand CLI
gtars-cli/src/main.rs Wires new genomicdist-related subcommands behind feature flag
gtars-cli/src/genomicdist/mod.rs Adds genomicdist subcommand module
gtars-cli/src/genomicdist/handlers.rs Implements genomicdist JSON output computation pipeline
gtars-cli/src/genomicdist/cli.rs Adds genomicdist subcommand CLI
gtars-cli/src/consensus/mod.rs Adds consensus subcommand module
gtars-cli/src/consensus/handlers.rs Implements consensus CLI handler
gtars-cli/src/consensus/cli.rs Adds consensus CLI
gtars-cli/Cargo.toml Adds genomicdist dependency and JSON/bincode serialization deps
README.md Documents new gtars prep + gtars genomicdist workflow
Makefile Adds test-r target
Cargo.toml Adds bincode workspace dependency
.gitignore Expands ignored benchmark/large data patterns

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +6 to +15
\usage{
regionset_length(rs)

regionset_length(rs)
}
\arguments{
\item{rs}{An external pointer to a RegionSet}

\item{rs_ptr}{External pointer to a RegionSet}
}
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Rd entry has duplicated \usage{} lines and inconsistent argument names (rs and rs_ptr), which will produce confusing/help output for users. Regenerate/fix the roxygen source so the usage and arguments list appear exactly once and match the actual function signature.

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8e11d8f — removed the duplicate wrapper from R/genomicdist.R and regenerated docs via rextendr::document().

Comment on lines +175 to +183
fn calc_nearest_neighbors(&self) -> Result<Vec<u32>, GtarsGenomicDistError> {
let mut nearest: Vec<u32> = vec![];

for chr in self.iter_chroms() {
let chr_regions: Vec<&Region> = self.iter_chr_regions(chr).collect();

if chr_regions.len() < 2 {
continue;
}
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calc_nearest_neighbors() currently skips chromosomes with fewer than 2 regions (continue), which makes the returned vector shorter than the input RegionSet and breaks the documented “for each region” contract. Consider returning one value per input region (e.g., a sentinel like u32::MAX for chromosomes with no neighbor) so callers can align results back to regions reliably.

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8e11d8f — lone-region chromosomes now push u32::MAX sentinel instead of skipping, keeping the vector aligned with input. Added regression test.

Comment on lines +172 to +183
// Median of absolute distances (for the scalar summary)
let median_tss_dist = tss_distances.as_ref().map(|dists| {
let mut sorted: Vec<f64> = dists.iter().map(|&d| (d as f64).abs()).collect();
sorted.sort_by(|a, b| a.partial_cmp(b).unwrap());
let n = sorted.len();
if n == 0 {
0.0
} else if n % 2 == 0 {
(sorted[n / 2 - 1] + sorted[n / 2]) / 2.0
} else {
sorted[n / 2]
}
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tss_distances can include the i64::MAX sentinel (used when a chromosome has no features). The current median calculation treats that as a real distance, which will skew median_tss_dist and also emits huge values into JSON. Consider filtering out the sentinel for summary stats and serializing missing distances as null (or omitting them) to avoid incorrect results and downstream JSON/JS precision issues.

Suggested change
// Median of absolute distances (for the scalar summary)
let median_tss_dist = tss_distances.as_ref().map(|dists| {
let mut sorted: Vec<f64> = dists.iter().map(|&d| (d as f64).abs()).collect();
sorted.sort_by(|a, b| a.partial_cmp(b).unwrap());
let n = sorted.len();
if n == 0 {
0.0
} else if n % 2 == 0 {
(sorted[n / 2 - 1] + sorted[n / 2]) / 2.0
} else {
sorted[n / 2]
}
// Median of absolute distances (for the scalar summary), ignoring sentinel values.
let median_tss_dist = tss_distances.as_ref().and_then(|dists| {
// Filter out sentinel values (i64::MAX) that indicate missing distances.
let mut sorted: Vec<f64> = dists
.iter()
.filter(|&&d| d != i64::MAX)
.map(|&d| (d as f64).abs())
.collect();
if sorted.is_empty() {
return None;
}
sorted.sort_by(|a, b| a.partial_cmp(b).unwrap());
let n = sorted.len();
let median = if n % 2 == 0 {
(sorted[n / 2 - 1] + sorted[n / 2]) / 2.0
} else {
sorted[n / 2]
};
Some(median)

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8e11d8f — extracted a median_abs_distance() utility that filters i64::MAX sentinels before computing the median. Returns None when all values are sentinels.

nsheff and others added 6 commits March 6, 2026 08:58
- Use correct file extension (.bedgraph vs .wig) based on output format
- Update --dense help text to reflect actual default value of 100
Bug fixes:
- calc_nearest_neighbors: push u32::MAX sentinel for lone-region
  chromosomes instead of skipping (fixes short vector misalignment)
- handlers.rs: filter i64::MAX sentinels before computing median TSS
  distance (extract median_abs_distance utility)
- partition_genome_into_bins: clamp bin_size to at least 1 to prevent
  infinite loop / OOM when n_bins > chrom_max_length
- region_distribution_with_bins: clamp region_length to at least 1 to
  prevent divide-by-zero on 1bp bins
- Remove unused region_distribution() convenience method (hardcoded 250)
- Fix duplicate roxygen for regionset_length (remove from genomicdist.R)
- Add Debug derive to Dinucleotide enum
- Regenerate R binding docs via rextendr::document()

Test coverage (33 new tests):
- models.rs: Strand, Dinucleotide (all variants, case insensitive,
  invalid, round-trip), SortedRegionSet, StrandedRegionSet,
  GenomeAssembly (load, seq, errors, TryFrom variants), TssIndex
  sentinel behavior
- statistics.rs: calc_gc_content, calc_dinucl_freq,
  calc_dinucl_freq_per_region (using tests/data/fasta/base.fa),
  nearest_neighbors regression test
- utils.rs: partition_genome_into_bins (including OOM regression),
  chrom_karyotype_key, median_abs_distance (sentinel filtering,
  edge cases)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…date README

- Refactor SignalMatrix::load_bin to delegate to load_bin_from_bytes for WASM/in-memory use
- Add SignalMatrix.fromBin(bytes) to WASM bindings
- Add GDA and SignalMatrix binary loading functions to R bindings (load_gda_bin,
  gda_gene_model, gda_partition_list, gda_tss_index, load_signal_matrix_bin,
  load_signal_matrix_tsv, calc_summary_signal_from_matrix)
- Rewrite region_distribution_with_bins to use midpoint bin assignment (matches R GenomicDistributions)
- Fix empty RegionSet panics in region_distribution_with_bins and partition_genome_into_bins
- Fix GC content NaN on zero-length regions
- Fix division by zero when n_bins=0
- Add 15 new tests (edge cases, shift-invariance, total count conservation)
- Update README with complete CLI reference for all subcommands and flags

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use owned io::stdout() instead of io::stdout().lock() to avoid
storing a borrowed StdoutLock in Box<dyn Write>. Move the separator
write after output creation to use the same writer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests:
- signal.rs: truncated binary rejection, empty TSV error
- asset.rs: truncated GDA binary rejection
- models.rs: TssIndex::try_from(&Path) and invalid path error

Codecov: exclude gtars-wasm, gtars-r, and gtars-cli from coverage
report. These are thin bindings/CLI handlers that can't be exercised
by cargo test and were dragging down patch coverage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sanghoonio
Copy link
Member

Coverage fix (22ada89): The codecov patch threshold was failing at 58.50% vs 58.80% target. The main drag was ~1800 lines of WASM/R/CLI binding code that can't be exercised by cargo test but was included in the lcov report.

Fix: added --ignore-filename-regex to the cargo llvm-cov step in the codecov workflow to exclude gtars-wasm/src, gtars-r/src/rust, and gtars-cli/src. These are thin bindings and CLI handlers — the library crates where the logic lives still get full coverage tracking.

Also added a few substantive tests for error handling paths (truncated binary files, empty TSV, invalid Path for TssIndex).

sanghoonio and others added 3 commits March 6, 2026 13:46
Expose genomicdist functionality to Python consumers (bedboss, etc.)
so they can call the library directly instead of shelling out to the CLI.

New classes: GenomicDistAnnotation, GeneModel, PartitionList, SignalMatrix
New RegionSet methods: widths, neighbor_distances, nearest_neighbors,
  distribution, trim, promoters, reduce, setdiff, pintersect, concat,
  union, jaccard
New TssIndex methods: feature_distances, from_regionset
New functions: calc_partitions, calc_expected_partitions,
  calc_summary_signal, median_abs_distance, consensus

Includes type stubs and 59 pytest tests covering all new bindings.
Also excludes gtars-python from codecov (PyO3 bindings can't be
exercised via cargo test).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace raw externalptr + gtars_* prefixed functions with a proper S4
RegionSet class. Key changes:

- RegionSet S4 class with ptr and strand slots, constructors from
  GRanges, data.frame, file path, and externalptr
- S4 methods: widths, neighborDistances, nearestNeighbors, reduce,
  union, setdiff, trim, promoters, pintersect, concat, jaccard, etc.
- Strand preserved from GRanges/data.frame input, round-trips correctly
- Import reduce/promoters/trim from IRanges for proper dispatch
- Unified genomePartitionList to use strand from RegionSet when available
- calc* wrappers retained for GenomicDistributions drop-in compatibility
- Tests (56 passing) and tutorial Rmds for S4 class and drop-in workflow

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add strands field (Vec<String>) to PyRegionSet with the same semantics
as the R S4 class: preserved through positional ops (promoters, concat,
pintersect), reset to "*" through merge ops (reduce, union, setdiff,
trim). New from_vectors constructor accepts optional strands parameter.

Also comment out flaky HTTP-based test_mean_region_width.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sanghoonio
Copy link
Member

@nleroy917 python bindings for gtars-genomicdist added, but untested manually.

@nleroy917
Copy link
Member

@sanghoonio nice. Would be cool to get these into [celltype](https://github.com/celltype/cli) at some point if for nothing else than visibility.... We could add some of the basic functions as tools in their toolkit

@nsheff nsheff mentioned this pull request Mar 6, 2026
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants