Add extra exploratory figures, add Argelaguet, join simulations to the workflow, fix yamet infs#70
Conversation
…lation via scaling/regressing out
…laguet's download
…ker if flag present
There was a problem hiding this comment.
Pull request overview
This PR integrates additional analyses into the main Snakemake workflow by adding the Argelaguet (2019) gastrulation dataset, migrating/connecting the simulations pipeline under workflow/, and introducing new exploratory embedding/figure outputs with coordinate pre-validation steps.
Changes:
- Add Argelaguet gastrulation download → harmonization → yamet runs → HTML report, and wire it into
rule all. - Move/merge the simulations pipeline into
workflow/(rules, conda envs, parameters, docs) and add simulation HTML targets to the main workflow. - Add coordinate pre-validation rules/scripts and new exploratory embedding/figure panels (CRC embeddings; methylation-bin stratified plots; shared embedding utilities).
Reviewed changes
Copilot reviewed 35 out of 53 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| workflow/Snakefile | Includes new Argelaguet/simulations/validate rules and adds new all targets. |
| workflow/simulations/within_cell.md | Adds narrative documentation for within-cell simulation design. |
| workflow/simulations/README.md | Updates simulations README to point to workflow/ and local docs. |
| workflow/simulations/input/parameters.tsv | Adds parameter grid for coverage/NA simulations. |
| workflow/simulations/feat_diff.md | Adds documentation for feature differentiation simulation. |
| workflow/simulations/between_cell.md | Adds documentation for between-cell simulation design. |
| workflow/rules/validate.smk | New coordinate validation rules producing coords_validated.flag. |
| workflow/rules/src/validate_yamet_coords.sh | New shell script to check harmonized coords vs reference. |
| workflow/rules/src/simulations_within_cell_getSamples.R | New within-cell simulation sample generator. |
| workflow/rules/src/simulations_within_cell_createTemplate.R | New within-cell template/reference generator. |
| workflow/rules/src/simulations_figure2.Rmd | Parameterizes paths for simulation Figure 2 rendering. |
| workflow/rules/src/simulations_feat_diff_getSamples.R | New feature-diff simulation sample generator. |
| workflow/rules/src/simulations_feat_diff_createTemplate.R | New feature-diff template/reference generator. |
| workflow/rules/src/simulations_between_cell_getSamples.R | New between-cell simulation sample generator. |
| workflow/rules/src/simulations_between_cell_createTemplate.R | New between-cell template/reference generator. |
| workflow/rules/src/simulations_08_combined_figure_adj.Rmd | Updates paths/logging and adds methylation-bin panels. |
| workflow/rules/src/simulations_07_combined_figure.Rmd | Updates paths/logging for combined simulation figure. |
| workflow/rules/src/simulations_06_between_cell.Rmd | Updates paths/logging and output column selection. |
| workflow/rules/src/simulations_05_within_cell.Rmd | Updates paths/logging and output column selection. |
| workflow/rules/src/simulations_04_feat_diff.Rmd | Updates sourcing/paths and adds log capture. |
| workflow/rules/src/simulations_03_yamet_plots.Rmd | Parameterizes yamet input dir and figure path. |
| workflow/rules/src/simulations_02_coverage_plots.Rmd | Parameterizes sim input/output dirs and figure path. |
| workflow/rules/src/simulations_01_sim_data.Rmd | Parameterizes inputs/outputs (lowReal dir, out dir). |
| workflow/rules/src/simPattern.R | Adds simulation helper functions for coverage-pattern generation. |
| workflow/rules/src/scmet.R | Updates column detection for per-cell outputs (now \\.tsv$). |
| workflow/rules/src/mi_table.R | Adds shared NMI table helper for simulations plots. |
| workflow/rules/src/make_grid_plots.R | Updates sourcing and output-column detection logic. |
| workflow/rules/src/embedding_utils.R | Adds shared PCA/UMAP/HVF/regression utilities for embeddings. |
| workflow/rules/src/ecker.Rmd | Adds early logging, git SHA stamping, uses shared embedding utils, adds regressed-S embedding. |
| workflow/rules/src/crc.Rmd | Adds early logging and new adjS-by-methylation-bin exploratory panels. |
| workflow/rules/src/crc_windows.Rmd | Refactors NA handling, renames variables, and clarifies SCE export section. |
| workflow/rules/src/crc_windows_sce.Rmd | Refactors to windows_sce naming and corrects labels/saves corrected SCE. |
| workflow/rules/src/crc_embeddings.Rmd | New report for CRC cell embeddings from entropy/methylation feature spaces. |
| workflow/rules/src/build_chr_cpg_ref.sh | Redirects script stdout/stderr to Snakemake log. |
| workflow/rules/src/argelaguet.Rmd | New Argelaguet gastrulation entropy report with embeddings and trajectory analyses. |
| workflow/rules/src/argelaguet_to_yamet.sh | New converter from Argelaguet methylation tables to yamet format. |
| workflow/rules/simulations.smk | New integrated simulations ruleset under the main workflow. |
| workflow/rules/mm10.smk | Adjusts mm10 reference generation and adds chr10-only aggregated reference. |
| workflow/rules/ecker.smk | Wires chr10 ref selection and coordinate validation dependency into ecker runs. |
| workflow/rules/crc.smk | Adds coordinate validation dependency and introduces CRC embeddings report rule. |
| workflow/rules/argelaguet.smk | New ruleset to download/harmonize/run/report Argelaguet dataset. |
| workflow/rules/archive.smk | Adds snapshotting for Argelaguet and simulations outputs. |
| workflow/envs/sim_yamet.yaml | Adds a conda environment for simulations yamet steps. |
| workflow/envs/sim_r_sim.yaml | Adds conda env for simulation generation scripts. |
| workflow/envs/sim_r_plot.yaml | Adds conda env for simulation plotting/scMET steps. |
| workflow/envs/sim_r_emanuel.yaml | Adds conda env for Emanuel’s coverage simulations. |
| workflow/envs/r.yml | Extends R env with embedding-related dependencies. |
| simulations/src/ggtheme.R | Removes old simulations-local ggtheme (now sourced from workflow rules). |
| simulations/Snakefile | Removes standalone simulations Snakefile (now integrated into workflow). |
| README.md | Updates repo docs to reflect simulations now live under workflow/. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 37 out of 55 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (2)
workflow/simulations/README.md:4
- Grammar: “parameterization of one simulated datasets” should be singular (e.g., “one simulated dataset”).
workflow/simulations/README.md:19 - Spelling: “intermediatley” is misspelled (should be “intermediately”) in the transition matrix descriptions.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| overlap=$(comm -12 \ | ||
| <(zcat "$cell" | head -100 | awk '{print $1"\t"$2}' | sort) \ | ||
| <(zcat "$ref" | awk '{print $1"\t"$2}' | sort) \ | ||
| | wc -l) |
There was a problem hiding this comment.
The validation computes overlap by sorting the entire reference (zcat "$ref" | ... | sort) every time. For whole-genome CpG references this can be very slow and memory-intensive. Consider streaming the reference once while matching against a small in-memory set of the first ~100 cell coordinates (e.g., load those 100 into an awk associative array) to avoid full-file sorting.
| harmonized_dir="$1" | ||
| ref="$2" | ||
|
|
There was a problem hiding this comment.
The script doesn’t enable strict bash settings (set -euo pipefail). Adding them near the top will make failures in zcat/awk/sort/comm pipelines fail fast and avoid misleading “no overlap” errors when inputs are missing/corrupt.
No description provided.