Skip to content

Add extra exploratory figures, add Argelaguet, join simulations to the workflow, fix yamet infs#70

Merged
imallona merged 41 commits intomasterfrom
dev
Apr 8, 2026
Merged

Add extra exploratory figures, add Argelaguet, join simulations to the workflow, fix yamet infs#70
imallona merged 41 commits intomasterfrom
dev

Conversation

@imallona
Copy link
Copy Markdown
Owner

No description provided.

@imallona imallona changed the title Add extra exploratory figures Add extra exploratory figures, add Argelaguet, join simulations to the workflow Mar 28, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR integrates additional analyses into the main Snakemake workflow by adding the Argelaguet (2019) gastrulation dataset, migrating/connecting the simulations pipeline under workflow/, and introducing new exploratory embedding/figure outputs with coordinate pre-validation steps.

Changes:

  • Add Argelaguet gastrulation download → harmonization → yamet runs → HTML report, and wire it into rule all.
  • Move/merge the simulations pipeline into workflow/ (rules, conda envs, parameters, docs) and add simulation HTML targets to the main workflow.
  • Add coordinate pre-validation rules/scripts and new exploratory embedding/figure panels (CRC embeddings; methylation-bin stratified plots; shared embedding utilities).

Reviewed changes

Copilot reviewed 35 out of 53 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
workflow/Snakefile Includes new Argelaguet/simulations/validate rules and adds new all targets.
workflow/simulations/within_cell.md Adds narrative documentation for within-cell simulation design.
workflow/simulations/README.md Updates simulations README to point to workflow/ and local docs.
workflow/simulations/input/parameters.tsv Adds parameter grid for coverage/NA simulations.
workflow/simulations/feat_diff.md Adds documentation for feature differentiation simulation.
workflow/simulations/between_cell.md Adds documentation for between-cell simulation design.
workflow/rules/validate.smk New coordinate validation rules producing coords_validated.flag.
workflow/rules/src/validate_yamet_coords.sh New shell script to check harmonized coords vs reference.
workflow/rules/src/simulations_within_cell_getSamples.R New within-cell simulation sample generator.
workflow/rules/src/simulations_within_cell_createTemplate.R New within-cell template/reference generator.
workflow/rules/src/simulations_figure2.Rmd Parameterizes paths for simulation Figure 2 rendering.
workflow/rules/src/simulations_feat_diff_getSamples.R New feature-diff simulation sample generator.
workflow/rules/src/simulations_feat_diff_createTemplate.R New feature-diff template/reference generator.
workflow/rules/src/simulations_between_cell_getSamples.R New between-cell simulation sample generator.
workflow/rules/src/simulations_between_cell_createTemplate.R New between-cell template/reference generator.
workflow/rules/src/simulations_08_combined_figure_adj.Rmd Updates paths/logging and adds methylation-bin panels.
workflow/rules/src/simulations_07_combined_figure.Rmd Updates paths/logging for combined simulation figure.
workflow/rules/src/simulations_06_between_cell.Rmd Updates paths/logging and output column selection.
workflow/rules/src/simulations_05_within_cell.Rmd Updates paths/logging and output column selection.
workflow/rules/src/simulations_04_feat_diff.Rmd Updates sourcing/paths and adds log capture.
workflow/rules/src/simulations_03_yamet_plots.Rmd Parameterizes yamet input dir and figure path.
workflow/rules/src/simulations_02_coverage_plots.Rmd Parameterizes sim input/output dirs and figure path.
workflow/rules/src/simulations_01_sim_data.Rmd Parameterizes inputs/outputs (lowReal dir, out dir).
workflow/rules/src/simPattern.R Adds simulation helper functions for coverage-pattern generation.
workflow/rules/src/scmet.R Updates column detection for per-cell outputs (now \\.tsv$).
workflow/rules/src/mi_table.R Adds shared NMI table helper for simulations plots.
workflow/rules/src/make_grid_plots.R Updates sourcing and output-column detection logic.
workflow/rules/src/embedding_utils.R Adds shared PCA/UMAP/HVF/regression utilities for embeddings.
workflow/rules/src/ecker.Rmd Adds early logging, git SHA stamping, uses shared embedding utils, adds regressed-S embedding.
workflow/rules/src/crc.Rmd Adds early logging and new adjS-by-methylation-bin exploratory panels.
workflow/rules/src/crc_windows.Rmd Refactors NA handling, renames variables, and clarifies SCE export section.
workflow/rules/src/crc_windows_sce.Rmd Refactors to windows_sce naming and corrects labels/saves corrected SCE.
workflow/rules/src/crc_embeddings.Rmd New report for CRC cell embeddings from entropy/methylation feature spaces.
workflow/rules/src/build_chr_cpg_ref.sh Redirects script stdout/stderr to Snakemake log.
workflow/rules/src/argelaguet.Rmd New Argelaguet gastrulation entropy report with embeddings and trajectory analyses.
workflow/rules/src/argelaguet_to_yamet.sh New converter from Argelaguet methylation tables to yamet format.
workflow/rules/simulations.smk New integrated simulations ruleset under the main workflow.
workflow/rules/mm10.smk Adjusts mm10 reference generation and adds chr10-only aggregated reference.
workflow/rules/ecker.smk Wires chr10 ref selection and coordinate validation dependency into ecker runs.
workflow/rules/crc.smk Adds coordinate validation dependency and introduces CRC embeddings report rule.
workflow/rules/argelaguet.smk New ruleset to download/harmonize/run/report Argelaguet dataset.
workflow/rules/archive.smk Adds snapshotting for Argelaguet and simulations outputs.
workflow/envs/sim_yamet.yaml Adds a conda environment for simulations yamet steps.
workflow/envs/sim_r_sim.yaml Adds conda env for simulation generation scripts.
workflow/envs/sim_r_plot.yaml Adds conda env for simulation plotting/scMET steps.
workflow/envs/sim_r_emanuel.yaml Adds conda env for Emanuel’s coverage simulations.
workflow/envs/r.yml Extends R env with embedding-related dependencies.
simulations/src/ggtheme.R Removes old simulations-local ggtheme (now sourced from workflow rules).
simulations/Snakefile Removes standalone simulations Snakefile (now integrated into workflow).
README.md Updates repo docs to reflect simulations now live under workflow/.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 37 out of 55 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (2)

workflow/simulations/README.md:4

  • Grammar: “parameterization of one simulated datasets” should be singular (e.g., “one simulated dataset”).
    workflow/simulations/README.md:19
  • Spelling: “intermediatley” is misspelled (should be “intermediately”) in the transition matrix descriptions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +35 to +38
overlap=$(comm -12 \
<(zcat "$cell" | head -100 | awk '{print $1"\t"$2}' | sort) \
<(zcat "$ref" | awk '{print $1"\t"$2}' | sort) \
| wc -l)
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation computes overlap by sorting the entire reference (zcat "$ref" | ... | sort) every time. For whole-genome CpG references this can be very slow and memory-intensive. Consider streaming the reference once while matching against a small in-memory set of the first ~100 cell coordinates (e.g., load those 100 into an awk associative array) to avoid full-file sorting.

Copilot uses AI. Check for mistakes.
Comment on lines +13 to +15
harmonized_dir="$1"
ref="$2"

Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script doesn’t enable strict bash settings (set -euo pipefail). Adding them near the top will make failures in zcat/awk/sort/comm pipelines fail fast and avoid misleading “no overlap” errors when inputs are missing/corrupt.

Copilot uses AI. Check for mistakes.
@imallona imallona changed the title Add extra exploratory figures, add Argelaguet, join simulations to the workflow Add extra exploratory figures, add Argelaguet, join simulations to the workflow, fix yamet infs Apr 4, 2026
@imallona imallona merged commit 210feca into master Apr 8, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants