Setting up this Pipeline

First, after cloning this repo, you will need a conda environment that recognizes Snakemake (and, optionally, SLURM for job submission on the computing cluster). You can use the .yaml file provided in this repo to make that environment: conda env create -f SETUP.yaml

Second, modify the permissions of the SETUP-pull.sh file with something like chmod +x SETUP-pull.sh and then run that bash script with ./SETUP-pull.sh to download the larger attachments and Singularity images required to run the pipeline.

Setup Structure of this Repo

This directory has the following subdirectories:

the directory envs contains specs for all conda environments required by rules inside the Snakefile.
the directory scripts contains all bash/python/R scripts called in the Snakefile.
the directory slurm_general houses a config file to make the Snakefile use SLURM-specific memory allocations.
optionally, the directory containing .bam Revio reads, e.g. hifi_reads/, can be located in this directory or somewhere else (the path must be included in the config-ref.yaml file, or whatever config file is sourced).

Runtime-generated directories:

the logs directory contains rule-specific logs which are created at runtime.
analysis may not exist yet, but is where all draft genomes and stats will be written.
a directory named busco_downloads will similarly appear after the pipeline is run.

The following files are also required to be present at runtime:

config_ref.yaml provides user-adjustable parameters for running the workflow, e.g. the path to .bam read files. The name of this file can be changed, but would also need to be changed in the Snakefile header.
the Snakefile contains all the rule chains for running the workflow.

Optional files and folders:

at runtime, if the SLURM profile is used, there will be a top-level log of rule order and outcome written to a file with syntax like assembly-test.jobnumber.out.
I keep a file at the top of the directory with the original .bam file names and the samples they correspond to.
the script ref-slurm-wrap.sh can be modified for use as a job submission wrapper for SLURM.

To Run the Pipeline

You will first need to modify the config file you are using (e.g. config-ref.yaml) to contain paths to your reads, etc.

If launching interactively, I still strongly recommend launching Snakemake with SLURM for memory allocation. snakemake --workflow-profile slurm_general

If you do not use SLURM, make sure to request conda env usage: snakemake --use-conda and add whatever other parameters you would like.

Optional

If you are interested in sex chromosome identification in Lepidoptera, I used nucmer to identify potential Z/W contigs with a custom Z/W chromosome database I built from publicly deposited Z/W chromosomes. This is not part of the snakemake pipeline but can be used in the sex_chrom folder.

Troubleshooting

If you are using a computing cluster, make sure that your python install location is in your active conda env or the env of the rule in question, rather than installed through your cluster's Anaconda install. This fixed many of my problems.

Feel free to modify the slurm_general config to work for your genomes (e.g. runtime, restart-times, etc).

Citations

The run order and tool selection is largely inspired by Nicholas Kron's R code for FOpsBet2.1, adapted for snakemake. Tools involved in this pipeline are the following:

Raw read qc: Genomescope2, jellyfish:

Assembly: hifiasm

https://github.com/chhylp123/hifiasm

QC: seqstat, meryl/merqury, BUSCO

Haplotig purging: purge_dups

https://github.com/dfguan/purge_dups

Error correction: inspector

https://github.com/Maggi-Chen/Inspector

Contig linking with raw read overlap: ntLink

https://github.com/bcgsc/ntLink

Mitochondrial contig identification: mitoFinder

https://github.com/RemiAllio/MitoFinder?tab=readme-ov-file
Note that I used this instead of MitoHiFi because I was having runtime issues with this tool.

Eukaryotic contaminant ID: Kraken2

https://github.com/DerrickWood/kraken2

Sequencing adapter contaminant ID: ncbi-adaptor

https://github.com/ncbi/fcs

More QC: QUAST

https://github.com/ablab/quast

Comparison to reference: mummer4/nucmer

Other tools associated with this publication:

BlobToolkit2 for snail plots: https://blobtoolkit.genomehubs.org/blobtools2/
lep_busco_painter from Charlotte Wright: https://github.com/charlottewright/lep_busco_painter/tree/main
ntSynt and ntSynt-viz: https://github.com/BirolLab/ntSynt/blob/main/README.md

This data also has unpublished RepeatModeler/Masker genomes and a ProgressiveCactus alignment that is not included in the publication or pipeline but is available upon request.

Earth BioGenome Project Standards

Notes on GitHub markdown formatting for READMEs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setting up this Pipeline

Setup Structure of this Repo

To Run the Pipeline

Optional

Troubleshooting

Citations

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
envs		envs
scripts		scripts
slurm_general		slurm_general
.gitignore		.gitignore
README.md		README.md
SETUP-pull.sh		SETUP-pull.sh
SETUP.yaml		SETUP.yaml
Snakefile		Snakefile
config-ref.yaml		config-ref.yaml
namefile.txt		namefile.txt
ref-slurm-wrap.sh		ref-slurm-wrap.sh
rulegraph.pdf		rulegraph.pdf
workflow_report.html		workflow_report.html

Folders and files

Latest commit

History

Repository files navigation

Setting up this Pipeline

Setup Structure of this Repo

To Run the Pipeline

Optional

Troubleshooting

Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages