First, after cloning this repo, you will need a conda environment that recognizes Snakemake (and, optionally, SLURM for job submission on the computing cluster).
You can use the .yaml file provided in this repo to make that environment: conda env create -f SETUP.yaml
Second, modify the permissions of the SETUP-pull.sh file with something like chmod +x SETUP-pull.sh and then run that bash script with ./SETUP-pull.sh
to download the larger attachments and Singularity images required to run the pipeline.
This directory has the following subdirectories:
- the directory
envscontains specs for all conda environments required by rules inside the Snakefile. - the directory
scriptscontains all bash/python/R scripts called in the Snakefile. - the directory
slurm_generalhouses a config file to make the Snakefile use SLURM-specific memory allocations. - optionally, the directory containing .bam Revio reads, e.g.
hifi_reads/, can be located in this directory or somewhere else (the path must be included in theconfig-ref.yamlfile, or whatever config file is sourced).
Runtime-generated directories:
- the
logsdirectory contains rule-specific logs which are created at runtime. analysismay not exist yet, but is where all draft genomes and stats will be written.- a directory named
busco_downloadswill similarly appear after the pipeline is run.
The following files are also required to be present at runtime:
config_ref.yamlprovides user-adjustable parameters for running the workflow, e.g. the path to .bam read files. The name of this file can be changed, but would also need to be changed in the Snakefile header.- the
Snakefilecontains all the rule chains for running the workflow.
Optional files and folders:
- at runtime, if the SLURM profile is used, there will be a top-level log of rule order and outcome written to a file with syntax like
assembly-test.jobnumber.out. - I keep a file at the top of the directory with the original .bam file names and the samples they correspond to.
- the script
ref-slurm-wrap.shcan be modified for use as a job submission wrapper for SLURM.
You will first need to modify the config file you are using (e.g. config-ref.yaml) to contain paths to your reads, etc.
If launching interactively, I still strongly recommend launching Snakemake with SLURM for memory allocation. snakemake --workflow-profile slurm_general
If you do not use SLURM, make sure to request conda env usage: snakemake --use-conda and add whatever other parameters you would like.
If you are interested in sex chromosome identification in Lepidoptera, I used nucmer to identify potential Z/W contigs with a custom Z/W chromosome database I built from publicly deposited Z/W chromosomes. This is not part of the snakemake pipeline but can be used in the sex_chrom folder.
If you are using a computing cluster, make sure that your python install location is in your active conda env or the env of the rule in question, rather than installed through your cluster's Anaconda install. This fixed many of my problems.
Feel free to modify the slurm_general config to work for your genomes (e.g. runtime, restart-times, etc).
The run order and tool selection is largely inspired by Nicholas Kron's R code for FOpsBet2.1, adapted for snakemake. Tools involved in this pipeline are the following:
Raw read qc: Genomescope2, jellyfish:
Assembly: hifiasm
QC: seqstat, meryl/merqury, BUSCO
- https://github.com/vgl-hub/gfastats
- https://genometools.org/tools.html << seqstat
- https://github.com/marbl/meryl
- https://github.com/marbl/merqury
- https://gitlab.com/ezlab/busco
Haplotig purging: purge_dups
Error correction: inspector
Contig linking with raw read overlap: ntLink
Mitochondrial contig identification: mitoFinder
- https://github.com/RemiAllio/MitoFinder?tab=readme-ov-file
- Note that I used this instead of MitoHiFi because I was having runtime issues with this tool.
Eukaryotic contaminant ID: Kraken2
Sequencing adapter contaminant ID: ncbi-adaptor
More QC: QUAST
Comparison to reference: mummer4/nucmer
Other tools associated with this publication:
- BlobToolkit2 for snail plots: https://blobtoolkit.genomehubs.org/blobtools2/
- lep_busco_painter from Charlotte Wright: https://github.com/charlottewright/lep_busco_painter/tree/main
- ntSynt and ntSynt-viz: https://github.com/BirolLab/ntSynt/blob/main/README.md
This data also has unpublished RepeatModeler/Masker genomes and a ProgressiveCactus alignment that is not included in the publication or pipeline but is available upon request.