Skip to content

amackays/assembly-pri

Repository files navigation

Setting up this Pipeline

First, after cloning this repo, you will need a conda environment that recognizes Snakemake (and, optionally, SLURM for job submission on the computing cluster). You can use the .yaml file provided in this repo to make that environment: conda env create -f SETUP.yaml

Second, modify the permissions of the SETUP-pull.sh file with something like chmod +x SETUP-pull.sh and then run that bash script with ./SETUP-pull.sh to download the larger attachments and Singularity images required to run the pipeline.

Setup Structure of this Repo

This directory has the following subdirectories:

  • the directory envs contains specs for all conda environments required by rules inside the Snakefile.
  • the directory scripts contains all bash/python/R scripts called in the Snakefile.
  • the directory slurm_general houses a config file to make the Snakefile use SLURM-specific memory allocations.
  • optionally, the directory containing .bam Revio reads, e.g. hifi_reads/, can be located in this directory or somewhere else (the path must be included in the config-ref.yaml file, or whatever config file is sourced).

Runtime-generated directories:

  • the logs directory contains rule-specific logs which are created at runtime.
  • analysis may not exist yet, but is where all draft genomes and stats will be written.
  • a directory named busco_downloads will similarly appear after the pipeline is run.

The following files are also required to be present at runtime:

  • config_ref.yaml provides user-adjustable parameters for running the workflow, e.g. the path to .bam read files. The name of this file can be changed, but would also need to be changed in the Snakefile header.
  • the Snakefile contains all the rule chains for running the workflow.

Optional files and folders:

  • at runtime, if the SLURM profile is used, there will be a top-level log of rule order and outcome written to a file with syntax like assembly-test.jobnumber.out.
  • I keep a file at the top of the directory with the original .bam file names and the samples they correspond to.
  • the script ref-slurm-wrap.sh can be modified for use as a job submission wrapper for SLURM.

To Run the Pipeline

You will first need to modify the config file you are using (e.g. config-ref.yaml) to contain paths to your reads, etc.

If launching interactively, I still strongly recommend launching Snakemake with SLURM for memory allocation. snakemake --workflow-profile slurm_general

If you do not use SLURM, make sure to request conda env usage: snakemake --use-conda and add whatever other parameters you would like.

Optional

If you are interested in sex chromosome identification in Lepidoptera, I used nucmer to identify potential Z/W contigs with a custom Z/W chromosome database I built from publicly deposited Z/W chromosomes. This is not part of the snakemake pipeline but can be used in the sex_chrom folder.

Troubleshooting

If you are using a computing cluster, make sure that your python install location is in your active conda env or the env of the rule in question, rather than installed through your cluster's Anaconda install. This fixed many of my problems.

Feel free to modify the slurm_general config to work for your genomes (e.g. runtime, restart-times, etc).

Citations

The run order and tool selection is largely inspired by Nicholas Kron's R code for FOpsBet2.1, adapted for snakemake. Tools involved in this pipeline are the following:

Raw read qc: Genomescope2, jellyfish:

Assembly: hifiasm

QC: seqstat, meryl/merqury, BUSCO

Haplotig purging: purge_dups

Error correction: inspector

Contig linking with raw read overlap: ntLink

Mitochondrial contig identification: mitoFinder

Eukaryotic contaminant ID: Kraken2

Sequencing adapter contaminant ID: ncbi-adaptor

More QC: QUAST

Comparison to reference: mummer4/nucmer

Other tools associated with this publication:

This data also has unpublished RepeatModeler/Masker genomes and a ProgressiveCactus alignment that is not included in the publication or pipeline but is available upon request.

Earth BioGenome Project Standards

Notes on GitHub markdown formatting for READMEs

About

code for assembling primary reference Heliconius genomes with PacBio long read data only

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages