GEP2

Genome Evaluation Pipeline (v2)

This repository contains a significantly updated version of GEP that builds upon lessons learned from ERGA and GAME.

Data is entered via a simple table, and configuration is managed through a tidy control panel. GEP2 uses a modern Snakemake version with containers and can run on a server/cluster (SLURM) or a local computer.

Please cite: Genome Evaluation Pipeline (GEP): A fully-automated quality control tool for parallel evaluation of genome assemblies. https://doi.org/10.1093/bioadv/vbaf147

Requirements

Conda (for Snakemake and NomNom)
Apptainer

GEP2 can:

• download assemblies & reads (or use the ones in your local storage)
• trim/filter/qc reads (paired-end, 10x, HiFi, ONT)
• classic contiguity metrics and other assembly stats (e.g., N50...)
• completeness metrics based on single-copy orthologs
• kmer-based analyses
• long-reads-based analysis
• Hi-C analysis and curation files production
• contamination screening
• aggregate results and show EBP-metrics guidance

How to Get and Set Up GEP2

1) Get the latest version

GEP2 is adding features rapidly, so please download the latest release (or clone the repo for getting hot fixes faster!)

2) Create the GEP2 Conda Environment

The environment contains Snakemake packages and NomNom. Enter the GEP folder and:

conda env create -f install.yml

3) Enter Your Data in a Table

You can use Google Drive, Excel, LibreOffice, Numbers, CSV, TSV, etc.

The table should contain these columns:

sp_name	asm_id	skip	asm_files	read_type	read_files

Please see the example table. The easiest is to make a copy of that Google table (File->Make a copy) and replace the fields with your data. Remember to change permissions (Share-> change General access to "Anyone with the link" viewer)

Column Descriptions:

sp_name: Species name in binomial nomenclature (e.g., Vultur gryphus)
asm_id: Assembly identifier (e.g., hifiasm_l2, yahs_test, or ASM2260516v1)
skip: Flag assemblies for selective analysis skipping. Leave empty (or - or off) to run all analyses. Set to on to flag this assembly, then control which analyses to skip in control_panel.yaml using SKIP_KMER, SKIP_INSP, SKIP_HIC, etc. Useful for running quick QC on draft assemblies while running full analysis on final assemblies.
asm_files: Path to assembly file, URL, or accession number (e.g., GCA_022605165.1). If it's a link or accession, the pipeline will download the data automatically. If Pri/Alt or (Hap1/Hap2) assemblies available, add as comma-separated, like: GCA_963854735.1, GCA_963694935.1
read_type: Can be illumina, 10x, hifi, or ont (variations like PacBio, paired-end, linked-read, arima, promethion and others should also work fine)
read_files: Comma-separated list of paths to read files. Can also be accession numbers (e.g., ERR12205285,ERR12205286). For paired-end reads, list as: forward1,reverse1,forward2,reverse2. Also can use pattern expansion in paths, like /readsA/*.fq.gz, /readsB/*.fq.gz

4) Configure the Control Panel

Add the table path/address and select different options in:

config/control_panel.yaml

5) Configure Cluster or Computer Parameters

GEP2/execution/
├── local/
│   └── config.yaml
└── slurm/
    └── config.yaml

IMPORTANT: You can tweak per-tool resources boundaries in GEP2/config/resources.yaml

6) Run!

First run takes longer as containers need to be built.

Load the conda environment like conda activate GEP2_env and in the GEP2 folder run:

On HPC/Server/Cluster using Slurm:

nohup snakemake --profile execution/slurm &

On Local Computer:

nohup snakemake --profile execution/local &

About the Command:

nohup runs Snakemake in a way that won't be interrupted if you lose connection to the server/cluster
The trailing & runs the command in the background, allowing you to continue using the terminal

Dry Run (Recommended):

Before running the full pipeline, perform a dry run to check what will execute and catch any errors:

snakemake --profile execution/slurm --dry-run

You can also inspect:

GEP2_results/data_config.yaml
GEP2_results/download_manifest.json

Results Structure

Open the report with a markdown renderer (VS Code works well):

GEP2_results/{sp_name}/{asm_id}/{asm_id}_report.md

Directory Structure:

GEP2_results/
├── data/
│   └── {sp_name}/
│       └── reads/
│           ├── {read_type}/
│           │   ├── {read_symlink}
│           │   ├── kmer_db_k{k-mer_length}/
│           │   │   └── {read_name}.meryl
│           │   ├── logs/
│           │   └── processed/
│           │       ├── {read_type}_Path{number}_{read_name}_{process}.fq.gz
│           │       └── reports/
│           │           └── multiqc_report.html
│           └── ...
├── data_config.yaml
├── data_table_{hash}.csv
├── downloaded_data/
│   └── {sp_name}/
│       ├── assemblies/
│       │   └── {asm_file}
│       └── reads/
│           └── {read_type}/
│               └── {read_file}
├── download_manifest.json
└── {sp_name}/
    └── {asm_id}/
        ├── {asm_id}_report.md
        ├── compleasm/
        │   └── {asm_file_name}/
        │       ├── {asm_file_name}_results.tar.gz
        │       └── {asm_file_name}_summary.txt
        ├── decontamination/
        │   ├── fcs-gx/
        │   │   └── {asm_file_name}/
        │   │       ├── {asm_file_name}.fcs_gx_report.txt
        │   │       └── {asm_file_name}.taxonomy.rpt
        │   └── blobtools/
        │       └── {asm_file_name}/
        │           ├── Blobdir/
        │           ├── ..
        │           ├── ..blob.circle.png
        │           ├── ..cumulative_plot.png
        │           └── ..snail_plot.png
        ├── gfastats/
        │   └── {asm_file_name}_stats.txt
        ├── hic/
        │   └── {asm_file_name}/
        │       ├── {asm_file_name}.cool
        │       ├── {asm_file_name}.mcool
        │       ├── {asm_file_name}.pairs.gz
        │       ├── {asm_file_name}.pairtools_stats.txt
        │       ├── {asm_file_name}.pretext
        │       ├── {asm_file_name}_tracks.pretext
        │       ├── {asm_file_name}_snapshots
        │       │   └── {asm_file_name}_FullMap.png
        │       └── tracks
        │           └── ...bedgraph
        ├── inspector/
        │   └── {asm_file_name}/
        │       ├── ..
        │       └── summary_statistics
        ├── k{k-mer_length}/
        │   ├── {asm_id}.hist
        │   ├── {asm_id}.meryl
        │   └── genomescope2/
        │       └── {asm_id}_linear_plot.png
        ├── logs/
        └── merqury/
            ├── ..
            ├── {asm_file_name}.completeness.stats
            ├── {asm_file_name}.qv
            └── ...png

Main tools:

tool	doi	version	container
bedtools	10.1093/bioinformatics/btq033	2.31.1	docker://diegomics/hic_analysis:0.2
blobtools	-	4.5.1	docker://genomehubs/blobtoolkit:4.5.1
busco	10.1093/molbev/msab199	6.0.0	docker://diegomics/gep2_base:0.2
bwa-mem2	10.1109/IPDPS.2019.00041	2.3	docker://diegomics/hic_analysis:0.2
chromap	10.1038/s41467-021-26865-w	0.3.2	docker://diegomics/hic_analysis:0.2
cooler	10.1093/bioinformatics/btz540	0.10.4	docker://diegomics/hic_analysis:0.2
compleasm	10.1093/bioinformatics/btad595	0.2.7	docker://quay.io/biocontainers/compleasm:0.2.7--pyh7e72e81_0
diamond	10.1038/s41592-021-01101-x	2.1.24	docker://diegomics/hic_analysis:0.2
enabrowsertools	-	1.7.2	docker://diegomics/gep2_base:0.2
fastp	10.1093/bioinformatics/bty560	1.1.0	docker://diegomics/gep2_base:0.2
fastqc	-	0.12.1	docker://diegomics/gep2_base:0.2
fcs-gx	10.1186/s13059-024-03198-7	0.5.5	databases/fcs-gx.sif
genomescope2	10.1038/s41467-020-14998-3	2.0.1	docker://diegomics/gep2_base:0.2
gfastats	10.1093/bioinformatics/btac460	1.3.11	docker://diegomics/gep2_base:0.2
hicexplorer	10.1093/gigascience/giac061	3.7.6	docker://diegomics/hic_analysis:0.2
hifiadapterfilt	10.1186/s12864-022-08375-1	3.0.0	docker://diegomics/gep2_base:0.2
hifasm	10.1038/s41592-024-02269-8	0.25.0	docker://diegomics/gep2_base:0.2
inspector	10.1186/s13059-021-02527-4	1.3.1	docker://diegomics/inspector:1.3.1
longdust	-	1.4	docker://diegomics/hic_analysis:0.2
merqury	10.1186/s13059-020-02134-9	1.3	docker://diegomics/gep2_base:0.2
merquryfk	-	1.2	docker://diegomics/gep2_base:0.2
minimap2	10.1093/bioinformatics/bty191	2.30	docker://diegomics/hic_analysis:0.2
multiqc	10.1093/bioinformatics/btw354	1.33	docker://diegomics/gep2_base:0.2
nanoplot	10.1093/bioinformatics/btad311	1.46.2	docker://diegomics/gep2_base:0.2
pairtools	10.1101/2023.02.13.528389	1.1.3	docker://diegomics/hic_analysis:0.2
pretextgraph	-	0.0.9	docker://diegomics/hic_analysis:0.2
pretextmap	-	0.2.4	docker://diegomics/hic_analysis:0.2
pretextsnapshot	-	0.0.6	docker://diegomics/hic_analysis:0.2
sambamba	10.1093/bioinformatics/btv098	1.0.1	docker://diegomics/hic_analysis:0.2
samtools	10.1093/gigascience/giab008	1.22.1	docker://diegomics/hic_analysis:0.2
sdust	-	0.1	docker://diegomics/hic_analysis:0.2
seqkit	10.1002/imt2.191	2.13.0	docker://diegomics/gep2_base:0.2
seqtk	-	1.5	docker://diegomics/gep2_base:0.2
tidk	10.1093/bioinformatics/btaf049	0.2.65	docker://diegomics/hic_analysis:0.2

Outside the pipeline, you can run all of these programs just using the containers!

Once pulled, Snakemake saves the container images in a hidden folder using its md5 name. You can get the image name and save it in a variable, like this:

HIC_VAR=$(echo -n "docker://diegomics/hic_analysis:0.2" | md5sum | awk '{print $1}')

Then, you can get the GEP2 installation folder in another variable, like:

GEP2_FOLDER="/srv/public/users/ddepanis/Software/GEP2"

Next, you can combine both to get the full image path:

HIC_CONTAINER=${GEP2_FOLDER}/.snakemake/singularity/${HIC_VAR}.simg

Finally, you can run a given program included in the container like:

apptainer exec -B {path/to/data}:{path/to/data} $HIC_CONTAINER tidk -h

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
config		config
execution		execution
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install.yml		install.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GEP2

Requirements

GEP2 can:

How to Get and Set Up GEP2

1) Get the latest version

2) Create the GEP2 Conda Environment

3) Enter Your Data in a Table

Column Descriptions:

4) Configure the Control Panel

5) Configure Cluster or Computer Parameters

6) Run!

On HPC/Server/Cluster using Slurm:

On Local Computer:

About the Command:

Dry Run (Recommended):

Results Structure

Directory Structure:

Main tools:

Outside the pipeline, you can run all of these programs just using the containers!

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GEP2

Requirements

GEP2 can:

How to Get and Set Up GEP2

1) Get the latest version

2) Create the GEP2 Conda Environment

3) Enter Your Data in a Table

Column Descriptions:

4) Configure the Control Panel

5) Configure Cluster or Computer Parameters

6) Run!

On HPC/Server/Cluster using Slurm:

On Local Computer:

About the Command:

Dry Run (Recommended):

Results Structure

Directory Structure:

Main tools:

Outside the pipeline, you can run all of these programs just using the containers!

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages