Skip to content

corceslab/Demuxlet_Pipeline

Repository files navigation

Demuxlet Pipeline

This pipeline is used to genotype bulk ATAC-seq data and create an input VCF file for use with Demuxlet to identify genotype-based multiplets in single-cell data.

This pipeline contains a controller script [SLURM_Demuxlet_Pipeline.sh] and a series of auxilary bash and R scripts that are called at run time and queued as jobs with SLURM dependencies.

PIPELINE COMPONENTS:

SLURM_Demuxlet_Pipeline.sh
	STEP1: Call peaks in data. This pipeline was designed for ATAC-seq. Calling peaks lets you quickly identify regions of high read count which could be used for genotyping purposes. This limits the search space for genotyping and increases speed.
		SLURM_Demuxlet_CallPeaks.sh
	STEP2: Merge peaks called in all samples. Merging the peak regions creates a union set of regions to be genotyped across all samples.
		SLURM_Demuxlet_MergePeaks.sh
	STEP3: Genotyping within the merged peak regions in every sample using Varscan.
		SLURM_Demuxlet_varscanGenotype.sh
	STEP4: Merge all resultant VCF files from Varscan. This doesnt use a script, just a multipart bash one-liner
	STEP5: Use bedtools to merge (without bookending) all of the unique single-base positions that should be re-genotyped across all samples. This represents the union set of putative positions containing SNPs.
	STEP6: Use samtools mpileup to obtain the allelic depth at these specific positions across all samples.
		SLURM_Demuxlet_mpileup.sh
	STEP7: Convert the allelic depth at each of these positions into a "birdseed"-style genotyping call using an R script. Bash script serves as a wrapper.
		SLURM_Demuxlet_makeBirdseed.sh
		SLURM_Demuxlet_makeBirdseed.R
	STEP8: Create the demuxlet input VCF file. This step creates a VCF file with inferred genotype probabilities based on the birdseed-style genotyping calls for each sample.
		SLURM_Demuxlet_makeDemuxletVCF.R

REQUIREMENTS:

The auxilary scripts utilize my GENOMES directory / directory structure which is shared on Sherlock. This relies on an environment variable called GENOMES that points to a directory that looks like this and has the following files:

${GENOMES}/hg38/hg38.fa
${GENOMES}/hg38/hg38_Common_SNPs.txt
${GENOMES}/hg38/hg38.blacklist.bed
${GENOMES}/hg38/hg38_NumtS_Regions.bed

DEPENDENCIES:

This code has been verified to work with the following software versions which must be loadable with the given commands:

bedtools v2.26.0	- module load bedtools/S2/2.26.0
samtools v1.5		- module load samtools/S2/1.5
macs2 2.1.1.20160309	- Not loaded. Instead, macs2 needs to runable when called as "macs2".
R 3.5.1			- module load R/3.5.1
libpng v1.2.57		- module load libpng/S2/1.2.57 (required for certain tools on Sherlock2)
UCSC Tools v 6.21.17	- module load ucscTools/6.21.17

This code requires the following R packages to be installed:

optparse, foreach, doParallel, ggplot2, seqinr, matrixStats

NOTES:

  1. You may brush up against the per-user job number submission limit depending on how many samples you are trying to run. I'm not sure what the actual limit is but this certainly happens when running more than 750 samples. In this case, the pipeline wont finish properly.
  2. This script expects all of the auxilary scripts to be present in the same directory
  3. Only hg38 is currently supported

USAGE:

bash /path/to/SLURM_Demuxlet_Pipeline.sh -m <Input_Manifest> ... <other options>
where Input_Manifest is a tab-delimited file with each line representing a single BAM file in the format <Sample Name> \t <File_path_BAM> \t <condition> \t <bioRepID>
where condition is the sample type (for example, cancer type from TCGA, or brain region from AD, or brain cohort).
where bioRepID is an identifier that is shared across all samples from the same biological donor.

Other options include:
-p 	Full directory path to a directory containing MACS2 peak calls (.narrowPeak format) for each sample in format <Sample Name>_peaks.narrowPeak
	This option, when used, will skip the peak calling step.
-o 	Full directory path to the desired output directory
-x	Skip to this numbered step (must be an integer)

Note --- I dont think that the condition and bioRepID in the manifest are used in the demuxlet pipeline.

TEST USAGE:

THIS NEEDS TO BE TESTED AND UPDATED.

To download and test the pipeline, follow these steps:

  1. Make sure all of the required R libraries are installed and you have satisfied the requirements and dependencies.
  2. Clone the github repository using: git clone https://github.com/rcorces/Demuxlet_Pipeline.git
  3. Make a folder to perform the analysis.
  4. Move into this folder and download and unzip the test data using wget https://s3.amazonaws.com/changseq/RyanCorces/Shared/DIAN-CAUD_test_data.zip and unzip DIAN-CAUD_test_data.zip. You should end up with 2 folders and 1 test manifest text file. DIAN-CAUD_test_input should contain 8 bam files and 8 bam index files. DIAN-CAUD_test_output should contain ???.
  5. Edit the second column of the test manifest file to provide the correct full path to each bam file.
  6. Run the test data through the pipeline using bash /full/path/to/SLURM_Demuxlet_Pipeline.sh -m /full/path/to/DIAN-CAUD_test_manifest.txt -o /full/path/to/output/directory/.
  7. Compare your output files to those in the DIAN-CAUD_test_output folder.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors