-
Notifications
You must be signed in to change notification settings - Fork 2
Usage
All parameter descriptions with data typing can be found in the GUI: from the dashboard, scroll down to "Schemas" section; boxes can be expanded to show details for data schema for each endpoint. N.b. this is a schema, NOT an API argument, so don't try copy-pasting it into the GUI dialogue boxes.
- "ExpDir": Folder containing your input data. Castanet expects this folder to contain only 2 files, i.e. your paired reads, in .fastq/.fastq.gz format.
- "ExpName": Run designation; your output folder will have this name.
- "RefStem": Path to file containing mapping references, in .fasta format
- "DoTrimming": If true, run trimmomatic to remove adapters and low quality reads.
- "TrimMinLen": Inserts smaller than this number will be trimmed by Trimmomatic, if "DoTrimming": true.
- "ConsensusMinD": Minimum read depth required for Castanet to attempt to generate a consensus sequence, across a coverage breadth defined in "ConsensusCoverage".
- "ConsensusCoverage": Minimum coverage breadth required for Castanet to attempt to generate a consensus sequence, at a read depth defined in "Consensus Coverage".
- "ConsensusMapQ": Minimum map quality (Phred) required for a read to be included in a Castanet consensus sequence.
- "ConsensusCleanFiles": If true, Castanet will delete the BAM file created for an experiment once it's finished generating consensus sequences. Enable this to save storage space.
- "AdaptP": Location of your Trimmomatic adapter sequences - may be in your Trimmomatic path, but a backup is included in the data dir.
- "NThreads": Specify the number of threads for multi-core processing. Options: integer == this many threads; 'auto' == let Castanet choose number of threads; 'hpc' == select when running on a compute cluster (hard codes to 1).
- "DoKrakenPrefilter": If true, run an initial filter step using Kraken2 for removing reads assigned to specific species.
- "LineageFile": File path to NCBI lineage file, which is used in combination with "DoKrakenPreFilter" to label and filter reads associated with specific species. This should be downloaded automatically as part of the Castanet installation.
- "ExcludeIds": (Optional) Used in combination with "DoKrakenPreFilter", NBCI TaxIDs to exclude.
- "RetainIds": (Optional) Used in combination with "DoKrakenPreFilter", NBCI TaxIDs to retain.
- "RetainNames": (Optional) Used in combination with "DoKrakenPreFilter", species names to retain.
- "ExcludeNames": (Optional) Used in combination with "DoKrakenPreFilter", species names to exclude.
- "KrakenDbDir": Directory path to database for Kraken2 database, used in combination with "DoKrakenPreFilter". A small database for labelling human reads should be automatically downloaded by the Castanet installation script.
- "KeepDups": If true, Castanet will not filter duplicated reads. Users are strongly recommended to keep this enabled.
- "Clin": (OPTIONAL) Path to CSV file containing clinical data (must have at least following fields: pt, clin_int; the field "sampleid" if present will be ignored). Other fields will be ignored.
- "DepthInf": (OPTIONAL, For regenerating full CSV with new clinical info): Path to previously generated CSV file of read depth per position for each probe, for all samples in this batch.
- "SamplesFile": (OPTIONAL) If specified, read raw read numbers from this CSV (needs cols 'sampleid', 'pt', 'rawreadnum'). If not specified, CASTANET will read the raw read numbers from the input bam file, i.e. it will assume you haven't pre-filtered the file.
- "PostFilt": If true, post hoc filter BAM file to remove reads marked as contamination.
- "SingleEndedReads": Set to true for non-paired reads, including NANOPORE; Castanet will only look for a single read file.
Castanet works by aggregating reads on target at the organism level, which is achieved through filtering probe names. It's critical that your probe nomenclature is compatible with Castanet, otherwise aggregation to specific orgaisms will fail during the "Analysis" script.
Ideally probe names will be in the format:
>Genus_species_....
E.g.:
>hbv_10407_cluster0
ATCG..
>treponema_pallidum_bact000001
ATCG..
>haemophilus_influenzae_bact000016_haemophilus-19_influenzae-16-parainfluenzae
ATCG..
Castanet can also process several standard formats, where rMLST gene names (i.e. bactXX) are optional (LOWER CASE TRANSFORMED):
bact[0-9]+_([A-Za-z]+)-[0-9]+[|_]([A-Za-z]+)bact[0-9]+_[0-9]+_([A-Za-z]+_[A-Za-z_]+)bact[0-9]+_([a-z]+_[a-z_]+)
Your probe panel will be unique to your experiments, hence we have not provided one. We thoroughly recommend running Castanet's /convert_mapping_reference/ endpoint to aid in creating a Castanet-compatible file:
- Source your multi-fasta mapping reference.
- Start the Castanet server and visit the GUI (steps 1--3 in "GUI" section, above).
- Find the /convert_mapping_reference/ box and expand it. Click "try it out" to enable the dialogue box.
- Fill in the parameters according to the schema below (n.b. maintain inverted commas and commas).
- Hit "Execute". Check the terminal for notifications if there are any errors in your probe names that Castanet can't correct.
{
"InputFolder": Directory containing your input fasta(s).
"OutFolder": Directory path to save your output mapping reference.
"OutFileName": Name for output file.
}
Try to avoid non-alphanumeric characters in your probe names, although the /convert_probes/ endpoint filters most types of contaminating character (underscores, hyphen, pipes etc.).
Expert users may wish to interact with the Castanet API programmatically. The following will reproduce the "quick-start" guide workflow.
- Complete steps 1--2 in the "GUI" section, above.
- Test that all of the dependencies needed for Castanet to run are installed and functioning as expected by hitting cURL'ing the check_dependencies endpoint (example script included:
$ bash dev/check_dependencies.sh). Check the output in either the Terminal or the API window (might need to scroll down) to ensure it completes successfully. - Try an end-to-end run analysing a synthetic dataset that's included in this repository, by cURL'ing the end_to_end endpoint (example script included:
$ bash dev/end_to_end.sh). Output are saved to ./experiments/CastanetTest/.
This endpoint will trigger a full Castanet analysis and is likely to be the most useful function for the majority of users. Descriptions for each function within this call are included in "Individual pipeline functions", below.
Run Castanet analysis and consensus generation on a ready-mapped BAM file.
(EXPERIMENTAL) Run an analytical pipeline specifically designed for amplicon NGS data. Details to follow.
The included install_deps.sh script will attempt to install the following dependencies automatically. Depending on your specific system set-up, manual installation of some or all components may still be required. Details are included below.
We have included a lineage file in the repo for convenience. Users may generate up-to-date files using the repository below.
https://github.com/zyxue/ncbitax2lin
Mapping is an essential process to Castanet, which involves comparing our experiment reads with a number of pre-defined reference sequences. We opt for bwa-mem2 for doing Burrows Wheeler alignment.
https://github.com/bwa-mem2/bwa-mem2
Castanet is not tested with original bwa, bowtie2 etc., but may be compatible.
Samtools is a collection of software libraries that provides a range of functions for interrogating NGS data, specifically in Sequence Alignment Map (SAM) format and the compressed Binary- (BAM) format. These functions include reading, writing and viewing the contents of these files.
http://www.htslib.org/
Trimming is an essential quality control process for removing sequence fragments that would contaminate our analyses. Specifically, we use Trimmomatic here to remove both low quality reads and our Illumina adapters (via MINLEN and ILLUMINACLIP functions).
http://www.usadellab.org/cms/?page=trimmomatic
Generation of consensus sequences requires multiple sequence alignment. Castanet is developed for use with MAFFT as we find it to be faster/more scalable and more accurate other programs. It is not possible to substitute MAFFT for another MSA program without making code changes, as we use functionality for adding unaligned fragnentary sequences that, to our knowledge, is unique to MAFFT.
https://mafft.cbrc.jp/alignment/software/
We use several algorithms to construct consensus sequences, one of which is Morishi's "ViralConsensus", which is a fast and memory-efficient tool for calling whole genome sequences directly from read aligned data. See Ref. https://doi.org/10.1093/bioinformatics/btad317 for more information.
https://github.com/niemasd/ViralConsensus