Skip to content

hemant-goyal/TurboSRA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ TurboSRA (v1.5)

A high-performance, multi-threaded SRA downloader that outsmarts NCBI's API to guarantee pristine, full-fidelity Phred scores.

🧬 Stop letting NCBI silently destroy your quality scores.

TurboSRA is a production-ready bash utility engineered to fix the two biggest bottlenecks in bioinformatics data acquisition: slow, single-threaded SRA downloads and unpredictable data degradation. By intelligently routing through the European Nucleotide Archive (ENA), TurboSRA guarantees that you receive the original, high-fidelity datasets required for sensitive variant callingβ€”at more than double the speed of standard tools.


✨ Key Features

  • Smart API Routing: Automatically prioritizes ENA servers to bypass NCBI's .sralite formats, ensuring base quality scores are never stripped or averaged.
  • Parallel Downloading: Utilizes aria2c to open 16 simultaneous HTTP connections per accession, maxing out your bandwidth.
  • Multi-Threaded Compression: Leverages pigz to compress output FASTQ files across all available CPU cores.
  • Cross-Platform Auto-Scaling: Natively detects and utilizes maximum CPU cores on Linux, WSL, and macOS (Apple Silicon).

πŸ“Š The Benchmark: Why use TurboSRA?

The standard NCBI sra-tools (prefetch + fasterq-dump) processes data sequentially using a single thread. Worse, during heavy server loads, NCBI silently serves .sralite formats, permanently destroying the original Phred quality scores to save bandwidth.

Test parameters: 10-core Apple Silicon (M-Series), 1Gbps connection, Single Accession (SRR1274307).

⏱️ The Speed Showdown

Metric Standard prefetch TurboSRA v1.5
Total Time 9.88 seconds 4.63 seconds (2.1x Faster)
Download Engine Single HTTP connection aria2c (16 parallel connections)
Extraction & Compression Sequential & Single-threaded Piped fasterq-dump & Multi-threaded pigz
Data Integrity ⚠️ Unpredictable (.sralite risk) πŸ›‘οΈ Guaranteed Full Phred Scores

πŸ” The Data Integrity Proof

If you request an accession using standard tools, you risk receiving simplified quality scores (a solid wall of ?), making the data useless for downstream variant callers like GATK or Snippy.

Output from standard prefetch (SRA Lite Fallback):

@SRR1274307.1 1 length=25
ATGGCTCACTGCAGCCTTGACTTTC
+SRR1274307.1 1 length=25
?????????????????????????  <-- Quality scores destroyed

Output from TurboSRA (Strict ENA Mode):

@SRR1274307.1 1 length=25
ATGGCTCACTGCAGCCTTGACTTTC
+SRR1274307.1 1 length=25
AAA?ABBBDEEDDDDEGGGFGGIIH  <-- Original Phred scores preserved

βš™οΈ Installation & Dependencies

TurboSRA requires standard POSIX utilities alongside a few core bioinformatics tools.

1. Install Dependencies

For macOS (Apple Silicon / Intel):

# Install SRA-Tools via Conda/Mamba
mamba install -c conda-forge -c bioconda sra-tools -y
# Install system utilities via Homebrew
brew install aria2 pigz curl

For Linux / WSL (Ubuntu/Debian):

sudo apt update
sudo apt install sra-toolkit aria2 pigz curl -y

2. Download TurboSRA

git clone [https://github.com/hemant-goyal/TurboSRA.git](https://github.com/hemant-goyal/TurboSRA.git)
cd TurboSRA
chmod +x turbo_srav1.5.sh

πŸš€ Usage Provide a plain text file containing only one SRA accession per line.

./turbo_srav1.5.sh -i accessions.txt [OPTIONS]

Flags [OPTIONS]

Flag Description Default
-i (Required) Input file containing list of SRR accessions None
-o Output directory ./ (Current directory)
-c Enable multi-threaded FASTQ compression (.fastq.gz) False
-l "Allow ""LITE"" mode. Skips ENA strict routing and allows NCBI's .sralite format for maximum speed (Warning: destroys quality scores)" False (Strict FULL mode)
-k Keep the intermediate .sra cache files False (Auto-cleans)
-t Manually specify the number of CPU threads Auto-detected

About

πŸš€ A high-performance, multi-threaded SRA downloader that outsmarts NCBI's API to guarantee pristine, full-fidelity Phred scores. Powered by aria2c and pigz, it parallelizes data acquisition while actively routing through ENA. Built natively for Linux, WSL, and Apple Silicon.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages