PHIFlow-Bash is a modular bash-based pipeline for host--microbe
transcriptomic profiling from RNA-seq data.
It integrates host alignment, gene quantification, microbial
classification and detection into a unified workflow.
- Read QC (
fastp) - Read alignment to the human host genome (
STAR) - Infer library strandedness (
RSeQC) - Quantify host gene expression (
subread:featureCounts) - Taxonomic profiling of host unmapped reads (
Kraken2) - Calculate confidence scores from Kraken2 taxonomic profiling (
Conifer) - Present summary of all previous processing steps (
MultiQC)
PHIFlow requires:
- Human genome FASTA (e.g., GRCh38)
- Human genome annotation file (GTF,BED)
- STAR genome index path
- Kraken2 database path
Before running the pipeline, the required reference resources must be prepared.
Reference genome and GTF files can be obtained from:
Generate the genome index using STAR:
STAR \
--runThreadN 16 \
--runMode genomeGenerate \
--genomeDir /path/to/star_index/ \
--genomeFastaFiles Homo_sapiens.GRCh38.dna.primary_assembly.fa \
--sjdbGTFfile Homo_sapiens.GRCh38.gtf \
--sjdbOverhang 100Provide the paths in the configuration file:
host_genome_index_path: "/path/to/star_index/"
host_genome_annotation_path_gtf: "/path/to/Homo_sapiens.GRCh38.gtf"Ensembl does not provide genome annotation files directly in BED format. However, a BED file can be generated from the corresponding GTF file using UCSC utilities.
After downloading the GTF file (e.g., Homo_sapiens.GRCh38.gtf), convert it using UCSC utilities:
gtfToGenePred Homo_sapiens.GRCh38.gtf stdout | \
genePredToBed stdin Homo_sapiens.GRCh38.bedAfter generating the BED file, update your configuration file:
host_genome_annotation_path_bed: "/path/to/Homo_sapiens.GRCh38.bed"Microbial classification requires a pre-built database for Kraken2.
Official databases are available at:
We recommend using the PlusPF database
(Standard + Protozoa and Fungi), which provides broad microbial
coverage suitable for host--microbe transcriptomic studies.
After downloading the database, define its path in the configuration file:
kraken2_db_path: "/path/to/kraken2_db"- STAR alignment (GRCh38 index): ~30--35 GB RAM\
- Kraken2 loads the entire database into RAM
RAM usage for Kraken2 is proportional to database size.
For the PlusPF database:
- ~100--150 GB RAM required
- 128 GB RAM for full pipeline execution with PlusPF
Resource Approximate Size
STAR index (GRCh38) ~30--35 GB Kraken2 PlusPF ~100--150 GB BAM files 5--15 GB per sample
Minimum recommended free space:
- 200 GB (small projects)\
- 500+ GB (medium/large cohorts)
Create the Conda environment:
mamba env create -f environment.yml
conda activate phiflow-bashExecution requires a single command:
bash phiflow.sh config_profile.ymlThe pipeline is controlled by a YAML configuration file.
Example:
module_execution_type: "both"
sample_ids_file: "/scratch/icaro/phiflow_pipeline/samples.txt"
dataset_name: "GSE154918"
kraken2_db_path: "/scratch/kraken2db/k2_pluspf"
host_genome_index_path: "/scratch/icaro/genome_indexes/index/100/"
host_genome_annotation_path_gtf: "/scratch/icaro/genome_indexes/gene_annotation/Homo_sapiens.GRCh38.104.gtf"
host_genome_annotation_path_bed: "/scratch/icaro/genome_indexes/gene_annotation/Homo_sapiens.GRCh38.104.bed"
paired_end: "yes"
strandness: "auto"
workdir: "/scratch/icaro/phiflow_pipeline"
samplesdir: "/home/CSBL/icaro/sandbox/metatranscriptomics/studies/GSE154918"| Parameter | Description |
|---|---|
| module_execution_type: | "host", "microbe", or "both" (default) |
| sample_ids_file: | File containing sample IDs (one per line) |
| dataset_name: | Dataset name used as output folder |
| kraken2_db_path: | Path to Kraken2 database |
| host_genome_index_path: | STAR genome index directory |
| host_genome_annotation_path_gtf: | GTF annotation filepath |
| host_genome_annotation_path_bed: | BED annotation filepath |
| paired_end: | "yes" or "no" |
| strandness | "auto", "yes", or "no" |
| workdir | Output directory |
| samplesdir | Directory containing FASTQ files |
FASTQ files must be located in the directory specified by:
samplesdir
sample1_R1.fastq.gz
sample1_R2.fastq.gz
sample1.fastq.gz
All results are generated inside the directory specified in:
workdir/
Outputs include:
- BAM alignment files\
- Gene count matrices\
- Kraken2 classification reports\
- MultiQC summary report
