PHIFlow-Bash

PHIFlow-Bash is a modular bash-based pipeline for host--microbe transcriptomic profiling from RNA-seq data.
It integrates host alignment, gene quantification, microbial classification and detection into a unified workflow.

Workflow scheme

Read QC (fastp)
Read alignment to the human host genome (STAR)
Infer library strandedness (RSeQC)
Quantify host gene expression (subread:featureCounts)
Taxonomic profiling of host unmapped reads (Kraken2)
Calculate confidence scores from Kraken2 taxonomic profiling (Conifer)
Present summary of all previous processing steps (MultiQC)

⚠️ Prerequisites

PHIFlow requires:

Human genome FASTA (e.g., GRCh38)
Human genome annotation file (GTF,BED)
STAR genome index path
Kraken2 database path

Before running the pipeline, the required reference resources must be prepared.

1️⃣ Human Reference Genome and STAR Index

Reference genome and GTF files can be obtained from:

https://www.ensembl.org

Generate the genome index using STAR:

STAR \
  --runThreadN 16 \
  --runMode genomeGenerate \
  --genomeDir /path/to/star_index/ \
  --genomeFastaFiles Homo_sapiens.GRCh38.dna.primary_assembly.fa \
  --sjdbGTFfile Homo_sapiens.GRCh38.gtf \
  --sjdbOverhang 100

Provide the paths in the configuration file:

host_genome_index_path: "/path/to/star_index/"
host_genome_annotation_path_gtf: "/path/to/Homo_sapiens.GRCh38.gtf"

📄 Generating a BED File from Ensembl GTF

Ensembl does not provide genome annotation files directly in BED format. However, a BED file can be generated from the corresponding GTF file using UCSC utilities.

After downloading the GTF file (e.g., Homo_sapiens.GRCh38.gtf), convert it using UCSC utilities:

gtfToGenePred Homo_sapiens.GRCh38.gtf stdout | \
genePredToBed stdin Homo_sapiens.GRCh38.bed

After generating the BED file, update your configuration file:

host_genome_annotation_path_bed: "/path/to/Homo_sapiens.GRCh38.bed"

2️⃣ Kraken2 Database

Microbial classification requires a pre-built database for Kraken2.

Official databases are available at:

✅ Recommended: PlusPF

We recommend using the PlusPF database
(Standard + Protozoa and Fungi), which provides broad microbial coverage suitable for host--microbe transcriptomic studies.

After downloading the database, define its path in the configuration file:

kraken2_db_path: "/path/to/kraken2_db"

💻 Hardware Requirements

🧠 Memory (RAM)

STAR alignment (GRCh38 index): ~30--35 GB RAM\
Kraken2 loads the entire database into RAM

RAM usage for Kraken2 is proportional to database size.

For the PlusPF database:

~100--150 GB RAM required

Recommended minimum:

128 GB RAM for full pipeline execution with PlusPF

💾 Disk Space

Resource Approximate Size

STAR index (GRCh38) ~30--35 GB Kraken2 PlusPF ~100--150 GB BAM files 5--15 GB per sample

Minimum recommended free space:

200 GB (small projects)\
500+ GB (medium/large cohorts)

🧬 Software Environment

Create the Conda environment:

mamba env create -f environment.yml
conda activate phiflow-bash

🚀 Running the Pipeline

Execution requires a single command:

bash phiflow.sh config_profile.yml

⚙️ Configuration File

The pipeline is controlled by a YAML configuration file.

Example:

module_execution_type: "both"
sample_ids_file: "/scratch/icaro/phiflow_pipeline/samples.txt"
dataset_name: "GSE154918"
kraken2_db_path: "/scratch/kraken2db/k2_pluspf"
host_genome_index_path: "/scratch/icaro/genome_indexes/index/100/"
host_genome_annotation_path_gtf: "/scratch/icaro/genome_indexes/gene_annotation/Homo_sapiens.GRCh38.104.gtf"
host_genome_annotation_path_bed: "/scratch/icaro/genome_indexes/gene_annotation/Homo_sapiens.GRCh38.104.bed"
paired_end: "yes"
strandness: "auto"
workdir: "/scratch/icaro/phiflow_pipeline"
samplesdir: "/home/CSBL/icaro/sandbox/metatranscriptomics/studies/GSE154918"

🔎 Parameter Description

Parameter	Description
module_execution_type:	"host", "microbe", or "both" (default)
sample_ids_file:	File containing sample IDs (one per line)
dataset_name:	Dataset name used as output folder
kraken2_db_path:	Path to Kraken2 database
host_genome_index_path:	STAR genome index directory
host_genome_annotation_path_gtf:	GTF annotation filepath
host_genome_annotation_path_bed:	BED annotation filepath
paired_end:	"yes" or "no"
strandness	"auto", "yes", or "no"
workdir	Output directory
samplesdir	Directory containing FASTQ files

📂 Expected FASTQ Input

FASTQ files must be located in the directory specified by:

samplesdir

Paired-end format:

sample1_R1.fastq.gz
sample1_R2.fastq.gz

Single-end format:

sample1.fastq.gz

📊 Output

All results are generated inside the directory specified in:

workdir/

Outputs include:

BAM alignment files\
Gene count matrices\
Kraken2 classification reports\
MultiQC summary report

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
other_res		other_res
scripts		scripts
README.md		README.md
config_test.yaml		config_test.yaml
environment.yml		environment.yml
phiflow.sh		phiflow.sh
samples.txt		samples.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PHIFlow-Bash

Workflow scheme

⚠️ Prerequisites

1️⃣ Human Reference Genome and STAR Index

📄 Generating a BED File from Ensembl GTF

2️⃣ Kraken2 Database

✅ Recommended: PlusPF

💻 Hardware Requirements

🧠 Memory (RAM)

Recommended minimum:

💾 Disk Space

🧬 Software Environment

🚀 Running the Pipeline

⚙️ Configuration File

🔎 Parameter Description

📂 Expected FASTQ Input

Paired-end format:

Single-end format:

📊 Output

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PHIFlow-Bash

Workflow scheme

⚠️ Prerequisites

1️⃣ Human Reference Genome and STAR Index

📄 Generating a BED File from Ensembl GTF

2️⃣ Kraken2 Database

✅ Recommended: PlusPF

💻 Hardware Requirements

🧠 Memory (RAM)

Recommended minimum:

💾 Disk Space

🧬 Software Environment

🚀 Running the Pipeline

⚙️ Configuration File

🔎 Parameter Description

📂 Expected FASTQ Input

Paired-end format:

Single-end format:

📊 Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages