A pixi-based bioinformatics workflow bundle for isolate genome taxonomy, annotation, and comparative genomics.
- Ubuntu 24.04 LTS (Noble Numbat) β other Linux distros may work but are untested
- 16 GB RAM minimum (32 GB+ recommended for GTDB-Tk and large assemblies)
- 50 GB free on home partition (for pixi environments)
- 1.2 TB+ external storage for databases (set as
EXTERNAL_VAULT)
Note:
rubyandJava (JRE)do not need to be installed system-wide β they are fully managed by pixi within the project environments.
sudo apt update && sudo apt install -y \
git \
curl \
wget \
tmux \
aria2 \
rsync \
unzip \
build-essential| Package | Purpose |
|---|---|
git |
Clone this repository and version control |
curl |
Install pixi and download tools |
wget |
Download databases and tools |
tmux |
Keep large downloads running after terminal disconnect |
aria2 |
Fast multi-connection downloads (used by download-kraken2) |
rsync |
Vault sync between drives |
unzip / build-essential |
General build dependencies |
curl -fsSL https://pixi.sh/install.sh | bash
source ~/.bashrc
pixi --versiongit clone git@github.com:bharat1912/taxonomy_bundle.git
cd taxonomy_bundleDatabases total ~1.2 TB and must be stored on an external or high-capacity drive.
Set EXTERNAL_VAULT in your ~/.bashrc to point to that drive before running
any download tasks.
# Add to ~/.bashrc β replace with your actual drive path
echo 'export EXTERNAL_VAULT="/media/your_drive/databases"' >> ~/.bashrc
source ~/.bashrcFallback: If
EXTERNAL_VAULTis not set, databases will install totaxonomy_bundle/db_local/β only suitable for testing as this will likely exhaust space on your primary drive with full databases.
Verify it is set correctly:
echo $EXTERNAL_VAULT # Should print your drive pathpixi install --allThis installs all pixi environments (default, env-a, env-b, env-checkm2, env-busco, etc). Takes 10β30 minutes on first run.
Note:
pixi installalone only installs the default environment. Usepixi install --allto install every environment defined inpixi.toml.
pixi run setup-vaultThis creates the db_link/ symlink tree inside the project, pointing all tools
to the correct database locations under EXTERNAL_VAULT.
pixi run install-miga-gemMiGA (Microbial Genomes Atlas) requires a specific Ruby gem (miga-base 1.4.1.6)
installed into the pixi Ruby environment. This step handles that β no system-wide
Ruby or gem install is needed.
Download databases relevant to your workflow. Each task is independently runnable:
# Core taxonomy databases
pixi run download-gtdbtk # GTDB-Tk r226 (~66 GB)
pixi run download-checkm2 # CheckM2 (~3 GB)
pixi run download-kraken2 # Kraken2 standard (~100 GB)
pixi run download-bakta-db # Bakta annotation DB (~70 GB)
# MiGA reference databases
pixi run download-miga-typemat # TypeMat_Lite β type strain genomes (~50 GB)
pixi run download-miga-phyla # Phyla_Lite (~5 GB)
# DFAST_QC reference
pixi run download-dfast-qc # DFAST_QC compact reference (<2 GB)
pixi run download-dfast-qc-gtdb-genomes # Full GTDB genomes for offline search (~127 GB extracted)Tip: Run all large downloads inside a
tmuxsession so they continue after terminal disconnect:tmux new -s downloads # run download commands here # Detach: Ctrl+B then D
# Taxonomy classify a single isolate genome
pixi run miga classify_wf \
--db-path $MIGA_HOME/TypeMat_Lite \
--type genome \
-o results/miga_classify \
genome.fasta
# GTDB taxonomy + quality check for an isolate
pixi run -e env-checkm2 dfast-qc-isolate \
-i genome.fasta \
-o results/dfast_qc/sample_name \
-n 8| Environment | Key tools |
|---|---|
default |
MiGA, GTDB-Tk, Bakta, PHANTASM, BIT, GToTree, PhyKit |
env-a |
Bakta, RDP Classifier, OGRI, env-specific tools |
env-b |
Additional annotation and comparative tools |
env-checkm2 |
DFAST_QC, CheckM2 |
env-busco |
BUSCO 6 |
env-pan |
PIRATE, Panaroo pangenome tools |
The full suite totals ~1.2 TB. You do not need all databases to get started. Each database is downloaded independently β add them one at a time as your work expands.
Note: Adding a database later requires just one command β no reinstallation needed:
pixi run download-kraken2 # add Kraken2 any time after initial setup pixi run download-miga-typemat # add MiGA type strain DB when needed
| Database | Tool | Size | Download command | Purpose |
|---|---|---|---|---|
| GTDB-Tk r226 | GTDB-Tk | ~66 GB | pixi run download-gtdbtk |
Species-level taxonomy |
| CheckM2 | CheckM2 | ~3 GB | pixi run download-checkm2 |
Genome completeness / contamination |
| Bakta DB | Bakta | ~70 GB | pixi run download-bakta-db |
Full genome annotation |
| DFAST_QC compact | DFAST_QC | ~2 GB | pixi run download-dfast-qc |
Quick taxonomy + QC check |
A 500 GB external SSD is sufficient for Tier 1. This is the recommended starting point.
| Database | Tool | Size | Download command | Purpose |
|---|---|---|---|---|
| TypeMat_Lite | MiGA | ~50 GB | pixi run download-miga-typemat |
Type strain taxonomy (28,548 genomes) |
| Phyla_Lite | MiGA | ~5 GB | pixi run download-miga-phyla |
Phylum-level classification |
| GTDB genomes (reps) | MiGA / DFAST_QC | ~127 GB | pixi run download-dfast-qc-gtdb-genomes |
Offline GTDB search |
| Kraken2 standard | Kraken2 | ~100 GB | pixi run download-kraken2 |
Read-level taxonomic profiling |
| MyTaxa | MiGA | ~50 GB | pixi run download-mytaxa |
Gene-based taxonomy screening |
| Database | Tool | Size | Download command | Purpose |
|---|---|---|---|---|
| DRAM2 | DRAM2 | ~600 GB | pixi run download-dram2 |
Full metabolic annotation |
| BUSCO lineages | BUSCO | ~10 GB | pixi run download-busco-prok |
Lineage-specific completeness |
| eggNOG | eggNOG-mapper | ~30 GB | pixi run download-eggnog |
Functional annotation + COG |
| What you download | Storage needed |
|---|---|
| Tier 1 only | ~150 GB |
| Tier 1 + Tier 2 | ~550 GB |
| Tier 1 + Tier 2 + Tier 3 | ~1.2 TB |
By default WSL2 uses only 50% of your system RAM. For bioinformatics workloads
create a .wslconfig file on the Windows side:
# File location: C:\Users\YourWindowsUsername\.wslconfig
[wsl2]
memory=24GB
processors=8
Adjust memory and processors to match your hardware. Restart WSL2 after saving:
# In PowerShell (Windows side)
wsl --shutdown
# Then reopen UbuntuYour Windows drives are mounted automatically inside WSL2:
- C: drive β
/mnt/c/ - D: drive β
/mnt/d/ - External SSD β
/mnt/e/(or similar)
Set EXTERNAL_VAULT to point to your external drive:
echo 'export EXTERNAL_VAULT="/mnt/d/databases"' >> ~/.bashrc
source ~/.bashrcPerformance note: Large database operations (GTDB-Tk, DRAM2) run faster when databases are stored on the native WSL2 filesystem (
~/or/home/) rather than on a Windows NTFS drive (/mnt/d/). For databases you access frequently, consider copying them to the WSL2 filesystem if space allows.
The full database suite is ~1.2 TB. You do not need all of it to get started. Databases can be added at any time with a single command β no reinstallation needed.
| Database | Tool | Size | Pixi task | Good for |
|---|---|---|---|---|
| GTDB-Tk r226 | GTDB-Tk | ~66 GB | download-gtdbtk |
Species-level taxonomy |
| CheckM2 | CheckM2 | ~3 GB | download-checkm2 |
Genome completeness & contamination |
| Bakta v6 | Bakta | ~70 GB | download-bakta-db |
Gene annotation |
| DFAST_QC compact | DFAST_QC | ~2 GB | download-dfast-qc |
Quick isolate taxonomy check |
| Starter total | ~141 GB |
pixi run download-kraken2 # Kraken2 k2_pluspf (~100 GB)
pixi run download-miga-typemat # MiGA TypeMat_Lite (~50 GB)
pixi run download-miga-phyla # MiGA Phyla_Lite (~5 GB)
pixi run download-dfast-qc-gtdb-genomes # DFAST_QC full GTDB (~127 GB extracted)
pixi run download-busco-prok # BUSCO prokaryotic lineages (~10 GB)DRAM2 note: The DRAM2 metabolic annotation database (~600 GB) is the largest single database. It is only needed for deep metabolic pathway analysis. Most isolate genomics workflows do not require it to get started.
- Isolate taxonomy: Use
dfast-qc-isolate(GTDB search) β fast, suitable for well-characterised species with GTDB representatives - MAG taxonomy (environmental bins): Use MiGA (
classify_wfor daemon) + GTDB-Tk β handles novel/underrepresented taxa not in GTDB representative set - Completeness/contamination: Use CheckM2 directly or via Snakemake workflows
taxonomy_bundle runs identically on Windows using WSL2 (Windows Subsystem for Linux). WSL2 provides a real Ubuntu Linux environment built into Windows 10/11 β not a virtual machine.
Open PowerShell as Administrator:
wsl --installRestart your computer. Ubuntu opens automatically. Set a username and password.
Institutional computers: If your machine is managed by a university or hospital IT department, you may need admin rights or IT assistance to enable WSL2 and virtualisation in BIOS.
WSL2 defaults to 50% of system RAM. For GTDB-Tk and large assemblies, increase this:
Create the file C:\Users\YourName\.wslconfig on Windows with:
[wsl2]
memory=24GB
processors=8Restart WSL2: wsl --shutdown in PowerShell, then reopen Ubuntu.
Windows drives are accessible inside WSL2 as /mnt/c/, /mnt/d/ etc.
Set your vault to an external drive:
echo 'export EXTERNAL_VAULT="/mnt/d/databases"' >> ~/.bashrc
source ~/.bashrcPerformance note: For large jobs, databases stored on Windows NTFS drives (
/mnt/d/) are slower than the native WSL2 filesystem. If possible, store databases on a dedicated Linux-formatted drive or within the WSL2 filesystem (~/).
- All pixi environments and tools β
- All Snakemake workflows β
- All database download tasks β
- Ruby/gem (MiGA) managed by pixi β no system install needed β
- Java managed by pixi β no system install needed β
If you use this bundle please cite the individual tools used in your analysis. Key citations include MiGA, GTDB-Tk, Bakta, DFAST_QC, CheckM2, PHANTASM, and BIT as appropriate.