This Colab-based tool automates the process of identifying the phylogenetic distribution of one or more genes across a user-defined set of genomes. It generates a presence/absence matrix and visualizes it alongside a taxonomic tree as a heatmap.
- 🧬 Accepts a set of protein sequences
- 🌍 Accepts a list of target organisms (or a short default list is available)
- ⬇️ Automatically retrieves proteomes (.faa files) from NCBI RefSeq
- 🧪 Runs BLASTP searches to detect gene presence
- 📊 Builds a presence/absence matrix
- 🌳 Constructs a taxonomic tree from NCBI
- 🎨 Plots a heatmap aligned to the tree
All of this is performed entirely in Google Colab — no local installation required.
You need to upload two files to the Colab environment before running the notebook:
This file can be provided in either of the following formats:
A standard FASTA file containing one or more amino acid sequences. Each entry should have:
- A header line beginning with
>gene_name - The sequence on one or more lines
A plain text file with one UniProt ID per line. The notebook will automatically download the corresponding protein sequences.
A plain text file with one organism name per line. These should be recognized species from NCBI.
📦 Don't have a list?
If no file is uploaded, the notebook will fall back to a default list of representative organisms, which includes a few common model species across different branches of life.
👉 Open the notebook in Colab
Steps:
- Upload
query_genes.txtandorganism_list.txt - Run the notebook cells
- Collect outputs:
- CSV matrix:
profile.csv - Heatmap figure:
tree_and_heatmap.png
- CSV matrix: