Batch calculation of protein physicochemical properties from FASTA files using Biopython.
This script reproduces many of the metrics provided by the ExPASy ProtParam web tool but allows high-throughput analysis of hundreds to hundreds of thousands of sequences locally without using a web browser.
Inspired by the ExPASy ProtParam tool:
https://web.expasy.org/protparam/
The output is written as CSV files that open directly in Excel, R, or Python.
For each protein sequence the script calculates:
- Amino acid counts
- Amino acid percentages
- Molecular weight
- Aromaticity
- Theoretical isoelectric point (pI)
- Secondary structure fraction
- helix\
- turn\
- sheet
- GRAVY (hydrophobicity score)
- Instability index
- Flexibility statistics
- mean\
- minimum\
- maximum\
- standard deviation
Additional metadata columns include:
- source FASTA file
- warnings (e.g., dropped residues)
- error handling status
Requires Python 3.8+
Install dependency:
pip install biopython
Or install from requirements:
pip install -r requirements.txt
Example requirements.txt:
biopython
project_folder/
│
├── batchProtParam.py
├── fastas/
│ ├── proteins1.fasta
│ └── proteins2.fasta
Run from the project directory:
python batchProtParam.py --in_dir ./fastas --out_dir ./results
This produces:
results/
proteins1.protparam.csv
proteins2.protparam.csv
python batchProtParam.py \
--in_dir ./fastas \
--out_dir ./results \
--output_mode per_fasta
python batchProtParam.py \
--in_dir ./fastas \
--out_dir ./results \
--output_mode all_fastas
Optional custom filename:
python batchProtParam.py \
--in_dir ./fastas \
--out_dir ./results \
--output_mode all_fastas \
--all_fastas_name my_results.csv
Sequences sometimes contain non-standard residues.
Residue Meaning
X unknown B D or N Z E or Q U selenocysteine O pyrrolysine
You can control how these are handled.
--ambiguous drop
Removes non-standard residues before calculation.
--ambiguous fail
Skips sequences containing ambiguous residues.
--ambiguous keep
Keeps residues unchanged (may cause calculation errors).
seq_id
length_aa
count_A
pct_A
molecular_weight
aromaticity
theoretical_pi
ss_helix
ss_turn
ss_sheet
gravy
instability_index
flex_mean
flex_min
flex_max
flex_stdev
source_fasta
warnings
status
error_type
The official ProtParam web server is useful for analyzing individual proteins but becomes impractical for large datasets.
This script enables:
- High-throughput proteome analysis
- Automated pipelines
- Reproducible workflows
- Integration with Python, R, or spreadsheet analysis
This tool relies on Biopython:
Cock et al. (2009).
Biopython: freely available Python tools for computational molecular
biology and bioinformatics.
Bioinformatics.
MIT License