Skip to content

CMDePompa/Batch-ProtParam

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Batch ProtParam

Batch calculation of protein physicochemical properties from FASTA files using Biopython.

This script reproduces many of the metrics provided by the ExPASy ProtParam web tool but allows high-throughput analysis of hundreds to hundreds of thousands of sequences locally without using a web browser.

Inspired by the ExPASy ProtParam tool:
https://web.expasy.org/protparam/

The output is written as CSV files that open directly in Excel, R, or Python.


Features

For each protein sequence the script calculates:

  • Amino acid counts
  • Amino acid percentages
  • Molecular weight
  • Aromaticity
  • Theoretical isoelectric point (pI)
  • Secondary structure fraction
    • helix\
    • turn\
    • sheet
  • GRAVY (hydrophobicity score)
  • Instability index
  • Flexibility statistics
    • mean\
    • minimum\
    • maximum\
    • standard deviation

Additional metadata columns include:

  • source FASTA file
  • warnings (e.g., dropped residues)
  • error handling status

Installation

Requires Python 3.8+

Install dependency:

pip install biopython

Or install from requirements:

pip install -r requirements.txt

Example requirements.txt:

biopython

Example Folder Structure

project_folder/
│
├── batchProtParam.py
├── fastas/
│   ├── proteins1.fasta
│   └── proteins2.fasta

Basic Usage

Run from the project directory:

python batchProtParam.py --in_dir ./fastas --out_dir ./results

This produces:

results/
    proteins1.protparam.csv
    proteins2.protparam.csv

Output Modes

One CSV per FASTA (default)

python batchProtParam.py \
  --in_dir ./fastas \
  --out_dir ./results \
  --output_mode per_fasta

One combined CSV for all FASTAs

python batchProtParam.py \
  --in_dir ./fastas \
  --out_dir ./results \
  --output_mode all_fastas

Optional custom filename:

python batchProtParam.py \
  --in_dir ./fastas \
  --out_dir ./results \
  --output_mode all_fastas \
  --all_fastas_name my_results.csv

Handling Ambiguous Amino Acids

Sequences sometimes contain non-standard residues.

Residue Meaning


X unknown B D or N Z E or Q U selenocysteine O pyrrolysine

You can control how these are handled.

Default (recommended)

--ambiguous drop

Removes non-standard residues before calculation.

Strict mode

--ambiguous fail

Skips sequences containing ambiguous residues.

Advanced mode

--ambiguous keep

Keeps residues unchanged (may cause calculation errors).


Example Output Columns

seq_id
length_aa
count_A
pct_A
molecular_weight
aromaticity
theoretical_pi
ss_helix
ss_turn
ss_sheet
gravy
instability_index
flex_mean
flex_min
flex_max
flex_stdev
source_fasta
warnings
status
error_type

Why Use This Script?

The official ProtParam web server is useful for analyzing individual proteins but becomes impractical for large datasets.

This script enables:

  • High-throughput proteome analysis
  • Automated pipelines
  • Reproducible workflows
  • Integration with Python, R, or spreadsheet analysis

Citation

This tool relies on Biopython:

Cock et al. (2009).
Biopython: freely available Python tools for computational molecular biology and bioinformatics.
Bioinformatics.


License

MIT License

About

Batch calculation of ProtParam-style protein properties from FASTA files using Biopython. Designed for large datasets without using the ExPASy web interface.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages