Skip to content

Latest commit

 

History

History
27 lines (18 loc) · 922 Bytes

File metadata and controls

27 lines (18 loc) · 922 Bytes

Protein Kingdom Classification from Codon Usage Patterns

Project Overview

This project develops machine learning models to predict species' biological kingdom based on codon usage frequency patterns. The dataset comprises over 13,000 samples with codon usage frequencies across 64 codons, enabling multi-class classification across various biological kingdoms.

Dataset

Source: Codon Usage Dataset on Kaggle

Models Implemented

  • Logistic Regression (with multinomial classification)
  • Support Vector Machines (SVM)

Environment Setup

Using Conda

conda env create -f env.yml
conda activate codon-classification

Key Findings

  • Best Performing Model: SVM with rbf kernel
  • Well-classified Kingdoms: Bacteria, Viruses, Vertebrates, Plants (F1-scores ~0.96)
  • Challenging Categories: low-sample classes(Archaea, Phage)