featsel

Feature selection pipeline for high-dimensional data with a focus on genomics and bioinformatics.

The Problem

Modern datasets often contain thousands of features but relatively few samples. This is especially common in:

Genomics: Gene expression profiles with 20,000+ genes per patient
Text analysis: Document classification with large vocabularies
Sensor data: IoT and industrial monitoring systems

Training models on such data leads to overfitting, long computation times, and poor interpretability. Feature selection addresses this by identifying the most predictive variables while discarding noise.

What This Project Does

This project implements a feature selection pipeline that:

Applies multiple feature selection methods (filter, wrapper, and embedded approaches)
Compares their effectiveness on classification tasks
Scales efficiently through parallelization
Produces interpretable results for domain experts

The primary use case is predicting breast cancer molecular subtypes from gene expression data, but the pipeline generalizes to other high-dimensional classification problems.

Project Structure

├── configs/            # Dataset configuration files (YAML)
├── docs/               # Report chapters (Markdown)
├── notebooks/          # Jupyter notebooks for analysis
├── featsel/            # Main Python package
├── datasets/           # Input data (one subfolder per dataset)
├── figures/            # Generated plots
└── references/         # Project proposal and papers

Installation

# Clone the repository
git clone https://github.com/drormeir/featsel.git
cd featsel

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Future PyPI installation (not yet available):

pip install featsel

Usage

Run the pipeline with a configuration file:

python -m featsel.run --config configs/scanb.yaml

To use your own dataset, create a config file (see configs/scanb.yaml as a template) and a data folder in datasets/.

Datasets

The pipeline is dataset-agnostic. Each dataset needs:

A subfolder in datasets/ with features.csv and metadata.csv
A YAML config file in configs/

SCAN-B Breast Cancer (included config)

Gene expression measurements (thousands of features)
PAM50 molecular subtype labels (Basal, LumA, LumB, HER2, Normal)
Clinical metadata (ER status, survival data)

Note: Data files are not included due to size. Download from [TBD] and place in datasets/scanb/.

Status

This project is part of an M.Sc. thesis at Reichman University, supervised by Dr. Ben Galili.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
configs		configs
datasets		datasets
docs		docs
featsel		featsel
notebooks		notebooks
references		references
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
FEATURE_SELECTION_GUIDE.md		FEATURE_SELECTION_GUIDE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
PUBLISHING.md		PUBLISHING.md
README.md		README.md
example_feature_selection.py		example_feature_selection.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

featsel

The Problem

What This Project Does

Project Structure

Installation

Usage

Datasets

SCAN-B Breast Cancer (included config)

Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

featsel

The Problem

What This Project Does

Project Structure

Installation

Usage

Datasets

SCAN-B Breast Cancer (included config)

Status

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages