Skip to content

tejfaster/Bloom-Filter-Text-Classification

Repository files navigation

Bloom Filter with Information Theory (Mutual Information + Conditional Entropy)

This project implements a Bloom Filter for fast membership testing with a target False Positive Rate (FPR). It extends the classical Bloom Filter design by adding Mutual Information (MI) based feature selection and Conditional Entropy to measure uncertainty after a Bloom Filter match.


✨ Features

✅ Standard Bloom Filter (bit array + k hash functions)
✅ Double Hashing using mmh3 (efficient hashing)
✅ Target FPR design using optimal formulas for m and k
✅ Feature Extraction (tokenization for text inputs)
Mutual Information for feature selection
✅ Bayesian Posterior Membership Probability
Conditional Entropy for uncertainty quantification
✅ Multi-query evaluation + plots
✅ Results and plots saved into results/ folder


📌 Bloom Filter Properties

  • No false negatives: If Bloom Filter returns False, the item is definitely not present.
  • Possible false positives: If Bloom Filter returns True, the item may or may not be present.

📂 Project Structure

Bloom_Filter_Project/
│
├── bloom_filter.py        # Bloom Filter core implementation
├── features.py            # Feature extraction + Mutual Information computation
├── inference.py           # Posterior probability + Conditional entropy
├── evaluation.py          # Multiple query evaluation functions
├── plots.py               # Plotting functions + saving plots
├── main.py                # Runs the full pipeline
│
├── results/               # Auto-generated results
│   ├── query_results.csv
│   ├── summary.txt
│   └── plots/
│       ├── posterior_per_query.png
│       ├── entropy_per_query.png
│       └── trustworthiness_curve.png
│
├── requirements.txt       # Project dependencies
└── README.md

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages