A complete, end-to-end machine learning workflow on the UCI Heart Disease dataset, including preprocessing, feature engineering, model training, evaluation, and an interactive Streamlit app for single and batch inference.
Heart_Disease_Project/
├─ data/
│ ├─ heart_disease.csv # Raw dataset
│ ├─ cleaned_X.csv # Cleaned, model-ready features (defines input schema)
│ └─ clean_y.csv # Target labels used during training
├─ models/
│ └─ final_model.pkl # Trained scikit-learn model artifact
├─ notebooks/ # EDA, preprocessing, modeling, tuning
│ ├─ 01_data_preprocessing.ipynb
│ ├─ 02_pca_analysis.ipynb
│ ├─ 03_feature_selection.ipynb
│ ├─ 04_supervised_learning.ipynb
│ ├─ 05_unsupervised_learning.ipynb
│ └─ 06_hyperparameter_tuning.ipynb
├─ results/
│ └─ evaluation_metrics.txt # Saved evaluation metrics
├─ requirements.txt
└─ app.py # Streamlit application entry point
The app loads the trained model from models/final_model.pkl and expects input features that match the columns and order of data/cleaned_X.csv.
- Single Prediction tab: Fill in patient details; the app builds the one-hot feature vector behind the scenes and predicts risk (and probability if available).
- Batch Prediction tab: Upload a CSV with the exact same header as
data/cleaned_X.csvto get predictions for many rows at once. A header template download is provided in the UI. - Documentation tab: Project overview, metrics (from
results/evaluation_metrics.txtif present), and the exact feature column list.
- Recommended Python: 3.11 (Windows supported)
- Create/activate a virtual environment (optional but recommended)
Windows PowerShell example:
# From the project root
python -m venv .venv
. .venv\Scripts\Activate.ps1
pip install --upgrade pip
pip install -r requirements.txtFrom the project root:
streamlit run app.pyStreamlit will open a local URL in your browser. If it doesn’t, copy the printed URL into your browser.
The app derives the exact expected input columns from data/cleaned_X.csv. These include:
- Continuous:
age,trestbps,chol,thalach,oldpeak,ca - One-hot encoded groups (column names start with):
sex_,cp_,restecg_,slope_,fbs_,exang_,thal_
For batch predictions, the uploaded CSV must have the identical header (names and order). Any extra columns will be ignored; any missing required columns will stop the run.
-
UnpicklingError / invalid load key:
-
The model file may have been saved with
joblib. The loader inapp.pytriesjoblib.loadfirst, then falls back topickle.load. -
If it still fails, ensure your environment’s
scikit-learnversion is compatible with the artifact that producedfinal_model.pkl. As a last resort, re-save the model in the current environment:import joblib joblib.dump(model, 'models/final_model.pkl')
-
-
Model or data file not found:
- Verify
models/final_model.pklanddata/cleaned_X.csvexist and paths match the project layout.
- Verify
-
Feature mismatch for batch upload:
- Use the header template download in the Batch tab and populate rows under that header.
- Keep
final_model.pklandcleaned_X.csvin sync with the training pipeline used in the notebooks. - If you change preprocessing or feature engineering, regenerate both the features and the model.
This project is provided for educational and research purposes. Add a license file if you plan to distribute.