pH Compliance Is Not Random — Random Forest

"pH in a pharmaceutical syrup is not a single measurement — it is the outcome of a system. Citric acid pushes it down. The citrate buffer resists the change. Water quality sets the baseline. Mixing conditions determine whether the equilibrium is actually reached. A model that ignores any of these dimensions will miss the story."

🎯 Business Problem

In oral liquid pharmaceutical manufacturing, pH is a release criterion — not a preference. A syrup lot with final pH outside the 4.5–5.5 specification cannot be shipped. It goes to reprocess if the batch is salvageable, or to disposal if it isn't. Either outcome triggers a deviation investigation, regulatory documentation, and a delay in the batch release schedule that ripples through production planning.

The challenge is that pH deviation is rarely caused by a single variable. It is the result of a specific combination of formulation choices and process conditions converging unfavorably. Traditional process control monitors each variable individually — citric acid within range, buffer within range, agitation at target — but misses the interaction effects that determine the final pH outcome.

This project builds a pre-batch risk classifier: given the planned recipe and process conditions, what is the probability of pH compliance? The model is consulted before synthesis begins — enabling formulation corrections before any material is consumed.

📊 Dataset

1,500 pharmaceutical batch records from an oral liquid manufacturing line
Target: lot_approved (binary) — 1 = final pH within 4.5–5.5 specification
Class balance: 40.1% approved (moderately imbalanced — 59.9% failed pH spec)
Final pH range: 1.96–6.68 (specification window: 4.5–5.5)

Layer	Feature	Description
Formulation	`citric_acid_pct`	Primary acidulant — strongest pH driver
Formulation	`citrate_buffer_pct`	Counter-lever — resists pH change
Formulation	`api_pct`, `sweetener_pct`, `preservative_pct`	Excipient contributions
Water	`water_ph`	Purified water quality at batch start
Carryover	`prev_lot_ph`	Previous lot pH — line conditioning effect
Process	`mixing_temp_c`, `mixing_time_min`, `agitation_rpm`	Physical mixing conditions
Sequence	`addition_order`	1=Buffer→Acid→API / 2=Acid→Buffer / 3=API→Buffer

Key EDA findings:

Addition order 2 (Acid→Buffer→API) achieves only 32.6% approval vs 44.3% for order 1 — sequence matters chemically, not just operationally
Previous lot pH at the lower boundary (≤4.6) measurably increases next-batch deviation risk
Final pH distribution shows a cluster of lots near 4.0 — slightly below spec — indicating systematic under-buffering

🤖 Model

Algorithm: Random Forest (300 trees) — sklearn.ensemble.RandomForestClassifier

pH deviation is not a linear problem. High citric acid combined with high buffer produces a different outcome than high acid with low buffer. These interaction effects are the mechanism — and they are exactly what tree-based ensembles are designed to capture.

A single Decision Tree trained on the same data achieves AUC 0.776. The Random Forest (300 trees, each trained on a random data and feature subset) reaches AUC 0.908. The 13-point gap is not luck — it is the variance reduction effect of ensemble averaging. Individual trees overfit the formulation complexity; the ensemble generalizes.

Why no scaling: Random Forest uses threshold splits, not distances. A concentration in % and a temperature in °C compete on equal footing — no StandardScaler needed.

Dual importance measures: Gini (built-in, fast) and Permutation (test-set, rigorous) both computed. When they agree — as they do here — the signal is robust.

📈 Key Results

Metric	Random Forest	Single Decision Tree
Test Accuracy	82.7%	78.0%
ROC-AUC	0.908	0.776
F1 (Approved)	79.3%	—
Recall (Approved)	82.3%	—

Confusion matrix (450 test batches):

	Pred: Fail	Pred: Pass
Actual: Fail	223 (TN) ✅	46 (FP)
Actual: Pass	32 (FN)	149 (TP) ✅

32 false negatives — batches predicted to pass that would actually fail. In a regulatory context, each FN is a potential deviation event that the model didn't flag. Lowering the classification threshold from 0.5 reduces FN at the cost of more FP (unnecessary interventions).

🔍 Feature Importance

Both Gini and Permutation importance tell the same story:

Feature	Gini	Permutation
`citric_acid_pct`	26.1%	Δ −23.6%
`citrate_buffer_pct`	21.0%	Δ −17.5%
`prev_lot_ph`	7.5%	Δ −3.2%
`agitation_rpm`	8.1%	Δ −1.9%
`mixing_time_min`	7.2%	Δ −1.2%
`water_ph` / `temp` / rest	combined 30%	Δ <1.5% each

The formulation insight: citric acid and citrate buffer together account for 47% of Gini importance and dominate permutation importance by a wide margin. The ratio between these two — not their individual values — determines the pH outcome. This is exactly what physical chemistry predicts.

The process insight: agitation and mixing time matter, but far less than the recipe. You cannot fix a bad formulation with better process conditions (see Scenario C).

🗂️ Repository Structure

PH_Adjustment_Syrup/
├── 06_RF_PH_Adjustment.ipynb  # Notebook (no outputs)
├── rf_raw_data.csv            # Sample dataset (250 rows)
├── README.md
└── requirements.txt

📦 Full Project Pack — complete dataset (1,500 batches), notebook with full outputs including Gini vs Permutation comparison charts, presentation deck (PPTX + PDF), and app.py pre-batch risk simulator available on Gumroad.

🚀 How to Run

Option 1 — Google Colab: Click the badge above.

Option 2 — Local:

pip install -r requirements.txt
jupyter notebook 06_RF_PH_Adjustment.ipynb

💡 Key Learnings

Ensemble beats single tree by design, not by chance — the jump from AUC 0.776 (DT) to 0.908 (RF) demonstrates the mechanics of bagging: many imperfect trees, each seeing different data, produce a collective vote that generalizes where any individual tree would overfit.
Two importance measures that agree give high confidence — Gini and Permutation use different computational approaches and answer slightly different questions. When both name the same top features, the signal is real, not an artifact of either method's bias.
Process corrections cannot fix a formulation root cause — Scenario C shows this clearly: increasing buffer and agitation on a 1.7% citric acid formula lowers approval probability from 11.7% to 1.0%. The acid concentration must be addressed at the recipe level.
The previous lot effect is underappreciated — prev_lot_ph carries 7.5% Gini importance. Line conditioning effects are real in liquid manufacturing, and they're rarely captured in traditional process control. This model surfaces them.
Addition order is chemistry, not just procedure — the 11.7 percentage-point approval rate difference between order 1 and order 2 isn't an operational preference — it reflects real differences in acid-buffer equilibration kinetics. The model learned what the chemistry textbook says.

👤 Author

Luis Lozano | Operational Excellence Manager · Master Black Belt · Machine Learning
GitHub: LozanoLsa · Gumroad: lozanolsa.gumroad.com

Turning Operations into Predictive Systems — Clone it. Fork it. Improve it.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
06_RF_PH_Adjustment.ipynb		06_RF_PH_Adjustment.ipynb
LICENSE		LICENSE
PH_Approval_RandomForest.pdf		PH_Approval_RandomForest.pdf
README.md		README.md
cover_06.png		cover_06.png
data_sources_and_features.txt		data_sources_and_features.txt
requirements.txt		requirements.txt
rf_raw_data.csv		rf_raw_data.csv
thumb_06.png		thumb_06.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pH Compliance Is Not Random — Random Forest

🎯 Business Problem

📊 Dataset

🤖 Model

📈 Key Results

🔍 Feature Importance

🗂️ Repository Structure

🚀 How to Run

💡 Key Learnings

👤 Author

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pH Compliance Is Not Random — Random Forest

🎯 Business Problem

📊 Dataset

🤖 Model

📈 Key Results

🔍 Feature Importance

🗂️ Repository Structure

🚀 How to Run

💡 Key Learnings

👤 Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages