"pH in a pharmaceutical syrup is not a single measurement — it is the outcome of a system. Citric acid pushes it down. The citrate buffer resists the change. Water quality sets the baseline. Mixing conditions determine whether the equilibrium is actually reached. A model that ignores any of these dimensions will miss the story."
In oral liquid pharmaceutical manufacturing, pH is a release criterion — not a preference. A syrup lot with final pH outside the 4.5–5.5 specification cannot be shipped. It goes to reprocess if the batch is salvageable, or to disposal if it isn't. Either outcome triggers a deviation investigation, regulatory documentation, and a delay in the batch release schedule that ripples through production planning.
The challenge is that pH deviation is rarely caused by a single variable. It is the result of a specific combination of formulation choices and process conditions converging unfavorably. Traditional process control monitors each variable individually — citric acid within range, buffer within range, agitation at target — but misses the interaction effects that determine the final pH outcome.
This project builds a pre-batch risk classifier: given the planned recipe and process conditions, what is the probability of pH compliance? The model is consulted before synthesis begins — enabling formulation corrections before any material is consumed.
- 1,500 pharmaceutical batch records from an oral liquid manufacturing line
- Target:
lot_approved(binary) — 1 = final pH within 4.5–5.5 specification - Class balance: 40.1% approved (moderately imbalanced — 59.9% failed pH spec)
- Final pH range: 1.96–6.68 (specification window: 4.5–5.5)
| Layer | Feature | Description |
|---|---|---|
| Formulation | citric_acid_pct |
Primary acidulant — strongest pH driver |
| Formulation | citrate_buffer_pct |
Counter-lever — resists pH change |
| Formulation | api_pct, sweetener_pct, preservative_pct |
Excipient contributions |
| Water | water_ph |
Purified water quality at batch start |
| Carryover | prev_lot_ph |
Previous lot pH — line conditioning effect |
| Process | mixing_temp_c, mixing_time_min, agitation_rpm |
Physical mixing conditions |
| Sequence | addition_order |
1=Buffer→Acid→API / 2=Acid→Buffer / 3=API→Buffer |
Key EDA findings:
- Addition order 2 (Acid→Buffer→API) achieves only 32.6% approval vs 44.3% for order 1 — sequence matters chemically, not just operationally
- Previous lot pH at the lower boundary (≤4.6) measurably increases next-batch deviation risk
- Final pH distribution shows a cluster of lots near 4.0 — slightly below spec — indicating systematic under-buffering
Algorithm: Random Forest (300 trees) — sklearn.ensemble.RandomForestClassifier
pH deviation is not a linear problem. High citric acid combined with high buffer produces a different outcome than high acid with low buffer. These interaction effects are the mechanism — and they are exactly what tree-based ensembles are designed to capture.
A single Decision Tree trained on the same data achieves AUC 0.776. The Random Forest (300 trees, each trained on a random data and feature subset) reaches AUC 0.908. The 13-point gap is not luck — it is the variance reduction effect of ensemble averaging. Individual trees overfit the formulation complexity; the ensemble generalizes.
Why no scaling: Random Forest uses threshold splits, not distances. A concentration in % and a temperature in °C compete on equal footing — no StandardScaler needed.
Dual importance measures: Gini (built-in, fast) and Permutation (test-set, rigorous) both computed. When they agree — as they do here — the signal is robust.
| Metric | Random Forest | Single Decision Tree |
|---|---|---|
| Test Accuracy | 82.7% | 78.0% |
| ROC-AUC | 0.908 | 0.776 |
| F1 (Approved) | 79.3% | — |
| Recall (Approved) | 82.3% | — |
Confusion matrix (450 test batches):
| Pred: Fail | Pred: Pass | |
|---|---|---|
| Actual: Fail | 223 (TN) ✅ | 46 (FP) |
| Actual: Pass | 32 (FN) | 149 (TP) ✅ |
32 false negatives — batches predicted to pass that would actually fail. In a regulatory context, each FN is a potential deviation event that the model didn't flag. Lowering the classification threshold from 0.5 reduces FN at the cost of more FP (unnecessary interventions).
Both Gini and Permutation importance tell the same story:
| Feature | Gini | Permutation |
|---|---|---|
citric_acid_pct |
26.1% | Δ −23.6% |
citrate_buffer_pct |
21.0% | Δ −17.5% |
prev_lot_ph |
7.5% | Δ −3.2% |
agitation_rpm |
8.1% | Δ −1.9% |
mixing_time_min |
7.2% | Δ −1.2% |
water_ph / temp / rest |
combined 30% | Δ <1.5% each |
The formulation insight: citric acid and citrate buffer together account for 47% of Gini importance and dominate permutation importance by a wide margin. The ratio between these two — not their individual values — determines the pH outcome. This is exactly what physical chemistry predicts.
The process insight: agitation and mixing time matter, but far less than the recipe. You cannot fix a bad formulation with better process conditions (see Scenario C).
PH_Adjustment_Syrup/
├── 06_RF_PH_Adjustment.ipynb # Notebook (no outputs)
├── rf_raw_data.csv # Sample dataset (250 rows)
├── README.md
└── requirements.txt
📦 Full Project Pack — complete dataset (1,500 batches), notebook with full outputs including Gini vs Permutation comparison charts, presentation deck (PPTX + PDF), and
app.pypre-batch risk simulator available on Gumroad.
Option 1 — Google Colab: Click the badge above.
Option 2 — Local:
pip install -r requirements.txt
jupyter notebook 06_RF_PH_Adjustment.ipynb- Ensemble beats single tree by design, not by chance — the jump from AUC 0.776 (DT) to 0.908 (RF) demonstrates the mechanics of bagging: many imperfect trees, each seeing different data, produce a collective vote that generalizes where any individual tree would overfit.
- Two importance measures that agree give high confidence — Gini and Permutation use different computational approaches and answer slightly different questions. When both name the same top features, the signal is real, not an artifact of either method's bias.
- Process corrections cannot fix a formulation root cause — Scenario C shows this clearly: increasing buffer and agitation on a 1.7% citric acid formula lowers approval probability from 11.7% to 1.0%. The acid concentration must be addressed at the recipe level.
- The previous lot effect is underappreciated —
prev_lot_phcarries 7.5% Gini importance. Line conditioning effects are real in liquid manufacturing, and they're rarely captured in traditional process control. This model surfaces them. - Addition order is chemistry, not just procedure — the 11.7 percentage-point approval rate difference between order 1 and order 2 isn't an operational preference — it reflects real differences in acid-buffer equilibration kinetics. The model learned what the chemistry textbook says.
Luis Lozano | Operational Excellence Manager · Master Black Belt · Machine Learning
GitHub: LozanoLsa · Gumroad: lozanolsa.gumroad.com
Turning Operations into Predictive Systems — Clone it. Fork it. Improve it.