This repository implements a pipeline to predict pull request (PR) merge time (hours) from GitHub metadata and code-change signals. The pipeline is split into extraction, analysis/feature engineering, training, and inference components so each stage can be iterated independently.
The system is organized into modular components that handle data ingestion, preprocessing, training, and serving.
Collects raw Pull Request (PR) data from GitHub and produces structured data.
See: github_etl
Coordinates large-scale PR extraction in controlled batches.
Manages pagination, retry logic, and incremental data collection.
Retrieves raw PR data from the GitHub API.
Extracts and enriches the data with additional features required for modeling.
- Commits Enricher → commit count, unique contributors
- Files Enricher → additions, deletions, code churn
- Metadata Enricher → title length, label count, etc.
Transforms raw GitHub JSON responses into a structured tabular format and derives additional modeling features through feature engineering on existing fields.
- Features additions like hour of the day, weekday, is_weekend, etc.
- Flattens nested objects
- Normalizes timestamps
- Standardizes column types
- Handles missing values
Persists transformed and enriched pr batch to Cloud Storage.
Provides logging, execution tracing, and pipeline metrics for monitoring reliability and performance.
Prepares datasets for model training by cleaning data, engineering labels, and generating train/validation/test splits.
See: data_analysis
Computes:
- Feature correlations
- Mutual information
- Distribution statistics
Used for feature evaluation and exploratory analysis.
Identifies and removes extreme PR cases using:
- Feature-based grouping
- Median-based filtering
- Train-only cleaning strategy
This prevents skewed learning while preserving validation integrity.
Performs chronological data splitting:
- 70% Train
- 15% Validation
- 15% Test
Ensures no temporal leakage between datasets.
Stores cleaned and split datasets back to Cloud Storage.
Performs hyperparameter search and model selection using XGBoost.
Training entrypoint: training/train.py
Key capabilities:
- Hyperparameter tuning
- Early stopping
- Validation monitoring
- Model selection based on MAE and P90 error
The best-performing model artifact is saved to Cloud Storage.
FastAPI-based service for real-time predictions.
Inference entrypoint: inference/main.py
Responsibilities:
- Loads trained model from Cloud Storage
- Applies consistent feature processing
- Returns predicted PR merge time
Designed for low-latency, production-ready inference.
Shared helpers for cloud I/O and common functionality.
See: utils
Acts as the central storage backbone for the system.
Stores:
- Processed datasets
- Train/Validation/Test splits
- Feature configurations
- Trained model artifacts
- Hyperparameter metadata
This design ensures reproducibility, version control, and complete separation between training and inference.
An high levet diagram for the implemented solution is available at architecture/architecture.png and embedded below.
- commit_count: Number of commits in the PR. Rationale: more commits usually indicate more work/complexity, often increasing time-to-merge.
- file_count / files_modified / files_added / files_deleted / files_removed: PR surface area. Rationale: larger PRs touching more files are harder to review and more likely to take longer.
- line_additions / line_deletions / code_churn: Magnitude of code change. Rationale: higher churn tends to correlate with more review time and integration effort.
- unique_file_types: Diversity of file types touched (e.g., .js, .md). Rationale: cross-language or mixed changes often require more reviewers and context switch.
- unique_commit_authors / assignee_count: Number of authors/assignees. Rationale: many contributors or reviewers can slow decision or speed it up depending on coordination — useful for model to learn patterns.
- avg_time_since_last_mod_days: Recency of touched files. Rationale: recently modified files are more familiar to reviewers; older files might require more context and take longer.
- title_length / body_length / label_count: Textual metadata signals. Rationale: more descriptive PRs (longer body) often correlate with smoother reviews; labels encode intent/priority.
- hour / weekday / is_weekend / is_us_holiday: Temporal signals. Rationale: submission time affects reviewer availability and response latency.
Notes: exact feature lists used in experiments are enumerated in training/train.py.
- Schema validation & type casting: Load JSON data with strict Polars schema, ensuring numeric and boolean columns are cast correctly (e.g.,
is_weekend,is_us_holidayto Int8). Value: prevents type mismatches and ensures consistent data representation. - Chronological train/val/test split: Sort by
created_attimestamp, then split 70% train, 15% val, 15% test chronologically. Value: prevents temporal leakage where future PRs leak information into past model training. - Feature importance ranking: Compute Pearson correlation and mutual information (MI) between each feature and target. Combine scores (60% correlation + 40% MI) to identify top 4 features most aligned with merge time. Value: focuses outlier detection on the features that matter most for prediction.
- Outlier detection via Freedman-Diaconis rule: Suggest bin width for each top feature using the formula
bin_width = (2 * IQR) / n^(1/3), clamping to 5–50 bins. Value: adapts bin size to data distribution; prevents over-fragmentation or under-binning. - Group-based outlier removal: Bin each top feature, group rows by bin combination, compute median target per group. Remove rows where target > 3× group median (unless group is small, <5 rows). Value: removes extreme outliers while preserving typical PR dynamics within each feature region; respects domain variance.
- Null-safe filtering: Drop null values before computing statistics or applying filters. Value: avoids NaN propagation in correlations and mutual information calculations.
- Train/val/test preservation: Keep validation and test sets unfiltered; apply outlier removal only to training set. Value: training sees a cleaned distribution; evaluation remains representative of real-world data.
- Model choice — XGBoost (gradient-boosted trees): The training code uses
xgboost.XGBRegressor(training/train.py). Rationale: tree ensembles handle heterogeneous numeric and categorical features without heavy normalization, capture non-linear interactions common in software metrics, scale well with tabular data, and are fast for grid search and inference. - Objective & eval metric: Training uses absolute-error oriented objective and validation MAE (
mean absolute error) for model selection. Rationale: MAE directly corresponds to average hours error which is easy to interpret (e.g., predicted vs actual hours). MAE is robust to outliers compared with MSE/RMSE when the domain contains heavy tails. - Additional robustness metrics: The pipeline reports median absolute error, RMSE, and P90 error. Rationale: median AE reduces influence of outliers; P90 indicates large-error behavior which is important for SLA-like guarantees.
- Hyperparameter search: Parameter grid search over learning rate, n_estimators, max_depth, subsample, and colsample_bytree. Rationale: these control bias/variance and effective model capacity for tabular data.
- Service: A FastAPI app exposes
/predictand/healthendpoints in inference/main.py. It loads a trained XGBoost model during startup and uses apredictorabstraction attached toapp.state. - Feature extraction: Given a PR link, the service fetches PR metadata and computes the same derived features used in training (feature parity is essential). Value: ensures model input distribution matches training.
- Preprocessing at inference: The service applies the same log1p transforms and feature ordering as training for skewed columns. Value: consistent scaling prevents unexpected prediction shifts.
The /predict endpoint accepts a POST request with a JSON body containing:
Request Body:
{
"pr_link": "https://github.com/owner/repo/pull/123"
}Parameters:
pr_link(string, required): A full GitHub pull request URL. The service parses this URL to extract the repository owner, name, and PR number, then fetches metadata from GitHub's API to compute features for the model. Example:https://github.com/excalidraw/excalidraw/pull/5678
Response: The endpoint returns a JSON object with the predicted merge time, SHAP feature contributions, and model explanation:
{
"predicted_merge_time_hours": 8.63,
"log_prediction_value": 2.26,
"top_feature_impacts": [
{
"feature": "avg_time_since_last_mod_days",
"feature_value": 4.1,
"shap_contribution": 0.61
},
{
"feature": "commit_count",
"feature_value": 0.69,
"shap_contribution": -0.46
}
],
"all_feature_impacts": [
{
"feature": "avg_time_since_last_mod_days",
"feature_value": 4.1,
"shap_contribution": 0.61
},
...
]
}Response Fields:
predicted_merge_time_hours: Predicted merge time in hours (converted from log space).log_prediction_value: Raw model prediction in log space (before exponentiation).top_feature_impacts: Array of the 4 most impactful features. Each entry contains:feature: Feature name.feature_value: The actual value of this feature for the PR (log-transformed if applicable).shap_contribution: SHAP value indicating how much this feature pushed the prediction up (positive) or down (negative) from the baseline.
all_feature_impacts: Complete list of all 19 features and their SHAP contributions, sorted by magnitude of impact.
- SHAP overview: SHAP (SHapley Additive exPlanations) assigns each feature an importance value for an individual prediction using principles from cooperative game theory. The model output is decomposed into a baseline value plus the contribution of each feature.
- Interpretation: A positive SHAP value indicates that the feature increases the predicted merge time relative to the baseline. A negative value indicates that the feature decreases it. The absolute magnitude represents the strength of the feature’s contribution.
- Global vs. Local Explanations:
- Global explanations are obtained by aggregating SHAP values across the dataset to determine overall feature importance.
- Local explanations analyze SHAP values for a single PR to understand the factors influencing that specific prediction.
- Practical Application: SHAP is used to validate feature behavior, identify potential model biases or unintended dependencies, and provide transparent reasoning behind predictions.
Simple run steps
- Install requirements
python3 -m pip install -r requirements.txt- Add your GitHub PAT token and GCP credentials path
- Put all values into a
.envfile. Useexample.envas a reference. - Example entries you should set in
.env(replace values):
GITHUB_PAT=your_github_pat_here
GOOGLE_APPLICATION_CREDENTIALS=/full/path/to/your-gcp-creds.json
GCS_BUCKET_NAME=predict-pr-merge-datasets
- First run ETL pipeline
python -m github_etl.run_etl- Note down the ingestion job id printed by the ETL run.
- Analysis
- Modify
data_analysis/config.pywith required values. Set the analysis job id to the same ingestion job id you noted.
python -m data_analysis.analysis- Training
- Edit
training/config.pyas needed.
python -m training.train- Note down the training job id printed by the training run.
- Inference
- Put the training job id in
inference/config.py(setTRAINING_JOB_ID) and run:
python -m inference.main- Single-repo design: No cross-repo transfer learning; cannot generalize to new repos
- No feature store: Features recomputed per pipeline; no versioning or reuse
- Hardcoded training job ID: Manual deployment to switch models; no version control
- No distributed compute: ThreadPoolExecutor (GIL-bound); single-machine constraint
- No data validation: Missing schema checks; corrupted data undetected
- Limited null handling: No explicit imputation strategy; inference fails with missing data
- Outlier removal train-only: Test set not cleaned; train/test distribution mismatch
- No feature drift monitoring: Silent drift; model degrades unnoticed
- Manual feature engineering: Hardcoded features; no automated relevance check
- Grid search only: No Bayesian optimization; expensive tuning
- Single model: No ensemble; no fallback if model fails
- No uncertainty quantification: Point predictions only; no confidence intervals
- Limited metrics: MAE-focused; missing P50/P90 percentile tracking
- No automated retraining: Manual trigger; models stale after drift
- SHAP latency: TreeExplainer on-the-fly adds 100-500ms (2-5x slower)
- No request validation: URL not sanitized; accepts malformed input
- No caching: Redundant GitHub API calls; wastes quota
- No rate limiting/auth: Vulnerable to DoS; unauthenticated access
- No batch endpoint: Single PR only; inefficient for bulk scoring
- Single model instance: No fallback; service dies if model loads fail
- No feature validation: NaN/infinite values pass silently
- Log-space predictions: Tail behavior underrepresented; error opacity
- Missing features: No reviewer signals, repo maturity, code review velocity, sentiment
- No quantile regression: Cannot generate prediction intervals
- No ensemble: Single model type; limited robustness
- Feature Store (Feast): Version features; decouple from training/inference
- Distributed ETL: Kafka + Spark for incremental CDC; process only new PRs
- Model Registry (MLflow): Version control, rollback, lineage
- Multi-repo Support: Transfer learning or meta-features for generalization
- Config Externalization:
.env+ Secret Manager; remove hardcoded IDs
- Data Validation (Great Expectations): Schema, null rates, value ranges
- Feature Drift Monitoring: Monthly KS test; alert if p < 0.05
- Null Imputation Strategy: Explicit policy (median/mode); document for inference
- Outlier Detection: Apply to all splits (train/val/test) for consistency
- Automated Feature Ranking: Remove low-signal features pre-training
- Bayesian Optimization (Optuna): Replace grid search; faster convergence
- Quantile Regression: Predict 25th/50th/90th percentiles for intervals
- Ensemble Methods: XGBoost + LightGBM + Ridge for robustness
- Auto-Retraining: Weekly trigger or on drift; auto-deploy if validation improves
- Cross-Validation: Stratified splits; prevent temporal leakage
- SHAP Optimization: Pre-compute baseline + LIME; reduce to <50ms latency
- Request Validation: Pydantic schema with regex URL check; fail fast
- Redis Caching: Cache PR metadata (1hr TTL); reduce API calls 50%+
- Rate Limiting + Auth: JWT tokens, RBAC with
slowapi - Batch Endpoint:
/predict_batchvectorized inference; handle 1000s PRs - Feature Validation: Schema checks with fallback imputation
- Health Checks: Extended endpoint (model, config, GCS, GitHub API)
The model was trained and evaluated on two repositories: VSCode and Excalidraw.
All metrics are computed on the held-out test sets.
- Train: 13,633 samples
- Validation: 3,849 samples
- Test: 3,851 samples
| Metric | Value | Description |
|---|---|---|
| MAE | 22.59 hours | Mean absolute error |
| Median Absolute Error | 0.60 hours | 50% of predictions are within ~36 minutes |
| RMSE | 104.19 hours | Penalizes large deviations more heavily |
| P90 Error | 32.19 hours | 90% of predictions are within ~32 hours |
| Error Range | Count |
|---|---|
| ≤ 1 hour | 2215 |
| ≤ 2 hours | 330 |
| ≤ 6 hours | 399 |
| ≤ 12 hours | 200 |
| ≤ 24 hours | 266 |
| > 24 hours | 441 |
- The very low median error (0.60 hours) indicates strong performance on typical PRs.
- Most predictions fall within 1 hour of the true merge time.
- The higher MAE and RMSE compared to the median suggest a small number of larger deviations.
- Overall performance is stable, with errors primarily concentrated in longer-running PRs.
- Train: 2,096 samples
- Validation: 572 samples
- Test: 574 samples
| Metric | Value | Description |
|---|---|---|
| MAE | 178.81 hours | Mean absolute error |
| Median Absolute Error | 11.58 hours | 50% of predictions are within ~12 hours |
| RMSE | 707.73 hours | Strongly influenced by large deviations |
| P90 Error | 382.60 hours | 90% of predictions are within ~383 hours |
| Error Range | Count |
|---|---|
| ≤ 1 hour | 38 |
| ≤ 2 hours | 67 |
| ≤ 6 hours | 122 |
| ≤ 12 hours | 64 |
| ≤ 24 hours | 52 |
| > 24 hours | 231 |
- Median error remains moderate, but overall error metrics are significantly higher.
- A substantial portion of predictions deviate by more than 24 hours.
- The large gap between median error and MAE/RMSE indicates a heavy-tailed error distribution.
- Model stability is lower compared to VSCode, likely due to dataset size and higher variance in merge patterns.
- The model performs significantly better on VSCode, likely due to larger training data and more consistent PR behavior.
- Excalidraw shows higher variance and heavier tail errors.
- Performance scales positively with dataset size and pattern stability.
