PredictPRMerge

Project Overview

This repository implements a pipeline to predict pull request (PR) merge time (hours) from GitHub metadata and code-change signals. The pipeline is split into extraction, analysis/feature engineering, training, and inference components so each stage can be iterated independently.

System Architecture

The system is organized into modular components that handle data ingestion, preprocessing, training, and serving.

ETL Layer

Collects raw Pull Request (PR) data from GitHub and produces structured data.
See: github_etl

Batch ETL Pipeline

Coordinates large-scale PR extraction in controlled batches.
Manages pagination, retry logic, and incremental data collection.

Extractor

Retrieves raw PR data from the GitHub API.

Enricher

Extracts and enriches the data with additional features required for modeling.

Commits Enricher → commit count, unique contributors
Files Enricher → additions, deletions, code churn
Metadata Enricher → title length, label count, etc.

Transformer

Transforms raw GitHub JSON responses into a structured tabular format and derives additional modeling features through feature engineering on existing fields.

Features additions like hour of the day, weekday, is_weekend, etc.
Flattens nested objects
Normalizes timestamps
Standardizes column types
Handles missing values

Loader

Persists transformed and enriched pr batch to Cloud Storage.

Observability (Tracer + Stats)

Provides logging, execution tracing, and pipeline metrics for monitoring reliability and performance.

Data Analysis / Preprocessing Layer

Prepares datasets for model training by cleaning data, engineering labels, and generating train/validation/test splits.
See: data_analysis

Statistical Summary Generator

Computes:

Feature correlations
Mutual information
Distribution statistics

Used for feature evaluation and exploratory analysis.

Outlier Detection & Removal

Identifies and removes extreme PR cases using:

Feature-based grouping
Median-based filtering
Train-only cleaning strategy

This prevents skewed learning while preserving validation integrity.

Dataset Splitter

Performs chronological data splitting:

70% Train
15% Validation
15% Test

Ensures no temporal leakage between datasets.

Loader

Stores cleaned and split datasets back to Cloud Storage.

Training Layer

Performs hyperparameter search and model selection using XGBoost.
Training entrypoint: training/train.py

Key capabilities:

Hyperparameter tuning
Early stopping
Validation monitoring
Model selection based on MAE and P90 error

The best-performing model artifact is saved to Cloud Storage.

Inference / Serving Layer

FastAPI-based service for real-time predictions.
Inference entrypoint: inference/main.py

Responsibilities:

Loads trained model from Cloud Storage
Applies consistent feature processing
Returns predicted PR merge time

Designed for low-latency, production-ready inference.

Utilities Layer

Shared helpers for cloud I/O and common functionality.
See: utils

Cloud Storage

Acts as the central storage backbone for the system.

Stores:

Processed datasets
Train/Validation/Test splits
Feature configurations
Trained model artifacts
Hyperparameter metadata

This design ensures reproducibility, version control, and complete separation between training and inference. An high levet diagram for the implemented solution is available at architecture/architecture.png and embedded below.

Feature Engineering — Why each feature was added

commit_count: Number of commits in the PR. Rationale: more commits usually indicate more work/complexity, often increasing time-to-merge.
file_count / files_modified / files_added / files_deleted / files_removed: PR surface area. Rationale: larger PRs touching more files are harder to review and more likely to take longer.
line_additions / line_deletions / code_churn: Magnitude of code change. Rationale: higher churn tends to correlate with more review time and integration effort.
unique_file_types: Diversity of file types touched (e.g., .js, .md). Rationale: cross-language or mixed changes often require more reviewers and context switch.
unique_commit_authors / assignee_count: Number of authors/assignees. Rationale: many contributors or reviewers can slow decision or speed it up depending on coordination — useful for model to learn patterns.
avg_time_since_last_mod_days: Recency of touched files. Rationale: recently modified files are more familiar to reviewers; older files might require more context and take longer.
title_length / body_length / label_count: Textual metadata signals. Rationale: more descriptive PRs (longer body) often correlate with smoother reviews; labels encode intent/priority.
hour / weekday / is_weekend / is_us_holiday: Temporal signals. Rationale: submission time affects reviewer availability and response latency.

Notes: exact feature lists used in experiments are enumerated in training/train.py.

Data Analysis — What each cleaning step provides

Schema validation & type casting: Load JSON data with strict Polars schema, ensuring numeric and boolean columns are cast correctly (e.g., is_weekend, is_us_holiday to Int8). Value: prevents type mismatches and ensures consistent data representation.
Chronological train/val/test split: Sort by created_at timestamp, then split 70% train, 15% val, 15% test chronologically. Value: prevents temporal leakage where future PRs leak information into past model training.
Feature importance ranking: Compute Pearson correlation and mutual information (MI) between each feature and target. Combine scores (60% correlation + 40% MI) to identify top 4 features most aligned with merge time. Value: focuses outlier detection on the features that matter most for prediction.
Outlier detection via Freedman-Diaconis rule: Suggest bin width for each top feature using the formula bin_width = (2 * IQR) / n^(1/3), clamping to 5–50 bins. Value: adapts bin size to data distribution; prevents over-fragmentation or under-binning.
Group-based outlier removal: Bin each top feature, group rows by bin combination, compute median target per group. Remove rows where target > 3× group median (unless group is small, <5 rows). Value: removes extreme outliers while preserving typical PR dynamics within each feature region; respects domain variance.
Null-safe filtering: Drop null values before computing statistics or applying filters. Value: avoids NaN propagation in correlations and mutual information calculations.
Train/val/test preservation: Keep validation and test sets unfiltered; apply outlier removal only to training set. Value: training sees a cleaned distribution; evaluation remains representative of real-world data.

Training — Why this model and metric

Model choice — XGBoost (gradient-boosted trees): The training code uses xgboost.XGBRegressor (training/train.py). Rationale: tree ensembles handle heterogeneous numeric and categorical features without heavy normalization, capture non-linear interactions common in software metrics, scale well with tabular data, and are fast for grid search and inference.
Objective & eval metric: Training uses absolute-error oriented objective and validation MAE (mean absolute error) for model selection. Rationale: MAE directly corresponds to average hours error which is easy to interpret (e.g., predicted vs actual hours). MAE is robust to outliers compared with MSE/RMSE when the domain contains heavy tails.
Additional robustness metrics: The pipeline reports median absolute error, RMSE, and P90 error. Rationale: median AE reduces influence of outliers; P90 indicates large-error behavior which is important for SLA-like guarantees.
Hyperparameter search: Parameter grid search over learning rate, n_estimators, max_depth, subsample, and colsample_bytree. Rationale: these control bias/variance and effective model capacity for tabular data.

Inference — What happens at prediction time

Service: A FastAPI app exposes /predict and /health endpoints in inference/main.py. It loads a trained XGBoost model during startup and uses a predictor abstraction attached to app.state.
Feature extraction: Given a PR link, the service fetches PR metadata and computes the same derived features used in training (feature parity is essential). Value: ensures model input distribution matches training.
Preprocessing at inference: The service applies the same log1p transforms and feature ordering as training for skewed columns. Value: consistent scaling prevents unexpected prediction shifts.

`/predict` Endpoint — Input specification

The /predict endpoint accepts a POST request with a JSON body containing:

Request Body:

{
  "pr_link": "https://github.com/owner/repo/pull/123"
}

Parameters:

pr_link (string, required): A full GitHub pull request URL. The service parses this URL to extract the repository owner, name, and PR number, then fetches metadata from GitHub's API to compute features for the model. Example: https://github.com/excalidraw/excalidraw/pull/5678

Response: The endpoint returns a JSON object with the predicted merge time, SHAP feature contributions, and model explanation:

{
  "predicted_merge_time_hours": 8.63,
  "log_prediction_value": 2.26,
  "top_feature_impacts": [
    {
      "feature": "avg_time_since_last_mod_days",
      "feature_value": 4.1,
      "shap_contribution": 0.61
    },
    {
      "feature": "commit_count",
      "feature_value": 0.69,
      "shap_contribution": -0.46
    }
  ],
  "all_feature_impacts": [
    {
      "feature": "avg_time_since_last_mod_days",
      "feature_value": 4.1,
      "shap_contribution": 0.61
    },
    ...
  ]
}

Response Fields:

predicted_merge_time_hours: Predicted merge time in hours (converted from log space).
log_prediction_value: Raw model prediction in log space (before exponentiation).
top_feature_impacts: Array of the 4 most impactful features. Each entry contains:
- feature: Feature name.
- feature_value: The actual value of this feature for the PR (log-transformed if applicable).
- shap_contribution: SHAP value indicating how much this feature pushed the prediction up (positive) or down (negative) from the baseline.
all_feature_impacts: Complete list of all 19 features and their SHAP contributions, sorted by magnitude of impact.

Model Explainability (SHAP)

SHAP overview: SHAP (SHapley Additive exPlanations) assigns each feature an importance value for an individual prediction using principles from cooperative game theory. The model output is decomposed into a baseline value plus the contribution of each feature.
Interpretation: A positive SHAP value indicates that the feature increases the predicted merge time relative to the baseline. A negative value indicates that the feature decreases it. The absolute magnitude represents the strength of the feature’s contribution.
Global vs. Local Explanations:
- Global explanations are obtained by aggregating SHAP values across the dataset to determine overall feature importance.
- Local explanations analyze SHAP values for a single PR to understand the factors influencing that specific prediction.
Practical Application: SHAP is used to validate feature behavior, identify potential model biases or unintended dependencies, and provide transparent reasoning behind predictions.

Quick Start

Simple run steps

Install requirements

python3 -m pip install -r requirements.txt

Add your GitHub PAT token and GCP credentials path

Put all values into a .env file. Use example.env as a reference.
Example entries you should set in .env (replace values):

GITHUB_PAT=your_github_pat_here
GOOGLE_APPLICATION_CREDENTIALS=/full/path/to/your-gcp-creds.json
GCS_BUCKET_NAME=predict-pr-merge-datasets

First run ETL pipeline

python -m github_etl.run_etl

Note down the ingestion job id printed by the ETL run.

Analysis

Modify data_analysis/config.py with required values. Set the analysis job id to the same ingestion job id you noted.

python -m data_analysis.analysis

Training

Edit training/config.py as needed.

python -m training.train

Note down the training job id printed by the training run.

Inference

Put the training job id in inference/config.py (set TRAINING_JOB_ID) and run:

python -m inference.main

Known Limitations

Architecture Layer

Single-repo design: No cross-repo transfer learning; cannot generalize to new repos
No feature store: Features recomputed per pipeline; no versioning or reuse
Hardcoded training job ID: Manual deployment to switch models; no version control
No distributed compute: ThreadPoolExecutor (GIL-bound); single-machine constraint

Analysis Pipeline

No data validation: Missing schema checks; corrupted data undetected
Limited null handling: No explicit imputation strategy; inference fails with missing data
Outlier removal train-only: Test set not cleaned; train/test distribution mismatch
No feature drift monitoring: Silent drift; model degrades unnoticed
Manual feature engineering: Hardcoded features; no automated relevance check

Training Pipeline

Grid search only: No Bayesian optimization; expensive tuning
Single model: No ensemble; no fallback if model fails
No uncertainty quantification: Point predictions only; no confidence intervals
Limited metrics: MAE-focused; missing P50/P90 percentile tracking
No automated retraining: Manual trigger; models stale after drift

Inference Pipeline

SHAP latency: TreeExplainer on-the-fly adds 100-500ms (2-5x slower)
No request validation: URL not sanitized; accepts malformed input
No caching: Redundant GitHub API calls; wastes quota
No rate limiting/auth: Vulnerable to DoS; unauthenticated access
No batch endpoint: Single PR only; inefficient for bulk scoring
Single model instance: No fallback; service dies if model loads fail
No feature validation: NaN/infinite values pass silently

Model Improvements Needed

Log-space predictions: Tail behavior underrepresented; error opacity
Missing features: No reviewer signals, repo maturity, code review velocity, sentiment
No quantile regression: Cannot generate prediction intervals
No ensemble: Single model type; limited robustness

Proposed Improvements

Architecture

Feature Store (Feast): Version features; decouple from training/inference
Distributed ETL: Kafka + Spark for incremental CDC; process only new PRs
Model Registry (MLflow): Version control, rollback, lineage
Multi-repo Support: Transfer learning or meta-features for generalization
Config Externalization: .env + Secret Manager; remove hardcoded IDs

Analysis Pipeline

Data Validation (Great Expectations): Schema, null rates, value ranges
Feature Drift Monitoring: Monthly KS test; alert if p < 0.05
Null Imputation Strategy: Explicit policy (median/mode); document for inference
Outlier Detection: Apply to all splits (train/val/test) for consistency
Automated Feature Ranking: Remove low-signal features pre-training

Training Pipeline

Bayesian Optimization (Optuna): Replace grid search; faster convergence
Quantile Regression: Predict 25th/50th/90th percentiles for intervals
Ensemble Methods: XGBoost + LightGBM + Ridge for robustness
Auto-Retraining: Weekly trigger or on drift; auto-deploy if validation improves
Cross-Validation: Stratified splits; prevent temporal leakage

Inference Pipeline

SHAP Optimization: Pre-compute baseline + LIME; reduce to <50ms latency
Request Validation: Pydantic schema with regex URL check; fail fast
Redis Caching: Cache PR metadata (1hr TTL); reduce API calls 50%+
Rate Limiting + Auth: JWT tokens, RBAC with slowapi
Batch Endpoint: /predict_batch vectorized inference; handle 1000s PRs
Feature Validation: Schema checks with fallback imputation
Health Checks: Extended endpoint (model, config, GCS, GitHub API)

Evaluation Results — Excalidraw Dataset

Model Evaluation Results

The model was trained and evaluated on two repositories: VSCode and Excalidraw.
All metrics are computed on the held-out test sets.

VSCode Repository

Dataset Split

Train: 13,633 samples
Validation: 3,849 samples
Test: 3,851 samples

Test Metrics

Metric	Value	Description
MAE	22.59 hours	Mean absolute error
Median Absolute Error	0.60 hours	50% of predictions are within ~36 minutes
RMSE	104.19 hours	Penalizes large deviations more heavily
P90 Error	32.19 hours	90% of predictions are within ~32 hours

Error Distribution (Absolute Error)

Error Range	Count
≤ 1 hour	2215
≤ 2 hours	330
≤ 6 hours	399
≤ 12 hours	200
≤ 24 hours	266
> 24 hours	441

Summary (VSCode)

The very low median error (0.60 hours) indicates strong performance on typical PRs.
Most predictions fall within 1 hour of the true merge time.
The higher MAE and RMSE compared to the median suggest a small number of larger deviations.
Overall performance is stable, with errors primarily concentrated in longer-running PRs.

Excalidraw Repository

Dataset Split

Train: 2,096 samples
Validation: 572 samples
Test: 574 samples

Test Metrics

Metric	Value	Description
MAE	178.81 hours	Mean absolute error
Median Absolute Error	11.58 hours	50% of predictions are within ~12 hours
RMSE	707.73 hours	Strongly influenced by large deviations
P90 Error	382.60 hours	90% of predictions are within ~383 hours

Error Distribution (Absolute Error)

Error Range	Count
≤ 1 hour	38
≤ 2 hours	67
≤ 6 hours	122
≤ 12 hours	64
≤ 24 hours	52
> 24 hours	231

Summary (Excalidraw)

Median error remains moderate, but overall error metrics are significantly higher.
A substantial portion of predictions deviate by more than 24 hours.
The large gap between median error and MAE/RMSE indicates a heavy-tailed error distribution.
Model stability is lower compared to VSCode, likely due to dataset size and higher variance in merge patterns.

Comparative Observations

The model performs significantly better on VSCode, likely due to larger training data and more consistent PR behavior.
Excalidraw shows higher variance and heavier tail errors.
Performance scales positively with dataset size and pattern stability.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
architecture		architecture
data_analysis		data_analysis
github_etl		github_etl
inference		inference
training		training
utils		utils
.gitignore		.gitignore
README.md		README.md
example.env		example.env
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PredictPRMerge

Project Overview

System Architecture

ETL Layer

Batch ETL Pipeline

Extractor

Enricher

Transformer

Loader

Observability (Tracer + Stats)

Data Analysis / Preprocessing Layer

Statistical Summary Generator

Outlier Detection & Removal

Dataset Splitter

Loader

Training Layer

Inference / Serving Layer

Utilities Layer

Cloud Storage

Feature Engineering — Why each feature was added

Data Analysis — What each cleaning step provides

Training — Why this model and metric

Inference — What happens at prediction time

/predict Endpoint — Input specification

Model Explainability (SHAP)

Quick Start

Known Limitations

Architecture Layer

Analysis Pipeline

Training Pipeline

Inference Pipeline

Model Improvements Needed

Proposed Improvements

Architecture

Analysis Pipeline

Training Pipeline

Inference Pipeline

Evaluation Results — Excalidraw Dataset

Model Evaluation Results

VSCode Repository

Dataset Split

Test Metrics

Error Distribution (Absolute Error)

Summary (VSCode)

Excalidraw Repository

Dataset Split

Test Metrics

Error Distribution (Absolute Error)

Summary (Excalidraw)

Comparative Observations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`/predict` Endpoint — Input specification

Packages