Skip to content

Aditya6122/PredictPRMerge-Interview-Assignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PredictPRMerge

Project Overview

This repository implements a pipeline to predict pull request (PR) merge time (hours) from GitHub metadata and code-change signals. The pipeline is split into extraction, analysis/feature engineering, training, and inference components so each stage can be iterated independently.

System Architecture

The system is organized into modular components that handle data ingestion, preprocessing, training, and serving.


ETL Layer

Collects raw Pull Request (PR) data from GitHub and produces structured data.
See: github_etl

Batch ETL Pipeline

Coordinates large-scale PR extraction in controlled batches.
Manages pagination, retry logic, and incremental data collection.

Extractor

Retrieves raw PR data from the GitHub API.

Enricher

Extracts and enriches the data with additional features required for modeling.

  • Commits Enricher → commit count, unique contributors
  • Files Enricher → additions, deletions, code churn
  • Metadata Enricher → title length, label count, etc.

Transformer

Transforms raw GitHub JSON responses into a structured tabular format and derives additional modeling features through feature engineering on existing fields.

  • Features additions like hour of the day, weekday, is_weekend, etc.
  • Flattens nested objects
  • Normalizes timestamps
  • Standardizes column types
  • Handles missing values

Loader

Persists transformed and enriched pr batch to Cloud Storage.

Observability (Tracer + Stats)

Provides logging, execution tracing, and pipeline metrics for monitoring reliability and performance.


Data Analysis / Preprocessing Layer

Prepares datasets for model training by cleaning data, engineering labels, and generating train/validation/test splits.
See: data_analysis

Statistical Summary Generator

Computes:

  • Feature correlations
  • Mutual information
  • Distribution statistics

Used for feature evaluation and exploratory analysis.

Outlier Detection & Removal

Identifies and removes extreme PR cases using:

  • Feature-based grouping
  • Median-based filtering
  • Train-only cleaning strategy

This prevents skewed learning while preserving validation integrity.

Dataset Splitter

Performs chronological data splitting:

  • 70% Train
  • 15% Validation
  • 15% Test

Ensures no temporal leakage between datasets.

Loader

Stores cleaned and split datasets back to Cloud Storage.


Training Layer

Performs hyperparameter search and model selection using XGBoost.
Training entrypoint: training/train.py

Key capabilities:

  • Hyperparameter tuning
  • Early stopping
  • Validation monitoring
  • Model selection based on MAE and P90 error

The best-performing model artifact is saved to Cloud Storage.


Inference / Serving Layer

FastAPI-based service for real-time predictions.
Inference entrypoint: inference/main.py

Responsibilities:

  • Loads trained model from Cloud Storage
  • Applies consistent feature processing
  • Returns predicted PR merge time

Designed for low-latency, production-ready inference.


Utilities Layer

Shared helpers for cloud I/O and common functionality.
See: utils


Cloud Storage

Acts as the central storage backbone for the system.

Stores:

  • Processed datasets
  • Train/Validation/Test splits
  • Feature configurations
  • Trained model artifacts
  • Hyperparameter metadata

This design ensures reproducibility, version control, and complete separation between training and inference. An high levet diagram for the implemented solution is available at architecture/architecture.png and embedded below.

Architecture Diagram

Feature Engineering — Why each feature was added

  • commit_count: Number of commits in the PR. Rationale: more commits usually indicate more work/complexity, often increasing time-to-merge.
  • file_count / files_modified / files_added / files_deleted / files_removed: PR surface area. Rationale: larger PRs touching more files are harder to review and more likely to take longer.
  • line_additions / line_deletions / code_churn: Magnitude of code change. Rationale: higher churn tends to correlate with more review time and integration effort.
  • unique_file_types: Diversity of file types touched (e.g., .js, .md). Rationale: cross-language or mixed changes often require more reviewers and context switch.
  • unique_commit_authors / assignee_count: Number of authors/assignees. Rationale: many contributors or reviewers can slow decision or speed it up depending on coordination — useful for model to learn patterns.
  • avg_time_since_last_mod_days: Recency of touched files. Rationale: recently modified files are more familiar to reviewers; older files might require more context and take longer.
  • title_length / body_length / label_count: Textual metadata signals. Rationale: more descriptive PRs (longer body) often correlate with smoother reviews; labels encode intent/priority.
  • hour / weekday / is_weekend / is_us_holiday: Temporal signals. Rationale: submission time affects reviewer availability and response latency.

Notes: exact feature lists used in experiments are enumerated in training/train.py.

Data Analysis — What each cleaning step provides

  • Schema validation & type casting: Load JSON data with strict Polars schema, ensuring numeric and boolean columns are cast correctly (e.g., is_weekend, is_us_holiday to Int8). Value: prevents type mismatches and ensures consistent data representation.
  • Chronological train/val/test split: Sort by created_at timestamp, then split 70% train, 15% val, 15% test chronologically. Value: prevents temporal leakage where future PRs leak information into past model training.
  • Feature importance ranking: Compute Pearson correlation and mutual information (MI) between each feature and target. Combine scores (60% correlation + 40% MI) to identify top 4 features most aligned with merge time. Value: focuses outlier detection on the features that matter most for prediction.
  • Outlier detection via Freedman-Diaconis rule: Suggest bin width for each top feature using the formula bin_width = (2 * IQR) / n^(1/3), clamping to 5–50 bins. Value: adapts bin size to data distribution; prevents over-fragmentation or under-binning.
  • Group-based outlier removal: Bin each top feature, group rows by bin combination, compute median target per group. Remove rows where target > 3× group median (unless group is small, <5 rows). Value: removes extreme outliers while preserving typical PR dynamics within each feature region; respects domain variance.
  • Null-safe filtering: Drop null values before computing statistics or applying filters. Value: avoids NaN propagation in correlations and mutual information calculations.
  • Train/val/test preservation: Keep validation and test sets unfiltered; apply outlier removal only to training set. Value: training sees a cleaned distribution; evaluation remains representative of real-world data.

Training — Why this model and metric

  • Model choice — XGBoost (gradient-boosted trees): The training code uses xgboost.XGBRegressor (training/train.py). Rationale: tree ensembles handle heterogeneous numeric and categorical features without heavy normalization, capture non-linear interactions common in software metrics, scale well with tabular data, and are fast for grid search and inference.
  • Objective & eval metric: Training uses absolute-error oriented objective and validation MAE (mean absolute error) for model selection. Rationale: MAE directly corresponds to average hours error which is easy to interpret (e.g., predicted vs actual hours). MAE is robust to outliers compared with MSE/RMSE when the domain contains heavy tails.
  • Additional robustness metrics: The pipeline reports median absolute error, RMSE, and P90 error. Rationale: median AE reduces influence of outliers; P90 indicates large-error behavior which is important for SLA-like guarantees.
  • Hyperparameter search: Parameter grid search over learning rate, n_estimators, max_depth, subsample, and colsample_bytree. Rationale: these control bias/variance and effective model capacity for tabular data.

Inference — What happens at prediction time

  • Service: A FastAPI app exposes /predict and /health endpoints in inference/main.py. It loads a trained XGBoost model during startup and uses a predictor abstraction attached to app.state.
  • Feature extraction: Given a PR link, the service fetches PR metadata and computes the same derived features used in training (feature parity is essential). Value: ensures model input distribution matches training.
  • Preprocessing at inference: The service applies the same log1p transforms and feature ordering as training for skewed columns. Value: consistent scaling prevents unexpected prediction shifts.

/predict Endpoint — Input specification

The /predict endpoint accepts a POST request with a JSON body containing:

Request Body:

{
  "pr_link": "https://github.com/owner/repo/pull/123"
}

Parameters:

  • pr_link (string, required): A full GitHub pull request URL. The service parses this URL to extract the repository owner, name, and PR number, then fetches metadata from GitHub's API to compute features for the model. Example: https://github.com/excalidraw/excalidraw/pull/5678

Response: The endpoint returns a JSON object with the predicted merge time, SHAP feature contributions, and model explanation:

{
  "predicted_merge_time_hours": 8.63,
  "log_prediction_value": 2.26,
  "top_feature_impacts": [
    {
      "feature": "avg_time_since_last_mod_days",
      "feature_value": 4.1,
      "shap_contribution": 0.61
    },
    {
      "feature": "commit_count",
      "feature_value": 0.69,
      "shap_contribution": -0.46
    }
  ],
  "all_feature_impacts": [
    {
      "feature": "avg_time_since_last_mod_days",
      "feature_value": 4.1,
      "shap_contribution": 0.61
    },
    ...
  ]
}

Response Fields:

  • predicted_merge_time_hours: Predicted merge time in hours (converted from log space).
  • log_prediction_value: Raw model prediction in log space (before exponentiation).
  • top_feature_impacts: Array of the 4 most impactful features. Each entry contains:
    • feature: Feature name.
    • feature_value: The actual value of this feature for the PR (log-transformed if applicable).
    • shap_contribution: SHAP value indicating how much this feature pushed the prediction up (positive) or down (negative) from the baseline.
  • all_feature_impacts: Complete list of all 19 features and their SHAP contributions, sorted by magnitude of impact.

Model Explainability (SHAP)

  • SHAP overview: SHAP (SHapley Additive exPlanations) assigns each feature an importance value for an individual prediction using principles from cooperative game theory. The model output is decomposed into a baseline value plus the contribution of each feature.
  • Interpretation: A positive SHAP value indicates that the feature increases the predicted merge time relative to the baseline. A negative value indicates that the feature decreases it. The absolute magnitude represents the strength of the feature’s contribution.
  • Global vs. Local Explanations:
    • Global explanations are obtained by aggregating SHAP values across the dataset to determine overall feature importance.
    • Local explanations analyze SHAP values for a single PR to understand the factors influencing that specific prediction.
  • Practical Application: SHAP is used to validate feature behavior, identify potential model biases or unintended dependencies, and provide transparent reasoning behind predictions.

Quick Start

Simple run steps

  1. Install requirements
python3 -m pip install -r requirements.txt
  1. Add your GitHub PAT token and GCP credentials path
  • Put all values into a .env file. Use example.env as a reference.
  • Example entries you should set in .env (replace values):
GITHUB_PAT=your_github_pat_here
GOOGLE_APPLICATION_CREDENTIALS=/full/path/to/your-gcp-creds.json
GCS_BUCKET_NAME=predict-pr-merge-datasets
  1. First run ETL pipeline
python -m github_etl.run_etl
  • Note down the ingestion job id printed by the ETL run.
  1. Analysis
  • Modify data_analysis/config.py with required values. Set the analysis job id to the same ingestion job id you noted.
python -m data_analysis.analysis
  1. Training
  • Edit training/config.py as needed.
python -m training.train
  • Note down the training job id printed by the training run.
  1. Inference
  • Put the training job id in inference/config.py (set TRAINING_JOB_ID) and run:
python -m inference.main

Known Limitations

Architecture Layer

  • Single-repo design: No cross-repo transfer learning; cannot generalize to new repos
  • No feature store: Features recomputed per pipeline; no versioning or reuse
  • Hardcoded training job ID: Manual deployment to switch models; no version control
  • No distributed compute: ThreadPoolExecutor (GIL-bound); single-machine constraint

Analysis Pipeline

  • No data validation: Missing schema checks; corrupted data undetected
  • Limited null handling: No explicit imputation strategy; inference fails with missing data
  • Outlier removal train-only: Test set not cleaned; train/test distribution mismatch
  • No feature drift monitoring: Silent drift; model degrades unnoticed
  • Manual feature engineering: Hardcoded features; no automated relevance check

Training Pipeline

  • Grid search only: No Bayesian optimization; expensive tuning
  • Single model: No ensemble; no fallback if model fails
  • No uncertainty quantification: Point predictions only; no confidence intervals
  • Limited metrics: MAE-focused; missing P50/P90 percentile tracking
  • No automated retraining: Manual trigger; models stale after drift

Inference Pipeline

  • SHAP latency: TreeExplainer on-the-fly adds 100-500ms (2-5x slower)
  • No request validation: URL not sanitized; accepts malformed input
  • No caching: Redundant GitHub API calls; wastes quota
  • No rate limiting/auth: Vulnerable to DoS; unauthenticated access
  • No batch endpoint: Single PR only; inefficient for bulk scoring
  • Single model instance: No fallback; service dies if model loads fail
  • No feature validation: NaN/infinite values pass silently

Model Improvements Needed

  • Log-space predictions: Tail behavior underrepresented; error opacity
  • Missing features: No reviewer signals, repo maturity, code review velocity, sentiment
  • No quantile regression: Cannot generate prediction intervals
  • No ensemble: Single model type; limited robustness

Proposed Improvements

Architecture

  1. Feature Store (Feast): Version features; decouple from training/inference
  2. Distributed ETL: Kafka + Spark for incremental CDC; process only new PRs
  3. Model Registry (MLflow): Version control, rollback, lineage
  4. Multi-repo Support: Transfer learning or meta-features for generalization
  5. Config Externalization: .env + Secret Manager; remove hardcoded IDs

Analysis Pipeline

  1. Data Validation (Great Expectations): Schema, null rates, value ranges
  2. Feature Drift Monitoring: Monthly KS test; alert if p < 0.05
  3. Null Imputation Strategy: Explicit policy (median/mode); document for inference
  4. Outlier Detection: Apply to all splits (train/val/test) for consistency
  5. Automated Feature Ranking: Remove low-signal features pre-training

Training Pipeline

  1. Bayesian Optimization (Optuna): Replace grid search; faster convergence
  2. Quantile Regression: Predict 25th/50th/90th percentiles for intervals
  3. Ensemble Methods: XGBoost + LightGBM + Ridge for robustness
  4. Auto-Retraining: Weekly trigger or on drift; auto-deploy if validation improves
  5. Cross-Validation: Stratified splits; prevent temporal leakage

Inference Pipeline

  1. SHAP Optimization: Pre-compute baseline + LIME; reduce to <50ms latency
  2. Request Validation: Pydantic schema with regex URL check; fail fast
  3. Redis Caching: Cache PR metadata (1hr TTL); reduce API calls 50%+
  4. Rate Limiting + Auth: JWT tokens, RBAC with slowapi
  5. Batch Endpoint: /predict_batch vectorized inference; handle 1000s PRs
  6. Feature Validation: Schema checks with fallback imputation
  7. Health Checks: Extended endpoint (model, config, GCS, GitHub API)

Evaluation Results — Excalidraw Dataset

Model Evaluation Results

The model was trained and evaluated on two repositories: VSCode and Excalidraw.
All metrics are computed on the held-out test sets.


VSCode Repository

Dataset Split

  • Train: 13,633 samples
  • Validation: 3,849 samples
  • Test: 3,851 samples

Test Metrics

Metric Value Description
MAE 22.59 hours Mean absolute error
Median Absolute Error 0.60 hours 50% of predictions are within ~36 minutes
RMSE 104.19 hours Penalizes large deviations more heavily
P90 Error 32.19 hours 90% of predictions are within ~32 hours

Error Distribution (Absolute Error)

Error Range Count
≤ 1 hour 2215
≤ 2 hours 330
≤ 6 hours 399
≤ 12 hours 200
≤ 24 hours 266
> 24 hours 441

Summary (VSCode)

  • The very low median error (0.60 hours) indicates strong performance on typical PRs.
  • Most predictions fall within 1 hour of the true merge time.
  • The higher MAE and RMSE compared to the median suggest a small number of larger deviations.
  • Overall performance is stable, with errors primarily concentrated in longer-running PRs.

Excalidraw Repository

Dataset Split

  • Train: 2,096 samples
  • Validation: 572 samples
  • Test: 574 samples

Test Metrics

Metric Value Description
MAE 178.81 hours Mean absolute error
Median Absolute Error 11.58 hours 50% of predictions are within ~12 hours
RMSE 707.73 hours Strongly influenced by large deviations
P90 Error 382.60 hours 90% of predictions are within ~383 hours

Error Distribution (Absolute Error)

Error Range Count
≤ 1 hour 38
≤ 2 hours 67
≤ 6 hours 122
≤ 12 hours 64
≤ 24 hours 52
> 24 hours 231

Summary (Excalidraw)

  • Median error remains moderate, but overall error metrics are significantly higher.
  • A substantial portion of predictions deviate by more than 24 hours.
  • The large gap between median error and MAE/RMSE indicates a heavy-tailed error distribution.
  • Model stability is lower compared to VSCode, likely due to dataset size and higher variance in merge patterns.

Comparative Observations

  • The model performs significantly better on VSCode, likely due to larger training data and more consistent PR behavior.
  • Excalidraw shows higher variance and heavier tail errors.
  • Performance scales positively with dataset size and pattern stability.

About

A system which helps understand how long a PR is likely to take before being merged can help teams improve planning, review workflows, and developer productivity.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors