Python analytics pipeline for evaluating MLB player skill using Statcast leaderboard data and percentile-based scoring models.
SkillEngine is a controlled baseball analytics project focused on core player skill only.
It produces percentile-based skill scores for qualified hitters and pitchers using Baseball Savant leaderboard data and exports ranked lists including Top 50 outputs.
The system is implemented as a reproducible Python pipeline capable of processing multiple seasons of data.
- Build a resume-ready analytics engineering project with clear scope boundaries, repeatable outputs, and versioned changes.
- Provide a reusable core skill model that can serve as the baseline layer for future matchup/context modeling.
- Establish a reproducible pipeline that can process historical seasons and future seasons without code changes.
- Python
- Pandas
- CSV data processing
- Baseball Savant leaderboard exports
- Deterministic ranking algorithms
SkillEngine intentionally does not include:
- Home/away splits
- Weather
- Pitch-type matchups
- Ballpark factors
- DFS logic
- Betting odds
Those features will be implemented later in a separate layer called MatchupEngine.
SkillEngine evaluates baseline player skill independent of matchup context.
SkillEngine uses Baseball Savant Custom Leaderboard exports.
Each season requires two raw files:
{season}_batting.csv
{season}_pitching.csv
Example:
2024_batting.csv
2024_pitching.csv
Raw files must be placed in:
01_data/raw/
The exact export process is documented in:
04_docs/SAVANT_EXPORT_GUIDE.md
This document ensures the raw data can always be reproduced consistently.
The full pipeline runs using:
python 02_src/run_skillengine.py {season}
Example:
python 02_src/run_skillengine.py 2024
Pipeline stages:
-
build_master_dataset.py
- Column mapping
- Data validation
- Player name normalization
- Innings conversion
- WHIP calculation
- Qualification filtering
-
score_hitters_v1.py
- Percentile calculations
- SkillScore generation for hitters
-
score_pitchers_v1.py
- Percentile calculations
- SkillScore generation for pitchers
-
finalize_rankings_v1.py
- Deterministic ranking
- Top 50 export generation
To focus on meaningful sample sizes, the master datasets apply qualification filters.
PA ≥ 200
IP ≥ 80
These thresholds ensure the SkillScore is calculated using players with sufficient playing time.
All component metrics are first converted to percentiles within the qualified player pool, then combined using weighted formulas.
SkillScore =
- 0.45 × OBP percentile
- 0.35 × SLG percentile
- -0.20 × K-rate percentile
SkillScore =
- 0.35 × K-rate percentile
- -0.20 × BB-rate percentile
- -0.25 × WHIP percentile
- -0.20 × ERA percentile
Interpretation:
Higher SkillScore indicates a stronger underlying skill profile relative to the qualified player pool.
For each processed season, the pipeline generates the following outputs.
{season}_hitters_master.csv
{season}_pitchers_master.csv
{season}_hitters_scored_v1.csv
{season}_pitchers_scored_v1.csv
{season}_hitters_ranked_v1.csv
{season}_pitchers_ranked_v1.csv
{season}_hitters_top50_v1.csv
{season}_pitchers_top50_v1.csv
To ensure rankings are reproducible, ties are resolved using deterministic tie-breakers.
- Higher SkillScore
- Higher OBP
- Higher SLG
- Higher PA
- Higher SkillScore
- Higher K-rate
- Lower WHIP
- Higher IP
- Use ranked files to evaluate the full qualified player pool.
- Use top50 files as a quick shortlist for draft research or high-level analysis.
- Treat SkillEngine as the baseline skill evaluation layer.
Future systems (MatchupEngine) will adjust these baseline ratings using matchup context.
SkillEngine follows semantic-style versioning.
Structural or conceptual model changes.
Examples:
- Changing percentile methodology
- Multi-year weighting
- New core architecture
Metric additions, weight adjustments, validation improvements, or ranking refinements.
Bug fixes that do not alter scoring logic.
Weights may only change if:
- Evaluation metrics show persistent mis-weighting.
- Changes are documented in
CHANGELOG.md. - Post-change validation is recorded.
No silent weight changes.
See:
CHANGELOG.md
for versioned model changes, validation notes, and architectural updates.