This project shows how to automate an MLflow experiment with Jenkins, persist the Iris data set in SQLite, and version both data and trained models with DVC so everything can be pushed to GitHub.
prepare_data.py: materialises the Iris data todata/iris.parquetand loads the same rows intodata/iris.db(SQLite).train.py: trains Logistic Regression, logs runs/artifacts to MLflow, and serialises the model toartifacts/model.pkl.data_utils.py: helper functions for reading/writing the Iris data.dvc.yaml/dvc.lock/params.yaml: define the DVC pipeline (prepare_data->train_model) and the hyperparameters Jenkins modifies.Jenkinsfile+Jenkinsfile.windows: CI pipelines. By default they install dependencies, tweakparams.yaml, then rundvc repro. Set theUSE_MLFLOW_PROJECTparameter to switch tomlflow run ..MLproject+python_env.yaml: optional MLflow Projects entry point for ad-hoc runs.
prepare_data.pyexports the Iris frame todata/iris.parquet.- The same script writes the rows into the
iris_samplestable insidedata/iris.db. Example query:sqlite3 data/iris.db "SELECT * FROM iris_samples LIMIT 5;". train.pyreuses the serialized data; if it is missing, the script regenerates the file and refreshes the DB.
- Create a pipeline job pointing to this repo.
- Parameters available in both Jenkinsfiles:
MLFLOW_TRACKING_URI: MLflow server URL (falls back to agent env vars if empty).MLFLOW_EXPERIMENT_NAME: overrides the experiment inparams.yaml.MAX_ITER: forwarded intoparams.yamlbeforedvc repro.USE_MLFLOW_PROJECT: whentrue, skip DVC and callmlflow run ..RUN_SECURITY_SCANS: whentrue, installsrequirements-security.txtand runspip-audit(dependency CVEs) andbandit(Python static analysis) before training.RUN_MS_SECURITY(Windows agents): ifmsdo(Microsoft Security DevOps) is installed, run CredScan/DevSkim/Bandit and emitmsdo.sarif.RUN_GARAK+GARAK_COMMAND: run Garak LLM red-team tests; put the full Garak CLI args (model, n-probes, report path, etc.) intoGARAK_COMMAND.RUN_FAIRLEARN: run a Fairlearn bias snapshot on the trained model and dataset.RUN_GISKARD: run a Giskard scan of the trained model and dataset.RUN_CREDO_AI: capture Credo AI metadata (version + basic dataset info).RUN_CYCLONEDX: generate a CycloneDX SBOM fromrequirements.txt.
- Linux agents use
Jenkinsfile(shell), Windows agents useJenkinsfile.windows(PowerShell). - Jenkins archives
mlruns_local/**, security outputs, Garak reports, fairness/scanner outputs, and the SBOM so they can be downloaded even without MLflow UI access.
Data (data/iris.parquet, data/iris.db) and the model (artifacts/model.pkl) are tracked as DVC outputs with cache: false, so the actual files stay in Git while DVC captures lineage in dvc.lock.
python -m venv .venv
. .venv/bin/activate # Windows: .\.venv\Scripts\activate
pip install -r requirements.txt
dvc repro
dvc exp show
# optional: configure remote storage
# dvc remote add -d s3 s3://my-bucket/pathgit add .
git commit -m "Add Jenkins + MLflow + DVC pipeline"
git branch -M main
git remote add origin https://github.com/Ayoub-Samir/SE.git
git push -u origin mainpython -m venv .venv
. .venv/bin/activate # Windows: .\.venv\Scripts\activate
pip install -r requirements.txt
export MLFLOW_TRACKING_URI=http://localhost:5000 # optional
dvc repro # or: python prepare_data.py && python train.py
mlflow ui --backend-store-uri ./mlruns --port 5000Open http://127.0.0.1:5000 to inspect the latest runs.
- Point
params.yamlto a different data source or table and re-rundvc repro. - Configure
dvc remote addto S3/Azure/GDrive when data/models grow larger. - Add a Jenkins post step that prints a link to your hosted MLflow UI using the run ID from the logs.
- For stronger MLSecOps/OWASP coverage, add secrets scanning (e.g., gitleaks/detect-secrets) and artifact signing; hashes are already logged to MLflow via
security_manifest.jsonfor integrity checks. - To stay within the Microsoft ecosystem: install
msdoon Windows agents and enableRUN_MS_SECURITYto get CredScan/DevSkim/Bandit SARIF output. - LLM red-teaming: enable
RUN_GARAKand pass something like--model openai:gpt-4o-mini --n-probes 10 --report garak_report.jsonintoGARAK_COMMAND(supply your own model/API credentials). - Fairness & governance: toggle
RUN_FAIRLEARN,RUN_GISKARD,RUN_CREDO_AI, and/orRUN_CYCLONEDXto emit bias, QA/governance metadata, and SBOM artifacts underartifacts/.