- DNNGP β Deep neural network for genomic prediction.
- Python 3.9
- pip
Install packages:
- Create a python environment.
conda create -n GxEtoolkit python=3.9
conda activate GxEtoolkit- Clone this repository and cd into it.
git clone https://github.com/AIBreeding/GxEtoolkit.git
cd ./GxEtoolkit
pip install -r requirements.txtAutoGS & EXGEP can be used from the command line or as a Python package. Choose the appropriate framework based on your needs. To run with your own data, make sure it follows the structure of the example datasets.
python main.py autogs -h
python main.py exgep -hpython main.py exgep \
--geno ./dataset/exgep_data/genotype.csv \
--phen ./dataset/exgep_data/pheno.csv \
--soil ./dataset/exgep_data/soil.csv \
--weather ./dataset/exgep_data/weather.csv \
--n_traits 1 \
--target_trait Yield \
--models_optimize LightGBM \
--models_assess LightGBM \
--metric_assess mae,mse,rmse,pcc,r2 \
--metric_optimise r2 \
--n_trial 2 \
--n_splits 2 \
--write_folder ./exgep_results/ \
--reload_study \
--reload_trial \
--shap_n_train_points 200 \
--shap_n_test_points 200 \
--shap_cluster True \
--shap_model_name LightGBMpython main.py autogs \
--cv_type KFold \
--phen_file ./dataset/autogs_data/trainset/Pheno/ \
--env_file ./dataset/autogs_data/trainset/Env/ \
--geno_file ./dataset/autogs_data/trainset/Geno/YI_All.vcf \
--ref_file ./dataset/docs/maizeRef_ALL.csv \
--file_names ./dataset/env.txt \
--models_optimize LightGBM,XGBoost \
--models_assess LightGBM,XGBoost \
--metric_assess mae,mse,rmse,pcc,r2 \
--metric_optimise r2 \
--n_trial 2 \
--n_traits 9 \
--target_trait Yield_Mg_ha \
--shap_n_train_points 200 \
--shap_n_test_points 200 \
--shap_cluster True \
--shap_model_name LightGBMpython main.py autogs \
--cv_type STECV \ # LOECV, LOESTCV, STECV
--phen_file ./dataset/autogs_data/trainset/Pheno/ \
--env_file ./dataset/autogs_data/trainset/Env/ \
--geno_file ./dataset/autogs_data/trainset/Geno/YI_All.vcf \
--ref_file ./dataset/docs/maizeRef_ALL.csv \
--file_names ./dataset/env.txt \
--n_traits 9 \
--target_trait Yield_Mg_ha \
--base_models KNN,XGBoost,LightGBM \
--meta_model ridge \
--is_ensemble \
--do_optimize \
--n_trial 2 \
--metric_assess pcc,rmse \
--metric_optimise r2 \
--optimization_objective maximize \
--write_folder ./autogs_LOECV/5. Usage as a Python package: You can use our framework to perform prediction tasks using only genotypic features or any other single type of data.
import pathlib
import os
import sys
import pprint
import sklearn
import pandas as pd
import numpy as np
from scipy.stats import pearsonr
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler
from autogs.model import RegAutoGS
from autogs.data.tools.reg_metrics import (mae_score as mae,
mse_score as mse,
rmse_score as rmse,
r2_score as r2,
rmsle_score as rmsle,
mape_score as mape,
medae_score as medae,
pcc_score as pcc)
your_phen_file_path = "/dataset/trainset/Pheno/"
your_geno_file_path = "/dataset/trainset/Geno/YI_All.vcf"
phen = pd.read_csv(your_phen_file_path)
geno = pd.read_csv(your_geno_file_path)
scaler = StandardScaler()
scaled_geno = scaler.fit_transform(geno)
X = pd.DataFrame(scaled_geno,columns=geno.columns)
y = pd.core.series.Series(phen)
# train AutoGS model for reg prediction
reg = RegAutoGS(
y=y,
X=X,
test_size=0.2,
n_splits=2,
n_trial=2,
reload_study=True,
reload_trial=True,
write_folder=os.getcwd()+'/autogs_results/',
metric_optimise=r2,
metric_assess=[mae, mse, rmse, pcc, r2, rmsle, mape, medae],
optimization_objective='maximize',
models_optimize=['LightGBM','XGBoost'],
models_assess=['LightGBM','XGBoost'],
early_stopping_rounds=3,
random_state=2024
)
reg.train() # train model
# You can interpret either the final AutoGS ensemble model or a specific base model of interest.
reg.CalSHAP(n_train_points=200,n_test_points=200,cluster=False, model_name='XGBoost') # cal SHAP valuemaeβ Mean Absolute Error (MAE)mseβ Mean Squared Error (MSE)rmseβ Root Mean Squared Error (RMSE)r2β RΒ² Score (Coefficient of Determination)rmsleβ Root Mean Squared Logarithmic Error (RMSLE)mapeβ Mean Absolute Percentage Error (MAPE)medaeβ Median Absolute Error (MEDAE)pccβ Pearson Correlation Coefficient (PCC)
The framework currently supports 28 base regression models, including ensemble models, linear models, and other widely-used regressors. Each model can be optionally selected via the selected_regressors list.
DTRβ Decision Tree RegressorETRβ Extra Trees RegressorLightGBMβ Light Gradient Boosting MachineXGBoostβ Extreme Gradient BoostingCatBoostβ Categorical BoostingAdaBoostβ Adaptive BoostingGBDTβ Gradient Boosting Decision TreeBaggingβ Bagging RegressorRFβ Random Forest RegressorHistGradientBoostingβ Histogram-based Gradient Boosting
BayesianRidgeβ Bayesian Ridge RegressionLassoLARSβ Lasso Least Angle RegressionElasticNetβ Elastic Net RegressionSGDβ Stochastic Gradient Descent RegressorLinearβ Ordinary Least Squares Linear RegressionLassoβ Lasso RegressionRidgeβ Ridge RegressionOMPβ Orthogonal Matching PursuitARDβ Automatic Relevance Determination RegressionPARβ Passive Aggressive RegressorTheilSenβ Theil-Sen EstimatorHuberβ Huber RegressorKernelRidgeβ Kernel Ridge RegressionRANSACβ RANSAC (RANdom SAmple Consensus) Regressor
KNNβ k-Nearest Neighbors RegressorSVRβ Support Vector RegressorDummyβ Dummy Regressor (Baseline)MLPβ Multi-Layer Perceptron (Deep Neural Network)
You can read our paper explaining AutoGS and EXGEP.
He K, Yu T, Gao S, et al. Leveraging Automated Machine Learning for Environmental Data-Driven Genetic Analysis and Genomic Prediction in Maize Hybrids. Adv Sci (Weinh), e2412423, 2025. https://doi.org/10.1002/advs.202412423
Yu T, Zhang H, Chen S, et al. EXGEP: a framework for predicting genotype-by-environment interactions using ensembles of explainable machine-learning models. Brief Bioinform, 2025. https://doi.org/10.1093/bib/bbaf414
This project is free to use for non-commercial purposes - see the LICENSE file for details.
For more information, please contact with Huihui Li (lihuihui@caas.cn).
