The analysis of mixed-effects additive Bayesian networks (mABNs) is a complex task that requires careful simulation studies to understand the behavior of different estimation methods.
This package, mabnsimstud, is designed to facilitate the simulation of data from mABNs and to analyze the performance of various estimation techniques in a controlled environment.
The data is stored in data directory.
Plots and results from the simulation study are stored in results directory (this is a non-standard package directory).
You can install the development version of mabnsimstud from GitHub with:
# install.packages("devtools")
devtools::install_github("furrer-lab/mabnsimstud")The mabnsimstud package implements a comprehensive simulation study designed to evaluate the performance of mixed-effects additive Bayesian networks (mABNs) under various data characteristics and pooling strategies. The study is organized into two main experimental parts:
Part A: Mixed-Effects ABN Performance Analysis
- Evaluates how data characteristics affect mABN performance
- Two sub-experiments (A1 and A2) with different parameter combinations
Part B: Pooling Strategy Comparison
- Compares different pooling approaches for handling group effects
- Two sub-experiments (B1 and B2) with three pooling strategies each
The study compares three approaches to handling group effects:
- PP (Partial Pooling): Mixed-effects model with random intercepts - the full mABN approach
- CP (Complete Pooling): Ignores group structure entirely - standard ABN on pooled data
- NP (No Pooling): Separate models per group - independent ABNs for each group
The mabnsimstud package was developed in the context of a simulation study for the paper "All for One or One for All? A comparative study of group effects in additive Bayesian networks".
The results of the simulation study can be reproduced using the provided scripts and functions in this package.
I use the terms "experiment" and "simulation" interchangeably, as they refer to the same process of simulating data and analyzing it.
The simulation study can be reproduced using the provided vignettes in the following order:
- Simulates the data for the experiments
- Combines all parameters and data in one
.rdafile for each experiment
File Naming Convention:
n_<sample_size>_k_<number_of_groups>_groups_<group_distribution_type>_p_<number_of_nodes>_nodedists_<node_distribution_type>_s_<sparsity_without_decimal>_graph<number_of_DAG>.rda
Parameter Explanation:
<sample_size>is the total number of observations across all groups<number_of_groups>is the number of groups (k)<group_distribution_type>is the distribution pattern of observations across groups:Even: all groups have equal size1S: one small group, all others large1L: one large group, all others smallHSHL: half small groups, half large groups
<number_of_nodes>is the number of variables/nodes in the network (p),<node_distribution_type>is the type of distribution for the nodes, which is one of- (Balanced): all distributions appear equally often.
- (G): One node with each distribution and the rest Gaussian. E.g., if there are 5 nodes, then 1 node is Poisson, 1 node is Binomial, 1 node is Multinomial, and the rest (2 nodes) are Gaussian.
- (B): One node with each distribution and the rest Binomial. E.g., if there are 5 nodes, then 1 node is Poisson, 1 node is Gaussian, 1 node is Multinomial, and the rest (2 nodes) are Binomial.
- (M): One node with each distribution and the rest Multinomial. E.g., if there are 5 nodes, then 1 node is Poisson, 1 node is Gaussian, 1 node is Binomial, and the rest (2 nodes) are Multinomial.
- (P): Poisson should not be overrepresented, due to convergence issues. Therefore, this node distribution type is not used.
<sparsity_without_decimal>is the sparsity of the graph without decimal point. E.g, if the sparsity is 0.1, then<sparsity_without_decimal>is 01.<number_of_DAG>indicates which DAG it is in a series of experiments, starting from 1 up to 5 (usually).
No input is required to run the data generation step from vignettes/generateSimData.Rmd, as it uses predefined parameters for the experiments.
For each experiment, an .RData file is created and stored in the data directory of the package.
Example:
# Load a dataset with 100 observations, 2 groups, 6 nodes, balanced distributions, 60% sparsity
load("data/n_100_k_2_groups_Even_p_6_nodedists_Balanced_s_04_graph2.rda")It contains the following elements:
data:data.frameof the simulated data from the mABN.DAG:igraphobject of the directed acyclic graph (DAG) representing the structure of the mABN.adj.matrix: adjacency matrix of the DAG.dists: The distribution type of the variables indata.groups_samples: A factor vector indicating the group membership of each observation. Each level of the factor corresponds to a group, and the number of levels is equal to the number of groups in the mABN.mycoefs: A list of coefficients for each node in the mABN, where each element corresponds to a node and contains the coefficients for the linear predictor of that node.max.parents: The maximum number of parents for each node in the mABN.
Learn the structure of the mABN from the simulated data. Calls:
abnwithmaxparents(): This function learns the structure of the mABN with a maximum number of parents for each node.abn_ensemble(): This function reduces the structural uncertainty by combining multiple learned structures into a single, more robust structure.abn_postprocessing(): This function post-processes the learned structure to ensure it meets the requirements of a valid mABN.abn_consensusdag(): This function computes the consensus structure from the ensemble of learned structures, resulting in a directed acyclic graph (DAG) that represents the most likely structure of the mABN.
The input for the structure learning step is the .RData file generated in the data generation step, which contains the simulated data and the parameters of mABN.
Each intermediate step of the structure learning process produces an output that is stored in the data directory of the package.
These intermediate outputs are named with the same naming convention as the .RData files, but with a prefix indicating the step it was generated from, e.g., abnslobj_*, abn_ensembleobj_*, abn_postprocessingobj_*, abn_consensusdagobj_*, etc.
All plots, log files, etc. from the intermediate steps are stored in the results directory of the package following the same naming convention.
Evaluate the learned mABN models against the true generating models.
Main Functions:
ComputePerfResults(): Computes structural accuracy (SHD), parametric accuracy (KL divergence), and prediction performancecv.predict.ABN(): Performs k-fold cross-validation for predictive performance assessment
Learned consensus DAG objects and corresponding true data objects.
Performance metrics including SHD, KL divergence, prediction errors, and runtimes.
Generate comprehensive plots and summary statistics from the simulation results.
Main Functions:
merge_simulation_results(): Combines results across all experimental conditionsplot_accuracy_by_*(): Various plotting functions for different aspects of the resultsextract_results(): Extracts and formats results for analysis
To reproduce the complete simulation study, execute the vignettes in this order:
generateSimData.Rmd- Generate all simulated datasetsstructurelearning.Rmd- Learn network structures for all datasets and strategiescomputeperformance.Rmd- Evaluate performance metricsabnplotresults.Rmd- Generate publication-ready plots and analyses
- Time: Full study requires several days of computation
- Memory: Recommend 16+ GB RAM for larger datasets
- Cores: Parallelization recommended (8+ cores optimal)
- Storage: ~50 GB for all intermediate files and results
💡 Quick Start:
For testing, set DEBUG <- TRUE in the vignettes to run with reduced parameter sets.
If you use this package in your research, please cite the following paper: Champion, M., Delucchi, M., Furrer, R. (2025). All for One or One for All? A comparative study of group effects in additive Bayesian networks. Under review.