Skip to content

furrer-lab/mabnsimstud

Repository files navigation

Mixed-Effects Additive Bayesian Network Simulation Study

The analysis of mixed-effects additive Bayesian networks (mABNs) is a complex task that requires careful simulation studies to understand the behavior of different estimation methods. This package, mabnsimstud, is designed to facilitate the simulation of data from mABNs and to analyze the performance of various estimation techniques in a controlled environment.

The data is stored in data directory. Plots and results from the simulation study are stored in results directory (this is a non-standard package directory).

Installation

You can install the development version of mabnsimstud from GitHub with:

# install.packages("devtools")
devtools::install_github("furrer-lab/mabnsimstud")

Experimental Design

The mabnsimstud package implements a comprehensive simulation study designed to evaluate the performance of mixed-effects additive Bayesian networks (mABNs) under various data characteristics and pooling strategies. The study is organized into two main experimental parts:

Experiment Overview

Part A: Mixed-Effects ABN Performance Analysis

  • Evaluates how data characteristics affect mABN performance
  • Two sub-experiments (A1 and A2) with different parameter combinations

Part B: Pooling Strategy Comparison

  • Compares different pooling approaches for handling group effects
  • Two sub-experiments (B1 and B2) with three pooling strategies each

Pooling Strategies

The study compares three approaches to handling group effects:

  • PP (Partial Pooling): Mixed-effects model with random intercepts - the full mABN approach
  • CP (Complete Pooling): Ignores group structure entirely - standard ABN on pooled data
  • NP (No Pooling): Separate models per group - independent ABNs for each group

Usage

The mabnsimstud package was developed in the context of a simulation study for the paper "All for One or One for All? A comparative study of group effects in additive Bayesian networks". The results of the simulation study can be reproduced using the provided scripts and functions in this package. I use the terms "experiment" and "simulation" interchangeably, as they refer to the same process of simulating data and analyzing it. The simulation study can be reproduced using the provided vignettes in the following order:

1. Data Generation

  • Simulates the data for the experiments
  • Combines all parameters and data in one .rda file for each experiment

File Naming Convention: n_<sample_size>_k_<number_of_groups>_groups_<group_distribution_type>_p_<number_of_nodes>_nodedists_<node_distribution_type>_s_<sparsity_without_decimal>_graph<number_of_DAG>.rda

Parameter Explanation:

  • <sample_size> is the total number of observations across all groups
  • <number_of_groups> is the number of groups (k)
  • <group_distribution_type> is the distribution pattern of observations across groups:
    • Even: all groups have equal size
    • 1S: one small group, all others large
    • 1L: one large group, all others small
    • HSHL: half small groups, half large groups
  • <number_of_nodes> is the number of variables/nodes in the network (p),
  • <node_distribution_type> is the type of distribution for the nodes, which is one of
    • (Balanced): all distributions appear equally often.
    • (G): One node with each distribution and the rest Gaussian. E.g., if there are 5 nodes, then 1 node is Poisson, 1 node is Binomial, 1 node is Multinomial, and the rest (2 nodes) are Gaussian.
    • (B): One node with each distribution and the rest Binomial. E.g., if there are 5 nodes, then 1 node is Poisson, 1 node is Gaussian, 1 node is Multinomial, and the rest (2 nodes) are Binomial.
    • (M): One node with each distribution and the rest Multinomial. E.g., if there are 5 nodes, then 1 node is Poisson, 1 node is Gaussian, 1 node is Binomial, and the rest (2 nodes) are Multinomial.
    • (P): Poisson should not be overrepresented, due to convergence issues. Therefore, this node distribution type is not used.
  • <sparsity_without_decimal> is the sparsity of the graph without decimal point. E.g, if the sparsity is 0.1, then <sparsity_without_decimal> is 01.
  • <number_of_DAG> indicates which DAG it is in a series of experiments, starting from 1 up to 5 (usually).

Input

No input is required to run the data generation step from vignettes/generateSimData.Rmd, as it uses predefined parameters for the experiments.

Output

For each experiment, an .RData file is created and stored in the data directory of the package. Example:

# Load a dataset with 100 observations, 2 groups, 6 nodes, balanced distributions, 60% sparsity
load("data/n_100_k_2_groups_Even_p_6_nodedists_Balanced_s_04_graph2.rda")

It contains the following elements:

  • data: data.frame of the simulated data from the mABN.
  • DAG: igraph object of the directed acyclic graph (DAG) representing the structure of the mABN.
  • adj.matrix: adjacency matrix of the DAG.
  • dists: The distribution type of the variables in data.
  • groups_samples: A factor vector indicating the group membership of each observation. Each level of the factor corresponds to a group, and the number of levels is equal to the number of groups in the mABN.
  • mycoefs: A list of coefficients for each node in the mABN, where each element corresponds to a node and contains the coefficients for the linear predictor of that node.
  • max.parents: The maximum number of parents for each node in the mABN.

2. Structure Learning

Learn the structure of the mABN from the simulated data. Calls:

  • abnwithmaxparents(): This function learns the structure of the mABN with a maximum number of parents for each node.
  • abn_ensemble(): This function reduces the structural uncertainty by combining multiple learned structures into a single, more robust structure.
  • abn_postprocessing(): This function post-processes the learned structure to ensure it meets the requirements of a valid mABN.
  • abn_consensusdag(): This function computes the consensus structure from the ensemble of learned structures, resulting in a directed acyclic graph (DAG) that represents the most likely structure of the mABN.

Input

The input for the structure learning step is the .RData file generated in the data generation step, which contains the simulated data and the parameters of mABN.

Output

Each intermediate step of the structure learning process produces an output that is stored in the data directory of the package. These intermediate outputs are named with the same naming convention as the .RData files, but with a prefix indicating the step it was generated from, e.g., abnslobj_*, abn_ensembleobj_*, abn_postprocessingobj_*, abn_consensusdagobj_*, etc.

All plots, log files, etc. from the intermediate steps are stored in the results directory of the package following the same naming convention.

3. Performance Evaluation

Evaluate the learned mABN models against the true generating models.

Main Functions:

  • ComputePerfResults(): Computes structural accuracy (SHD), parametric accuracy (KL divergence), and prediction performance
  • cv.predict.ABN(): Performs k-fold cross-validation for predictive performance assessment

Input

Learned consensus DAG objects and corresponding true data objects.

Output

Performance metrics including SHD, KL divergence, prediction errors, and runtimes.

4. Results Analysis and Visualization

Generate comprehensive plots and summary statistics from the simulation results.

Main Functions:

  • merge_simulation_results(): Combines results across all experimental conditions
  • plot_accuracy_by_*(): Various plotting functions for different aspects of the results
  • extract_results(): Extracts and formats results for analysis

Workflow Execution Order

To reproduce the complete simulation study, execute the vignettes in this order:

  1. generateSimData.Rmd - Generate all simulated datasets
  2. structurelearning.Rmd - Learn network structures for all datasets and strategies
  3. computeperformance.Rmd - Evaluate performance metrics
  4. abnplotresults.Rmd - Generate publication-ready plots and analyses

⚠️ Computational Requirements:

  • Time: Full study requires several days of computation
  • Memory: Recommend 16+ GB RAM for larger datasets
  • Cores: Parallelization recommended (8+ cores optimal)
  • Storage: ~50 GB for all intermediate files and results

💡 Quick Start: For testing, set DEBUG <- TRUE in the vignettes to run with reduced parameter sets.

Citation

If you use this package in your research, please cite the following paper: Champion, M., Delucchi, M., Furrer, R. (2025). All for One or One for All? A comparative study of group effects in additive Bayesian networks. Under review.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages