Welcome to PGEcore, a central repository for scripts that integrate and wrap commonly used bioinformatics tools and bespoke code for common functions. This repository is being developed collaboratively as part of a hackathon (WASABI25) organized by the PlasmoGenEpi (PGE) group. The scripts here will serve as foundational components for building robust and reusable bioinformatics workflows.
Contents:
This repository aims to:
- Provide a centralized collection of R scripts that wrap around external bioinformatics tools.
- Include bespoke utility functions to simplify and standardize workflow development.
- Serve as a shared resource for bioinformatics pipelines and modular workflows.
PGEcore/
├── README.md # Overview and guidelines
├── scripts/ # Scripts wrapping external tools and bespoke code
├── utils/ # Utility functions used across scripts
├── data/ # Example datasets
├── docs/ # Additional documentation or references
└── .gitignore # Ignore unnecessary files
To develop we use Gitflow.
- Clone the Repository and checkout the
developbranch:
git clone https://github.com/PlasmoGenEpi/PGEcore.git
cd PGEcore
git checkout develop
- Create a branch to develop on:
- Follow the branch naming convention (e.g. feature/your_tool_name`).
git checkout -b <branch name>
- Add your script and documentation:
- See here for more information on this.
- Commit Your Changes:
- Write clear and concise commit messages.
git add .
git commit -m "Added wrapper for ToolX"
git push origin <branch-name>
- Submit a Pull Request (PR):
- Ensure you have included all of the features outlined in the script guidlines.
- Once you have made your changes, create a PR into the
developbranch. Someone will review your PR and provide feedback or approve it for merging. Never make any changes to themainbranch, and please always PR intodevelop.
For an example please see scripts/FreqEstimationModel_wrapper.
- Create a directory under the
scripts/folder. Make sure to give it an appropriate name.- If the script is to wrap an existing tool then include
_wrapperat the end of the name E.g.moire_wrapper. - If it makes the tool name clearer then it can be camel cased (e.g.
FreqEstimationModel_wrapper). - If it is a piece of bespoke code then keep to lower case separated by
_(e.g.estimate_coi_naive)
mkdir scripts/<my_dir> - If the script is to wrap an existing tool then include
- Add the script for running the tool under the directory you just created. Keep the names of the script consistent with the directory you created above. Please follow the guidelines below when writting your code.
- Copy the template from
docs/README.mdinto this directory. Fill in the sections of the README for your tool.cp docs/template_README.md scripts/<my_dir>/README.md
Each software tool should be wrapped by one module (i.e., R script).
In case-by-case circumstances, multiple modules may be created for one software tool, but only if the functionality used and inputs have little overlap.
Within the module, code should be separated into functions according to whether it serves to format input, run the software tool, or format output. Multiple functions may be written for each purpose at the discretion of the module author. For example, if a module takes a common set of input but generates three different outputs, a logical set of functions might be read_input, run_tool, write_output1, write_output2, and write_output3.
Within the input function(s), contents of tabular inputs should be validated with the validate (see here for examples) package. Presence of expected columns, assumptions about column types, and assumptions about missingness should all be validated. Other input validation and general assertions should be performed with the checkmate package.
The main body of the script (i.e., outside of function definitions) should contain very little code aside from parsing the shell arguments and calling the appropriate sequence of functions depending on the outputs requested.
The script should expose the following as shell arguments:
- Paths for all required and optional input files
- Paths for all required and optional output files
- In the case of software tools that are being used to generate multiple outputs, the module should have flags that control what outputs will be generated
- The number of threads, in the case of tools that support multi-threading
- The random number seed, in the case of tools that use random number generation
- Any other tool parameters the module author deems important to adjust when running the tool on different datasets. If a parameter seems useful to tune during benchmarking and stress testing of the tool, it should probably be exposed as an optional argument.
Arguments should be parsed with the optparse package. Specify types and include help messages.
It is encouraged to use the tidyverse packages for all tabular data manipulation. Other than that and the packages described above for argument parsing and input validation, endeavor to minimize dependencies.
We will use roxygen2 to create online documentation for each module. Please write docstrings using the roxygen2 format for each function in your script.
Otherwise, attempt to make your code as readable as possible and use in-line comments to clarify complicated segments.