Author: Vaidyanathan M. Sethuraman. Email: vm5@ornl.gov
ver_1.2: Nov-05-2021
If you find this software useful, please cite it in your work as Sethuraman et al., Atomistic Simulations of Polydisperse Lignin Melts Using Simple Polydisperse Residue Input Generator, Biomacromolecules (2023) Citation.
A Python/FORTRAN-90 based code to generate random initial structures for polydisperse monomers/residues according to Schulz-Zimm distribution or from an experimental molecular weight distribution.
In conjunction with LigninBuilder, users can generate
generate starting .psf, .pdb and .top files for
running with NAMD or
GROMACS software.
Although this code was primarily designed to generate input files for various types of biomass (Lignin/Carbohydrates), it can be used with any protein or polymer complex with the forcefield files.
This code works on a combination of Python and FORTRAN90
platforms. Python3.0+ and ifort/gfortran compilers are required.
Dowloading and unzipping the directory can be done using:
git clone https://github.com/vaidyanathanms/SPRInG_PolydispersePolymerBuilder.git
The above commands should generate a folder of the name:
- generic_builder
Navigate to generic_builder for generating a structure from scratch
using
cd generic_builder
. Inside generic_builder folder, you should see the following python
and FORTRAN files:
- make_genpsf.py
- genconf.py
- pdi_gen.f90
- pdi_dist_params.f90
If the files are present, you are set to generate a new structure.
There are two ways to generate initial structure. You can either copy
the four files above to a new folder or run from the directory
generic_builder.
inputsforpsfgen.inp is a sample input file containing all the input
keywords to SPRIG. We will look at the keywords in detail in the SPRIG
Keywords section. For running SPRIG, use:
python genconf.py <filename>where <filename> is the name of the input file to genconf.py.
If this generates a folder with casenum_ID and sub-folder
all_tclfiles (within casenum_ID) which contain a number
of tcl files, you are all set. Here, ID refers to an integer value
given as input (see SPRIG Keywords).
Several things can go wrong including compiler compatibilities and incompatible input constraints (see SPRIG Keywords). If you find an error, please report to Vaidyanathan M. Sethuraman.
Outputs from SPRIG can be directly fed into LigninBuilder to
generate the input structure for GROMACS using the following
three steps. Make sure to follow the order.
-
Step 1: If SPRIG ran correctly, users should see a folder
casenum_ID, whereIDis an integer value given as input to SPRIG (see SPRIG Keywords). Navigate to this directory usingcd <casenum_ID>Inside the folder users should see several files viz.,
- step1.tcl
- step2.tcl
- patchlist_ID.tcl, reslist_ID.tcl
- log_ID.txt
- step3.tcl (Optional)
where
IDcorresponds to thecasenum_ID. The directory will also contain the user specified input files for the residue/patch probabilities. If all of these files are present, Step 1 is complete. -
Step 2: Execute the following from command terminal:
vmd -dispdev text -e step1.tclMake sure the path to
vmdis added to$BINor is given correctly.Alternatively, users can open
VMDand openTk Consoleand issuesource step1.tclIf either of the commands run smoothly, this should generate
psffiles for each chain structure. This should also generatepdbfiles corresponding to thepsffiles within the directory. This requiresLigninBuilderto be added in~\.vmdrc(seeLigninBuilderon how to do this).Following the generation of
psf`pdb` files, from command terminal, issuevmd -dispdev text -e step2.tclOR
from
Tk ConsoleinVMDissuesource step2.tclThis should provide output files of the form
inpfile_nch_N.psf,inpfile_nch_N.pdbwhere inpfile is the name of the input system (see SPRIG Keywords) and N corresponds to the number of chains in the system. If LigninBuilder flag is ON, it should also generate an outputtopfile of the forminpfile_nch_N.top. Make sure theprmfile is in the folder (casenum_ID) for generatingtopfiles.If the
pdbfile(s) is (are) not generated, please see the input constraints. Most likely a particular residue (or patch) is incompatible. Please make sure thatLigninBuilderis added to~\.vmdrcbefore executing this command. -
Step 3: If the intent of the user is to generate an initial structure which do not have overlapping atoms, then the
minimizestructuremodule inLigninBuilderis necessary. To this end, issue from command line,vmd -dispdev text -e step3.tclOR
from
Tk ConsoleinVMDissuesource step3.tclFor some cases, the codes would require to use
findmissingterms.pyin LigninBuilder (seeLigninBuildermodule for more details). If the run is succesful, this will generate apdbandpsffile which are compatible withGROMACS/NAMD
NOTE: It is the user's responsibility to check whether the
parameters match the atomnames (atomtypes) in the psf/pdb
files.
In this section, we look at the different keywords that are needed to generate a polydisperse input structure.
- Some keywords are optional and are prefixed with (Optional) while introducing the keyword.
- A space/tab should be present between the keywords and arguments or between arguments.
- A new line can start with an optional
#. These lines will be ignored. However,#cannot be used in the middle of a line. - The input file name (and file names used as arguments) cannot be
specified as
Noneornone. These are reserved keywords within the program.
-
case_num (Optional)
All input files can start with an optional
case_numkeyword. If this is used as a keyword, it should be the first keyword in the input file. Usage:case_num caseIDcaseIDshould be a positive integer. This will create a folder of the namecasenum_caseIDwhere all the output files will be present. Default value forcaseIDis 1.Example:
case_num 1 -
biomass_type
This is a mandatory keyword and corresponds to the prefix for output file. Usage:
biomass_type argnameThe final tcl files generated will be of the form argname_1_nch.tcl
, wherench` corresponds to number of chains in the system.Example:
biomass_type switchgrass -
num_resids
This is a mandatory keyword and corresponds to the average number of residues per chain (segment). Usage:
num_resids nreswhere
nrescorresponds to the average number of residues per chain (segments or monomers per chain). Should be an integer value.Example:
num_resids 20 -
num_chains
Mandatory keyword corresponding to the number of chains in the system. Usage:
num_chains nchwhere
nchcorresponds to the number of chains in the system (integer value).Example:
num_chains 10 -
disperse (Optional)
This keyword dictates the polydispersity of the system. If this option is not provided, chains are assumed as monodisperse by default. If the option is provided, chains will be drawn from a Schulz-Zimm distribution. There are two options (and suboptions) for this case. Usage:
disperse maketype optkeywords optargsmaketypecan be eitherSZTHEORY,EXPTDATAorREADDATA.SZTHEORYgenerates a set of polydisperse chains using theoretical Schulz-Zimm distribution (see below for options).EXPTDATAgenerates a set of polydisperse chains according to the experimental data (curves) for the molecular weight distribution of chains (see below).READDATAreads a file containing the molecular weights (degree of polymerization) of all the chains from a file (see below for format). Arguments for each option are elaborated below.Examples:
disperse SZTHEORY 1.50 polydisp.inp 10000 disperse sztheory 1.50 polydisp.inp 1000 pditol 8.0 mwrange 20 disperse readdata molwtdata.dat disperse exptdata exptdata.dat mwmonomer 180 ntrials 20 disperse EXPTDATA WTdata.dat mwmonomer 200 ntrials 1000 pditol 3-
SZTHEORYFor this case a new file will be generated according to the polydispersity value and the number of chains/number of residues per chain using a Schulz-Zimm distribution. Usage for this option is as follows:disperse sztheory PDIval Outputfile ntrials pditol tolerance mwrange rangevalPDIvalandOutputfilecorresponds to the target polydispersity value and the output file containing the molecular weights (degree of polymerization) of each chain.PDIvalis the target dispersity index and is defined as the ratio between the weight average molecular weight and number average molecular weight. This number *should be greater than 1.0.Outputfileis the name of the file that is generated where the degree of polymerization of each chain is written out.ntrialscorrespond to the number of trials the program attempts to generate the polymer chains within the PDI tolerance limit. Values between 1000-10000 should be enough for most cases.With the pditol keyword, user can also specify an optional tolerance value (0-100). This corresponds to the maximum relative error (in %) between the target PDI value and the simulated PDI. Different combinations will be tried to obtain either the target PDI value of the system. Default value is 5. For all practical purposes values between 5 and 15 yield good output distribution if the number of chains in the system is less than 20. This keyword and the corresponding argument is optional.
With the
mwrangekeyword, user can specify the maximum range of molecular weights that will be used to create the Schulz-Zimmm distribution. Plot the theoretical curve to see where the distribution tapers to zero. Most likely the default value of 5 would suffice. This keyword and the corresponding argument is optional.If this option is used, after running the program, a file with the name 'geninp_pdistruct.txt' will be generated and it will contain the details of the inputs.
-
EXPTDATAWith this option, users can input the molecular weight distribution obtained from experiments. The program then uses this molecular weight distribution data to generate chains for simulations. Usage:disperse EXPTDATA inpfilename mwmonomer 200 ntrials 1000 pditol 3where
inpfilenamecorresponds to the molecular weight distribution data file. Theinpfilenameshould contain only two columns of data. Further, one of the columns should have a headermolwtwhich corresponds to the molecular weights of the sample distribution.For the second column, the code accepts one of the following three options:
wlogmw,wmw, orpmw.wlogmwcorresponds to the distribution data (w(logM) = dm/dlog(M)); where m is the total mass and M is the molecular weight of the chains.wmwcorresponds to the distribution data (w(M) = dm/dM = dm/(M*dlog(M)) = w(logM)/M).wmwis the weight averaged probability distribution.pmwcorresponds to the number averaged probability distribution and is related to the other quantities through, p(M) = w(M)/M = w(logM)/$M^2$Users should input only one of the above three options for the distribution. Most likely, experimental data are reported in w(logM), wheras other options are more common in theory/computation literature.
Keywords
mwmonomer,ntrialsandpditolare optional.mwmonomercorresponds to the average molecular weight of one monomer in g/mol. Default value is 200 g/mol.ntrialscorresponds to the number of attempts, random samples are drawn from the experimental distribution before both the average number molecular mass and the PDI converges to a tolerance ofpditolof the experimental distribution.pditolis the relative tolerance in % (between 0 and 100%). Default values forntrialsandpditolare 100000 and 5%, respectively. -
READDATAUsers can also specify a file where the degree of polymerization of each chain is specified. In this case, the program will directly read this file and create the segments. Usage:disperse READDATA inpfilenamewhere
inpfilenameis the name of the file. Theinpfilenameshould have the following structure. First line *should have the following structure:num_chains nchainswhere nchains correspond to the number of chains in the system. This should be consistent with the
num_chainsin the input file used to rungenconf.py. The nextnlines should correspond to the degree of polymerization of thendifferent chains.
-
-
top_ipfile
Mandatory keyword and the argument corresponds to the path to the topology file. It is the user's responsibility to check whether the residues generated have their monomer structure in the topology file. User should also provide the full path to the topology file. Default assumption is that the file is present in the path from which
genconf.pyis called. Usage:top_ipfile filenameExample:
top_ipfile top_lignin.top -
resid_inp
Mandatory keyword and the argument corresponds to the average probability of each residue in the system. It should be provided in a file with each line corresponding to the residue name and the average probability. Users should make sure that the residue name matches with the residue name in the topology file. Usage:
resid_inp filenameExample for formatting filename:
SYR 0.4 TRCN 0.05 GUAI 0.3 PCA 0.15 FERUT 0.1NOTE: The sum of the probabilities need not be one. Code internally makes the sum to be one. However, a warning will be issued if the sum is not one. The inputs should contain the details for the branch (graft) monomers or else the code will not recongnize any branch monomer.
-
patch_inp
Mandatory keyword and the argument corresponds to the average probability of each patch in the system. It should be provided in a file with each line corresponding to the residue name and the average probability. Users should make sure that the patch name matches with the patch name in the topology file. Usage:
resid_inp filenameExample for formatting filename:
BO4 0.8 B5 0.1 BB 0.05 AO4 0.05NOTES:
-
The sum of the probabilities need not be one. Code internally makes the sum to be one. However, a warning will be issued if the sum is not one.
-
The inputs should NOT contain the details for the branch (graft) patches. DO NOT provide patch details for branch monomers here. This is different from inputting residues where the name of the branched residue should be present.
-
If you are using SPRIG with LigninBuilder, please be aware that residues for which there exists equal probability for the tacticities (e.g. BO4R and BO4L for BO4), DO NOT give separate probabilities for each stereoisomer. Provide the overall probability. LigninBuilder will make sure that all the stereoisomers have equal probability.
-
-
seg_name
Mandatory keyword which corresponds to the name of the segment. Except for one case (see NOTE below), this will serve as the prefix for segment names for different chains in the
psffile output. Usage:seg_name argnameThe output
psfname will have segment names of the formargname_chainIDwherechainIDis an integer varying from 1 to number of chains in the system.NOTE: In case, an input PDB file is given to generate the PDB file using
genconf.pyand NOT LigninBuilder, users must make sure that the segment name matches the segment name in the input PDB file that is used to generate the initial guesses for the initial coordinates (ICs). -
op_style (Optional)
Keyword dictating the output style. There are two argument options --
singleandmulti. Usage:op_style single op_style multi 4For the argument
single, a single ouputtclfile will be generated per chain. On running this withVMDorLigninBuilder, you can produce the final structure. However, the final structure may have unphysical bonds. It is the user's responsibility to check this. If the user is combining this code withLigninBuilderplease usesingleoption sinceLigninBuilderhas capabilities to untangle unphysical bonds.For the argument
multi, the program breaks down thetclfiles into smallertclfiles. This requires an extra integer argument. Let us say that we are creating a polymer with degree of polymerization 20 and the extra argument is 4, thetclfile corresponding to this chain will have 5 different builds. First, the first four segments of the chain are built. Then NAMD is called to minimize the structure. The minimized structure is then used as an input to generate the next 4 residues -- so on and so forth. This requiresNAMDpath to be added correctly or else runningtclfile may encounter errors.NOTE: Use the option
singlewithLigninBuilder. Default issingle. -
branching (Optional)
To define branching of main chain. Branches are single monomer long in the current mode. The program can manage multiple types of branches. Usage:
branching 1 branch1 patch1 branch2 patch2 ...The keyword
branchingshould be followed by an integer 1 or 0. 1 corresponds to turning on the branch and 0 corresponds to no branch. This gives the user to toggle between branched and non-branched system easily.The branch1/patch1 pair corresponds to the name of the branch residue and the patch connecting the branch with the backbone. The inputs should always be given as pairs. Users must make sure that the residue names are already present in the input list for residues.
Example:
branching 1 PCA GOG FERUT GOGNOTE: Since, by construction, the number of patches equal to the number of residues in the system, the final probabilities for the patch values may not reflect the input values.
-
nattempts (Optional)
Number of attempts to achieve a random configuration that corresponds to the input probabilities for residues and patches. Each time a better target configuration (smaller residual error) is found, the program saves that configuration. In case, the target residual error is not met within
nattempts, the best configuration along with the residual error will be generated as output. Usage:nattempts intvaluewhere intvalue is the number of attempts. A value between 50 and 200 for an average degree of polymerization of 30 works generates a target onfiguration in a few minutes.
Example:
nattempts 60Default value for
nattemptsis 50. -
tol (Optional)
Relative tolerance between the input probabilities for residues/patches and averaged output values for residues/patches. L2norm is used to calculate the relative error. Program runs until the error is less than tolerance value or the number of attempts exceeds
nattempts. Usage:tol tolvalwhere
tolvalis a number between 0 and 1. Nominal values are between 0.05 and 0.15. Default value is 0.1. -
terminator (Optional)
Use this option to make sure that certain fraction of the chains end with this type of residue. This residue should be present in the input residue list. The final fraction of this residue will have an average probability closer to the value input to the program. Usage:
terminator resnamewhere
resnameis the name of the residue.Example
terminator TRCN -
LigninBuilder (Optional)
Use this option to generate output files that can be used in conjunction with
LigninBuildersoftware to generate*.topfiles forGROMACSsoftware. Usage:LigninBuilder filenamewhere
filenamecorresponds to the file that contains the details of the potentials. Usually this will have a.prmextension. Please make sure to provide the full path to the fil unless it is in same directory asgenconf.py.Example:
LigninBuilder par_lignin.prm -
clean_directories (Optional)
Use this option to clean the existing output directory (
casenum_ID) and replace with new files. Usage:clean_directories YArguments can be
Y(yes) orN(no). Default isN.WARNING: All files will be deleted before the new output files are written into the directory.
-
patch_patch_constraint (Optional)
Optional argument to specify the constraints between adjacent patches. This is useful to let the program know that patches (linkers) cannot be next to each other. For instance, a patch (linker) of type
$\beta$ -5 cannot be followed by a 55 patch (linker) since the 5th position is already occupied. The constraints need to be specified in a separate file. Usage:patch_patch_constraint <filename>where
<filename>contains the patch and the constraints. A sample example of this file will be as follows:55 55 5B BB BB GOG B5 BO4 B5 55 GOG BB BO4 O4B 4O5The first entry of each row should be the first of the two consecutive patches. Next 'n' entries of the row should contain all the patches that are incompatible with the first entry.
Rules of making a new row is as follows: first element in every row corresponds to the patch between residues
iandi+1whereas the other elements in the rows correspond to the EXCLUDED patch values between residuei+1and residuei+2. For instance, let us say that aBO4patch between residues 1 and 2, this DOES NOT preclude having the patch BO4 between 2 and 3. Therefore, for the row ofBO4, there should not be BO4 in the EXCLUDED values. In other words, the number of occurences ofBO4for the row ofBO4should be exactly 1. Now, for a different case, let us say that a patchBBis present between residues 1 and 2. Therefore, the$\beta$ position of residue 2 is filled and we CANNOT have a patch which starts with "\beta" such as "BB" or "B5" or "BO4". In this case BB should be repeated twice (including the first column of that row).Finally, if there are branches (grafts) present, one needs to take extra care to include the exclusions.
NOTES
-
Each row is independent. In other words, if
pat_2is incompatible withpat_1does not mean thatpat_1is incompatible withpat_2. This incompatibility (if necessary) should be specified in separate rows. -
The patch names should exactly match the names of the patches in the input patch list given using
patch_inp. If the patch name in the input does not match what is given in the list, it will be ignored. However, it is OK for this file to have the name of the patches that is not present in the input given usingpatch_inp. -
Although this is an optional argument, almost always there will be restrictions for real sytems. It is the user's responsiblity to make sure that all the constraints are given to the system.
-
-
patch_res_constraint (Optional)
Optional argument to specify the constraints between a patch (linker) and a residue. This is useful to let the program know that certain patches cannot succeed (or precede) a certain residue. For instance, a
55patch can neither precede nor succeed a syringyl residue. However,405can precede a syringyl residue whereas it cannot succeed a syringyl residue. Similar to thepatch_patch_constraintinput, the incompatibility data should be input as a file. Usagepatch_res_constraint <filename>where
<filename>corresponds to the name of the file comprising the incompatibilities. There are certain keywords that should be present in this file. Keywords are case-sensitive.Example:
patch restrict_before restrict_after 55 SYR SYR B5 None SYR 55 TRCN TRCN B5 TRCN TRCNNOTES
-
The first line of this file should have the following keywords:
patch,restrict_beforeandrestrict_after. -
Each row should contain only 3 entries. The first entry is the name of the patch. The entry under
restrict_beforeshould correspond to the restriction that the patch cannot precede a residue and the entry belowrestrict_aftershould correspond to the restriction that the patch cannot succeed a residue. In the above example55can neither precede nor succeed aSYRresidue. SimilarlyB5can neither precede nor succeed aTRCNresidue. However,B5can come beforeSYRwhere as it cannot come afterSYR. -
Use
Nonekeyword (case-sensitive), if a patch can precede (but not succeed) or viceversa a given residue. By default, there are no restrictions. In other words, for all the patches mentioned in the input will haveNoneas default argument for incompatibility. -
If a patch cannot come before (or after) two residues, it should be specified in two separate lines. For instance, in the example above, patch
55should be repeated twice for the program to know that it has incompatibility with bothSYRandTRCN. -
Although this is an optional argument, almost always there will be restrictions for real sytems. It is the user's responsiblity to make sure that all the constraints are given to the system.
-
-
gen_packmol (to be added)
-
namd_inp (to be added)
-
pdb_ipfile (to be added)