We use uv to manage dependencies and the project environment.
Clone the GitHub repository:
git clone https://github.com/MDverse/mdverse_entity_norm.git
cd mdverse_entity_normSync dependencies:
uv syncThis project consists of the normalisation step for mollecular dynamics entities. Currently, we have implemented the normalisation for temperature and the grounding for molecules. The normalisation and grounding processes are performed using the scripts located in the src/mdverse_entity_norm/scripts directory. Each script is designed to handle a specific type of entity and can be executed independently. The results of the normalisation and grounding processes are saved in the results directory, which is created if it does not already exist. The output files are in TSV format, containing the original entities and their corresponding normalized or grounded values, along with any relevant metadata such as confidence scores or error codes.
To normalize temperature entities, run :
uv run src/mdverse_entity_norm/scripts/normalize_temperature.pyThis command generates a file named
normalized_temperature.tsvin theresultsdirectory, containing the normalized temperature entities. The file has two columns:original_valueandnormalized_value, whereoriginal_valueis the original temperature entity andnormalized_valueis the normalized temperature entity in Celsius.
The logic behind the grounding of molecule entities is described in this image below :

To ground molecules entities, run :
uv run src/mdverse_entity_norm/scripts/ground_molecule.py --mol_filepath data/MOL.txt --grounded_mol_filepath results/grounded_molecules.tsv --non_grounded_mol_filepath results/non_grounded_molecules.tsvThis command generates two files in the
resultsdirectory:grounded_molecules.tsvandnon_grounded_molecules.tsv. Thegrounded_molecules.tsvfile contains the grounded molecule entities with their corresponding identifiers, while thenon_grounded_molecules.tsvfile contains the molecule entities that could not be grounded.
The grounded_molecules.tsv file has six columns :
Entity_name : corresponding to the original molecule name,
Database : corresponding to the database name,
ID : corresponding to the molecule ID,
Score : corresponding to the confidence score,
Name : corresponding to the molecule full name,
nb_res : corresponding to the number of results found.
Thenon_grounded_molecules.tsv file has two columns :
Entity_name : corresponding to the original molecule name that could not be grounded
error : corresponding to the error code obtained during the grounding process.