Babel

Introduction

The Biomedical Data Translator integrates data across many data sources. One source of difficulty is that different data sources use different vocabularies. One source may represent water as MESH:D014867, while another may use the identifier DRUGBANK:DB09145. When integrating, we need to recognize that both of these identifiers are identifying the same concept.

Babel integrates the specific naming systems used in the Translator, creating equivalent sets across multiple semantic types following the conventions established by the Biolink Model. Each semantic type (such as biolink:SmallMolecule) requires specialized processing, but in each case, a JSON-formatted compendium is written to disk. This compendium can be used directly, but it can also be served by the Node Normalization service or another frontend.

In certain contexts, differentiating between some related concepts doesn't make sense: for example, you might not want to differentiate between a gene and the protein that is the product of that gene. Babel provides different conflations that group cliques on the basis of various criteria: for example, the GeneProtein conflation combines a gene with the protein that that gene encodes.

While generating these cliques, Babel also collects all the synonyms for every clique, which can then be used by tools like Name Resolver (NameRes) to provide name-based lookup of concepts.

Using Babel outputs

What do Babel outputs look like?

Three Babel data formats are available:

Compendium files contain concepts (sets or "cliques" of equivalent identifiers), which include a preferred identifier, Biolink type, list of equivalent identifiers as well as other information about the concept (such as the descriptions, information content valuen and so on).
Synonym files, which don't include the equivalent identifiers for each concept, but do include every known synonym for each concept. These files can be directly loaded into an Apache Solr database for querying. The Name Resolver contains scripts for loading these files and provides a frontend that can be used to search for concepts by label or synonym, or to provide an autocomplete service for Babel concepts.
Conflation files contain the lists of concepts that should be conflated when that conflation is turned on.

How can I download Babel outputs?

You can find out about downloading Babel outputs. You can find a list of Babel releases in the Releases list.

How can I deploy Babel outputs?

Information on deploying Babel outputs is available.

How can I access Babel cliques?

There are several ways of accessing Babel cliques:

You can run the Babel pipeline to generate the cliques yourself. Note that Babel currently has very high memory requirements -- it requires around 500G of memory in order to generate the Protein clique. Information on running Babel is available.
The NCATS Translator project provides the Node Normalization frontend to "normalize" identifiers -- any member of a particular clique will be normalized to the same preferred identifier, and the API will return all the secondary identifiers, Biolink type, description and other useful information. You can find out more about this frontend on its GitHub repository.
The NCATS Translator project also provides the Name Lookup (Name Resolution) frontends for searching for concepts by labels or synonyms. You can find out more about this frontend at its GitHub repository.
Play around with the Babel Downloads (in a custom format), which are currently available in JSONL, Apache Parquet or KGX formats.

What is the Node Normalization service (NodeNorm)?

The Node Normalization service, Node Normalizer or NodeNorm is an NCATS Translator web service to normalize identifiers by returning a single preferred identifier for any identifier provided.

In addition to returning the preferred identifier and all the secondary identifiers for a clique, NodeNorm will also return its Biolink type and "information content" score, and optionally any descriptions we have for these identifiers.

It also includes some endpoints for normalizing an entire TRAPI message and other APIs intended primarily for Translator users.

You can find out more about NodeNorm at its Swagger interface or in this Jupyter Notebook.

What is the Name Resolver (NameRes)?

The Name Resolver, Name Lookup or NameRes is an NCATS Translator web service for looking up preferred identifiers by search text. Although it is primarily designed to be used to power NCATS Translator's autocomplete text fields, it has also been used for named-entity linkage.

You can find out more about NameRes at its Swagger interface or in this Jupyter Notebook.

Understanding Babel outputs

For a detailed explanation of how Babel constructs cliques, chooses preferred identifiers and labels, sources descriptions, and calculates information content values — as well as guidance on reporting incorrect cliques — see docs/Understanding.md.

Running Babel

How can I run Babel?

Babel requires significant memory — around 500 GB to build the largest compendia (Protein and DrugChemical conflated), though smaller compendia need far less. It uses uv for Python dependency management and Snakemake for build orchestration. See docs/RunningBabel.md for detailed instructions, configuration, and Slurm job setup.

Contributing to Babel

If you want to contribute to Babel, start with the Contributing to Babel documentation. This will provide guidance on how the source code is organized, what contributions are most useful, and how to run the tests. For a deeper look at the development workflow and ideas for improving it, see Developing Babel.

Contact information

You can find out more about Babel by opening an issue on this repository, contacting one of the Translator DOGSLED PIs or contacting the NCATS Translator team.

Name		Name	Last commit message	Last commit date
Latest commit History 2,449 Commits
.github		.github
docs		docs
input_data		input_data
kubernetes		kubernetes
releases		releases
scripts		scripts
slurm		slurm
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.travis.yml		.travis.yml
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
config.yaml		config.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Babel

Introduction

Using Babel outputs

What do Babel outputs look like?

How can I download Babel outputs?

How can I deploy Babel outputs?

How can I access Babel cliques?

What is the Node Normalization service (NodeNorm)?

What is the Name Resolver (NameRes)?

Understanding Babel outputs

Running Babel

How can I run Babel?

Contributing to Babel

Contact information

About

Uh oh!

Releases 24

Packages

Uh oh!

Uh oh!

Contributors 9

Languages

Folders and files

Latest commit

History

Repository files navigation

Babel

Introduction

Using Babel outputs

What do Babel outputs look like?

How can I download Babel outputs?

How can I deploy Babel outputs?

How can I access Babel cliques?

What is the Node Normalization service (NodeNorm)?

What is the Name Resolver (NameRes)?

Understanding Babel outputs

Running Babel

How can I run Babel?

Contributing to Babel

Contact information

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 24

Packages 0

Uh oh!

Uh oh!

Contributors 9

Languages

Packages