Skip to content

fix: chain template alignments auth labelling (inference)#117

Merged
jandom merged 21 commits intomainfrom
jandom/2026-02/fix/chain-template-alignments-auth-labelling
Mar 26, 2026
Merged

fix: chain template alignments auth labelling (inference)#117
jandom merged 21 commits intomainfrom
jandom/2026-02/fix/chain-template-alignments-auth-labelling

Conversation

@jandom
Copy link
Copy Markdown
Collaborator

@jandom jandom commented Feb 7, 2026

Summary

Hopefully helps to solve #101

Changes

So far wrote a test that reproduced the failure, and then added a "fix"

Related Issues

Testing

Other Notes

- auth chain ids vs labelled chain ids
- add a test that confims this
@jandom jandom requested a review from gnikolenyi February 7, 2026 15:44
@jandom jandom self-assigned this Feb 7, 2026
Comment on lines +62 to +64
label_to_author = get_label_to_author_chain_id_dict(cif_file)
author_to_label = {v: k for k, v in label_to_author.items()}
label_chain_id = author_to_label[template.chain_id]
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the re-mapping from "auth" chains IDs to "label" chain IDs... very wishful in terms of inputs not being pathological

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if there are multiple label IDs mapping to the same author ID, this will just always map an author ID to the l ast label ID. @ljarosch can confirm, but I think this only happens with homomeric chains, so it should be fine. We should just document this behavior.

A way to make this more robust would be to explicitly sort the label_to_author dict when iterating over it, so maybe add that here so we are not relying on the dict ordering for this mapping.

Copy link
Copy Markdown
Contributor

@jnwei jnwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is great! This was a hard bug to pin down.

A few drive by comments regarding setting up the tests with colabfold web services.

template = templates[16]
assert template.chain_id == "A" and template.entry_id == "1rnb"

fetch(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mock this call instead of explicitly calling the RCSB database using fetch?

As this is a unit test, it would be good to remove dependencies on web servers so that we don't have latency issues / failures due to the availability of the service.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to just a cif file as fixture

Copy link
Copy Markdown
Collaborator

@gnikolenyi gnikolenyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far. Added some comments on top of Jennifer's.

One thing I am missing is the actual mapping being done after the colabfold pipeline pulled the templates. I see you have the primitive but it is not yet being called in openfold3/core/data/tools/colabfold_msa_server.py or anywhere else outside of the unittests. Could you please also add the remapping to the colabfold pipeline itself?

Comment on lines +62 to +64
label_to_author = get_label_to_author_chain_id_dict(cif_file)
author_to_label = {v: k for k, v in label_to_author.items()}
label_chain_id = author_to_label[template.chain_id]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if there are multiple label IDs mapping to the same author ID, this will just always map an author ID to the l ast label ID. @ljarosch can confirm, but I think this only happens with homomeric chains, so it should be fine. We should just document this behavior.

A way to make this more robust would be to explicitly sort the label_to_author dict when iterating over it, so maybe add that here so we are not relying on the dict ordering for this mapping.

@jandom jandom marked this pull request as draft February 10, 2026 14:38
@jandom
Copy link
Copy Markdown
Collaborator Author

jandom commented Feb 10, 2026

@jnwei @gnikolenyi many thanks for the reviews – I'm not sure why this wasn't a draft, this is clearly not ready for prime time.

Agreed that we need this to use some fixture files (in an integration test context), but I would also like to have an end-to-end test that does everything. Ideally, we would just have in-memory generated fixtures but we don't have the tooling setup atm.

Could you please also add the remapping to the colabfold pipeline itself?

This would be ideal but it's not possible unless we pull in the cif files, which have the info on the peptide chains to do the mapping.

@jandom jandom requested review from gnikolenyi and jnwei February 11, 2026 18:00
@jandom jandom requested a review from ljarosch February 12, 2026 16:47
@jandom
Copy link
Copy Markdown
Collaborator Author

jandom commented Feb 12, 2026

Wrapping @ljarosch into this PR, because it's quite hairy. Here is some updated context

  • this only occurs at inference and only when using colabfold
  • at training and at manual inference, we provide the correctly formatted templates

Colabfold returns "author" chain IDs rather than "labelled" chain IDs, and this PR fixes how these are handled. However, the code now assumes that author chain IDs are provided and may erroneously correct properly provided chain IDs.

@gnikolenyi
Copy link
Copy Markdown
Collaborator

@jnwei @jandom Added the template structure download and chain ID remapping logic to the colabfold pipeline and removed it from the template pipeline. See logs below for an example with the remapping printed explicitly:

Submitting 5 sequences to the Colabfold MSA server for main MSAs...
COMPLETE: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 750/750 [elapsed: 00:02 remaining: 00:00]
Downloading template CIFs: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 38/38 [00:04<00:00,  8.53it/s]
Remapped 4uyk: author A -> label A
Remapped 1914: author A -> label A
Remapped 1e8o: author A -> label A
Remapped 3jaj: author S1 -> label JC
Remapped 5aox: author D -> label D
Remapped 4ue5: author E -> label E
Remapped 1e8s: author A -> label A
Remapped 4uyk: author B -> label B
Remapped 7nfx: author t -> label WA
Remapped 5aox: author B -> label B
Remapped 1914: author A -> label A
Remapped 4uyj: author B -> label B
Remapped 5aox: author E -> label E
Remapped 1e8o: author D -> label D
Remapped 7obr: author t -> label WA
Remapped 1e8o: author B -> label B
Remapped 4ue5: author B -> label B
Remapped 2w9j: author B -> label B
Remapped 2w9j: author A -> label A
Remapped 4gnx: author A -> label A
Remapped 3kdf: author A -> label A
Remapped 7uy6: author F -> label E
Remapped 7lmb: author F -> label G
Remapped 6d6v: author F -> label C
Remapped 4gnx: author B -> label B
Remapped 2pqa: author C -> label C
Remapped 2pqa: author A -> label A
Remapped 1l1o: author E -> label E
Remapped 2z6k: author A -> label A
Remapped 1quq: author A -> label A
Remapped 2pi2: author C -> label C
Remapped 6i52: author B -> label B
Remapped 2z6k: author B -> label B
Remapped 1quq: author C -> label C
Remapped 3kdf: author D -> label B
Remapped 4joi: author B -> label B
Remapped 4joi: author A -> label A
Remapped 7u5c: author F -> label F
Remapped 6w6w: author C -> label D
Remapped 8d0k: author B -> label B
Remapped 8c5y: author B -> label B
Remapped 4gnx: author C -> label C
Remapped 1jmc: author A -> label B
Remapped 1fgu: author B -> label B
Remapped 6i52: author C -> label C
Remapped 1l1o: author C -> label C
Remapped 1l1o: author F -> label F
Remapped 1ynx: author A -> label A
Remapped 8aaj: author A -> label A
Remapped 8c5z: author A -> label A
Remapped 8oej: author A -> label A
Remapped 8oej: author D -> label D
Remapped 6d6v: author D -> label A
Remapped 8c5y: author J -> label J
Remapped 3u50: author C -> label A
Remapped 1o7i: author B -> label B
Remapped 7wcg: author A -> label A
Remapped 3dm3: author B -> label B
Remapped 2k5v: author A -> label A
Submitting 2 paired MSA queries to the Colabfold MSA server...
COMPLETE: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [elapsed: 00:01 remaining: 00:00]
COMPLETE: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 450/450 [elapsed: 00:02 remaining: 00:00]
Computing paired MSAs: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.92s/it]

For example, checking 3jaj author S1 -> label JC: https://www.rcsb.org/sequence/3JAJ#JC

Note the following:

  • only up to the first 25 templates are parsed and remapped from auth to label asym ids to reduce the amount of cif files that need to be downloaded - we can expose this as an argument to the runner in a later PR
  • the cif files are deduplicated within an inference run and only the non-redundant set of cifs are downloaded, but across multiple runs, the cif files are re-downloaded
  • the cif files are not reused for the actual template processing, which would require a more significant refactor - both this and the re-download across inference runs are areas which we can optimize later

Copy link
Copy Markdown
Contributor

@jnwei jnwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of the latest update from @gnikolenyi is that there are two main changes:

  • We download the cifs of the templates in the colabfold alignment process. This is necessary to parse the author labeled ids. These cifs are stored temporarily
  • The cifs for the templates are downloaded again in the template pipeline for template feature processing.

No changes / updates were made to the tests. Gergo separately ran an example to process the MSAs of several other examples.

@jandom
Copy link
Copy Markdown
Collaborator Author

jandom commented Mar 13, 2026

This is a huge step in the right direction but without extensive tests, I'm quite weary. Will review ASP.

@jandom jandom requested a review from jnwei March 18, 2026 15:49
@jandom jandom marked this pull request as ready for review March 18, 2026 15:50
@jandom jandom added the safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. label Mar 18, 2026
@jandom jandom added safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. and removed safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. labels Mar 18, 2026
@jandom
Copy link
Copy Markdown
Collaborator Author

jandom commented Mar 18, 2026

@gnikolenyi I have somewhat changed this code, the biggest change is skipping the CIF download and using the RSCB API to get the mapping instead (phew!).

Outstanding items

  • remove the original helper methods for the re-mapping from the inference pipeline (everything is now in the colabfold)
  • fix some of the mypy annotations, just because this code is so messy

@jandom
Copy link
Copy Markdown
Collaborator Author

jandom commented Mar 23, 2026

well let's not get too crazy with the mocking horse

  • unit tests, everything mocked don't ping externals but work in a "mocked out world"
  • integration tests, ping actual service and test everything in a more life-like setting
  • e2e tests... we'll get there one day!

So I think we want both @gnikolenyi, and this is the direction I'll take

@jandom jandom requested a review from gnikolenyi March 23, 2026 14:41
@jandom jandom added safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. and removed safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. labels Mar 23, 2026


# TODO: Do this in preprocessing instead to avoid it going out-of-sync with the data?
def get_model_ranking_fit(pdb_id):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to a new rscb.py module

Comment on lines +672 to +681
per_rep: dict[str, pd.DataFrame] = {}
unique_pdb_ids: set[str] = set()
for rep_id in rep_ids:
m_i = rep_id_to_m[rep_id]
if m_i not in m_with_templates:
continue
chain_alns = template_alignments[template_alignments[0] == m_i]
top_n = chain_alns.copy()
per_rep[rep_id] = top_n
unique_pdb_ids.update(top_n[1].str.split("_").str[0])
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this code – but it's just refactored into a function...

@@ -725,47 +808,33 @@ def query_format_main(self):
# Create empty DataFrame with expected column structure (at least column 0)
# to match the structure when file is read with header=None.
logger.warning(
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should error here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could raise an error here instead.

My concern is if a user is running a large batch of predictions, they may prefer to be notified later about the issue with missing templates, rather than have the workflow interrupted for a few broken examples. We could think of a better way to record this issue and bring attention to the missing template alginments?

Copy link
Copy Markdown
Collaborator Author

@jandom jandom Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to tell – either way it's out of scope in a way, because it's not related to the bug-fix per se. Should we handle this in another PR? This PR is already 20 files, we're ballooning


import logging

import requests
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is all brand new

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, I like this separation.

@jandom jandom added safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. and removed safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. labels Mar 23, 2026
@ljarosch
Copy link
Copy Markdown
Collaborator

ljarosch commented Mar 24, 2026

Hi @jandom and @gnikolenyi, I'll give this a better review later this week. One question already - if we download the CIF anyway (to get the actual template structure), is it actually a good idea to rely on another endpoint (the graphQL query interface)?

One more general issue I could see with our current code:

  • It's not entirely clear what template distribution the CF-server provides (it's local to the server and consistent with their PDB70 alignment file). The PDB occasionally retrospectively updates structure details on already released structures which can break our chain mapping (we've been bit by this multiple times)
  • We officially support custom CF-servers https://openfold-3.readthedocs.io/en/latest/inference.html#use-a-privately-hosted-colabfold-msa-server, where we have no control over the template structures the CF-server returns (they might be custom, older PDB versions, ...), so those wouldn't work with any RCSB queries

So a bit of a broader point, but I wonder if instead of relying on all these RCSB endpoints we should just have a self-contained logic that just takes the CIF that the CF-server returns and figures out the mapping, template coordinates, ... from that?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to have some tests that hit the real RCSB API, as you said, it is important to have some integration tests.

However, given that our group alone runs our CI tests ~10 times and possibly more in a day, and that other developers may also run our CI battery of tests, I recommend we add a label to these tests so that we can filter these tests and reduce the frequency so that it is not part of the battery of unit tests. Maybe something like the pytest.mark for the inference integration tests, here: https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/tests/test_inference_full.py#L35

In the long run, we can set up a cronjob to run this and other integration tests to run once a day by default, and upon manual trigger by the developer.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed – added pytest-recordings to save the response as "cassettes" (YAML files) containing the response

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not reviewed. Will review this file in #145

_TEST_DATA_DIR = Path(openfold3.__file__).parent / "tests" / "test_data"


class TestTemplatePreprocessor:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be a big lift to also have a test case to ensure a chain that has a consistent template alignment / author chain id is left unadulterated?

Unfortunately, 1RNB seems to be a small monomer protein, so it looks like we cannot simply use a different chain of this structure. Perhaps we can revisit this and add another test case later if we add more test cases with template alignments later.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused here – what test case are we looking for? Something where the author id = label id?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this test case, it's kind of ugly – it's basically a copy of the inference pipeline (where the fix was originally applied).

The two cases we're concerned about are

  • author-id not the same as label-id
  • author-id same as label-id

But these are now covered by these tests

  • test_remap_author_to_label
  • TestFetchLabelToAuthorChainIds

I'm going to remove this test, i think it adds nothing

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice simple sanity check unit test, I like it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also VCR for this case – otherwise we're hitting colabfold. But here it's harder – colabfold doesn't produce a single JSON response (like RSCB API) but instead launches a job and the user needs to then download and unpack various files.

@@ -725,47 +808,33 @@ def query_format_main(self):
# Create empty DataFrame with expected column structure (at least column 0)
# to match the structure when file is read with header=None.
logger.warning(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could raise an error here instead.

My concern is if a user is running a large batch of predictions, they may prefer to be notified later about the issue with missing templates, rather than have the workflow interrupted for a few broken examples. We could think of a better way to record this issue and bring attention to the missing template alginments?


import logging

import requests
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, I like this separation.

@jnwei
Copy link
Copy Markdown
Contributor

jnwei commented Mar 25, 2026

One question already - if we download the CIF anyway (to get the actual template structure), is it actually a good idea to rely on another endpoint (the graphQL query interface)? ... I wonder if instead of relying on all these RCSB endpoints we should just have a self-contained logic that just takes the CIF that the CF-server returns and figures out the mapping, template coordinates, ...

It's a good thought towards reducing complexity, and another way to approach the issue. My concern is that ti can be tricky to handle the mapping of chains / template coordinates correctly. If RCSB is self-consistent, it might be best to leave the parsing of the chains to the RCSB experts, with a relatively cheap API call.

where we have no control over the template structures the CF-server returns (they might be custom, older PDB versions, ...), so those wouldn't work with any RCSB queries

To me, the question of custom templates / custom servers is a different ball game all together. For that, I would recommend we revisit the contributed PR #37 to think about how we could support alignments with custom template structures. @gnikolenyi had previously proposed adding support for custom alignments along with custom templates.

@jandom
Copy link
Copy Markdown
Collaborator Author

jandom commented Mar 25, 2026

I'm not a fan of downloading all these cifs – Gergo did that initially in his implementation and had to put in a limit of max 25 templates (why not 50/100?). The API seems simpler.

@jandom jandom added safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. and removed safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. labels Mar 25, 2026
@jandom jandom requested a review from jnwei March 25, 2026 12:44
Copy link
Copy Markdown
Contributor

@jnwei jnwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is great. The pytest-recroding is perfect for our use case.

Could you also add a small README / comment to test_rscb.py that describes how to handle the recording, in case it needs to be regenerated?

From their docs looks like it should just be

pytest --record-mode=once test_rcsb.py

Or maybe rewrite should be used instead?



def _make_m8_dataframe(template_ids: list[str], m_index: int = 101) -> pd.DataFrame:
"""Build a minimal m8-format DataFrame for testing."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a small reference to the m8 format description? We have one in our docs https://openfold-3.readthedocs.io/en/latest/template_how_to.html#m8

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On it

@jandom jandom merged commit 119b2cb into main Mar 26, 2026
2 checks passed
@jandom jandom deleted the jandom/2026-02/fix/chain-template-alignments-auth-labelling branch March 26, 2026 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants