fix: chain template alignments auth labelling (inference) by jandom · Pull Request #117 · aqlaboratory/openfold-3

jandom · 2026-02-07T15:44:23Z

Summary

Hopefully helps to solve #101

Changes

So far wrote a test that reproduced the failure, and then added a "fix"

Related Issues

Testing

Other Notes

- auth chain ids vs labelled chain ids - add a test that confims this

openfold3/tests/test_data/template_alignments/colabfold_template.m8

jandom · 2026-02-07T15:45:56Z

openfold3/tests/core/data/pipelines/preprocessing/test_template.py

+        label_to_author = get_label_to_author_chain_id_dict(cif_file)
+        author_to_label = {v: k for k, v in label_to_author.items()}
+        label_chain_id = author_to_label[template.chain_id]


This is the re-mapping from "auth" chains IDs to "label" chain IDs... very wishful in terms of inputs not being pathological

So if there are multiple label IDs mapping to the same author ID, this will just always map an author ID to the l ast label ID. @ljarosch can confirm, but I think this only happens with homomeric chains, so it should be fine. We should just document this behavior.

A way to make this more robust would be to explicitly sort the label_to_author dict when iterating over it, so maybe add that here so we are not relying on the dict ordering for this mapping.

jnwei

Overall this is great! This was a hard bug to pin down.

A few drive by comments regarding setting up the tests with colabfold web services.

jnwei · 2026-02-09T04:12:19Z

openfold3/tests/core/data/pipelines/preprocessing/test_template.py

+        template = templates[16]
+        assert template.chain_id == "A" and template.entry_id == "1rnb"
+
+        fetch(


Can we mock this call instead of explicitly calling the RCSB database using fetch?

As this is a unit test, it would be good to remove dependencies on web servers so that we don't have latency issues / failures due to the availability of the service.

Switched to just a cif file as fixture

openfold3/tests/test_data/template_alignments/colabfold_template.m8

gnikolenyi

Looks good so far. Added some comments on top of Jennifer's.

One thing I am missing is the actual mapping being done after the colabfold pipeline pulled the templates. I see you have the primitive but it is not yet being called in openfold3/core/data/tools/colabfold_msa_server.py or anywhere else outside of the unittests. Could you please also add the remapping to the colabfold pipeline itself?

openfold3/tests/test_data/template_alignments/colabfold_template.m8

gnikolenyi · 2026-02-09T22:56:23Z

openfold3/tests/core/data/pipelines/preprocessing/test_template.py

+        label_to_author = get_label_to_author_chain_id_dict(cif_file)
+        author_to_label = {v: k for k, v in label_to_author.items()}
+        label_chain_id = author_to_label[template.chain_id]


So if there are multiple label IDs mapping to the same author ID, this will just always map an author ID to the l ast label ID. @ljarosch can confirm, but I think this only happens with homomeric chains, so it should be fine. We should just document this behavior.

A way to make this more robust would be to explicitly sort the label_to_author dict when iterating over it, so maybe add that here so we are not relying on the dict ordering for this mapping.

jandom · 2026-02-10T14:40:17Z

@jnwei @gnikolenyi many thanks for the reviews – I'm not sure why this wasn't a draft, this is clearly not ready for prime time.

Agreed that we need this to use some fixture files (in an integration test context), but I would also like to have an end-to-end test that does everything. Ideally, we would just have in-memory generated fixtures but we don't have the tooling setup atm.

Could you please also add the remapping to the colabfold pipeline itself?

This would be ideal but it's not possible unless we pull in the cif files, which have the info on the peptide chains to do the mapping.

jandom · 2026-02-12T16:49:38Z

Wrapping @ljarosch into this PR, because it's quite hairy. Here is some updated context

this only occurs at inference and only when using colabfold
at training and at manual inference, we provide the correctly formatted templates

Colabfold returns "author" chain IDs rather than "labelled" chain IDs, and this PR fixes how these are handled. However, the code now assumes that author chain IDs are provided and may erroneously correct properly provided chain IDs.

…e pipeline.

gnikolenyi · 2026-03-13T04:32:44Z

@jnwei @jandom Added the template structure download and chain ID remapping logic to the colabfold pipeline and removed it from the template pipeline. See logs below for an example with the remapping printed explicitly:

Submitting 5 sequences to the Colabfold MSA server for main MSAs...
COMPLETE: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 750/750 [elapsed: 00:02 remaining: 00:00]
Downloading template CIFs: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 38/38 [00:04<00:00,  8.53it/s]
Remapped 4uyk: author A -> label A
Remapped 1914: author A -> label A
Remapped 1e8o: author A -> label A
Remapped 3jaj: author S1 -> label JC
Remapped 5aox: author D -> label D
Remapped 4ue5: author E -> label E
Remapped 1e8s: author A -> label A
Remapped 4uyk: author B -> label B
Remapped 7nfx: author t -> label WA
Remapped 5aox: author B -> label B
Remapped 1914: author A -> label A
Remapped 4uyj: author B -> label B
Remapped 5aox: author E -> label E
Remapped 1e8o: author D -> label D
Remapped 7obr: author t -> label WA
Remapped 1e8o: author B -> label B
Remapped 4ue5: author B -> label B
Remapped 2w9j: author B -> label B
Remapped 2w9j: author A -> label A
Remapped 4gnx: author A -> label A
Remapped 3kdf: author A -> label A
Remapped 7uy6: author F -> label E
Remapped 7lmb: author F -> label G
Remapped 6d6v: author F -> label C
Remapped 4gnx: author B -> label B
Remapped 2pqa: author C -> label C
Remapped 2pqa: author A -> label A
Remapped 1l1o: author E -> label E
Remapped 2z6k: author A -> label A
Remapped 1quq: author A -> label A
Remapped 2pi2: author C -> label C
Remapped 6i52: author B -> label B
Remapped 2z6k: author B -> label B
Remapped 1quq: author C -> label C
Remapped 3kdf: author D -> label B
Remapped 4joi: author B -> label B
Remapped 4joi: author A -> label A
Remapped 7u5c: author F -> label F
Remapped 6w6w: author C -> label D
Remapped 8d0k: author B -> label B
Remapped 8c5y: author B -> label B
Remapped 4gnx: author C -> label C
Remapped 1jmc: author A -> label B
Remapped 1fgu: author B -> label B
Remapped 6i52: author C -> label C
Remapped 1l1o: author C -> label C
Remapped 1l1o: author F -> label F
Remapped 1ynx: author A -> label A
Remapped 8aaj: author A -> label A
Remapped 8c5z: author A -> label A
Remapped 8oej: author A -> label A
Remapped 8oej: author D -> label D
Remapped 6d6v: author D -> label A
Remapped 8c5y: author J -> label J
Remapped 3u50: author C -> label A
Remapped 1o7i: author B -> label B
Remapped 7wcg: author A -> label A
Remapped 3dm3: author B -> label B
Remapped 2k5v: author A -> label A
Submitting 2 paired MSA queries to the Colabfold MSA server...
COMPLETE: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [elapsed: 00:01 remaining: 00:00]
COMPLETE: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 450/450 [elapsed: 00:02 remaining: 00:00]
Computing paired MSAs: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.92s/it]

For example, checking 3jaj author S1 -> label JC: https://www.rcsb.org/sequence/3JAJ#JC

Note the following:

only up to the first 25 templates are parsed and remapped from auth to label asym ids to reduce the amount of cif files that need to be downloaded - we can expose this as an argument to the runner in a later PR
the cif files are deduplicated within an inference run and only the non-redundant set of cifs are downloaded, but across multiple runs, the cif files are re-downloaded
the cif files are not reused for the actual template processing, which would require a more significant refactor - both this and the re-download across inference runs are areas which we can optimize later

jnwei

My understanding of the latest update from @gnikolenyi is that there are two main changes:

We download the cifs of the templates in the colabfold alignment process. This is necessary to parse the author labeled ids. These cifs are stored temporarily
The cifs for the templates are downloaded again in the template pipeline for template feature processing.

No changes / updates were made to the tests. Gergo separately ran an example to process the MSAs of several other examples.

jandom · 2026-03-13T08:35:58Z

This is a huge step in the right direction but without extensive tests, I'm quite weary. Will review ASP.

…gnments-auth-labelling

jandom · 2026-03-18T15:51:46Z

@gnikolenyi I have somewhat changed this code, the biggest change is skipping the CIF download and using the RSCB API to get the mapping instead (phew!).

Outstanding items

remove the original helper methods for the re-mapping from the inference pipeline (everything is now in the colabfold)
fix some of the mypy annotations, just because this code is so messy

jandom · 2026-03-23T13:23:42Z

well let's not get too crazy with the mocking horse

unit tests, everything mocked don't ping externals but work in a "mocked out world"
integration tests, ping actual service and test everything in a more life-like setting
e2e tests... we'll get there one day!

So I think we want both @gnikolenyi, and this is the direction I'll take

jandom · 2026-03-23T14:42:19Z

openfold3/core/data/primitives/caches/filtering.py



-# TODO: Do this in preprocessing instead to avoid it going out-of-sync with the data?
-def get_model_ranking_fit(pdb_id):


Moved to a new rscb.py module

jandom · 2026-03-23T14:50:27Z

openfold3/core/data/tools/colabfold_msa_server.py

+    per_rep: dict[str, pd.DataFrame] = {}
+    unique_pdb_ids: set[str] = set()
+    for rep_id in rep_ids:
+        m_i = rep_id_to_m[rep_id]
+        if m_i not in m_with_templates:
+            continue
+        chain_alns = template_alignments[template_alignments[0] == m_i]
+        top_n = chain_alns.copy()
+        per_rep[rep_id] = top_n
+        unique_pdb_ids.update(top_n[1].str.split("_").str[0])


I don't love this code – but it's just refactored into a function...

jandom · 2026-03-23T14:51:17Z

openfold3/core/data/tools/colabfold_msa_server.py

@@ -725,47 +808,33 @@ def query_format_main(self):
            # Create empty DataFrame with expected column structure (at least column 0)
            # to match the structure when file is read with header=None.
            logger.warning(


Maybe we should error here

We could raise an error here instead.

My concern is if a user is running a large batch of predictions, they may prefer to be notified later about the issue with missing templates, rather than have the workflow interrupted for a few broken examples. We could think of a better way to record this issue and bring attention to the missing template alginments?

Hard to tell – either way it's out of scope in a way, because it's not related to the bug-fix per se. Should we handle this in another PR? This PR is already 20 files, we're ballooning

jandom · 2026-03-23T14:51:37Z

openfold3/core/data/tools/rscb.py

+
+import logging
+
+import requests


This file is all brand new

nice, I like this separation.

ljarosch · 2026-03-24T22:41:12Z

Hi @jandom and @gnikolenyi, I'll give this a better review later this week. One question already - if we download the CIF anyway (to get the actual template structure), is it actually a good idea to rely on another endpoint (the graphQL query interface)?

One more general issue I could see with our current code:

It's not entirely clear what template distribution the CF-server provides (it's local to the server and consistent with their PDB70 alignment file). The PDB occasionally retrospectively updates structure details on already released structures which can break our chain mapping (we've been bit by this multiple times)
We officially support custom CF-servers https://openfold-3.readthedocs.io/en/latest/inference.html#use-a-privately-hosted-colabfold-msa-server, where we have no control over the template structures the CF-server returns (they might be custom, older PDB versions, ...), so those wouldn't work with any RCSB queries

So a bit of a broader point, but I wonder if instead of relying on all these RCSB endpoints we should just have a self-contained logic that just takes the CIF that the CF-server returns and figures out the mapping, template coordinates, ... from that?

jnwei · 2026-03-25T06:09:45Z

openfold3/tests/core/data/tools/test_rscb.py

I think it's fine to have some tests that hit the real RCSB API, as you said, it is important to have some integration tests.

However, given that our group alone runs our CI tests ~10 times and possibly more in a day, and that other developers may also run our CI battery of tests, I recommend we add a label to these tests so that we can filter these tests and reduce the frequency so that it is not part of the battery of unit tests. Maybe something like the pytest.mark for the inference integration tests, here: https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/tests/test_inference_full.py#L35

In the long run, we can set up a cronjob to run this and other integration tests to run once a day by default, and upon manual trigger by the developer.

Agreed – added pytest-recordings to save the response as "cassettes" (YAML files) containing the response

jnwei · 2026-03-25T06:10:47Z

openfold3/tests/core/data/tools/test_colabfold_msa_server.py

Not reviewed. Will review this file in #145

jnwei · 2026-03-25T06:18:34Z

openfold3/tests/core/data/pipelines/preprocessing/test_template.py

+_TEST_DATA_DIR = Path(openfold3.__file__).parent / "tests" / "test_data"
+
+
+class TestTemplatePreprocessor:


Would it be a big lift to also have a test case to ensure a chain that has a consistent template alignment / author chain id is left unadulterated?

Unfortunately, 1RNB seems to be a small monomer protein, so it looks like we cannot simply use a different chain of this structure. Perhaps we can revisit this and add another test case later if we add more test cases with template alignments later.

I'm a little confused here – what test case are we looking for? Something where the author id = label id?

Looking at this test case, it's kind of ugly – it's basically a copy of the inference pipeline (where the fix was originally applied).

The two cases we're concerned about are

author-id not the same as label-id

author-id same as label-id

But these are now covered by these tests

test_remap_author_to_label

TestFetchLabelToAuthorChainIds

I'm going to remove this test, i think it adds nothing

jnwei · 2026-03-25T06:18:53Z

openfold3/tests/core/data/primitives/structure/test_metadata.py

Nice simple sanity check unit test, I like it.

We could also VCR for this case – otherwise we're hitting colabfold. But here it's harder – colabfold doesn't produce a single JSON response (like RSCB API) but instead launches a job and the user needs to then download and unpack various files.

jnwei · 2026-03-25T06:21:51Z

openfold3/core/data/tools/colabfold_msa_server.py

@@ -725,47 +808,33 @@ def query_format_main(self):
            # Create empty DataFrame with expected column structure (at least column 0)
            # to match the structure when file is read with header=None.
            logger.warning(


We could raise an error here instead.

My concern is if a user is running a large batch of predictions, they may prefer to be notified later about the issue with missing templates, rather than have the workflow interrupted for a few broken examples. We could think of a better way to record this issue and bring attention to the missing template alginments?

jnwei · 2026-03-25T06:23:51Z

openfold3/core/data/tools/rscb.py

+
+import logging
+
+import requests


nice, I like this separation.

jnwei · 2026-03-25T06:32:55Z

One question already - if we download the CIF anyway (to get the actual template structure), is it actually a good idea to rely on another endpoint (the graphQL query interface)? ... I wonder if instead of relying on all these RCSB endpoints we should just have a self-contained logic that just takes the CIF that the CF-server returns and figures out the mapping, template coordinates, ...

It's a good thought towards reducing complexity, and another way to approach the issue. My concern is that ti can be tricky to handle the mapping of chains / template coordinates correctly. If RCSB is self-consistent, it might be best to leave the parsing of the chains to the RCSB experts, with a relatively cheap API call.

where we have no control over the template structures the CF-server returns (they might be custom, older PDB versions, ...), so those wouldn't work with any RCSB queries

To me, the question of custom templates / custom servers is a different ball game all together. For that, I would recommend we revisit the contributed PR #37 to think about how we could support alignments with custom template structures. @gnikolenyi had previously proposed adding support for custom alignments along with custom templates.

jandom · 2026-03-25T12:18:05Z

I'm not a fan of downloading all these cifs – Gergo did that initially in his implementation and had to put in a limit of max 25 templates (why not 50/100?). The API seems simpler.

…-auth-labelling

jnwei

Overall this is great. The pytest-recroding is perfect for our use case.

Could you also add a small README / comment to test_rscb.py that describes how to handle the recording, in case it needs to be regenerated?

From their docs looks like it should just be

pytest --record-mode=once test_rcsb.py

Or maybe rewrite should be used instead?

jnwei · 2026-03-26T08:51:53Z

openfold3/tests/core/data/tools/test_colabfold_msa_server.py

+
+
+def _make_m8_dataframe(template_ids: list[str], m_index: int = 101) -> pd.DataFrame:
+    """Build a minimal m8-format DataFrame for testing."""


nit: add a small reference to the m8 format description? We have one in our docs https://openfold-3.readthedocs.io/en/latest/template_how_to.html#m8

jandom added 2 commits February 7, 2026 15:21

fix #101: template chain alignment

561018c

- auth chain ids vs labelled chain ids - add a test that confims this

further tweak, and maybe working now

0787889

jandom requested a review from gnikolenyi February 7, 2026 15:44

jandom self-assigned this Feb 7, 2026

jandom commented Feb 7, 2026

View reviewed changes

openfold3/tests/test_data/template_alignments/colabfold_template.m8 Outdated Show resolved Hide resolved

jandom commented Feb 7, 2026

View reviewed changes

jnwei requested changes Feb 9, 2026

View reviewed changes

gnikolenyi requested changes Feb 9, 2026

View reviewed changes

jandom marked this pull request as draft February 10, 2026 14:38

jandom added 3 commits February 10, 2026 15:17

rename templates

c3c14a5

run a linter

080d0d0

review: comments and improvements

e57d303

jandom requested review from gnikolenyi and jnwei February 11, 2026 18:00

simpler code, happier

865ec86

jandom requested a review from ljarosch February 12, 2026 16:47

Add remapping logic to the colabfold pipeline and remove from templat…

b29f1f6

…e pipeline.

jnwei reviewed Mar 13, 2026

View reviewed changes

refactor the PR slightly

a78196f

jandom requested a review from jnwei March 18, 2026 15:49

jandom marked this pull request as ready for review March 18, 2026 15:50

jandom added the safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. label Mar 18, 2026

Merge branch 'public-main' into jandom/2026-02/fix/chain-template-ali…

be1fc4e

…gnments-auth-labelling

jandom added safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. and removed safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. labels Mar 18, 2026

jandom added 3 commits March 23, 2026 14:33

mutualize RSCB API calls and add tests

11420df

use the new rscb.py module in colabfold_msa_server

65c292f

migrate all testst to test_colabfold_msa_server

fbc389a

jandom requested a review from gnikolenyi March 23, 2026 14:41

jandom added safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. and removed safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. labels Mar 23, 2026

jandom commented Mar 23, 2026

View reviewed changes

jandom added 2 commits March 23, 2026 14:46

remove dead code

8450fa8

zip alignments and a3m_lines with strict=True

4a26bc6

jandom commented Mar 23, 2026

View reviewed changes

move test files to test_data

78570ff

jandom added safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. and removed safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. labels Mar 23, 2026

jnwei reviewed Mar 25, 2026

View reviewed changes

jandom added 2 commits March 25, 2026 12:24

review: comments from Jennifer (use pytest-recording to store responses)

f16b160

remove the obsolete test

bf286be

jandom added safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. and removed safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. labels Mar 25, 2026

jandom requested a review from jnwei March 25, 2026 12:44

Merge branch 'main' into jandom/2026-02/fix/chain-template-alignments…

f5cb6b9

…-auth-labelling

jnwei approved these changes Mar 26, 2026

View reviewed changes

review: comments from Jennifer

1364f45

jandom merged commit 119b2cb into main Mar 26, 2026
2 checks passed

jandom deleted the jandom/2026-02/fix/chain-template-alignments-auth-labelling branch March 26, 2026 11:18



		# TODO: Do this in preprocessing instead to avoid it going out-of-sync with the data?
		def get_model_ranking_fit(pdb_id):

		_TEST_DATA_DIR = Path(openfold3.__file__).parent / "tests" / "test_data"


		class TestTemplatePreprocessor:



		def _make_m8_dataframe(template_ids: list[str], m_index: int = 101) -> pd.DataFrame:
		"""Build a minimal m8-format DataFrame for testing."""


		import logging

		import requests


		import logging

		import requests

Conversation

jandom commented Feb 7, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnwei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gnikolenyi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jandom commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jandom commented Feb 12, 2026

Uh oh!

gnikolenyi commented Mar 13, 2026

Uh oh!

jnwei left a comment

Choose a reason for hiding this comment

Uh oh!

jandom commented Mar 13, 2026

Uh oh!

jandom commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jandom commented Mar 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jandom Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ljarosch commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnwei commented Mar 25, 2026

Uh oh!

jandom commented Feb 10, 2026 •

edited

Loading

jandom commented Mar 18, 2026 •

edited

Loading

jandom Mar 25, 2026 •

edited

Loading

ljarosch commented Mar 24, 2026 •

edited

Loading