fix: chain template alignments auth labelling (inference)#117
Conversation
- auth chain ids vs labelled chain ids - add a test that confims this
openfold3/tests/test_data/template_alignments/colabfold_template.m8
Outdated
Show resolved
Hide resolved
| label_to_author = get_label_to_author_chain_id_dict(cif_file) | ||
| author_to_label = {v: k for k, v in label_to_author.items()} | ||
| label_chain_id = author_to_label[template.chain_id] |
There was a problem hiding this comment.
This is the re-mapping from "auth" chains IDs to "label" chain IDs... very wishful in terms of inputs not being pathological
There was a problem hiding this comment.
So if there are multiple label IDs mapping to the same author ID, this will just always map an author ID to the l ast label ID. @ljarosch can confirm, but I think this only happens with homomeric chains, so it should be fine. We should just document this behavior.
A way to make this more robust would be to explicitly sort the label_to_author dict when iterating over it, so maybe add that here so we are not relying on the dict ordering for this mapping.
jnwei
left a comment
There was a problem hiding this comment.
Overall this is great! This was a hard bug to pin down.
A few drive by comments regarding setting up the tests with colabfold web services.
| template = templates[16] | ||
| assert template.chain_id == "A" and template.entry_id == "1rnb" | ||
|
|
||
| fetch( |
There was a problem hiding this comment.
Can we mock this call instead of explicitly calling the RCSB database using fetch?
As this is a unit test, it would be good to remove dependencies on web servers so that we don't have latency issues / failures due to the availability of the service.
There was a problem hiding this comment.
Switched to just a cif file as fixture
openfold3/tests/test_data/template_alignments/colabfold_template.m8
Outdated
Show resolved
Hide resolved
gnikolenyi
left a comment
There was a problem hiding this comment.
Looks good so far. Added some comments on top of Jennifer's.
One thing I am missing is the actual mapping being done after the colabfold pipeline pulled the templates. I see you have the primitive but it is not yet being called in openfold3/core/data/tools/colabfold_msa_server.py or anywhere else outside of the unittests. Could you please also add the remapping to the colabfold pipeline itself?
openfold3/tests/test_data/template_alignments/colabfold_template.m8
Outdated
Show resolved
Hide resolved
| label_to_author = get_label_to_author_chain_id_dict(cif_file) | ||
| author_to_label = {v: k for k, v in label_to_author.items()} | ||
| label_chain_id = author_to_label[template.chain_id] |
There was a problem hiding this comment.
So if there are multiple label IDs mapping to the same author ID, this will just always map an author ID to the l ast label ID. @ljarosch can confirm, but I think this only happens with homomeric chains, so it should be fine. We should just document this behavior.
A way to make this more robust would be to explicitly sort the label_to_author dict when iterating over it, so maybe add that here so we are not relying on the dict ordering for this mapping.
|
@jnwei @gnikolenyi many thanks for the reviews – I'm not sure why this wasn't a draft, this is clearly not ready for prime time. Agreed that we need this to use some fixture files (in an integration test context), but I would also like to have an end-to-end test that does everything. Ideally, we would just have in-memory generated fixtures but we don't have the tooling setup atm.
This would be ideal but it's not possible unless we pull in the cif files, which have the info on the peptide chains to do the mapping. |
|
Wrapping @ljarosch into this PR, because it's quite hairy. Here is some updated context
Colabfold returns "author" chain IDs rather than "labelled" chain IDs, and this PR fixes how these are handled. However, the code now assumes that author chain IDs are provided and may erroneously correct properly provided chain IDs. |
|
@jnwei @jandom Added the template structure download and chain ID remapping logic to the colabfold pipeline and removed it from the template pipeline. See logs below for an example with the remapping printed explicitly: For example, checking 3jaj author S1 -> label JC: https://www.rcsb.org/sequence/3JAJ#JC Note the following:
|
jnwei
left a comment
There was a problem hiding this comment.
My understanding of the latest update from @gnikolenyi is that there are two main changes:
- We download the cifs of the templates in the colabfold alignment process. This is necessary to parse the author labeled ids. These cifs are stored temporarily
- The cifs for the templates are downloaded again in the template pipeline for template feature processing.
No changes / updates were made to the tests. Gergo separately ran an example to process the MSAs of several other examples.
|
This is a huge step in the right direction but without extensive tests, I'm quite weary. Will review ASP. |
…gnments-auth-labelling
|
@gnikolenyi I have somewhat changed this code, the biggest change is skipping the CIF download and using the RSCB API to get the mapping instead (phew!). Outstanding items
|
|
well let's not get too crazy with the mocking horse
So I think we want both @gnikolenyi, and this is the direction I'll take |
|
|
||
|
|
||
| # TODO: Do this in preprocessing instead to avoid it going out-of-sync with the data? | ||
| def get_model_ranking_fit(pdb_id): |
There was a problem hiding this comment.
Moved to a new rscb.py module
| per_rep: dict[str, pd.DataFrame] = {} | ||
| unique_pdb_ids: set[str] = set() | ||
| for rep_id in rep_ids: | ||
| m_i = rep_id_to_m[rep_id] | ||
| if m_i not in m_with_templates: | ||
| continue | ||
| chain_alns = template_alignments[template_alignments[0] == m_i] | ||
| top_n = chain_alns.copy() | ||
| per_rep[rep_id] = top_n | ||
| unique_pdb_ids.update(top_n[1].str.split("_").str[0]) |
There was a problem hiding this comment.
I don't love this code – but it's just refactored into a function...
| @@ -725,47 +808,33 @@ def query_format_main(self): | |||
| # Create empty DataFrame with expected column structure (at least column 0) | |||
| # to match the structure when file is read with header=None. | |||
| logger.warning( | |||
There was a problem hiding this comment.
Maybe we should error here
There was a problem hiding this comment.
We could raise an error here instead.
My concern is if a user is running a large batch of predictions, they may prefer to be notified later about the issue with missing templates, rather than have the workflow interrupted for a few broken examples. We could think of a better way to record this issue and bring attention to the missing template alginments?
There was a problem hiding this comment.
Hard to tell – either way it's out of scope in a way, because it's not related to the bug-fix per se. Should we handle this in another PR? This PR is already 20 files, we're ballooning
|
|
||
| import logging | ||
|
|
||
| import requests |
There was a problem hiding this comment.
This file is all brand new
There was a problem hiding this comment.
nice, I like this separation.
|
Hi @jandom and @gnikolenyi, I'll give this a better review later this week. One question already - if we download the CIF anyway (to get the actual template structure), is it actually a good idea to rely on another endpoint (the graphQL query interface)? One more general issue I could see with our current code:
So a bit of a broader point, but I wonder if instead of relying on all these RCSB endpoints we should just have a self-contained logic that just takes the CIF that the CF-server returns and figures out the mapping, template coordinates, ... from that? |
There was a problem hiding this comment.
I think it's fine to have some tests that hit the real RCSB API, as you said, it is important to have some integration tests.
However, given that our group alone runs our CI tests ~10 times and possibly more in a day, and that other developers may also run our CI battery of tests, I recommend we add a label to these tests so that we can filter these tests and reduce the frequency so that it is not part of the battery of unit tests. Maybe something like the pytest.mark for the inference integration tests, here: https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/tests/test_inference_full.py#L35
In the long run, we can set up a cronjob to run this and other integration tests to run once a day by default, and upon manual trigger by the developer.
There was a problem hiding this comment.
Agreed – added pytest-recordings to save the response as "cassettes" (YAML files) containing the response
| _TEST_DATA_DIR = Path(openfold3.__file__).parent / "tests" / "test_data" | ||
|
|
||
|
|
||
| class TestTemplatePreprocessor: |
There was a problem hiding this comment.
Would it be a big lift to also have a test case to ensure a chain that has a consistent template alignment / author chain id is left unadulterated?
Unfortunately, 1RNB seems to be a small monomer protein, so it looks like we cannot simply use a different chain of this structure. Perhaps we can revisit this and add another test case later if we add more test cases with template alignments later.
There was a problem hiding this comment.
I'm a little confused here – what test case are we looking for? Something where the author id = label id?
There was a problem hiding this comment.
Looking at this test case, it's kind of ugly – it's basically a copy of the inference pipeline (where the fix was originally applied).
The two cases we're concerned about are
- author-id not the same as label-id
- author-id same as label-id
But these are now covered by these tests
- test_remap_author_to_label
- TestFetchLabelToAuthorChainIds
I'm going to remove this test, i think it adds nothing
There was a problem hiding this comment.
Nice simple sanity check unit test, I like it.
There was a problem hiding this comment.
We could also VCR for this case – otherwise we're hitting colabfold. But here it's harder – colabfold doesn't produce a single JSON response (like RSCB API) but instead launches a job and the user needs to then download and unpack various files.
| @@ -725,47 +808,33 @@ def query_format_main(self): | |||
| # Create empty DataFrame with expected column structure (at least column 0) | |||
| # to match the structure when file is read with header=None. | |||
| logger.warning( | |||
There was a problem hiding this comment.
We could raise an error here instead.
My concern is if a user is running a large batch of predictions, they may prefer to be notified later about the issue with missing templates, rather than have the workflow interrupted for a few broken examples. We could think of a better way to record this issue and bring attention to the missing template alginments?
|
|
||
| import logging | ||
|
|
||
| import requests |
There was a problem hiding this comment.
nice, I like this separation.
It's a good thought towards reducing complexity, and another way to approach the issue. My concern is that ti can be tricky to handle the mapping of chains / template coordinates correctly. If RCSB is self-consistent, it might be best to leave the parsing of the chains to the RCSB experts, with a relatively cheap API call.
To me, the question of custom templates / custom servers is a different ball game all together. For that, I would recommend we revisit the contributed PR #37 to think about how we could support alignments with custom template structures. @gnikolenyi had previously proposed adding support for custom alignments along with custom templates. |
|
I'm not a fan of downloading all these cifs – Gergo did that initially in his implementation and had to put in a limit of max 25 templates (why not 50/100?). The API seems simpler. |
jnwei
left a comment
There was a problem hiding this comment.
Overall this is great. The pytest-recroding is perfect for our use case.
Could you also add a small README / comment to test_rscb.py that describes how to handle the recording, in case it needs to be regenerated?
From their docs looks like it should just be
pytest --record-mode=once test_rcsb.py
Or maybe rewrite should be used instead?
|
|
||
|
|
||
| def _make_m8_dataframe(template_ids: list[str], m_index: int = 101) -> pd.DataFrame: | ||
| """Build a minimal m8-format DataFrame for testing.""" |
There was a problem hiding this comment.
nit: add a small reference to the m8 format description? We have one in our docs https://openfold-3.readthedocs.io/en/latest/template_how_to.html#m8
Summary
Hopefully helps to solve #101
Changes
So far wrote a test that reproduced the failure, and then added a "fix"
Related Issues
Testing
Other Notes