-
Notifications
You must be signed in to change notification settings - Fork 140
Description
The Bug
_get_within_cds_effect() in malariagen_data/veff.py does not handle
in-frame complex variants — cases where multiple ref bases are replaced
by a different number of alt bases with a net change divisible by 3.
These variants fall through to an unimplemented else branch at line 428-431
and return a literal "TODO in-frame complex variation (MNP + INDEL)"
string as the effect label, with impact="UNKNOWN".
The Affected variant types
Variants that hit this branch satisfy all of the following:
| Condition | Meaning |
|---|---|
len(ref) > 1 and len(alt) > 1 |
Both alleles are multi-base |
len(ref) != len(alt) |
Not a pure MNP |
(len(alt) - len(ref)) % 3 == 0 |
In-frame — reading frame preserved |
Concrete examples:
| ref | alt | net | description |
|---|---|---|---|
CA |
CAGTT |
+3 | 2 bases → 5 bases |
CAG |
CAGTTC |
+3 | 3 bases → 6 bases |
CAGTT |
CA |
-3 | 5 bases → 2 bases |
CAGTTC |
CAG |
-3 | 6 bases → 3 bases |
The Minimal reproduction
from malariagen_data.veff import null_effect, _get_within_cds_effect
from unittest.mock import MagicMock
REF_SEQ = "ATGCAGTTCAACGGTACCTGAATGATGATG"
ann = MagicMock()
ann.get_ref_seq.side_effect = lambda c, s, e: REF_SEQ[s-1:e].lower()
ann.get_ref_allele_coords.side_effect = lambda c, p, r: (p, p+len(r)-1)
cds = MagicMock()
cds.start, cds.end, cds.strand, cds.ID = 1, 30, "+", "CDS-mock"
base = null_effect._replace(
chrom="chr1", pos=4, ref="CAG", alt="CAGTTC",
vlen=3, ref_start=4, ref_stop=6, strand="+"
)
result = _get_within_cds_effect(ann, base, cds, [cds])
print(result.effect) # 'TODO in-frame complex variation (MNP + INDEL)'
print(result.impact) # 'UNKNOWN'The Root cause
The INDEL/MNP classification block in _get_within_cds_effect() handles
four cases but leaves a fifth unimplemented:
- ✅ Frameshift:
(len(alt) - len(ref)) % 3 != 0 - ✅ Simple insertion:
len(ref) == 1 and len(alt) > len(ref) - ✅ Simple deletion:
len(alt) == 1 and len(ref) > len(alt) - ✅ Pure MNP:
len(ref) == len(alt) - ❌ Complex in-frame: multi-base ref, multi-base alt, different lengths, in-frame — not implemented
Why this matters
Any downstream code consuming effect labels — resistance marker
identification, variant filtering, frequency analysis — silently receives
a garbage string for these variants with no error or warning. This is
particularly relevant for resistance gene analysis on datasets like Ag3,
where complex variants in genes such as Vgsc would be misclassified.
Proposed fix
Apply the same is_codon_changed logic already used for simple
insertions and deletions, and assign the appropriate existing label:
| net change | codon changed | effect label |
|---|---|---|
| > 0 | yes | CODON_CHANGE_PLUS_CODON_INSERTION |
| > 0 | no | CODON_INSERTION |
| < 0 | yes | CODON_CHANGE_PLUS_CODON_DELETION |
| < 0 | no | CODON_DELETION |
This reuses labels already present in the same function and is consistent
with SnpEff terminology. No new effect types are needed.
Why this approach
The fix reuses the is_codon_changed logic already present in the same
function for simple insertions and deletions (lines 382–384). This is
intentional — the biology is identical. A complex in-frame variant
(multi-base ref → different-length multi-base alt, net ±3) has the same
functional consequence as a simple in-frame insertion or deletion: the
reading frame is preserved, and the question is only whether the boundary
codon also changes.
Reusing the existing labels means:
- No new effect types are introduced into the codebase
- The output is consistent with SnpEff terminology already used elsewhere
- Downstream code that already handles
CODON_CHANGE_PLUS_CODON_INSERTION
andCODON_DELETIONwill handle these variants correctly with no
changes needed
An alternative would be a new label like COMPLEX_CHANGE_IN_FRAME, but
this would break any downstream code filtering on existing effect labels
and adds a term not present in standard variant annotation vocabulary.
The conservative approach is to extend the existing classification logic
rather than introduce new terminology.
Happy to submit a PR if this approach looks right.