Skip to content

veff.py: in-frame complex variants return literal "TODO" string as effect label #1132

@sallykinyua

Description

@sallykinyua

The Bug

_get_within_cds_effect() in malariagen_data/veff.py does not handle
in-frame complex variants — cases where multiple ref bases are replaced
by a different number of alt bases with a net change divisible by 3.
These variants fall through to an unimplemented else branch at line 428-431
and return a literal "TODO in-frame complex variation (MNP + INDEL)"
string as the effect label, with impact="UNKNOWN".

The Affected variant types

Variants that hit this branch satisfy all of the following:

Condition Meaning
len(ref) > 1 and len(alt) > 1 Both alleles are multi-base
len(ref) != len(alt) Not a pure MNP
(len(alt) - len(ref)) % 3 == 0 In-frame — reading frame preserved

Concrete examples:

ref alt net description
CA CAGTT +3 2 bases → 5 bases
CAG CAGTTC +3 3 bases → 6 bases
CAGTT CA -3 5 bases → 2 bases
CAGTTC CAG -3 6 bases → 3 bases

The Minimal reproduction

from malariagen_data.veff import null_effect, _get_within_cds_effect
from unittest.mock import MagicMock

REF_SEQ = "ATGCAGTTCAACGGTACCTGAATGATGATG"
ann = MagicMock()
ann.get_ref_seq.side_effect = lambda c, s, e: REF_SEQ[s-1:e].lower()
ann.get_ref_allele_coords.side_effect = lambda c, p, r: (p, p+len(r)-1)

cds = MagicMock()
cds.start, cds.end, cds.strand, cds.ID = 1, 30, "+", "CDS-mock"

base = null_effect._replace(
    chrom="chr1", pos=4, ref="CAG", alt="CAGTTC",
    vlen=3, ref_start=4, ref_stop=6, strand="+"
)
result = _get_within_cds_effect(ann, base, cds, [cds])
print(result.effect)  # 'TODO in-frame complex variation (MNP + INDEL)'
print(result.impact)  # 'UNKNOWN'

The Root cause

The INDEL/MNP classification block in _get_within_cds_effect() handles
four cases but leaves a fifth unimplemented:

  • ✅ Frameshift: (len(alt) - len(ref)) % 3 != 0
  • ✅ Simple insertion: len(ref) == 1 and len(alt) > len(ref)
  • ✅ Simple deletion: len(alt) == 1 and len(ref) > len(alt)
  • ✅ Pure MNP: len(ref) == len(alt)
  • ❌ Complex in-frame: multi-base ref, multi-base alt, different lengths, in-frame — not implemented

Why this matters

Any downstream code consuming effect labels — resistance marker
identification, variant filtering, frequency analysis — silently receives
a garbage string for these variants with no error or warning. This is
particularly relevant for resistance gene analysis on datasets like Ag3,
where complex variants in genes such as Vgsc would be misclassified.

Proposed fix

Apply the same is_codon_changed logic already used for simple
insertions and deletions, and assign the appropriate existing label:

net change codon changed effect label
> 0 yes CODON_CHANGE_PLUS_CODON_INSERTION
> 0 no CODON_INSERTION
< 0 yes CODON_CHANGE_PLUS_CODON_DELETION
< 0 no CODON_DELETION

This reuses labels already present in the same function and is consistent
with SnpEff terminology. No new effect types are needed.

Why this approach

The fix reuses the is_codon_changed logic already present in the same
function for simple insertions and deletions (lines 382–384). This is
intentional — the biology is identical. A complex in-frame variant
(multi-base ref → different-length multi-base alt, net ±3) has the same
functional consequence as a simple in-frame insertion or deletion: the
reading frame is preserved, and the question is only whether the boundary
codon also changes.

Reusing the existing labels means:

  1. No new effect types are introduced into the codebase
  2. The output is consistent with SnpEff terminology already used elsewhere
  3. Downstream code that already handles CODON_CHANGE_PLUS_CODON_INSERTION
    and CODON_DELETION will handle these variants correctly with no
    changes needed

An alternative would be a new label like COMPLEX_CHANGE_IN_FRAME, but
this would break any downstream code filtering on existing effect labels
and adds a term not present in standard variant annotation vocabulary.
The conservative approach is to extend the existing classification logic
rather than introduce new terminology.

Happy to submit a PR if this approach looks right.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions