Skip to content

nans in ld_matrix #3441

@petrelharp

Description

@petrelharp

I'm moving discussion of nans in ts.ld_matrix here from #3416.

Consider the almost-simplest application of branch-mode LD:

>>> ts = msprime.sim_ancestry(6)
>>> ts.ld_matrix([0,1, 2, 3], mode='branch')
array([[nan]])
>>> ts.ld_matrix([[0,1,2,3], [4,5]], mode='branch', indexes=(0,1))
array([[nan]])

This is because there are in both cases branches that aren't ancestral to some of the sample sets (of course).
In other words - unless I'm missing something - r2 will always be nan unless you're doing it on all the samples in a simplified ARG. This seems unhelpful.

The same thing happens in site mode if there's mutations on those branches, even under infinite sites:

>>> ts = msprime.sim_mutations(ts, rate=0.3, discrete_genome=False)
>>> ts.ld_matrix([0,1,2,3])
array([[       nan,        nan,        nan,        nan],
       [       nan,        nan,        nan,        nan],
       [       nan,        nan, 1.        , 0.11111111],
       [       nan,        nan, 0.11111111, 1.        ]])
>>> ts.ld_matrix([[0,1,2,3], [4,5]], indexes=(0,1))
array([[nan, nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan]])

I think this probably ought to be explained more prominantly, or maybe have the behavior changed, because this is probably the first thing that people will run into when trying some of this stuff out. So, question for @apragsdale:

Is this the desired behavior?

If it is, then I think the explanation of this (and more importantly, suggestion of what to do instead) should be put semi-prominantly in the docs.

I think this question boils down to "what is the natural thing to do with genomic data". In other words, if you compute LD from a subset of samples, what do you do with the sites that are monomorphic in your sample? I think you ignore them, which suggests that branch mode should just ignore those branches that are not polymorphic. This is what the other statistics do, mostly: see this not note (that could be better). However, I could also see a good answer here being "r2 is a ratio and so it's not a good thing to average across a bunch of sites; nan is the right answer here and you should just use a different, non-ratio statistic instead". But then I might suggest either not making r2 the default or throwing a warning for r2 in branch mode.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions