nans in `ld_matrix`

I'm moving discussion of `nan`s in `ts.ld_matrix` here from #3416.

Consider the almost-simplest application of branch-mode LD:
```
>>> ts = msprime.sim_ancestry(6)
>>> ts.ld_matrix([0,1, 2, 3], mode='branch')
array([[nan]])
>>> ts.ld_matrix([[0,1,2,3], [4,5]], mode='branch', indexes=(0,1))
array([[nan]])
```
This is because there are in both cases branches that aren't ancestral to some of the sample sets (of course). 
In other words - unless I'm missing something - `r2` will *always* be `nan` unless you're doing it on all the samples in a simplified ARG. This seems unhelpful.

The same thing happens in site mode if there's mutations on those branches, even under infinite sites:
```
>>> ts = msprime.sim_mutations(ts, rate=0.3, discrete_genome=False)
>>> ts.ld_matrix([0,1,2,3])
array([[       nan,        nan,        nan,        nan],
       [       nan,        nan,        nan,        nan],
       [       nan,        nan, 1.        , 0.11111111],
       [       nan,        nan, 0.11111111, 1.        ]])
>>> ts.ld_matrix([[0,1,2,3], [4,5]], indexes=(0,1))
array([[nan, nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan]])

```

I think this probably ought to be explained more prominantly, or maybe have the behavior changed, because this is probably the first thing that people will run into when trying some of this stuff out. So, question for @apragsdale: 

Is this the desired behavior?

If it is, then I think the explanation of this (and more importantly, suggestion of what to do instead) should be put semi-prominantly in the docs.

I think this question boils down to "what is the natural thing to do with genomic data". In other words, if you compute LD from a subset of samples, what do you do with the sites that are monomorphic in your sample? I think you ignore them, which suggests that branch mode should just ignore those branches that are not polymorphic. This is what the other statistics do, mostly: see [this not note](https://tskit.dev/tskit/docs/latest/stats.html#strictness-and-which-branches-count) (that could be better). However, I could also see a good answer here being "r2 is a ratio and so it's not a good thing to average across a bunch of sites; nan is the right answer here and you should just use a different, non-ratio statistic instead". But then I might suggest either not making `r2` the default or throwing a warning for r2 in branch mode.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nans in `ld_matrix` #3441

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

nans in ld_matrix #3441

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

nans in `ld_matrix` #3441