Skip to content

Two locus general matrix#17

Open
lkirk wants to merge 42 commits intoupstream-mainfrom
two-locus-general-matrix
Open

Two locus general matrix#17
lkirk wants to merge 42 commits intoupstream-mainfrom
two-locus-general-matrix

Conversation

@lkirk
Copy link
Owner

@lkirk lkirk commented Mar 9, 2026

user specified summary functions in python

benjeffery and others added 26 commits March 13, 2026 01:33
…'t have to add a dimension at the end for scalar operations
*Python tests*
Overhaul python testing of the general stat functions. Remove the
dependence on the example tree sequences, opting instead to simulate a
couple of examples directly. Use these simulated trees in test fixtures,
scoped at the module level. This streamlines the test parameterization a
lot.

Use the single stat site names from the summary function definitions.

*CPython tests*
Add a multiallelic tree sequence to test normalisation function
validation and errors. Remove one more occurrence of `np.expand_dims`.

*trees.c*
Remove the unnecessary branch in
tsk_treeseq_two_locus_count_general_stat, improving the code coverage.

*trees.py*
Default normalisation function can be None, applying default at runtime.
Simplifies calling code and is more in line with the rest of the API.
@lkirk lkirk force-pushed the two-locus-general-matrix branch from ed3f0fb to bd0a1a5 Compare March 17, 2026 02:57
lkirk added 3 commits March 17, 2026 13:17
Use the number of sites and trees reported by the tree sequence instead
of hard coded values. This has the benefit of being more readable,
communicating intent (review comment from Peter).

Split the multiallelic and biallelic test cases, they're getting messy.
Now I can explicitly assert that the norm_func is not run for biallelic
sites and for branch stats. Also gets rid of awkward assertions about
sample sets.
Clean up dimension handling around summary functions and normalisation.
There is a slight speed advantage (according to a microbenchmark) and a
huge readability advantage to simply returning [value]. I keep all
computations specifying `keepdims`, but remove list indexing (i.e.
`AB[[0]]`) in favor of returning a list with a single scalar. It turns
out that vectorised numpy functions are actually slower in some cases
because the data we're operating on is so small. Finally, fix the
default normalisation function so that it works both on one-way and
two-way statistics. Users will still need to specify `hap_norm` when
appropriate (and a special case of `hap_norm` for two-way stats).

Per Peter's comment, I investigated dimension dropping and indeed,
general stats don't drop dimensions so I removed the dimension dropping
code. However, we return a matrix of `(m, m, k)` and we want
`(k, m, m)`, so `np.moveaxis` is still needed.

Added tests:
* Multiallelic multi sample-set. This tests operations on two sample
  sets for multiallelic data (which excercises the norm function with
  multiple sample sets). This test highlighted the slight changes needed
  to the default normalisation function.
* Multi outputs. This test mimics a two-way stat called
  on multiple indexes. It shows and tests the ability to compute
  multiple statistics from the same haplotype counts matrix (which is
  especially useful with the explosion of possible summary functions in
  three-way, four-way stats).

In our biallelic test case, I also assert that the normalisation
function is never called and add a note about polarisation.

Finally, I add a draft docstring, but to complete this I think that the
two-locus docs are required. Also, I'd like to add some general
documentation.
lkirk added 13 commits March 17, 2026 18:07
Add a test for behavior on empty tree sequences (no samples, no edges,
no sites). Add a "no sites" fixture.

Include branch stat testing. Tune branch stat test runtime by reducing
the size of `ts_100_samp_with_sites_fixture`, now named
`ts_10_samp_with_sites_fixture`.

Add explicit testing for output dimensions and assert that the norm
func is not called on trees with only biallelic sites and in branch
mode.

Add a GeneralStatNormFuncs class to explicitly document possible
normalisation functions and in what situations they will be used.

Tune size of tree sequence in `test_general_multi_outputs` so that the
test runs in a reasonable amount of time in branch mode.
The transpose operation was creating intermediate data that was not
being garbage collected, resulting in a rather obvious memory leak. To
mitigate this, I opt to wrap the data in a numpy array that is already
transposed. The original data is natively laid out with shape (K,3), by
creating a numpy array with shape (3,K) and strides (8,8*K), we can
avoid an intermediate transpose operation altogether. After
leak-checking again, the memory leak is gone.

I also add a Py_XDECREF to remove the reference to `norm_func`, though
in my leak checking I don't actually see any difference in RSS or heap
size.

Finally, mark a few more arrays as read-only, since the C functions that
accept them as input annotate these arrays as `const`.
`compute_two_tree_branch_stat` did not check the error returned by the
summary function (which is the return value of
`compute_two_tree_branch_state_update`.).

I caught this because the python callback was setting an exception in
the summary function and the python runtime was complaining about an
exception being set, despite a successful return status. This also means
that failing summary functions would (eventually) be caught, but the
code would continue to run.

The C python tests did not catch this because we would eventually raise
the correct exception.
Provide more precise requirements for `f` and `norm_f` and give some
basic understanding of what these functions are and when normalisation
will be required.

Attempt to fix syntax errors by adding a newline in code blocks.

Clarify output dimensions in the code comments (though this might need
to change since I think we'll remove the `np.moveaxis` call at the end.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants