Two locus general matrix by lkirk · Pull Request #17 · lkirk/tskit

lkirk · 2026-03-09T21:34:02Z

user specified summary functions in python

…'t have to add a dimension at the end for scalar operations

… fix norm func for r2_ij

…lematic in practice

*Python tests* Overhaul python testing of the general stat functions. Remove the dependence on the example tree sequences, opting instead to simulate a couple of examples directly. Use these simulated trees in test fixtures, scoped at the module level. This streamlines the test parameterization a lot. Use the single stat site names from the summary function definitions. *CPython tests* Add a multiallelic tree sequence to test normalisation function validation and errors. Remove one more occurrence of `np.expand_dims`. *trees.c* Remove the unnecessary branch in tsk_treeseq_two_locus_count_general_stat, improving the code coverage. *trees.py* Default normalisation function can be None, applying default at runtime. Simplifies calling code and is more in line with the rest of the API.

Use the number of sites and trees reported by the tree sequence instead of hard coded values. This has the benefit of being more readable, communicating intent (review comment from Peter). Split the multiallelic and biallelic test cases, they're getting messy. Now I can explicitly assert that the norm_func is not run for biallelic sites and for branch stats. Also gets rid of awkward assertions about sample sets.

Clean up dimension handling around summary functions and normalisation. There is a slight speed advantage (according to a microbenchmark) and a huge readability advantage to simply returning [value]. I keep all computations specifying `keepdims`, but remove list indexing (i.e. `AB[[0]]`) in favor of returning a list with a single scalar. It turns out that vectorised numpy functions are actually slower in some cases because the data we're operating on is so small. Finally, fix the default normalisation function so that it works both on one-way and two-way statistics. Users will still need to specify `hap_norm` when appropriate (and a special case of `hap_norm` for two-way stats). Per Peter's comment, I investigated dimension dropping and indeed, general stats don't drop dimensions so I removed the dimension dropping code. However, we return a matrix of `(m, m, k)` and we want `(k, m, m)`, so `np.moveaxis` is still needed. Added tests: * Multiallelic multi sample-set. This tests operations on two sample sets for multiallelic data (which excercises the norm function with multiple sample sets). This test highlighted the slight changes needed to the default normalisation function. * Multi outputs. This test mimics a two-way stat called on multiple indexes. It shows and tests the ability to compute multiple statistics from the same haplotype counts matrix (which is especially useful with the explosion of possible summary functions in three-way, four-way stats). In our biallelic test case, I also assert that the normalisation function is never called and add a note about polarisation. Finally, I add a draft docstring, but to complete this I think that the two-locus docs are required. Also, I'd like to add some general documentation.

This reverts commit 6685399.

Add a test for behavior on empty tree sequences (no samples, no edges, no sites). Add a "no sites" fixture. Include branch stat testing. Tune branch stat test runtime by reducing the size of `ts_100_samp_with_sites_fixture`, now named `ts_10_samp_with_sites_fixture`. Add explicit testing for output dimensions and assert that the norm func is not called on trees with only biallelic sites and in branch mode. Add a GeneralStatNormFuncs class to explicitly document possible normalisation functions and in what situations they will be used. Tune size of tree sequence in `test_general_multi_outputs` so that the test runs in a reasonable amount of time in branch mode.

The transpose operation was creating intermediate data that was not being garbage collected, resulting in a rather obvious memory leak. To mitigate this, I opt to wrap the data in a numpy array that is already transposed. The original data is natively laid out with shape (K,3), by creating a numpy array with shape (3,K) and strides (8,8*K), we can avoid an intermediate transpose operation altogether. After leak-checking again, the memory leak is gone. I also add a Py_XDECREF to remove the reference to `norm_func`, though in my leak checking I don't actually see any difference in RSS or heap size. Finally, mark a few more arrays as read-only, since the C functions that accept them as input annotate these arrays as `const`.

`compute_two_tree_branch_stat` did not check the error returned by the summary function (which is the return value of `compute_two_tree_branch_state_update`.). I caught this because the python callback was setting an exception in the summary function and the python runtime was complaining about an exception being set, despite a successful return status. This also means that failing summary functions would (eventually) be caught, but the code would continue to run. The C python tests did not catch this because we would eventually raise the correct exception.

Provide more precise requirements for `f` and `norm_f` and give some basic understanding of what these functions are and when normalisation will be required. Attempt to fix syntax errors by adding a newline in code blocks. Clarify output dimensions in the code comments (though this might need to change since I think we'll remove the `np.moveaxis` call at the end.

…ng to ``result_dim`` fixes the docs.

benjeffery and others added 26 commits March 13, 2026 01:33

Trigger lint on merge queue

953b5e2

remove consts from tsk_json_struct_metadata_get_blob (closes tskit-de…

17bacda

…v#3425)

edits to dev docs

8d4ab2d

no detached head; closes tskit-dev#3431

2a02e7f

remove prompts from bash commands for copyability; closes tskit-dev#3430

924c2f0

further clarify uv subdir; closes tskit-dev#3429

1656dc4

installation on macos; closes tskit-dev#3432

01de523

note about debug build; closes tskit-dev#3433

0071e2d

uv run gdb in dev docs

afaf3b9

initial stab at a general matrix (no normalisation)

1508884

added dimension dropping, but I think transposing is better -- we don…

b144f8d

…'t have to add a dimension at the end for scalar operations

finalize and add tests for single and multipop

92b422a

turns out, the general norm function needs to know the state_dims

8018550

fix up a bit of naming in general test funcs, remove unneeded branch,…

0351141

… fix norm func for r2_ij

flake8 does not like assigning lambdas to variables

ad920d0

and black doesn't like that

8d22e8e

do not test equality, this was useful on my local machine but is prob…

e5fcd0e

…lematic in practice

lowlevel tests

6bb867f

relax diff requirements (macos failure)

26bebf5

relax diff requirements (macos failure) -- previous commit fixed one

667dc65

new formatting tools, fix lint

8ea0852

remove TODOs, old comment and tested elsewhere

2ddf6d0

make testing more clear

680f9c5

preserve native dimensions instead of expanding at the end

5e984be

msprime produces different trees on macos (same seed)

bd0a1a5

lkirk force-pushed the two-locus-general-matrix branch from ed3f0fb to bd0a1a5 Compare March 17, 2026 02:57

lkirk added 3 commits March 17, 2026 13:17

regain test coverage for default sample sets

6685399

lkirk added 13 commits March 17, 2026 18:07

Revert "regain test coverage for default sample sets"

70cd6b4

This reverts commit 6685399.

update comment about result dimension

be997cf

be more explicit about setting the default norm function

9481921

linting does not like assigning lambdas to variables

7387461

add an else statement to improve readability (review)

f5b1940

Add some minimal documentation about the purpose of the two entrypoints

f854aa1

Test explicitly that our internal data is read only

8c49863

Incorporate Peter's improvement to the test comments

7378e42

turns out the documentation build doesn't like result_dims changi…

824faaa

…ng to ``result_dim`` fixes the docs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two locus general matrix#17

Two locus general matrix#17
lkirk wants to merge 42 commits intoupstream-mainfrom
two-locus-general-matrix

lkirk commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lkirk commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants