Open
Conversation
…'t have to add a dimension at the end for scalar operations
… fix norm func for r2_ij
…lematic in practice
*Python tests* Overhaul python testing of the general stat functions. Remove the dependence on the example tree sequences, opting instead to simulate a couple of examples directly. Use these simulated trees in test fixtures, scoped at the module level. This streamlines the test parameterization a lot. Use the single stat site names from the summary function definitions. *CPython tests* Add a multiallelic tree sequence to test normalisation function validation and errors. Remove one more occurrence of `np.expand_dims`. *trees.c* Remove the unnecessary branch in tsk_treeseq_two_locus_count_general_stat, improving the code coverage. *trees.py* Default normalisation function can be None, applying default at runtime. Simplifies calling code and is more in line with the rest of the API.
ed3f0fb to
bd0a1a5
Compare
Use the number of sites and trees reported by the tree sequence instead of hard coded values. This has the benefit of being more readable, communicating intent (review comment from Peter). Split the multiallelic and biallelic test cases, they're getting messy. Now I can explicitly assert that the norm_func is not run for biallelic sites and for branch stats. Also gets rid of awkward assertions about sample sets.
Clean up dimension handling around summary functions and normalisation. There is a slight speed advantage (according to a microbenchmark) and a huge readability advantage to simply returning [value]. I keep all computations specifying `keepdims`, but remove list indexing (i.e. `AB[[0]]`) in favor of returning a list with a single scalar. It turns out that vectorised numpy functions are actually slower in some cases because the data we're operating on is so small. Finally, fix the default normalisation function so that it works both on one-way and two-way statistics. Users will still need to specify `hap_norm` when appropriate (and a special case of `hap_norm` for two-way stats). Per Peter's comment, I investigated dimension dropping and indeed, general stats don't drop dimensions so I removed the dimension dropping code. However, we return a matrix of `(m, m, k)` and we want `(k, m, m)`, so `np.moveaxis` is still needed. Added tests: * Multiallelic multi sample-set. This tests operations on two sample sets for multiallelic data (which excercises the norm function with multiple sample sets). This test highlighted the slight changes needed to the default normalisation function. * Multi outputs. This test mimics a two-way stat called on multiple indexes. It shows and tests the ability to compute multiple statistics from the same haplotype counts matrix (which is especially useful with the explosion of possible summary functions in three-way, four-way stats). In our biallelic test case, I also assert that the normalisation function is never called and add a note about polarisation. Finally, I add a draft docstring, but to complete this I think that the two-locus docs are required. Also, I'd like to add some general documentation.
This reverts commit 6685399.
Add a test for behavior on empty tree sequences (no samples, no edges, no sites). Add a "no sites" fixture. Include branch stat testing. Tune branch stat test runtime by reducing the size of `ts_100_samp_with_sites_fixture`, now named `ts_10_samp_with_sites_fixture`. Add explicit testing for output dimensions and assert that the norm func is not called on trees with only biallelic sites and in branch mode. Add a GeneralStatNormFuncs class to explicitly document possible normalisation functions and in what situations they will be used. Tune size of tree sequence in `test_general_multi_outputs` so that the test runs in a reasonable amount of time in branch mode.
The transpose operation was creating intermediate data that was not being garbage collected, resulting in a rather obvious memory leak. To mitigate this, I opt to wrap the data in a numpy array that is already transposed. The original data is natively laid out with shape (K,3), by creating a numpy array with shape (3,K) and strides (8,8*K), we can avoid an intermediate transpose operation altogether. After leak-checking again, the memory leak is gone. I also add a Py_XDECREF to remove the reference to `norm_func`, though in my leak checking I don't actually see any difference in RSS or heap size. Finally, mark a few more arrays as read-only, since the C functions that accept them as input annotate these arrays as `const`.
`compute_two_tree_branch_stat` did not check the error returned by the summary function (which is the return value of `compute_two_tree_branch_state_update`.). I caught this because the python callback was setting an exception in the summary function and the python runtime was complaining about an exception being set, despite a successful return status. This also means that failing summary functions would (eventually) be caught, but the code would continue to run. The C python tests did not catch this because we would eventually raise the correct exception.
Provide more precise requirements for `f` and `norm_f` and give some basic understanding of what these functions are and when normalisation will be required. Attempt to fix syntax errors by adding a newline in code blocks. Clarify output dimensions in the code comments (though this might need to change since I think we'll remove the `np.moveaxis` call at the end.
…ng to ``result_dim`` fixes the docs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
user specified summary functions in python