feat: add arc and chord diagrams for haplotype sharing (#457, #458)#989
Conversation
|
Hi @31puneet Thank you for the suggestion! The plots do look nice, but I'm not convinced they add enough over the clustering dendrograms we already produce - cohort sharing is already visible there via the colour mapping. Given the maintenance overhead of adding a new plot type, I'd want to be sure it's offering something meaningfully new before committing to it. @jonbrenas @ahernank happy to hear your thoughts if you disagree. |
|
Thanks for the review @tristanpwdennis! These visualizations were originally requested by @alimanfoo in #457 and #458 specifically because dendrograms become hard to read above ~1000 haplotypes — the arc and chord diagrams provide a simpler cohort-level summary of pairwise sharing without needing to parse individual-level structure. Happy to hear everyone's thoughts on whether this still aligns with the project's direction. |
|
Hi, @31puneet is right that we asked for those, so I am in favour of adding them. I agree with @tristanpwdennis that a lot of this information can already be gleaned from haplotype clusterings ... but I think it is actually a good thing. One of the issue we have with the dendrograms is that they contain so much information that a lot of our users don't really know how to use and interpret them. |
|
Thanks @31puneet. Could you explain (here and in comments in the code itself) how the haplotype sharing is computed? |
a1379f2 to
8983418
Compare
|
Hi @jonbrenas
I have even Updated the Files with the comments |
42aa2be to
7cb10e8
Compare
|
Hi @jonbrenas requesting a review for this PR, Happy to make any changes |
|
Hi @31puneet, Sorry it took me so long to get back to this PR. It is step 5 that I am not sure I completely follow. Can you give more details and/or an example, maybe? |
|
Hi @jonbrenas No worries at all, Here is a Detailed Explanation of Step 5
Why we use a set() here:We count unique shared lineages instead of absolute pairwise frequencies. If we didn't use a set, countries where we happened to sequence a massive number of mosquitoes (e.g., Cameroon) would completely dominate the thickness of the chords due to unequal sampling effort. By using a set, we collapse it, meaning the chord thickness purely represents the diversity of gene flow (how many distinct lineages crossed the border) rather than raw sample size. |
|
Thanks @31puneet.
Isn't that what |
|
Hi @jonbrenas What I found during testing: I just tested both methods side-by-side mathematically on the 2L:28545000-28550000 region. I realized that my set() logic was actually completely computationally redundant! Because ht_seg.distinct() has already grouped identical sequences into distinct lineages before my loop even starts, iterating through them and using set() on the cohorts doesn't actually change the mathematical output |
|
Hi @jonbrenas |
|
Hi @31puneet. I think the user should be the one choosing if the width of the chords is the number of unique shared haplotypes, or the absolute number of shared haplotypes. This would also give us the freedom to add more statistics later, if need be. |
|
Hi @jonbrenas |
|
Hi @jonbrenas
I ran a test ( |
|
Hi @jonbrenas |
|
Hi @31puneet, I think the code is alright. Can you reflect on the differences between the results obtained with the 2 methods? |
|
Hi @jonbrenas, thanks for the review For unique metric:
For the absolute metric:
|
|
Hi @31puneet, but can you draw conclusions from the results? |
|
Yes @jonbrenas Biologically, this points to a selective sweep (like a shared insecticide resistance mutation). Basically, a very small handful of mosquitoes gained a massive survival advantage, and those specific DNA lineages quickly spread and multiplied across both countries. If these lineages were just migrating naturally over time, the absolute number of shared mosquitoes would still be pretty low. The fact that only 8 unique lineages exploded into thousands of actual matches is the exact footprint of strong positive selection happening across different regions. In short: the unique metric shows us how many distinct lineages managed to cross the border, while the absolute metric shows us how aggressively those lineages multiplied and dominated the population once they got there. |


Overview
Closes #457, Closes #458.
This PR adds two new visualization methods for haplotype sharing between predefined cohorts:
plot_haplotype_sharing_arc()— Arc diagram where cohorts are placed on a horizontal axis and arcs connect cohorts that share identical haplotypes. Arc thickness is proportional to the number of shared haplotypes.plot_haplotype_sharing_chord()— Chord diagram where cohorts are arranged in a circle and Bézier curves connect cohorts that share identical haplotypes. Chord thickness is proportional to sharing count.Both methods reuse a shared private helper
_compute_haplotype_sharing()that loads haplotypes, identifies distinct haplotype groups viaht.distinct(), and builds a pairwise sharing matrix between cohorts.Changes
hapclust.py_compute_haplotype_sharing(),plot_haplotype_sharing_arc(),plot_haplotype_sharing_chord()hapclust_params.pycohort_colparameter typetest_hapclust.pyUsage
Chord and Arc diagram