Skip to content

feat: add arc and chord diagrams for haplotype sharing (#457, #458)#989

Merged
jonbrenas merged 14 commits intomalariagen:masterfrom
31puneet:feat/haplotype-sharing-arc-chord
Mar 19, 2026
Merged

feat: add arc and chord diagrams for haplotype sharing (#457, #458)#989
jonbrenas merged 14 commits intomalariagen:masterfrom
31puneet:feat/haplotype-sharing-arc-chord

Conversation

@31puneet
Copy link
Contributor

@31puneet 31puneet commented Feb 28, 2026

Overview

Closes #457, Closes #458.

This PR adds two new visualization methods for haplotype sharing between predefined cohorts:

  • plot_haplotype_sharing_arc() — Arc diagram where cohorts are placed on a horizontal axis and arcs connect cohorts that share identical haplotypes. Arc thickness is proportional to the number of shared haplotypes.
  • plot_haplotype_sharing_chord() — Chord diagram where cohorts are arranged in a circle and Bézier curves connect cohorts that share identical haplotypes. Chord thickness is proportional to sharing count.

Both methods reuse a shared private helper _compute_haplotype_sharing() that loads haplotypes, identifies distinct haplotype groups via ht.distinct(), and builds a pairwise sharing matrix between cohorts.

Changes

File Change
hapclust.py Added _compute_haplotype_sharing(), plot_haplotype_sharing_arc(), plot_haplotype_sharing_chord()
hapclust_params.py Added cohort_col parameter type
test_hapclust.py Added parametrized tests for both new methods (ag3_sim + af1_sim)

Usage

import malariagen_data
ag3 = malariagen_data.Ag3()

# Arc diagram
ag3.plot_haplotype_sharing_arc(
    region="2L:28500000-28510000",
    cohort_col="country",
)

# Chord diagram
ag3.plot_haplotype_sharing_chord(
    region="2L:28500000-28510000",
    cohort_col="country",
)

Chord and Arc diagram

image

Note: The Output (Screenshot) is generated using simulator data (20 cohorts). With real Ag3 data, the plots will display 20+ countries with richer sharing patterns.

@tristanpwdennis tristanpwdennis self-requested a review March 6, 2026 00:45
@tristanpwdennis tristanpwdennis self-assigned this Mar 6, 2026
@tristanpwdennis
Copy link
Collaborator

tristanpwdennis commented Mar 6, 2026

Hi @31puneet

Thank you for the suggestion! The plots do look nice, but I'm not convinced they add enough over the clustering dendrograms we already produce - cohort sharing is already visible there via the colour mapping. Given the maintenance overhead of adding a new plot type, I'd want to be sure it's offering something meaningfully new before committing to it.

@jonbrenas @ahernank happy to hear your thoughts if you disagree.

@31puneet
Copy link
Contributor Author

31puneet commented Mar 6, 2026

Thanks for the review @tristanpwdennis! These visualizations were originally requested by @alimanfoo in #457 and #458 specifically because dendrograms become hard to read above ~1000 haplotypes — the arc and chord diagrams provide a simpler cohort-level summary of pairwise sharing without needing to parse individual-level structure. Happy to hear everyone's thoughts on whether this still aligns with the project's direction.

@jonbrenas
Copy link
Collaborator

Hi,

@31puneet is right that we asked for those, so I am in favour of adding them. I agree with @tristanpwdennis that a lot of this information can already be gleaned from haplotype clusterings ... but I think it is actually a good thing. One of the issue we have with the dendrograms is that they contain so much information that a lot of our users don't really know how to use and interpret them.

@jonbrenas
Copy link
Collaborator

Thanks @31puneet. Could you explain (here and in comments in the code itself) how the haplotype sharing is computed?

@31puneet 31puneet force-pushed the feat/haplotype-sharing-arc-chord branch from a1379f2 to 8983418 Compare March 9, 2026 15:45
@31puneet
Copy link
Contributor Author

31puneet commented Mar 9, 2026

Hi @jonbrenas

  • _compute_haplotype_sharing: This is the main Function which does the Comparison. Both Arc and Chord Diagram Call this same Function
  • Step1: First We Load the haplotypes for a region. Each Sample has 2 haplotypes. We use self.haplotypes() which returns an xarray Dataset with genotype data and sample IDs.
  • Step2: Load Sample Data using self.sample_metadata() method. We need the metadata to group samples into cohorts based on cohort_col
  • Step3: Map haplotypes to cohort labels: It aligns the sample metadata to perfectly match the genomic data and then creates two identical rows for each mosquito.
  • Step4: Load and convert haplotypes into memory: It takes the paired genetic data for each mosquito, splits it into individual haplotypes, and loads it all into memory.
  • Step5: Identify Shared Haplotypes and Build the Matrix: It finds groups of exactly identical haplotypes, determines which cohorts those identical haplotypes belong to, and counts how many identical groups are shared between each pair of cohorts.

I have even Updated the Files with the comments

@31puneet 31puneet force-pushed the feat/haplotype-sharing-arc-chord branch from 42aa2be to 7cb10e8 Compare March 9, 2026 15:48
@31puneet
Copy link
Contributor Author

Hi @jonbrenas requesting a review for this PR, Happy to make any changes

@jonbrenas
Copy link
Collaborator

Hi @31puneet,

Sorry it took me so long to get back to this PR. It is step 5 that I am not sure I completely follow. Can you give more details and/or an example, maybe?

@31puneet
Copy link
Contributor Author

Hi @jonbrenas

No worries at all, Here is a Detailed Explanation of Step 5

  1. We use ht_seg.distinct() to find all identical haplotypes and return them as groups of indices.
  2. For each identical group (e.g., Haplotypes {0, 15, 204}), we look up their cohorts {"Cameroon", "Gabon", "Cameroon"}.
  3. We convert these cohorts to a set() to get just the unique locations: {"Cameroon", "Gabon"}.
  4. We then add +1 connection between Cameroon and Gabon in our symmetric N x N sharing_matrix.

Why we use a set() here:

We count unique shared lineages instead of absolute pairwise frequencies. If we didn't use a set, countries where we happened to sequence a massive number of mosquitoes (e.g., Cameroon) would completely dominate the thickness of the chords due to unequal sampling effort. By using a set, we collapse it, meaning the chord thickness purely represents the diversity of gene flow (how many distinct lineages crossed the border) rather than raw sample size.

@jonbrenas
Copy link
Collaborator

Thanks @31puneet.

If we didn't use a set, countries where we happened to sequence a massive number of mosquitoes (e.g., Cameroon) would completely dominate the thickness of the chords due to unequal sampling effort.

Isn't that what cohort_size is for?

@31puneet
Copy link
Contributor Author

Hi @jonbrenas
Thanyou for catching that
My original thought process: I originally used set() because I was worried that if a user left cohort_size=None , the graphical arcs would be massively distorted merely by unequal sample sizes, and I wanted to prevent that without having to throw data away via downsampling.

What I found during testing: I just tested both methods side-by-side mathematically on the 2L:28545000-28550000 region. I realized that my set() logic was actually completely computationally redundant! Because ht_seg.distinct() has already grouped identical sequences into distinct lineages before my loop even starts, iterating through them and using set() on the cohorts doesn't actually change the mathematical output

@31puneet
Copy link
Contributor Author

Hi @jonbrenas
Since we found that both methods output the exact same numbers, and my set() code is mathematically identical but a few lines shorter to write, do you still want me to remove set() and swap it to the absolute counting loop? Happy to do so if you prefer that logic explicitly!"

@jonbrenas
Copy link
Collaborator

Hi @31puneet. I think the user should be the one choosing if the width of the chords is the number of unique shared haplotypes, or the absolute number of shared haplotypes. This would also give us the freedom to add more statistics later, if need be.

@31puneet
Copy link
Contributor Author

Hi @jonbrenas
That's a good idea, It gives users the freedom to analyze unique lineages or absolute population flow.
I'll add a metric parameter ("unique" vs "absolute") to the plotting functions and implement the absolute frequency calculation. Sound good?

@31puneet
Copy link
Contributor Author

Hi @jonbrenas
I added the new metric toggle parameter :

  • metric="unique" (default): Uses the set() logic to count the number of highly identical distinct lineages shared (unbiased by sample size).
  • metric="absolute": Uses your Counter logic to multiply the absolute frequencies of those matching haplotypes.

I ran a test (2L:28545000-28550000) showing the matrix for both metrics over the same data.
Screenshot 2026-03-18 014754
Screenshot 2026-03-18 015125

@31puneet
Copy link
Contributor Author

Hi @jonbrenas
Requesting a review for the changes i made

@jonbrenas
Copy link
Collaborator

Hi @31puneet, I think the code is alright. Can you reflect on the differences between the results obtained with the 2 methods?

@31puneet
Copy link
Contributor Author

Hi @jonbrenas, thanks for the review
The differences between the results of the two methods:

For unique metric:

  • The result: We see that Cameroon and Uganda share exactly 29 distinct haplotypes, which gives us a baseline look at the diversity of historical gene flow between them.
  • How we got it: This uses set() logic to just count the distinct shared lineages once. It completely removes the bias of how many individual mosquitoes actually carry those lineages in our dataset.

For the absolute metric:

  • The result: Cameroon-Uganda pair converts to a massive number, showing that the volume of sharing is actually huge compared to the baseline diversity.
  • How we got it: This uses the Counter logic to multiply the absolute frequencies of those matching haplotypes together.

@jonbrenas
Copy link
Collaborator

Hi @31puneet, but can you draw conclusions from the results?

@31puneet
Copy link
Contributor Author

Yes @jonbrenas
The contrast between the two metrics allows us to visually confirm evolutionary gene flow. For example, the test data shows Cameroon and Gabon share only 8 unique distinct lineages, but their absolute sharing volume is 3567.

Biologically, this points to a selective sweep (like a shared insecticide resistance mutation). Basically, a very small handful of mosquitoes gained a massive survival advantage, and those specific DNA lineages quickly spread and multiplied across both countries. If these lineages were just migrating naturally over time, the absolute number of shared mosquitoes would still be pretty low. The fact that only 8 unique lineages exploded into thousands of actual matches is the exact footprint of strong positive selection happening across different regions.

In short: the unique metric shows us how many distinct lineages managed to cross the border, while the absolute metric shows us how aggressively those lineages multiplied and dominated the population once they got there.

@jonbrenas jonbrenas merged commit b1ab7fb into malariagen:master Mar 19, 2026
8 checks passed
@31puneet 31puneet deleted the feat/haplotype-sharing-arc-chord branch March 19, 2026 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Chord diagram to visualise haplotype sharing Arc diagram to visualise haplotype sharing between cohorts

3 participants