feat: add arc and chord diagrams for haplotype sharing (#457, #458) by 31puneet · Pull Request #989 · malariagen/malariagen-data-python

31puneet · 2026-02-28T18:50:02Z

Overview

Closes #457, Closes #458.

This PR adds two new visualization methods for haplotype sharing between predefined cohorts:

plot_haplotype_sharing_arc() — Arc diagram where cohorts are placed on a horizontal axis and arcs connect cohorts that share identical haplotypes. Arc thickness is proportional to the number of shared haplotypes.
plot_haplotype_sharing_chord() — Chord diagram where cohorts are arranged in a circle and Bézier curves connect cohorts that share identical haplotypes. Chord thickness is proportional to sharing count.

Both methods reuse a shared private helper _compute_haplotype_sharing() that loads haplotypes, identifies distinct haplotype groups via ht.distinct(), and builds a pairwise sharing matrix between cohorts.

Changes

File	Change
`hapclust.py`	Added `_compute_haplotype_sharing()`, `plot_haplotype_sharing_arc()`, `plot_haplotype_sharing_chord()`
`hapclust_params.py`	Added `cohort_col` parameter type
`test_hapclust.py`	Added parametrized tests for both new methods (ag3_sim + af1_sim)

Usage

import malariagen_data
ag3 = malariagen_data.Ag3()

# Arc diagram
ag3.plot_haplotype_sharing_arc(
    region="2L:28500000-28510000",
    cohort_col="country",
)

# Chord diagram
ag3.plot_haplotype_sharing_chord(
    region="2L:28500000-28510000",
    cohort_col="country",
)

Chord and Arc diagram

Note: The Output (Screenshot) is generated using simulator data (20 cohorts). With real Ag3 data, the plots will display 20+ countries with richer sharing patterns.

malariagen#458)

tristanpwdennis · 2026-03-06T00:51:05Z

Hi @31puneet

Thank you for the suggestion! The plots do look nice, but I'm not convinced they add enough over the clustering dendrograms we already produce - cohort sharing is already visible there via the colour mapping. Given the maintenance overhead of adding a new plot type, I'd want to be sure it's offering something meaningfully new before committing to it.

@jonbrenas @ahernank happy to hear your thoughts if you disagree.

31puneet · 2026-03-06T04:38:40Z

Thanks for the review @tristanpwdennis! These visualizations were originally requested by @alimanfoo in #457 and #458 specifically because dendrograms become hard to read above ~1000 haplotypes — the arc and chord diagrams provide a simpler cohort-level summary of pairwise sharing without needing to parse individual-level structure. Happy to hear everyone's thoughts on whether this still aligns with the project's direction.

jonbrenas · 2026-03-06T08:00:30Z

Hi,

@31puneet is right that we asked for those, so I am in favour of adding them. I agree with @tristanpwdennis that a lot of this information can already be gleaned from haplotype clusterings ... but I think it is actually a good thing. One of the issue we have with the dendrograms is that they contain so much information that a lot of our users don't really know how to use and interpret them.

jonbrenas · 2026-03-09T08:42:37Z

Thanks @31puneet. Could you explain (here and in comments in the code itself) how the haplotype sharing is computed?

31puneet · 2026-03-09T15:46:36Z

Hi @jonbrenas

_compute_haplotype_sharing: This is the main Function which does the Comparison. Both Arc and Chord Diagram Call this same Function
Step1: First We Load the haplotypes for a region. Each Sample has 2 haplotypes. We use self.haplotypes() which returns an xarray Dataset with genotype data and sample IDs.
Step2: Load Sample Data using self.sample_metadata() method. We need the metadata to group samples into cohorts based on cohort_col
Step3: Map haplotypes to cohort labels: It aligns the sample metadata to perfectly match the genomic data and then creates two identical rows for each mosquito.
Step4: Load and convert haplotypes into memory: It takes the paired genetic data for each mosquito, splits it into individual haplotypes, and loads it all into memory.
Step5: Identify Shared Haplotypes and Build the Matrix: It finds groups of exactly identical haplotypes, determines which cohorts those identical haplotypes belong to, and counts how many identical groups are shared between each pair of cohorts.

I have even Updated the Files with the comments

31puneet · 2026-03-17T15:47:38Z

Hi @jonbrenas requesting a review for this PR, Happy to make any changes

jonbrenas · 2026-03-17T15:50:55Z

Hi @31puneet,

Sorry it took me so long to get back to this PR. It is step 5 that I am not sure I completely follow. Can you give more details and/or an example, maybe?

31puneet · 2026-03-17T16:24:37Z

Hi @jonbrenas

No worries at all, Here is a Detailed Explanation of Step 5

We use ht_seg.distinct() to find all identical haplotypes and return them as groups of indices.
For each identical group (e.g., Haplotypes {0, 15, 204}), we look up their cohorts {"Cameroon", "Gabon", "Cameroon"}.
We convert these cohorts to a set() to get just the unique locations: {"Cameroon", "Gabon"}.
We then add +1 connection between Cameroon and Gabon in our symmetric N x N sharing_matrix.

Why we use a set() here:

We count unique shared lineages instead of absolute pairwise frequencies. If we didn't use a set, countries where we happened to sequence a massive number of mosquitoes (e.g., Cameroon) would completely dominate the thickness of the chords due to unequal sampling effort. By using a set, we collapse it, meaning the chord thickness purely represents the diversity of gene flow (how many distinct lineages crossed the border) rather than raw sample size.

jonbrenas · 2026-03-17T17:15:00Z

Thanks @31puneet.

If we didn't use a set, countries where we happened to sequence a massive number of mosquitoes (e.g., Cameroon) would completely dominate the thickness of the chords due to unequal sampling effort.

Isn't that what cohort_size is for?

31puneet · 2026-03-17T18:29:41Z

Hi @jonbrenas
Thanyou for catching that
My original thought process: I originally used set() because I was worried that if a user left cohort_size=None , the graphical arcs would be massively distorted merely by unequal sample sizes, and I wanted to prevent that without having to throw data away via downsampling.

What I found during testing: I just tested both methods side-by-side mathematically on the 2L:28545000-28550000 region. I realized that my set() logic was actually completely computationally redundant! Because ht_seg.distinct() has already grouped identical sequences into distinct lineages before my loop even starts, iterating through them and using set() on the cohorts doesn't actually change the mathematical output

31puneet · 2026-03-17T18:36:03Z

Hi @jonbrenas
Since we found that both methods output the exact same numbers, and my set() code is mathematically identical but a few lines shorter to write, do you still want me to remove set() and swap it to the absolute counting loop? Happy to do so if you prefer that logic explicitly!"

jonbrenas · 2026-03-17T19:14:04Z

Hi @31puneet. I think the user should be the one choosing if the width of the chords is the number of unique shared haplotypes, or the absolute number of shared haplotypes. This would also give us the freedom to add more statistics later, if need be.

31puneet · 2026-03-17T19:32:24Z

Hi @jonbrenas
That's a good idea, It gives users the freedom to analyze unique lineages or absolute population flow.
I'll add a metric parameter ("unique" vs "absolute") to the plotting functions and implement the absolute frequency calculation. Sound good?

31puneet · 2026-03-17T20:23:51Z

Hi @jonbrenas
I added the new metric toggle parameter :

metric="unique" (default): Uses the set() logic to count the number of highly identical distinct lineages shared (unbiased by sample size).
metric="absolute": Uses your Counter logic to multiply the absolute frequencies of those matching haplotypes.

I ran a test (2L:28545000-28550000) showing the matrix for both metrics over the same data.

31puneet · 2026-03-18T16:05:44Z

Hi @jonbrenas
Requesting a review for the changes i made

jonbrenas · 2026-03-18T16:24:29Z

Hi @31puneet, I think the code is alright. Can you reflect on the differences between the results obtained with the 2 methods?

31puneet · 2026-03-18T16:40:52Z

Hi @jonbrenas, thanks for the review
The differences between the results of the two methods:

For unique metric:

The result: We see that Cameroon and Uganda share exactly 29 distinct haplotypes, which gives us a baseline look at the diversity of historical gene flow between them.
How we got it: This uses set() logic to just count the distinct shared lineages once. It completely removes the bias of how many individual mosquitoes actually carry those lineages in our dataset.

For the absolute metric:

The result: Cameroon-Uganda pair converts to a massive number, showing that the volume of sharing is actually huge compared to the baseline diversity.
How we got it: This uses the Counter logic to multiply the absolute frequencies of those matching haplotypes together.

jonbrenas · 2026-03-18T20:47:40Z

Hi @31puneet, but can you draw conclusions from the results?

31puneet · 2026-03-18T21:32:05Z

Yes @jonbrenas
The contrast between the two metrics allows us to visually confirm evolutionary gene flow. For example, the test data shows Cameroon and Gabon share only 8 unique distinct lineages, but their absolute sharing volume is 3567.

Biologically, this points to a selective sweep (like a shared insecticide resistance mutation). Basically, a very small handful of mosquitoes gained a massive survival advantage, and those specific DNA lineages quickly spread and multiplied across both countries. If these lineages were just migrating naturally over time, the absolute number of shared mosquitoes would still be pretty low. The fact that only 8 unique lineages exploded into thousands of actual matches is the exact footprint of strong positive selection happening across different regions.

In short: the unique metric shows us how many distinct lineages managed to cross the border, while the absolute metric shows us how aggressively those lineages multiplied and dominated the population once they got there.

31puneet added 3 commits February 28, 2026 18:40

feat: add arc and chord diagrams for haplotype sharing (malariagen#457,

bc2424c

malariagen#458)

Merge branch 'master' into feat/haplotype-sharing-arc-chord

cd86417

Merge branch 'master' into feat/haplotype-sharing-arc-chord

9fb0a7d

tristanpwdennis self-requested a review March 6, 2026 00:45

tristanpwdennis self-assigned this Mar 6, 2026

docs: add inline comments explaining haplotype sharing computation

8983418

31puneet force-pushed the feat/haplotype-sharing-arc-chord branch from a1379f2 to 8983418 Compare March 9, 2026 15:45

fix: Lint erros

7cb10e8

31puneet force-pushed the feat/haplotype-sharing-arc-chord branch from 42aa2be to 7cb10e8 Compare March 9, 2026 15:48

31puneet added 3 commits March 9, 2026 15:49

Merge branch 'master' into feat/haplotype-sharing-arc-chord

9178f48

Merge branch 'master' into feat/haplotype-sharing-arc-chord

f0d5d12

Merge branch 'master' into feat/haplotype-sharing-arc-chord

cb18a86

31puneet added 3 commits March 17, 2026 20:02

add width_metric parameter for unique vs absolute haplotype counting

e25fe0b

Merge branch 'master' into feat/haplotype-sharing-arc-chord

ead68bd

fix: document metric parameter and add unique/absolute test coverage

1d8e33c

Merge branch 'master' into feat/haplotype-sharing-arc-chord

d7ba865

Merge branch 'master' into feat/haplotype-sharing-arc-chord

c2b6fd7

jonbrenas approved these changes Mar 18, 2026

View reviewed changes

Merge branch 'master' into feat/haplotype-sharing-arc-chord

ed43275

jonbrenas merged commit b1ab7fb into malariagen:master Mar 19, 2026
8 checks passed

31puneet deleted the feat/haplotype-sharing-arc-chord branch March 19, 2026 19:30

Conversation

31puneet commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Changes

Usage

Chord and Arc diagram

Uh oh!

tristanpwdennis commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

31puneet commented Mar 6, 2026

Uh oh!

jonbrenas commented Mar 6, 2026

Uh oh!

jonbrenas commented Mar 9, 2026

Uh oh!

31puneet commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

31puneet commented Mar 17, 2026

Uh oh!

jonbrenas commented Mar 17, 2026

Uh oh!

31puneet commented Mar 17, 2026

Why we use a set() here:

Uh oh!

jonbrenas commented Mar 17, 2026

Uh oh!

31puneet commented Mar 17, 2026

Uh oh!

31puneet commented Mar 17, 2026

Uh oh!

jonbrenas commented Mar 17, 2026

Uh oh!

31puneet commented Mar 17, 2026

Uh oh!

31puneet commented Mar 17, 2026

Uh oh!

31puneet commented Mar 18, 2026

Uh oh!

jonbrenas commented Mar 18, 2026

Uh oh!

31puneet commented Mar 18, 2026

Uh oh!

jonbrenas commented Mar 18, 2026

Uh oh!

31puneet commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

31puneet commented Feb 28, 2026 •

edited

Loading

tristanpwdennis commented Mar 6, 2026 •

edited

Loading

31puneet commented Mar 9, 2026 •

edited

Loading