MAlign

MAlign is a Python library for multiple sequence alignment with asymmetric scoring matrices across different domains. Unlike standard alignment tools that assume symmetric substitution costs, MAlign supports directional scoring -- the cost of aligning symbol A with symbol B can differ from B with A.

While designed primarily for computational linguistics (e.g., historical phonology, cognate detection), MAlign works with any hashable Python objects and is suitable for general-purpose sequence alignment tasks.

Key Features

Asymmetric scoring: Direction-dependent alignment costs, with from_substitution_counts() factory for log-odds matrices from observed sound change frequencies
True multi-alignment: N-dimensional alignment for up to 4 sequences (via YenKSP on N-dim graphs), with automatic UPGMA progressive fallback for larger sets
Multiple algorithms: Needleman-Wunsch (anw) and Yen's k-shortest paths (yenksp)
k-best alignments: Return the top-k optimal alignments, not just the best one
Matrix learning: Supervised (EM, gradient descent) and unsupervised (bootstrap_matrix) from sequence pairs
Prior-guided learning: Blend phonological feature priors with data-driven scores via linearly-decaying regularization
Block detection: Detect and merge complementary-gap patterns (diphthongization, metathesis) into compound symbols
Feature-based scoring: Build matrices from phonological feature distances (via distfeat)
Matrix imputation: Fill sparse matrices using sklearn-based methods
Evaluation metrics: Accuracy, precision, recall, and F1 for alignment quality

Installation

pip install malign

For phonological feature-based scoring matrices:

pip install malign[features]

Quick Start

Basic Alignment

import malign

alms = malign.align(["ATTCGGAT", "TACGGATTT"], k=2)
print(malign.tabulate_alms(alms))

Custom Scoring Matrix

matrix = malign.ScoringMatrix.from_sequences(
    sequences=[["A", "C", "G", "T"], ["A", "C", "G", "T"]],
    match=2.0, mismatch=-1.0, gap_score=-1.5,
)
alms = malign.align(["ACGT", "AGT"], k=1, matrix=matrix)

Full Pipeline: Features to Evaluation

This example shows the complete workflow for linguistic alignment -- building a scoring matrix from phonological feature distances, aligning cognate pairs, and evaluating the results:

import malign

# Build a scoring matrix from phonological feature distances
matrix = malign.ScoringMatrix.from_distfeat(
    sequences=[["n", "o", "t", "e"], ["n", "o", "tʃ", "e"]],
    gap="-", gap_score=-1.0,
)

# Align cognate sequences
alms = malign.align(
    [["n", "o", "t", "e"], ["n", "o", "tʃ", "e"]],
    k=3, matrix=matrix, method="anw",
)
print(malign.tabulate_alms(alms[:2]))

# Evaluate against gold standard
gold = malign.Alignment(
    [("n", "o", "t", "e"), ("n", "o", "tʃ", "e")], score=0.0,
)
print(f"Accuracy: {malign.alignment_accuracy(alms[0], gold):.2%}")
print(f"F1: {malign.alignment_f1(alms[0], gold):.2%}")

Matrix Learning from Cognates

cognate_sets = [
    [["n", "o", "t", "e"], ["n", "o", "tʃ", "e"]],
    [["f", "a", "t", "o"], ["h", "a", "d", "o"]],
]
matrix = malign.learn_matrix(cognate_sets, method="em", max_iter=10)

# Optionally regularize with a phonological prior
matrix = malign.learn_matrix(
    cognate_sets, method="em", max_iter=10, prior_matrix=prior,
)

Unsupervised Bootstrap Learning

# No clustering needed -- just pairs of related sequences
pairs = [
    (["p", "a", "t", "a"], ["b", "a", "d", "a"]),
    (["t", "a", "p", "a"], ["d", "a", "b", "a"]),
    (["k", "a", "t", "a"], ["g", "a", "d", "a"]),
]
matrix = malign.bootstrap_matrix(pairs, max_iter=20)

# Optionally blend with a phonological prior
prior = malign.ScoringMatrix.from_distfeat(
    sequences=[["p", "t", "k", "b", "d", "g"], ["p", "t", "k", "b", "d", "g"]],
)
matrix = malign.bootstrap_matrix(pairs, max_iter=20, prior_matrix=prior)

Block Detection (Diphthongization / Metathesis)

# Merge complementary-gap columns into compound symbols
alms = malign.align([["a"], ["j", "e"]], k=1, merge_blocks=True)
# Sequence 2 gets compound symbol ("j", "e") instead of separate columns

Algorithms

Method	Description	Best for
`anw` (default)	Asymmetric Needleman-Wunsch	Pairwise alignment, small k
`yenksp`	Yen's k-shortest paths on alignment graph	Large k, diverse alignments
`dumb`	Gap-padding baseline	Testing and comparison

Requirements

Python >= 3.12
numpy, scipy, scikit-learn, tabulate, PyYAML
Optional: distfeat for feature-based scoring

Documentation

Community

Contributions, bug reports, and feature requests are welcome via GitHub issues and pull requests.

Author and Citation

Developed by Tiago Tresoldi (tiago.tresoldi@lingfil.uu.se).

The author has received funding from the Riksbankens Jubileumsfond (grant agreement ID: MXM19-1087:1, Cultural Evolution of Texts).

During the first stages of development, the author received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No. ERC Grant #715618, Computer-Assisted Language Comparison).

If you use malign, please cite it as:

Tresoldi, Tiago (2026). MALIGN, a library for multiple asymmetric alignments on different domains. Version 0.5. Uppsala: Department of Linguistics and Philology, Uppsala University.

In BibTeX:

@misc{Tresoldi2026malign,
  author = {Tresoldi, Tiago},
  title = {MALIGN, a library for multiple asymmetric alignments on different domains. Version 0.5},
  howpublished = {\url{https://github.com/tresoldi/malign}},
  address = {Uppsala},
  publisher = {Department of Linguistics and Philology, Uppsala University},
  year = {2026},
}

License

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 213 Commits
benchmarks		benchmarks
docs		docs
examples		examples
malign		malign
scripts		scripts
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ROADMAP.md		ROADMAP.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MAlign

Key Features

Installation

Quick Start

Basic Alignment

Custom Scoring Matrix

Full Pipeline: Features to Evaluation

Matrix Learning from Cognates

Unsupervised Bootstrap Learning

Block Detection (Diphthongization / Metathesis)

Algorithms

Requirements

Documentation

Community

Author and Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MAlign

Key Features

Installation

Quick Start

Basic Alignment

Custom Scoring Matrix

Full Pipeline: Features to Evaluation

Matrix Learning from Cognates

Unsupervised Bootstrap Learning

Block Detection (Diphthongization / Metathesis)

Algorithms

Requirements

Documentation

Community

Author and Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages