confusable-vision

Empirical glyph similarity scoring using vector-outline raycasting. Renders Unicode confusable character pairs across 245 system fonts, measures structural similarity from font outlines directly (no rasterization), and produces scored JSON artifacts with per-font continuous distance scores.

Key results from 52.6 million single-char and 190 million multi-char comparisons across 22,581 characters and 12 writing systems:

249,976 unique single-char confusable pairs across 245 fonts, 12 scripts, 66 cross-script pairs. 764,395 total font-level discoveries.
2,524,275 unique multi-char (bigram) confusable pairs including rn/m across 95 fonts (33 below distance 0.40) and oy/Cyrillic uk across 16 fonts.
Per-font continuous distance scores, not binary lists. Each pair has a measured ray distance per font, giving font-aware confidence for downstream security tooling.
305% more discoveries than SDF, 29% faster. The enriched five-layer ray signature is a strict superset of SDF findings; SDF-exclusive pairs did not replicate under manual review.

The output feeds directly into namespace-guard for runtime confusable detection in package names, domain names, and identifiers.

How it works

RaySpace casts parallel rays through font outlines at 36 angles and captures five layers of information per glyph: crossing counts, crossing positions, crossing angles, ping distances (stroke width at each crossing), and ping max (counter width between crossings). This produces a compact signature per character per font. Two signatures are compared with a weighted L1 distance across all five layers.

A three-stage filter cascade makes exhaustive comparison tractable:

Advance width filter (15% tolerance) eliminates pairs with different character widths. Removes 63% of candidates.
Ray comparison (threshold 2.0, tightened to 1.0 for large script pairs). Removes another 33%.
Only 3.3% of candidates survive to become discoveries.

The signature bank (294,646+ entries across 245 fonts; more with --include-uppercase) is precomputed once. Discovery then runs as single-threaded arithmetic on the bank, completing 52.6 million pair comparisons in 31 minutes with no worker threads or GPU.

Quick start

npm install

# 1. Build the ray signature bank (prerequisite, ~24 min)
npx tsx scripts/build-signature-bank.ts

# 1b. Include uppercase Latin A-Z (optional)
#     By default the bank only includes IDNA PVALID codepoints (lowercase,
#     digits, symbols). Use --include-uppercase to add uppercase A-Z, useful
#     for font identification and trademark visual comparison where uppercase
#     glyph shapes matter. The builder is resumable, so this only computes
#     the additional codepoints.
npx tsx scripts/build-signature-bank.ts --include-uppercase

# 1c. Include arbitrary codepoint ranges (optional)
npx tsx scripts/build-signature-bank.ts --extra-range=0041-005A

# 2. Single-char discovery (22,581 chars, 12 scripts, ~36 min)
npx tsx scripts/discover-singlechar-sdf.ts --scorer=ray

# 3. Multi-char (bigram) discovery (676 bigrams, ~63 min)
npx tsx scripts/discover-multichar-sdf.ts --scorer=ray

# 4. Score known TR39 multi-char confusables (~5 min)
npx tsx scripts/score-multichar-sdf.ts --scorer=ray

Legacy SSIM pipeline

The original SSIM-based pipeline scored 26.5 million comparisons across 230 fonts. It remains functional but is superseded by RaySpace for all discovery and scoring tasks.

# TR39 confusable pair scoring
npx tsx scripts/build-index.ts          # Render index (~160s, 11,370 PNGs)
npx tsx scripts/score-all-pairs.ts      # Score all pairs (~65s, 235K comparisons)

# Novel confusable discovery
npx tsx scripts/build-candidates.ts          # Candidate set (~23K chars)
npx tsx scripts/build-index.ts --candidates  # Render candidates (~40min, 89K PNGs)
npx tsx scripts/score-candidates.ts          # Score against Latin targets (~15min, 2.9M comparisons)

# Extract high-scoring discoveries from both pipelines
npx tsx scripts/extract-discoveries.ts

What it found

Top confusable pairs (single-char, mean distance < 0.10)

Source	Target	Scripts	Mean	Fonts	Zeros
w U+0077	ԝ U+051D	Latin-Cyrillic	0.000	19	19
j U+006A	ϳ U+03F3	Latin-Greek	0.013	21	18
i U+0069	і U+0456	Latin-Cyrillic	0.018	62	50
s U+0073	ѕ U+0455	Latin-Cyrillic	0.018	62	46
c U+0063	с U+0441	Latin-Cyrillic	0.019	61	45
o U+006F	о U+043E	Latin-Cyrillic	0.020	61	44
j U+006A	ј U+0458	Latin-Cyrillic	0.021	60	48
x U+0078	х U+0445	Latin-Cyrillic	0.023	59	50
p U+0070	р U+0440	Latin-Cyrillic	0.024	61	46
e U+0065	е U+0435	Latin-Cyrillic	0.032	61	44
a U+0061	а U+0430	Latin-Cyrillic	0.042	61	45

"Zeros" = fonts where the outlines produce bit-identical ray signatures (distance 0.000). Latin w/Cyrillic ԝ is identical in all 19 fonts that contain both glyphs.

Cross-script breakthroughs (single-char)

Source	Target	Scripts	Mean	Fonts
ο U+03BF	ჿ U+10FF	Greek-Georgian	0.057	2
ヘ U+30D8	へ U+3078	Katakana-Hiragana	0.122	11
丶 U+4E36	ヽ U+30FD	Han-Katakana	0.125	9
二 U+4E8C	ニ U+30CB	Han-Katakana	0.249	11
口 U+53E3	ロ U+30ED	Han-Katakana	0.268	11

Georgian Coda (U+10FF) forms a four-way confusable ring with Latin o, Cyrillic o, and Greek omicron, all below distance 0.08.

Multi-char headline results

Bigram	Target	Mean	Fonts	Notes
ll	॥ U+0965 (Devanagari double danda)	0.176	8	Cross-script
oy	ѹ U+0479 (Cyrillic uk)	0.322	16	Novel cross-script bigram confusable
rn	m U+006D	0.531	95	33 fonts below 0.40
bl	ы U+044B (Cyrillic yeru)	0.797	49	Cross-script

The oy/Cyrillic uk discovery is the standout novel finding: the Latin bigram "oy" is visually identical to the Cyrillic digraph letter uk (ѹ) at distance 0.000 in Helvetica and 0.0005 in Arial Unicode MS.

Threshold calibration

Mean threshold	Single-char unique pairs	Multi-char unique pairs
< 0.50	138	13
< 1.00	4,174	1,631
< 1.50	59,700	(noise)
< 2.00	249,976	2,524,275

Three recommended operating tiers:

Strict (< 0.50): 138 single-char pairs, near-zero false positives. Suitable for automated blocking (IDN registration, package name validation) where false positives have real cost.
Standard (< 1.00): 4,174 single-char pairs, good balance of coverage and precision. Suitable for flagging and manual review in security tooling.
Exploratory (< 2.00): Full discovery set. Contains noise at the upper end but useful for research, font auditing, and building training sets.

Font querying

Query which confusable pairs exist for a specific font. Useful for font designers shipping a new typeface, browser vendors evaluating a system font change, or anyone choosing a display font for security-sensitive contexts like IDN domains.

npx tsx scripts/query-font.ts --list-fonts                    # 218 fonts in discovery data
npx tsx scripts/query-font.ts "Arial"                         # All pairs for Arial (SSIM >= 0.7)
npx tsx scripts/query-font.ts "Arial" --threshold 0.8         # High-confidence only
npx tsx scripts/query-font.ts "Arial" --compare "Georgia"     # Diff two fonts by SSIM delta
npx tsx scripts/query-font.ts "Arial" --json                  # JSON for downstream processing

Font name matching is case-insensitive substring, so "arial" matches Arial, Arial Black, and Arial Unicode MS. Compare mode sorts by the biggest SSIM differences first, surfacing exactly which pairs get better or worse when switching fonts.

Requires the discovery files from the scoring pipeline (gitignored, regenerate locally).

Output

Committed (CC-BY-4.0)

File	Description
`data/output/confusable-discoveries.json`	110 TR39 pairs with high SSIM (>= 0.7) or pixel-identical
`data/output/candidate-discoveries.json`	793 novel pairs not in TR39, mean SSIM >= 0.7
`data/output/confusable-weights.json`	1,397 weighted edges for namespace-guard integration
`data/output/cross-script-discoveries.json`	563 cross-script confusable pairs
`data/output/cross-script-summary.json`	Cross-script summary by script pair
`data/output/multichar-discoveries.json`	Multi-char confusable discoveries

Generated (gitignored, run pipeline to regenerate)

File	Description
`data/output/render-index/`	Render PNGs + index (SSIM pipeline)
`data/output/singlechar-sdf-scores.jsonl`	Single-char RaySpace scores
`data/output/multichar-rayspace-scores.jsonl`	Multi-char RaySpace scores
`data/output/signature-bank/`	Ray signature bank (294,646 entries, 7.9GB compressed)

Progress

Licence

Code (src/, scripts/): MIT
Generated data (data/output/): CC-BY-4.0. Free to use, share, and adapt for any purpose including commercial, with attribution.
Attribution: Paul Wood FRSA (@paultendo), confusable-vision

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github		.github
attack-tests		attack-tests
data		data
docs		docs
scripts		scripts
src		src
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
REPORT.md		REPORT.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

confusable-vision

How it works

Quick start

What it found

Top confusable pairs (single-char, mean distance < 0.10)

Cross-script breakthroughs (single-char)

Multi-char headline results

Threshold calibration

Font querying

Output

Committed (CC-BY-4.0)

Generated (gitignored, run pipeline to regenerate)

Progress

Related

Blog posts

Background

Licence

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

confusable-vision

How it works

Quick start

What it found

Top confusable pairs (single-char, mean distance < 0.10)

Cross-script breakthroughs (single-char)

Multi-char headline results

Threshold calibration

Font querying

Output

Committed (CC-BY-4.0)

Generated (gitignored, run pipeline to regenerate)

Progress

Related

Blog posts

Background

Licence

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages