Skip to content

unicode custom loading

GregRos edited this page Mar 26, 2025 · 1 revision

THIS IS OUT OF SCOPE Loading Unicode data can be customized using members from parjs/unicode.

sync constructor

This import also contains a sync constructor for the parser, which you can use if you don’t have any other choice.

import {unicodeSync} from "parjs/unicode"
// This will load Unicode data synchronously, potentially blocking
// for a long time.
const p = unicodeSync()

loading binary data

By default, the Unicode data parjs needs is encoded as base64 and embedded in the source code. This is the case even if the async constructor is used. However, there are some drawbacks with this approach.

  1. parjs loads a relatively large dataset, which you might not actually need.
  2. base64 is not space-efficient and parsing it is a bit redundant.

It’s possible to construct a custom Unicode dataset and load it as binary data instead. This can be achieved using the #construct-unicode script during development and the configure function.

import {configure} from "parjs/unicode"
configure({
    loader: "binary",
    async load() {
        return fetch("/parjs.unicode.bin")
    }
})

This modifies how Unicode data is loaded globally. You’ll also need to implement a loadSync function if you want unicodeSync to work.

construct-unicode

This script lets you construct a Unicode dataset, storing it directly as binary data in a file. You can then tell parjs how to fetch this file to construct the 🉑unicode parser.

Usage: parjs construct-unicode [options] <file>

Constructs a custom Unicode dataset, encoding it as a compressed binary file. If no options are specified, the default dataset will be used.

Most Unicode properties are used to process @prop=value character classes. However, the minimal Script and General_Category properties are needed to parse named character classes.

If you want to avoid loading them, you have to specify the --no-minimal option. Note that this will cause runtime errors if you use some of the named character classes.

Arguments:
  file     Where to store the dataset.

Options:
  --no-minimal          Prevents loading the minimal properties.
  --props <props-list>  Comma-separated list of Unicode properties to load.

Clone this wiki locally