Skip to content

Lips7/Matcher

Repository files navigation

Matcher

RustPythonJavaC

PyPI - License

Crates.io VersionGitHub Actions Workflow Statusdocs.rsCrates.io Total Downloads

PyPI - VersionPyPI - Python VersionPyPI - Downloads

A high-performance matcher designed to solve LOGICAL and TEXT VARIATIONS problems in word matching, implemented in Rust.

It's helpful for

  • Precision and Recall: Word matching is a retrieval process, LOGICAL match improves precision while TEXT VARIATIONS match improves recall.
  • Content Filtering: Detecting and filtering out offensive or sensitive words.
  • Search Engines: Improving search results by identifying relevant keywords.
  • Text Analysis: Extracting specific information from large volumes of text.
  • Spam Detection: Identifying spam content in emails or messages.
  • ···

Features

For detailed implementation, see the Design Document.

  • Text Transformation:
    • Fanjian: Simplify traditional Chinese characters to simplified ones. Example: 蟲艸 -> 虫艹
    • Delete: Remove specific characters. Example: *Fu&*iii&^%%*&kkkk -> Fuiiikkkk
    • Normalize: Normalize special characters to identifiable characters. Example: 𝜢𝕰𝕃𝙻𝝧 𝙒ⓞᵣℒ𝒟! -> hello world!
    • PinYin: Convert Chinese characters to Pinyin for fuzzy matching. Example: 西安 -> xi an, matches 洗按 -> xi an, but not -> xian
    • PinYinChar: Convert Chinese characters to Pinyin. Example: 西安 -> xian, matches 洗按 and -> xian
  • AND OR NOT Word Matching:
    • Takes into account the number of repetitions of words.
    • Example: hello&world matches hello world and world,hello
    • Example: 无&法&无&天 matches 无无法天 (because is repeated twice), but not 无法天
    • Example: hello~helloo~hhello matches hello but not helloo and hhello
  • Efficient Handling of Large Word Lists: Optimized for performance.

Rust Users

See the Rust README.

Python Users

See the Python README.

C, Java and Other Users

We provide dynamic library to link. See the C README and Java README.

Build from source

git clone https://github.com/Lips7/Matcher.git
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- --default-toolchain nightly -y
cargo build --release

Then you should find the libmatcher_c.so/libmatcher_c.dylib/matcher_c.dll in the target/release directory.

Benchmarks

Please refer to benchmarks for details.

About

A high-performance matcher designed to solve LOGICAL and TEXT VARIATIONS problems in word matching, implemented in Rust.

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Contributors