A high-performance matcher designed to solve LOGICAL and TEXT VARIATIONS problems in word matching, implemented in Rust.
It's helpful for
- Precision and Recall: Word matching is a retrieval process, LOGICAL match improves precision while TEXT VARIATIONS match improves recall.
- Content Filtering: Detecting and filtering out offensive or sensitive words.
- Search Engines: Improving search results by identifying relevant keywords.
- Text Analysis: Extracting specific information from large volumes of text.
- Spam Detection: Identifying spam content in emails or messages.
- ···
For detailed implementation, see the Design Document.
- Text Transformation:
- Fanjian: Simplify traditional Chinese characters to simplified ones.
Example:
蟲艸->虫艹 - Delete: Remove specific characters.
Example:
*Fu&*iii&^%%*&kkkk->Fuiiikkkk - Normalize: Normalize special characters to identifiable characters.
Example:
𝜢𝕰𝕃𝙻𝝧 𝙒ⓞᵣℒ𝒟!->hello world! - PinYin: Convert Chinese characters to Pinyin for fuzzy matching.
Example:
西安->xi an, matches洗按->xi an, but not先->xian - PinYinChar: Convert Chinese characters to Pinyin.
Example:
西安->xian, matches洗按and先->xian
- Fanjian: Simplify traditional Chinese characters to simplified ones.
Example:
- AND OR NOT Word Matching:
- Takes into account the number of repetitions of words.
- Example:
hello&worldmatcheshello worldandworld,hello - Example:
无&法&无&天matches无无法天(because无is repeated twice), but not无法天 - Example:
hello~helloo~hhellomatcheshellobut nothellooandhhello
- Efficient Handling of Large Word Lists: Optimized for performance.
See the Rust README.
See the Python README.
We provide dynamic library to link. See the C README and Java README.
git clone https://github.com/Lips7/Matcher.git
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- --default-toolchain nightly -y
cargo build --releaseThen you should find the libmatcher_c.so/libmatcher_c.dylib/matcher_c.dll in the target/release directory.
Please refer to benchmarks for details.