content similarity detection / deduplication#915
content similarity detection / deduplication#915geeknik wants to merge 13 commits intoprojectdiscovery:devfrom
Conversation
"Dynamic Scope" integration
"Dynamic Scope" integration
Add dynamic scope engine.
minor nit
|
@geeknik very interesting approach, I'm going to review this soon and compare it with BM25 |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
ehsandeep
left a comment
There was a problem hiding this comment.
- Refactored the code to integrate with the existing engine as an optional feature instead of creating a new engine.
- Added documentation to assist with understanding and testing the feature:
- Renamed options to better reflect the feature and added support for customizing the similarity threshold.
Options:
-sdd, -similarity-deduplication Enable content similarity detection to avoid crawling similar pages
-st, -similarity-threshold Set similarity threshold for content deduplication (range: 0.0–1.0, default: 0.1)
|
When will this feature be available online? |
"Dynamic Scope" integration, cuts back on data usage while crawling by utilizing a TF-IDF machine learning model to discard pages which might be too similar to pages already crawled. 👍🏻
For example, when running
katana -d 1 -u https://www.ibm.com/ -j -o ibm.json, theibm.jsonis about 11MB. Now when running withDynamic Scopewhich adds-udsto the command line, drops the output of ibm.json to about 3.4MB.Crawl Fast, Crawl Smart. 🚀