feat: implement doctrine-based fulltext search#2118
feat: implement doctrine-based fulltext search#2118benjaminfrueh wants to merge 1 commit intomainfrom
Conversation
| "symfony/string": "^6.0", | ||
| "symfony/translation-contracts": "^3.6", | ||
| "teamtnt/tntsearch": "^5.0" | ||
| "wamania/php-stemmer": "^4.0" |
There was a problem hiding this comment.
We use this one in a few other apps: https://github.com/search?q=org%3Anextcloud+wamania%2Fphp-stemmer&type=code
Maybe worth to consider scoping the dependency to avoid conflicts with different versions between apps. https://arthur-schiwon.de/isolating-nextcloud-app-dependencies-php-scoper
07cbae6 to
d8a4b06
Compare
|
Hello there, We hope that the review process is going smooth and is helpful for you. We want to ensure your pull request is reviewed to your satisfaction. If you have a moment, our community management team would very much appreciate your feedback on your experience with this PR review process. Your feedback is valuable to us as we continuously strive to improve our community developer experience. Please take a moment to complete our short survey by clicking on the following link: https://cloud.nextcloud.com/apps/forms/s/i9Ago4EQRZ7TWxjfmeEpPkf6 Thank you for contributing to Nextcloud and we hope to hear from you soon! (If you believe you should not receive this message, you can add yourself to the blocklist.) |
mejo-
left a comment
There was a problem hiding this comment.
Really nice work, thanks so much @benjaminfrueh. I finally found time to go through the code changes, read up a bit on stemming, fuzzy searching, bigrams and other stuff I didn't know much about before 😆
The general approach of your implementation looks really clean and promising to me.
I have quite a few comments and questions and am curious what you think about them.
| * @method int getHitCount() | ||
| * @method void setHitCount(int $value) | ||
| */ | ||
| class SearchDoc extends Entity { |
There was a problem hiding this comment.
So far we tried to name the objects similar to the table names. So Maybe either rename the tables to collectives_search_* or rename the object classes to FtsDoc and so on? 🤔
There was a problem hiding this comment.
As discussed we will prefix them with collectives_s_ for now, to be consistent with the other tables. The stable32 support currently limits the table and index length.
| use OCP\AppFramework\Db\QBMapper; | ||
| use OCP\DB\QueryBuilder\IQueryBuilder; | ||
| use OCP\IDBConnection; | ||
| use OCP\Snowflake\IGenerator; |
There was a problem hiding this comment.
This will not be compatible with Nextcloud 32, which we still intend to support with the main Collectives release as long as it's officially supported. Maybe we should stick with normal autoincrement IDs for now and migrate all IDs to snowflake IDs at a later point after we dropped Nextcloud 32 support, what do you think? We probably have to do this in other DB tables as well anyway.
There was a problem hiding this comment.
I changed it to autoincrement IDs for now.
|
|
||
| public function stem(string $word): string { | ||
| if ($this->stemmer === null && $this->stemmingEnabled) { | ||
| $language = $this->config->getSystemValue('default_language', 'en'); |
There was a problem hiding this comment.
I'm not sure whether it's the best option to use the instance's default language here. This effectively means that stemming only happens for instance's default language? I guess this would be much more powerful if the language of the indexed document was used here, right?
Maybe there's simple algorithms to guess language from the full document in the indexer and pass the detected language into the stemmer? Nothing that necessarily needs to happen in this PR, but still I'd be curious about your thoughts.
There was a problem hiding this comment.
Language detection per document seems like the only proper solution, it comes with a small performance downside for indexing, but that should be fine. I found that there are language detection libraries, like https://github.com/patrickschur/language-detection which could be used, what do you think?
We should then store the language in the collectives_s_files table, so the correct stemmer can be used for each document.
There are possible edge-cases that document language change or a document has mixed languages, but we would just have to save the first one we detect.
| namespace OCA\Collectives\Search\FileSearch\Stemmer; | ||
|
|
||
| use OCP\IConfig; | ||
| use Wamania\Snowball\NotFoundException; |
There was a problem hiding this comment.
I like that we use a third-party stemmer library. I compared the supported languages and TNTSearch supports the following languages that Wamania php-stemmer does not: Arabic, Croatian, Latvian, Polish and Ukrainian and a PorterStemmer where I'm not sure about it's purpose.
Personally I think having support for Arabic would be nice, but it seems easy enough to add that to Wamania in a PR and we should consider it as a follow-up task for now.
There was a problem hiding this comment.
The PorterStemmer should be the default english stemmer named after Martin Porter.
I think the Wamania stemmer library is the commonly used one and already used in other nextcloud apps, https://github.com/search?q=org%3Anextcloud+wamania%2Fphp-stemmer&type=code
We can add a PR for an arabic stemmer as a follow-up task.
|
|
||
| public function tokenize(string $text): array { | ||
| $text = mb_strtolower($text); | ||
| $text = preg_replace('/([^\p{L}\p{N}@])+/u', ' ', $text); |
There was a problem hiding this comment.
The tokenizer replace regex of TNTSearch also includes _ and - as characters that should not be replaced:
$pattern = '/[^\p{L}\p{N}\p{Pc}\p{Pd}@]+/u';Was it on purpose that you removed these to characters? Just asking out of curiosity.
There was a problem hiding this comment.
I have seen both versions for word tokenizers, but you are correct, current TNT-Search does not split words at - and _ so it will index these as single words. I will add it back to the pattern in favour for making the search results more explicit. It is individual, as there are words where splitting makes sense to find them better (e. g. ẁell-known can then be found by searching known, and words where splitting would be bad (e. g. e-mail, the e would not even get indexed)
I also added the stopword option back to the tokenizer, just in case we want to remove common and meaningless words from the index.
|
|
||
| foreach ($terms as $term => $hitCount) { | ||
| try { | ||
| $term = substr((string)$term, 0, 50); |
There was a problem hiding this comment.
Maybe better use mb_substr() here as well to avoid truncating in the middle of a multi-byte character.
There was a problem hiding this comment.
Thanks for the feedback, updated it.
|
|
||
| $hitCountParam = $qb->createNamedParameter($hitCount, IQueryBuilder::PARAM_INT); | ||
| $qb->update($this->tableName) | ||
| ->set('num_hits', $qb->createFunction("num_hits - $hitCountParam")) |
There was a problem hiding this comment.
I wonder whether we want to save-guard against negative values here. Especially as the fields are unsigned integers, which means here's a risk of buffer overflows, right?
There was a problem hiding this comment.
Thanks for the feedback, updated it.
| protected FileIndexer $indexer; | ||
| class FileSearcher { | ||
| private const DEFAULT_LIMIT = 15; | ||
| private const FUZZY_PREFIX_LENGTH = 2; |
There was a problem hiding this comment.
A prefix length of 2 seems pretty short and probably results in many candidates in many cases. How about increasing this to 3 or 4?
There was a problem hiding this comment.
Thanks for the feedback, I increased it to 3 for now.
| $tokens = []; | ||
| $lastWord = ''; | ||
| foreach ($words as $word) { | ||
| if (strlen($word) <= 3) { |
There was a problem hiding this comment.
| if (strlen($word) <= 3) { | |
| if (mb_strlen($word) <= 3) { |
There was a problem hiding this comment.
Thanks for the feedback, updated it.
| $this->indexer->setLanguage($this->language ?? self::UNSUPPORTED_LANGUAGE); | ||
| $scored = []; | ||
| foreach ($files as $file) { | ||
| $content = mb_strtolower($file->getContent()); |
There was a problem hiding this comment.
This looks like a potential performance problem to me, as it will re-read the contents of all matched files. I don't know though what to do about it. The ranking by bigrams seams important as it makes sure that searches for several words ranks results higher where these words are next to each other. So dropping sorting by bigrams is not an option in my opinion.
We could add a fourth DB table and compute bigram phrase counts in it at index time. But this would probably drastically increase the database size and might be a bit over-engineered.
There was a problem hiding this comment.
I agree that this could be a performance issue, but as you said there is no good solution without bloating the DB by a lot and also slow the indexing. The performance impact should be limited by the DEFAULT_LIMIT of 15 documents which are searched ranked by hits and then ranked by bigrams.
TNTSearch offered bigram and various ngram tokenizers, neither of them were configured or previously used for collectives, they would increase the database size drastically. They stored them just as word pairs alongside their words.
I think we should test the performance impact with a large dataset first and then decide on this.
d8a4b06 to
b26823e
Compare
Signed-off-by: Benjamin Frueh <benjamin.frueh@gmail.com>
b26823e to
2ee83e2
Compare
📝 Summary
Replaces TNTSearch with nextcloud database full-text search using doctrine.
Resolves #2050
🏁 Checklist
npm run lint/npm run stylelint/composer run cs:check)