feat: implement doctrine-based fulltext search by benjaminfrueh · Pull Request #2118 · nextcloud/collectives

benjaminfrueh · 2025-11-11T14:28:36Z

📝 Summary

Replaces TNTSearch with nextcloud database full-text search using doctrine.

Resolves #2050

🏁 Checklist

Code is properly formatted (npm run lint / npm run stylelint / composer run cs:check)
Sign-off message is added to all commits
Tests (unit, integration and/or end-to-end) passing and the changes are covered with tests
Documentation (README or documentation) has been updated or is not required

juliusknorr · 2025-11-11T21:04:26Z

composer.json

        "symfony/string": "^6.0",
        "symfony/translation-contracts": "^3.6",
-        "teamtnt/tntsearch": "^5.0"
+        "wamania/php-stemmer": "^4.0"


We use this one in a few other apps: https://github.com/search?q=org%3Anextcloud+wamania%2Fphp-stemmer&type=code

Maybe worth to consider scoping the dependency to avoid conflicts with different versions between apps. https://arthur-schiwon.de/isolating-nextcloud-app-dependencies-php-scoper

github-actions · 2025-11-26T02:11:04Z

Hello there,
Thank you so much for taking the time and effort to create a pull request to our Nextcloud project.

We hope that the review process is going smooth and is helpful for you. We want to ensure your pull request is reviewed to your satisfaction. If you have a moment, our community management team would very much appreciate your feedback on your experience with this PR review process.

Your feedback is valuable to us as we continuously strive to improve our community developer experience. Please take a moment to complete our short survey by clicking on the following link: https://cloud.nextcloud.com/apps/forms/s/i9Ago4EQRZ7TWxjfmeEpPkf6

Thank you for contributing to Nextcloud and we hope to hear from you soon!

(If you believe you should not receive this message, you can add yourself to the blocklist.)

mejo-

Really nice work, thanks so much @benjaminfrueh. I finally found time to go through the code changes, read up a bit on stemming, fuzzy searching, bigrams and other stuff I didn't know much about before 😆

The general approach of your implementation looks really clean and promising to me.

I have quite a few comments and questions and am curious what you think about them.

mejo- · 2026-03-05T13:51:32Z

lib/Search/FileSearch/Db/SearchDoc.php

+ * @method int getHitCount()
+ * @method void setHitCount(int $value)
+ */
+class SearchDoc extends Entity {


So far we tried to name the objects similar to the table names. So Maybe either rename the tables to collectives_search_* or rename the object classes to FtsDoc and so on? 🤔

As discussed we will prefix them with collectives_s_ for now, to be consistent with the other tables. The stable32 support currently limits the table and index length.

mejo- · 2026-03-05T13:54:28Z

lib/Search/FileSearch/Db/SearchDocMapper.php

+use OCP\AppFramework\Db\QBMapper;
+use OCP\DB\QueryBuilder\IQueryBuilder;
+use OCP\IDBConnection;
+use OCP\Snowflake\IGenerator;


This will not be compatible with Nextcloud 32, which we still intend to support with the main Collectives release as long as it's officially supported. Maybe we should stick with normal autoincrement IDs for now and migrate all IDs to snowflake IDs at a later point after we dropped Nextcloud 32 support, what do you think? We probably have to do this in other DB tables as well anyway.

I changed it to autoincrement IDs for now.

mejo- · 2026-03-05T14:06:49Z

lib/Search/FileSearch/Stemmer/Stemmer.php

+
+	public function stem(string $word): string {
+		if ($this->stemmer === null && $this->stemmingEnabled) {
+			$language = $this->config->getSystemValue('default_language', 'en');


I'm not sure whether it's the best option to use the instance's default language here. This effectively means that stemming only happens for instance's default language? I guess this would be much more powerful if the language of the indexed document was used here, right?

Maybe there's simple algorithms to guess language from the full document in the indexer and pass the detected language into the stemmer? Nothing that necessarily needs to happen in this PR, but still I'd be curious about your thoughts.

Language detection per document seems like the only proper solution, it comes with a small performance downside for indexing, but that should be fine. I found that there are language detection libraries, like https://github.com/patrickschur/language-detection which could be used, what do you think?

We should then store the language in the collectives_s_files table, so the correct stemmer can be used for each document.

There are possible edge-cases that document language change or a document has mixed languages, but we would just have to save the first one we detect.

mejo- · 2026-03-05T14:11:21Z

lib/Search/FileSearch/Stemmer/Stemmer.php

+namespace OCA\Collectives\Search\FileSearch\Stemmer;
+
+use OCP\IConfig;
+use Wamania\Snowball\NotFoundException;


I like that we use a third-party stemmer library. I compared the supported languages and TNTSearch supports the following languages that Wamania php-stemmer does not: Arabic, Croatian, Latvian, Polish and Ukrainian and a PorterStemmer where I'm not sure about it's purpose.

Personally I think having support for Arabic would be nice, but it seems easy enough to add that to Wamania in a PR and we should consider it as a follow-up task for now.

The PorterStemmer should be the default english stemmer named after Martin Porter.
I think the Wamania stemmer library is the commonly used one and already used in other nextcloud apps, https://github.com/search?q=org%3Anextcloud+wamania%2Fphp-stemmer&type=code

We can add a PR for an arabic stemmer as a follow-up task.

mejo- · 2026-03-05T14:25:03Z

lib/Search/FileSearch/Tokenizer/WordTokenizer.php

+
+	public function tokenize(string $text): array {
+		$text = mb_strtolower($text);
+		$text = preg_replace('/([^\p{L}\p{N}@])+/u', ' ', $text);


The tokenizer replace regex of TNTSearch also includes _ and - as characters that should not be replaced:

$pattern = '/[^\p{L}\p{N}\p{Pc}\p{Pd}@]+/u';

Was it on purpose that you removed these to characters? Just asking out of curiosity.

I have seen both versions for word tokenizers, but you are correct, current TNT-Search does not split words at - and _ so it will index these as single words. I will add it back to the pattern in favour for making the search results more explicit. It is individual, as there are words where splitting makes sense to find them better (e. g. ẁell-known can then be found by searching known, and words where splitting would be bad (e. g. e-mail, the e would not even get indexed)

I also added the stopword option back to the tokenizer, just in case we want to remove common and meaningless words from the index.

mejo- · 2026-03-05T15:15:34Z

lib/Search/FileSearch/FileIndexer.php

+
+		foreach ($terms as $term => $hitCount) {
+			try {
+				$term = substr((string)$term, 0, 50);


Maybe better use mb_substr() here as well to avoid truncating in the middle of a multi-byte character.

Thanks for the feedback, updated it.

mejo- · 2026-03-05T15:18:34Z

lib/Search/FileSearch/Db/SearchWordMapper.php

+
+		$hitCountParam = $qb->createNamedParameter($hitCount, IQueryBuilder::PARAM_INT);
+		$qb->update($this->tableName)
+			->set('num_hits', $qb->createFunction("num_hits - $hitCountParam"))


I wonder whether we want to save-guard against negative values here. Especially as the fields are unsigned integers, which means here's a risk of buffer overflows, right?

Thanks for the feedback, updated it.

mejo- · 2026-03-05T15:20:59Z

lib/Search/FileSearch/FileSearcher.php

-	protected FileIndexer $indexer;
+class FileSearcher {
+	private const DEFAULT_LIMIT = 15;
+	private const FUZZY_PREFIX_LENGTH = 2;


A prefix length of 2 seems pretty short and probably results in many candidates in many cases. How about increasing this to 3 or 4?

Thanks for the feedback, I increased it to 3 for now.

mejo- · 2026-03-05T15:23:10Z

lib/Search/FileSearch/Tokenizer/ClauseTokenizer.php

+		$tokens = [];
+		$lastWord = '';
+		foreach ($words as $word) {
+			if (strlen($word) <= 3) {


Suggested change

if (strlen($word) <= 3) {

if (mb_strlen($word) <= 3) {

Thanks for the feedback, updated it.

mejo- · 2026-03-05T15:33:22Z

lib/Search/FileSearch/FileSearcher.php

-		$this->indexer->setLanguage($this->language ?? self::UNSUPPORTED_LANGUAGE);
+		$scored = [];
+		foreach ($files as $file) {
+			$content = mb_strtolower($file->getContent());


This looks like a potential performance problem to me, as it will re-read the contents of all matched files. I don't know though what to do about it. The ranking by bigrams seams important as it makes sure that searches for several words ranks results higher where these words are next to each other. So dropping sorting by bigrams is not an option in my opinion.

We could add a fourth DB table and compute bigram phrase counts in it at index time. But this would probably drastically increase the database size and might be a bit over-engineered.

I agree that this could be a performance issue, but as you said there is no good solution without bloating the DB by a lot and also slow the indexing. The performance impact should be limited by the DEFAULT_LIMIT of 15 documents which are searched ranked by hits and then ranked by bigrams.

TNTSearch offered bigram and various ngram tokenizers, neither of them were configured or previously used for collectives, they would increase the database size drastically. They stored them just as word pairs alongside their words.

I think we should test the performance impact with a large dataset first and then decide on this.

Signed-off-by: Benjamin Frueh <benjamin.frueh@gmail.com>

benjaminfrueh added technical debt 2. developing enhancement New feature or request labels Nov 11, 2025

juliusknorr reviewed Nov 11, 2025

View reviewed changes

benjaminfrueh force-pushed the feat/doctrine-fulltext-search branch from 07cbae6 to d8a4b06 Compare November 11, 2025 22:28

github-actions bot added the feedback-requested label Nov 26, 2025

mejo- mentioned this pull request Feb 26, 2026

Allow to filter pages by mentioned user #2299

Open

benjaminfrueh added this to 📝 Productivity team Mar 5, 2026

github-project-automation bot moved this to 🧭 Planning evaluation (don't pick) in 📝 Productivity team Mar 5, 2026

benjaminfrueh moved this from 🧭 Planning evaluation (don't pick) to 🏗️ In progress in 📝 Productivity team Mar 5, 2026

mejo- reviewed Mar 5, 2026

View reviewed changes

mejo- force-pushed the feat/doctrine-fulltext-search branch from d8a4b06 to b26823e Compare March 8, 2026 18:42

feat: implement doctrine-based fulltext search

2ee83e2

Signed-off-by: Benjamin Frueh <benjamin.frueh@gmail.com>

benjaminfrueh force-pushed the feat/doctrine-fulltext-search branch from b26823e to 2ee83e2 Compare March 19, 2026 19:01

Conversation

benjaminfrueh commented Nov 11, 2025 • edited by juliusknorr Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📝 Summary

🏁 Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

mejo- left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benjaminfrueh Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

benjaminfrueh commented Nov 11, 2025 •

edited by juliusknorr

Loading

benjaminfrueh Mar 19, 2026 •

edited

Loading