Skip to content

feat: implement doctrine-based fulltext search#2118

Draft
benjaminfrueh wants to merge 1 commit intomainfrom
feat/doctrine-fulltext-search
Draft

feat: implement doctrine-based fulltext search#2118
benjaminfrueh wants to merge 1 commit intomainfrom
feat/doctrine-fulltext-search

Conversation

@benjaminfrueh
Copy link
Contributor

@benjaminfrueh benjaminfrueh commented Nov 11, 2025

📝 Summary

Replaces TNTSearch with nextcloud database full-text search using doctrine.

Resolves #2050

🏁 Checklist

  • Code is properly formatted (npm run lint / npm run stylelint / composer run cs:check)
  • Sign-off message is added to all commits
  • Tests (unit, integration and/or end-to-end) passing and the changes are covered with tests
  • Documentation (README or documentation) has been updated or is not required

"symfony/string": "^6.0",
"symfony/translation-contracts": "^3.6",
"teamtnt/tntsearch": "^5.0"
"wamania/php-stemmer": "^4.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use this one in a few other apps: https://github.com/search?q=org%3Anextcloud+wamania%2Fphp-stemmer&type=code

Maybe worth to consider scoping the dependency to avoid conflicts with different versions between apps. https://arthur-schiwon.de/isolating-nextcloud-app-dependencies-php-scoper

@benjaminfrueh benjaminfrueh force-pushed the feat/doctrine-fulltext-search branch from 07cbae6 to d8a4b06 Compare November 11, 2025 22:28
@github-actions
Copy link
Contributor

Hello there,
Thank you so much for taking the time and effort to create a pull request to our Nextcloud project.

We hope that the review process is going smooth and is helpful for you. We want to ensure your pull request is reviewed to your satisfaction. If you have a moment, our community management team would very much appreciate your feedback on your experience with this PR review process.

Your feedback is valuable to us as we continuously strive to improve our community developer experience. Please take a moment to complete our short survey by clicking on the following link: https://cloud.nextcloud.com/apps/forms/s/i9Ago4EQRZ7TWxjfmeEpPkf6

Thank you for contributing to Nextcloud and we hope to hear from you soon!

(If you believe you should not receive this message, you can add yourself to the blocklist.)

@github-project-automation github-project-automation bot moved this to 🧭 Planning evaluation (don't pick) in 📝 Productivity team Mar 5, 2026
@benjaminfrueh benjaminfrueh moved this from 🧭 Planning evaluation (don't pick) to 🏗️ In progress in 📝 Productivity team Mar 5, 2026
Copy link
Member

@mejo- mejo- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice work, thanks so much @benjaminfrueh. I finally found time to go through the code changes, read up a bit on stemming, fuzzy searching, bigrams and other stuff I didn't know much about before 😆

The general approach of your implementation looks really clean and promising to me.

I have quite a few comments and questions and am curious what you think about them.

* @method int getHitCount()
* @method void setHitCount(int $value)
*/
class SearchDoc extends Entity {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far we tried to name the objects similar to the table names. So Maybe either rename the tables to collectives_search_* or rename the object classes to FtsDoc and so on? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed we will prefix them with collectives_s_ for now, to be consistent with the other tables. The stable32 support currently limits the table and index length.

use OCP\AppFramework\Db\QBMapper;
use OCP\DB\QueryBuilder\IQueryBuilder;
use OCP\IDBConnection;
use OCP\Snowflake\IGenerator;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not be compatible with Nextcloud 32, which we still intend to support with the main Collectives release as long as it's officially supported. Maybe we should stick with normal autoincrement IDs for now and migrate all IDs to snowflake IDs at a later point after we dropped Nextcloud 32 support, what do you think? We probably have to do this in other DB tables as well anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to autoincrement IDs for now.


public function stem(string $word): string {
if ($this->stemmer === null && $this->stemmingEnabled) {
$language = $this->config->getSystemValue('default_language', 'en');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether it's the best option to use the instance's default language here. This effectively means that stemming only happens for instance's default language? I guess this would be much more powerful if the language of the indexed document was used here, right?

Maybe there's simple algorithms to guess language from the full document in the indexer and pass the detected language into the stemmer? Nothing that necessarily needs to happen in this PR, but still I'd be curious about your thoughts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Language detection per document seems like the only proper solution, it comes with a small performance downside for indexing, but that should be fine. I found that there are language detection libraries, like https://github.com/patrickschur/language-detection which could be used, what do you think?

We should then store the language in the collectives_s_files table, so the correct stemmer can be used for each document.

There are possible edge-cases that document language change or a document has mixed languages, but we would just have to save the first one we detect.

namespace OCA\Collectives\Search\FileSearch\Stemmer;

use OCP\IConfig;
use Wamania\Snowball\NotFoundException;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that we use a third-party stemmer library. I compared the supported languages and TNTSearch supports the following languages that Wamania php-stemmer does not: Arabic, Croatian, Latvian, Polish and Ukrainian and a PorterStemmer where I'm not sure about it's purpose.

Personally I think having support for Arabic would be nice, but it seems easy enough to add that to Wamania in a PR and we should consider it as a follow-up task for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PorterStemmer should be the default english stemmer named after Martin Porter.
I think the Wamania stemmer library is the commonly used one and already used in other nextcloud apps, https://github.com/search?q=org%3Anextcloud+wamania%2Fphp-stemmer&type=code

We can add a PR for an arabic stemmer as a follow-up task.


public function tokenize(string $text): array {
$text = mb_strtolower($text);
$text = preg_replace('/([^\p{L}\p{N}@])+/u', ' ', $text);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tokenizer replace regex of TNTSearch also includes _ and - as characters that should not be replaced:

$pattern = '/[^\p{L}\p{N}\p{Pc}\p{Pd}@]+/u';

Was it on purpose that you removed these to characters? Just asking out of curiosity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen both versions for word tokenizers, but you are correct, current TNT-Search does not split words at - and _ so it will index these as single words. I will add it back to the pattern in favour for making the search results more explicit. It is individual, as there are words where splitting makes sense to find them better (e. g. ẁell-known can then be found by searching known, and words where splitting would be bad (e. g. e-mail, the e would not even get indexed)

I also added the stopword option back to the tokenizer, just in case we want to remove common and meaningless words from the index.


foreach ($terms as $term => $hitCount) {
try {
$term = substr((string)$term, 0, 50);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better use mb_substr() here as well to avoid truncating in the middle of a multi-byte character.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, updated it.


$hitCountParam = $qb->createNamedParameter($hitCount, IQueryBuilder::PARAM_INT);
$qb->update($this->tableName)
->set('num_hits', $qb->createFunction("num_hits - $hitCountParam"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we want to save-guard against negative values here. Especially as the fields are unsigned integers, which means here's a risk of buffer overflows, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, updated it.

protected FileIndexer $indexer;
class FileSearcher {
private const DEFAULT_LIMIT = 15;
private const FUZZY_PREFIX_LENGTH = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A prefix length of 2 seems pretty short and probably results in many candidates in many cases. How about increasing this to 3 or 4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, I increased it to 3 for now.

$tokens = [];
$lastWord = '';
foreach ($words as $word) {
if (strlen($word) <= 3) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (strlen($word) <= 3) {
if (mb_strlen($word) <= 3) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, updated it.

$this->indexer->setLanguage($this->language ?? self::UNSUPPORTED_LANGUAGE);
$scored = [];
foreach ($files as $file) {
$content = mb_strtolower($file->getContent());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a potential performance problem to me, as it will re-read the contents of all matched files. I don't know though what to do about it. The ranking by bigrams seams important as it makes sure that searches for several words ranks results higher where these words are next to each other. So dropping sorting by bigrams is not an option in my opinion.

We could add a fourth DB table and compute bigram phrase counts in it at index time. But this would probably drastically increase the database size and might be a bit over-engineered.

Copy link
Contributor Author

@benjaminfrueh benjaminfrueh Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this could be a performance issue, but as you said there is no good solution without bloating the DB by a lot and also slow the indexing. The performance impact should be limited by the DEFAULT_LIMIT of 15 documents which are searched ranked by hits and then ranked by bigrams.

TNTSearch offered bigram and various ngram tokenizers, neither of them were configured or previously used for collectives, they would increase the database size drastically. They stored them just as word pairs alongside their words.

I think we should test the performance impact with a large dataset first and then decide on this.

@mejo- mejo- force-pushed the feat/doctrine-fulltext-search branch from d8a4b06 to b26823e Compare March 8, 2026 18:42
Signed-off-by: Benjamin Frueh <benjamin.frueh@gmail.com>
@benjaminfrueh benjaminfrueh force-pushed the feat/doctrine-fulltext-search branch from b26823e to 2ee83e2 Compare March 19, 2026 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: 🏗️ In progress

Development

Successfully merging this pull request may close these issues.

feat: Use Nextcloud database for TNTSearch

3 participants