Skip to content

[FR] Add digest_jaccard method to MinHash structs or modify existing jaccard methods #49

@iron3oxide

Description

@iron3oxide

It would be really cool if one could store a document and its generated MinHash somewhere (like a database) and check for similarity with other documents at a later stage. As of now, this is not really possible to do, as one would have to regenerate the hash object with the document again since it's not possible to regenerate a MinHash instance from its digest or to get the jaccard distance to a hash in digest form.

I see an easy solution for both CSimHash and RSimHash: add a digest_jaccard method that takes a Vec<u32> digest and spits out the similarity. One could also consider modifying the existing jaccard methods instead, as only the digest of other is ever needed in them, so handing over the full MinHash object seems inefficient. Of course this would be a breaking change, maybe warranting a 0.3.0 release.

I'm happy to implement this, please let me know if this would be a welcome change and which option you'd prefer!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions