-
Notifications
You must be signed in to change notification settings - Fork 22
thoughts on LID improvements like HNSW++? #7
Description
https://arxiv.org/pdf/2501.13992
Showed that you can use local intrinsic dimensionality to improve performance of HNSW search by a pretty decent margin.
My intuition is that in cluster centers, there's lower dimensionality and so neighbors are going to be more evenly spaced out. At boundary and other regions, there will be more irregularity in neighbor spacing.
The paper above precomputes this for all points in the dataset (which is computationally prohibitive in most cases), then uses it in a few ways, among which are:
- insertion order - high LID nodes are inserted first, to promote them forming longer links.
- insertion layer - high LID nodes are inserted into higher layers, again to promote them forming longer links.
The paper boasts some really impressive improvements - up to 18% improvement in recall for NLP tasks.
I think it would be interesting to try to extend this to use an estimate of the LID based on the constructed graph so far, rather than using the full LID. For CoreNN / diskann, it could look like:
- instead of inserting nodes in random order during construction, inserting via a priority queue based on LID estimates on the current graph. This could maybe let high-LID nodes "jump the queue" and improve the insertion order.
- manipulating R and a based on LID. We want high-LID nodes to have longer links, so maybe allowing them to form more links would be beneficial. I think having a higher a would also allow these nodes to take more "detours" which may also provide for longer links.
There's also stuff there about multi-branching...
Anyways, I'm really curious about it and might hack on this a bit. Curious if you've thought about it or have tinkered on it at all?