Adding implementations for vector stores #79
Conversation
b53d8a2 to
ccd9e48
Compare
d125eda to
23bafa5
Compare
…mentation Also note that we need to now include deleting previous collections and restarting new ones / error handling for the dimension-mismatch case
66f47e3 to
4bafdb7
Compare
| K = min(K, cur_min) | ||
|
|
||
| search_K = K | ||
| while True: |
There was a problem hiding this comment.
We should no longer be manually post filtering for the non-faiss vector stores -- instead we should perform a filtered vector search using the VS API (most of them let you filter on an exact match predicate)
| pd.DataFrame: The DataFrame with the index directory saved. | ||
| """ | ||
| if lotus.settings.rm is None: | ||
|
|
There was a problem hiding this comment.
it may be good to display a warning here, telling the user to not reset the df index, getting embedding ids requires the df index is invariant so that get_vectors_from_index returns correctly
| metadatas: list[Mapping[str, Union[str, int, float, bool]]] = [{"doc_id": int(i)} for i in range(len(docs_list))] | ||
|
|
||
| # Add documents in batches | ||
| batch_size = 100 |
There was a problem hiding this comment.
better to add a vs_upsert_batch_size setting that can be configured in our settings module, rather than hard-coding it
| @abstractmethod | ||
| def search(self, | ||
| queries: pd.Series | str | Image.Image | list | NDArray[np.float64], | ||
| def __call__(self, |
There was a problem hiding this comment.
it might make sense to add an ids parameter here, so that sem ops can make filtered search calls easily rather than re-implementing vs-specific logic each time
| }) | ||
|
|
||
| # Upsert in batches of 100 | ||
| batch_size = 100 |
There was a problem hiding this comment.
move batch-size to a settings configuration (same comment for all vs implementations)
| rm_output: RMOutput = rm(queries, K) | ||
| distances = rm_output.distances | ||
| indices = rm_output.indices | ||
| vs_output: RMOutput = vs(query_vectors, K) |
There was a problem hiding this comment.
we need to add a parameter for ids in vs() and set it to the index of the df we're searching over -- otherwise we won't catch the case when the df being searched over has been filtered by the user
| # VS Only Tests | ||
| ################################################################################ | ||
|
|
||
|
|
There was a problem hiding this comment.
an important test to add is doing filtered vector search -- ie the program starts with some df, embeds/indexes the column, does any filter op (can be a structured filter), then calls a sem op that uses search over the indexed column
Verified tests pass locally (for those that can be run locally at least)