Skip to content

Adding implementations for vector stores #79

Open
AmoghTantradi wants to merge 58 commits intomainfrom
at/vs_implementation
Open

Adding implementations for vector stores #79
AmoghTantradi wants to merge 58 commits intomainfrom
at/vs_implementation

Conversation

@AmoghTantradi
Copy link
Copy Markdown
Collaborator

@AmoghTantradi AmoghTantradi commented Jan 14, 2025

Verified tests pass locally (for those that can be run locally at least)

K = min(K, cur_min)

search_K = K
while True:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should no longer be manually post filtering for the non-faiss vector stores -- instead we should perform a filtered vector search using the VS API (most of them let you filter on an exact match predicate)

pd.DataFrame: The DataFrame with the index directory saved.
"""
if lotus.settings.rm is None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may be good to display a warning here, telling the user to not reset the df index, getting embedding ids requires the df index is invariant so that get_vectors_from_index returns correctly

metadatas: list[Mapping[str, Union[str, int, float, bool]]] = [{"doc_id": int(i)} for i in range(len(docs_list))]

# Add documents in batches
batch_size = 100
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to add a vs_upsert_batch_size setting that can be configured in our settings module, rather than hard-coding it

@abstractmethod
def search(self,
queries: pd.Series | str | Image.Image | list | NDArray[np.float64],
def __call__(self,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might make sense to add an ids parameter here, so that sem ops can make filtered search calls easily rather than re-implementing vs-specific logic each time

})

# Upsert in batches of 100
batch_size = 100
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move batch-size to a settings configuration (same comment for all vs implementations)

rm_output: RMOutput = rm(queries, K)
distances = rm_output.distances
indices = rm_output.indices
vs_output: RMOutput = vs(query_vectors, K)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to add a parameter for ids in vs() and set it to the index of the df we're searching over -- otherwise we won't catch the case when the df being searched over has been filtered by the user

# VS Only Tests
################################################################################


Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an important test to add is doing filtered vector search -- ie the program starts with some df, embeds/indexes the column, does any filter op (can be a structured filter), then calls a sem op that uses search over the indexed column

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants