Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental Index Updates #3

Open
magaton opened this issue Nov 8, 2024 · 1 comment
Open

Incremental Index Updates #3

magaton opened this issue Nov 8, 2024 · 1 comment
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@magaton
Copy link

magaton commented Nov 8, 2024

Hello, a very nice work!
I am using paradedb atm for bm25 pg search and I googled out this repo when checking whether there is an alternative implementation.
You say that creating bm25 index from table / column is costly.
Would it be possible to add incremental index updates, like search engines usually do?

@jankovicsandras
Copy link
Owner

Hi,

This is a very good question.

Updating the index is challenging, because all the words in the vocabulary (all unique tokens/words in all documents) have a parameter called inverse document frequency (idf) which must be updated even if just 1 document is added (or removed), so almost everything must be recalculated at INSERT or DELETE. However we could spare some time by avoiding re-tokenizing the unchanged documents, so this can be a good optimization.
UPDATE - when no document is added or removed, just existing documents are being updated - should recalculate idf and wsmap scores for all the words in the old document contents and the new contents.

I'll think about this and try to implement something when I'll have some time, but probably first in BM25opt, then porting here.

@jankovicsandras jankovicsandras added enhancement New feature or request good first issue Good for newcomers labels Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants