Incremental Index Updates #3

magaton · 2024-11-08T15:01:49Z

Hello, a very nice work!
I am using paradedb atm for bm25 pg search and I googled out this repo when checking whether there is an alternative implementation.
You say that creating bm25 index from table / column is costly.
Would it be possible to add incremental index updates, like search engines usually do?

jankovicsandras · 2024-11-11T12:32:13Z

Hi,

This is a very good question.

Updating the index is challenging, because all the words in the vocabulary (all unique tokens/words in all documents) have a parameter called inverse document frequency (idf) which must be updated even if just 1 document is added (or removed), so almost everything must be recalculated at INSERT or DELETE. However we could spare some time by avoiding re-tokenizing the unchanged documents, so this can be a good optimization.
UPDATE - when no document is added or removed, just existing documents are being updated - should recalculate idf and wsmap scores for all the words in the old document contents and the new contents.

I'll think about this and try to implement something when I'll have some time, but probably first in BM25opt, then porting here.

jankovicsandras added enhancement New feature or request good first issue Good for newcomers labels Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental Index Updates #3

Incremental Index Updates #3

magaton commented Nov 8, 2024

jankovicsandras commented Nov 11, 2024

Incremental Index Updates #3

Incremental Index Updates #3

Comments

magaton commented Nov 8, 2024

jankovicsandras commented Nov 11, 2024