Do not index urls and numbers by default #553
Replies: 3 comments 1 reply
-
Indeed, it's an interesting proposal, we might want to ignore URLs everywhere in the document, not only in dedicated fields. I can't personally work on this before the release of v0.30.0, but, there is no technical limit to implementing this feature. |
Beta Was this translation helpful? Give feedback.
-
We could extend the reasoning to fields containing values mostly related to filtering and sorting. e.g, boolean (true/false), dates. We could also consider not indexing numeric values by default. If we go that way, it MUST be intuitive for users to activate the indexing of those default un-indexed values if they are needed. It's my biggest concern right now. |
Beta Was this translation helpful? Give feedback.
-
Hello @qdequele, your issue is related to search time, could you provide some search query examples that take time? |
Beta Was this translation helpful? Give feedback.
-
Today we are indexing all the data sent by the user, all the fields present in the document, even if it doesn't make sense to search on the field type. For example, I would assume that by default, no one would search through URLs or numbers.
Is it possible to consider not indexing numbers or URLs?
It could be, trying to determine if the field is an URL or a number and not index it. Or it could also be during the tokenization, recognizing the numbers and URLs and considering them as soft separators. The second proposition makes perhaps more sense and could be easier to explain, even if it's harder to code.
Context
I tried to do a demo on a Crunchbase data that look like this:
I indexed the entire dataset, 2.1M documents. When I tried with the default settings to search on it, some queries took more than 1s. After restricting searchable attributes to non-URL fields. Most of the requests were under 3ms.
It looks like a huge DX issue that could perhaps be managed very quickly.
Beta Was this translation helpful? Give feedback.
All reactions