Do not index urls and numbers by default #553

qdequele · 2022-10-13T18:11:44Z

qdequele
Oct 13, 2022
Maintainer

Today we are indexing all the data sent by the user, all the fields present in the document, even if it doesn't make sense to search on the field type. For example, I would assume that by default, no one would search through URLs or numbers.

Is it possible to consider not indexing numbers or URLs?

It could be, trying to determine if the field is an URL or a number and not index it. Or it could also be during the tokenization, recognizing the numbers and URLs and considering them as soft separators. The second proposition makes perhaps more sense and could be easier to explain, even if it's harder to code.

Context

I tried to do a demo on a Crunchbase data that look like this:

{
    "name": "MeiliSearch",
    "short_description": "Next generation search API",
    "uuid": "d08ad0c5-e822-4e14-bd9a-ae1700dabd72",
    "type": "organization",
    "primary_role": "company",
    "cb_url": "https://www.crunchbase.com/organization/meilisearch?utm_source=crunchbase&utm_medium=export&utm_campaign=odm_csv",
    "domain": "meilisearch.com",
    "homepage_url": "https://www.meilisearch.com/",
    "logo_url": "https://res.cloudinary.com/crunchbase-production/image/upload/bdasornefogsigi2f6gv",
    "facebook_url": "",
    "twitter_url": "https://twitter.com/meilisearch/",
    "linkedin_url": "https://www.linkedin.com/company/13016868/",
    "combined_stock_symbols": "",
    "city": "Paris",
    "region": "Ile-de-France",
    "country_code": "FRA"
}

I indexed the entire dataset, 2.1M documents. When I tried with the default settings to search on it, some queries took more than 1s. After restricting searchable attributes to non-URL fields. Most of the requests were under 3ms.

It looks like a huge DX issue that could perhaps be managed very quickly.

ManyTheFish · 2022-10-17T15:56:20Z

ManyTheFish
Oct 17, 2022
Collaborator

Indeed, it's an interesting proposal, we might want to ignore URLs everywhere in the document, not only in dedicated fields.
However, we could be able to ignore only the URLs that start with https:// or http://, in your example, the domain meilisearch.com would not be ignored. Otherwise, we would ignore words that are not URLs like file names.

I can't personally work on this before the release of v0.30.0, but, there is no technical limit to implementing this feature.

1 reply

qdequele Oct 22, 2022
Maintainer Author

I believe, and for my use case, only removing complete URLs would be enough.

gmourier · 2022-11-08T15:36:26Z

gmourier
Nov 8, 2022
Maintainer

We could extend the reasoning to fields containing values mostly related to filtering and sorting. e.g, boolean (true/false), dates.

We could also consider not indexing numeric values by default.

If we go that way, it MUST be intuitive for users to activate the indexing of those default un-indexed values if they are needed. It's my biggest concern right now.

0 replies

ManyTheFish · 2022-11-23T10:08:54Z

ManyTheFish
Nov 23, 2022
Collaborator

Hello @qdequele, your issue is related to search time, could you provide some search query examples that take time?
Thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meilisearch

Do not index urls and numbers by default #553

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Meilisearch

Do not index urls and numbers by default #553

qdequele Oct 13, 2022 Maintainer

Context

Replies: 3 comments · 1 reply

ManyTheFish Oct 17, 2022 Collaborator

qdequele Oct 22, 2022 Maintainer Author

gmourier Nov 8, 2022 Maintainer

ManyTheFish Nov 23, 2022 Collaborator

qdequele
Oct 13, 2022
Maintainer

Replies: 3 comments 1 reply

ManyTheFish
Oct 17, 2022
Collaborator

qdequele Oct 22, 2022
Maintainer Author

gmourier
Nov 8, 2022
Maintainer

ManyTheFish
Nov 23, 2022
Collaborator