Ability to select tokenizer/language #423
Replies: 2 comments 2 replies
-
Hi @mooncake4132! 👋 We think to open the configuration of tokenizers in the next versions of the tokenizer. If I understand well you would like to define a general tokenizer able to tokenize in Chinese/Japanese/Latin according to the character? Do your attributes share multiple languages or can they be separated by language within a document? (having dedicated fields per language, e.g. I take the opportunity to ping @ManyTheFish who recently worked on the redesign of the tokenizer. Thanks! |
Beta Was this translation helpful? Give feedback.
-
Hey @mooncake4132 and @gmourier! About Romanization, if you do it, I recommend to just split your romanized words with spaces. |
Beta Was this translation helpful? Give feedback.
-
First of all, thank you for this great product. I currently run an in-house search engine and melisearch is way better than mine in almost every aspect. There is, however, a small limitation that is preventing me from adopting meilisearch, and I'm wondering if this is something your team is interested in addressing.
I run my search engine for book and movie titles in Chinese and Japanese. I need a way to force meilisearch to tokenize these titles by characters and not try to guess the tokenizations with libraries like jieba. I have four reasons:
金曜日の妻たちへ
with金妻
, probably because the title is tokenized as金曜日/の/妻/たち/へ
.All these problems can be solved if I can force meilisearch to tokenize by characters. While these may be specific to my use case, I feel like this is a very useful fallback if the CJK tokenizers don't provide good results. I'm happy to hear what you think!
Beta Was this translation helpful? Give feedback.
All reactions