Persian Language Enhancement #545
ariaieboy
started this conversation in
Feedback & Feature Proposal
Replies: 1 comment
-
Hello @ariaieboy, your discussion seems to be highly related to #139. Thank you a lot for your feedback, this helps me a lot to enhance Language support in Meilisearch! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
hello
I am using meilisearch on a new project in persian language. and it's actually great in many cases but there is some edge cases that meili need some enhancement.
first issue is letter
آ
the unicode of this letter is U+0622 and it's usually come as the first letter of a word. and it's equal to
ا
character with unicode of U+0627this letter is shared between arabic and persian language but it's not that complicated like what we have in arabic language. in arabic we have multiple
alef
letter but in persian we only have this 2 kind ofalef
that is equal to each others.for example this words are equal :
آب
=>اب
|آقا
=>اقا
|آسفالت
=>اسفالت
a nice quick workaround would be a char replacement of U+0622 with U+0627
second issue is what we call half-space in persian and unicode call's it Zero-width non-joiner the unicode of this char is U+200C :
this character is equal to
space
in persian language.examples:
میتوان
=>می توان
|کتابها
=>کتاب ها
in a perfect world this character is meaningful and it's telling that two word are related for example
کتاب
meansbook
andها
act likes
inbooks
so instead ofspace
we must useshort space
but in 95% of the times in computer words users usespace
and that's gonna break meilisearch result.like prev issue the quick workaround is to replace
short space
withspace
.both in the index and user input.
the third issue is Tatweel character :
this char is rarely used specially in persian language and it's not necessary to handle it in you tokenizer. but having this edge cases covered gonna help improve search results in general.
the Tatweel actually means nothing and can be removed from any string this Tatweel or kashida is a type of justification in the arabic and usually for making a word visually better in persian language
examples:
حــمید
=>حمید
|رحــــــــــــــیــــــــــــــم
=>رحیم
in this case the quick workaround would be replacing
kashida
withnull
Beta Was this translation helpful? Give feedback.
All reactions