[Charabia] Feature request: trim emoji during segmentation #796
Replies: 1 comment 1 reply
-
Hey, @slatian, as far as I know, there are no tricks to fix this issue except preprocessing your documents and inserting a space before each emoji. I'm sorry, but it's definitely something we could improve on our side. Currently, most of the team is on holiday; you may have a better answer in a week or so. |
Beta Was this translation helpful? Give feedback.
-
Why am I requesting this:
I was tinkering with charabia to get an understanding for it, and even with my "well behaved" small dataset I noticed that people use emoji "like this🤷", which segments to "like" and "this🤷". I'd prefer an option to let it be segmented to "like", "this", "🤷", which would allow multiple things:
So my concrete Feature request is an emoji segmenter that segments off prefixed and suffixed emoji clusters into their own tokens.
Related question: Is there a way I could implement this without having to fork Chrabia, I haven't found a way to get a custom segmenter attached to a tokenizer,
Beta Was this translation helpful? Give feedback.
All reactions