FarsiProcessor is a ruby gem to normalize and stem Persian/Farsi text
Normalization is defined as:
- Normalization of arabic kaf to farsi keheh
- Normalization of arabic yeh and arabic alef maksoura to farsi yeh
- Normalization of alef mada, alef with hamza below and alef with hamza above to alef
- Removing TATWIL
- Removing DIACRITICS
Stemming is defined as removing these suffixes (+ suffixes of plural form)
Add this line to your application's Gemfile:
gem "farsi_processor"
And then execute:
$ bundle
Or install it yourself as:
$ gem install farsi_processor
require 'farsi_processor'
[1] pry(main)> FarsiProcessor.process("ك")
=> "ک"
[2] pry(main)> FarsiProcessor.process("کتاب ها")
=> "کتاب"
# it supports only and except options
[3] pry(main)> FarsiProcessor.process("ك ي", only: ["ك"])
=> "ک ي"
[4] pry(main)> FarsiProcessor.process("ك ي", except: ["ك"])
=> "ك ی"
[5] pry(main)> FarsiProcessor.process('دخترهای', except: ['های'])
=> "دختره"
# you can choose to just normalize or stem a word,
# they also support an only and except option
[6] pry(main)> FarsiProcessor.normalize("ك")
=> "ک"
[7] pry(main)> FarsiProcessor.stem("کتاب ها")
=> "کتاب"
If you have any issues with farsi_processor which you cannot find the solution, please add an issue on GitHub or fork the project and send a pull request.
The gem is available as open source under the terms of the MIT License.