Skip to content

A Pre-trained Language Model for Semantic Similarity Measurement of Persian Informal Short Texts

License

Notifications You must be signed in to change notification settings

mojtabasajjadi/FarSSiBERT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FarSSiBERT: A Novel Transformer-based Model for Semantic Similarity Measurement of Persian Social Networks Informal Texts.

FarSSiBERT is a monolingual large language model based on Google’s BERT architecture. This model is pre-trained on a large corpus of informal Persian short texts with various writing styles, including more than 104M tweets from diverse subjects. Paper presenting FarSSiBERT:

Features

It included a Python library for measuring the semantic similarity of Persian short texts.

  • Text cleaning.
  • Specific tokenizer for informal Persian short text.
  • Word and sentence embeddings based on transformers.
  • Semantic Similarity Measurement.
  • Pre-train BERT model for all downstream tasks especially informal texts.

How to use:

  • Download and install the FarSSiBERT python package.
  • Import and use as below:
  from FarSSiBERT.SSMeasurement import SSMeasurement

  text1 = "متن اول"
  text2 = "متن دوم"

  new_instance = SSMeasurement( text1 , text2 )
  label = new_instance.get_similarity_label()
  similarity = new_instance.get_cosine_similarity()

Requirements:

  • pyhton=>3.7
  • transformers==4.30.02
  • torch==1.13.0
  • scikit-learn==0.21.3
  • numpy~=1.21.6
  • sklearn~=0.0

Releases

No releases published

Packages

No packages published

Languages