In this project I used the follwing pipleline to perform clustering on Persian economic papers.
- crawled 592 papers from Tarbiat Modares University
- extracted title, abstract and keyword for each paper.
- performed cleaning preprocess including normalizing, lemmatizing, removing stopwords and redundant words using hazm.
- perform 2 types of word embeddings using FaBERT and gensim's Word2Vec.
- Performed various type of clusteting algorithms using sklearn.cluster package.
- Evaluated results using unknown ground truth evaluation metrics and by visualizing sorted similarity matrix. (read here for more).
- inspected identified clusters by eye and allocated appropriate names to each cluster.
Here you can view the presentation slides.