clustering_persian_economic_papers

In this project I used the follwing pipleline to perform clustering on Persian economic papers.

crawled 592 papers from Tarbiat Modares University
extracted title, abstract and keyword for each paper.
performed cleaning preprocess including normalizing, lemmatizing, removing stopwords and redundant words using hazm.
perform 2 types of word embeddings using FaBERT and gensim's Word2Vec.
Performed various type of clusteting algorithms using sklearn.cluster package.
Evaluated results using unknown ground truth evaluation metrics and by visualizing sorted similarity matrix. (read here for more).
inspected identified clusters by eye and allocated appropriate names to each cluster.

Here you can view the presentation slides.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
cleaned		cleaned
crawler		crawler
modares_papers		modares_papers
results		results
util		util
DBscan.py		DBscan.py
FaBERT_embedding.py		FaBERT_embedding.py
LICENSE		LICENSE
README.md		README.md
affinity_propagation.py		affinity_propagation.py
birch.py		birch.py
clean_data.py		clean_data.py
gensim_embedding.py		gensim_embedding.py
hierarchical.py		hierarchical.py
kmeans++.py		kmeans++.py
spectral.py		spectral.py
utils.py		utils.py

Provide feedback