GitHub - brainsharks-fyp17/WEIntrinsicEvaluation: W2V, skipgram and Glove models for Sinhala, along with some evaluation metrics

First we preprocess Common Crawl Dataset to remove numbers, short sentences and special characters. This repo also includes code to remove stop words.

Then we train three types of word embedding models;

Word2Vec
Glove
FastText

Train data: Si-cc

Trained Models (300 dimensions) can be found Word2Vec:
word2vec 15.8M window=5,vectorsize=300,min_count=3

gdown --id 1lcnIgMYPmn5t0L36F_aqQwFKkNHV_Fl0

FastText (.bin version):
fasttext 15.8M vector_size=300, window=5, min_count=3,negative=10

gdown --id 1-5qOaEQF16z-3PN6KdJLqMJWw0AD57IG

Glove:

gdown --id 1pjgCsZ21R-rt2wgYqE6wAz2BVwJIBLsP

Also, this repo contains source code and datasets to two types of intrinsic evaluation;

Analogy Evaluation
Relatedness Evaluation

You can follow the steps below to do the evaluation.

Preprocessing Steps

Run tokenizer.py. Put source file consists of Sinhala Corpus as input. Output will be a file consists of sentences in each line.
Process the output file of the first step using remove_numeric_lines.py, remove_short_sentenes.py. If you want to train embedding models without stop words, use remove_stop_words.py to remove them from the corpus. It will remove predefined stop words included in “stop words.txt” file. You can add, remove or modify those stop words if you want.

Training Steps

Train word2vec using wor2vec_train.py (Change ‘sg’ hyperparameter to select between CBOW and Skipgram)
Train fasttext using fasttext_train.py
Train glove using glove_train.py

Intrinsic Evaluation

For word2vec and fasttext models, use analogy_evaluation_word2vec_fasttext.py for analogy evaluation.
For glove models, use analogy_evaluation_glove.py for analogy evaluation.
For relatedness evaluation, use relatedness_measure_fasttext.py, relatedness_measure_glove.py and relatedness_meausre_word2vec.py

Extrinsic Evaluation

Sentiment Analysis for Sinhala can be conducted as desicribed in https://github.com/theisuru/sentiment-tagger
POS tagging can be performed based on https://github.com/wantinghuang/tensorflow-lstmcrf-postagger. This repo contains preprocessed POS tag data set which was used for POS tagging evaluation task.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
IntrinsicEvaluation		IntrinsicEvaluation
POS tag data		POS tag data
Preprocessing		Preprocessing
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

brainsharks-fyp17/WEIntrinsicEvaluation

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages