Fetch enwiki train and vocab files #9

loretoparisi · 2016-10-21T09:30:27Z

The "run.bat" script has the options text, read_vocab and train_file.

Which is the input format expected for these files (e.g. the a dump https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 is it ok?
Is text a specific file in the train_file folder?

How to generate/retrieve a enwiki vocabulary file?

Is it ok to process the dump and generating a vocabulary like the one made by the script mkvocab.pl

THE     84503449
AND     33700692
WAS     12911542
FOR     10342919
THAT    8318795

The text was updated successfully, but these errors were encountered:

loretoparisi · 2016-10-21T11:01:11Z

[UPDATE]

I have found another vocabulary dump directly from the Google's GoogleNews-vectors-negative300.bin file, exploded in 30 files of 100K words (Google has 3 million words in the features vector) that are in the vocabulary folder of this repo

https://github.com/loretoparisi/inspect_word2vec

The vocabulary structure is like

$ head -n 10 vocabulary_01.txt 
Allanah_Munson
WINDS_WILL
nab_sexual_predators
By_Alexandra_Barham
Mayor_Noramie_Jasmin
Chief_Executive_Glenn_Tilton
Neil_Kinnock
Makoto_Tamada_JPN_Konica
abductor_muscle
visit_www.availability.sungard.com

so we have commonly paired words as well and stopwords.
Can I use this vocabulary as read_vocab input then?

xuehui1991 · 2016-10-24T02:34:10Z

Hey loretoparisi, there is an example:

set size=300
set text=test_version
set read_vocab=%text%_traning_data.txt
set train_file=%text%_vocab_data.txt
set binary=1
set cbow=1
set alpha=0.01
set epoch=20
set window=5
set sample=0
set hs=0
set negative=5
set threads=16
set mincount=5
set sw_file=stopwords_simple.txt
set stopwords=0
set data_block_size=1000
set max_preload_data_size=2000
set use_adagrad=0
set is_pipeline=0
set output=%text%_%size%.bin

distributed_word_embedding.exe -max_preload_data_size %max_preload_data_size% -is_pipeline %is_pipeline% -alpha %alpha% -data_block_size %data_block_size% -train_file %train_file% -output %output% -threads %threads% -size %size% -binary %binary% -cbow %cbow% -epoch %epoch% -negative %negative% -hs %hs% -sample %sample% -min_count %mincount% -window %window% -stopwords %stopwords% -sw_file %sw_file% -read_vocab %read_vocab% -use_adagrad %use_adagrad%

you can see text is a parameter in bat file and you train_file is the path that you put training file.

As for input format of train_file, it could be raw English file or other language file that after segmentation(it's more better that you have remove some noise).

As for the input format of vocab_file, you can check Preprocess in new repo.

or you don't want to use this code , make sure that your vocab_file like that(seperated by spaces):

word_name_1 word_frequency_1
word_name_2 word_frequency_2
...
word_name_n word_frequency_n

Best wishes.

loretoparisi · 2016-10-24T08:12:56Z

@xuehui1991 thank you for your answer, one thing I kindly ask you. Is the word_frequency_1 the term frequency expressed as tf-idf form ( http://en.wikipedia.org/wiki/Tf%E2%80%93idf )?
I can see a new WordEmbedding project in theMultiverso framework - https://github.com/Microsoft/Multiverso/tree/master/Applications/WordEmbedding
Do we have to use that one as DMTK word embedding now?

Thank you.

xuehui1991 · 2016-10-26T05:03:33Z

I think word_frequency_1 means the word count of dataset, no need to use tf-idf.
As for the repo, I think both of them are ok.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetch enwiki train and vocab files #9

Fetch enwiki train and vocab files #9

loretoparisi commented Oct 21, 2016 •

edited

Loading

loretoparisi commented Oct 21, 2016 •

edited

Loading

xuehui1991 commented Oct 24, 2016

loretoparisi commented Oct 24, 2016 •

edited

Loading

xuehui1991 commented Oct 26, 2016

Fetch enwiki train and vocab files #9

Fetch enwiki train and vocab files #9

Comments

loretoparisi commented Oct 21, 2016 • edited Loading

loretoparisi commented Oct 21, 2016 • edited Loading

xuehui1991 commented Oct 24, 2016

loretoparisi commented Oct 24, 2016 • edited Loading

xuehui1991 commented Oct 26, 2016

loretoparisi commented Oct 21, 2016 •

edited

Loading

loretoparisi commented Oct 21, 2016 •

edited

Loading

loretoparisi commented Oct 24, 2016 •

edited

Loading