Skip to content
This repository has been archived by the owner on Jan 26, 2021. It is now read-only.

Fetch enwiki train and vocab files #9

Open
loretoparisi opened this issue Oct 21, 2016 · 4 comments
Open

Fetch enwiki train and vocab files #9

loretoparisi opened this issue Oct 21, 2016 · 4 comments

Comments

@loretoparisi
Copy link

loretoparisi commented Oct 21, 2016

The "run.bat" script has the options text, read_vocab and train_file.

  • Which is the input format expected for these files (e.g. the a dump https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 is it ok?
  • Is text a specific file in the train_file folder?

How to generate/retrieve a enwiki vocabulary file?

Is it ok to process the dump and generating a vocabulary like the one made by the script mkvocab.pl

THE     84503449
AND     33700692
WAS     12911542
FOR     10342919
THAT    8318795
@loretoparisi
Copy link
Author

loretoparisi commented Oct 21, 2016

[UPDATE]

I have found another vocabulary dump directly from the Google's GoogleNews-vectors-negative300.bin file, exploded in 30 files of 100K words (Google has 3 million words in the features vector) that are in the vocabulary folder of this repo

https://github.com/loretoparisi/inspect_word2vec

The vocabulary structure is like

$ head -n 10 vocabulary_01.txt 
Allanah_Munson
WINDS_WILL
nab_sexual_predators
By_Alexandra_Barham
Mayor_Noramie_Jasmin
Chief_Executive_Glenn_Tilton
Neil_Kinnock
Makoto_Tamada_JPN_Konica
abductor_muscle
visit_www.availability.sungard.com

so we have commonly paired words as well and stopwords.
Can I use this vocabulary as read_vocab input then?

@xuehui1991
Copy link

Hey loretoparisi, there is an example:

set size=300
set text=test_version
set read_vocab=%text%_traning_data.txt
set train_file=%text%_vocab_data.txt
set binary=1
set cbow=1
set alpha=0.01
set epoch=20
set window=5
set sample=0
set hs=0
set negative=5
set threads=16
set mincount=5
set sw_file=stopwords_simple.txt
set stopwords=0
set data_block_size=1000
set max_preload_data_size=2000
set use_adagrad=0
set is_pipeline=0
set output=%text%_%size%.bin

distributed_word_embedding.exe -max_preload_data_size %max_preload_data_size% -is_pipeline %is_pipeline% -alpha %alpha% -data_block_size %data_block_size% -train_file %train_file% -output %output% -threads %threads% -size %size% -binary %binary% -cbow %cbow% -epoch %epoch% -negative %negative% -hs %hs% -sample %sample% -min_count %mincount% -window %window% -stopwords %stopwords% -sw_file %sw_file% -read_vocab %read_vocab% -use_adagrad %use_adagrad%

you can see text is a parameter in bat file and you train_file is the path that you put training file.

As for input format of train_file, it could be raw English file or other language file that after segmentation(it's more better that you have remove some noise).

As for the input format of vocab_file, you can check Preprocess in new repo.

or you don't want to use this code , make sure that your vocab_file like that(seperated by spaces):

word_name_1 word_frequency_1
word_name_2 word_frequency_2
...
word_name_n word_frequency_n

Best wishes.

@loretoparisi
Copy link
Author

loretoparisi commented Oct 24, 2016

@xuehui1991 thank you for your answer, one thing I kindly ask you. Is the word_frequency_1 the term frequency expressed as tf-idf form ( http://en.wikipedia.org/wiki/Tf%E2%80%93idf )?
I can see a new WordEmbedding project in theMultiverso framework - https://github.com/Microsoft/Multiverso/tree/master/Applications/WordEmbedding
Do we have to use that one as DMTK word embedding now?

Thank you.

@xuehui1991
Copy link

I think word_frequency_1 means the word count of dataset, no need to use tf-idf.
As for the repo, I think both of them are ok.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants