-
Notifications
You must be signed in to change notification settings - Fork 55
Fetch enwiki train and vocab files #9
Comments
[UPDATE] I have found another vocabulary dump directly from the Google's https://github.com/loretoparisi/inspect_word2vec The vocabulary structure is like
so we have |
Hey loretoparisi, there is an example:
you can see As for input format of As for the input format of or you don't want to use this code , make sure that your
Best wishes. |
@xuehui1991 thank you for your answer, one thing I kindly ask you. Is the Thank you. |
I think word_frequency_1 means the word count of dataset, no need to use tf-idf. |
The "run.bat" script has the options
text
,read_vocab
andtrain_file
.https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
is it ok?text
a specific file in thetrain_file
folder?How to generate/retrieve a
enwiki
vocabulary file?Is it ok to process the dump and generating a vocabulary like the one made by the script mkvocab.pl
The text was updated successfully, but these errors were encountered: