This fork of the original GloVe package includes some tweaks to the original algorithms and adds tools (compiled [CP]ython scripts) for remapping the preprocessed data and serialising them as (platform-agnostic) NumPy objects. Changes were made to common.[ch]
, cooccur.c
, vocab-count.c
. and the output of these programs is not entirely compatible with the original repository.
- The original
get_word
(defined incommon.c
)algorithm to something more sensible. (Theget_word
function found in the original repo is not even the one they used in their experiments: the paper states that they used the Stanford tokenizer.) vocab-count.c
now generates a fixed-width (32-bit) integer encoding of the tokenized corpus and saves it in the working directory asencoded
. TheHASHREC
struct defined incommon.h
and used invocab-count.c
has a new field which records this encoding. Newlines (which are considered "document boundaries" bycoooccur
) get the code-1
.- The original version of
cooccur.c
used the corpus and the vocabulary generated byvocab-count.c
as input. Now it only uses the fileencoded
. (The functionality of checking tokens against the vocabulary and collapsing unknown tokens has been outsourced toremap.pyx
.) - Except for
glove.c
(which usespthreads.h
), the repo will now compile on Windows (tested with MSVC 2019 and MinGW gcc; note that MinGW won't compile the Cython scripts). Previously there were some non-compliant format strings that caused trouble. - I haven't made any changes to or tested
glove.c
: the purpose of this fork was to use the preprocessing code to generate input files to a Python reimplementation.
Usage:
make # compile and link C code in /src directory
make scripts # compile Cython scripts in place in /scripts directory
vocab
--- String of lemmata of the form'{word} {code}\n'
encoded
--- Machine-dependent binary. An array ofint
s. The largest occurring value is written at the top.cooccur.bin
--- Machine-dependent binary. An array of 16 bytestruct crec
- checkcommon.h
for the definition.2-gram-xx.npz
--- a NumPy-generated zip archive with contentsidx.npy
(formatnp.int32
),val.npy
(formatnp.float64
).
I added some Cython scripts. If you want to compile these by hand, you must run cython
with the --embed
flag to generate C code with a main
function.
remap.pyx
--- takes avocab
andencoded
as input, and collapses all codes not found invocab
to a single token. The codes are then remapped to the set0, 1, ..., max_code
wheremax_code
is the code for an unknown token. Generates a newvocab
file recording this mapping.twogram.pyx
--- repackages the output ofcooccur
as a set of2-gram-%d.npz
NumPy archives and a plain text file2-gram.manifest
listing them.encoded.py
--- this simple script wraps the binaryencoded
output ofvocab-count
with a NumPy header and saves it with a.npz
extension. It is not part of the main preprocessing workflow.
You can use the Makefile
in ./working
to run the preprocessing (perhaps this is a somewhat idiosyncratic approach, but it seems to work well).
Usage:
make vocab CORPUS={path_to_corpus.txt} # run vocab-count, yielding vocab mapping + fixed-width encoded corpus
make encoded
make remap # this runs the script for collapsing unknown tokens (NOT WORKING YET)
make 2-gram MEMORY=8 WINDOW=5 # run cooccur.c and shuffle.c, and wrap the output.
# This might take a while and use a lot of disk space.
# The generated .npz archives will be split into chunks of size MEMORY/2 GB.
make clean # delete intermediate files, leaving only the original corpus, vocab, and two.npz.
# If you run make 2-gram again, it will fail to find up-to-date dependencies and rebuild everything.
make clobber # like make clean, but also delete vocab and 2-gram files
Unless you are preprocessing the data inside the actual repo working
directory, you are going to want to pass REPO={path_to_repo}
as an argument to make
or write this into the Makefile
. You should make sure that you have on the order of sizeof(corpus) * 3
disk space free (the exact number will depend on the arguments you pass to the preprocessing scripts).
All work contained in this package is licensed under the Apache License, Version 2.0. See the include LICENSE file.