Skip to content

GloVe model for distributed word representation

License

Notifications You must be signed in to change notification settings

awmacpherson/GloVe

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GloVe: Global Vectors for Word Representation

This fork of the original GloVe package includes some tweaks to the original algorithms and adds tools (compiled [CP]ython scripts) for remapping the preprocessed data and serialising them as (platform-agnostic) NumPy objects. Changes were made to common.[ch], cooccur.c, vocab-count.c. and the output of these programs is not entirely compatible with the original repository.

Changes

  • The original get_word (defined in common.c)algorithm to something more sensible. (The get_word function found in the original repo is not even the one they used in their experiments: the paper states that they used the Stanford tokenizer.)
  • vocab-count.c now generates a fixed-width (32-bit) integer encoding of the tokenized corpus and saves it in the working directory as encoded. The HASHREC struct defined in common.h and used in vocab-count.c has a new field which records this encoding. Newlines (which are considered "document boundaries" by coooccur) get the code -1.
  • The original version of cooccur.c used the corpus and the vocabulary generated by vocab-count.c as input. Now it only uses the file encoded. (The functionality of checking tokens against the vocabulary and collapsing unknown tokens has been outsourced to remap.pyx.)
  • Except for glove.c (which uses pthreads.h), the repo will now compile on Windows (tested with MSVC 2019 and MinGW gcc; note that MinGW won't compile the Cython scripts). Previously there were some non-compliant format strings that caused trouble.
  • I haven't made any changes to or tested glove.c: the purpose of this fork was to use the preprocessing code to generate input files to a Python reimplementation.

Usage:

make # compile and link C code in /src directory
make scripts # compile Cython scripts in place in /scripts directory

File formats

  • vocab --- String of lemmata of the form '{word} {code}\n'
  • encoded --- Machine-dependent binary. An array of ints. The largest occurring value is written at the top.
  • cooccur.bin --- Machine-dependent binary. An array of 16 byte struct crec - check common.h for the definition.
  • 2-gram-xx.npz --- a NumPy-generated zip archive with contents idx.npy (format np.int32), val.npy (format np.float64).

Scripts

I added some Cython scripts. If you want to compile these by hand, you must run cython with the --embed flag to generate C code with a main function.

  • remap.pyx --- takes a vocab and encoded as input, and collapses all codes not found in vocab to a single token. The codes are then remapped to the set 0, 1, ..., max_code where max_code is the code for an unknown token. Generates a new vocab file recording this mapping.
  • twogram.pyx --- repackages the output of cooccur as a set of 2-gram-%d.npz NumPy archives and a plain text file 2-gram.manifest listing them.
  • encoded.py --- this simple script wraps the binary encoded output of vocab-count with a NumPy header and saves it with a .npz extension. It is not part of the main preprocessing workflow.

Running the preprocessing

You can use the Makefile in ./working to run the preprocessing (perhaps this is a somewhat idiosyncratic approach, but it seems to work well).

Usage:

make vocab CORPUS={path_to_corpus.txt} # run vocab-count, yielding vocab mapping + fixed-width encoded corpus
make encoded

make remap # this runs the script for collapsing unknown tokens (NOT WORKING YET)

make 2-gram MEMORY=8 WINDOW=5 # run cooccur.c and shuffle.c, and wrap the output. 
                              # This might take a while and use a lot of disk space. 
                              # The generated .npz archives will be split into chunks of size MEMORY/2 GB.

make clean # delete intermediate files, leaving only the original corpus, vocab, and two.npz.
           # If you run make 2-gram again, it will fail to find up-to-date dependencies and rebuild everything.

make clobber # like make clean, but also delete vocab and 2-gram files

Unless you are preprocessing the data inside the actual repo working directory, you are going to want to pass REPO={path_to_repo} as an argument to make or write this into the Makefile. You should make sure that you have on the order of sizeof(corpus) * 3 disk space free (the exact number will depend on the arguments you pass to the preprocessing scripts).

License

All work contained in this package is licensed under the Apache License, Version 2.0. See the include LICENSE file.

About

GloVe model for distributed word representation

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C 67.8%
  • Python 14.4%
  • MATLAB 8.3%
  • Shell 7.3%
  • Makefile 2.2%