first commit

surajiyer · Aug 5, 2020 · e39df46 · e39df46
commit e39df46
Show file tree

Hide file tree

Showing 10 changed files with 1,004 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,105 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+.pytest_cache/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+.hypothesis/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+.static_storage/
+.media/
+local_settings.py
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# pyenv
+.python-version
+
+# celery beat schedule file
+celerybeat-schedule
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+The MIT License (MIT)
+
+Copyright (C) 2017 Ines Montani
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,79 @@
+# spacycaKE: Keyphrase Extraction for spaCy
+[spaCy v2.0](https://spacy.io/usage/v2) extension and pipeline component for Keyphrase Extraction methods meta data to `Doc` objects.
+
+## Installation
+`spacycaKE` requires `spacy` v2.0.0 or higher and `spacybert` v1.0.0 or higher.
+
+## Usage
+### Getting BERT embeddings for single language dataset
+```
+import spacy
+import spacycake
+nlp = spacy.load('en')
+```
+
+Then either use BertInference as part of a pipeline,
+```
+bert = BertInference(
+    from_pretrained='path/to/pretrained_bert_weights_dir',
+    set_extension=False)
+nlp.add_pipe(bert, last=True)
+```
+Or not...
+```
+bert = BertInference(
+    from_pretrained='path/to/pretrained_bert_weights_dir',
+    set_extension=True)
+```
+The difference is that when `set_extension=True`, `bert_repr` is set as a property extension for the Doc, Span and Token spacy objects. If `set_extension=False`, the `bert_repr` is set as an attribute extension with a default value (`=None`). The attribute computes the correct value when `doc._.bert_repr` is called.
+
+Get the Bert representation / embedding.
+```
+doc = nlp("This is a test")
+print(doc._.bert_repr)  # <-- torch.Tensor
+```
+
+## Available attributes
+The extension sets attributes on the `Doc`, `Span` and `Token`. You can change the attribute name on initializing the extension.
+| | | |
+|-|-|-|
+| `Doc._.cake` | `torch.Tensor` | Document BERT embedding |
+
+## Settings
+On initialization of `BertInference`, you can define the following:
+
+| name | type | default | description |
+|-|-|-|-|
+| `from_pretrained` | `str` | `None` | Path to Bert model directory or name of HuggingFace transformers pre-trained Bert weights, e.g., `bert-base-uncased` |
+| `attr_name` | `str` | `'bert_repr'` | Name of the BERT embedding attribute to set to the `._` property |
+| `max_seq_len` | `int` | 512 | Max sequence length for input to Bert |
+| `pooling_strategy` | `str` | `'REDUCE_MEAN'` | Strategy to generate single sentence embedding from multiple word embeddings. See below for the various pooling strategies available. |
+| `set_extension` | `bool` | `True` | If `True`, then `'bert_repr'` is set as a property extension for the `Doc`, `Span` and `Token` spacy objects. If `False`, the `'bert_repr'` is set as an attribute extension with a default value (`None`) which gets filled correctly when called in a pipeline. Set it to `False` if you want to use this extension in a spacy pipeline. |
+| `force_extension` | `bool` | `True` | A boolean value to create the same 'Extension Attribute' upon being executed again |
+
+On initialization of `MultiLangBertInference`, you can define the following:
+
+| name | type | default | description |
+|-|-|-|-|
+| `from_pretrained` | `Dict[LANG_ISO_639_1, str]` | `None` | Mapping between two-letter language codes to path to model directory or HuggingFace transformers pre-trained Bert weights |
+| `attr_name` | `str` | `'bert_repr'` | Same as in BertInference |
+| `max_seq_len` | `int` | 512 | Same as in BertInference |
+| `pooling_strategy` | `str` | `'REDUCE_MEAN'` | Same as in BertInference |
+| `set_extension` | `bool` | `True` | Same as in BertInference |
+| `force_extension` | `bool` | `True` | Same as in BertInference |
+
+## Pooling strategies
+| strategy | description |
+|-|-|
+| `REDUCE_MEAN` | Element-wise average the word embeddings |
+| `REDUCE_MAX` | Element-wise maximum of the word embeddings |
+| `REDUCE_MEAN_MAX` | Apply both `'REDUCE_MEAN'` and `'REDUCE_MAX'` and concatenate. So if the original word embedding is of dimensions `(768,)`, then the output will have shape `(1536,)` |
+| `CLS_TOKEN`, `FIRST_TOKEN` | Take the embedding of only the first `[CLS]` token |
+| `SEP_TOKEN`, `LAST_TOKEN` | Take the embedding of only the last `[SEP]` token |
+| `None` | No reduction is applied and a matrix of embeddings per word in the sentence is returned |
+
+## Roadmap
+This extension is still experimental. Possible future updates include:
+* Getting document representation from other state-of-the-art NLP models other than Google's BERT.
+* Method for computing similarity between `Doc`, `Span` and `Token` objects using the `bert_repr` tensor.
+* Getting representation from multiple / other layers in the models.