Implementation of what is discussed in the paper https://aclanthology.org/2022.acl-srw.29/.
If you find our work useful please consider citing us:
@inproceedings{papaluca-etal-2022-pretrained,
title = "Pretrained Knowledge Base Embeddings for improved Sentential Relation Extraction",
author = "Papaluca, Andrea and
Krefl, Daniel and
Suominen, Hanna and
Lenskiy, Artem",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-srw.29",
doi = "10.18653/v1/2022.acl-srw.29",
pages = "373--382",
abstract = "In this work we put forward to combine pretrained knowledge base graph embeddings with transformer based language models to improve performance on the sentential Relation Extraction task in natural language processing. Our proposed model is based on a simple variation of existing models to incorporate off-task pretrained graph embeddings with an on-task finetuned BERT encoder. We perform a detailed statistical evaluation of the model on standard datasets. We provide evidence that the added graph embeddings improve the performance, making such a simple approach competitive with the state-of-the-art models that perform explicit on-task training of the graph embeddings. Furthermore, we ob- serve for the underlying BERT model an interesting power-law scaling behavior between the variance of the F1 score obtained for a relation class and its support in terms of training examples.",
}
- model.py: Implementation of the various models as torch.nn.Module objects.
- trainer.py: Trainer object that implements the training loop for the models in model.py.
- dataset.py: Implementation of the two objects used for preprocessing and preparing the data to be fed to the models.
- run_experiment.py: Main script to train and/or evaluate a model on one of the three datasets.
- ner_schemes.py: Simple implementation of the schemes for Named Entity Recognition (at the moment just BIOES is implemented).
- plot_results.py: Script for generating the plots shown in the paper.
- utils.py: Various helper functions mainly used by plot_results.py.
- graph.py: Little implementation of a Knowledge Graph object to visualize the graph of entities and relations.
Directories containing all the files relevant to the corpora, such as the train and test files, saved results and/or saved models
- Wikidata/: The Wikidata corpus, please refer to https://github.com/ansonb/RECON for the train and test files.
- NYT/: The New York Times corpus by Riedel, please refer to https://github.com/ansonb/RECON for the train and test files we used.
- CONLL04/: The CONLL04 corpus, please refer https://github.com/lavis-nlp/spert for the train and test files we used (shortcut).
For each corpus you can find the preprocess.py script to prepare the data in the format needed by run_experiment.py inside the relative directory (Wikidata/, NYT/ or CONLL04/). In addition to the train/test file you are going to need the pretrained graph embedding file that you can find here, make sure to adapt it to have one or more (we split that in 4 due to RAM limitations) final pickle-serialized python dictionaries with keys and values given respectively by the Ids and the relative embeddings of the entities. Once everything is set up, just run:
python preprocess.py PATH/_TO_THE/_FILE_TO_PROCESS.json PATH/_TO_THE/_PRETRAINED_EMBEDDING_FILE.pkl
by specifying which file to process (PATH/_TO_THE/_FILE_TO_PROCESS.json) and where the graph embedding file/s is/are located (PATH/_TO_THE/_PRETRAINED_EMBEDDING_FILE.pkl) . You will obtain as output a train.pkl/test.pkl file ready to be fed to the run_experiment.py script.
To train a model just run the run_experiment.py script:
python run_experiment.py PATH/_TO_THE/_TRAIN_FILE.pkl PATH/_TO_THE/_TEST_FILE.pkl
by specifying the location of the train and test files generated as explained above. You can specify various training options by editing the configuration file conf.json
{
"n_epochs": number of epochs to train (int),
"n_exp": number of training runs (int),
"batchsize": dimension of each batch (int),
"lr": learning rate (float),
"device": id of the gpu to use (int),
"pretrained_lang_model": name of the pretrained language model (str),
"evaluate": evaluate the performance? (bool),
"negative_rel_class": label of the NA relation class (str)
}
and running:
python run_experiment.py train.pkl test.pkl --conf PATH/_TO_/conf.json
If evaluate is set to true, you can also specify the json file where to save the results with:
python run_experiment.py train.pkl test.pkl --conf PATH/_TO_/conf.json --res_file results_file_name.json
this will store some information about the training settings and several performance metrics for each run, such as:
- Precision, Recall and F1 for each relation and on average
- The confusion matrix
- The micro Precision-Recall curve