Skip to content

Latest commit

 

History

History
564 lines (423 loc) · 23.8 KB

README.md

File metadata and controls

564 lines (423 loc) · 23.8 KB

Multilingual-USAS

Lexicons for the Multilingual UCREL Semantic Analysis System


CI License

Introduction

The UCREL semantic analysis system (USAS) is a framework for undertaking the automatic semantic analysis of text. The framework has been designed and used across a number of research projects since 1990.

The USAS framework initially in English is being extended to other languages. This repository houses the lexicons and tagsets for the non-English versions of the USAS tagger.

For more details about the USAS tagger, see our website: http://ucrel.lancs.ac.uk/usas/. Others collaborating on multilingual lexicon development are listed on this site.

USAS Lexicon Resources Overview

This section will detail the USAS lexicon resources we have in this repository per language, in addition we have a machine readable JSON file, ./language_resources.json, that contains all of the relevant meta data to allow for easier processing of these lexicon resources.

USAS Lexicon Statistics

The table summarising the lexicon data statistics can be found in the ./lexicon_statistics.md markdown file, each row in the table reports the statistics for one lexicon file in this repository, of which each language can have multiple lexicon files. The lexicon file can either be a single or Multi Word Expression (MWE) lexicon file, the format of these lexicon files are explained in more detail in the lexicon file format section below.

The table within ./lexicon_statistics.md markdown file was automatically generated using the following command within the CI GitHub Action:

python lexicon_statistics.py > ./lexicon_statistics.md

USAS Lexicon Meta Data

The ./language_resources.json is a JSON file that contains meta data on what each lexicon resource file contains in this repository per language. The structure of the JSON file is the following:

{
  "Language one BCP 47 code": {
    "resources": [
      {
        "data type": "single",
        "file path": "FILE PATH"
      },
      {
        "data type": "pos",
        "file path": "FILE PATH"
      }
    ],
    "language data": {
      "description": "LANANGUAGE NAME",
      "macrolanguage": "Macrolanguage code",
      "script": "ISO 15924 script code"
    }
  },
  "Language two BCP 47 code": : {
    "resources": [
      {
        "data type": "mwe",
        "file path": "FILE PATH"
      },
    ],
    "language data": {
      "description": "LANANGUAGE NAME",
      "macrolanguage": "Macrolanguage code",
      "script": "ISO 15924 script code"
    }
  },
  ...
}
  • The BCP 47 code of the language, the BCP47 language subtag lookup tool is a great tool to use to find a BCP 47 code for a language.
    • resources - this is a list of resource files that are associated with the given language. There is no limit on the number of resources files associated with a language.
      • data type value can be 1 of 3 values:
        1. single - The file path value has to be of the single word lexicon file format as described in the Lexicon File Format section.
        2. mwe - The file path value has to be of the Multi Word Expression lexicon file format as described in the Lexicon File Format section.
        3. pos - The file path value has to be of the POS tagset file format as described in the [POS Tagset File Format section].
      • file path - value is always relative to the root of this repository, and the file is in the format specified by the associated data type.
    • language data - this is data that is associated with the BCP 47 language code. To some degree this is redundant as we can look this data up through the BCP 47 code, however we thought it is better to have it in the meta data for easy lookup. All of this data can be easily found through looking up the BCP 47 language code in the BCP47 language subtag lookup tool
      • description - The description of the language code.
      • macrolanguage - The macrolanguage tag, note if this does not exist then give the primary language tag, which could be the same as the whole BCP 47 code. The macrolanguage tag could be useful in future for grouping languages.
      • script - The ISO 15924 script code of the language code. The BCP 47 code by default does not always include the script of the language as the default script for that language is assumed, therefore this data is here to make the default more explicit.

Below is an extract of the ./language_resources.json, to give as an example of this JSON structure:

{
    "arb": {
        "resources": [
            {
                "data type": "single",
                "file path": "./Arabic/semantic_lexicon_arabic.tsv"
            }
        ],
        "language data": {
            "description": "Standard Arabic",
            "macrolanguage": "ar",
            "script": "Arab"
        }
    },
    "cmn": {
        "resources":[
            {
                "data type": "single", 
                "file path": "./Chinese/semantic_lexicon_chi.tsv"
            }, 
            {
                "data type": "mwe", 
                "file path": "./Chinese/mwe-chi.tsv"
            }, 
            {
                "data type": "pos", 
                "file path": "./Chinese/simplified-pos-tagset-chi.txt"
            }
        ],
        "language data": {
            "description": "Mandarin Chinese",
            "macrolanguage": "zh",
            "script": "Hani"
        }
    },
    ...
}

Lexicon File Format

All lexicon files are tsv formatted. There are two main type of file formats the single word and the multi word expression (MWE) lexicons. These two file formats can be easily distinguished in two ways:

  1. The MWE files will always have the word mwe within the file name e.g. the Welsh MWE lexicon file name is called mwe-welsh.tsv and can be found at ./Welsh/mwe-welsh.tsv. Where as the single word lexicon files will never have the word mwe within it's file name.
  2. The MWE files compared to the single word would typically only contain 2 headers:
    • mwe_template
    • semantic_tags

Single word lexicon file format

These lexicons on each line will contain at minimum a lemma and a list of semantic tags associated to that lemma in rank order, whereby the most likely semantic tag is the first tag in white space separated list, an example of a line can be seen below:

Austin	Z1 Z2

In the example above we can see that for the lemma Austin the most likely semantic tag is Z1.

A full list of valid TSV headers and their expected value:

Header name Required Value Example
lemma ✔️ The base/dictionary form of the token. See Manning, Raghavan, and Schütze IR book for more details on lemmatization. car
semantic_tags ✔️ A list of semantic/USAS tags separated by whitespace, whereby the most likely semantic tag is the first tag in the list. Z0 Z3
pos Part Of Speech (POS) tag associated with the lemma and token. Noun
token The full word/token form of the lemma. cars

Example single word lexicon file:

lemma	token	pos	semantic_tags
Austin	Austin	Noun	Z1 Z2
car	cars	Noun	Z0 Z3

Special cases

Implicit POS

If you are writing a lexicon which contains POS information, but you come across a token/lemma that can be a member of any POS tagset, e.g. the POS information would not help in disambiguating it, then assign it the POS value * which represents the wildcard.

Example of a single word lexicon with this implicit POS information:

lemma	token	pos	semantic_tags
Austin	Austin	Noun	Z1 Z2
car	cars	Noun	Z0 Z3
computer	computer	*	Z1

Multi Word Expression (MWE) lexicon file format

These lexicons on each line will contain only a value for the mwe_template and the semantic_tags. The semantic_tags will contain the same information as the semantic_tags header for the single word lexicon data. The mwe_template, which is best described in the The UCREL Semantic Analysis System paper (see Fig 3, called multiword templates in the paper), is a simplified pattern matching code, like a regular expression, that is used to capture MWEs that have similar structure. For example, *_* Ocean_N*1 will capture Pacific Ocean, Atlantic Ocean, etc. The templates not only match continuous MWEs, but also match discontinuous ones. MWE templates allow other words to be embedded within them. For example, the set phrase turn on may occur as turn it on, turn the light on, turn the TV on etc. Using the template turn*_* {N*/P*/R*} on_RP we can identify this set phrase in various contexts.

You will have noticed that these mwe_templates have the following pattern matching structure of {token}_{pos} {token}_{pos}, etc. In which each token and/or POS can have a wildcard applied to it, the wildcard means zero or more additional characters.

A full list of valid TSV headers and their expected value:

Header name Required Value Example
mwe_template ✔️ See the description in the paragraphs above *_* Ocean_N*1
semantic_tags ✔️ A list of semantic/USAS tags separated by whitespace, whereby the most likely semantic tag is the first tag in the list. Z2 Z0

Example multi word expression lexicon file:

mwe_template	semantic_tags
turn*_* {N*/P*/R*} on_RP	A1 A1.6 W2
*_* Ocean_N*1	Z2 Z0

Scripts

Text to TSV

This is used within the CI GitHub action:

  1. Converts all lexicon files (single and MWE) from text file format to TSV. The lexicon files are found through the meta data file (language_resources.json).
  2. Checks that the TSV files are formatted correctly:
    1. The minimum header names exist,
    2. All fields/columns have a header name,
    3. All lines contain the minimum information e.g. no comment lines exist in the middle of the file.

Test All File Formats

The test_all_collections.py script uses the test_collection.py script that was explained in the test file format section to test all of the single and multi word expression lexicon files within this repository to ensure that they conform to the file format specified in the Lexicon File Format section. The script takes no arguments as it uses the ./language_resources.json, which is explained in USAS Lexicon Meta Data section.

python test_all_collections.py

Test file format

To test that a lexicon collection conforms to the file format specified in the Lexicon File Format section you can use the test_collection.py python script.. The script takes two arguments:

  1. The path to the lexicon file you want to check.
  2. Whether the lexicon file is a single word lexicon (single) or a Multi Word Expression lexicon (mwe).

Example for a single word lexicon:

python test_collection.py Welsh/semantic_lexicon_cy.tsv single

Example for a Multi Word Expression lexicon:

python test_collection.py Welsh/mwe-welsh.tsv mwe

The script tests the following:

  1. The minimum header names exist.
  2. All fields/columns have a header name.
  3. All lines contain the minimum information e.g. no comment lines exist in the middle of the file.

Lexicon Statistics

This generates the lexicon statistics table, as show in USAS Lexicon Statistics section, it does not take an arguments, but uses the ./language_resources.json file, of which this meta data file is best explained in USAS Lexicon Meta Data section.

Example:

python lexicon_statistics.py

Remove column

Given a Lexicon file path will remove the column with the given header name, and save the rest of the data from the Lexicon to the new lexicon file path. The script takes three arguments:

  1. The path to the existing lexicon file.
  2. The path to save the new lexicon file too, this lexicon will be the same as argument 1, but with the removal of the column with the given header name.
  3. The header name for the column that will be removed.

Example:

python remove_column.py Malay/semantic_lexicon_ms.tsv Malay/new.tsv pos

Test token is equal to lemma

Tests for single word lexicon files if the token and lemma values per row/line are equal. Will output to stdout a JSON object for each line that contains a different token and lemma value. An example of the JSON object is shown below:

{"token": "A.E", "lemma": "A.E.", "row index": 0}

Example:

python test_token_is_equal_to_lemma.py SINGLE_WORD_LEXICON_FILE_PATH

Unique column values

Given a header name and a lexicon file path, it will output to stdout all of the unique values and how often they occur from that header's column from the given lexicon file.

Examples:

python column_unique_values.py French/semantic_lexicon_fr.tsv pos
# Output:
Unique values for the header pos
Value: prep Count: 60
Value: noun Count: 1633
Value: adv Count: 147
Value: verb Count: 448
Value: adj Count: 264
Value: det Count: 86
Value: pron Count: 56
Value: conj Count: 20
Value: null Count: 2
Value: intj Count: 8
python column_unique_values.py Welsh/mwe-welsh.tsv semantic_tags
# Output:
Unique values for the header semantic_tags
Value: G1.1 Count: 1
Value: Z1 Count: 1
Value: M3/Q1.2 Count: 3
Value: Q2.1 Count: 1
Value: I2.1/T2+ Count: 1
Value: P1/G1.1 G1.2 Count: 2
Value: A9- Count: 2
Value: Z2 Count: 1
Value: Y2 Count: 1
Value: X5.1+ Count: 1

Compare header values between two files

Write to stdout the following:

  1. Number of unique values in the column with header_name_1 from lexicon_file_path_1
  2. Number of unique values in the column with header_name_1 from lexicon_file_path_1
  3. Number of unique values in common between the two files.

Example:

python compare_headers_between_lexicons.py Russian/semantic_lexicon_rus.tsv Russian/semantic_lexicon_rus_names.tsv lemma lemma
# Output
Number of unique values in lexicon file 1 17396
Number of unique values in lexicon file 2 7637
Number of unique values in common between the two files:3169

Mapping column values

Given a column/header name, mapping file, and a lexicon file path it will map the values within the column name in the lexicon file using the mapping file. The resulting lexicon file will then be saved to the given output file path.

Example:

In this example we will map the values in pos column of Finnish/semantic_lexicon_fin.tsv using the mapper specified as a dictionary within pos_mapper.json and save the new lexicon file with these mapped values, all other columns and data will remain the same, to the new lexicon file named Finnish/mapped_semantic_lexicon_fin.tsv.

python map_column_values.py Finnish/semantic_lexicon_fin.tsv pos pos_mappers/Finnish_pos_mapper.json Finnish/pos_mapped_semantic_lexicon_fin.tsv

Convert all tabs to spaces after a given column header name

This script assumes that the semantic tags in the INPUT_FILE have been separated by tabs rather than spaces, this script will reverse this process and output the spaced version into a new OUTPUT_FILE. Both files are expected to be in TSV format. This is useful when data entered for USAS tags has been tab separated rather than space separated, e.g.:

lemma\tsemantic_tags
test\tZ1\tZ2

After running this script it will be converted to:

lemma\tsemantic_tags
test\tZ1 Z2

To run the script:

python tabs_to_spaces.py INPUT_FILE_NAME OUTPUT_FILE_NAME

Example:

python tabs_to_spaces.py Finnish/semantic_lexicon_fin.tsv Finnish/new_semantic_lexicon_fin.tsv

Spaces to tabs

This script converts a file that contains only spaces to a file that separates fields/columns by tabs instead of spaces. For MWE files the optional POS tag argument is not used.

For single word lexicon files we expect a POS field, further if you provide a JSON formatted POS tagset file where the object keys are the valid POS tags in the tagset then the POS field values will be checked against the given POS tagset.

Example:

This converts a MWE file that contains only spaces (idioms_utf8.c7) and converts it to tab separated by outputting it into the file mwe-en.txt.

python spaces_to_tabs.py idioms_utf8.c7 mwe-en.txt mwe

This converts a single word lexicon file that contains only spaces (lexicon_utf8.c7) and converts it to tab separated by outputting it into the file semantic_lexicon_en.txt while also ensuring that all POS tags in the POS field conform to the POS tagset defined by the key values in the dictionary object within the ./pos_mappers/c7_to_upos.json file.

python spaces_to_tabs.py lexicon_utf8.c7 semantic_lexicon_en.txt single --pos-tagset-file ./pos_mappers/c7_to_upos.json

Duplicate entries identification and statistics

This script finds duplicate entries within either a single word and MWE lexicon file and displays how many duplicates there are.

Single word lexicon example:

python duplicate_entires.py English/semantic_lexicon_en.txt single

MWE lexicon example:

python duplicate_entires.py English/mwe-en.txt mwe

If you want to output the stdout data into a TSV file you can provide an optional output-file argument to save the data to a given TSV file, in this example we save it to duplicate_single_word_lexicon_english_upos.tsv file:

python duplicate_entires.py English/semantic_lexicon_en.txt single --output-file duplicate_single_word_lexicon_english_upos.tsv

Extract Unique POS Values from MWE File

This script finds all of the unique POS values within a MWE lexicon file, including POS values that are part of a curly brace discontinues MWE expression. By default the unique POS values are output to stdout.

If the optional output-file argument is passed the unique POS values are also saved to the output-file in TSV format. If another optional argument, pos-mapper-file, is given it will try to map the unique POS values given the JSON POS mapper file, any values it cannot map it will leave blank, and add a new column called mapped with all the POS values it could map.

Example:

python extract_unique_pos_values_from_mwe_file.py English/mwe-en.txt
# Output
II22
N*2
NNO*
MC1
IF
UH*
NN1*
CSN
DA1
BCL21
DDQV
...

This example shows how to use the optional output-file argument, whereby in this case the output will be print to stdout like before but also saved to the output-file in TSV format:

python extract_unique_pos_values_from_mwe_file.py English/mwe-en.txt --output-file unique_pos_values_mwe_en.tsv 

This example show how to use the optional pos-mapper-file argument:

python extract_unique_pos_values_from_mwe_file.py English/mwe-en.txt --output-file unique_pos_values_mwe_en.tsv --pos-mapper-file pos_mappers/c7_to_upos.json
# Output
Unique POS Values:
VBZ     VERB
VVD     VERB
NN132
RRT     ADV
NP*
RT      ADV
V*
...

POS Mapping for MWE files

This maps POS values within a MWE lexicon file given a POS mapper, the mapped MWE lexicon file will be saved to the given output file.

python mwe_pos_mapping.py English/mwe-en.txt pos_mappers/mwe_c7_to_upos.json English/mapped-mwe-en.txt

Python Requirements

This has been tested with Python >= 3.7, to install the relevant python requirements:

pip install -r requirements.txt

Citation

In order to reference this further development of the multilingual USAS tagger, please cite our paper at NAACL-HLT 2015, which described our bootstrapping approach:

APA Format:

Piao, S. S., Bianchi, F., Dayrell, C., D’egidio, A., & Rayson, P. (2015). Development of the multilingual semantic annotation system. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1268-1274).

BibTeX Format:

@inproceedings{piao-etal-2015-development,
    title = "Development of the Multilingual Semantic Annotation System",
    author = "Piao, Scott  and
      Bianchi, Francesca  and
      Dayrell, Carmen  and
      D{'}Egidio, Angela  and
      Rayson, Paul",
    booktitle = "Proceedings of the 2015 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = may # "{--}" # jun,
    year = "2015",
    address = "Denver, Colorado",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/N15-1137",
    doi = "10.3115/v1/N15-1137",
    pages = "1268--1274",
}

In 2015/16, we extended this initial approach to twelve languages and evaluated the coverage of these lexicons on multilingual corpora. Please cite our LREC-2016 paper:

APA Format:

Piao, S. S., Rayson, P., Archer, D., Bianchi, F., Dayrell, C., El-Haj, M., Jiménez, R-M., Knight, D., Křen, M., Lofberg, L., Nawab, R. M. A., Shafi, J., Teh, P. L., & Mudraya, O. (2016). Lexical coverage evaluation of large-scale multilingual semantic lexicons for twelve languages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16) (pp. 2614-2619).

BibTeX Format:

@inproceedings{piao-etal-2016-lexical,
    title = "Lexical Coverage Evaluation of Large-scale Multilingual Semantic Lexicons for Twelve Languages",
    author = {Piao, Scott  and
      Rayson, Paul  and
      Archer, Dawn  and
      Bianchi, Francesca  and
      Dayrell, Carmen  and
      El-Haj, Mahmoud  and
      Jim{\'e}nez, Ricardo-Mar{\'\i}a  and
      Knight, Dawn  and
      K{\v{r}}en, Michal  and
      L{\"o}fberg, Laura  and
      Nawab, Rao Muhammad Adeel  and
      Shafi, Jawad  and
      Teh, Phoey Lee  and
      Mudraya, Olga},
    booktitle = "Proceedings of the Tenth International Conference on Language Resources and Evaluation ({LREC}'16)",
    month = may,
    year = "2016",
    address = "Portoro{\v{z}}, Slovenia",
    publisher = "European Language Resources Association (ELRA)",
    url = "https://aclanthology.org/L16-1416",
    pages = "2614--2619",
}

Contact Information

If you are interested in getting involved in creating lexicons for new languages or updating the existing ones then please get in touch with: Paul Rayson (p.rayson@lancaster.ac.uk) and/or the UCREL Research Centre (ucrel@lancaster.ac.uk) at Lancaster University.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.