Skip to content

strn/spacy-sr

Repository files navigation

Serbian language word corpus

This directory hosts various word corpus annotated files in Serbian language, transliterated into Serbian Cyrillic script. Main information about each source is given below.

Source files

List of source files.

SETimes.SRPlus

URL: https://github.com/reldi-data/SETimes.SRPlus/blob/master/set.sr.plus.conllu

Transliterated file:

Note: The resource is An extended and updated version of the original SETimes.SR annotated corpus The original (base) SETimes.SR annotated corpus is here.

License: Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) (see the bottom of URL)

UD_Serbian_Set

URL: https://github.com/UniversalDependencies/UD_Serbian-SET

Transliterated files:

License: Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Annotated corpus of Serbian language-related news and comments: MetaLangNEWS-COMMENTS-Sr

URL: https://www.clarin.si/repository/xmlui/handle/11356/1372

Transliterated files:

License: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

Transliteration

Transliteration was performed by simple Python script connlutrans.py in this directory. Only sentences, word forms and lemmas were transliterated.

Usage

Single file transliteration

connlutrans.py -i <input_file> -o <output_file>

Multiple file transliteration

connlutrans.py -i <input_directory>/* -o <output_directory>

Output files will be placed in output directory, with infix -cyr right before the extension.

About

Serbian language corpus

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages