mascara

A natural language tokenizer.

Purpose

This is a C library and command-line tool for segmenting written texts into grapheme clusters, tokens, or sentences. It has specific support for English, French, Italian and German. A generic tokenizer is also available.

Building

The library is available in source form, as an amalgamation. Compile mascara.c together with your source code, and use the interface described in mascara.h. You'll need a C11 compiler, which means either GCC or CLang on Unix.

A command-line tool mascara is included, plus a set of sentence boundary detection models. To install all these:

$ make && sudo make install

Although the library itself is BSD-licensed and can thus be used for free in commercial software, sentence boundary detection models are derived from corpora covered by more restrictive licenses. Here are the corpora used for creating each model:

en_amalg: Brown corpus, excerpts from the Wall Street Journal, BNC 1000 Gold Trees
fr_sequoia: Sequoia corpus
de_tiger: Tiger corpus
it_tut: Turin University Treebank corpus

Usage

Examples

The examples directory contains concrete usage examples. Compile these files with make, and use them like so:

Split a text into extended grapheme clusters:

  $ examples/graphemes हृदय
  हृ
  द
  य

Split a sentence into tokens:

  $ examples/tokens "And now, Laertes, what's the news with you?"
  And
  now
  ,
  Laertes
  ,
  what
  's
  the
  news
  with
  you
  ?

Split a text into sentences:

  $ examples/sentences "Pierre Vinken, 61 years old, will join the board
  as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier
  N.V., the Dutch publishing group."  
  Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .  
  Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .

The library API is fully described in mascara.h.

Tokenization mode

Before allocating a tokenizer, you must choose whether you want to iterate over tokens or over sentences. Segmentation is slightly different depending on which mode you choose:

When iterating over tokens, periods that immediately follow a word are always separated from it, even if the token is an abbreviation:
```
Mr . and Mrs . Smith have two children .
```
When iterating over sentences, all but the last period of the sentence are left attached to the token that precedes them, provided it is a word:
```
Mr. and Mrs. Smith have two children .
```

Token types

During tokenization, each token is annotated with a type. This information is sometimes useful for its own sake, but it is intended to be used as feature for later processing. Existing token types are:

LATIN. A token principally made of Latin characters:

  Hamlet
  entr'ouvert
  willy-nilly
  AT&T
  tris(dimethylamino)bromophosphonium

PREFIX. A token at the beginning of a text segment:

  y'    => y'know
  d'    => d'entrée de jeu
  qu'   => qu'on se le dise
  dell' => dell'altro

SUFFIX. A token at the end of a text segment:

  'll   => he'll
  'd    => he'd
  -t-il => pense-t-il

SYM. A symbol. This doesn't include all Unicode symbols, only the most common ones that need to be recognized for the input text to be tokenized correctly:
```
?
!!!
+
$
```
NUM. A numeric token. This includes numbers in decimal, hexadecimal, and exponential notation, phone numbers, and a few other types. Examples:
```
1,234,567
80's
12.34
0xdeadbeef
20 000
3e-27
```
ABBR. A likely abbreviation, with internal periods:
```
  Ph.D.
  a.m.
  J.-C.
```
EMAIL. An email address:
```
  foo@example.com
  john@café.be
```

URI. A likely URI:

  http://www.example.com?q=fubar
  www.google.de

PATH. A path in the file system:

  /usr/bin/fubar
  ~/home_sweet_home/foo.txt

UNK. Anything but one of the above. This includes unknown symbols, as well as words not in the Latin script. The longest possible span of unknown characters is systematically selected:
```
☎
मन्त्र
```

You can check wich type is assigned to which token with the command-line tool:

$ echo "And now, Laertes, what's the news with you?" | mascara -f "%s/%t "
And/LATIN now/LATIN ,/SYM Laertes/LATIN ,/SYM what/LATIN 's/SUFFIX the/LATIN
news/LATIN with/LATIN you/LATIN ?/SYM

Implementation

Tokenization

There are two main approaches for implementing a tokenizer: a) using finite-state automata, and b) using a supervised sequence model. The second solution is much heavier than its alternative and doesn't seem to be worth the extra work for such a light task as tokenization, so I discarded it.

Each tokenizer uses two finite-state machines, written in Ragel. The first one matches the input text from left to right, in the usual way. The second one reads it from right to left, and is used to recognize contractions at the end of a word. Using two separate machines helps to disambiguate the role of single quotes.

In my first attempt at the task, I performed a preliminary segmentation of the text on whitespace characters, and then repeatedly attempted to trim tokens (punctuation, prefixes, suffixes) from the left and the right of the delimited text chunks. I changed that because this cannot deal with tokens that contain internal whitespace characters, such as numbers in French.

Sentence boundary detection

Two sentence boundary detection modules are implemented.

The first one is a simple finite-state machine. It uses a fixed list of rules and abbreviations. It cannot disambiguate the most ambiguous use of periods (cardinal numbers, abbreviation at the end of a sentence, etc.). It is currently only used as a fallback, when no language-specific model is available.

The second one is a finite-state machine that uses a Naive Bayes classifier to disambiguate the role of periods. Only periods that are seemingly not token-internal are examined by the classifier. The remaining ones, as well as question marks, etc., are deemed to be unambiguous, and are always classified as end-of-sentence markers. We use one feature set per language, obtained through semi-automatic optimization with the help of my feature selection tool. Features are extracted from a window of three tokens, centered on the period to classify.

Despite the simplicity of this approach, its performance is close to state-of-the-art. The following tables show the performance of our classifier on four corpora. We use five-fold cross-validation, and only classify ambiguous periods.

Brown

             accuracy    precision   recall      F1
  bayes      98.83       99.61       99.11       99.35
  baseline   90.83       90.83       100.00      95.19

Sequoia

             accuracy    precision   recall      F1
  bayes      98.89       98.94       99.92       99.43
  baseline   96.28       96.28       100.00      98.11

Tiger

             accuracy    precision   recall      F1
  bayes      99.48       99.65       99.79       99.72
  baseline   93.34       93.34       100.00      96.58

Turin University Treebank

             accuracy    precision   recall      F1
  bayes      98.22       99.53       98.38       98.95
  baseline   85.58       85.58       100.00      92.23

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
cmd		cmd
examples		examples
models		models
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mascara.c		mascara.c
mascara.h		mascara.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mascara

Purpose

Building

Usage

Examples

Tokenization mode

Token types

Implementation

Tokenization

Sentence boundary detection

References

About

Releases

Packages

Contributors 2

Languages

License

michaelnmmeyer/mascara

Folders and files

Latest commit

History

Repository files navigation

mascara

Purpose

Building

Usage

Examples

Tokenization mode

Token types

Implementation

Tokenization

Sentence boundary detection

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages