-
Notifications
You must be signed in to change notification settings - Fork 9
Tokenizer
CSchott edited this page May 9, 2016
·
1 revision
The tokenizer is used to create a list of single words/tokens from a given string. To accomplish this the tokenizer separates the string by white space. Then he deletes unwanted characters like parentheses( For a total List see below). The # and @ characters will be handled in another way. Words marked with these will be added to the list with AND without the given character.
Currently the tokenizer uses following separators (more following):
- punctuation: . , : ; ! ?
- parentheses: () [] {} < >
- operators: + / *
- Quotation: ' "