The problem of word breaking has attracted many scholars to study this field for last decades and is one of the most important topics in the field of natural language processing. Word breaking is useful in mapping of Internet domains to a query, detection of malicious domains, machine translation, speech recognition, correction of written errors, etc. One of the noises that names suffer from is the removal of the spaces between parts of name. Names without space are names that the space between different parts of them is removed. Different parts of name consist of first name, middle name, last name, etc. Our goal in this project is to break names without space into constructive parts.
Breaking parts of name in names without space is useful in a variety of fields, including information retrieval and entity resolution. In the past, methods based on language models have been used in word breaking, the approach of this study is also using unigram and bigram language models. Breaking on a name-specific basis is done in this research for the first time.
Since the approach used is supervised, to evaluate the performance of the breaking algorithm, train and test datasets are required. The data used in this research is extracted from names in the numerous data collections provided to researchers for free on the Internet. Overall, more that 120 million names from different languages have been collected. Unique tokens in the dataset and the frequency of each occurrence is calculated to produce unigram and bigrams used in this project. Our breaking algorithm performs well with a high accuracy on the test dataset.
-
Notifications
You must be signed in to change notification settings - Fork 0
nikiibayat/NameBreaking
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Breaking Parts of Name in Names without Space
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published