Skip to content

nikiibayat/NameBreaking

Repository files navigation

Breaking Parts of Name in Names without Space

Name Breaking
The problem of word breaking has attracted many scholars to study this field for last decades and is one of the most important topics in the field of natural language processing. Word breaking is useful in mapping of Internet domains to a query, detection of malicious domains, machine translation, speech recognition, correction of written errors, etc. One of the noises that names suffer from is the removal of the spaces between parts of name. Names without space are names that the space between different parts of them is removed. Different parts of name consist of first name, middle name, last name, etc. Our goal in this project is to break names without space into constructive parts. Breaking parts of name in names without space is useful in a variety of fields, including information retrieval and entity resolution. In the past, methods based on language models have been used in word breaking, the approach of this study is also using unigram and bigram language models. Breaking on a name-specific basis is done in this research for the first time. Since the approach used is supervised, to evaluate the performance of the breaking algorithm, train and test datasets are required. The data used in this research is extracted from names in the numerous data collections provided to researchers for free on the Internet. Overall, more that 120 million names from different languages have been collected. Unique tokens in the dataset and the frequency of each occurrence is calculated to produce unigram and bigrams used in this project. Our breaking algorithm performs well with a high accuracy on the test dataset.

About

Breaking Parts of Name in Names without Space

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages