Breaking Parts of Name in Names without Space

The problem of word breaking has attracted many scholars to study this field for last decades and is one of the most important topics in the field of natural language processing. Word breaking is useful in mapping of Internet domains to a query, detection of malicious domains, machine translation, speech recognition, correction of written errors, etc. One of the noises that names suffer from is the removal of the spaces between parts of name. Names without space are names that the space between different parts of them is removed. Different parts of name consist of first name, middle name, last name, etc. Our goal in this project is to break names without space into constructive parts. Breaking parts of name in names without space is useful in a variety of fields, including information retrieval and entity resolution. In the past, methods based on language models have been used in word breaking, the approach of this study is also using unigram and bigram language models. Breaking on a name-specific basis is done in this research for the first time. Since the approach used is supervised, to evaluate the performance of the breaking algorithm, train and test datasets are required. The data used in this research is extracted from names in the numerous data collections provided to researchers for free on the Internet. Overall, more that 120 million names from different languages have been collected. Unique tokens in the dataset and the frequency of each occurrence is calculated to produce unigram and bigrams used in this project. Our breaking algorithm performs well with a high accuracy on the test dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Photos		Photos
BestUndergraduateProjectAward.JPG		BestUndergraduateProjectAward.JPG
README.md		README.md
name_breaker.py		name_breaker.py
poster.pdf		poster.pdf
report.pdf		report.pdf
unigram.txt		unigram.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Breaking Parts of Name in Names without Space

About

Releases

Packages

Languages

nikiibayat/NameBreaking

Folders and files

Latest commit

History

Repository files navigation

Breaking Parts of Name in Names without Space

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages