Text Normalization Challenge

This project was created to tackle Google's challenge of Text normalization on Kaggle, namely: https://www.kaggle.com/c/text-normalization-challenge-english-language

Introduction and Methodology

To tackle this task, we looked at the translations of each category (such as PLAIN, DATE, etc) defined by Google, and discovered that each category had clear patterns of how they should be translated - though there were of course special cases.

To exploit these patterns, we decided on a two step process to predict the speech form of the word:

We classify the word into a certain category.
We change the word according to some predefined rules.

This became the "Classification" and "Prediction" part of our algorithm. The classification part of our algorithm was carried out using a mixed set of manually generated and computer-discovered features along with Gradient Boosted Trees, while we mainly used regex, high-frequency word databases, and pattern recognition for our prediction models.

For more information, please visit the respective folders.

Results

Our final accuracy of translation is 99.31%, which put us at #40 (Top 15%) of the test, leading to a Silver Medal from Kaggle.

Name		Name	Last commit message	Last commit date
Latest commit History 230 Commits
Classification		Classification
Data		Data
Prediction		Prediction
.DS_Store		.DS_Store
README.md		README.md
en_train_lim.csv		en_train_lim.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Normalization Challenge

Introduction and Methodology

Results

About

Releases

Packages

Contributors 3

Languages

MichaelLLi/Text_Normalization

Folders and files

Latest commit

History

Repository files navigation

Text Normalization Challenge

Introduction and Methodology

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages