This repository contains the implementation of Neural Machine Translation from Afrikaans to English.
The dataset contains pairs of English - Afrikaans sentences. Afr.txt
was gotten from Tatoeba Project.
Afrikaans_English.ipynb
: The data was cleaned and broken into smaller training and testing dataset. Then english-afrikaan-both.pkl
, english-afrikaan-train.pkl
and english-afrikaan-test.pkl
were generated for training and testing purposes.
The preprocessing of the data involves:
-
Removing punctuation marks from the data.
-
Removing all non-printable characters.
-
Normalizing all Unicode characters to ASCII (e.g. Latin characters).
-
Converting text corpus into lower case characters.
-
Shuffling the sentences as sentences were previously sorted in the increasing order of their length.
-
Training the Encoder-Decoder LSTM model
After training, the model will be saved as model.h5
in your directory.
This model uses Encoder-Decoder LSTMs for NMT. In this architecture, the input sequence is encoded by the front-end model called encoder then, decoded by backend model called decoder.
It uses Adam Optimizer to train the model using Stochastic Gradient Descent and minimizes the categorical loss function.
Run evaluate_model.py to evaluate the accuracy of the model on both train and test dataset.
-
It loads the best saved
model.h5
model. -
The model performs pretty well on train set and have been generalized to perform well on test set.
The report on this project can be found here
This work builds extensively on the following works:
- G. Lample, A. Conneau, L. Denoyer, MA. Ranzato, Unsupervised Machine Translation With Monolingual Data Only, 2018a. (https://arxiv.org/abs/1711.00043)
Thanks to Tatoeba Project for the dataset.
See the LICENSE file for more details.