Welcome to the LangGPT project! Here, I have fine-tuned a pre-trained English-Hindi translator model using a custom dataset. This endeavor provided insights into the fundamentals of tokenizers, Hugging Face transformers, Hugging Face Hub, and Streamlit for app building. The project has been deployed using Streamlit Cloud and can be accessed via the link below.
Link: LangGPT Streamlit
I utilized a Hugging Face transformer model pre-trained for English-Hindi translations. Below are the benchmarks of the pre-trained model:
Test Set | BLEU | chr-F |
---|---|---|
newsdev2014.eng.hin | 6.9 | 0.296 |
newstest2014-hien.eng.hin | 9.9 | 0.323 |
Tatoeba-test.eng.hin | 16.1 | 0.447 |
For more details, visit the model page: Helsinki-NLP/opus-mt-en-hi
The IIT Bombay English-Hindi corpus contains parallel and monolingual Hindi corpora collected from various sources and developed at the Center for Indian Language Technology, IIT Bombay.
The dataset is structured as a list of dictionaries with English and Hindi translations labeled as en
and hi
.
Below is the data split provided by Hugging Face and the subset we used for fine-tuning our model.
Type | Rows Present | Rows Used |
---|---|---|
Train | 1.66M | 100k |
Validation | 520 | 520 |
Test | 2.51k | 500 |
For more information about the dataset: cfilt/iitb-english-hindi
- Installed the necessary libraries.
- Loaded the dataset from Hugging Face using the
load_dataset
module. - Loaded the tokenizer from the model checkpoint.
- Preprocessed the dataset by converting the source and target text to tokens.
- Loaded the pre-trained model from Hugging Face.
- Defined the training parameters:
batch_size = 8 num_samples = 100000 # Number of samples to select learning_rate = 2e-5 weight_decay = 0.01 num_epochs = 10
- Prepared data collators.
- Selected 100k rows from the training split due to resource constraints.
- Defined the optimizer (
AdamWeightDecay
) and compiled the model. - Trained the model for
num_epochs
and saved it to a folder. - Loaded the saved model and evaluated the BLEU Score.
- Pushed the tokenizer and model to Hugging Face and built an app using
Streamlit
utilizing the model from Hugging Face.
- 🤗 Transformers
- SentencePiece
- TensorFlow
- Datasets
- SacreBLEU
- Hugging Face
- Streamlit
- Streamlit Cloud
The model was evaluated using the BLEU Score, a standard metric for language translation tasks.
Test Set | BLEU Score |
---|---|
cfilt/iitb-english-hindi/test | ~38 |
- Clone the repository:
git clone https://github.com/Srikar-V675/langgpt-pretrained.git
- Change directory to the repository:
cd langgpt-pretrained
- Create directories for the model and tokenizer:
mkdir model mkdir tokenizer
- Run each cell from the
model-training.ipynb
Jupyter Notebook and update the location for saving the trained model:# Old model.save_pretrained("langGPT/") # New model.save_pretrained("model/")
- [Optional] Log in to Hugging Face Hub:
Note: Obtain an access token from Hugging Face before running the command.
huggingface-cli login
- [Optional] Push the model to Hugging Face Hub by running
push_to_hub.py
:Note: Update thepython push_to_hub.py
repo_name
variable to your Hugging Face repo name.# Mine repo_name = "shinigami-srikar/langgpt-pretrained" # Yours repo_name = "your-username/your-repo-name"
- Run the Streamlit app:
Note: Update the
streamlit run app.py
repo_name
inapp.py
to your Hugging Face Hub repo name.
Feel free to explore and enhance the project! 🚀