Object Character Recognition for Captchas

OCR (Object Character Recognition) is the technology that enables the conversion of images containing printed or handwritten text into machine-readable text. Several Machine Learning-based approaches have been proposed for OCR, that include utilizing object detection models like YoLOv5 and DBNet for text localization, and employing encoder-decoder architectures, often based on CNNs and RNNs, for text recognition. Recent advancements integrate Transformers for improved accuracy, with potential exploration of pre-trained CV and NLP models to further enhance performance.

(Link to dataset)

Problem Statement

We were provided with the following problem statement : Given an image of captcha, you have to train a model which is capable of predicting the characters in that image correctly. In order to solve this problem, the following data was made available to us:

test_images_mlware.zip : This contained 5000 images on which we had to gather predictions using our OCR model.
train_images_mlware.zip : This contained the 25000 images with the help of which we trained our OCR model.
train-labels_mlware.csv : This .csv file contained the actual text present in each image belonging to the training set.
sample_submission_mlware.csv : Sample submission

Approach

After exploring several methods, we found TrOCR, an end-to-end text recognition approach that employs pre-trained image and text Transformer models, to be a suitable fit for the task at hand. TrOCR excels in printed, handwritten, and scene text recognition by integrating Transformer architecture for both image understanding and text generation, emphasizing simplicity for superior performance.

It was important to first, fine-tune TrOCR so that it could be leveraged for our downstream OCR task for Captcha images. For this purpose, we leveraged the pre-trained model available on HuggingFace, and implemented a fine-tuning of the pre-trained model by using the Seq2Seq trainer. We made use of the Weights and Biases platform to monitor the model progress.

We implemented a 80-20 split between the training and the validation datasets, and defined a PyTorch class for the Dataset for easy processing. First, we tried to fine-tune the model with only 3 epochs(TrOCR_3), which led to a CER loss of 0.02. After this, we fine-tuned our model for a total of 6 epochs(TrOCR_6), which took 585 minutes (~10 hours) to finish compiling. TrOCR ended up as the best performer out of all the approaches that we implemented, achieving a best CER loss of 0.00088 on the test image dataset.

Results

We achieved a CER score of 0.00088 on the private(testing) data. (Kaggle Competition link)

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
code		code
documentation		documentation
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Object Character Recognition for Captchas

Problem Statement

Approach

Results

About

Releases

Packages

Languages

mbappeenjoyer/OCR_For_Captchas

Folders and files

Latest commit

History

Repository files navigation

Object Character Recognition for Captchas

Problem Statement

Approach

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages