The objective of this project is to develop, train and test a CNN-RNN model for automatically generating captions from a given image as shown in the example image below.
The Microsoft Common Objects in COntext (MS COCO) dataset is a large-scale dataset commonly used to train and benchmark object detection, segmentation, and captioning algorithms. This dataset of image-caption pairs (obtained using the COCO API) is used in this project to train the CNN-RNN model to automatically generate captions from images.
The Encoder uses the pre-trained ResNet-50 architecture (with the final fully-connected layer removed) to extract features from a batch of pre-processed images. The output is then flattened to a vector, before being passed through a Linear layer to transform the feature vector to have the same size as the word embedding.
The Decoder is made of an embedding layer that stores word embedding of input feature vectors and captions, an LSTM layer and a fully-connected layer in the output that generates appropriate output key.
The complete model combines the pretrained ResNet50 EncoderCNN model and LSTM DecoderRNN to automatically generate image captions.
The project is broken up into a few main parts in four Python notebooks:
Notebook 0 : Dataset - Explore the MS COCO dataset using the COCO API
Notebook 1 : Preliminaries - Explore the DataLoader, Obtain Batches, Experiment with the CNN Encoder and Implement the RNN Decoder
Notebook 2 : Training - Setup Training Process, Define & Tune Hyperparameters, Save Trained Models
Notebook 3 : Inference - Get Data Loader for Test Dataset, Define Decoder Sampler, Use trained model to generate captions for images in the test dataset.
The picture above samples some images in the test dataset and the corresponding (relatively accurate) predicted captions.
The picture above samples some images in the test dataset and the corresponding (relatively inaccurate) predicted captions.
Notebook Documentation, Images and Starter Code are part of project files provided by Udacity in the Computer Vision Nanodegree.