Training a network routinely on a mixed dataset, i.e. , a dataset with both masked and non-masked images, will cause the model to perform poorly on one set of images -- non-masked images or masked. This problem can be tackled using contrastive learning. The contrastive learning paradigm consists of two steps. First, the contrastive learning minimizes the distance between the masked and non-masked feature embeddings generated by the model. Second, expression recognition is learnt. Two versions of the same image are given as input to the network. The first is the original image and the second is a synthetically masked version of the same image.
A shared network produces two feature embeddings, one for each image. Mean Square Error loss is used to calculate the distance between these feature embeddings. To train the network for FER, standard cross entropy loss is used.
Results |