Hidden State Guidance: Improving Image Captioning using An Image Conditioned Autoencoder (2019)
Most RNN-based image captioning models receive supervision on the output words to mimic human captions. Therefore, the hidden states can only receive noisy gradient signals via layers of back-propagation through time, leading to less accurate generated captions. Consequently, we propose a novel framework, Hidden State Guidance (HSG), that matches the hidden states in the caption decoder to those in a teacher decoder trained on an easier task of autoencoding the captions conditioned on the image. During training with the REINFORCE algorithm, the conventional rewards are sentence-based evaluation metrics equally distributed to each generated word, no matter their relevance. HSG provides a word-level reward that helps the model learn better hidden representations. Experimental results demonstrate that HSG clearly outperforms various state-of-the-art caption decoders using either raw images, detected objects, or scene graph features as inputs.
PDF, Other
In Proceedings of the Visually Grounded Interaction and Language Workshop at NeurIPS 2019, December 2019.

Raymond J. Mooney Faculty mooney [at] cs utexas edu
Jialin Wu Ph.D. Student jialinwu [at] utexas edu