Track: High School STEM Poster Competition
Abstract
The Image Captioning models with the Attention method have developed significantly compared to previous models, but it is still unsatisfactory in recognizing images. The early Image Captioning models were built by combining CNN as an encoder and RNN as a decoder, making them susceptible to the influence of each CNN and RNN model. In particular, the CNN network has shown performance differences over time, which affects the RNN model used as a decoder. In this paper, we experiment with various CNN architectures to improve the performance of image captioning based on the CNN architecture as a reference. We analyze the performance of Image Captioning based on the performance of various CNN architecture models. We compared seven different CNN Architectures, according to Batch size, using public benchmarks: MS-COCO datasets. All CNN architectures used in this study are pre-trained networks on the ImageNet dataset. In our experimental results, DenseNet (Huang et al. 2017) and InceptionV3 (Szegedy et al. 2016) got the most satisfactory result among the seven CNN architectures after training 50 epochs on GPU.
Keywords
Deep Learning, Computer Vision, Image Captioning, CNN and DenseNet