Convolutional Neural Network Architectures  Analysis for Image Captioning

SHIN, DONG HO

doi:10.46254/NA8.20230159

Track: High School STEM Poster Competition

Abstract

The Image Captioning models with the Attention method have developed significantly compared to previous models, but it is still unsatisfactory in recognizing images. The early Image Captioning models were built by combining CNN as an encoder and RNN as a decoder, making them susceptible to the influence of each CNN and RNN model. In particular, the CNN network has shown performance differences over time, which affects the RNN model used as a decoder. In this paper, we experiment with various CNN architectures to improve the performance of image captioning based on the CNN architecture as a reference. We analyze the performance of Image Captioning based on the performance of various CNN architecture models. We compared seven different CNN Architectures, according to Batch size, using public benchmarks: MS-COCO datasets. All CNN architectures used in this study are pre-trained networks on the ImageNet dataset. In our experimental results, DenseNet (Huang et al. 2017) and InceptionV3 (Szegedy et al. 2016) got the most satisfactory result among the seven CNN architectures after training 50 epochs on GPU.

Keywords

Deep Learning, Computer Vision, Image Captioning, CNN and DenseNet

Convolutional Neural Network Architectures Analysis for Image Captioning

DONG HO SHIN

Publisher: IEOM Society International

Track: High School STEM Poster Competition

Abstract

Related Research