Engineering Science and Technology, an International Journal, cilt.64, 2025 (SCI-Expanded)
In the last decade, the outstanding performance of deep learning has also led to a rapid and inevitable rise in automatic image captioning, as well as the need for large amounts of data. Although well-known, conventional and publicly available datasets have been proposed for the image captioning task, the lack of ground-truth caption data still remains a major challenge in the generation of accurate image captions. To address this issue, in this paper we introduced a novel image captioning benchmark dataset called Tiny TR-CAP, which consists of 1076 original images and 5380 handwritten captions (5 captions for each image with high diversity). The captions, which were translated into English using two web-based language translation APIs and a novel multilingual deep machine translation model, were tested against 11 state-of-the-art and prominent deep learning-based models, including CLIPCap, BLIP, BLIP2, FUSECAP, OFA, PromptCap, Kosmos2, MiniGPT4, LlaVA, BakLlaVA, and GIT. In the experimental studies, the accuracy statistics of the captions generated by the related models were reported in terms of the BLEU, METEOR, ROUGE-L, CIDEr, SPICE, and WMD captioning metrics, and their performance was evaluated comparatively. In the performance analysis, quite promising captioning performances were observed, and the best success rates were achieved with the OFA model with scores of 0.7097 BLEU-1, 0.5389 BLEU-2, 0.3940 BLEU-3, 0.2875 BLEU-4, 0.1797 METEOR, 0.4627 ROUGE-L, 0.2938 CIDEr, 0.0626 SPICE, and 0.4605 WMD. To support research studies in the field of image captioning, the image and caption sets of Tiny TR-CAP will also be publicly available on GitHub (https://github.com/abbasmemis/tiny_TR-CAP) for academic research purposes.