Improving scene text image captioning using transformer-based multilevel attention

Swati Srivastava; Himanshu Sharma

doi:10.1117/1.JEI.32.3.033023

13 June 2023 Improving scene text image captioning using transformer-based multilevel attention

Swati Srivastava, Himanshu Sharma

Author Affiliations +

Journal of Electronic Imaging, Vol. 32, Issue 3, 033023 (June 2023). https://doi.org/10.1117/1.JEI.32.3.033023

Abstract

Many existing image captioning methods only focus on image objects and their relationships for generating image captions, ignoring the text present in an image. Scene text (ST) contains crucial information to understand an image and facilitating reasoning. The existing methods fail to establish strong correlations between optical character recognition (OCR) tokens, as they have limited OCR representation power. Further, these methods have not efficiently used the positional information of the text. In this work, we have proposed an ST-based image captioning model (Trans-MAtt) based on a multilevel attention mechanism and relation network. We have used relation networks to enhance the connections between ST tokens. We have employed a multi-level attention method, which comprises of spatial, semantic, and appearance attention modules that precisely define the image. To represent context-enriched ST tokens, we use a combination of appearance, location, FastText, and PHOC features. We predict the ST location in the image, which is further integrated with the generated word embeddings for final caption generation. Experiments on the TextCaps dataset demonstrate the effectiveness of the proposed Trans-MAtt model, where it outperforms the current best model by 3.4% on B-4, 2.9% on METEOR, 3.3% on ROUGE-L, 3.1% on CIDEr-D, and 4.1% on SPICE metric scores. Our experiments on the Flickr30k and MSCOCO datasets demonstrated the superiority of our proposed model over existing methods.

Citation Download Citation

Swati Srivastava and Himanshu Sharma "Improving scene text image captioning using transformer-based multilevel attention," Journal of Electronic Imaging 32(3), 033023 (13 June 2023). https://doi.org/10.1117/1.JEI.32.3.033023

Received: 3 November 2022; Accepted: 30 May 2023; Published: 13 June 2023

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available