13 June 2023 Improving scene text image captioning using transformer-based multilevel attention
Swati Srivastava, Himanshu Sharma
Author Affiliations +
Abstract

Many existing image captioning methods only focus on image objects and their relationships for generating image captions, ignoring the text present in an image. Scene text (ST) contains crucial information to understand an image and facilitating reasoning. The existing methods fail to establish strong correlations between optical character recognition (OCR) tokens, as they have limited OCR representation power. Further, these methods have not efficiently used the positional information of the text. In this work, we have proposed an ST-based image captioning model (Trans-MAtt) based on a multilevel attention mechanism and relation network. We have used relation networks to enhance the connections between ST tokens. We have employed a multi-level attention method, which comprises of spatial, semantic, and appearance attention modules that precisely define the image. To represent context-enriched ST tokens, we use a combination of appearance, location, FastText, and PHOC features. We predict the ST location in the image, which is further integrated with the generated word embeddings for final caption generation. Experiments on the TextCaps dataset demonstrate the effectiveness of the proposed Trans-MAtt model, where it outperforms the current best model by 3.4% on B-4, 2.9% on METEOR, 3.3% on ROUGE-L, 3.1% on CIDEr-D, and 4.1% on SPICE metric scores. Our experiments on the Flickr30k and MSCOCO datasets demonstrated the superiority of our proposed model over existing methods.

© 2023 SPIE and IS&T
Swati Srivastava and Himanshu Sharma "Improving scene text image captioning using transformer-based multilevel attention," Journal of Electronic Imaging 32(3), 033023 (13 June 2023). https://doi.org/10.1117/1.JEI.32.3.033023
Received: 3 November 2022; Accepted: 30 May 2023; Published: 13 June 2023
Lens.org Logo
CITATIONS
Cited by 1 scholarly publication.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Visualization

Semantics

Optical character recognition

Education and training

Information visualization

Data modeling

Transformers

RELATED CONTENT


Back to Top