KEYWORDS: RGB color model, Sensors, Optical transfer functions, Lawrencium, Super resolution, Data modeling, Image processing, Optical engineering, Cameras, Head
Super-resolution (SR) aims to increase the resolution of imagery. Applications include security, medical imaging, and object recognition. We propose a deep learning-based SR system that takes a hexagonally sampled low-resolution image as an input and generates a rectangularly sampled SR image as an output. For training and testing, we use a realistic observation model that includes optical degradation from diffraction and sensor degradation from detector integration. Our SR approach first uses nonuniform interpolation to partially upsample the observed hexagonal imagery and convert it to a rectangular grid. We then leverage a state-of-the-art convolutional neural network architecture designed for SR known as residual channel attention network (RCAN). In particular, we use RCAN to further upsample and restore the imagery to produce the final SR image estimate. We demonstrate that this system is superior to applying RCAN directly to rectangularly sampled LR imagery with equivalent sample density. The theoretical advantages of hexagonal sampling are well known. However, to the best of our knowledge, the practical benefit of hexagonal sampling in light of modern processing techniques such as RCAN SR is heretofore untested. Our SR system demonstrates a notable advantage of hexagonally sampled imagery when employing a modified RCAN for hexagonal SR.
The video captioning problem consists of describing a short video clip with natural language. Existing solutions tend to rely on extracting features from frames or sets of frames with pretrained and fixed Convolutional Neural Networks (CNNs). Traditionally, the CNNs are pretrained on the ImageNet-1K (IN1K) classification task. The features are then fed into a sequence-to-sequence model to produce the text description output. In this paper, we propose using Facebook's ResNeXt Weakly Supervised Learning (WSL) CNNs as fixed feature extractors for video captioning. These CNNs are trained on billion-scale weakly supervised datasets constructed from Instagram image-hashtag pairs and then fine-tuned on IN1K. Whereas previous works use complicated architectures or multimodal features, we demonstrate state-of-the-art performance on the Microsoft Video Description (MSVD) dataset and competitive results on the Microsoft Research-Video to Text (MSR-VTT) dataset using only the frame-level features from the new CNNs and a basic Transformer as a sequence-to-sequence model. Moreover, our results validate that CNNs pretrained with weak supervision can effectively transfer to tasks other than classification. Finally, we present results for a number of IN1K feature extractors and discuss the relationship between IN1K accuracy and video captioning performance. Code will be made available at https://github. com/flauted/OpenNMT-py.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.