Handwritten text recognition (HTR) is a challenging task that requires a large amount of diverse training data. One of the possible approaches to this problem is the adoption of CNNs. The key challenge is that the CNN requires geometrically labeled training data, which may increase the cost and time consumption of labeling. To overcome these limitations we propose the method, based on Generative Adversarial Network (GAN), which transfers handwriting styles to printed style images, preserving the Same geometrical Annotation as Input - SAIGAN. Taking printed style image as an input, it produces the handwritten image with the same text content located in the same positions. Our method operates on the character-level and can produce sequences of an arbitrary length and any content. Once trained, it is also possible to generate new handwriting styles by simply manipulating latent vectors. Proposed character style supervision allowed our model to surpass the basis method.
In this paper, we introduce HoughToRadon Transform layer, a novel layer designed to improve the speed of neural networks incorporated with Hough Transform to solve semantic image segmentation problems. By placing it after a Hough Transform layer, ’inner’ convolutions receive modified feature maps with new beneficial properties, such as a smaller area of processed images and parameter space linearity by angle and shift. These properties were not presented in Hough Transform alone. Furthermore, HoughToRadon Transform layer allows us to adjust the size of intermediate feature maps using two new parameters, thus allowing us to balance the speed and quality of the resulting neural network. Our experiments on the open MIDV-500 dataset show that this new approach leads to time savings in document segmentation tasks and achieves state-of-the-art 97.7% accuracy, outperforming HoughEncoder with larger computational complexity.
For text line recognition, much attention is paid to augmentation of the training images. Yet the inner structure of the textual information in the images also affects the accuracy of the resulting model. In this paper, we propose an ANNbased method for textual data generation for printing in images with a background of a synthetic training sample. In our method we avoid the usage of completely random sequences as well as the dictionary-based ones. As a result, we gain the data that saves the basic properties of the target language model, such as the balance of vowels and consonants, but avoid the lexicon-based properties, like the prevalence of the specific characters. Moreover, as our method focuses only on high-levels features and does not try to generate the real words, we can use a small training sample and light-weight ANN for text generation. To check our method, we train three ANNs with same architecture, but with different training samples. We choose machine readable zones as a target field because of their structure that does not correspond with the ordinary lexicon. The results of the experiments on three public datasets of identity documents demonstrate the effectiveness of our method and allows to enhance the state-of-the art results for the target field.
Local patch descriptors are used in many computer vision tasks. Over the past decades, many methods of descriptor extraction were proposed. In recent years researchers have started to train descriptors via Convolutional Neural Networks (CNNs) which have shown their advantages in many other computer vision fields. However, the resulting descriptors are usually represented as a long real-valued vector. That leads to high computational complexity and memory usage in real applications with a large amount of data being processed. To deal with that problem binary local descriptors were designed, but they still have a large size. In this paper, we propose a method of discrete low-dimensional local descriptor creation with lightweight CNN. We show that for small-sized descriptors the quality drops significantly during simple binarization compared to floating-point ones. The experiments on HPatches dataset [1] demonstrate that our descretization approach dramatically outperforms the naive binarization for the compact descriptors.
In this work, we present the auto-clustering method which can be used for pattern recognition tasks and applied to the training of a metric convolutional neural network. The main idea is that the algorithm creates clusters consisting of classes similar from a network’s point of view. The usage of clusters allows the network to pay more attention to classes that are hard to differ. This method improves the generation of pairs during the training process, which is a current problem because the optimal generation of data significantly affects the quality of training. The algorithm works in parallel with the training process and is fully automatic. To evaluate this method we chose the Korean alphabet with the corresponding PHD08 dataset and compared our auto-clustering with random-mining, hard-mining, distance-based mining. Opensource framework Tesseract OCR 4.0.0 was also considered to evaluate the baseline.
Image recognition includes problems where special features can be found only in a specific area of an image. This fact suggests us to apply different filters to different areas of input images. Convolutional networks have only fully-connected and locally-connected layers to make it. A Fully-connected layer erases the position factor for every output and a locally connected layer storage an enormous number of parameters. We need a layer that can apply different convolution kernels for different areas of an input image and not carry so many parameters as a locally-connected layer for high scale resolution images. This is why in this paper, we introduce a new type of convolutional layer - a block layer, and a way to construct a neural network using block convolutional layers to achieve better performance in the image classification problem. The influence of block layers on the quality of the neural network classifier is shown in this paper. We also provide a comparison with neural network architecture LeNet-5 as a baseline. The research was conducted on open datasets: MNIST, CIFAR-10, Fashion MNIST. The results of our research prove that this layer can increase the accuracy of neural network classifiers without increasing the number of operations for the neural network.
In this work, we consider the pairs generation algorithm based on the distances between elements in metric space. The right generation of training data is an actual issue, and its solution leads to better neural network learning. Understanding the properties of the source data, we can select pairs for training in such a way that the network will pay more attention to elements that are close in the metric space and have different classes. However, the problem arises when these properties are difficult to extract from the data and a more universal pairs generation method is needed. Our method generates pairs using the results of the network from previous iterations, in parallel with the training process itself. Thus, we do not need to evaluate the properties of elements ourselves, and we can use absolutely any data as learning objects. We demonstrate this approach using the example of Korean character recognition, and also compare it with other commonly used pair generation methods.
In that paper, we a suggest lightweight filtering neural network, which implements the filtering stage in the Filtered Back-Projection algorithm (FBP), but good reconstruction results are achieved not only in ideal data but also in noisy data, which a usual FBP algorithm cannot achieve. Thus, our neural network is not an only variation of Ramp filter, which is usually used then FBP algorithm, but also a denoising filter. The neural network architecture was inspired with the idea of the possibility of the Ramp filtering operation’s approximation with sufficient accuracy. The efficiency of our network was shown on the synthetic data, which imitate tomographic projections collected with low exposition. In the generation of synthetic data, we have taken into account the quantum nature of X-ray radiation, exposition time of one frame, and non-linear detector response. The FBP reconstruction time with our neural network was 13 times faster than the time of reconstruction neural network from Learned Primal-Dual Reconstruction, and our reconstruction quality 0.906 by SSIM metric, which is enough to identify most significant objects.
In this work we study the effect of activation functions in a neural network. We consider how activation functions with different properties and their combination affect the final quality of the model. Due to optimization and speed performance issues with most of bounded functions that are represented by sigmoids, we propose the generalized version of SoftSign function - ratio function (rf). Its shape greatly depends on introduced degree parameter, which in theory leads to new interesting property - contraction to zero. For evaluation, we chose image binarization problem: based on UNet architecture of DIBCO-2017 winners, we conducted all experiments with replacing activation functions only. Our research has led us to the state-of-the-art results in binarization quality on DIBCO-2017 test dataset. U-Net with modified activation functions significantly outperforms all existing solutions in all metrics.
Despite the significant success in the field of text recognition, complex and unsolved problems still exist in this field. In recent years, the recognition accuracy of the English language has greatly increased, while the problem of recognition of hieroglyphs has received much less attention. Hieroglyph recognition or image recognition with Korean, Japanese or Chinese characters have differences from the traditional text recognition task. This article discusses the main differences between hieroglyph languages and the Latin alphabet in the context of image recognition. A light-weight method for recognizing images of the hieroglyphs is proposed and tested on a public dataset of Korean hieroglyph images. Despite the existing solutions, the proposed method is suitable for mobile devices. Its recognition accuracy is better than the accuracy of the open-source OCR framework. The presented method of training embedded net bases on the similarities in the recognition data.
In this work, we propose a method for tomography reconstruction in case of a limited field of view, when the whole image of the investigated sample does not fit on the detector. Proposed technique based on iterative procedure with corrections on each step in sinogram space and reconstruction space. On synthetic and experimental data shown, that proposed technique allows to improve tomography reconstruction quality and extends the field of view.
In this paper, we study the recently introduced neural network architecture HoughNet for the ability to accumulate transferable high-level features. The main idea of that neural network is to use convolutional layers separated with Fast Hough Transform layers to enable an analysis of complex non-linear statistics along different lines. We show that different convolutional blocks in this neural network have essentially different purposes. While initial features extracting is task-specific, the main part of the neural network operates with high-level features and do not require re-training in order to be applied to data from a different domain. To prove our statement, we two sets of the images with different origins and demonstrate Transfer Learning presence in the neural network except for the first layers which are highly task-specific.
Regularization methods play an important role in artificial neural networks training, improving generalization performance and preventing them from overfitting. In this paper, we introduce a new regularization method, based on the orthogonalization of convolutional layer filters. Proposed method is easy to implement and it has plug-and-play compatibility with modern training approaches, without any changes or adaptations on their part. Experiments with MNIST and CIFAR10 datasets showed that the effectiveness of the suggested method depends on number of filters in the layer, and maximum increase in quality is achieved for architectures with small number of parameters, which is important for training fast and lightweight neural networks.
Character segmentation is one of the crucial problems of modern text line recognition methods. In this paper, we propose a per-character segmentation method based on the light weight convolutional neural network (CNN) which is suitable for on-premise applications for various mobile devices. The distinctive feature of our method is that it provides the coordinates of the start and end points of each character, not the coordinates of the “cut” between two characters. It allows us to utilize known geometrical properties of glyphs efficiently. Consequently, the target character images are not flawed because of characters intersections or wide spaces. We present the results measured for text lines with various letter spacing. Results illustrate that the proposed method decreases the segmentation error rate for the majority of test datasets.
In this paper, we propose a new method to detect monospaced font in text line images. Although many authors address more complex problems of text recognition or font recognition, this problem is still challenging when dealing with camera-captured images of identity documents. However, such a font characteristic can be useful in document authentication. These images usually contain complex backgrounds and various distortions. Our approach is based on a segmentation neural network and Fourier Transform for detecting “strong” periodic components in the segmentor output. The experimental results show that the combination of neural network and Fourier Transform deals with the task of monospaced font detection more effectively than the same Fourier analysis applied to the results of an image processing method for segmentation. The main advantage of the neural network is that its output does not depend on background, font and characters characteristics directly.
The paper presents an algorithm for document image recognition robust to projective distortions. This algorithm is based on a similarity metric, which is learned using Siamese architecture. The idea of training Siamese networks is to build a function of converting the image into a space where a distance function corresponding to a pre-defined metric approximates the similarity between objects of initial space. During learning the loss function tries to minimize the distance between pairs of object from the same class and maximize it between the ones from different classes. A convolutional network is used for mapping initial space to the target one. This network lets to construct a feature vector in target space for each class. Classification of objects is performed using the mapping function and finding the nearest feature vector. The proposed algorithm achieved recognition quality comparable to classifying convolutional network on an open dataset of document images MIDV-500 [1]. Another important advantage of this method is the possibility of one-shot learning that is also shown in the paper.
In this paper we study the real-time augmentation - method of increasing variability of training dataset during the learning process. We consider the most common label-preserving deformations, which can be useful in many practical tasks. Due to limitations of existing augmentation tools like increase in learning time or dependence on a specific platform, we developed own real-time augmentation system. Experiments on MNIST and SVHN datasets demonstrated the effectiveness of suggested approach - the quality of the trained models improves, and learning time remains the same as if augmentation was not used.
In this paper, we consider the problem of detecting counterfeit identity documents in images captured with smartphones. As the number of documents contain special fonts, we study the applicability of convolutional neural networks (CNNs) for detection of the conformance of the fonts used with the ones, corresponding to the government standards. Here, we use multi-task learning to differentiate samples by both fonts and characters and compare the resulting classifier with its analogue trained for binary font classification. We train neural networks for authenticity estimation of the fonts used in machine-readable zones and ID numbers of the Russian national passport and test them on samples of individual characters acquired from 3238 images of the Russian national passport. Our results show that the usage of multi-task learning increases sensitivity and specificity of the classifier. Moreover, the resulting CNNs demonstrate high generalization ability as they correctly classify fonts which were not present in the training set. We conclude that the proposed method is sufficient for authentication of the fonts and can be used as a part of the forgery detection system for images acquired with a smartphone camera.
In this paper we study combination of Viola-Jones classifier with deep convolutional neural network as an approach to the problem of object detection and classification. It is well known that Viola-Jones detectors are fast and accurate in detection of vast variety of different objects. On the other hand, methods based on neural network usage demonstrate high accuracy in the problems of image classification. The main goal of this paper is to study viability of Viola-Jones classifier in problem of image classification. The first part of both algorithms is the same: we will use Viola-Jones classifier to find object bounding rectangle in the image. The second part of the algorithms is different: we will compare usage of Viola-Jones classifier with convolutional neural network-based classifier. We will provide speed and accuracy comparison between these two algorithms.
This paper addresses one of the fundamental problems of machine learning - training data acquiring. Obtaining enough natural training data is rather difficult and expensive. In last years usage of synthetic images has become more beneficial as it allows to save human time and also to provide a huge number of images which otherwise would be difficult to obtain. However, for successful learning on artificial dataset one should try to reduce the gap between natural and synthetic data distributions. In this paper we describe an algorithm which allows to create artificial training datasets for OCR systems using russian passport as a case study.
In this paper we propose a novel method for vanishing points detection based on convolutional neural network (CNN) approach and fast Hough transform algorithm. We show how to determine fast Hough transform neural network layer and how to use it in order to increase usability of the neural network approach to the vanishing point detection task. Our algorithm includes CNN with consequence of convolutional and fast Hough transform layers. We are building estimator for distribution of possible vanishing points in the image. This distribution can be used to find candidates of vanishing point. We provide experimental results from tests of suggested method using images collected from videos of road trips. Our approach shows stable result on test images with different projective distortions and noise. Described approach can be effectively implemented for mobile GPU and CPU.
In this paper, we propose an expansion of convolutional neural network (CNN) input features based on Hough Transform. We perform morphological contrasting of source image followed by Hough Transform, and then use it as input for some convolutional filters. Thus, CNNs computational complexity and the number of units are not affected. Morphological contrasting and Hough Transform are the only additional computational expenses of introduced CNN input features expansion. Proposed approach was demonstrated on the example of CNN with very simple structure. We considered two image recognition problems, that were object classification on CIFAR-10 and printed character recognition on private dataset with symbols taken from Russian passports. Our approach allowed to reach noticeable accuracy improvement without taking much computational effort, which can be extremely important in industrial recognition systems or difficult problems utilising CNNs, like pressure ridge analysis and classification.
KEYWORDS: Video, Detection and tracking algorithms, Mobile devices, Image filtering, Machine vision, Image quality, Sensors, Current controlled current source, Patents, Internet
In this paper we consider a task of finding information fields within document with flexible form for credit card expiration date field as example. We discuss main difficulties and suggest possible solutions. In our case this task is to be solved on mobile devices therefore computational complexity has to be as low as possible. In this paper we provide results of the analysis of suggested algorithm. Error distribution of the recognition system shows that suggested algorithm solves the task with required accuracy.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.