We describe our work on text-image alignment in context of building a historical document retrieval system. We
aim at aligning images of words in handwritten lines with their text transcriptions. The images of handwritten
lines are automatically segmented from the scanned pages of historical documents and then manually transcribed.
To train automatic routines to detect words in an image of handwritten text, we need a training set - images of
words with their transcriptions. We present our results on aligning words from the images of handwritten lines and
their corresponding text transcriptions. Alignment based on the longest spaces between portions of handwriting
is a baseline. We then show that relative lengths, i.e. proportions of words in their lines, can be used to improve
the alignment results considerably. To take into account the relative word length, we define the expressions for
the cost function that has to be minimized for aligning text words with their images. We apply right to left
alignment as well as alignment based on exhaustive search. The quality assessment of these alignments shows
correct results for 69% of words from 100 lines, or 90% of partially correct and correct alignments combined.
A method is presented for automatically identifying and removing crossed-out text in off-line handwriting. It classifies connected components by simply comparing two scalar features with thresholds. The performance is quantified based on manually labeled connected components of 250 pages of a forensic dataset. 47% of connected components consisting of crossed-out text can be removed automatically while 99% of the normal text components are preserved. The influence of automatically removing crossed-out text on writer verification and identification is also quantified. This influence is not significant.
Word-spotting techniques are usually based on detailed modeling of target words, followed by search for the
locations of such a target word in images of handwriting. In this study, the focus is on deciding for the presence
of target words in lines of text, regardless and disregarding their horizontal position. Line strips are modeled
using a Bag-of-Glyphs approach using a self-organized map. This approach uses the presence of fragmented-connected
component shapes (glyphs) in a line strip to characterize this text passage, similar to the Bag-of-Words
approach for 'ASCII'-encoded documents in regular Information Retrieval. Subsequently, the presence of a word
or word category is trained to a support-vector machine in an iterative setup which involves an active group of
users. Results are promising for a large proportion of words and are dependent both on the amount of labeled
lines as well as shape uniqueness. Particularly useful is the ability to train on abstract content classes such as
proper names, municipalities or word-bigram presence in the line-strip images.
The digital cleaning of dirty and old documents and the binarization into a black/white image can be a tedious process. It is usually done by experts. In this article a method is shown that is easy for the end user. Untrained persons are able to do this task now while before an expert was needed. The method uses interactive evolutionary computing to program image processing operations that act on the document image.
KEYWORDS: Databases, Image processing, Data processing, Computing systems, Astrophysics, Data archive systems, Cultural heritage, Machine learning, Pattern recognition, Imaging systems
Building a system which allows to search a very large database of document images requires professionalization of hardware and software, e-science and web access. In astrophysics there is ample experience dealing with large data sets due to an increasing number of measurement instruments. The problem of digitization of historical documents of the Dutch cultural heritage is a similar problem. This paper discusses the use of a system developed at the Kapteyn Institute of Astrophysics for the processing of large data sets, applied to the problem of creating a very large searchable archive of connected cursive handwritten texts. The system is adapted to the specific needs of processing document images. It shows that interdisciplinary collaboration can be beneficial in the context of machine learning, data processing and professionalization of image processing and retrieval systems.
Conference Committee Involvement (4)
Document Recognition and Retrieval XVIII
26 January 2011 | San Francisco Airport, California, United States
Document Recognition and Retrieval XVII
20 January 2010 | San Jose, California, United States
Document Recognition and Retrieval XVI
21 January 2009 | San Jose, California, United States
Document Recognition and Retrieval XV
30 January 2008 | San Jose, California, United States
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.