Paper
30 March 1995 Evaluation of an automatic markup system
Author Affiliations +
Proceedings Volume 2422, Document Recognition II; (1995) https://doi.org/10.1117/12.205833
Event: IS&T/SPIE's Symposium on Electronic Imaging: Science and Technology, 1995, San Jose, CA, United States
Abstract
One predominant application of OCR is the recognition of full text documents for information retrieval. Modern retrieval systems exploit both the textual content of the document as well as its structure. The relationship between textual content and character accuracy have been the focus of recent studies. It has been shown that due to the redundancies in text, average precision and recall is not heavily affected by OCR character errors. What is not fully known is to what extent OCR devices can provide reliable information that can be used to capture the structure of the document. In this paper, we present a preliminary report on the design and evaluation of a system to automatically markup technical documents, based on information provided by an OCR device. The device we use differs from traditional OCR devices in that it not only performs optical character recognition, but also provides detailed information about page layout, word geometry, and font usage. Our automatic markup program, which we call Autotag, uses this information, combined with dictionary lookup and content analysis, to identify structural components of the text. These include the document title, author information, abstract, sections, section titles, paragraphs, sentences, and de-hyphenated words. A visual examination of the hardcopy is compared to the output of our markup system to determine its correctness.
© (1995) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Kazem Taghva, Allen Condit, and Julie Borsack "Evaluation of an automatic markup system", Proc. SPIE 2422, Document Recognition II, (30 March 1995); https://doi.org/10.1117/12.205833
Lens.org Logo
CITATIONS
Cited by 6 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Optical character recognition

Visualization

Image segmentation

Lanthanum

Analytical research

Associative arrays

Image storage

RELATED CONTENT

Data acquisition from cemetery headstones
Proceedings of SPIE (February 04 2013)
Arabic word recognizer for mobile applications
Proceedings of SPIE (February 07 2011)
Do Thesauri enhance rule-based categorization for OCR text?
Proceedings of SPIE (January 13 2003)

Back to Top