Paper
28 January 2008 Measuring the impact of character recognition errors on downstream text analysis
Author Affiliations +
Proceedings Volume 6815, Document Recognition and Retrieval XV; 68150G (2008) https://doi.org/10.1117/12.767131
Event: Electronic Imaging, 2008, San Jose, California, United States
Abstract
Noise presents a serious challenge in optical character recognition, as well as in the downstream applications that make use of its outputs as inputs. In this paper, we describe a paradigm for measuring the impact of recognition errors on the stages of a standard text analysis pipeline: sentence boundary detection, tokenization, and part-of-speech tagging. Employing a hierarchical methodology based on approximate string matching for classifying errors, their cascading effects as they travel through the pipeline are isolated and analyzed. We present experimental results based on injecting single errors into a large corpus of test documents to study their varying impacts depending on the nature of the error and the character(s) involved. While most such errors are found to be localized, in the worst case some can have an amplifying effect that extends well beyond the site of the original error, thereby degrading the performance of the end-to-end system.
© (2008) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Daniel Lopresti "Measuring the impact of character recognition errors on downstream text analysis", Proc. SPIE 6815, Document Recognition and Retrieval XV, 68150G (28 January 2008); https://doi.org/10.1117/12.767131
Lens.org Logo
CITATIONS
Cited by 8 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Optical character recognition

Error analysis

Data mining

Computer programming

Data modeling

Matrices

Speech recognition

Back to Top