This paper introduces a learning problem related to the task of converting printed documents to ASCII text files. The goal of the learning procedure is to produce a function that maps documents to restoration techniques in such a way that on average the restored documents have minimum OCR error. We derive a general form for the optimal function and use it to motivate the development of a nonparametric method based on nearest--neighbors. We also develop a direct method of solution based on empirical error minimization for which we prove a finite sample bound on estimation error that is independent of distribution. We show that this empirical error minimization problem is an extension of the empirical optimization problem for traditional M-class classification with general loss function and prove computational hardness for this problem. We then derive a simple iterative algorithm called Generalized Multi-Class Ratchet ( GMR ) and prove that it produces an optimal function asymptotically (with probability 1). To obtain the GMR algorithm we introduce a new data map that extends Kesler's construction for the multiclass problem (e.g. see p. 266 in \cite{Dud00a}) and then apply an algorithm called Ratchet to this mapped data, where Ratchet is a modification of the Pocket algorithm \cite{Gal90a}. Finally we apply these methods to a collection of documents and report on the experimental results.
M. Cannon, M. Fugate, D. Hush, and C. Scovel. Selecting a Restoration Technique to Minimize OCR Error. IEEE Transactions on Neural Networks, v. 14, No 3. pp. 478--490, May 2003. Los Alamos National Laboratory Technical Report LA-UR-01-6860. [ Abstract | PostScript (376 KB) | PDF (414 KB) ]






