A system for automatically identifying the script used in a handwritten document image is described. The system was developed using a 496-document dataset representing six scripts, eight languages, and 281 writers. Documents were characterized by the mean, standard deviation, and skew of five connected component features. A linear discriminant analysis was used to classify new documents, and tested using writer-sensitive cross-validation. Classification accuracy averaged 88% across the six scripts. The same method, applied within the Roman subcorpus, discriminated English and German documents with 85% accuracy. Pilot results indicate that a variation of the method may be applicable to writer identification.
J. Hochberg, K. Bowers, M. Cannon, and P. Kelly. Handwritten document image analysis at Los Alamos: Script, language, and writer identification. In Proceedings of the 1999 Symposium on Document Image Understanding Technology, pp. 161-165, Annapolis, MD, May 1999. Los Alamos National Laboratory Technical Report LA-UR-99-1679. [ Abstract | PostScript (630 KB) | PDF (30 KB) ]






