Go to Laboratory Home Go to Laboratory Home PageGo to Laboratory PhoneGo to Laboratory Search
Abstract

This paper explores the use of script identification vectors in the analysis of multilingual document images. A script identification vector is calculated for each connected component in a document. The vector expresses the closest distance between the component and templates developed for each of thirteen scripts, including Arabic, Chinese, Cyrillic, and Roman. We calculate the first three principal components within the resulting thirteen-dimensional space for each image. By mapping these components to red, green, and blue, we can visualize the information contained in the script identification vectors. Our visualization of several multilingual images suggests that the script identification vectors can be used to segment images into script-specific regions as large as several paragraphs or as small as a few characters. The visualized vectors also reveal distinctions within scripts, such as font in Roman documents, and kanji vs. kana in Japanese. Results are best for documents containing highly dissimilar scripts such as Roman and Japanese. Documents containing similar scripts, such as Roman and Cyrillic, will require further investigation.

J. Hochberg, M. Cannon, P. Kelly, and J. White. Page Segmentation Using Script Identification Vectors: A First Look. In Proceedings of the 1997 Symposium on Document Image Understanding Technology, pp. 258-264, 1997. Los Alamos National Laboratory Technical Report LA-UR-97-1281.   [   Abstract   |   PDF (63 KB)   |   Images   ]