Persian/Arabic Document Segmentation Based On Pyramidal Image Structure

Automatic transformation of paper documents into electronic documents requires document segmentation at the first stage. However, some parameters restrictions such as variations in character font sizes, different text line spacing, and also not uniform document layout structures altogether have made it difficult to design a general-purpose document layout analysis algorithm for many years. Thus in most previously reported methods it is inevitable to include these parameters. This problem becomes excessively acute and severe, especially in Persian/Arabic documents. Since the Persian/Arabic scripts differ considerably from the English scripts, most of the proposed methods for the English scripts do not render good results for the Persian scripts. In this paper, we present a novel parameter-free method for segmenting the Persian/Arabic document images which also works well for English scripts. This method segments the document image into maximal homogeneous regions and identifies them as texts and non-texts based on a pyramidal image structure. In other words the proposed method is capable of document segmentation without considering the character font sizes, text line spacing, and document layout structures. This algorithm is examined for 150 Arabic/Persian and English documents and document segmentation process are done successfully for 96 percent of documents.





References:
[1] F. Legourgiois, Z. Bublinski, and H. Emptoz, "A Fast and Efficient Method for Extracting Text Paragraphs and Graphics from Unconstrained Documents”, Proc. 11th Int'l Conf. Pattern Recognition , 1992, pp. 272-276.
[2] D. Drivas and A. Amin, "Document segmentation and Classification Utilizing Bottom-Up Approach”, Proc. Third Int'l Conf. Document Analysis and Recognition, 1995, pp. 610-614.
[3] A. Simon, J. Pret, and A. Johnson, "A Fast Algorithm for Bottom-Up Document Layout Analysis”, IEEE Trans. Pattern Analysis and Machine Intelligence, 1997, vol. 19, pp. 273-276.
[4] J. Ha, R. Haralick, and I. Phillips, "Recursive X-Y Cut Using Bounding Boxes of Connected Components”, Proc. Third Int'l Conf. Document Analysis and Recognition, 1995, pp. 952-955.
[5] J. Ha, R. Haralick, and I. Phillips, "Document Page Decomposition by the Bounding-Box Projection Technique”, Proc. Third Int'l Conf. Document Analysis and Recognition, 1995, pp. 1119-1122.
[6] Yi Xiaoa, Hong Yana’ "Text region extraction in a document image based on the Delaunay tessellation”, Pattern Recognition, 2003, pp. 799-809.
[7] Jie Xi, Jianming Hu, Lide Wu, "Document segmentation of Chinese newspapers”, Pattern Recognition, 2002, pp. 2695-2704.
[8] A. Jain and Y. Zhong, ªDocument segmentation Using Texture Analysis”, Pattern Recognition, 1996, vol. 29, pp. 743-770.
[9] A. Jain and S. Bhattacharjee, "Text Segmentation Using GaborFilters for Automatic Document Processing”’ Machine Vision and Applications, 1992, vol. 5, pp. 169-184.
[10] M.Acharyya, M.K.Kundu, "Document Image Segmentation Using Wavelet Scale-Space Features”, IEEE Transaction on circuits and systems for video technology, DEC 2002, vol. 12, no. 12.
[11] R.C.Gonzalez, R.E.Woods, "Digital Image Processing”, Second Edition, 2002, by Prentice-Hall